[Lustre-discuss] OST went back in time to 0 (bug 9646)

Andreas Dilger adilger at sun.com
Fri Jul 31 13:57:12 PDT 2009


On Jul 30, 2009  08:29 +0200, Jakob Goldbach wrote:
> I have a question on bug 9646 - Server went back in time.
> 
> I had an OSS crash and had to pull the power. After mounting lustre
> again I see the following on one of my clients:
> 
> (import.c:909:ptlrpc_connect_interpret()) b-OST0010_UUID went back in
> time (transno 12901362807 was previously committed, server now claims
> 0)!  See https://bugzilla.lustre.org/show_bug.cgi?id=9646
> 
> This bug description suggest that there are commits lost in hardware
> cache - but how can it loose all commits (transno is zero)? (btw, cache
> is battery backup up)
> 
> On the client that I saw this I had previosly deactivated the import
> bacause of the crash. Is this the reason I'm seeing this transno as
> zero ? (full dmesg below)

The error message is a bit misleading.  Bug 9646 is describing the
situation where the last_committed transaction number rolled back
to some previous non-zero value.  That indicates some serious problem
in the storage.  In this case the client is not getting a complete reply
and is looking at a last_committed value that was never properly filled
in.  That should probably be updated in the manual.

This should probably be filed as a bug, and no check should be done
if the reply was an error, and no message printed on the console.

> 2860:0:(import.c:508:import_select_connection())
> b-OST0010-osc-ffff81022ce89800: tried all connections, increasing
> latency to 27s
> 
> setting import backup-OST0010_UUID INACTIVE by administrator request
> 
> 8281:0:(import.c:508:import_select_connection())
> b-OST0010-osc-ffff81022ce89800: tried all connections, increasing
> latency to 32s
> 
> 167-0: This client was evicted by b-OST0010; in progress operations
> using this service will fail.
> 
> b-OST0010-osc-ffff81022ce89800: Connection restored to service b-OST0010
> using nid 172.16.14.36 at tcp.
> 
> 11-0: an error occurred while communicating with 172.16.14.36 at tcp. The
> ost_statfs operation failed with -11
> ...
> 11-0: an error occurred while communicating with 172.16.14.36 at tcp. The
> obd_ping operation failed with -107
> 
> b-OST0010-osc-ffff81022ce89800: Connection to service backup-OST0010 via
> nid 172.16.14.36 at tcp was lost; in progress operations using this service
> will wait for recovery to complete.
> 
> 2859:0:(import.c:909:ptlrpc_connect_interpret()) b-OST0010_UUID went
> back in time (transno 12901362807 was previously committed, server now
> claims 0)!  See https://bugzilla.lustre.org/show_bug.cgi?id=9646
> 
> 167-0: This client was evicted by backup-OST0010; in progress operations
> using this service will fail.
> 
> b-OST0010-osc-ffff81022ce89800: Connection restored to service b-OST0010
> using nid 172.16.14.36 at tcp.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.




More information about the lustre-discuss mailing list