[Lustre-discuss] OST went down running Lustre 1.6.6

Brian J. Murrell Brian.Murrell at Sun.COM
Wed Feb 11 09:07:13 PST 2009


On Wed, 2009-02-11 at 11:53 -0500, Brian Stone wrote:
> So, in this case, it appears that all clients did not complete recovery 
> and recovery timed out.

During recovery, there is a progress message printed to the log
periodically stating how_many/expected clients have reconnected.  I
believe when recovery is aborted it also shows how many clients managed
to reconnect.

> I assume you have a short amount of time to 
> figure out who did not participate in recovery and get them to 
> reconnect.

Yes.  This is calculated from obd_timeout.

> What's the best way to get clients to reconnect that are not 
> participating in recovery?

Well, given a robust enough network (i.e. just reliable) it should just
happen.  There is no way to prod a client into reconnecting sooner/more
frequently than it would normally.  However with a reliable network,
this should not be a problem.

> What's the best way to identify clients that 
> are not participating in recovery?

Hrm.  I'm not sure that it's realistic to try to figure out who's all
connected and who isn't with a goal to troubleshoot and fix problems
during recovery.  It just doesn't last long enough.

Really, barring any bugs, all you need to do to have successful
recoveries is to just have a network that allows the communication to
happen reliably enough.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20090211/ae2e19de/attachment.pgp>


More information about the lustre-discuss mailing list