[Lustre-discuss] Another server question.

Tue Feb 3 09:21:10 PST 2009

On Feb 3, 2009, at 11:42 AM, Brian J. Murrell wrote:

>
>> I down one of the servers (normal shutdown, not the MGD of course).
>> OK, so the clients seem to be frozen in regards to the lustre.
>
> Only if they want to access objects (files, or file stripes) on that
> server that you shut down, yes.

In our experience, despite what has been said and what we have read,  
if we lose or take down a single OSS, our clients lose access (i/o  
seems blocked) to the file system until that OSS is back up and has  
completed recovery.    That's just or experience and it has been very  
consistent.   We've never seen otherwise, though we would like to.  :)

>
>> Many here
>> have noted that it should be ok, with the exception of files that  
>> were
>> stored on the downed server,

Again, not in our experience.    We are currently running 1.6.4.2 and  
have never seen this work.    Losing a single OSS renders the file  
system pretty much unusable until the OSS has recovered.    We could  
be doing something wrong, I suppose but I'm not sure what.

>> but that does not seem to be the case here.
>> That is not my main concern however, the real question is, I bring  
>> the server
>> back up; check its ID by issuing lctl dl; I check the MGS by a cat / 
>> proc/fs/lustre/devices
>> and see the ID in there as UP. OK, so it all seems well again, but  
>> the client
>> is still (somewhat) stuck.

You have to wait for recovery to complete.     You can check the  
recovery status on the OSSs and MGS/MDS by....

cd /proc/fs/lustre; find . -name "*recov*" -exec cat {} \;

Once all the OSSs/MGS show recovery "COMPLETE", clients will be able  
to access the file system again.

We've been running three separate Lustre file systems for over a year  
now and are *very* happy with it.    There are a few things that we  
still don't understand and this is one of them.   We wish that when an  
OSS went down, we only lost access to files/objects on *that* OSS but,  
again, that has not been our experience.    Still we've kissed a lot  
of distributed/parallel file system frogs.   We'll take Lustre, hands  
down.

Charlie Taylor
UF HPC Center

>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss