[Lustre-discuss] Another server question.
Charles Taylor
taylor at hpc.ufl.edu
Tue Feb 3 09:21:10 PST 2009
On Feb 3, 2009, at 11:42 AM, Brian J. Murrell wrote:
>
>> I down one of the servers (normal shutdown, not the MGD of course).
>> OK, so the clients seem to be frozen in regards to the lustre.
>
> Only if they want to access objects (files, or file stripes) on that
> server that you shut down, yes.
In our experience, despite what has been said and what we have read,
if we lose or take down a single OSS, our clients lose access (i/o
seems blocked) to the file system until that OSS is back up and has
completed recovery. That's just or experience and it has been very
consistent. We've never seen otherwise, though we would like to. :)
>
>> Many here
>> have noted that it should be ok, with the exception of files that
>> were
>> stored on the downed server,
Again, not in our experience. We are currently running 1.6.4.2 and
have never seen this work. Losing a single OSS renders the file
system pretty much unusable until the OSS has recovered. We could
be doing something wrong, I suppose but I'm not sure what.
>> but that does not seem to be the case here.
>> That is not my main concern however, the real question is, I bring
>> the server
>> back up; check its ID by issuing lctl dl; I check the MGS by a cat /
>> proc/fs/lustre/devices
>> and see the ID in there as UP. OK, so it all seems well again, but
>> the client
>> is still (somewhat) stuck.
You have to wait for recovery to complete. You can check the
recovery status on the OSSs and MGS/MDS by....
cd /proc/fs/lustre; find . -name "*recov*" -exec cat {} \;
Once all the OSSs/MGS show recovery "COMPLETE", clients will be able
to access the file system again.
We've been running three separate Lustre file systems for over a year
now and are *very* happy with it. There are a few things that we
still don't understand and this is one of them. We wish that when an
OSS went down, we only lost access to files/objects on *that* OSS but,
again, that has not been our experience. Still we've kissed a lot
of distributed/parallel file system frogs. We'll take Lustre, hands
down.
Charlie Taylor
UF HPC Center
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
More information about the lustre-discuss
mailing list