[Lustre-discuss] Luster clients getting evicted
Brock Palen
brockp at umich.edu
Mon Feb 4 10:17:37 PST 2008
> Hi Brock,
>
> On Monday 04 February 2008 07:11:11 am Brock Palen wrote:
>> on our cluster that has been running lustre for about 1 month. I have
>> 1 MDT/MGS and 1 OSS with 2 OST's.
>>
>> Our cluster uses all Gige and has about 608 nodes 1854 cores.
>
> This seems to be a lot of clients for only one OSS (and thus for only
> one GigE link to the OSS).
Its more for evaluation, the 'real' file system is a NFS file system
provided by a OnStor bobcat. So anything is a improvement. The
cluster IS to big, but there isn't a person at the university who is
willing to pay for anything other than more cluster nodes. Enough
with politics.
>
>> We have allot of jobs that die, and/or go into high IO wait, strace
>> shows processes stuck in fstat().
>>
>> The big problem is (i think) I would like some feedback on it that of
>> these 608 nodes 209 of them have in dmesg the string
>>
>> "This client was evicted by"
>>
>> Is this normal for clients to be dropped like this?
>
> I'm not an expert here, but evictions typically occur when a client
> hasn't been seen for a certain period by the OSS/MDS. This is often
> related to network problems. Considering your number of clients, if
> they all do I/O operations on the filesystem concurrently, maybe your
> Ethernet switches are the bottleneck and have to drop packets. Is your
> GigE network working fine outside of Lustre?
>
> To eliminate networking issues from the equation, you can try to lctl
> ping your MDS and OSS from a freshly evicted node, and see what you
> get. (lctl ping <your-oss-nid>)
I just had another node get evicted while running code causing the
code to lock up. This time it was the MDS that evicted it. Pinging
work though:
[root at nyx350 ~]# lctl ping 141.212.30.184 at tcp
12345-0 at lo
12345-141.212.30.184 at tcp
Recovery is slow, this clinet has been evicted for about 10 minutes.
I have attached the output of lctl dk from the client and some
syslog messages from the MDS.
>
>> Is there some
>> tuning that needs to be done to the server to carry this many nodes
>> out of the box? We are using default lustre install with Gige.
>
> Do your MDS or OSS show any particularly high load or memory usage? Do
> you see any Lustre-related error messages in their logs?
Nope both servers have 2GB ram, and load is almost 0. No swapping.
Thanks
-------------- next part --------------
A non-text attachment was scrubbed...
Name: client.err
Type: application/octet-stream
Size: 27024 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20080204/3d7714b2/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mds.log
Type: application/octet-stream
Size: 997 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20080204/3d7714b2/attachment-0001.obj>
-------------- next part --------------
>
> CHeers,
> --
> Kilian
>
>
More information about the lustre-discuss
mailing list