[Lustre-discuss] Luster clients getting evicted

Wed Feb 6 07:59:05 PST 2008

>> If client get eviction from the server, it might be triggered by
>
> 1) server did not get client pinger msg in a long time.
> 2) client is too busy to handle the server lock cancel req.

Clients show a load of 4.2  (4 cores total, 1 process per core).

> 3) client cancel the lock, but the network just dropped the cancel  
> reply to server.
I see a very small amount (6339) of dropped packets on the interfaces  
of the OSS.  Links between the switches show no errors.

> 4) server is too busy to handle the lock cancel reply from the  
> client or be blocked somewhere.

I started paying attention to the OSS more once you said this, some  
times i see the cpu use of socknal_sd00 get to 100%.  Now is this  
process used to keep all the odb_ping's going?

Both the OSS and the MDS/MGS are SMP systems and run single  
interfaces.  If I dual homed the servers would that create another  
socknal process for lnet?

>
> It seems there are a lot of metadata operations in your job. I  
> guess your eviction
> might be caused by the latter 2 reasons. If you could provide the  
> process stack trace on MDS
> when the job died, it might help us to figure out what is going on  
> there?
>
> WangDi
>> Brock Palen
>> Center for Advanced Computing
>> brockp at umich.edu
>> (734)936-1985
>>
>>
>> On Feb 4, 2008, at 2:47 PM, Brock Palen wrote:
>>
>>
>>>> Which version of lustre do you use?
>>>> Server and clients same version and same os? which one?
>>>>
>>> lustre-1.6.4.1
>>>
>>> The servers (oss and mds/mgs) use the RHEL4 rpm from lustre.org:
>>> 2.6.9-55.0.9.EL_lustre.1.6.4.1smp
>>>
>>> The clients run patchless RHEL4
>>> 2.6.9-67.0.1.ELsmp
>>>
>>> One set of clients are on a 10.x network while the servers and other
>>> half of clients are on a 141.  network, because we are using the tcp
>>> network type, we have not setup any lnet routes.  I don't think
>>> should cause a problem but I include the information for  
>>> clarity.  We
>>> do route 10.x on campus.
>>>
>>>
>>>> Harald
>>>>
>>>> On Monday 04 February 2008 04:11 pm, Brock Palen wrote:
>>>>
>>>>> on our cluster that has been running lustre for about 1 month.  
>>>>> I  have
>>>>> 1 MDT/MGS and 1 OSS with 2 OST's.
>>>>>
>>>>> Our cluster uses all Gige and has about 608 nodes 1854 cores.
>>>>>
>>>>> We have allot of jobs that die, and/or go into high IO wait,   
>>>>> strace
>>>>> shows processes stuck in fstat().
>>>>>
>>>>> The big problem is (i think) I would like some feedback on it   
>>>>> that of
>>>>> these 608 nodes 209 of them have in dmesg the string
>>>>>
>>>>> "This client was evicted by"
>>>>>
>>>>> Is this normal for clients to be dropped like this?  Is there some
>>>>> tuning that needs to be done to the server to carry this many  
>>>>> nodes
>>>>> out of the box?  We are using default lustre install with Gige.
>>>>>
>>>>>
>>>>> Brock Palen
>>>>> Center for Advanced Computing
>>>>> brockp at umich.edu
>>>>> (734)936-1985
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Lustre-discuss mailing list
>>>>> Lustre-discuss at lists.lustre.org
>>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>>>
>>>> -- 
>>>> Harald van Pee
>>>>
>>>> Helmholtz-Institut fuer Strahlen- und Kernphysik der  
>>>> Universitaet  Bonn
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>>
>>>>
>>>>
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>
>>>
>>>
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>
>
>