[Lustre-discuss] Luster clients getting evicted

Tom.Wang Tom.Wang at Sun.COM
Tue Feb 5 10:41:10 PST 2008


Brock Palen wrote:
> The timeouts fixed the random evictions.  The problem we were trying  
> to solve in the first place still is in place though.  In talking  
> with the user of the code the problem is related to a similar problem  
> in another code.
>
> One code is from NOAA, the Other is S3D from Sandia (I think).
>
> Both these codes write one file per process.  (NetCDF for one,  
> tecplot for the other).
> When the code has finished with a iteration they copy/tar/cpio the  
> files to another location.  This is where the job will hand *some*  
> times.  Most the time it works, but with enough iterations of this  
> method a job will hang at some point.  The job does not die.  Just  
> hangs.
>
> The NOAA code does the mv+cpio in its pbs script.  The S3D code uses  
> system() to run tar.  In the end they have the same behavior.
>
> has anyone seen similar behavior?
>   
If client get eviction from the server, it might be triggered by

1) server did not get client pinger msg in a long time.
2) client is too busy to handle the server lock cancel req.
3) client cancel the lock, but the network just dropped the cancel reply 
to server.
4) server is too busy to handle the lock cancel reply from the client or 
be blocked somewhere.

It seems there are a lot of metadata operations in your job. I guess 
your eviction
might be caused by the latter 2 reasons. If you could provide the 
process stack trace on MDS
when the job died, it might help us to figure out what is going on there?

WangDi
> Brock Palen
> Center for Advanced Computing
> brockp at umich.edu
> (734)936-1985
>
>
> On Feb 4, 2008, at 2:47 PM, Brock Palen wrote:
>
>   
>>> Which version of lustre do you use?
>>> Server and clients same version and same os? which one?
>>>       
>> lustre-1.6.4.1
>>
>> The servers (oss and mds/mgs) use the RHEL4 rpm from lustre.org:
>> 2.6.9-55.0.9.EL_lustre.1.6.4.1smp
>>
>> The clients run patchless RHEL4
>> 2.6.9-67.0.1.ELsmp
>>
>> One set of clients are on a 10.x network while the servers and other
>> half of clients are on a 141.  network, because we are using the tcp
>> network type, we have not setup any lnet routes.  I don't think
>> should cause a problem but I include the information for clarity.  We
>> do route 10.x on campus.
>>
>>     
>>> Harald
>>>
>>> On Monday 04 February 2008 04:11 pm, Brock Palen wrote:
>>>       
>>>> on our cluster that has been running lustre for about 1 month. I  
>>>> have
>>>> 1 MDT/MGS and 1 OSS with 2 OST's.
>>>>
>>>> Our cluster uses all Gige and has about 608 nodes 1854 cores.
>>>>
>>>> We have allot of jobs that die, and/or go into high IO wait,  strace
>>>> shows processes stuck in fstat().
>>>>
>>>> The big problem is (i think) I would like some feedback on it  
>>>> that of
>>>> these 608 nodes 209 of them have in dmesg the string
>>>>
>>>> "This client was evicted by"
>>>>
>>>> Is this normal for clients to be dropped like this?  Is there some
>>>> tuning that needs to be done to the server to carry this many nodes
>>>> out of the box?  We are using default lustre install with Gige.
>>>>
>>>>
>>>> Brock Palen
>>>> Center for Advanced Computing
>>>> brockp at umich.edu
>>>> (734)936-1985
>>>>
>>>>
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>>         
>>> -- 
>>> Harald van Pee
>>>
>>> Helmholtz-Institut fuer Strahlen- und Kernphysik der Universitaet  
>>> Bonn
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>
>>>
>>>       
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>>
>>     
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>   




More information about the lustre-discuss mailing list