[Lustre-discuss] Luster clients getting evicted

Brock Palen brockp at umich.edu
Tue Feb 5 08:01:47 PST 2008


The timeouts fixed the random evictions.  The problem we were trying  
to solve in the first place still is in place though.  In talking  
with the user of the code the problem is related to a similar problem  
in another code.

One code is from NOAA, the Other is S3D from Sandia (I think).

Both these codes write one file per process.  (NetCDF for one,  
tecplot for the other).
When the code has finished with a iteration they copy/tar/cpio the  
files to another location.  This is where the job will hand *some*  
times.  Most the time it works, but with enough iterations of this  
method a job will hang at some point.  The job does not die.  Just  
hangs.

The NOAA code does the mv+cpio in its pbs script.  The S3D code uses  
system() to run tar.  In the end they have the same behavior.

has anyone seen similar behavior?

Brock Palen
Center for Advanced Computing
brockp at umich.edu
(734)936-1985


On Feb 4, 2008, at 2:47 PM, Brock Palen wrote:

>> Which version of lustre do you use?
>> Server and clients same version and same os? which one?
>
> lustre-1.6.4.1
>
> The servers (oss and mds/mgs) use the RHEL4 rpm from lustre.org:
> 2.6.9-55.0.9.EL_lustre.1.6.4.1smp
>
> The clients run patchless RHEL4
> 2.6.9-67.0.1.ELsmp
>
> One set of clients are on a 10.x network while the servers and other
> half of clients are on a 141.  network, because we are using the tcp
> network type, we have not setup any lnet routes.  I don't think
> should cause a problem but I include the information for clarity.  We
> do route 10.x on campus.
>
>>
>> Harald
>>
>> On Monday 04 February 2008 04:11 pm, Brock Palen wrote:
>>> on our cluster that has been running lustre for about 1 month. I  
>>> have
>>> 1 MDT/MGS and 1 OSS with 2 OST's.
>>>
>>> Our cluster uses all Gige and has about 608 nodes 1854 cores.
>>>
>>> We have allot of jobs that die, and/or go into high IO wait,  strace
>>> shows processes stuck in fstat().
>>>
>>> The big problem is (i think) I would like some feedback on it  
>>> that of
>>> these 608 nodes 209 of them have in dmesg the string
>>>
>>> "This client was evicted by"
>>>
>>> Is this normal for clients to be dropped like this?  Is there some
>>> tuning that needs to be done to the server to carry this many nodes
>>> out of the box?  We are using default lustre install with Gige.
>>>
>>>
>>> Brock Palen
>>> Center for Advanced Computing
>>> brockp at umich.edu
>>> (734)936-1985
>>>
>>>
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>> -- 
>> Harald van Pee
>>
>> Helmholtz-Institut fuer Strahlen- und Kernphysik der Universitaet  
>> Bonn
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>




More information about the lustre-discuss mailing list