[Lustre-discuss] Luster clients getting evicted

Roland Laifer Laifer at RZ.Uni-Karlsruhe.DE
Wed Feb 6 08:56:59 PST 2008


On Tue, Feb 05, 2008 at 11:01:47AM -0500, Brock Palen wrote:
> The timeouts fixed the random evictions.  The problem we were trying  
> to solve in the first place still is in place though.  In talking  
> with the user of the code the problem is related to a similar problem  
> in another code.
> 
> One code is from NOAA, the Other is S3D from Sandia (I think).
> 
> Both these codes write one file per process.  (NetCDF for one,  
> tecplot for the other).
> When the code has finished with a iteration they copy/tar/cpio the  
> files to another location.  This is where the job will hand *some*  
> times.  Most the time it works, but with enough iterations of this  
> method a job will hang at some point.  The job does not die.  Just  
> hangs.
> 
> The NOAA code does the mv+cpio in its pbs script.  The S3D code uses  
> system() to run tar.  In the end they have the same behavior.
> 
> has anyone seen similar behavior?

we have seen evictions several times and I noticed that it's worth 
to investigate them. You can get evictions by bad applications, 
e.g. if lots of nodes write few bytes to a shared file. 

One time the reason was a tecplot routine and the user reported that 
it includes bad code (in preutil.c). 

Regards, 
  Roland 
-- 
 --------------------------------------------------------------------------
  Roland Laifer 
  Rechenzentrum, Universitaet Karlsruhe (TH), D-76128 Karlsruhe, Germany
  Email: Roland.Laifer at rz.uni-karlsruhe.de, Phone: +49 721 608 4861, 
  Fax: +49 721 32550, Web: www.rz.uni-karlsruhe.de/personen/roland.laifer
 --------------------------------------------------------------------------



More information about the lustre-discuss mailing list