[Lustre-discuss] Luster clients getting evicted
Roland Laifer
Laifer at RZ.Uni-Karlsruhe.DE
Wed Feb 6 08:56:59 PST 2008
On Tue, Feb 05, 2008 at 11:01:47AM -0500, Brock Palen wrote:
> The timeouts fixed the random evictions. The problem we were trying
> to solve in the first place still is in place though. In talking
> with the user of the code the problem is related to a similar problem
> in another code.
>
> One code is from NOAA, the Other is S3D from Sandia (I think).
>
> Both these codes write one file per process. (NetCDF for one,
> tecplot for the other).
> When the code has finished with a iteration they copy/tar/cpio the
> files to another location. This is where the job will hand *some*
> times. Most the time it works, but with enough iterations of this
> method a job will hang at some point. The job does not die. Just
> hangs.
>
> The NOAA code does the mv+cpio in its pbs script. The S3D code uses
> system() to run tar. In the end they have the same behavior.
>
> has anyone seen similar behavior?
we have seen evictions several times and I noticed that it's worth
to investigate them. You can get evictions by bad applications,
e.g. if lots of nodes write few bytes to a shared file.
One time the reason was a tecplot routine and the user reported that
it includes bad code (in preutil.c).
Regards,
Roland
--
--------------------------------------------------------------------------
Roland Laifer
Rechenzentrum, Universitaet Karlsruhe (TH), D-76128 Karlsruhe, Germany
Email: Roland.Laifer at rz.uni-karlsruhe.de, Phone: +49 721 608 4861,
Fax: +49 721 32550, Web: www.rz.uni-karlsruhe.de/personen/roland.laifer
--------------------------------------------------------------------------
More information about the lustre-discuss
mailing list