[Lustre-discuss] short writes

David Singleton David.Singleton at anu.edu.au
Thu Jul 8 06:25:32 PDT 2010


The POSIX standard pretty clearly allows short writes to occur (number of
bytes written less than requested in a successful call to write) but its
not something you see very often and I dont think many users/applications
expect it to occur when writing to disk based files.  We are seeing it
fairly regularly and just wanted to confirm that we (rather our users)
should expect this behaviour from Lustre.

We are seeing the issue with the infamous Gaussian quantum chem code
which spends literally days constantly writing and reading to scratch files
in roughly 1GB chunks as part of out-of-core solvers.  We manage jobs using
simple SIGSTOP/SIGCONT based suspend/resume and occasionally jobs will flag
a short write immediately after a SIGCONT. The application incorrectly
treats this as an error and aborts.  Adding code to complete the write
appears to fix the problem (as you'd hope).  Now we are at the stage of
"debating" with the application developers whether it's their problem or
Lustre's.

Is this considered normal Lustre behaviour?

This is with 1.8.3 clients on 2.6.27.46.

Thanks,
David




More information about the lustre-discuss mailing list