[Lustre-discuss] possible file corruption

Jason Temple jtemple at cscs.ch
Tue Dec 4 06:52:44 PST 2012


Hello,

I have a troubling issue with random file corruption using either lustre
1.8.6 (internal Cray lustre) and lustre 2.1 (sonexion - produced by
xyratex).

Randomly, our users will come across an issue with files either having 0
size, or being corrupted.  The 0 size files are usually ascii files
(which are normally created with simple cat and awk statements,
serially), while the corrupted files are weather data (grib) files that
most of the time are truncated during an untar operation. Other times,
the files have blocks filled with zeroes in the middle of the file.

The real kicker is that we can not reproduce the problem reliably in
order to troubleshoot it.  I managed to trigger file truncation after
1500 iterations of untaring the same tar file, but since then, after
30,000 iterations, I haven't been able to reproduce it.

When it happens, there are no errors in the logs relating to lustre, and
nothing is dumped into /tmp.

Has anyone come across this before?  I've searched google for weeks, but
have only found a few bugs that seem like they might be similar, but are
usually related to netcdf and parallel i/o, while our cases of
corruption are usually encountered serially.

What log settings are suggested to try and capture this phantom while it
is happening?

Thanks in advance,

Jason



More information about the lustre-discuss mailing list