[Lustre-discuss] Inconsistent data with Lustre 1.6.3
Andreas Dilger
adilger at sun.com
Tue Jan 29 16:09:43 PST 2008
On Jan 29, 2008 16:44 -0500, Jeff Darcy wrote:
> We're running Lustre 1.6.3 and Linux 2.6.18 on our 972-node
> (5832-processor) machines, and we're seeing some interesting problems
> when we run executables from a Lustre filesystem. When we run
> 5000-processor jobs, we often see some - maybe only a few, maybe a
> couple of dozen - fail with illegal-instruction and other traps, where
> examining the core file shows that the instructions in question are just
> fine (and the same as on jobs that succeeded). Has anybody else seen
> similar problems running executables from a Lustre filesystem?
>
> There's a significant chance that the problem is architecture-specific
> (our CPU architecture is MIPS with weak memory ordering) and/or in Linux
> rather than Lustre, but the same test has run fine using Lustre 1.6beta
> on Linux 2.6.15 and on other filesystems (e.g. NFS or ext3 over NBD)
> using current versions. If anybody has any suggestions about places to
> look, parameters to tweak for the sake of experimentation, etc. it would
> be most appreciated.
There is definitely a possibility that the MIPS page cache is not coherent
in some cases. I'm peripherally aware of these architecture-specific areas,
and it is possible that we don't handle all of them in the Lustre code.
In particular, I recalled vaguely (and then found) bug 933 which seems
directly relevant to your situation as flush_dcache_page() and friends
are NOT no-ops that they are on most common architectures.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
More information about the lustre-discuss
mailing list