[Lustre-discuss] Inconsistent data with Lustre 1.6.3

Andreas Dilger adilger at sun.com
Tue Jan 29 16:09:43 PST 2008


On Jan 29, 2008  16:44 -0500, Jeff Darcy wrote:
> We're running Lustre 1.6.3 and Linux 2.6.18 on our 972-node 
> (5832-processor) machines, and we're seeing some interesting problems 
> when we run executables from a Lustre filesystem.  When we run 
> 5000-processor jobs, we often see some - maybe only a few, maybe a 
> couple of dozen - fail with illegal-instruction and other traps, where 
> examining the core file shows that the instructions in question are just 
> fine (and the same as on jobs that succeeded).  Has anybody else seen 
> similar problems running executables from a Lustre filesystem?
> 
> There's a significant chance that the problem is architecture-specific 
> (our CPU architecture is MIPS with weak memory ordering) and/or in Linux 
> rather than Lustre, but the same test has run fine using Lustre 1.6beta 
> on Linux 2.6.15 and on other filesystems (e.g. NFS or ext3 over NBD) 
> using current versions.  If anybody has any suggestions about places to 
> look, parameters to tweak for the sake of experimentation, etc. it would 
> be most appreciated.

There is definitely a possibility that the MIPS page cache is not coherent
in some cases.  I'm peripherally aware of these architecture-specific areas,
and it is possible that we don't handle all of them in the Lustre code.

In particular, I recalled vaguely (and then found) bug 933 which seems
directly relevant to your situation as flush_dcache_page() and friends
are NOT no-ops that they are on most common architectures.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.




More information about the lustre-discuss mailing list