[Lustre-discuss] ldlm_enqueue operation failures

Mon Feb 18 13:42:22 PST 2008

Hello!

On Feb 18, 2008, at 4:29 PM, Charles Taylor wrote:

> Unfortunately, the upgrade did not resolve our issue.    One our
> users has an mpi app where every thread opens the same input file
> (actually several in succession).    Although we have run this job
> successfully before on up to 512 procs, it is not working now.
> Lustre seems to be locking up when all the threads go after the same
> file (to open) and we see things such as ...

Can you upload full log from start of problematic job to end somewhere?
Also somewhere when first watchdog timeouts hit, it would be nice if you
can do sysrq-t on MDS too to get traces of all threads (you need to have
big dmesg buffer for them to fit, of use serial console).
Is the job uses flocks/fcntl locks at all? if not, then don't worry  
about
mounting with -o flock.

Bye,
     Oleg