[Lustre-discuss] Help finding error in bugzilla

Jeff Blasius jeff.blasius at yale.edu
Wed Jul 9 11:55:11 PDT 2008


Thank You so much!

This user was sure this wasn't the case. Eventually we decided to
restart the mds. This triggered the D state python processes to return
a trace indicating the problem.

It turns out python started a process (popen) where std. error opened
a default file name on all 160 in flight processes. This was an open,
not an append, but it was enough contention to block access to the
entire directory.
               -jeff

On Wed, Jul 9, 2008 at 7:17 AM, Andreas Dilger <adilger at sun.com> wrote:
> On Jul 08, 2008  22:04 -0400, Jeff Blasius wrote:
>> Jul  8 14:24:54 oss10 kernel: LustreError:
>> 4572:0:(ldlm_lockd.c:646:ldlm_server_completion_ast()) ### enqueue
>> wait took 7744763506us from 1215533749 ns: filter-lustre0-OST0009_UUID
>> lock: 00000101a2b59580/0x9275e8a2d17f9488 lrc: 2/0,0 mode: PW/PW res:
>> 68100256/0 rrc: 74 type: EXT [0->33554431] (req 0->4095) flags: 20
>> remote: 0xf87d4d490599950 expref: 117 pid: 4757
>> Jul  8 14:24:54 oss10 kernel: LustreError:
>> 4572:0:(ldlm_lockd.c:646:ldlm_server_completion_ast()) Skipped 64
>> previous similar messages
>
> It looks like you have many processes writing to the start of the
> same file.  That causes unavoidable lock contention, and is most
> likely a bug in your program (e.g. the binary is linked with gprof
> and all of them are overwriting the same output file).
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
>



-- 
Jeff Blasius / jeff.blasius at yale.edu
Phone: (203)432-9940 51 Prospect Rm. 011
High Performance Computing (HPC)
UNIX Systems Administrator, Linux Systems Design & Support (LSDS)
Yale University Information Technology Services (ITS)



More information about the lustre-discuss mailing list