[Lustre-discuss] lock callback timer expired, lock on destroyed export, locks stolen, busy with active RPCs, operation 400 on unconnected MDS

Thomas Roth t.roth at gsi.de
Mon May 3 08:49:15 PDT 2010


Hi all,

just want to share my recent insight and increase the number of Google hits
 for those who suffer from
- MDT / filesystem becoming suddenly unusable
- LustreError:  ... lock callback timer expired ...
- LustreError: ...  lock on destroyed export ...
- Lustre: ... Stealing 1 locks ...
- Lustre: ... All locks stolen ...
- LustreError: ... busy with active 2 RPCs ...
- LustreError: ... operation 400 on unconnected MDS ...

All of these and more we have seen on the MDT of our 1.6.7.2-Cluster
after running for one year without major problems. For the last 2 weeks
the system hasn't had an uptime of more than 30h, though.

We found a user job submission script that probably caused all this by
starting
- several hundred (900) jobs simultaneously
- all of them opening one and the same file for batch system errors and
one and the same file for its output.

So if someone is sitting in front of an uncooperative MDT, dazed and
confused as I was, perhaps this is the direction to investigate.

Still I'd like to learn more about "operation X on unconnected MDS", on
the net I only found my own question from two years ago.

Regards,
Thomas

-- 
--------------------------------------------------------------------
Thomas Roth
Department: Informationstechnologie
GSI Helmholtzzentrum für Schwerionenforschung GmbH
Planckstraße 1
64291 Darmstadt

Gesellschaft mit beschränkter Haftung
Sitz der Gesellschaft: Darmstadt
Handelsregister: Amtsgericht Darmstadt, HRB 1528

Geschäftsführung: Professor Dr. Dr. h.c. Horst Stöcker,
Christiane Neumann, Dr. Hartmut Eickhoff

Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph
Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt




More information about the lustre-discuss mailing list