[Lustre-discuss] Lustre 1.6.4.2 Error
Dilling
dilling at zdv.uni-tuebingen.de
Fri Mar 21 11:15:18 PDT 2008
Hi,
some days ago one of my users started a lot of matlab jobs flooding
all processors on our 40 nodes 2CPU cluster (4Core System). More
details and log files can be found in the appendix. As a result of
this I observed a strange behavior of lustre. Ptlrpcd used 100% of one
CPU, the second CPU was completly occupied by pwd. Pwd was a child of
the matlab process invoked by the user. I/O on lustre was partly
possible but df reported access denied. A recovery with the mdt
started after lustre.timeout=300 but did not complete. I had to reboot
all nodes which showed this behavior. The ost showed the message:
Mar 14 16:47:05 cn46 kernel: LustreError: 138-a: lustre-OST0003: A
client on nid 10.128.15.2 at tcp was evicted due to a lock glimpse
callback to 10.128.15.2 at tcp timed out: rc -110
The client kernels reported soft lockup on all available cores.
Does anyone have an idea how to prevent such behavior. Thanks for your help.
Regards
w.d.
--------------------------------------------------------------------------------
W.Dilling Tel.: (49) 7071/29-70206
Universitaet Tuebingen Fax.: (49) 7071/29-5912
Zentrum fuer Datenverarbeitung mail: dilling at zdv.uni-tuebingen.de
Waechterstrasse 76
72074 Tuebingen
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lustre_error_14.03.2008.tar
Type: application/x-tar
Size: 71680 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20080321/ff21f16f/attachment.tar>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2052 bytes
Desc: S/MIME krytographische Unterschrift
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20080321/ff21f16f/attachment.bin>
More information about the lustre-discuss
mailing list