[Lustre-discuss] Failure to communicate with MDS via o2ib
Charles Taylor
taylor at hpc.ufl.edu
Tue May 27 06:46:39 PDT 2008
We have a few nodes that locked up due to memory oversubscription.
After rebooting, they can no longer communicate with our any of our
three MDSs over IB and, consequently, we cannot mount our Lustre
1.6.4.2 file systems on these nodes any longer. All other
communication via the IB port (ipoib for pings, ssh, etc) seems
fine. If we re-cable the node to use the second IB port,
communication is re-established and we can mount the file system. In
other words, by switching to the second IB port, we can once again
communicate with the MDSs and everything works as expected. Note
that this is only for a few nodes (out of 400) that seem to have
gotten in a bad state with regard to lustre.
Relevant info:
Lustre 1.6.4.2
CentOS 4.5 w/ updated kernel.
Linux r5b-s30.ufhpc 2.6.18-8.1.14.el5.L-1642 #1 SMP Wed Feb 20
10:59:48 EST 2008 x86_64 x86_64 x86_64 GNU/Linux
OFED 1.2
HCA #0: MT25208 Tavor Compat, Lion Cub, revision A0
Primary image is valid, unknown source
Secondary image is valid, unknown source
Vital Product Data
Product Name: Lion cub
P/N: MHEA28-1TC
E/C: A2
S/N: MT0637X00650
Freq/Power: PCIe x8
Checksum: Ok
Date Code: N/A
We don't know because we have not tried rebooting the MDS's yet (kind
of painful) but I'm guessing that if we rebooted them, the issue would
go away. I suppose it could be a problem at the IB layer (LID re-
assignment or some such) but since Lustre is the only app that seems
to be manifesting the issue that seems unlikely. I'm just wondering
if anyone else has encountered this and might know of a way to clear
it out (some obscure lnet command) without rebooting the MDS.
Charlie Taylor
UF HPC Center
More information about the lustre-discuss
mailing list