[Lustre-discuss] Failure to communicate with MDS via o2ib

Charles Taylor taylor at hpc.ufl.edu
Tue May 27 06:46:39 PDT 2008



We have a few nodes that locked up due to memory oversubscription.     
After rebooting, they can no longer communicate with our any of our  
three MDSs over IB and, consequently, we cannot mount our Lustre  
1.6.4.2 file systems on these nodes any longer.    All other  
communication via the IB port (ipoib for pings, ssh, etc) seems  
fine.   If we re-cable the node to use the second IB port,  
communication is re-established and we can mount the file system.   In  
other words, by switching to the second IB port, we can once again  
communicate with the MDSs and everything works as expected.    Note  
that this is only for a few nodes (out of 400) that seem to have  
gotten in a bad state with regard to lustre.

Relevant info:

Lustre 1.6.4.2

CentOS 4.5 w/ updated kernel.

Linux r5b-s30.ufhpc 2.6.18-8.1.14.el5.L-1642 #1 SMP Wed Feb 20  
10:59:48 EST 2008 x86_64 x86_64 x86_64 GNU/Linux

OFED 1.2

HCA #0: MT25208 Tavor Compat, Lion Cub, revision A0
   Primary image is valid, unknown source
   Secondary image is valid, unknown source

   Vital Product Data
     Product Name: Lion cub
     P/N: MHEA28-1TC
     E/C: A2
     S/N: MT0637X00650
     Freq/Power: PCIe x8
     Checksum: Ok
     Date Code: N/A

We don't know because we have not tried rebooting the MDS's yet (kind  
of painful) but I'm guessing that if we rebooted them, the issue would  
go away.    I suppose it could be a problem at the IB layer (LID re- 
assignment or some such) but since Lustre is the only app that seems  
to be manifesting the issue that seems unlikely.  I'm just wondering  
if anyone else has encountered this and might know of a way to clear  
it out (some obscure lnet command) without rebooting the MDS.


Charlie Taylor
UF HPC Center



More information about the lustre-discuss mailing list