[Lustre-discuss] Failure to communicate with MDS via o2ib

Tue May 27 06:50:38 PDT 2008

Whoops, I meant to include the mount-time error message....

/etc/init.d/lustre-client start
IB HCA detected - will try to sleep until link state becomes ACTIVE
   State becomes ACTIVE
Loading Lustre lnet module with option networks=o2ib:      [  OK  ]
Loading Lustre kernel module:                              [  OK  ]
mount -t lustre 10.13.24.40 at o2ib:/ufhpc /ufhpc/scratch:

mount.lustre: mount 10.13.24.40 at o2ib:/ufhpc at /ufhpc/scratch failed:  
Cannot
send after transport endpoint shutdown
                                                            [FAILED]
Error: Failed to mount 10.13.24.40 at o2ib:/ufhpc
mount -t lustre 10.13.24.90 at o2ib:/crn /crn/scratch:  mount.lustre: mount
10.13.24.90 at o2ib:/crn at /crn/scratch failed: Cannot send after  
transport
endpoint shutdown
                                                            [FAILED]
Error: Failed to mount 10.13.24.90 at o2ib:/crn
mount -t lustre 10.13.24.85 at o2ib:/hpcdata /ufhpc/hpcdata:   
mount.lustre: mount
10.13.24.85 at o2ib:/hpcdata at /ufhpc/hpcdata failed: Cannot send after  
transport
endpoint shutdown
                                                            [FAILED]
Error: Failed to mount 10.13.24.85 at o2ib:/hpcdata
Charlie Taylor
UF HPC Center

On May 27, 2008, at 9:46 AM, Charles Taylor wrote:

>
>
> We have a few nodes that locked up due to memory oversubscription.
> After rebooting, they can no longer communicate with our any of our
> three MDSs over IB and, consequently, we cannot mount our Lustre
> 1.6.4.2 file systems on these nodes any longer.    All other
> communication via the IB port (ipoib for pings, ssh, etc) seems
> fine.   If we re-cable the node to use the second IB port,
> communication is re-established and we can mount the file system.   In
> other words, by switching to the second IB port, we can once again
> communicate with the MDSs and everything works as expected.    Note
> that this is only for a few nodes (out of 400) that seem to have
> gotten in a bad state with regard to lustre.
>
> Relevant info:
>
> Lustre 1.6.4.2
>
> CentOS 4.5 w/ updated kernel.
>
> Linux r5b-s30.ufhpc 2.6.18-8.1.14.el5.L-1642 #1 SMP Wed Feb 20
> 10:59:48 EST 2008 x86_64 x86_64 x86_64 GNU/Linux
>
> OFED 1.2
>
> HCA #0: MT25208 Tavor Compat, Lion Cub, revision A0
>   Primary image is valid, unknown source
>   Secondary image is valid, unknown source
>
>   Vital Product Data
>     Product Name: Lion cub
>     P/N: MHEA28-1TC
>     E/C: A2
>     S/N: MT0637X00650
>     Freq/Power: PCIe x8
>     Checksum: Ok
>     Date Code: N/A
>
> We don't know because we have not tried rebooting the MDS's yet (kind
> of painful) but I'm guessing that if we rebooted them, the issue would
> go away.    I suppose it could be a problem at the IB layer (LID re-
> assignment or some such) but since Lustre is the only app that seems
> to be manifesting the issue that seems unlikely.  I'm just wondering
> if anyone else has encountered this and might know of a way to clear
> it out (some obscure lnet command) without rebooting the MDS.
>
>
> Charlie Taylor
> UF HPC Center
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20080527/0efeface/attachment.htm>