[Lustre-discuss] Failure to communicate with MDS via o2ib
Charles Taylor
taylor at hpc.ufl.edu
Tue May 27 07:36:07 PDT 2008
Here it is for the one of the other MDSs (10.13.16.24 at o2ib). As you
can see, the ipoib ping succeeds but the "lctl ping" fails as does the
mount. The last few lines of dmesg are also below.
[root at r5b-s41 ~]# ping 10.13.16.24
PING 10.13.16.24 (10.13.16.24) 56(84) bytes of data.
64 bytes from 10.13.16.24: icmp_seq=0 ttl=64 time=0.168 ms
--- 10.13.16.24 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.168/0.168/0.168/0.000 ms, pipe 2
[root at r5b-s41 ~]# lctl ping 10.13.16.24 at o2ib
failed to ping 10.13.16.24 at o2ib: Input/output error
[root at r5b-s41 ~]# mount -t lustre 10.13.16.24 at o2ib:/ufhpc /ufhpc/scratch
mount.lustre: mount 10.13.16.24 at o2ib:/ufhpc at /ufhpc/scratch failed:
Cannot send after transport endpoint shutdown
dmesg....
LustreError: 12980:0:(client.c:519:ptlrpc_import_delay_req()) @@@
IMP_INVALID req at ffff8102320ee400 x15/t0 o501-
>MGS at MGC10.13.16.24@o2ib_0:26 lens 136/120 ref 1 fl Rpc:/0/0 rc 0/0
LustreError: 15c-8: MGC10.13.16.24 at o2ib: The configuration from log
'ufhpc-client' failed (-108). This may be the result of communication
errors between this node and the MGS, a bad configuration, or other
errors. See the syslog for more information.
LustreError: 12980:0:(llite_lib.c:1021:ll_fill_super()) Unable to
process log: -108
Lustre: client ffff810232fc3800 umount complete
LustreError: 12980:0:(obd_mount.c:1924:lustre_fill_super()) Unable to
mount (-108)
Thanks,
Charlie Taylor
On May 27, 2008, at 10:13 AM, Isaac Huang wrote:
> On Tue, May 27, 2008 at 09:50:38AM -0400, Charles Taylor wrote:
>> Whoops, I meant to include the mount-time error message....
>>
>> /etc/init.d/lustre-client start
>> IB HCA detected - will try to sleep until link state becomes ACTIVE
>> State becomes ACTIVE
>> Loading Lustre lnet module with option networks=o2ib: [ OK ]
>> Loading Lustre kernel module: [ OK ]
>> mount -t lustre 10.13.24.40 at o2ib:/ufhpc /ufhpc/scratch:
>>
>>
>> mount.lustre: mount 10.13.24.40 at o2ib:/ufhpc at /ufhpc/scratch
>> failed: Cannot
>> send after transport endpoint shutdown
>> [FAILED]
>> Error: Failed to mount 10.13.24.40 at o2ib:/ufhpc
>> mount -t lustre 10.13.24.90 at o2ib:/crn /crn/scratch: mount.lustre:
>> mount
>> 10.13.24.90 at o2ib:/crn at /crn/scratch failed: Cannot send after
>> transport
>> endpoint shutdown
>> [FAILED]
>> Error: Failed to mount 10.13.24.90 at o2ib:/crn
>> mount -t lustre 10.13.24.85 at o2ib:/hpcdata /ufhpc/hpcdata:
>> mount.lustre: mount
>> 10.13.24.85 at o2ib:/hpcdata at /ufhpc/hpcdata failed: Cannot send
>> after transport
>> endpoint shutdown
>> [FAILED]
>> Error: Failed to mount 10.13.24.85 at o2ib:/hpcdata
>
> Was there any error message in 'dmesg'? Can you try 'lctl ping
> 10.13.24.90 at o2ib'? (and 'lctl list_nids' and 'lctl --net o2ib
> peer_list' and 'lctl --net o2ib conn_list').
>
> Isaac
More information about the lustre-discuss
mailing list