[lustre-discuss] Lustre client mount fails: Request sent has timed out for slow reply
Ms. Megan Larko
dobsonunit at gmail.com
Fri Nov 25 11:05:07 PST 2016
I have a very small HPC cluster running CentOS 7.2. The lustre servers are
running lustre kernel-3.10.0-327.3.1.el7_lustre.x86_64. The clients are
I have two compute node clients successfully mounting the Lustre file
system from the servers. The next two compute clients will not mount
lustre. I have the lustre-client-3.8.0-3.10.0_327.3.1.el7.x86_64 and
lustre-client-modules-2.8.0-e.10.0_327.3.1.el7.x86_64 rpm installed on all
compute clients, including the next two. My InfiniBand network is up and
successfully pings the other systems. I can cleanly "modprobe lustre"
using /etc/modprobe.d/lustre.conf containing one line: options lnet
networks="o2ib0(ib0)". This information is the same on both Lustre client
and server systems, all of which use ib0.
On the next two compute clients I can successfully "lctl ping mds-ib at o2ib0"
and successfully ping the oss similarly. I try to mount the Lustre file
system on the next two compute clients via the command "mount -t lustre
A.B.C.D at o2ib0:/myLustre /myLustre where the A.B.C.D address exists and
works as described above and the Lustre FS is "myLustre" and successfully
mounts on the two earlier compute clients.
This mount fails on both of my next two compute clients with the STDERR:
mount.lustre: mount A.B.C.D at o2ib0:/myLustre /myLustre failed: Input/output
The compute client /var/log/messages file shows:
[date] [hostname] kernel: Lustre:
@@@ Request sent has timed out for slow reply: [sent 1480097968/real
1480097992] req at ffff8800aa14000 x1551992831868952/t0(0)
o250->MCGA.B.C.D at o2ib@A.B.C.D at o2ib:26:25 lens 520/544 e 0 to 1 dl
1480997973 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
The above appears 2X in a row followed by:
[date] [hostname] kernel: LustreError: 15c-8: MGCA.B.C.D at o2ib: The
configuration from log 'myLustre-client' failed (-5). This may be the
result of communication errors between this node and the MGS, a bad
configuration, or other errors. See the syslog for more information.
[date] [hostname] kernel: Lustre: Unmounted myLustre-client
[date] [hostname] kernel: LustreError:
unable to mount (-5)
As all four compute nodes are built from a single kickstart file, I do not
understand why two compute clients can mount the /myLustre file system and
two cannot. The IB fabric on the in-kernel opensm-3.3.10-1.el7.x86_64
looks clean with no entries in the /var/log/opensm-unhealthy-ports-dump.
If I go all the way back to the last opensm start I do see a single line in
/var/log/opensm.log on the opensm server for the next compute client
subn_validate_neighbor: ERR 7518: neighbor does not point back at us (guid:
[GUID of my next compute client])
Is this last opensm error completely stopping my Lustre mount when all
other IP pings are completely successful?
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the lustre-discuss