[lustre-discuss] Lustre client mount fails: Request sent has timed out for slow reply

Fri Nov 25 12:25:54 PST 2016

Possible causes in cases like this:
- duplicate client IP addresses (used only at connect time for o2iblnd)
- firewall rules (though unlikely to be the case for IB)
- SELinux (this is supported in Lustre 2.7+ but can still have rules that prevent mounting)

Sorry, I don't know anything about opensm.  Presumably you've restarted these clients, and
other IB-level communications are working?

Cheers, Andreas

On Nov 25, 2016, at 12:05, Ms. Megan Larko <dobsonunit at gmail.com> wrote:
> 
> Greetings List!
> 
> I have a very small HPC cluster running CentOS 7.2.  The lustre servers are running lustre kernel-3.10.0-327.3.1.el7_lustre.x86_64.   The clients are running kernel-3.10.0-327.3.1.el7.x86_64.
> 
> I have two compute node clients successfully mounting the Lustre file system from the servers.  The next two compute clients will not mount lustre.  I have the lustre-client-3.8.0-3.10.0_327.3.1.el7.x86_64 and lustre-client-modules-2.8.0-e.10.0_327.3.1.el7.x86_64 rpm installed on all compute clients, including the next two.  My InfiniBand network is up and successfully pings the other systems.  I can cleanly "modprobe lustre" using /etc/modprobe.d/lustre.conf containing one line: options lnet networks="o2ib0(ib0)".  This information is the same on both Lustre client and server systems, all of which use ib0.
> 
> On the next two compute clients I can successfully "lctl ping mds-ib at o2ib0" and successfully ping the oss similarly.  I try to mount the Lustre file system on the next two compute clients via the command "mount -t lustre A.B.C.D at o2ib0:/myLustre /myLustre where the A.B.C.D address exists and works as described above and the Lustre FS is "myLustre" and successfully mounts on the two earlier compute clients.
> 
> This mount fails on both of my next two compute clients with the STDERR:
> 
> mount.lustre: mount A.B.C.D at o2ib0:/myLustre /myLustre failed: Input/output error
> 
> The compute client /var/log/messages file shows:
> [date] [hostname] kernel: Lustre: 51814:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1480097968/real 1480097992]  req at ffff8800aa14000 x1551992831868952/t0(0) o250->MCGA.B.C.D at o2ib@A.B.C.D at o2ib:26:25 lens 520/544 e 0 to 1 dl 1480997973 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
> 
> The above appears 2X in a row followed by:
> [date] [hostname] kernel: LustreError: 15c-8: MGCA.B.C.D at o2ib: The configuration from log 'myLustre-client' failed (-5).  This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors.  See the syslog for more information.
> [date] [hostname] kernel: Lustre: Unmounted myLustre-client
> [date] [hostname] kernel: LustreError: 53873:0:(obd_mount.c:1426:lustre_fill_super()) unable to mount  (-5)
> 
> As all four compute nodes are built from a single kickstart file, I do  not understand why two compute clients can mount the /myLustre file system and two cannot.    The IB fabric on the in-kernel opensm-3.3.10-1.el7.x86_64 looks clean with no entries in the /var/log/opensm-unhealthy-ports-dump.   If I go all the way back to the last opensm start I do see a single line in /var/log/opensm.log on the opensm server for the next compute client stating:
> subn_validate_neighbor: ERR 7518: neighbor does not point back at us (guid: [GUID of my next compute client])
> 
> Is this last opensm error completely stopping my Lustre mount when all other IP pings are completely successful?
> 
> TIA,
> megan
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org