[lustre-discuss] Problem with OPA fabric leading to unexpected lustre behaviour

Kurt Strosahl strosahl at jlab.org
Mon Apr 15 05:24:29 PDT 2019


Good Morning,


     I'm presently working on an issue with my OPA network that seems to be having an unusual impact on lustre.  What happens is that when one of the nodes on the OPA fabric reboots it sometimes has trouble reaching one of the four lnet routers that we have set up.  This isn't, itself, a lustre problem as the nodes experiencing this issue can't even ping the lnet routers opa interface.  The impact this has is that the lustre file system can't mount, even though the other three lnet routers are available.  Eventually the issue clears up and lustre is able to mount, but I'm wondering why having one of the four lnet routers down would prevent the lustre file system from mounting.  Is it because the lnet router is only down from the OPA side, while the IB side is still up?


Here is what the system records in dmesg

[   45.839655] Lustre: Server MGS version (2.5.42.4) is much older than client. Consider upgrading server (2.10.4)
[   52.847699] Lustre: 3469:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1555330354/real 1555330354]  req at ffff882ef0930300 x1630882080227408/t0(0) o501->MGC172.17.4.125 at o2ib@172.17.4.125 at o2ib:26/25 lens 296/272 e 0 to 1 dl 1555330361 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
[   52.847735] LustreError: 166-1: MGC172.17.4.125 at o2ib: Connection to MGS (at 172.17.4.125 at o2ib) was lost; in progress operations using this service will fail
[   52.847933] LustreError: 15c-8: MGC172.17.4.125 at o2ib: The configuration from log 'lustre2-client' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
[   52.848876] Lustre: Unmounted lustre2-client
[   52.849424] Lustre: MGC172.17.4.125 at o2ib: Connection restored to MGC172.17.4.125 at o2ib_0 (at 172.17.4.125 at o2ib)
[   52.857803] LustreError: 3469:0:(obd_mount.c:1582:lustre_fill_super()) Unable to mount  (-5)
[   56.828713] LNet: 3604:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 172.20.0.3 at o2ib1: 4294722 seconds
[  106.829847] LNet: 3604:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 172.20.0.3 at o2ib1: 4294772 seconds

The 172.20.0.3 address is the lnet router in question.


w/r,

Kurt J. Strosahl

System Administrator: Lustre, HPC
Scientific Computing Group, Thomas Jefferson National Accelerator Facility
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20190415/02277f09/attachment.html>


More information about the lustre-discuss mailing list