[lustre-discuss] Lustre client 2.10.4 fails to mount over OPA fabric after kernel update to 3.10.0-862.11.6.el7

Anthony Brookfield a.brookfield at sheffield.ac.uk
Fri Aug 17 01:58:23 PDT 2018


Hi,

Is anyone else having problems with clients running the latest 
3.10.0-862.11.6.el7 kernel unable to mount lustre over OPA?  We've got 
2.10.4 lustre clients on Centos 7.5, running the OS-provided OPA 
software packages.  The servers are running lustre 2.7.16.11.

Attempting to mount fails with errors:
kernel: LNet: Using FMR for registration
kernel: LNet: Added LNI 10.112.0.130 at o2ib [128/2048/0/180]
kernel: Lustre: 1872:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ 
Request sent has failed due to network error: [sent 1534426797/real 
1534426797]  req at ffff8ca9059f0000 x1608963113091088/t0(0) 
o250->MGC10.112.0.11 at o2ib@10.112.0.11 at o2ib:26/25 lens 520/544 e 0 to 1 
dl 1534426802 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
kernel: LustreError: 1632:0:(mgc_request.c:251:do_config_log_add()) 
MGC10.112.0.11 at o2ib: failed processing log, type 1: rc = -5
kernel: LustreError: 1903:0:(mgc_request.c:603:do_requeue()) failed 
processing log: -5
kernel: Lustre: 1872:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ 
Request sent has failed due to network error: [sent 1534426822/real 
1534426822]  req at ffff8cb1109b0000 x1608963113091152/t0(0) 
o250->MGC10.112.0.11 at o2ib@10.112.0.12 at o2ib:26/25 lens 520/544 e 0 to 1 
dl 1534426827 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
LustreError: 15c-8: MGC10.112.0.11 at o2ib: The configuration from log 
'lustre-client' failed (-5). This may be the result of communication 
errors between this node and the MGS, a bad configuration, or other 
errors. See the syslog for more information.
kernel: Lustre: Unmounted lustre-client
mount: mount.lustre: mount 10.112.0.11 at o2ib0:10.112.0.12 at o2ib0:/lustre 
at /mnt/fastdata failed: Input/output error
mount: Is the MGS running?
kernel: LustreError: 1632:0:(obd_mount.c:1582:lustre_fill_super()) 
Unable to mount  (-5)

Mounting lustre on clients running kernel 3.10.0-862.11.6.el7 over 
normal ethernet works fine.

Downgrading the kernel packages to 3.10.0-862.9.1.el7 allows the clients 
to mount over OPA.

Omni-path itself looks fine - ipoib is working, server addresses are 
pingable etc.  opainfo shows link status is OK, and IMB test jobs run OK.

Would be helpful to know if anyone else with OPA is also seeing 
problems, or if it's just a problem with our setup......

Cheers,

Anthony.

-- 
Dr Anthony Brookfield
Research Computing Infrastructure Specialist
CiCS, University of Sheffield.



More information about the lustre-discuss mailing list