[lustre-discuss] MOFED 4.4-1.0.0.0

Hans Henrik Happe happe at nbi.dk
Fri Aug 3 04:53:28 PDT 2018


Hi,

Did anyone try Mellanox OFED 4.4-1.0.0.0?

With Lustre 2.10.4 and CentOS 6.10 and 6.9 we have issues. Using CentOS
6.9 and the previous supported version there are no problems (CentOS
6.10 is not supported on the previous).

We are using ConnectX-3 cards on kernel 2.6.32-696.18.7.el6.x86_64.

First mount after start of openibd fails. Attached 'first.txt' shows the
log.

A second mount succeeds ('second.txt'). The OSTs are slowly added after
some timeouts. Everything seems to work after this.

After this we can unmount and mount again and everything is normal.
However, reloading the driver (restart openibd) the mount fails again.

I'll have a go at CentOS 7.5 and contact Mellanox next.

Cheers,
Hans Henrik
-------------- next part --------------
Aug  3 13:26:49 node578 kernel: LNet: HW NUMA nodes: 2, HW CPU cores: 64, npartitions: 2
Aug  3 13:26:49 node578 kernel: alg: No test for adler32 (adler32-zlib)
Aug  3 13:26:49 node578 kernel: alg: No test for crc32 (crc32-table)
Aug  3 13:26:49 node578 kernel: alg: No test for crc32 (crc32-pclmul)
Aug  3 13:26:50 node578 kernel: Lustre: Lustre: Build Version: 2.10.4
Aug  3 13:26:50 node578 kernel: LNet: Added LNI 10.21.205.78 at o2ib [8/256/0/180]
Aug  3 13:26:53 node578 kernel: LNet: 73616:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 10.21.10.111 at o2ib: 4478161 seconds
Aug  3 13:26:53 node578 kernel: Lustre: 73626:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1533295610/real 1533295613]  req at ffff885f76c0bc80 x1607776977551376/t0(0) o250->MGC10.21.10.111 at o2ib@10.21.10.111 at o2ib:26/25 lens 520/544 e 0 to 1 dl 1533295615 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
Aug  3 13:26:56 node578 kernel: LustreError: 73562:0:(mgc_request.c:251:do_config_log_add()) MGC10.21.10.111 at o2ib: failed processing log, type 1: rc = -5
Aug  3 13:27:05 node578 kernel: LustreError: 73703:0:(mgc_request.c:603:do_requeue()) failed processing log: -5
Aug  3 13:27:18 node578 kernel: LNet: 73616:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 10.21.10.111 at o2ib: 4478186 seconds
Aug  3 13:27:18 node578 kernel: Lustre: 73626:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1533295635/real 1533295638]  req at ffff88bfa9bc3cc0 x1607776977551440/t0(0) o250->MGC10.21.10.111 at o2ib@10.21.10.111 at o2ib:26/25 lens 520/544 e 0 to 1 dl 1533295645 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
Aug  3 13:27:27 node578 kernel: LustreError: 15c-8: MGC10.21.10.111 at o2ib: The configuration from log 'hpc-client' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
Aug  3 13:27:27 node578 kernel: Lustre: Unmounted hpc-client
Aug  3 13:27:27 node578 kernel: LustreError: 73562:0:(obd_mount.c:1582:lustre_fill_super()) Unable to mount  (-5)
-------------- next part --------------
Aug  3 13:41:33 node578 kernel: Lustre: hpc: root_squash is set to 99:99
Aug  3 13:41:33 node578 kernel: Lustre: hpc: nosquash_nids set to 172.20.1.10 at tcp1 172.20.1.221 at tcp1 172.20.1.71 at tcp1 10.121.16.11 at tcp1
Aug  3 13:41:39 node578 kernel: Lustre: 73626:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1533296493/real 0]  req at ffff885f98ec79c0 x1607776977551760/t0(0) o38->hpc-MDT0000-mdc-ffff885f887bf800 at 10.21.10.101@o2ib:12/10 lens 520/544 e 0 to 1 dl 1533296498 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Aug  3 13:42:51 node578 kernel: LNet: 73616:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 10.21.10.102 at o2ib: 4479119 seconds
Aug  3 13:42:51 node578 kernel: Lustre: 73626:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1533296568/real 1533296571]  req at ffff88bf97e36cc0 x1607776977551840/t0(0) o38->hpc-MDT0000-mdc-ffff885f887bf800 at 10.21.10.102@o2ib:12/10 lens 520/544 e 0 to 1 dl 1533296573 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
Aug  3 13:43:14 node578 kernel: Lustre: Mounted hpc-client
Aug  3 13:43:16 node578 kernel: LustreError: 73774:0:(llite_lib.c:1772:ll_statfs_internal()) obd_statfs fails: rc = -5
Aug  3 13:43:17 node578 kernel: LNet: 73616:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 10.21.10.112 at o2ib: 4479145 seconds
Aug  3 13:43:17 node578 kernel: Lustre: 73626:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1533296594/real 1533296597]  req at ffff885f98ec7cc0 x1607776977551904/t0(0) o8->hpc-OST0001-osc-ffff885f887bf800 at 10.21.10.112@o2ib:28/4 lens 520/544 e 0 to 1 dl 1533296599 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
Aug  3 13:43:18 node578 kernel: LustreError: 73775:0:(llite_lib.c:1772:ll_statfs_internal()) obd_statfs fails: rc = -5
Aug  3 13:43:19 node578 kernel: LNet: 73616:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 10.21.10.121 at o2ib: 5 seconds
Aug  3 13:43:19 node578 kernel: LNet: 73616:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Skipped 3 previous similar messages
Aug  3 13:43:33 node578 kernel: LustreError: 73782:0:(llite_lib.c:1772:ll_statfs_internal()) obd_statfs fails: rc = -5
Aug  3 13:43:38 node578 kernel: LustreError: 73784:0:(llite_lib.c:1772:ll_statfs_internal()) obd_statfs fails: rc = -5
Aug  3 13:43:38 node578 kernel: LustreError: 73784:0:(llite_lib.c:1772:ll_statfs_internal()) Skipped 1 previous similar message
Aug  3 13:43:44 node578 kernel: Lustre: 73626:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1533296619/real 0]  req at ffff88bfad20cc80 x1607776977552208/t0(0) o8->hpc-OST0006-osc-ffff885f887bf800 at 10.21.10.120@o2ib:28/4 lens 520/544 e 0 to 1 dl 1533296624 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Aug  3 13:43:44 node578 kernel: Lustre: 73626:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 6 previous similar messages
Aug  3 13:43:44 node578 kernel: LNet: 73616:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 10.21.10.120 at o2ib: 4479172 seconds
Aug  3 13:43:44 node578 kernel: LNet: 73616:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Skipped 1 previous similar message
Aug  3 13:44:34 node578 kernel: Lustre: 73626:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1533296669/real 0]  req at ffff88bf97f9fcc0 x1607776977552816/t0(0) o8->hpc-OST0002-osc-ffff885f887bf800 at 10.21.10.113@o2ib:28/4 lens 520/544 e 0 to 1 dl 1533296674 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Aug  3 13:44:34 node578 kernel: Lustre: 73626:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Aug  3 13:44:59 node578 kernel: LNet: 73616:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 10.21.10.113 at o2ib: 2 seconds
Aug  3 13:44:59 node578 kernel: LNet: 73616:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Skipped 1 previous similar message



More information about the lustre-discuss mailing list