[lustre-discuss] slow mount of lustre CentOS6 clients to 2.9 servers

Grigory Shamov Grigory.Shamov at umanitoba.ca
Wed May 10 09:25:48 PDT 2017


Hi Doug,


The NITS modified, yes. It had a real MDS¹s IP we have.

Prior to getting this problem we have tried to build Lustre client with
the older MLNX 3.2 and and elrepo  kernel-lt . It was impossible to build
Lustre-2.9 client with elrepo kernel-lt  so we switched the kernel back to
CentOS's 2.6.32 but upgraded MLNX
 to 3.4.1. 

But Lustre-2.8.0 client would build back then, and that combination
(elrepo kernel-lt, MLNX 3.2 and Lustre 2.8) did not fail to mount.

Could the mount problem somehow be related to the newer Mellanox drivers?
In a recent thread you have mentioned this one:
https://review.whamcloud.com/#/c/24306/  ? (But our hardware is old so the
MLNX4 driver gets loaded, not MLNX5).

-- 
Grigory Shamov

Westgrid/ComputeCanada Site Lead
University of Manitoba
E2-588 EITC Building,
(204) 474-9625



From: "Oucharek, Doug S" <doug.s.oucharek at intel.com>
Date: Friday, May 5, 2017 at 1:09 PM
To: Grigory Shamov <Grigory.Shamov at umanitoba.ca>
Cc: "lustre-discuss at lists.lustre.org" <lustre-discuss at lists.lustre.org>
Subject: Re: [lustre-discuss] slow mount of lustre CentOS6 clients to 2.9
servers


Are the NIDs "192.168.xxx.yyy at o2ib² really configured that way or did you
modify those logs when pasting them to email?

Doug


On May 5, 2017, at 11:02 AM, Grigory Shamov <Grigory.Shamov at umanitoba.ca>
wrote:

Hi All,

We were installing a new Lustre storage.
To that end , we have built new clients with the following configuration:

CentOS 6.8, kernel 2.6.32-642.el6.x86_64
Mellanox OFED 3.4.1.0 (on QDR fabric)

and either lustre-2.8.0 or lustre-2.9.0 clients, which we rebuilt from
sources. The new server is Lustre 2.9 on CentOS 7.3 .

Now, the clients we built have a problem in mounting the filesystem.  It
takes long time, and/or fails initially, with messages as follows (for the
2.8 client):

mounting device 192.168.xxx.yyy at o2ib:/lustre at /lustrenew, flags=0x400
options=flock,device=192.168.xxx.yyy at o2ib:/lustre
mount.lustre: mount 192.168.xxx.yyy at o2ib:/lustre at /lustrenew failed:
Input/output error retries left: 0
mount.lustre: mount 192.168.xxx.yyy at o2ib:/lustre at /lustrenew failed:
Input/output error
Is the MGS running?

In dmesg:

LNet: HW CPU cores: 24, npartitions: 4
alg: No test for adler32 (adler32-zlib)
alg: No test for crc32 (crc32-table)
alg: No test for crc32 (crc32-pclmul)
Lustre: Lustre: Build Version: 2.8.0-RC5--PRISTINE-2.6.32-642.el6.x86_64
LNet: Added LNI 192.168.aaa.bbb at o2ib [8/256/0/180]
Lustre: 3476:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request
sent has timed out for sent delay: [sent 1493927511/real 0]
req at ffff88061a1aac80 x1566496533774340/t0(0)
o250->MGC192.168.xxx.yyy at o2ib@192.168.xxx.yyy at o2ib:26/25
 lens 520/544 e 0 to 1 dl 1493927516 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
LustreError: 15c-8: MGC192.168.xxx.yyy at o2ib: The configuration from log
'lustre-client' failed (-5). This may be the result of communication
errors between this node and the MGS, a bad configuration, or
 other errors. See the syslog for more information.
Lustre: Unmounted lustre-client


Initial mount would thus  fail; then mount  happens but OST's would take
lot of time to become active;

UUID                  1K-blocks        Used  Available Use% Mounted on
lustre-MDT0000_UUID  1156701708      751100  1077936556  0%
/lustrenew[MDT:0]
OST0000            : inactive device
OST0001            : inactive device
OST0002            : inactive device
OST0003            : inactive device
OST0004            : inactive device
OST0005            : inactive device
OST0006            : inactive device
OST0007            : inactive device

filesystem summary:            0          0          0  0% /lustrenew

then, after some 10 minutes , the mount completes and performance-wise,
Lustre seems to be normal.

Same dmesg output from 2.9 client

LNet: HW CPU cores: 24, npartitions: 2
alg: No test for adler32 (adler32-zlib)
alg: No test for crc32 (crc32-table)
alg: No test for crc32 (crc32-pclmul)
Lustre: Lustre: Build Version: 2.9.0
LNet: Added LNI 192.168.aaa.bbb at o2ib [8/256/0/180]
Lustre: 3468:0:(client.c:2111:ptlrpc_expire_one_request()) @@@ Request
sent has timed out for sent delay: [sent 1493929145/real 0]
req at ffff880631d07c80 x1566498247147536/t0(0)
o250->MGC192.168.xxx.yyy at o2ib@192.168.xxx.yyy at o2ib:26/25
 lens 520/544 e 0 to 1 dl 1493929150 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
LustreError: 15c-8: MGC192.168.xxx.yyy at o2ib: The configuration from log
'lustre-client' failed (-5). This may be the result of communication
errors between this node and the MGS, a bad configuration, or
 other errors. See the syslog for more information.
Lustre: Unmounted lustre-client
LustreError: 3413:0:(obd_mount.c:1449:lustre_fill_super()) Unable to mount
 (-5)

 I am at loss as to what would cause such behavior? Could anyone advise
where to look at for the causes of this problem? Thank you very much in
advance!

--
Grigory Shamov
HPC SIte Lead,
University of Manitoba


_______________________________________________
lustre-discuss mailing list
lustre-discuss at lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org








More information about the lustre-discuss mailing list