[lustre-discuss] open() against files on lustre hangs

Jan j4nd3r53n at gmail.com
Fri Feb 23 02:51:16 PST 2024


Hi Thomas,

Thank you for your suggestions and explanation. Yes, I found what the problem was: the firewall on the OSS. If only all problem turned out to be that easy :-)

/jan

On 22/02/2024 15:20, Bertschinger, Thomas Andrew Hjorth wrote:
> Hello Jan,
> 
> More often than not, when I see stat() syscalls hanging, it's due to a communication issue with an OSS rather than an MDS. I think the message about "Lustre: comind-MDT0000: haven't heard from client ..." may be a downstream effect of the client hanging (maybe due to an OSS issue), that causes the client to stop responding to the MDS, rather than the root cause.
> 
> (This is just conjecture but it's based on the fact that in my experience when I see the symptoms you have here, it's generally an OSS issue.)
> 
> Here are some methods I commonly use to check that a client can communicate with each server:
> 
> client $ lfs df
> (should return a line for each server)
> 
> # get the server NID with "lctl list_nids" on the server side, and then for each server, do:
> client $ lctl ping $SERVER_NID
> 
> client $ lctl get_param osc.*.state | grep -B1 current
> (normal states include FULL, IDLE, but it shouldn't say DISCONN or CONNECTING ...)
> 
> Do those commands reveal any communication issues between the client and any of the servers?
> 
> - Thomas Bertschinger
> 
> 
> ________________________________________
> From: lustre-discuss <lustre-discuss-bounces at lists.lustre.org> on behalf of Jan Andersen via lustre-discuss <lustre-discuss at lists.lustre.org>
> Sent: Thursday, February 22, 2024 1:42 AM
> To: lustre-discuss
> Subject: [EXTERNAL] [lustre-discuss] open() against files on lustre hangs
> 
> I have the beginnings of a lustre filesystem, with a server, mds,
> hosting the MGS and MDS, and a storage node, oss1. The disks, /mgt and
> /mdt on mds and /ost on oss1 mount, apparently without error.
> 
> I have set up a client, pxe, which mounts /lustre:
> 
> root at node080027eb24b8:~# mount -t lustre mds at tcp:/comind /lustre
> 
> This appears to be successful - from dmesg:
> 
> ...
> [Wed Feb 21 10:54:59 2024] libcfs: loading out-of-tree module taints kernel.
> [Wed Feb 21 10:54:59 2024] libcfs: module verification failed: signature
> and/or required key missing - tainting kernel
> [Wed Feb 21 10:54:59 2024] LNet: HW NUMA nodes: 1, HW CPU cores: 1,
> npartitions: 1
> [Wed Feb 21 10:54:59 2024] alg: No test for adler32 (adler32-zlib)
> [Wed Feb 21 10:55:00 2024] Key type ._llcrypt registered
> [Wed Feb 21 10:55:00 2024] Key type .llcrypt registered
> [Wed Feb 21 10:55:00 2024] Lustre: Lustre: Build Version: 2.15.4
> [Wed Feb 21 10:55:00 2024] LNet: Added LNI 192.168.50.13 at tcp [8/256/0/180]
> [Wed Feb 21 10:55:00 2024] LNet: Accept secure, port 988
> [Wed Feb 21 10:55:02 2024] Lustre: Mounted comind-client
> 
> I have, after several attempts managed to create a file (or at least a
> directory entry):
> 
> root at node080027eb24b8:~# ls /lustre
> test
> 
> However, anything that tries to open anything in /lustre - eg, 'ls -l' -
> just hangs indefinitely, which I suspect is because it is waiting for
> some sort of response on a network socket. An strace shows:
> 
> root at node080027eb24b8:~# strace -f /usr/bin/cat /lustre/test
> ...
> fstat(3, {st_mode=S_IFREG|0644, st_size=346132, ...}) = 0
> mmap(NULL, 346132, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fb3d0994000
> close(3)                                = 0
> fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0), ...}) = 0
> openat(AT_FDCWD, "/lustre/test", O_RDONLY) = 3
> fstat(3,
> 
> I see no change in dmesg on pxe and oss1, but this on mds:
> 
> ...
> [Wed Feb 21 10:50:06 2024] LDISKFS-fs (sdb1): mounted filesystem with
> ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
> [Wed Feb 21 10:50:44 2024] LDISKFS-fs (sda): mounted filesystem with
> ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
> [Wed Feb 21 10:50:44 2024] Lustre: comind-MDT0000: Imperative Recovery
> not enabled, recovery window 300-900
> [Wed Feb 21 10:51:15 2024] Lustre: comind-OST0000-osc-MDT0000:
> Connection restored to  (at 192.168.50.130 at tcp)
> [Wed Feb 21 10:57:04 2024] Lustre: comind-MDT0000: haven't heard from
> client 83befb6d-7ee2-4acb-997c-b15520dcb70d (at 192.168.50.13 at tcp) in
> 240 seconds. I think it's dead, and I am evicting it. exp
> 00000000ddc96899, cur 1708513026 expire 1708512876 last 1708512786
> 
> 
> So, something isn't right somewhere in the communication from pxe to mds
> - but what?


More information about the lustre-discuss mailing list