[lustre-discuss] open() against files on lustre hangs

Jan Andersen j4nd3r53n at gmail.com
Thu Feb 22 00:42:46 PST 2024


I have the beginnings of a lustre filesystem, with a server, mds, 
hosting the MGS and MDS, and a storage node, oss1. The disks, /mgt and 
/mdt on mds and /ost on oss1 mount, apparently without error.

I have set up a client, pxe, which mounts /lustre:

root at node080027eb24b8:~# mount -t lustre mds at tcp:/comind /lustre

This appears to be successful - from dmesg:

...
[Wed Feb 21 10:54:59 2024] libcfs: loading out-of-tree module taints kernel.
[Wed Feb 21 10:54:59 2024] libcfs: module verification failed: signature 
and/or required key missing - tainting kernel
[Wed Feb 21 10:54:59 2024] LNet: HW NUMA nodes: 1, HW CPU cores: 1, 
npartitions: 1
[Wed Feb 21 10:54:59 2024] alg: No test for adler32 (adler32-zlib)
[Wed Feb 21 10:55:00 2024] Key type ._llcrypt registered
[Wed Feb 21 10:55:00 2024] Key type .llcrypt registered
[Wed Feb 21 10:55:00 2024] Lustre: Lustre: Build Version: 2.15.4
[Wed Feb 21 10:55:00 2024] LNet: Added LNI 192.168.50.13 at tcp [8/256/0/180]
[Wed Feb 21 10:55:00 2024] LNet: Accept secure, port 988
[Wed Feb 21 10:55:02 2024] Lustre: Mounted comind-client

I have, after several attempts managed to create a file (or at least a 
directory entry):

root at node080027eb24b8:~# ls /lustre
test

However, anything that tries to open anything in /lustre - eg, 'ls -l' - 
just hangs indefinitely, which I suspect is because it is waiting for 
some sort of response on a network socket. An strace shows:

root at node080027eb24b8:~# strace -f /usr/bin/cat /lustre/test
...
fstat(3, {st_mode=S_IFREG|0644, st_size=346132, ...}) = 0
mmap(NULL, 346132, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fb3d0994000
close(3)                                = 0
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0), ...}) = 0
openat(AT_FDCWD, "/lustre/test", O_RDONLY) = 3
fstat(3,

I see no change in dmesg on pxe and oss1, but this on mds:

...
[Wed Feb 21 10:50:06 2024] LDISKFS-fs (sdb1): mounted filesystem with 
ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
[Wed Feb 21 10:50:44 2024] LDISKFS-fs (sda): mounted filesystem with 
ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
[Wed Feb 21 10:50:44 2024] Lustre: comind-MDT0000: Imperative Recovery 
not enabled, recovery window 300-900
[Wed Feb 21 10:51:15 2024] Lustre: comind-OST0000-osc-MDT0000: 
Connection restored to  (at 192.168.50.130 at tcp)
[Wed Feb 21 10:57:04 2024] Lustre: comind-MDT0000: haven't heard from 
client 83befb6d-7ee2-4acb-997c-b15520dcb70d (at 192.168.50.13 at tcp) in 
240 seconds. I think it's dead, and I am evicting it. exp 
00000000ddc96899, cur 1708513026 expire 1708512876 last 1708512786


So, something isn't right somewhere in the communication from pxe to mds 
- but what?




More information about the lustre-discuss mailing list