[lustre-discuss] open() against files on lustre hangs
Jan Andersen
j4nd3r53n at gmail.com
Thu Feb 22 00:42:46 PST 2024
I have the beginnings of a lustre filesystem, with a server, mds,
hosting the MGS and MDS, and a storage node, oss1. The disks, /mgt and
/mdt on mds and /ost on oss1 mount, apparently without error.
I have set up a client, pxe, which mounts /lustre:
root at node080027eb24b8:~# mount -t lustre mds at tcp:/comind /lustre
This appears to be successful - from dmesg:
...
[Wed Feb 21 10:54:59 2024] libcfs: loading out-of-tree module taints kernel.
[Wed Feb 21 10:54:59 2024] libcfs: module verification failed: signature
and/or required key missing - tainting kernel
[Wed Feb 21 10:54:59 2024] LNet: HW NUMA nodes: 1, HW CPU cores: 1,
npartitions: 1
[Wed Feb 21 10:54:59 2024] alg: No test for adler32 (adler32-zlib)
[Wed Feb 21 10:55:00 2024] Key type ._llcrypt registered
[Wed Feb 21 10:55:00 2024] Key type .llcrypt registered
[Wed Feb 21 10:55:00 2024] Lustre: Lustre: Build Version: 2.15.4
[Wed Feb 21 10:55:00 2024] LNet: Added LNI 192.168.50.13 at tcp [8/256/0/180]
[Wed Feb 21 10:55:00 2024] LNet: Accept secure, port 988
[Wed Feb 21 10:55:02 2024] Lustre: Mounted comind-client
I have, after several attempts managed to create a file (or at least a
directory entry):
root at node080027eb24b8:~# ls /lustre
test
However, anything that tries to open anything in /lustre - eg, 'ls -l' -
just hangs indefinitely, which I suspect is because it is waiting for
some sort of response on a network socket. An strace shows:
root at node080027eb24b8:~# strace -f /usr/bin/cat /lustre/test
...
fstat(3, {st_mode=S_IFREG|0644, st_size=346132, ...}) = 0
mmap(NULL, 346132, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fb3d0994000
close(3) = 0
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0), ...}) = 0
openat(AT_FDCWD, "/lustre/test", O_RDONLY) = 3
fstat(3,
I see no change in dmesg on pxe and oss1, but this on mds:
...
[Wed Feb 21 10:50:06 2024] LDISKFS-fs (sdb1): mounted filesystem with
ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
[Wed Feb 21 10:50:44 2024] LDISKFS-fs (sda): mounted filesystem with
ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
[Wed Feb 21 10:50:44 2024] Lustre: comind-MDT0000: Imperative Recovery
not enabled, recovery window 300-900
[Wed Feb 21 10:51:15 2024] Lustre: comind-OST0000-osc-MDT0000:
Connection restored to (at 192.168.50.130 at tcp)
[Wed Feb 21 10:57:04 2024] Lustre: comind-MDT0000: haven't heard from
client 83befb6d-7ee2-4acb-997c-b15520dcb70d (at 192.168.50.13 at tcp) in
240 seconds. I think it's dead, and I am evicting it. exp
00000000ddc96899, cur 1708513026 expire 1708512876 last 1708512786
So, something isn't right somewhere in the communication from pxe to mds
- but what?
More information about the lustre-discuss
mailing list