[lustre-discuss] Struggling with OSS mounts after a crash

Edmondson, Edward e.edmondson at ucl.ac.uk
Fri Jan 20 12:25:45 PST 2023


As mentioned in the original email I did do the correct writeconf procedure as per the manual but it didn't work.

Anyway I've managed to fix it to some extent at least.

lnet didn't have a tcp interface on one subnet on the OSSes, only the o2ib interface. I added the tcp one and it's started working.

It's a bit mystifying as all the parameters are set up using the o2ib subnet addresses and interfaces. I assume this means the traffic is all over ib, but if there's a good way to confirm that I'd welcome hearing it!

--
Dr Edd Edmondson
HPC Systems Manager
Dept of Physics and Astronomy
University College London

(he/him) During remote working email is the best way to contact me. If needed I am available by phone on 0203 108 1399, by Microsoft Teams, or other methods by arrangement.
On 18 Jan 2023 at 10:50 +0000, Edmondson, Edward via lustre-discuss <lustre-discuss at lists.lustre.org>, wrote:

⚠ Caution: External sender

Hi all,

I'm struggling to get my OSS mounts online after a less than clean shutdown. I'm on lustre 2.12.9. Plenty of googling etc doesn’t bring up anything that seems particular to the problem I’m having unfortunately.

lnet seems to be up, pings ok both ways, communications clearly happen between the nodes judging by the logs. I've been through the log reconfiguration process with --writeconf on everything, step by step as in the manual

On the OSS node when I try to mount:
mount.lustre: mount /dev/mapper/lustre-oss0 at /mnt/oss0 failed: No such file or directory
Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)

In logs:
Jan 18 10:27:56 nas-0-4 kernel: LustreError: 31015:0:(ldlm_lib.c:494:client_obd_setup()) can't add initial connection
Jan 18 10:27:56 nas-0-4 kernel: LustreError: 31015:0:(lwp_dev.c:125:lwp_setup()) lustre-MDT0000-lwp-OST0000: client obd setup error: rc = -2
Jan 18 10:27:56 nas-0-4 kernel: LustreError: 31015:0:(lwp_dev.c:273:lwp_init0()) lustre-MDT0000-lwp-OST0000: setup lwp failed. -2
Jan 18 10:27:56 nas-0-4 kernel: LustreError: 31015:0:(obd_config.c:559:class_setup()) setup lustre-MDT0000-lwp-OST0000 failed (-2)
Jan 18 10:27:56 nas-0-4 kernel: LustreError: 31015:0:(obd_mount.c:202:lustre_start_simple()) lustre-MDT0000-lwp-OST0000 setup error -2
Jan 18 10:27:56 nas-0-4 kernel: LustreError: 31015:0:(obd_mount_server.c:671:lustre_lwp_setup()) lustre-MDT0000-lwp-OST0000: setup up failed: rc -2
Jan 18 10:27:56 nas-0-4 kernel: LustreError: 15c-8: MGC10.3.255.200 at o2ib: The configuration from log 'lustre-client' failed (-2). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
Jan 18 10:27:56 nas-0-4 kernel: LustreError: 30961:0:(obd_mount_server.c:1414:server_start_targets()) lustre-OST0000: failed to start LWP: -2
Jan 18 10:27:56 nas-0-4 kernel: LustreError: 30961:0:(obd_mount_server.c:1992:server_fill_super()) Unable to start targets: -2
Jan 18 10:27:56 nas-0-4 kernel: Lustre: Failing over lustre-OST0000
Jan 18 10:27:57 nas-0-4 kernel: LustreError: 30961:0:(ldlm_lockd.c:3203:ldlm_cleanup()) ldlm still has namespaces; clean these up first.
Jan 18 10:27:57 nas-0-4 kernel: LustreError: 30961:0:(ldlm_lockd.c:2862:ldlm_put_ref()) ldlm_cleanup failed: -16
Jan 18 10:27:57 nas-0-4 kernel: Lustre: server umount lustre-OST0000 complete
Jan 18 10:27:57 nas-0-4 kernel: LustreError: 30961:0:(obd_mount.c:1604:lustre_fill_super()) Unable to mount (-2)

On the MGS/MDT node (which has now mounted the MGS and MDT fine):
Jan 18 10:27:56 nas-0-3 kernel: Lustre: MGS: Connection restored to 24758df3-a11a-f5db-18a5-2e0e35f2099d (at 10.3.255.199 at o2ib)
Jan 18 10:27:56 nas-0-3 kernel: Lustre: MGS: Regenerating lustre-OST0000 log by user request: rc = 0
Jan 18 10:27:56 nas-0-3 kernel: Lustre: Found index 0 for lustre-OST0000, updating log
Jan 18 10:27:56 nas-0-3 kernel: Lustre: Client log for lustre-OST0000 was not updated; writeconf the MDT first to regenerate it.

The MDT has absolutely been writeconfed so that last message isn't terribly helpful. fscks are clean, so there's not a problem there.

Any advice hugely appreciated!

--
Dr Edd Edmondson
HPC Systems Manager
Dept of Physics and Astronomy
University College London

(he/him) During remote working email is the best way to contact me. If needed I am available by phone on 0203 108 1399, by Microsoft Teams, or other methods by arrangement.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20230120/91524f69/attachment.htm>


More information about the lustre-discuss mailing list