[lustre-discuss] [EXTERNAL] Re: Struggling with OSS mounts after a crash

Hanafi, Mahmoud (ARC-TN)[InuTeq, LLC] mahmoud.hanafi at nasa.gov
Fri Jan 20 09:41:38 PST 2023


Did you run writeconf on all targets and then try to mount them?

You should dump debug logs it may provide additional info.
lctl dk /tmp/ldebug.out


-Mahmoud Hanafi



From: lustre-discuss <lustre-discuss-bounces at lists.lustre.org> on behalf of Andreas Dilger via lustre-discuss <lustre-discuss at lists.lustre.org>
Reply-To: Andreas Dilger <adilger at whamcloud.com>
Date: Friday, January 20, 2023 at 7:59 AM
To: "Edmondson, Edward" <e.edmondson at ucl.ac.uk>
Cc: "lustre-discuss at lists.lustre.org" <lustre-discuss at lists.lustre.org>
Subject: [EXTERNAL] Re: [lustre-discuss] Struggling with OSS mounts after a crash

You need to run writeconf on all targets at the same time, and mount in a specific order. That is documented in th Lustre Operations Manual.
Cheers, Andreas


On Jan 18, 2023, at 03:49, Edmondson, Edward via lustre-discuss <lustre-discuss at lists.lustre.org> wrote:
Hi all,

I'm struggling to get my OSS mounts online after a less than clean shutdown. I'm on lustre 2.12.9. Plenty of googling etc doesn’t bring up anything that seems particular to the problem I’m having unfortunately.

lnet seems to be up, pings ok both ways, communications clearly happen between the nodes judging by the logs. I've been through the log reconfiguration process with --writeconf on everything, step by step as in the manual

On the OSS node when I try to mount:
mount.lustre: mount /dev/mapper/lustre-oss0 at /mnt/oss0 failed: No such file or directory
Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)


In logs:
Jan 18 10:27:56 nas-0-4 kernel: LustreError: 31015:0:(ldlm_lib.c:494:client_obd_setup()) can't add initial connection
Jan 18 10:27:56 nas-0-4 kernel: LustreError: 31015:0:(lwp_dev.c:125:lwp_setup()) lustre-MDT0000-lwp-OST0000: client obd setup error: rc = -2
Jan 18 10:27:56 nas-0-4 kernel: LustreError: 31015:0:(lwp_dev.c:273:lwp_init0()) lustre-MDT0000-lwp-OST0000: setup lwp failed. -2
Jan 18 10:27:56 nas-0-4 kernel: LustreError: 31015:0:(obd_config.c:559:class_setup()) setup lustre-MDT0000-lwp-OST0000 failed (-2)
Jan 18 10:27:56 nas-0-4 kernel: LustreError: 31015:0:(obd_mount.c:202:lustre_start_simple()) lustre-MDT0000-lwp-OST0000 setup error -2
Jan 18 10:27:56 nas-0-4 kernel: LustreError: 31015:0:(obd_mount_server.c:671:lustre_lwp_setup()) lustre-MDT0000-lwp-OST0000: setup up failed: rc -2
Jan 18 10:27:56 nas-0-4 kernel: LustreError: 15c-8: MGC10.3.255.200 at o2ib: The configuration from log 'lustre-client' failed (-2). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
Jan 18 10:27:56 nas-0-4 kernel: LustreError: 30961:0:(obd_mount_server.c:1414:server_start_targets()) lustre-OST0000: failed to start LWP: -2
Jan 18 10:27:56 nas-0-4 kernel: LustreError: 30961:0:(obd_mount_server.c:1992:server_fill_super()) Unable to start targets: -2
Jan 18 10:27:56 nas-0-4 kernel: Lustre: Failing over lustre-OST0000
Jan 18 10:27:57 nas-0-4 kernel: LustreError: 30961:0:(ldlm_lockd.c:3203:ldlm_cleanup()) ldlm still has namespaces; clean these up first.
Jan 18 10:27:57 nas-0-4 kernel: LustreError: 30961:0:(ldlm_lockd.c:2862:ldlm_put_ref()) ldlm_cleanup failed: -16
Jan 18 10:27:57 nas-0-4 kernel: Lustre: server umount lustre-OST0000 complete
Jan 18 10:27:57 nas-0-4 kernel: LustreError: 30961:0:(obd_mount.c:1604:lustre_fill_super()) Unable to mount (-2)

On the MGS/MDT node (which has now mounted the MGS and MDT fine):
Jan 18 10:27:56 nas-0-3 kernel: Lustre: MGS: Connection restored to 24758df3-a11a-f5db-18a5-2e0e35f2099d (at 10.3.255.199 at o2ib)
Jan 18 10:27:56 nas-0-3 kernel: Lustre: MGS: Regenerating lustre-OST0000 log by user request: rc = 0
Jan 18 10:27:56 nas-0-3 kernel: Lustre: Found index 0 for lustre-OST0000, updating log
Jan 18 10:27:56 nas-0-3 kernel: Lustre: Client log for lustre-OST0000 was not updated; writeconf the MDT first to regenerate it.

The MDT has absolutely been writeconfed so that last message isn't terribly helpful. fscks are clean, so there's not a problem there.

Any advice hugely appreciated!

--
Dr Edd Edmondson
HPC Systems Manager
Dept of Physics and Astronomy
University College London

(he/him) During remote working email is the best way to contact me. If needed I am available by phone on 0203 108 1399, by Microsoft Teams, or other methods by arrangement.
_______________________________________________
lustre-discuss mailing list
lustre-discuss at lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20230120/5d38fa82/attachment-0001.htm>


More information about the lustre-discuss mailing list