<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title></title>
</head>
<body>
<div name="messageBodySection">
<div dir="auto">As mentioned in the original email I did do the correct writeconf procedure as per the manual but it didn't work.<br>
<br>
Anyway I've managed to fix it to some extent at least.<br>
<br>
lnet didn't have a tcp interface on one subnet on the OSSes, only the o2ib interface. I added the tcp one and it's started working.<br>
<br>
It's a bit mystifying as all the parameters are set up using the o2ib subnet addresses and interfaces. I assume this means the traffic is all over ib, but if there's a good way to confirm that I'd welcome hearing it!</div>
</div>
<div name="messageSignatureSection"><br>
<div class="matchFont">
<div dir="auto">-- </div>
<div dir="auto">Dr Edd Edmondson</div>
<div dir="auto">HPC Systems Manager</div>
<div dir="auto">Dept of Physics and Astronomy</div>
<div dir="auto">University College London</div>
<div dir="auto"><br>
</div>
<div dir="auto">(he/him) During remote working email is the best way to contact me. If needed I am available by phone on 0203 108 1399, by Microsoft Teams, or other methods by arrangement.</div>
</div>
</div>
<div name="messageReplySection">On 18 Jan 2023 at 10:50 +0000, Edmondson, Edward via lustre-discuss <lustre-discuss@lists.lustre.org>, wrote:<br>
<blockquote type="cite" style="border-left-color: grey; border-left-width: thin; border-left-style: solid; margin: 5px 5px;padding-left: 10px;">
<div style="background-color:#FFEFD5; padding:1px;">
<p style="font-size:11pt; line-height:10pt; font-family: 'Arial','Helvetica',sans-serif;">
⚠ Caution: External sender</p>
</div>
<br>
<div>
<div class="WordSection1">
<div name="messageBodySection">
<div>
<p class="MsoNormal">Hi all,<br>
<br>
I'm struggling to get my OSS mounts online after a less than clean shutdown. I'm on lustre 2.12.9. Plenty of googling etc doesn’t bring up anything that seems particular to the problem I’m having unfortunately.<br>
<br>
lnet seems to be up, pings ok both ways, communications clearly happen between the nodes judging by the logs. I've been through the log reconfiguration process with --writeconf on everything, step by step as in the manual<br>
<br>
On the OSS node when I try to mount:<br>
<span style="font-family:"Courier New"">mount.lustre: mount /dev/mapper/lustre-oss0 at /mnt/oss0 failed: No such file or directory<br>
Is the MGS specification correct?<br>
Is the filesystem name correct?<br>
If upgrading, is the copied client log valid? (see upgrade docs)<br>
<br>
</span></p>
<p class="MsoNormal">In logs:<br>
<span style="font-family:"Courier New"">Jan 18 10:27:56 nas-0-4 kernel: LustreError: 31015:0:(ldlm_lib.c:494:client_obd_setup()) can't add initial connection<br>
Jan 18 10:27:56 nas-0-4 kernel: LustreError: 31015:0:(lwp_dev.c:125:lwp_setup()) lustre-MDT0000-lwp-OST0000: client obd setup error: rc = -2<br>
Jan 18 10:27:56 nas-0-4 kernel: LustreError: 31015:0:(lwp_dev.c:273:lwp_init0()) lustre-MDT0000-lwp-OST0000: setup lwp failed. -2<br>
Jan 18 10:27:56 nas-0-4 kernel: LustreError: 31015:0:(obd_config.c:559:class_setup()) setup lustre-MDT0000-lwp-OST0000 failed (-2)<br>
Jan 18 10:27:56 nas-0-4 kernel: LustreError: 31015:0:(obd_mount.c:202:lustre_start_simple()) lustre-MDT0000-lwp-OST0000 setup error -2<br>
Jan 18 10:27:56 nas-0-4 kernel: LustreError: 31015:0:(obd_mount_server.c:671:lustre_lwp_setup()) lustre-MDT0000-lwp-OST0000: setup up failed: rc -2<br>
Jan 18 10:27:56 nas-0-4 kernel: LustreError: 15c-8: MGC10.3.255.200@o2ib: The configuration from log 'lustre-client' failed (-2). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog
for more information.<br>
Jan 18 10:27:56 nas-0-4 kernel: LustreError: 30961:0:(obd_mount_server.c:1414:server_start_targets()) lustre-OST0000: failed to start LWP: -2<br>
Jan 18 10:27:56 nas-0-4 kernel: LustreError: 30961:0:(obd_mount_server.c:1992:server_fill_super()) Unable to start targets: -2<br>
Jan 18 10:27:56 nas-0-4 kernel: Lustre: Failing over lustre-OST0000<br>
Jan 18 10:27:57 nas-0-4 kernel: LustreError: 30961:0:(ldlm_lockd.c:3203:ldlm_cleanup()) ldlm still has namespaces; clean these up first.<br>
Jan 18 10:27:57 nas-0-4 kernel: LustreError: 30961:0:(ldlm_lockd.c:2862:ldlm_put_ref()) ldlm_cleanup failed: -16<br>
Jan 18 10:27:57 nas-0-4 kernel: Lustre: server umount lustre-OST0000 complete<br>
Jan 18 10:27:57 nas-0-4 kernel: LustreError: 30961:0:(obd_mount.c:1604:lustre_fill_super()) Unable to mount (-2)<br>
</span><br>
On the MGS/MDT node (which has now mounted the MGS and MDT fine):<br>
<span style="font-family:"Courier New"">Jan 18 10:27:56 nas-0-3 kernel: Lustre: MGS: Connection restored to 24758df3-a11a-f5db-18a5-2e0e35f2099d (at 10.3.255.199@o2ib)<br>
Jan 18 10:27:56 nas-0-3 kernel: Lustre: MGS: Regenerating lustre-OST0000 log by user request: rc = 0<br>
Jan 18 10:27:56 nas-0-3 kernel: Lustre: Found index 0 for lustre-OST0000, updating log<br>
Jan 18 10:27:56 nas-0-3 kernel: Lustre: Client log for lustre-OST0000 was not updated; writeconf the MDT first to regenerate it.<br>
</span><br>
The MDT has absolutely been writeconfed so that last message isn't terribly helpful. fscks are clean, so there's not a problem there.<br>
<br>
Any advice hugely appreciated!</p>
</div>
</div>
<div name="messageSignatureSection">
<p class="MsoNormal"> </p>
<div>
<div>
<p class="MsoNormal">-- </p>
</div>
<div>
<p class="MsoNormal">Dr Edd Edmondson</p>
</div>
<div>
<p class="MsoNormal">HPC Systems Manager</p>
</div>
<div>
<p class="MsoNormal">Dept of Physics and Astronomy</p>
</div>
<div>
<p class="MsoNormal">University College London</p>
</div>
<div>
<p class="MsoNormal"> </p>
</div>
<div>
<p class="MsoNormal">(he/him) During remote working email is the best way to contact me. If needed I am available by phone on 0203 108 1399, by Microsoft Teams, or other methods by arrangement.</p>
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
</body>
</html>