[lustre-discuss] Lustre 2.12.6 on RHEL 7.9 not able to mount disks after reboot

Tue Aug 9 14:41:03 PDT 2022

JC,

The message where it asks if the MGS is running is a pretty common error 
that you'll see when something isn't right. There's not a lot of detail 
in your message but first step is to make sure your OST device is 
present on the OSS server. You mentioned remounting the RAID 
directories; is this software/MD RAID? Are you using ldiskfs or ZFS for 
the backend storage (I'll guess ldiskfs if using MD RAID).

If you've already verified the OST volume is present, see if you can 
'lctl ping' between the MDS and OSS nodes. I'm not sure what your 
knowledge base is so forgive me if this is too elementary, but on each 
node, type 'lctl list_nids' to get the Lustre node identifier, then run 
'lctl ping <NID>' to make sure you can talk Lustre/LNet between them:

[root at tin1:~]# lctl list_nids
192.168.101.1 at o2ib1

[root at tin6:~]# lctl list_nids
192.168.101.6 at o2ib1

[root at tin6:~]# lctl ping 192.168.101.1 at o2ib1
12345-0 at lo
12345-192.168.101.1 at o2ib1

[root at tin1:~]# lctl ping 192.168.101.6 at o2ib1
12345-0 at lo
12345-192.168.101.6 at o2ib1

If you get a failure (like I/O Error), then you have a communications 
problem and you'll want to make sure all the correct interfaces are up. 
If the pings do work, then you'll want to look for messages in 
/var/log/lustre and dmesg.

Cameron

On 8/9/22 06:45, Crowder, Jonathan via lustre-discuss wrote:
>
> Hello, this is my first post here so I may need some guidance on the 
> function of this system.
>
> I am in a small team supporting some 36TB lustre servers for a 
> business unit. Our configuration per mount point is one lustre master 
> node and 3 lustre object stores. We had one of the object stores lost 
> to an unidentified reboot and upon getting it booted back into the 
> lustre kernel by azure cloud teams, we saw behavior where we could not 
> get it to remount the raid directories for storage to the local file 
> paths we have set up for them. I can obtain the output soon here, it 
> knows the MGS node, but asks if it's running. I am having difficulty 
> investigating more deeply into why this is happening as the other 
> object stores are working without issue.
>
> Thanks,
>
> JC
>
>
> Internal
>
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!G2kpM7uM-TzIFchu!l2FbiTAR6qhLwbOqf4kfzj8IRp8tfTexTXEOpPVB2ASGCAIVUTpJGN5isgF9Ugs$  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20220809/1867e6bc/attachment-0001.htm>