[lustre-discuss] Lustre 2.12 OSTs fail on certain OSSs

Konzem, Kevin (Contractor) P kkonzem at contractor.usgs.gov
Fri Jul 9 12:58:16 PDT 2021

We are having an interesting issue with OST connectivity that we just cant figure out on our own. For background, in this instance we have one VM running as an MDS, and two physical servers with direct connected storage systems to serve as OSSs, There are two OSTs per OSS, making four total, and pacemaker/corosync set up to run HA for the OSTs. When I set up this instance in February, I tested thoroughly, and made sure that I could run the filesystem with any combination of OSTs running on either of the OSSs, however lately, the OSTs are having connectivity issues if they run on a certain OSS. For instance, if OST0 and OST1 are running on OSS1, and OST2 and OST3 are running on OSS2, the filesystem will work fine with no issues, but if I migrate any OST to the other OSS, that OST will mount up and appear to be working fine from a 'lctl dl' ran from the MDS, but all files located on the affected OST will be unavailable from any clients, and a 'lfs check servers' ran from a client will hang for a while, then show "resource temporarily unavailable (11)" on that OST. Any attempt to access or even check metadata of a file [ls, df, du, ect] will freeze up the session.
I kicked off a 'lfsck_start -o -t layout -A' from the MDT and it completed without finding anything to repair.
Id appreciate if anyone could point me in a direction to look for answers to this issue.

root@[MDS] ~ $ lctl dl
  0 UP osd-ldiskfs lustrest-MDT0000-osd lustrest-MDT0000-osd_UUID 11
  1 UP mgs MGS MGS 56
  2 UP mgc MGC[MDS STORAGE NETWORK  IP]@tcp bc90ff88-6a97-fd41-f1af-97bf148bf883 4
  3 UP mds MDS MDS_uuid 2
  4 UP lod lustrest-MDT0000-mdtlov lustrest-MDT0000-mdtlov_UUID 3
  5 UP mdt lustrest-MDT0000 lustrest-MDT0000_UUID 60
  6 UP mdd lustrest-MDD0000 lustrest-MDD0000_UUID 3
  7 UP qmt lustrest-QMT0000 lustrest-QMT0000_UUID 3
  8 UP osp lustrest-OST0000-osc-MDT0000 lustrest-MDT0000-mdtlov_UUID 4
  9 UP osp lustrest-OST0001-osc-MDT0000 lustrest-MDT0000-mdtlov_UUID 4
10 UP osp lustrest-OST0002-osc-MDT0000 lustrest-MDT0000-mdtlov_UUID 4
11 UP osp lustrest-OST0003-osc-MDT0000 lustrest-MDT0000-mdtlov_UUID 4
12 UP lwp lustrest-MDT0000-lwp-MDT0000 lustrest-MDT0000-lwp-MDT0000_UUID 4
Attached is a screencap of the failing 'lfs check servers' during the failure.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20210709/b54e559c/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: MicrosoftTeams-image (5).png
Type: image/png
Size: 92738 bytes
Desc: MicrosoftTeams-image (5).png
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20210709/b54e559c/attachment-0001.png>

More information about the lustre-discuss mailing list