[lustre-discuss] OST's wating fro client on a pcs cluster

Fri Nov 19 04:38:21 PST 2021

One more addition, I also the following message on the oss who had the ost
before the failover:
Nov 19 12:43:59 dh4-oss01 kernel: LustreError: 137-5: muse-OST0001_UUID:
not available for connect from 172.23.53.214 at o2ib4 (no target). If you are
running an HA pair check that the target is mounted on the other server.

On Fri, 19 Nov 2021 at 12:01, Meijering, Koos <h.meijering at rug.nl> wrote:

> Hi Colin,
>
> I've added here 3 log file 1 from the metadata and 2 from the object
> stores.
> Before this logs started the filesystem was working, then I requested the
> cluster to failover muse-OST0001 from oss01 to oss02.
>
>
> On Thu, 18 Nov 2021 at 17:11, Colin Faber <cfaber at gmail.com> wrote:
>
>> Hi Koos,
>>
>> First thing -- it's generally a bad idea to run newer server versions
>> with older clients (the opposite isn't true).
>>
>> Second -- do you have any logging that you can share from the client
>> itself? (dmesg, syslog, etc)
>>
>> A quick test may be to run 2.12.7 clients against your cluster to verify
>> that there is no interop problem.
>>
>> -cf
>>
>>
>> On Thu, Nov 18, 2021 at 8:58 AM Meijering, Koos via lustre-discuss <
>> lustre-discuss at lists.lustre.org> wrote:
>>
>>> Hi all,
>>>
>>> We have build a lustre cluster server environment on CentOS7 and lustre
>>> 2.12.7
>>> The clients are using 2.12.5
>>> The setup is 3 clusters for a 3PB filesystem
>>> One cluster is a two node cluster built for MGS and MDT's
>>> The other two clusters are also two node cluster used for the OST's
>>> The cluster framework is working as expected.
>>>
>>> The servers are connected in a multirail network, because some clients
>>> are in IB and the other clients are on ethernet
>>>
>>> But we have the following problem. When an OST failover to the
>>> second node the clients are unable to contact the OST that is started on
>>> the oder node.
>>> The OST recovery status is waiting for clients
>>> When we fail it back it starts working again and the recovery status is
>>> compple
>>>
>>> We tried to abort the recovery but that does not work.
>>>
>>> We used these documents to build the cluster:
>>> https://wiki.lustre.org/Creating_the_Lustre_Management_Service_(MGS)
>>> https://wiki.lustre.org/Creating_the_Lustre_Metadata_Service_(MDS)
>>> https://wiki.lustre.org/Creating_Lustre_Object_Storage_Services_(OSS)
>>>
>>> https://wiki.lustre.org/Creating_Pacemaker_Resources_for_Lustre_Storage_Services
>>>
>>> I'm not sure what the next steps must be to find the problem and where
>>> to look.
>>>
>>> Best regards
>>> Koos Meijering
>>> ........................................................................
>>> HPC Team
>>> Rijksuniversiteit Groningen
>>> ........................................................................
>>> _______________________________________________
>>> lustre-discuss mailing list
>>> lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20211119/f39e6240/attachment.html>