[lustre-discuss] OST's wating fro client on a pcs cluster

Fri Nov 19 13:48:46 PST 2021

Hi Koos,

One thing you mentioned that I should have picked up on sooner, was "The
servers are connected in a multirail network, because some clients are in
IB and the other clients are on ethernet"

Can you describe your topology? How are the various elements connected to
each other?

-cf

On Fri, Nov 19, 2021 at 5:38 AM Meijering, Koos <h.meijering at rug.nl> wrote:

> One more addition, I also the following message on the oss who had the ost
> before the failover:
> Nov 19 12:43:59 dh4-oss01 kernel: LustreError: 137-5: muse-OST0001_UUID:
> not available for connect from 172.23.53.214 at o2ib4 (no target). If you
> are running an HA pair check that the target is mounted on the other server.
>
> On Fri, 19 Nov 2021 at 12:01, Meijering, Koos <h.meijering at rug.nl> wrote:
>
>> Hi Colin,
>>
>> I've added here 3 log file 1 from the metadata and 2 from the object
>> stores.
>> Before this logs started the filesystem was working, then I requested the
>> cluster to failover muse-OST0001 from oss01 to oss02.
>>
>>
>> On Thu, 18 Nov 2021 at 17:11, Colin Faber <cfaber at gmail.com> wrote:
>>
>>> Hi Koos,
>>>
>>> First thing -- it's generally a bad idea to run newer server versions
>>> with older clients (the opposite isn't true).
>>>
>>> Second -- do you have any logging that you can share from the client
>>> itself? (dmesg, syslog, etc)
>>>
>>> A quick test may be to run 2.12.7 clients against your cluster to verify
>>> that there is no interop problem.
>>>
>>> -cf
>>>
>>>
>>> On Thu, Nov 18, 2021 at 8:58 AM Meijering, Koos via lustre-discuss <
>>> lustre-discuss at lists.lustre.org> wrote:
>>>
>>>> Hi all,
>>>>
>>>> We have build a lustre cluster server environment on CentOS7 and lustre
>>>> 2.12.7
>>>> The clients are using 2.12.5
>>>> The setup is 3 clusters for a 3PB filesystem
>>>> One cluster is a two node cluster built for MGS and MDT's
>>>> The other two clusters are also two node cluster used for the OST's
>>>> The cluster framework is working as expected.
>>>>
>>>> The servers are connected in a multirail network, because some clients
>>>> are in IB and the other clients are on ethernet
>>>>
>>>> But we have the following problem. When an OST failover to the
>>>> second node the clients are unable to contact the OST that is started on
>>>> the oder node.
>>>> The OST recovery status is waiting for clients
>>>> When we fail it back it starts working again and the recovery status is
>>>> compple
>>>>
>>>> We tried to abort the recovery but that does not work.
>>>>
>>>> We used these documents to build the cluster:
>>>> https://wiki.lustre.org/Creating_the_Lustre_Management_Service_(MGS)
>>>> https://wiki.lustre.org/Creating_the_Lustre_Metadata_Service_(MDS)
>>>> https://wiki.lustre.org/Creating_Lustre_Object_Storage_Services_(OSS)
>>>>
>>>> https://wiki.lustre.org/Creating_Pacemaker_Resources_for_Lustre_Storage_Services
>>>>
>>>> I'm not sure what the next steps must be to find the problem and where
>>>> to look.
>>>>
>>>> Best regards
>>>> Koos Meijering
>>>> ........................................................................
>>>> HPC Team
>>>> Rijksuniversiteit Groningen
>>>> ........................................................................
>>>> _______________________________________________
>>>> lustre-discuss mailing list
>>>> lustre-discuss at lists.lustre.org
>>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20211119/83c6b71d/attachment.html>