[lustre-discuss] [EXTERNAL] I/O error on lctl ping although ibping successful

Youssef Eldakar youssefeldakar at gmail.com
Thu Jun 22 05:44:47 PDT 2023


Quite strangely, I found 2 good hosts (successfully mount the file system),
where the TCP ping goes through on one, while it doe snot on the other
(though LNET ping is OK for both).

- Youssef

On Wed, Jun 21, 2023 at 6:08 PM Youssef Eldakar <youssefeldakar at gmail.com>
wrote:

> Thanks, Rick, for that suggestion. TCP ping between a problematic host and
> the MDS indeed does not go through.
>
> Not exactly sure what to investigate next, but that gives me somewhere to
> start...
>
> - Youssef
>
> On Tue, Jun 20, 2023 at 7:00 PM Mohr, Rick via lustre-discuss <
> lustre-discuss at lists.lustre.org> wrote:
>
>> Have you tried tcp pings on the IP addresses associated with the IB
>> interfaces?
>>
>> --Rick
>>
>>
>> On 6/20/23, 12:11 PM, "lustre-discuss on behalf of Youssef Eldakar via
>> lustre-discuss" <lustre-discuss-bounces at lists.lustre.org <mailto:
>> lustre-discuss-bounces at lists.lustre.org> on behalf of
>> lustre-discuss at lists.lustre.org <mailto:lustre-discuss at lists.lustre.org>>
>> wrote:
>>
>>
>> In a cluster having ~100 Lustre clients (compute nodes) connected
>> together with the MDS and OSS over Intel True Scale InfiniBand
>> (discontinued product), we started seeing certain nodes failing to mount
>> the Lustre file system and giving I/O error on LNET (lctl) ping even though
>> an ibping test to the MDS gives no errors. We tried rebooting the
>> problematic nodes and even fresh-installing the OS and Lustre client, which
>> did not help. However, rebooting the MDS seems to possibly momentarily help
>> after the MDS starts up again, but the same set of problematic nodes seem
>> to always eventually revert back to the state where they fail to ping the
>> MDS over LNET.
>>
>>
>> Thank you for any pointers we may pursue.
>>
>>
>>
>>
>> Youssef Eldakar
>> Bibliotheca Alexandrina
>> www.bibalex.org <
>> https://urldefense.us/v2/url?u=http-3A__www.bibalex.org&d=DwMFaQ&c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&r=SpEwA4Pnyq7nH7aMGq8KpA&m=kwZRPirpHWOowgLmVOYe_KJ4ZigAHQk3DiF8-BwQ2qFikINn8C5-0SyyYEDelqDH&s=5DLPIzJx0tgg1TgSZkvvNNVfDfgpo-Prv-BPOga0WMA&e=>
>> <
>> https://urldefense.us/v2/url?u=http-3A__www.bibalex.org&amp;d=DwMFaQ&amp;c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&amp;r=SpEwA4Pnyq7nH7aMGq8KpA&amp;m=kwZRPirpHWOowgLmVOYe_KJ4ZigAHQk3DiF8-BwQ2qFikINn8C5-0SyyYEDelqDH&amp;s=5DLPIzJx0tgg1TgSZkvvNNVfDfgpo-Prv-BPOga0WMA&amp;e=&gt
>> ;>
>> hpc.bibalex.org <
>> https://urldefense.us/v2/url?u=http-3A__hpc.bibalex.org&d=DwMFaQ&c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&r=SpEwA4Pnyq7nH7aMGq8KpA&m=kwZRPirpHWOowgLmVOYe_KJ4ZigAHQk3DiF8-BwQ2qFikINn8C5-0SyyYEDelqDH&s=HMqKriFlJ2qwafMOSVJMqre9-wmJ--kaSS_rx4t7hQw&e=>
>> <
>> https://urldefense.us/v2/url?u=http-3A__hpc.bibalex.org&amp;d=DwMFaQ&amp;c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&amp;r=SpEwA4Pnyq7nH7aMGq8KpA&amp;m=kwZRPirpHWOowgLmVOYe_KJ4ZigAHQk3DiF8-BwQ2qFikINn8C5-0SyyYEDelqDH&amp;s=HMqKriFlJ2qwafMOSVJMqre9-wmJ--kaSS_rx4t7hQw&amp;e=&gt
>> ;>
>>
>>
>>
>>
>>
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20230622/0880da01/attachment.htm>


More information about the lustre-discuss mailing list