[lustre-discuss] [EXTERNAL] I/O error on lctl ping although ibping successful

Mohr, Rick mohrrf at ornl.gov
Tue Jun 20 09:59:04 PDT 2023


Have you tried tcp pings on the IP addresses associated with the IB interfaces?

--Rick


On 6/20/23, 12:11 PM, "lustre-discuss on behalf of Youssef Eldakar via lustre-discuss" <lustre-discuss-bounces at lists.lustre.org <mailto:lustre-discuss-bounces at lists.lustre.org> on behalf of lustre-discuss at lists.lustre.org <mailto:lustre-discuss at lists.lustre.org>> wrote:


In a cluster having ~100 Lustre clients (compute nodes) connected together with the MDS and OSS over Intel True Scale InfiniBand (discontinued product), we started seeing certain nodes failing to mount the Lustre file system and giving I/O error on LNET (lctl) ping even though an ibping test to the MDS gives no errors. We tried rebooting the problematic nodes and even fresh-installing the OS and Lustre client, which did not help. However, rebooting the MDS seems to possibly momentarily help after the MDS starts up again, but the same set of problematic nodes seem to always eventually revert back to the state where they fail to ping the MDS over LNET.


Thank you for any pointers we may pursue.




Youssef Eldakar
Bibliotheca Alexandrina
www.bibalex.org <https://urldefense.us/v2/url?u=http-3A__www.bibalex.org&d=DwMFaQ&c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&r=SpEwA4Pnyq7nH7aMGq8KpA&m=kwZRPirpHWOowgLmVOYe_KJ4ZigAHQk3DiF8-BwQ2qFikINn8C5-0SyyYEDelqDH&s=5DLPIzJx0tgg1TgSZkvvNNVfDfgpo-Prv-BPOga0WMA&e=> <https://urldefense.us/v2/url?u=http-3A__www.bibalex.org&amp;d=DwMFaQ&amp;c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&amp;r=SpEwA4Pnyq7nH7aMGq8KpA&amp;m=kwZRPirpHWOowgLmVOYe_KJ4ZigAHQk3DiF8-BwQ2qFikINn8C5-0SyyYEDelqDH&amp;s=5DLPIzJx0tgg1TgSZkvvNNVfDfgpo-Prv-BPOga0WMA&amp;e=>>
hpc.bibalex.org <https://urldefense.us/v2/url?u=http-3A__hpc.bibalex.org&d=DwMFaQ&c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&r=SpEwA4Pnyq7nH7aMGq8KpA&m=kwZRPirpHWOowgLmVOYe_KJ4ZigAHQk3DiF8-BwQ2qFikINn8C5-0SyyYEDelqDH&s=HMqKriFlJ2qwafMOSVJMqre9-wmJ--kaSS_rx4t7hQw&e=> <https://urldefense.us/v2/url?u=http-3A__hpc.bibalex.org&amp;d=DwMFaQ&amp;c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&amp;r=SpEwA4Pnyq7nH7aMGq8KpA&amp;m=kwZRPirpHWOowgLmVOYe_KJ4ZigAHQk3DiF8-BwQ2qFikINn8C5-0SyyYEDelqDH&amp;s=HMqKriFlJ2qwafMOSVJMqre9-wmJ--kaSS_rx4t7hQw&amp;e=>>







More information about the lustre-discuss mailing list