[lustre-discuss] rebooting nodes

Colin Faber colin.faber at seagate.com
Thu Aug 10 09:39:09 PDT 2017


In my experience, OFED tends to be unloaded prior to LNet tear down. This
chops the feet out from LNet and LNet module won't cleanly unload,
resulting in hang on reboot. The trick is to ensure that lustre is
unmounted, then LNet is unloaded, then OFED modules are unloaded. Generally
when shutting down in this order, your reboot should be clean.

You can verify this idea by checking your console log during the shutdown.

-cf

On Thu, Aug 10, 2017 at 7:51 AM, Christopher Johnston <chjohnst at gmail.com>
wrote:

> On my systems that use standard ethernet (im in the cloud), 2.9 reboots I
> have no issues I can see.  I did have issues with the lnet driver not being
> able to grab the port on boot-up so I backported the lnet systemd unit file
> from 2.10 to get around that.
>
> On Thu, Aug 10, 2017 at 9:44 AM, Ben Evans <bevans at cray.com> wrote:
>
>> Are the Infiniband drivers disappearing first?  I know that used to be an
>> issue.
>>
>> -Ben
>>
>> On 8/10/17, 8:59 AM, "lustre-discuss on behalf of Michael Di Domenico"
>> <lustre-discuss-bounces at lists.lustre.org on behalf of
>> mdidomenico4 at gmail.com> wrote:
>>
>> >does anyone else have issues with issue 'reboot' while having a lustre
>> >mount?
>> >
>> >we're running v2.9 clients on our workstations, but when a user goes
>> >to reboot the machine (from the gui) the system stalls under systemd
>> >while i presume it's attempting to unmount the filesystem.
>> >
>> >what i see on the console is; systemd kicks in and starts unmounting
>> >all the nfs shares we have, works fine.  but then it gets to lustre
>> >and starts throwing connection errors on the console.  it's almost as
>> >if systemd raced itself stopping lustre, whereby lnet got yanked out
>> >from under the mount before the unmount actually finished.
>> >
>> >after five minutes or so, it looks like systemd threw in the towel and
>> >gave up trying to unmount, but the system is stuck still trying to
>> >execute more shutdown tasks.
>> >
>> >when we mount lustre on the workstations, i have a script that figures
>> >some stuff out, issues a service lnet start, and then issues a mount
>> >command.  this all works fine, but i'm not sure if that's why systemd
>> >can't figure out what to do correctly.
>> >
>> >and since this is during a shutdown phase, debugging this is
>> >difficult.  any thoughts?
>> >_______________________________________________
>> >lustre-discuss mailing list
>> >lustre-discuss at lists.lustre.org
>> >http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.lustre.org_listinfo.cgi_lustre-2Ddiscuss-2Dlustre.org&d=DwMFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=x9pM59OqndbWw-lPPdr8w1Vud29EZigcxcNkz0uw5oQ&m=Gzks6KFhzHoz-saPEKrQSsQKMh_8dil_0_74sCECIlk&s=_Bb_hwIpGb8sVPVPxSlp1pkUO70bYXITUHEs0m5g26A&e=>
>>
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.lustre.org_listinfo.cgi_lustre-2Ddiscuss-2Dlustre.org&d=DwMFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=x9pM59OqndbWw-lPPdr8w1Vud29EZigcxcNkz0uw5oQ&m=Gzks6KFhzHoz-saPEKrQSsQKMh_8dil_0_74sCECIlk&s=_Bb_hwIpGb8sVPVPxSlp1pkUO70bYXITUHEs0m5g26A&e=>
>>
>
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.
> lustre.org_listinfo.cgi_lustre-2Ddiscuss-2Dlustre.org&
> d=DwICAg&c=IGDlg0lD0b-nebmJJ0Kp8A&r=x9pM59OqndbWw-
> lPPdr8w1Vud29EZigcxcNkz0uw5oQ&m=Gzks6KFhzHoz-saPEKrQSsQKMh_
> 8dil_0_74sCECIlk&s=_Bb_hwIpGb8sVPVPxSlp1pkUO70bYXITUHEs0m5g26A&e=
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20170810/52dcb31c/attachment.htm>


More information about the lustre-discuss mailing list