[lustre-discuss] Rebooting storage nodes while jobs are running?

Harr, Cameron harr1 at llnl.gov
Fri Mar 1 10:58:00 PST 2019


We have multiple compute clusters that mount each of our Lustre file 
systems and we do OS/kernel updates on them without regards to each 
other. Sometimes a client cluster may be updated at the same time as one 
of the Lustre clusters, but often it's not. This approach generally 
works fine and jobs/file-accesses will hang until recovery on the file 
system is finished.

On 2/27/19 8:05 AM, Carlson, Timothy S wrote:
> I will say YMMV.  I've rebooted storage nodes and have had mixed results where we land into one of three bucktes
>
> 1) Codes breeze through and have just been stuck in D state while OSS's reboot
> 2) RPCs get stuck somewhere and when the OSS comes back I eventually have to force an abort_recovery
> 3) A code dies by not handling the timeout (not sure if this is due to the code itself or the client improperly handling the timeout)
>
> On our current setup with around 1000 clients, 50ish OSS, and 2.5.x vintage lustre servers I would say option 1 is by far the largest percentage (>95). 2 and 3 happen from time to time with likelihood greater than 0.
>
> It's always a best practice to take a scheduled outage for a kernel/version upgrade. You never know what oddity your particular setup might encounter.
>
> Tim
>
> -----Original Message-----
> From: lustre-discuss <lustre-discuss-bounces at lists.lustre.org> On Behalf Of Paul Edmon
> Sent: Wednesday, February 27, 2019 7:54 AM
> To: lustre-discuss at lists.lustre.org
> Subject: Re: [lustre-discuss] Rebooting storage nodes while jobs are running?
>
>   From experience rebooting the storage nodes is fine, the processes accessing them will just hang until restored.  I've done this many times on our cluster with no ill effect.
>
> That said I have not tried it with kernel upgrades or lustre release changes.  That may do something different and unexpected. Some one else on the list may have insight on these.
>
> -Paul Edmon-
>
> On 2/27/19 10:17 AM, Bernd Melchers wrote:
>> Hi all,
>> our environment: CentOS-7.6, lustre-2.12.0 at zfs-0.7.12, 2 mds, 7 ods, 180 clients.
>>
>> Is it possible to reboot the mds and ods server (e.g. for new kernel
>> or new lustre releases) without affecting running jobs on the client nodes?
>> The reboot can take up to 15 minutes. Did the clients still wait for
>> the storage nodes to reappear or will i/o operations get errors?
>> Is the behaviour of a client influenced by the timeout parameter (
>> "lctl get_param timeout") or by other parameters?
>>
>> Mit freundlichen Grüßen
>> Bernd Melchers
>>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


More information about the lustre-discuss mailing list