[lustre-discuss] Rebooting storage nodes while jobs are running?

Carlson, Timothy S Timothy.Carlson at pnnl.gov
Wed Feb 27 08:05:45 PST 2019


I will say YMMV.  I've rebooted storage nodes and have had mixed results where we land into one of three bucktes

1) Codes breeze through and have just been stuck in D state while OSS's reboot
2) RPCs get stuck somewhere and when the OSS comes back I eventually have to force an abort_recovery
3) A code dies by not handling the timeout (not sure if this is due to the code itself or the client improperly handling the timeout)

On our current setup with around 1000 clients, 50ish OSS, and 2.5.x vintage lustre servers I would say option 1 is by far the largest percentage (>95). 2 and 3 happen from time to time with likelihood greater than 0. 

It's always a best practice to take a scheduled outage for a kernel/version upgrade. You never know what oddity your particular setup might encounter.

Tim

-----Original Message-----
From: lustre-discuss <lustre-discuss-bounces at lists.lustre.org> On Behalf Of Paul Edmon
Sent: Wednesday, February 27, 2019 7:54 AM
To: lustre-discuss at lists.lustre.org
Subject: Re: [lustre-discuss] Rebooting storage nodes while jobs are running?

 From experience rebooting the storage nodes is fine, the processes accessing them will just hang until restored.  I've done this many times on our cluster with no ill effect.

That said I have not tried it with kernel upgrades or lustre release changes.  That may do something different and unexpected. Some one else on the list may have insight on these.

-Paul Edmon-

On 2/27/19 10:17 AM, Bernd Melchers wrote:
> Hi all,
> our environment: CentOS-7.6, lustre-2.12.0 at zfs-0.7.12, 2 mds, 7 ods, 180 clients.
>
> Is it possible to reboot the mds and ods server (e.g. for new kernel 
> or new lustre releases) without affecting running jobs on the client nodes?
> The reboot can take up to 15 minutes. Did the clients still wait for 
> the storage nodes to reappear or will i/o operations get errors?
> Is the behaviour of a client influenced by the timeout parameter ( 
> "lctl get_param timeout") or by other parameters?
>
> Mit freundlichen Grüßen
> Bernd Melchers
>
_______________________________________________
lustre-discuss mailing list
lustre-discuss at lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


More information about the lustre-discuss mailing list