[lustre-discuss] Rebooting storage nodes while jobs are running?

Wed Feb 27 07:54:16 PST 2019

 From experience rebooting the storage nodes is fine, the processes 
accessing them will just hang until restored.  I've done this many times 
on our cluster with no ill effect.

That said I have not tried it with kernel upgrades or lustre release 
changes.  That may do something different and unexpected. Some one else 
on the list may have insight on these.

-Paul Edmon-

On 2/27/19 10:17 AM, Bernd Melchers wrote:
> Hi all,
> our environment: CentOS-7.6, lustre-2.12.0 at zfs-0.7.12, 2 mds, 7 ods, 180 clients.
>
> Is it possible to reboot the mds and ods server (e.g. for new kernel or
> new lustre releases) without affecting running jobs on the client nodes?
> The reboot can take up to 15 minutes. Did the clients still wait for
> the storage nodes to reappear or will i/o operations get errors?
> Is the behaviour of a client influenced by the timeout parameter ( "lctl get_param timeout")
> or by other parameters?
>
> Mit freundlichen Grüßen
> Bernd Melchers
>