[lustre-discuss] STOP'd processes on Lustre clients while OSS/OST unavailable?

Fri Feb 19 11:53:46 PST 2016

Paul,

I would say this is not very likely to work and could easily result in 
corrupted data.  With the servers going down completely, the clients 
will lose the locks they had (no possibility of recovery with the 
servers down completely like this), and any data not written out will be 
lost.  You can guarantee the processes are idle with SIGSTOP, yes, but 
you can't guarantee all of the data has been written out.

There are other possible issues as well, but I don't think it's 
necessary to detail them all.  I would strongly advise against this plan 
- Just truly stop activity on the clients and unmount Lustre (to be 
certain), then remount it after the maintenance is complete.

- Patrick
On 02/19/2016 01:45 PM, Paul Brunk wrote:
> Hi all:
>
> We have a Linux cluster (CentOS 6.5, Lustre 1.8.9-wcl) which mounts a
> Lustre FS from CentOS-based server appliance (Lustre 2.1.0).
>
> The Lustre cluster has 4 OSSes as two failover pairs. Due to bad luck
> we have one OSS unbootable, and replacing it will require taking its
> live partner down too (though not any of the other Lustre servers).
>
> We can prevent I/O to the Lustre FS by suspending (kill -STOP) the
> user processes on the cluster compute nodes before the maintenance
> work, and resuming them (kill -CONT) afterwards.
>
> I don't know what would happen, though, in those cases where the
> STOP'd process has an open file decriptor on the Lustre FS. If the
> relevant OSS/OSTs become unavailable, and then available again, during
> the STOP'd time, what would happen when the process is CONT'd?
>
> I tried a Web search on this, but the best I could find was stuff
> which assumed that one of a failover partner set would remain
> available. or was specifially about evictions (which I guess are a
> risk of this maintenance prccedure anyway). I did find one doc (
> http://wiki.lustre.org/Lustre_Resiliency:_Understanding_Lustre_Message_Loss_and_Tuning_for_Resiliency 
>
> )which suggested that silent data corruption was a possibility in the
> event of evictions.
>
> But what about non-evicted clients with open filehandles?
>
> Thanks for any insight!
>