[lustre-discuss] STOP'd processes on Lustre clients while OSS/OST unavailable?

Paul Brunk pbrunk at uga.edu
Fri Feb 19 11:45:59 PST 2016


Hi all:

We have a Linux cluster (CentOS 6.5, Lustre 1.8.9-wcl) which mounts a
Lustre FS from CentOS-based server appliance (Lustre 2.1.0).

The Lustre cluster has 4 OSSes as two failover pairs. Due to bad luck
we have one OSS unbootable, and replacing it will require taking its
live partner down too (though not any of the other Lustre servers).

We can prevent I/O to the Lustre FS by suspending (kill -STOP) the
user processes on the cluster compute nodes before the maintenance
work, and resuming them (kill -CONT) afterwards.

I don't know what would happen, though, in those cases where the
STOP'd process has an open file decriptor on the Lustre FS. If the
relevant OSS/OSTs become unavailable, and then available again, during
the STOP'd time, what would happen when the process is CONT'd?

I tried a Web search on this, but the best I could find was stuff
which assumed that one of a failover partner set would remain
available. or was specifially about evictions (which I guess are a
risk of this maintenance prccedure anyway). I did find one doc (
http://wiki.lustre.org/Lustre_Resiliency:_Understanding_Lustre_Message_Loss_and_Tuning_for_Resiliency
)which suggested that silent data corruption was a possibility in the
event of evictions.

But what about non-evicted clients with open filehandles?

Thanks for any insight!

-- 
Paul Brunk, system administrator
Georgia Advanced Computing Resource Center (GACRC)
Enterprise IT Svcs, the University of Georgia


More information about the lustre-discuss mailing list