[lustre-discuss] STOP'd processes on Lustre clients while OSS/OST unavailable?
oleg.drokin at intel.com
Fri Feb 19 12:11:01 PST 2016
Actually I have to disagree.
If the servers go down, but then go up and complete the recovery succesfully, the locks would be replayed and it all should work transparently.
Clients would 'pause" trying to access those servers for as long as needed until the servers come back again.
Also, file descriptors is something between MDS and clients so if an OST goes down, file descriptors would not be affected.
That said, leaving MDS up while some OSTs go down for potentially prolonged time is not that great of an idea and it might make sense to deactivate those OSTs on MDS (before bringing OSTs down)
(and reactivate them once they are back).
On Feb 19, 2016, at 2:53 PM, Patrick Farrell wrote:
> I would say this is not very likely to work and could easily result in corrupted data. With the servers going down completely, the clients will lose the locks they had (no possibility of recovery with the servers down completely like this), and any data not written out will be lost. You can guarantee the processes are idle with SIGSTOP, yes, but you can't guarantee all of the data has been written out.
> There are other possible issues as well, but I don't think it's necessary to detail them all. I would strongly advise against this plan - Just truly stop activity on the clients and unmount Lustre (to be certain), then remount it after the maintenance is complete.
> - Patrick
> On 02/19/2016 01:45 PM, Paul Brunk wrote:
>> Hi all:
>> We have a Linux cluster (CentOS 6.5, Lustre 1.8.9-wcl) which mounts a
>> Lustre FS from CentOS-based server appliance (Lustre 2.1.0).
>> The Lustre cluster has 4 OSSes as two failover pairs. Due to bad luck
>> we have one OSS unbootable, and replacing it will require taking its
>> live partner down too (though not any of the other Lustre servers).
>> We can prevent I/O to the Lustre FS by suspending (kill -STOP) the
>> user processes on the cluster compute nodes before the maintenance
>> work, and resuming them (kill -CONT) afterwards.
>> I don't know what would happen, though, in those cases where the
>> STOP'd process has an open file decriptor on the Lustre FS. If the
>> relevant OSS/OSTs become unavailable, and then available again, during
>> the STOP'd time, what would happen when the process is CONT'd?
>> I tried a Web search on this, but the best I could find was stuff
>> which assumed that one of a failover partner set would remain
>> available. or was specifially about evictions (which I guess are a
>> risk of this maintenance prccedure anyway). I did find one doc (
>> )which suggested that silent data corruption was a possibility in the
>> event of evictions.
>> But what about non-evicted clients with open filehandles?
>> Thanks for any insight!
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
More information about the lustre-discuss