<html>


<head>


<meta http-equiv="Content-Type" content="text/html; charset=utf-8">


</head>


<body>


<br>


Andreas,<br>


<br>


Somewhat worse - no crash is required on the server.  The LDLM locks from that target held by the client are destroyed on eviction, which also destroys the pages under them.  So any data that is not synced/persistent at the time of eviction is lost.<br>


<br>


(And, for others reading:)<br>


But that’s required by the purpose of eviction, which is mostly to allow the file system to make forward progress in face of a misbehaving client (rather than just deadlock forever).  And if your FS and clients are healthy, you shouldn’t normally have any evictions. 


 Notice we’re only talking about this in the context of a bug.<br>


<br>


- Patrick<br>


<br>


<hr style="display:inline-block;width:98%" tabindex="-1">


<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>From:</b> Andreas Dilger <adilger@whamcloud.com><br>


<b>Sent:</b> Monday, October 22, 2018 8:55:57 PM<br>


<b>To:</b> Marion Hakanson<br>


<b>Cc:</b> Patrick Farrell; lustre-discuss@lists.lustre.org<br>


<b>Subject:</b> Re: [lustre-discuss] LU-11465 OSS/MDS deadlock in 2.10.5</font>


<div> </div>


</div>


<div class="BodyFragment"><font size="2"><span style="font-size:11pt;">


<div class="PlainText">On Oct 23, 2018, at 09:25, Marion Hakanson <hakansom@ohsu.edu> wrote:<br>


> <br>


> I think Patrick's warning of data loss on a local ZFS filesystem is not<br>


> quite right.  It's a design feature of ZFS that it flushes caches upon<br>


> committing writes before returning a "write complete" back to the<br>


> application.  Data loss can still happen if the storage lies to ZFS<br>


> about having sent the data to stable storage.<br>


<br>


Just to clarify, even ZFS on a local node does not avoid data loss if<br>


the file is written only to RAM, and is not sync'd to disk.  That is<br>


true of any filesystem, unless your writes are all O_SYNC (which can<br>


hurt performance significantly), or until NVRAM is used exclusively to<br>


store data.<br>


<br>


There is some time after the write() syscall returns to an application<br>


before the filesystem will even _start_ to write to the disk, to allow<br>


it to aggregate data from multiple write() syscalls for efficiency.<br>


Once the data is sent from RAM to disk, the disk should not ack the write<br>


until it is persistent.  If sync() (or variant) is called by userspace,<br>


that should not return until the data is persistent, which is true with<br>


Lustre as well.<br>


<br>


What Patrick was referencing is if the server crashes after the client<br>


write() has received the data, but before it is persistent on disk, and<br>


*then* the client is evicted from the server, the data would be lost.<br>


It would still return an error if fsync() is called on the file handle,<br>


but this is often not done by applications.  The same is true if a local<br>


disk disconnects from the node before the data is persistent (e.g. USB<br>


device unplug, cable failure, external RAID enclosure power failure, etc).<br>


<br>


Cheers, Andreas<br>


<br>


> Anyway, thanks, Andreas and others, for clarifying about the use of<br>


> abort_recovery.  Using it turns out to not have been helpful in our<br>


> situation so far, but this has been a useful discussion about the<br>


> risks of data loss, etc.<br>


> <br>


> Thanks and regards,<br>


> <br>


> Marion<br>


> <br>


> <br>


>> From: Patrick Farrell <paf@cray.com><br>


>> To: "Mohr Jr, Richard Frank (Rick Mohr)" <rmohr@utk.edu>, Marion Hakanson<br>


>>       <hakansom@ohsu.edu><br>


>> CC: "lustre-discuss@lists.lustre.org" <lustre-discuss@lists.lustre.org><br>


>> Subject: Re: [lustre-discuss] LU-11465 OSS/MDS deadlock in 2.10.5<br>


>> Date: Fri, 19 Oct 2018 17:36:56 +0000<br>


>> <br>


>> There is a somewhat hidden danger with eviction: You can get silent data loss.  The simplest example is buffered (ie, any that aren't direct I/O) writes - Lustre reports completion (ie your write() syscall completes) once the data is in the page cache on


 the client (like any modern file system, including local ones - you can get silent data loss on EXT4, XFS, ZFS, etc, if your disk becomes unavailable before data is written out of the page cache).<br>


>> <br>


>> So if that client with pending dirty data is evicted from the OST the data is destined for - which is essentially what abort recovery does - that data is lost, and the application doesn't get an error (because the write() call has already completed).<br>


>> <br>


>> A message is printed to the console on the client in this case, but you have to know to look for it.  The application will run to completion, but you may still experience data loss, and not know it.  It's just something to keep in mind - applications that


 run to completion despite evictions may not have completed *successfully*.<br>


>> <br>


>> - Patrick<br>


>> <br>


>> On 10/19/18, 11:42 AM, "lustre-discuss on behalf of Mohr Jr, Richard Frank (Rick Mohr)" <lustre-discuss-bounces@lists.lustre.org on behalf of rmohr@utk.edu> wrote:<br>


>> <br>


>> <br>


>>> On Oct 19, 2018, at 10:42 AM, Marion Hakanson <hakansom@ohsu.edu> wrote:<br>


>>> <br>


>>> Thanks for the feedback.  You're both confirming what we've learned so far, that we had to unmount all the clients (which required rebooting most of them), then reboot all the storage servers, to get things unstuck until the problem recurred.<br>


>>> <br>


>>> I tried abort_recovery on the clients last night, before rebooting the MDS, but that did not help.  Could well be I'm not using it right:<br>


>> <br>


>>    Aborting recovery is a server-side action, not something that is done on the client.  As mentioned by Peter, you can abort recovery on a single target after it is mounted by using “lctl —device <DEV> abort_recover”.  But you can also just skip over the


 recovery step when the target is mounted by adding the “-o abort_recov” option to the mount command.  For example,


<br>


>> <br>


>>    mount -t lustre -o abort_recov /dev/my/mdt /mnt/lustre/mdt0<br>


>> <br>


>>    And similarly for OSTs.  So you should be able to just unmount your MDT/OST on the running file system and then remount with the abort_recov option.  From a client perspective, the lustre client will get evicted but should automatically reconnect.  


<br>


>> <br>


>>    Some applications can ride through a client eviction without causing issues, some cannot.  I think it depends largely on how the application does its IO and if there is any IO in flight when the eviction occurs.  I have had to do this a few times on a


 running cluster, and in my experience, we have had good luck with most of the applications continuing without issues.  Sometimes there are a few jobs that abort, but overall this is better than having to stop all jobs and remount lustre on all the compute


 nodes.<br>


>> <br>


>>    --<br>


>>    Rick Mohr<br>


>>    Senior HPC System Administrator<br>


>>    National Institute for Computational Sciences<br>


>>    <a href="http://www.nics.tennessee.edu">http://www.nics.tennessee.edu</a><br>


>> <br>


>>    _______________________________________________<br>


>>    lustre-discuss mailing list<br>


>>    lustre-discuss@lists.lustre.org<br>


>>    <a href="http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org">http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org</a><br>


>> <br>


>> <br>


> <br>


> _______________________________________________<br>


> lustre-discuss mailing list<br>


> lustre-discuss@lists.lustre.org<br>


> <a href="http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org">http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org</a><br>


<br>


Cheers, Andreas<br>


---<br>


Andreas Dilger<br>


CTO Whamcloud<br>


<br>


<br>


<br>


<br>


</div>


</span></font></div>


</body>


</html>