<HTML>

<HEAD>

<TITLE>Re: [Lustre-devel] Failover & Force export for the DMU</TITLE>

</HEAD>

<BODY>

<FONT SIZE="4"><FONT FACE="Calibri, Verdana, Helvetica, Arial"><SPAN STYLE='font-size:11pt'>I forgot one other comment/question: shutdown of Lustre servers was traditionally sometimes very slow because of timeouts – however with the Sandia “kill the export features” is this still true?<BR>

<BR>

- peter -<BR>

<BR>

<BR>

On 4/17/08 9:10 AM, "Ricardo M. Correia" <Ricardo.M.Correia@Sun.COM> wrote:<BR>

<BR>

</SPAN></FONT></FONT><BLOCKQUOTE><FONT SIZE="4"><FONT FACE="Calibri, Verdana, Helvetica, Arial"><SPAN STYLE='font-size:11pt'>Hi Peter,<BR>

<BR>

Please see my comments.<BR>

<BR>

On Qua, 2008-04-16 at 17:18 -0700, Peter Braam wrote:<BR>

</SPAN></FONT></FONT><BLOCKQUOTE><FONT SIZE="4"><FONT FACE="Calibri, Verdana, Helvetica, Arial"><SPAN STYLE='font-size:11pt'> </SPAN></FONT></FONT><FONT FACE="Calibri, Verdana, Helvetica, Arial"><FONT SIZE="6"><SPAN STYLE='font-size:18pt'>I think that is fine – again, the key issue is not to kill the server while it gets these errors.  It may well be that the server needs a special “I’m recovering be gentle with errors” mode to avoid reasonable panics.<BR>

</SPAN></FONT></FONT></BLOCKQUOTE><FONT FACE="Calibri, Verdana, Helvetica, Arial"><FONT SIZE="4"><SPAN STYLE='font-size:11pt'><BR>

I would say any error returned by the filesystem even in normal operation should be handled gently :)<BR>

<BR>

</SPAN></FONT></FONT><BLOCKQUOTE><FONT FACE="Calibri, Verdana, Helvetica, Arial"><FONT SIZE="4"><SPAN STYLE='font-size:11pt'> </SPAN></FONT><FONT SIZE="6"><SPAN STYLE='font-size:18pt'>Please explain why we want to export such a pool and on which node we want to export it, in fact what is “export” (it should be similar to unmount)?  If things are failing, then, on the node that is failing, we don’t need this pool anymore, we need to shut things down, in most cases for a reboot.  We need the pool on the failover node.<BR>

</SPAN></FONT></FONT></BLOCKQUOTE><FONT FACE="Calibri, Verdana, Helvetica, Arial"><FONT SIZE="4"><SPAN STYLE='font-size:11pt'><BR>

The DMU has the notion of importing and exporting a pool, which is different from mounting/unmounting a filesystem inside the pool.<BR>

<BR>

Basically, an import consists in scanning and reading the labels of all the devices of a pool to find out the pool configuration.<BR>

After this process, the pool transitions to the imported state, which means that the DMU knows about the pool (has the pool configuration cached) and the user can perform any operation he desires on the pool.<BR>

<BR>

Usually after an import ZFS also mounts the filesystems inside the pool automatically, but this is not relevant here.<BR>

<BR>

In ZFS, an export consists of unmounting any filesystem belonging to the pool, flushing dirty data, marking the pool as exported on-disk and then removing the pool configuration from the cache.<BR>

In Lustre/ZFS, strictly speaking there are no filesystems mounted so we don't do that, but of course the export would fail if Lustre has an open objset, so we need to close them first.<BR>

After this, the user can only operate/manipulate the pool if he re-imports it.<BR>

<BR>

So basically, what we need to do when things are failing (in the node that is failing) is to close the filesystems and export the pool. The big problem is that the DMU cannot export a pool if the devices are experiencing fatal write failures, which is why we need a force-export mechanism.<BR>

<BR>

After that, we need to import the pool on the failover node and mount all the MDTs/OSTs that were stored there, do recovery, etc (I'm sure you understand this process much better than I do :)<BR>

<BR>

<BR>

</SPAN></FONT></FONT><BLOCKQUOTE><FONT FACE="Calibri, Verdana, Helvetica, Arial"><FONT SIZE="4"><SPAN STYLE='font-size:11pt'> </SPAN></FONT><FONT SIZE="6"><SPAN STYLE='font-size:18pt'>In fact there is a very useful distinction to make.  There are two failover scenarios:</SPAN></FONT><FONT SIZE="4"><SPAN STYLE='font-size:11pt'> <BR>

</SPAN></FONT></FONT><OL><LI><FONT FACE="Calibri, Verdana, Helvetica, Arial"><FONT SIZE="6"><SPAN STYLE='font-size:18pt'>fail over to move services away from failures on the OSS.  In this case a reboot/panic is not really harmful.</SPAN></FONT><FONT SIZE="4"><SPAN STYLE='font-size:11pt'> <BR>

</SPAN></FONT></FONT></OL></BLOCKQUOTE><FONT FACE="Calibri, Verdana, Helvetica, Arial"><FONT SIZE="4"><SPAN STYLE='font-size:11pt'><BR>

That's why when I heard about the need for this feature, I immediately proposed doing a panic, which wouldn't have any consequences assuming Lustre recovery does its job. But it's not useful in a "multiple pools in the same server" scenario.<BR>

<BR>

</SPAN></FONT></FONT><BLOCKQUOTE><FONT FACE="Calibri, Verdana, Helvetica, Arial"><FONT SIZE="4"><SPAN STYLE='font-size:11pt'> <BR>

</SPAN></FONT></FONT><OL><LI><FONT FACE="Calibri, Verdana, Helvetica, Arial"><FONT SIZE="6"><SPAN STYLE='font-size:18pt'>fail over from a fully functioning OSS/DMU to redistribute services.  In this case we need a control mechanism to turn the device read-only and clean up the DMU.</SPAN></FONT><FONT SIZE="4"><SPAN STYLE='font-size:11pt'> <BR>

</SPAN></FONT></FONT></OL></BLOCKQUOTE><FONT FACE="Calibri, Verdana, Helvetica, Arial"><FONT SIZE="4"><SPAN STYLE='font-size:11pt'><BR>

Why do we need to turn the device read-only in this case? Why can't we do a clean unmount/export if the devices are fully functioning?<BR>

Andreas has told me before that with ldiskfs, doing a clean unmount could take a lot of time if there's a lot of dirty data, but I don't believe this will be true with the DMU.<BR>

Even if such a problem were to arise, in the DMU it's trivial to limit the transaction group size and therefore limit the time it takes to sync a txg.<BR>

<BR>

</SPAN></FONT></FONT><BLOCKQUOTE><FONT FACE="Calibri, Verdana, Helvetica, Arial"><FONT SIZE="4"><SPAN STYLE='font-size:11pt'> </SPAN></FONT><FONT SIZE="6"><SPAN STYLE='font-size:18pt'>Unfortunately we cannot consider mandating that there is only one file system per OSS because then we need an idle node to act as the failover node.  We must handle the problem of shutting “one of more” down, but only in the clean case (2). <BR>

</SPAN></FONT></FONT></BLOCKQUOTE><FONT FACE="Calibri, Verdana, Helvetica, Arial"><FONT SIZE="4"><SPAN STYLE='font-size:11pt'><BR>

In the clean case, we don't need force-export.<BR>

<BR>

Force-export is only really needed if <B>all</B> of the following conditions are true:<BR>

<BR>

1) We have more than 1 filesystem (MDT/OST) running in the same <U>userspace process</U> (note how I didn't say "same server". Also note that for Lustre 2.0, we will have a limitation of 1 userspace process per server).<BR>

<BR>

2) The MDTs/OSTs are stored in more than 1 ZFS pool (note how I didn't say "more than 1 device". A single ZFS pool can use multiple disk devices.).<BR>

<BR>

3) One or more, but not all of the ZFS pools are suffering from fatal IO failures.<BR>

<BR>

4) We only want to failover the MDTs/OSTs stored on the pools that are suffering IO failures, but we still want to keep the remaining MDTs/OSTs working in the same server.<BR>

<BR>

If there is a requirement of supporting a scenario where all of these conditions are true, then we need force-export. From my latest discussion with Andreas about this, we do need that.<BR>

If not all of the conditions are true, we could either do a clean export or do a panic, depending on the situation.<BR>

<BR>

At least, that is my understanding :)<BR>

<BR>

Thanks,<BR>

Ricardo<BR>

<BR>

--<BR>

<IMG src="cid:3291274608_4805863" ></SPAN><SPAN STYLE='font-size:10pt'><B>Ricardo Manuel Correia<BR>

</B>Lustre Engineering<BR>

</SPAN><SPAN STYLE='font-size:11pt'><BR>

</SPAN><SPAN STYLE='font-size:10pt'><B>Sun Microsystems, Inc.<BR>

</B>Portugal<BR>

Phone +351.214134023 / x58723<BR>

Mobile +351.912590825<BR>

Email Ricardo.M.Correia@Sun.COM<BR>

</SPAN><SPAN STYLE='font-size:11pt'><HR ALIGN=CENTER SIZE="3" WIDTH="95%"></SPAN></FONT></FONT><FONT SIZE="4"><FONT FACE="Consolas, Courier New, Courier"><SPAN STYLE='font-size:10pt'>_______________________________________________<BR>

Lustre-devel mailing list<BR>

Lustre-devel@lists.lustre.org<BR>

<a href="http://lists.lustre.org/mailman/listinfo/lustre-devel">http://lists.lustre.org/mailman/listinfo/lustre-devel</a><BR>

</SPAN></FONT></FONT></BLOCKQUOTE>

</BODY>

</HTML>