[Lustre-devel] Failover & Force export for the DMU

Thu Apr 17 09:10:31 PDT 2008

Hi Peter,

Please see my comments.

On Qua, 2008-04-16 at 17:18 -0700, Peter Braam wrote:

> I think that is fine – again, the key issue is not to kill the server
> while it gets these errors.  It may well be that the server needs a
> special “I’m recovering be gentle with errors” mode to avoid
> reasonable panics.

I would say any error returned by the filesystem even in normal
operation should be handled gently :)

> Please explain why we want to export such a pool and on which node we
> want to export it, in fact what is “export” (it should be similar to
> unmount)?  If things are failing, then, on the node that is failing,
> we don’t need this pool anymore, we need to shut things down, in most
> cases for a reboot.  We need the pool on the failover node.

The DMU has the notion of importing and exporting a pool, which is
different from mounting/unmounting a filesystem inside the pool.

Basically, an import consists in scanning and reading the labels of all
the devices of a pool to find out the pool configuration.
After this process, the pool transitions to the imported state, which
means that the DMU knows about the pool (has the pool configuration
cached) and the user can perform any operation he desires on the pool.

Usually after an import ZFS also mounts the filesystems inside the pool
automatically, but this is not relevant here.

In ZFS, an export consists of unmounting any filesystem belonging to the
pool, flushing dirty data, marking the pool as exported on-disk and then
removing the pool configuration from the cache.
In Lustre/ZFS, strictly speaking there are no filesystems mounted so we
don't do that, but of course the export would fail if Lustre has an open
objset, so we need to close them first.
After this, the user can only operate/manipulate the pool if he
re-imports it.

So basically, what we need to do when things are failing (in the node
that is failing) is to close the filesystems and export the pool. The
big problem is that the DMU cannot export a pool if the devices are
experiencing fatal write failures, which is why we need a force-export
mechanism.

After that, we need to import the pool on the failover node and mount
all the MDTs/OSTs that were stored there, do recovery, etc (I'm sure you
understand this process much better than I do :)

> In fact there is a very useful distinction to make.  There are two
> failover scenarios:
>      1. fail over to move services away from failures on the OSS.  In
>         this case a reboot/panic is not really harmful.

That's why when I heard about the need for this feature, I immediately
proposed doing a panic, which wouldn't have any consequences assuming
Lustre recovery does its job. But it's not useful in a "multiple pools
in the same server" scenario.

>      1. fail over from a fully functioning OSS/DMU to redistribute
>         services.  In this case we need a control mechanism to turn
>         the device read-only and clean up the DMU.

Why do we need to turn the device read-only in this case? Why can't we
do a clean unmount/export if the devices are fully functioning?
Andreas has told me before that with ldiskfs, doing a clean unmount
could take a lot of time if there's a lot of dirty data, but I don't
believe this will be true with the DMU.
Even if such a problem were to arise, in the DMU it's trivial to limit
the transaction group size and therefore limit the time it takes to sync
a txg.

> Unfortunately we cannot consider mandating that there is only one file
> system per OSS because then we need an idle node to act as the
> failover node.  We must handle the problem of shutting “one of more”
> down, but only in the clean case (2). 

In the clean case, we don't need force-export.

Force-export is only really needed if all of the following conditions
are true:

1) We have more than 1 filesystem (MDT/OST) running in the same
userspace process (note how I didn't say "same server". Also note that
for Lustre 2.0, we will have a limitation of 1 userspace process per
server).

2) The MDTs/OSTs are stored in more than 1 ZFS pool (note how I didn't
say "more than 1 device". A single ZFS pool can use multiple disk
devices.).

3) One or more, but not all of the ZFS pools are suffering from fatal IO
failures.

4) We only want to failover the MDTs/OSTs stored on the pools that are
suffering IO failures, but we still want to keep the remaining MDTs/OSTs
working in the same server.

If there is a requirement of supporting a scenario where all of these
conditions are true, then we need force-export. From my latest
discussion with Andreas about this, we do need that.
If not all of the conditions are true, we could either do a clean export
or do a panic, depending on the situation.

At least, that is my understanding :)

Thanks,
Ricardo

--

Ricardo Manuel Correia
Lustre Engineering

Sun Microsystems, Inc.
Portugal
Phone +351.214134023 / x58723
Mobile +351.912590825
Email Ricardo.M.Correia at Sun.COM
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20080417/5fa11443/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 6g_top.gif
Type: image/gif
Size: 1257 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20080417/5fa11443/attachment.gif>