[Lustre-devel] Simpifying Interoperation
Robert Read
rread at sun.com
Wed Oct 22 11:06:11 PDT 2008
Huang,
Have you seen the new recovery interop architecture that came out of
the menlo park meetings?
http://arch.lustre.org/index.php?title=Simplified_Interoperation
robert
On Oct 21, 2008, at 22:44 , Huang Hua wrote:
> Hello Eric,
> I have some updates on the interop development.
>
> 1. I have implemented a "Barrier" on client, (by a read write
> semaphore,
> but Andreas suggests to use mutex).
> Before upgrade the MDS/OSS, we setup barrier on all clients, stopping
> new requests being sent to MDS and OSS.
> Currently, this will be done manually, e.g. run a command on all
> clients
> to barrier them.
> And then, user can explicitly sync lustre from clients, MDS and OSS,
> to
> insure that no outstanding requests are there.
> And then, cancel all mdc and osc locks on clients manually. Maybe this
> step is optional. I will do more testing.
>
> 2. User can upgrade MDS now. In this step, we need to "tunefs.lustre
> --write_conf" to erase all configuration, because these
> configuration can not be recognized by 2.0 MDS.
>
> 3. User has to upgrade OSS, or restart OSS to re-generate the
> configuration.
> Both are OK.
>
> 4. After that, we can cleanup the barrier on client manually. Client
> will reconnect to server, recover, and continue to run seamlessly.
>
>
> The problems here is that we have to setup and cleanup barrier on all
> clients by hand.
> Ideally, we should do it on a MDS/MGS by DLM lock, or something
> similar.
> If this is strongly required, I will implement this later.
> Till now, preliminary upgrade/downgrade tests passed. More testing are
> underway.
>
> I will answer some of your questions inlinely. Please see the
> followings.
>
>
>
>
> Eric Barton wrote:
>> Here are some first thoughts on Huang Hua's idea to simplify version
>> interoperation, and an invitation for further comments...
>>
>> 1. This scheme should not interfere with upgrade via failover pairs,
>> and it must also allow the MDS to be upgraded separately from the
>> OSSs. I think this means in general that we have to allow
>> piecemeal server upgrades.
>>
>> 2. This scheme need a mechanism that...
>>
>> a) notifies clients when a particular server is about to upgrade so
>> that update operations are blocked until the upgrade completes
>> and the client reconnects to the upgraded (and/or failed over)
>> server.
>>
> Now, this notification is done on every client, manually, by a
> command.
>
>> b) notifies the server when all clients have completed preparation
>> for the upgrade so that no further requests require resend.
>>
> This is done when barriers have been setup on all clients, and sync
> have
> been run on clients.
>
>> c) notifies clients when all outstanding updates have been
>> committed. If the server crashes before this point, client
>> replay is still required. Clients must not poll for this since
>> the server is shutting down.
>>
>> The DLM seems the right basic mechanism to notify clients, however
>> current assumptions about acquisition timeouts might be an issue.
>>
>> We must also ensure that the race between this server upgrade
>> process and connection establishment (including any new
>> notification locks) by new clients is handled consistently.
>>
> I think these races should be avoided by user.
>
>
>>
>> 3. It's not clear to me that we need to evict, or even clean the
>> client cache provided the client doesn't attempt any more writes
>> until it has connected to the failover server.
> While upgrade, we do not need to evict the clients.
> but, in downgrade, we have to evict clients, because
> the 1.8 mds server does not understand FIDs, it only knows inode
> number.
> But while 1.8 client is talking to 2.0 MDS, they talk in FIDs, and
> know
> nothing about real inode numbers.
>
>
>> The client can
>> re-acquire all the locks covering its cache during recovery after
>> the upgrade - and there is no need for request refomatting here
>> since locks are replayed explicitly (i.e. new requests are
>> formatted from scratch using the correct protocol version).
>>
>> It does seem advisable however to clean the cache before such a
>> significant system incident.
>>
>> 4. We can avoid reformatting requests during open replay if this is
>> also done explicity.
>>
> while upgrading, the client will do open replay.
> The server has already committed this open.
> So, 2.0 MDS only need to "open" that file, and return handle back to
> client.
>
>
>> 5. This scheme prevents recovery on clients that were disconnected
>> when the upgrade began. Such clients will simply be evicted when
>> they reconnect even though the server should actually have
>> committed all their replayable requests.
>>
>> If this can be prevented, we can probably also dispense with much
>> of the notification described in (2) above. However it would
>> require (a) a change in the connection protocol to get clients to
>> purge their own replay queue and (b) changes to ensure resent
>> requests can be reconstructed from scratch (but maybe (b) is just
>> another way of saying "request reformatting").
>>
>> If this is doable - it further begs the question of whether simply
>> making all server requests synchronous during upgrades is enough to
>> simply most interoperation issues.
>>
>> 6. This is all about client/server communications. Are there any
>> issues for inter-server interoperation?
>>
>
> The protocol for mds-oss does not change much.
> As I tested, there is no inter-server interop issues.
>
>
>> 7. Clients and servers may have to run with different versions for
>> extended periods (one customer ran like this for months). Does
>> this raise any issues with this scheme?
>>
> I do not see any issues.
> More testing is needed.
>
> Thanks,
> Huang Hua
>
>> Cheers,
>> Eric
>>
>>
>
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
More information about the lustre-devel
mailing list