[Lustre-discuss] questions about an OST content

Andreas Dilger andreas.dilger at oracle.com
Wed Nov 10 02:06:43 PST 2010


On 2010-11-09, at 03:07, Aurelien Degremont wrote:
> Andreas Dilger a écrit :
>>> Cold replace:
>>> 1 - Empty your OST
>>> 2 - Stop your filesystem
>>> 3 - Replace/reformat using the same index
>>> 4 - Restart using --writeconf
>>> 5 - Remount the clients
>> 6 - fix up the MDS's idea of the OST's last-allocated object.
>>> Did I miss something ?
>> Other than #6, it looks correct.
> 
> How do you fix #6?  What are the actions needed for that?

That is what is described in the rest of this email...

>>> What is currently preventing, a freshly formatted OST with the same index, to register itself properly (using first_time flag) to MGS and MDT when remounting and:
>>> - refreshing its CONFIG from MGS internal cache
>>> - telling MDT to reset last_rcvd/LAST_ID it knows for this OST.
>>> That way, we could have an easy way to hot replace an OST.
>>> How do you think this can be achieved ?
>> It probably wouldn't be impossible to have a new OST gracefully replace an old one, if that is what the administrator wanted.  Some "special" action would need to be taken on the OST and/or MDT to ensure that this is what the admin wanted, instead of e.g. accidentally inserting some other OST with the same index and corrupting the filesystem because of duplicate object IDs, or not being able to access existing objects on the "real" OST at that index.
>> - the new OST would be best off to start allocating objects at the LAST_ID
>>  of the old OST, so that there is no risk of confusion between objects
>> - the MDT contains the old LAST_ID in it's lov_objids file, and it sends this
>>  to the OST at connection time, this is no problem
>> - currently the new OST will refuse to allow the MDT to connect, because it
>>  detects that the old LAST_ID value from the MDT is inconsistent with its
>>  own value
>> - it would be relatively straight forward to have the OST detect if the local
>>  LAST_ID value was "new" and use the MDT value instead
> 
> Can we based this check on 'first_time' flag.
> I mean, OST update its LAST_ID based on what MDT tell it only if it has the 'first_time' flag set.

The problem is that if the 'first_time' flag is always set on a new OST, then any OST accidentally claiming the same index (e.g. from a test filesystem of the same name, or from user error) could replace the valid OST.  This 'first_time' flag could not be the default.

>> - the danger is if the LAST_ID file was lost for some reason (e.g. corruption
>>  causes e2fsck to erase it).  in that case, the OST startup code should be
>>  smart enough to regenerate LAST_ID based on walking the object directories,
>>  which would also avoid the need to do this in e2fsck/lfsck (which can only
>>  run offline)
>> - in cases where the on-disk LAST_ID is much lower than the MDT-supplied
>>  value, the OST should just skip precreation of all the intermediate objects
>>  and just start using the new MDT value
> 
> This seems a different feature, even if related, which is "Better handling of LAST_ID corruption".

Partly, yes.

>> - the only other thing is to avoid the case where a "new" OST is accidentally
>>  assigned the same index, when that isn't what is wanted.  There needs to be
>>  some way to "prime" the new OST (that is NOT the default for a newly
>>  formatted OST), or conversely tell the MDT that it should signal the new
>>  OST to take the place of the old one, so that there are not any mistakes
> 
> Indeed, this is important. And if we want to have this supports online replace. Another option when formatting OST?
> --replace ? Which is only accepted when --index is set?

Yes, that would probably be a good way to handle it from the user interface.  The other question is how to handle this internally.  Probably a flag stored in the mountinfo or last_rcvd file.

>> Since this is something that has come up on this list a number of times in the last year, I guess it means that a Lustre filesystem is now outliving the hardware on which it runs, so it would definitely be worthwhile for someone to look at this.  I filed bug 24128 on this, in case anyone wants to work on it.
> 
> Can you also add it to Community project list?

Done.

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.




More information about the lustre-discuss mailing list