[Lustre-discuss] questions about an OST content

Andreas Dilger andreas.dilger at oracle.com
Mon Nov 8 16:04:03 PST 2010


On 2010-11-08, at 14:18, Aurélien Degrémont wrote:
> Tell me if I'm wrong regarding this OST update.
> AFAIK, there is two ways to replace an OST by a new one:
> 
> Hot replace:
> 1 - Disable your OST on MDT (lctl deactivate)
> 2 - Empty your OST
> 3 - Backup the magic files (last_rcvd, LAST_ID, CONFIG/*)
> 4 - Deactivate the OST on all clients also.
> 5 - Unmount the OST
> 6 - Replace, reformat using same index
> 7 - Put back the backup magic files.
> 8 - Restart the OST.
> 9 - Activate the OST everywhere.
> 
> Cold replace:
> 1 - Empty your OST
> 2 - Stop your filesystem
> 3 - Replace/reformat using the same index
> 4 - Restart using --writeconf
> 5 - Remount the clients
6 - fix up the MDS's idea of the OST's last-allocated object.

> Did I miss something ?

Other than #6, it looks correct.

> As far as i understand this, the important point here is to have the OST 
> internal information in sync with what the MGS (CONFIG/*) and the MDT 
> (last_rcvd, LAST_ID) knows.

Right.

> What is currently preventing, a freshly formatted OST with the same 
> index, to register itself properly (using first_time flag) to MGS and 
> MDT when remounting and:
>  - refreshing its CONFIG from MGS internal cache
>  - telling MDT to reset last_rcvd/LAST_ID it knows for this OST.
> That way, we could have an easy way to hot replace an OST.
> How do you think this can be achieved ?

It probably wouldn't be impossible to have a new OST gracefully replace an old one, if that is what the administrator wanted.  Some "special" action would need to be taken on the OST and/or MDT to ensure that this is what the admin wanted, instead of e.g. accidentally inserting some other OST with the same index and corrupting the filesystem because of duplicate object IDs, or not being able to access existing objects on the "real" OST at that index.

- the new OST would be best off to start allocating objects at the LAST_ID
  of the old OST, so that there is no risk of confusion between objects
- the MDT contains the old LAST_ID in it's lov_objids file, and it sends this
  to the OST at connection time, this is no problem
- currently the new OST will refuse to allow the MDT to connect, because it
  detects that the old LAST_ID value from the MDT is inconsistent with its
  own value
- it would be relatively straight forward to have the OST detect if the local
  LAST_ID value was "new" and use the MDT value instead
- the danger is if the LAST_ID file was lost for some reason (e.g. corruption
  causes e2fsck to erase it).  in that case, the OST startup code should be
  smart enough to regenerate LAST_ID based on walking the object directories,
  which would also avoid the need to do this in e2fsck/lfsck (which can only
  run offline)
- in cases where the on-disk LAST_ID is much lower than the MDT-supplied
  value, the OST should just skip precreation of all the intermediate objects
  and just start using the new MDT value
- the only other thing is to avoid the case where a "new" OST is accidentally
  assigned the same index, when that isn't what is wanted.  There needs to be
  some way to "prime" the new OST (that is NOT the default for a newly
  formatted OST), or conversely tell the MDT that it should signal the new
  OST to take the place of the old one, so that there are not any mistakes

In conclusion, most of this is already close to working, but needs some amount of effort to get it tested and working smoothly.

Since this is something that has come up on this list a number of times in the last year, I guess it means that a Lustre filesystem is now outliving the hardware on which it runs, so it would definitely be worthwhile for someone to look at this.  I filed bug 24128 on this, in case anyone wants to work on it.

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.




More information about the lustre-discuss mailing list