[lustre-discuss] Correct procedure for OST replacment

Etienne Aujames eaujames at ddn.com
Tue Oct 25 05:15:51 PDT 2022


Hello,

I think you hit the following bug:
https://jira.whamcloud.com/browse/LU-15000 MDS crashes with
(osp_dev.c:1404:osp_obd_connect()) ASSERTION( osp->opd_connects == 1 )
failed

Stephane Thiell reported this issue and fixed it by patching his 2.12.7
version with https://review.whamcloud.com/46552 (2.15 backport:  
https://review.whamcloud.com/47515).

A backport is issued for b2_15 branch but not yet landed: 
https://review.whamcloud.com/c/fs/lustre-release/+/48898

You could also check his LAD's presentation about removing OSTs (lctl
del_ost):
"A filesystem coming of age: live hardware upgrade practices at
Stanford Research Computing" (
https://www.eofs.eu/_media/events/lad22/2.5-stanfordrc_s_thiell.pdf)

Etienne AUJAMES

On Tue, 2022-10-25 at 10:12 +0000, Redl, Robert wrote:
> Dear Lustre Experts,
> 
> some time ago we removed an OST. We followed the instructions from
> the documentation (
> https://doc.lustre.org/lustre_manual.xhtml#lustremaint.remove_ost
> ) including cleaning up the logs from all related entries using
> llog_cancel. After the removal the system worked normal. 
> 
> Now we are trying to add a new OST reusing the same index. If the OST
> is created with mkfs.lustre --replace, then it is possible to mount
> the OST, but it is not possible to mount the whole filesystem
> anymore. A client would see the following error message:
> 
> kernel: LustreError:
> 70451:0:(obd_config.c:1499:class_process_config()) no device for:
> project-OST0007-osc-ffff914108c2e800
> kernel: LustreError:
> 70451:0:(obd_config.c:2001:class_config_llog_handler()) 
> MGC10.163.52.14 at tcp: cfg command failed: rc = -22
> kernel: Lustre:    cmd=cf00b 0:project-OST0007-osc  1:
> 10.163.52.20 at tcp
> kernel: LustreError: 1760:0:(mgc_request.c:612:do_requeue()) failed
> processing log: -22
> 
> In order to make the filesystem mountable again, all log entries
> created by mounting the OST must be removed using llog_cancel.
> 
> If the OST is created using mkfs.lustre without --replace, then the
> OST itself is not mountable. The following error message is shown:
> 
> kernel: LustreError: 140-5: Server project-OST0007 requested index 7,
> but that index is already in use. Use --writeconf to force
> kernel: LustreError: 7302:0:(mgs_handler.c:503:mgs_target_reg())
> Failed to write project-OST0007 log (-98)
> 
> Given that the --writeconf suggested in the error message requires a
> full shutdown of the system, we would like to avoid that.
> 
> I wonder if we maybe overlooked something when the OST was removed.
> The logs for project-client, project-MDT0000, and project-MDT0001 are
> not showing any traces of the old OST anymore. Is there anything more
> that needs to be done to make lustre forget that an OST with a given
> index existed at some point?
> 
> Lustre Version: 2.15.1, ZFS-backend.
> 
> Thanks a lot!
> Robert
> 
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> 
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> 
> 


More information about the lustre-discuss mailing list