[lustre-discuss] How to eliminate zombie OSTs
Alejandro Sierra
algsierra at gmail.com
Wed Aug 9 09:44:48 PDT 2023
Hello,
In 2018 we implemented a lustre system 2.10.5 with 20 OSTs in two OSS
with 4 jboxes, each box with 24 disks of 12 TB each, for a total of
nearly 1 PB. In all that time we had power failures and failed raid
controller cards, all of which made us adjust the configuration. After
the last failure, the system keeps sending error messages about OSTs
that are no more in the system. In the MDS I do
# lctl dl
and I get the 20 currently active OSTs
oss01.lanot.unam.mx - OST00 /dev/disk/by-label/lustre-OST0000
oss01.lanot.unam.mx - OST01 /dev/disk/by-label/lustre-OST0001
oss01.lanot.unam.mx - OST02 /dev/disk/by-label/lustre-OST0002
oss01.lanot.unam.mx - OST03 /dev/disk/by-label/lustre-OST0003
oss01.lanot.unam.mx - OST04 /dev/disk/by-label/lustre-OST0004
oss01.lanot.unam.mx - OST05 /dev/disk/by-label/lustre-OST0005
oss01.lanot.unam.mx - OST06 /dev/disk/by-label/lustre-OST0006
oss01.lanot.unam.mx - OST07 /dev/disk/by-label/lustre-OST0007
oss01.lanot.unam.mx - OST08 /dev/disk/by-label/lustre-OST0008
oss01.lanot.unam.mx - OST09 /dev/disk/by-label/lustre-OST0009
oss02.lanot.unam.mx - OST15 /dev/disk/by-label/lustre-OST000f
oss02.lanot.unam.mx - OST16 /dev/disk/by-label/lustre-OST0010
oss02.lanot.unam.mx - OST17 /dev/disk/by-label/lustre-OST0011
oss02.lanot.unam.mx - OST18 /dev/disk/by-label/lustre-OST0012
oss02.lanot.unam.mx - OST19 /dev/disk/by-label/lustre-OST0013
oss02.lanot.unam.mx - OST25 /dev/disk/by-label/lustre-OST0019
oss02.lanot.unam.mx - OST26 /dev/disk/by-label/lustre-OST001a
oss02.lanot.unam.mx - OST27 /dev/disk/by-label/lustre-OST001b
oss02.lanot.unam.mx - OST28 /dev/disk/by-label/lustre-OST001c
oss02.lanot.unam.mx - OST29 /dev/disk/by-label/lustre-OST001d
but I also get 5 that are not currently active, in fact doesn't exist
28 IN osp lustre-OST0014-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4
29 UP osp lustre-OST0015-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4
30 UP osp lustre-OST0016-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4
31 UP osp lustre-OST0017-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4
32 UP osp lustre-OST0018-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4
When I try to eliminate them with
lctl conf_param -P osp.lustre-OST0015-osc-MDT0000.active=0
I get the error
conf_param: invalid option -- 'P'
set a permanent config parameter.
This command must be run on the MGS node
usage: conf_param [-d] <target.keyword=val>
-d Remove the permanent setting.
If I do
lctl --device 28 deactivate
I don't get an error, but nothing changes
What can I do?
Thank you in advance for any help.
--
Alejandro Aguilar Sierra
LANOT, ICAyCC, UNAM
More information about the lustre-discuss
mailing list