[lustre-discuss] How to eliminate zombie OSTs

Alejandro Sierra algsierra at gmail.com
Wed Aug 9 09:44:48 PDT 2023


Hello,

In 2018 we implemented a lustre system 2.10.5 with 20 OSTs in two OSS
with 4 jboxes, each box with 24 disks of 12 TB each, for a total of
nearly 1 PB. In all that time we had power failures and failed raid
controller cards, all of which made us adjust the configuration. After
the last failure, the system keeps sending error messages about OSTs
that are no more in the system. In the MDS I do

# lctl dl

and I get the 20 currently active OSTs

oss01.lanot.unam.mx     -       OST00   /dev/disk/by-label/lustre-OST0000
oss01.lanot.unam.mx     -       OST01   /dev/disk/by-label/lustre-OST0001
oss01.lanot.unam.mx     -       OST02   /dev/disk/by-label/lustre-OST0002
oss01.lanot.unam.mx     -       OST03   /dev/disk/by-label/lustre-OST0003
oss01.lanot.unam.mx     -       OST04   /dev/disk/by-label/lustre-OST0004
oss01.lanot.unam.mx     -       OST05   /dev/disk/by-label/lustre-OST0005
oss01.lanot.unam.mx     -       OST06   /dev/disk/by-label/lustre-OST0006
oss01.lanot.unam.mx     -       OST07   /dev/disk/by-label/lustre-OST0007
oss01.lanot.unam.mx     -       OST08   /dev/disk/by-label/lustre-OST0008
oss01.lanot.unam.mx     -       OST09   /dev/disk/by-label/lustre-OST0009
oss02.lanot.unam.mx     -       OST15   /dev/disk/by-label/lustre-OST000f
oss02.lanot.unam.mx     -       OST16   /dev/disk/by-label/lustre-OST0010
oss02.lanot.unam.mx     -       OST17   /dev/disk/by-label/lustre-OST0011
oss02.lanot.unam.mx     -       OST18   /dev/disk/by-label/lustre-OST0012
oss02.lanot.unam.mx     -       OST19   /dev/disk/by-label/lustre-OST0013
oss02.lanot.unam.mx     -       OST25   /dev/disk/by-label/lustre-OST0019
oss02.lanot.unam.mx     -       OST26   /dev/disk/by-label/lustre-OST001a
oss02.lanot.unam.mx     -       OST27   /dev/disk/by-label/lustre-OST001b
oss02.lanot.unam.mx     -       OST28   /dev/disk/by-label/lustre-OST001c
oss02.lanot.unam.mx     -       OST29   /dev/disk/by-label/lustre-OST001d

but I also get 5 that are not currently active, in fact doesn't exist

 28 IN osp lustre-OST0014-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4
 29 UP osp lustre-OST0015-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4
 30 UP osp lustre-OST0016-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4
 31 UP osp lustre-OST0017-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4
 32 UP osp lustre-OST0018-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4

When I try to eliminate them with

lctl conf_param -P osp.lustre-OST0015-osc-MDT0000.active=0

I get the error

conf_param: invalid option -- 'P'
set a permanent config parameter.
This command must be run on the MGS node
usage: conf_param [-d] <target.keyword=val>
  -d  Remove the permanent setting.

If I do

lctl --device 28 deactivate

I don't get an error, but nothing changes

What can I do?

Thank you in advance for any help.

--
Alejandro Aguilar Sierra
LANOT, ICAyCC, UNAM


More information about the lustre-discuss mailing list