[lustre-discuss] How to eliminate zombie OSTs
Jeff Johnson
jeff.johnson at aeoncomputing.com
Wed Aug 9 09:49:41 PDT 2023
Alejandro,
Is your MGS located on the same node as your primary MDT? (combined MGS/MDT
node)
--Jeff
On Wed, Aug 9, 2023 at 9:46 AM Alejandro Sierra via lustre-discuss <
lustre-discuss at lists.lustre.org> wrote:
> Hello,
>
> In 2018 we implemented a lustre system 2.10.5 with 20 OSTs in two OSS
> with 4 jboxes, each box with 24 disks of 12 TB each, for a total of
> nearly 1 PB. In all that time we had power failures and failed raid
> controller cards, all of which made us adjust the configuration. After
> the last failure, the system keeps sending error messages about OSTs
> that are no more in the system. In the MDS I do
>
> # lctl dl
>
> and I get the 20 currently active OSTs
>
> oss01.lanot.unam.mx - OST00 /dev/disk/by-label/lustre-OST0000
> oss01.lanot.unam.mx - OST01 /dev/disk/by-label/lustre-OST0001
> oss01.lanot.unam.mx - OST02 /dev/disk/by-label/lustre-OST0002
> oss01.lanot.unam.mx - OST03 /dev/disk/by-label/lustre-OST0003
> oss01.lanot.unam.mx - OST04 /dev/disk/by-label/lustre-OST0004
> oss01.lanot.unam.mx - OST05 /dev/disk/by-label/lustre-OST0005
> oss01.lanot.unam.mx - OST06 /dev/disk/by-label/lustre-OST0006
> oss01.lanot.unam.mx - OST07 /dev/disk/by-label/lustre-OST0007
> oss01.lanot.unam.mx - OST08 /dev/disk/by-label/lustre-OST0008
> oss01.lanot.unam.mx - OST09 /dev/disk/by-label/lustre-OST0009
> oss02.lanot.unam.mx - OST15 /dev/disk/by-label/lustre-OST000f
> oss02.lanot.unam.mx - OST16 /dev/disk/by-label/lustre-OST0010
> oss02.lanot.unam.mx - OST17 /dev/disk/by-label/lustre-OST0011
> oss02.lanot.unam.mx - OST18 /dev/disk/by-label/lustre-OST0012
> oss02.lanot.unam.mx - OST19 /dev/disk/by-label/lustre-OST0013
> oss02.lanot.unam.mx - OST25 /dev/disk/by-label/lustre-OST0019
> oss02.lanot.unam.mx - OST26 /dev/disk/by-label/lustre-OST001a
> oss02.lanot.unam.mx - OST27 /dev/disk/by-label/lustre-OST001b
> oss02.lanot.unam.mx - OST28 /dev/disk/by-label/lustre-OST001c
> oss02.lanot.unam.mx - OST29 /dev/disk/by-label/lustre-OST001d
>
> but I also get 5 that are not currently active, in fact doesn't exist
>
> 28 IN osp lustre-OST0014-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4
> 29 UP osp lustre-OST0015-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4
> 30 UP osp lustre-OST0016-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4
> 31 UP osp lustre-OST0017-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4
> 32 UP osp lustre-OST0018-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4
>
> When I try to eliminate them with
>
> lctl conf_param -P osp.lustre-OST0015-osc-MDT0000.active=0
>
> I get the error
>
> conf_param: invalid option -- 'P'
> set a permanent config parameter.
> This command must be run on the MGS node
> usage: conf_param [-d] <target.keyword=val>
> -d Remove the permanent setting.
>
> If I do
>
> lctl --device 28 deactivate
>
> I don't get an error, but nothing changes
>
> What can I do?
>
> Thank you in advance for any help.
>
> --
> Alejandro Aguilar Sierra
> LANOT, ICAyCC, UNAM
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
--
------------------------------
Jeff Johnson
Co-Founder
Aeon Computing
jeff.johnson at aeoncomputing.com
www.aeoncomputing.com
t: 858-412-3810 x1001 f: 858-412-3845
m: 619-204-9061
4170 Morena Boulevard, Suite C - San Diego, CA 92117
High-Performance Computing / Lustre Filesystems / Scale-out Storage
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20230809/b971d079/attachment-0001.htm>
More information about the lustre-discuss
mailing list