<div dir="ltr">Alejandro,<div><br></div><div>Is your MGS located on the same node as your primary MDT? (combined MGS/MDT node)</div><div><br></div><div>--Jeff</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Aug 9, 2023 at 9:46 AM Alejandro Sierra via lustre-discuss <<a href="mailto:lustre-discuss@lists.lustre.org">lustre-discuss@lists.lustre.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hello,<br>
<br>
In 2018 we implemented a lustre system 2.10.5 with 20 OSTs in two OSS<br>
with 4 jboxes, each box with 24 disks of 12 TB each, for a total of<br>
nearly 1 PB. In all that time we had power failures and failed raid<br>
controller cards, all of which made us adjust the configuration. After<br>
the last failure, the system keeps sending error messages about OSTs<br>
that are no more in the system. In the MDS I do<br>
<br>
# lctl dl<br>
<br>
and I get the 20 currently active OSTs<br>
<br>
<a href="http://oss01.lanot.unam.mx" rel="noreferrer" target="_blank">oss01.lanot.unam.mx</a> - OST00 /dev/disk/by-label/lustre-OST0000<br>
<a href="http://oss01.lanot.unam.mx" rel="noreferrer" target="_blank">oss01.lanot.unam.mx</a> - OST01 /dev/disk/by-label/lustre-OST0001<br>
<a href="http://oss01.lanot.unam.mx" rel="noreferrer" target="_blank">oss01.lanot.unam.mx</a> - OST02 /dev/disk/by-label/lustre-OST0002<br>
<a href="http://oss01.lanot.unam.mx" rel="noreferrer" target="_blank">oss01.lanot.unam.mx</a> - OST03 /dev/disk/by-label/lustre-OST0003<br>
<a href="http://oss01.lanot.unam.mx" rel="noreferrer" target="_blank">oss01.lanot.unam.mx</a> - OST04 /dev/disk/by-label/lustre-OST0004<br>
<a href="http://oss01.lanot.unam.mx" rel="noreferrer" target="_blank">oss01.lanot.unam.mx</a> - OST05 /dev/disk/by-label/lustre-OST0005<br>
<a href="http://oss01.lanot.unam.mx" rel="noreferrer" target="_blank">oss01.lanot.unam.mx</a> - OST06 /dev/disk/by-label/lustre-OST0006<br>
<a href="http://oss01.lanot.unam.mx" rel="noreferrer" target="_blank">oss01.lanot.unam.mx</a> - OST07 /dev/disk/by-label/lustre-OST0007<br>
<a href="http://oss01.lanot.unam.mx" rel="noreferrer" target="_blank">oss01.lanot.unam.mx</a> - OST08 /dev/disk/by-label/lustre-OST0008<br>
<a href="http://oss01.lanot.unam.mx" rel="noreferrer" target="_blank">oss01.lanot.unam.mx</a> - OST09 /dev/disk/by-label/lustre-OST0009<br>
<a href="http://oss02.lanot.unam.mx" rel="noreferrer" target="_blank">oss02.lanot.unam.mx</a> - OST15 /dev/disk/by-label/lustre-OST000f<br>
<a href="http://oss02.lanot.unam.mx" rel="noreferrer" target="_blank">oss02.lanot.unam.mx</a> - OST16 /dev/disk/by-label/lustre-OST0010<br>
<a href="http://oss02.lanot.unam.mx" rel="noreferrer" target="_blank">oss02.lanot.unam.mx</a> - OST17 /dev/disk/by-label/lustre-OST0011<br>
<a href="http://oss02.lanot.unam.mx" rel="noreferrer" target="_blank">oss02.lanot.unam.mx</a> - OST18 /dev/disk/by-label/lustre-OST0012<br>
<a href="http://oss02.lanot.unam.mx" rel="noreferrer" target="_blank">oss02.lanot.unam.mx</a> - OST19 /dev/disk/by-label/lustre-OST0013<br>
<a href="http://oss02.lanot.unam.mx" rel="noreferrer" target="_blank">oss02.lanot.unam.mx</a> - OST25 /dev/disk/by-label/lustre-OST0019<br>
<a href="http://oss02.lanot.unam.mx" rel="noreferrer" target="_blank">oss02.lanot.unam.mx</a> - OST26 /dev/disk/by-label/lustre-OST001a<br>
<a href="http://oss02.lanot.unam.mx" rel="noreferrer" target="_blank">oss02.lanot.unam.mx</a> - OST27 /dev/disk/by-label/lustre-OST001b<br>
<a href="http://oss02.lanot.unam.mx" rel="noreferrer" target="_blank">oss02.lanot.unam.mx</a> - OST28 /dev/disk/by-label/lustre-OST001c<br>
<a href="http://oss02.lanot.unam.mx" rel="noreferrer" target="_blank">oss02.lanot.unam.mx</a> - OST29 /dev/disk/by-label/lustre-OST001d<br>
<br>
but I also get 5 that are not currently active, in fact doesn't exist<br>
<br>
28 IN osp lustre-OST0014-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4<br>
29 UP osp lustre-OST0015-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4<br>
30 UP osp lustre-OST0016-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4<br>
31 UP osp lustre-OST0017-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4<br>
32 UP osp lustre-OST0018-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4<br>
<br>
When I try to eliminate them with<br>
<br>
lctl conf_param -P osp.lustre-OST0015-osc-MDT0000.active=0<br>
<br>
I get the error<br>
<br>
conf_param: invalid option -- 'P'<br>
set a permanent config parameter.<br>
This command must be run on the MGS node<br>
usage: conf_param [-d] <target.keyword=val><br>
-d Remove the permanent setting.<br>
<br>
If I do<br>
<br>
lctl --device 28 deactivate<br>
<br>
I don't get an error, but nothing changes<br>
<br>
What can I do?<br>
<br>
Thank you in advance for any help.<br>
<br>
--<br>
Alejandro Aguilar Sierra<br>
LANOT, ICAyCC, UNAM<br>
_______________________________________________<br>
lustre-discuss mailing list<br>
<a href="mailto:lustre-discuss@lists.lustre.org" target="_blank">lustre-discuss@lists.lustre.org</a><br>
<a href="http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org" rel="noreferrer" target="_blank">http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org</a><br>
</blockquote></div><br clear="all"><div><br></div><span class="gmail_signature_prefix">-- </span><br><div dir="ltr" class="gmail_signature"><div dir="ltr"><div><div dir="ltr">------------------------------<br>Jeff Johnson<br>Co-Founder<br>Aeon Computing<br><br><a href="mailto:jeff.johnson@aeoncomputing.com" target="_blank">jeff.johnson@aeoncomputing.com</a><br><a href="http://www.aeoncomputing.com" target="_blank">www.aeoncomputing.com</a><br>t: 858-412-3810 x1001 f: 858-412-3845<br>m: 619-204-9061<br><br>4170 Morena Boulevard, Suite C - San Diego, CA 92117<div><br></div><div>High-Performance Computing / Lustre Filesystems / Scale-out Storage</div></div></div></div></div>