[lustre-discuss] OSTffff created :-(
Torsten Harenberg
harenberg at physik.uni-wuppertal.de
Wed May 23 06:04:44 PDT 2018
Dear all,
we are running a Lustre 2.5.3 installation for a couple of years
already. The devices come from the 3PAR SAN appliance.
Our users asked us to enlarge the available disk space, so we exported
two new LUNs to the OST servers.
File systems have been created with:
mkfs.lustre --fsname=lustre --ost --index 15 --backfstype=ldiskfs
--failnode=<IP>@tcp --mgsnode=<IP>@tcp
--mgsnode=<IP>@tcp --verbose /dev/mapper/OST000F
which went fine.
However, after mounting, the file system appears as
lustre-OSTffff_UUID 8585168804 35177704 8120481472 0%
/lustre[OST:65535]
in lfs df.
And lfs df prints 65k+ lines with
OSTfff5 : Resource temporarily unavailable
OSTfff6 : Resource temporarily unavailable
OSTfff7 : Resource temporarily unavailable
OSTfff8 : Resource temporarily unavailable
OSTfff9 : Resource temporarily unavailable
OSTfffa : Resource temporarily unavailable
OSTfffb : Resource temporarily unavailable
OSTfffc : Resource temporarily unavailable
OSTfffd : Resource temporarily unavailable
OSTfffe : Resource temporarily unavailable
in between.
Searching for the root of this, we saw:
------
[root at lustre4 ~]# tunefs.lustre /dev/mapper/OST000F
checking for existing Lustre data: found
Reading CONFIGS/mountdata
Read previous values:
Target: lustre-OSTffff
Index: 15
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x2
(OST )
Persistent mount opts: errors=remount-ro
Parameters: failover.node=<IP>@tcp
mgsnode=<IP>@tcp mgsnode=<IP>@tcp
Permanent disk data:
Target: lustre-OST000f
Index: 15
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x2
(OST )
Persistent mount opts: errors=remount-ro
Parameters: failover.node=<IP>@tcp
mgsnode=<IP>@tcp mgsnode=<IP>@tcp
------
No idea where the
Read previous values:
Target: lustre-OSTffff
comes from.
Now we were trying to free the OST immediately, which turns out to be
more complicated than expected.
We tried to follow the manual and issued on the MDS:
[root at lustre1 ~]# lctl --device lustre-OSTffff-osc-MDT0000 deactivate
But device still is "UP":
[root at lustre1 ~]# lctl dl
0 UP osd-ldiskfs lustre-MDT0000-osd lustre-MDT0000-osd_UUID 24
1 UP mgs MGS MGS 427
2 UP mgc MGC132.195.124.201 at tcp 17eb290e-d0a6-2047-3250-84f893ebc47a 5
3 UP mds MDS MDS_uuid 3
4 UP lod lustre-MDT0000-mdtlov lustre-MDT0000-mdtlov_UUID 4
5 UP mdt lustre-MDT0000 lustre-MDT0000_UUID 455
6 UP mdd lustre-MDD0000 lustre-MDD0000_UUID 4
7 UP qmt lustre-QMT0000 lustre-QMT0000_UUID 4
8 UP osp lustre-OST0000-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 5
9 UP osp lustre-OST0001-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 5
10 UP osp lustre-OST0002-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 5
11 UP osp lustre-OST0003-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 5
12 UP osp lustre-OST0004-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 5
13 UP osp lustre-OST0005-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 5
14 UP osp lustre-OST0006-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 5
15 UP osp lustre-OST0007-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 5
16 UP osp lustre-OST0008-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 5
17 UP osp lustre-OST0009-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 5
18 UP osp lustre-OST000a-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 5
19 UP osp lustre-OST000b-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 5
20 UP osp lustre-OST000c-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 5
21 UP osp lustre-OST000d-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 5
22 UP osp lustre-OST000e-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 5
23 UP osp lustre-OSTffff-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 5
24 UP lwp lustre-MDT0000-lwp-MDT0000 lustre-MDT0000-lwp-MDT0000_UUID 5
We set it degraded on the OST:
[root at lustre4 ~]# lctl get_param obdfilter.*.degraded
obdfilter.lustre-OST0008.degraded=0
obdfilter.lustre-OST0009.degraded=0
obdfilter.lustre-OST000a.degraded=0
obdfilter.lustre-OST000b.degraded=0
obdfilter.lustre-OST000c.degraded=0
obdfilter.lustre-OST000d.degraded=0
obdfilter.lustre-OST000e.degraded=0
obdfilter.lustre-OSTffff.degraded=1
But still the file system usage grows:
[root at wnfg001 ~]# lfs df /lustre | grep ffff
lustre-OSTffff_UUID 8585168804 35159988 8120496592 0%
/lustre[OST:65535]
[root at wnfg001 ~]# lfs df /lustre | grep ffff
lustre-OSTffff_UUID 8585168804 35177704 8120481472 0%
/lustre[OST:65535]
We could stop usage by setting if inactive on ALL (200+ in our case)
clients with
lctl set_param osc.lustre-OSTffff-*.active=0
But then the file system becomes unusable for the users:
-bash-4.1# touch
/lustre/gridsoft/arc/session/LeENDmtfhfsnsBfJnpimw0EmABFKDmABFKDmxSGKDmABFKDmhbxd6n/qq2
touch: setting times of
`/lustre/gridsoft/arc/session/LeENDmtfhfsnsBfJnpimw0EmABFKDmABFKDmxSGKDmABFKDmhbxd6n/qq2':
Cannot send after transport endpoint shutdown
same is true for "lctl --device XX deactivate".
So we are looking for ways now to:
1.) set the OST read-only but keeping the file system usable
2.) then migrate what's on this OSTffff (we started a lfs find already,
but it takes very long)
3.) remove the OST and start from scratch.
And really nice would be to understand where the OSTffff comes from and
how one can avoid it.
Any hint is really appreciated.
Best regards
Torsten
--
Dr. Torsten Harenberg harenberg at physik.uni-wuppertal.de
Bergische Universitaet
Fakultät 4 - Physik Tel.: +49 (0)202 439-3521
Gaussstr. 20 Fax : +49 (0)202 439-2811
42097 Wuppertal
More information about the lustre-discuss
mailing list