[lustre-discuss] Lustre 2.8.0 - MDT/MGT failing to mount
Steve Barnet
barnet at icecube.wisc.edu
Thu May 4 10:00:05 PDT 2017
Hi all,
Thanks very much for all of your help and pointers! This
did actually do the right thing.
So, for posterity and to hopefully help others who may
run into this:
1) Configuration log information is wrong due to index collision
(admin error). MDT/MGT fail to mount. In this particular
case, this happened after a kernel panic, and an e2fsck
was required to make the ldiskfs filesystem consistent.
2) Isolate the servers. A clean unmount from all the clients
was not possible, so I did this with iptables on the servers
(basically, drop all client traffic to port 988 on all servers).
Don't want client recovery attempts muddying the waters
with the MDS or OSSes,.
3) Initial mount of the MDT fails, with log messages like:
May 4 06:27:21 lfs4-mds kernel: Lustre: MGS: Connection restored to
MGC10.128.11.174 at tcp1_0 (at 0 at lo)
May 4 06:27:21 lfs4-mds kernel: LustreError:
6091:0:(genops.c:334:class_newdev()) Device lfs4-OST000e-osc-MDT0000
already exists at 22, won't add
May 4 06:27:21 lfs4-mds kernel: LustreError:
6091:0:(obd_config.c:370:class_attach()) Cannot create device
lfs4-OST000e-osc-MDT0000 of type osp : -17
May 4 06:27:21 lfs4-mds kernel: LustreError:
6091:0:(obd_config.c:1666:class_config_llog_handler())
MGC10.128.11.174 at tcp1: cfg command failed: rc = -17
May 4 06:27:21 lfs4-mds kernel: Lustre: cmd=cf001
0:lfs4-OST000e-osc-MDT0000 1:osp 2:lfs4-MDT0000-mdtlov_UUID
May 4 06:27:21 lfs4-mds kernel:
May 4 06:27:21 lfs4-mds kernel: LustreError: 15c-8:
MGC10.128.11.174 at tcp1: The configuration from log 'lfs4-MDT0000' failed
(-17). This may be the result of communication errors between this node
and the MGS, a bad configuration, or other errors. See the syslog for
more information.
May 4 06:27:21 lfs4-mds kernel: LustreError:
6004:0:(obd_mount_server.c:1309:server_start_targets()) failed to start
server lfs4-MDT0000: -17
May 4 06:27:21 lfs4-mds kernel: LustreError:
6004:0:(obd_mount_server.c:1798:server_fill_super()) Unable to start
targets: -17
May 4 06:27:21 lfs4-mds kernel: Lustre: Failing over lfs4-MDT0000
May 4 06:27:27 lfs4-mds kernel: Lustre:
6004:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has
timed out for slow reply: [sent 1493897241/real 1493897241]
req at ffff88038d8fdc80 x1566404887184112/t0(0)
o251->MGC10.128.11.174 at tcp1@0 at lo:26/25 lens 224/224 e 0 to 1 dl
1493897247 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
May 4 06:27:28 lfs4-mds kernel: Lustre: server umount lfs4-MDT0000 complete
May 4 06:27:28 lfs4-mds kernel: LustreError:
6004:0:(obd_mount.c:1426:lustre_fill_super()) Unable to mount (-17)
It appears that the most relevant messages are probably these:
May 4 06:27:21 lfs4-mds kernel: LustreError: 15c-8:
MGC10.128.11.174 at tcp1: The configuration from log 'lfs4-MDT0000' failed
(-17). This may be the result of communication errors between this node
and the MGS, a bad configuration, or other errors. See the syslog for
more information.
May 4 06:27:21 lfs4-mds kernel: LustreError:
6004:0:(obd_mount_server.c:1309:server_start_targets()) failed to start
server lfs4-MDT0000: -17
May 4 06:27:21 lfs4-mds kernel: LustreError:
6004:0:(obd_mount_server.c:1798:server_fill_super()) Unable to start
targets: -17
Which indicate the corrupt configuration information.
4) Regenerate the configuration logs using writeconf. This procedure
is documented in the manual, but breaks down to:
a) Make sure no clients can reach your servers.
b) Make sure *all* targets are unmounted
c) For each target:
# tunefs.lustre --writeconf /dev/<target>
d) Mount the MDT (in our case, combined MDT/MGT)
e) Mount the OSTs
5) At this point, you see log messages like this:
a) On the MDS:
May 4 10:37:11 lfs4-mds kernel: Lustre: MGS: Connection restored to
MGC10.128.11.174 at tcp1_0 (at 0 at lo)
May 4 10:37:11 lfs4-mds kernel: Lustre: MGS: Logs for fs lfs4 were
removed by user request. All servers must be restarted in order to
regenerate the logs.
May 4 10:37:36 lfs4-mds kernel: Lustre: lfs4-MDT0000: Connection
restored to MGC10.128.11.174 at tcp1_0 (at 0 at lo)
May 4 10:38:01 lfs4-mds kernel: Lustre: MGS: Connection restored to
10.128.11.173 at tcp1 (at 10.128.11.173 at tcp1)
May 4 10:38:01 lfs4-mds kernel: Lustre: MGS: Regenerating lfs4-OST0000
log by user request.
May 4 10:38:09 lfs4-mds kernel: LustreError: 11-0:
lfs4-OST0000-osc-MDT0000: operation ost_connect to node
10.128.11.173 at tcp1 failed: rc = -16
May 4 10:38:34 lfs4-mds kernel: LustreError: 11-0:
lfs4-OST0000-osc-MDT0000: operation ost_connect to node
10.128.11.173 at tcp1 failed: rc = -16
b) On a client, normal recovery:
May 4 10:40:32 lfs4-oss-02 kernel: Lustre: lfs4-OST0005: Will be in
recovery for at least 5:00, or until 16 clients reconnect
May 4 10:40:32 lfs4-oss-02 kernel: Lustre: Skipped 1 previous similar
message
May 4 10:40:32 lfs4-oss-02 kernel: Lustre: lfs4-OST0005: Denying
connection for new client lfs4-MDT0000-mdtlov_UUID(at
10.128.11.174 at tcp1), waiting for 16 known clients (0 recovered, 0 in
progress, and 0 evicted) to recover in 5:00
6) Upon mounting the OST with the index collision, I still saw
the following in the logs:
May 4 10:45:12 lfs4-mds kernel: LustreError:
16576:0:(genops.c:334:class_newdev()) Device lfs4-OST000e-osc-MDT0000
already exists at 22, won't add
May 4 10:45:12 lfs4-mds kernel: LustreError:
16576:0:(obd_config.c:370:class_attach()) Cannot create device
lfs4-OST000e-osc-MDT0000 of type osp : -17
May 4 10:45:12 lfs4-mds kernel: LustreError:
16576:0:(obd_config.c:1666:class_config_llog_handler())
MGC10.128.11.174 at tcp1: cfg command failed: rc = -17
May 4 10:45:12 lfs4-mds kernel: Lustre: cmd=cf001
0:lfs4-OST000e-osc-MDT0000 1:osp 2:lfs4-MDT0000-mdtlov_UUID
but the OST mounted correctly, and I am not seeing any other
indications of a problem.
After a little testing, it looks like the filesystem is back in
working condition.
Many thanks again to all who helped out!
Best,
---Steve
On 5/4/17 10:51 AM, Colin Faber wrote:
> Hi,
>
> Yes MGS/MDT as well as OSTs, Remount MGS/MDT, then OSTs, then clients.
>
> -cf
>
>
> On Thu, May 4, 2017 at 9:24 AM, Mohr Jr, Richard Frank (Rick Mohr) <
> rmohr at utk.edu> wrote:
>
>>
>>> On May 4, 2017, at 11:03 AM, Steve Barnet <barnet at icecube.wisc.edu>
>> wrote:
>>>
>>> On 5/4/17 10:01 AM, Mohr Jr, Richard Frank (Rick Mohr) wrote:
>>>> Did you try doing a writeconf to regenerate the config logs for the
>> file system?
>>>
>>>
>>> Not yet, but quick enough to try. Do this for the MDT/MGT first,
>>> then the OSTs?
>>>
>>
>> I believe that is correct, but you should check the Lustre manual to be
>> certain of the procedure.
>>
>> --
>> Rick Mohr
>> Senior HPC System Administrator
>> National Institute for Computational Sciences
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.
>> nics.tennessee.edu&d=DwICAg&c=IGDlg0lD0b-nebmJJ0Kp8A&r=x9pM59OqndbWw-
>> lPPdr8w1Vud29EZigcxcNkz0uw5oQ&m=zNWOKVoBbMeg1KtWlyO1oNprX_
>> 1JpEc6vKU6dgcqmQM&s=qi1QBsAhh_VrYyASzVltuIBDt9VlL4wsIIVnIA9vdGE&e=
>>
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-discuss at lists.lustre.org
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.
>> lustre.org_listinfo.cgi_lustre-2Ddiscuss-2Dlustre.org&
>> d=DwICAg&c=IGDlg0lD0b-nebmJJ0Kp8A&r=x9pM59OqndbWw-
>> lPPdr8w1Vud29EZigcxcNkz0uw5oQ&m=zNWOKVoBbMeg1KtWlyO1oNprX_
>> 1JpEc6vKU6dgcqmQM&s=N5BBW4WTyDBCfEfkgHB9_iQh_kEA5QzKPGTMZbbub5o&e=
>>
>
More information about the lustre-discuss
mailing list