[lustre-discuss] Crash due to transaction in readonly mode (snapshot)

Hans Henrik Happe happe at nbi.dk
Thu Sep 23 07:27:25 PDT 2021


Hi,

We had a crash with this in MDS log:

Sep 22 13:45:07 sci-mds01 kernel: LustreError:
258240:0:(osd_handler.c:354:osd_trans_create()) 03781251-MDT0000:
someone try to start transaction under readonly mode, should be disabled.
Sep 22 13:45:07 sci-mds01 kernel: CPU: 31 PID: 94594 Comm:
mdt_rdpg05_005 Kdump: loaded Tainted: P           OE  ------------  
3.10.0-1160.6.1.el7.x86_64 #1
Sep 22 13:45:07 sci-mds01 kernel: Hardware name: Dell Inc. PowerEdge
R640/0HG0J8, BIOS 2.10.2 02/24/2021
Sep 22 13:45:07 sci-mds01 kernel: Call Trace:
Sep 22 13:45:07 sci-mds01 kernel: [<ffffffff89f81400>] dump_stack+0x19/0x1b
Sep 22 13:45:07 sci-mds01 kernel: [<ffffffffc143e64a>]
osd_trans_create+0x3ca/0x410 [osd_zfs]
Sep 22 13:45:07 sci-mds01 kernel: CPU: 10 PID: 258241 Comm:
mdt_rdpg05_001 Kdump: loaded Tainted: P           OE  ------------  
3.10.0-1160.6.1.el7.x86_64 #1
Sep 22 13:45:07 sci-mds01 kernel: [<ffffffffc12d885a>]
top_trans_create+0x8a/0x200 [ptlrpc]
Sep 22 13:45:07 sci-mds01 kernel: Hardware name: Dell Inc. PowerEdge
R640/0HG0J8, BIOS 2.10.2 02/24/2021
Sep 22 13:45:07 sci-mds01 kernel: [<ffffffffc16284dc>]
lod_trans_create+0x3c/0x50 [lod]
....

Looks similar to this:
http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/2018-August/015854.html

When restarting, the MGS starts fine, but the one MDT (science-MDT0000)
does not:

Sep 23 16:10:17 sci-mds00 kernel: Lustre: MGS: Connection restored to
0dd6cfa0-bdf7-c8ac-7bb9-182f7874e165 (at 0 at lo)
Sep 23 16:10:17 sci-mds00 kernel: Lustre: Skipped 1 previous similar message
Sep 23 16:10:19 sci-mds00 kernel: Lustre:
52424:0:(llog_cat.c:93:llog_cat_new_log()) science-OST1100-osc-MDT0000:
there are no more free slots in catalog [0x2:0x1:0x0]:0
Sep 23 16:10:19 sci-mds00 kernel: LustreError:
52424:0:(osp_sync.c:1524:osp_sync_init()) science-OST1100-osc-MDT0000:
can't initialize llog: rc = -28
Sep 23 16:10:19 sci-mds00 kernel: LustreError:
52424:0:(obd_config.c:559:class_setup()) setup
science-OST1100-osc-MDT0000 failed (-28)
Sep 23 16:10:19 sci-mds00 kernel: LustreError:
52424:0:(obd_config.c:1835:class_config_llog_handler())
MGC10.120.10.90 at tcp: cfg command failed: rc = -28
Sep 23 16:10:19 sci-mds00 kernel: Lustre:    cmd=cf003
0:science-OST1100-osc-MDT0000  1:science-OST1100_UUID  2:10.120.10.110 at tcp 
Sep 23 16:10:19 sci-mds00 kernel: LustreError: 15c-8:
MGC10.120.10.90 at tcp: The configuration from log 'science-MDT0000' failed
(-28). This may be the result of communication errors between this node
and the MGS, a bad configuration, or other errors. Set.
Sep 23 16:10:19 sci-mds00 kernel: LustreError:
52172:0:(obd_mount_server.c:1397:server_start_targets()) failed to start
server science-MDT0000: -28
Sep 23 16:10:19 sci-mds00 kernel: LustreError:
52172:0:(obd_mount_server.c:1992:server_fill_super()) Unable to start
targets: -28
Sep 23 16:10:19 sci-mds00 kernel: Lustre: Failing over science-MDT0000
Sep 23 16:10:19 sci-mds00 kernel: Lustre: server umount science-MDT0000
complete
Sep 23 16:10:19 sci-mds00 kernel: LustreError:
52172:0:(obd_mount.c:1608:lustre_fill_super()) Unable to mount  (-28)


We have tried to --writeconf it, but that only moves the problem to this
error when mounting an OST:

Sep 23 12:04:16 sci-mds00 kernel: Lustre: MGS: Logs for fs science were
removed by user request.  All servers must be restarted in order to
regenerate the logs: rc = 0
Sep 23 12:04:16 sci-mds00 kernel: Lustre: science-MDT0000: Imperative
Recovery not enabled, recovery window 300-900
Sep 23 12:04:38 sci-mds00 kernel: Lustre: MGS: Connection restored to
68b4cd3a-6c73-19c5-2925-935e42bdaf2b (at 10.120.10.111 at tcp)
Sep 23 12:04:38 sci-mds00 kernel: Lustre: Skipped 2 previous similar
messages
Sep 23 12:04:38 sci-mds00 kernel: Lustre: MGS: Regenerating
science-OST1100 log by user request: rc = 0
Sep 23 12:04:45 sci-mds00 kernel: LustreError:
5547:0:(genops.c:556:class_register_device())
science-OST1100-osc-MDT0000: already exists, won't add
Sep 23 12:04:45 sci-mds00 kernel: LustreError:
5547:0:(obd_config.c:1835:class_config_llog_handler())
MGC10.120.10.90 at tcp: cfg command failed: rc = -17
Sep 23 12:04:45 sci-mds00 kernel: Lustre:    cmd=cf001
0:science-OST1100-osc-MDT0000  1:osp  2:science-MDT0000-mdtlov_UUID 
Sep 23 12:04:45 sci-mds00 kernel: LustreError:
1345:0:(mgc_request.c:599:do_requeue()) failed processing log: -17

Any ideas how to solve this.

Cheers,
Hans Henrik
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20210923/337eb578/attachment.html>


More information about the lustre-discuss mailing list