[lustre-discuss] old Lustre 2.8.0 panic'ing continously

Thu Mar 12 07:18:32 PDT 2020

Dear all,

Am 10.03.20 um 08:18 schrieb Torsten Harenberg:
> During the last days (since thursday), our Lustre instance was
> surprisingly stable. We lowered a bit the load by limiting the # of
> running jobs which might also helped to stablize the system.
> 
> We enabled kdump, so if another crash is happening anytime soon, we hope
> to get at least a dump for a hint where the problem is.

now it crashed again. But we got a backtrace and a dump.

the backtrace is:

<4>general protection fault: 0000 [#1] SMP
<4>last sysfs file:
/sys/devices/pci0000:00/0000:00:01.0/0000:04:00.1/host4/rport-4:0-1/target4:0:1/4:0:1:14/state
<4>CPU 13
<4>Modules linked in: osp(U) ofd(U) lfsck(U) ost(U) mgc(U)
osd_ldiskfs(U) ldiskfs(U) lquota(U) lustre(U) lov(U) mdc(U) fid(U)
lmv(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic
crc32c_intel libcfs(U) autofs4 bonding ipt_REJECT nf_conntrack_ipv4
nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6
nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6
iTCO_wdt iTCO_vendor_support hpilo hpwdt serio_raw lpc_ich mfd_core
ioatdma dca ses enclosure sg bnx2x ptp pps_core libcrc32c mdio
power_meter acpi_ipmi ipmi_si ipmi_msghandler shpchp ext4 jbd2 mbcache
dm_round_robin sd_mod crc_t10dif qla2xxx scsi_transport_fc scsi_tgt
pata_acpi ata_generic ata_piix dm_multipath dm_mirror dm_region_hash
dm_log dm_mod hpsa [last unloaded: scsi_wait_scan]
<4>
<4>Pid: 18657, comm: ll_ost_io02_077 Not tainted
2.6.32-573.12.1.el6_lustre.x86_64 #1 HP ProLiant DL360p Gen8
<4>RIP: 0010:[<ffffffffa0bc9ee3>]  [<ffffffffa0bc9ee3>]
ldiskfs_ext_insert_extent+0xb3/0x10c0 [ldiskfs]
<4>RSP: 0018:ffff8806fa8136c0  EFLAGS: 00010246
<4>RAX: 0000000000000000 RBX: 0000000000000002 RCX: dead000000200200
<4>RDX: ffff8806fa813800 RSI: ffff88196f62a2c0 RDI: ffff880106fc3901
<4>RBP: ffff8806fa813790 R08: 0000000000000000 R09: ffff8807ff69f3c0
<4>R10: 0000000000000009 R11: 0000000000000002 R12: ffff88196f62a240
<4>R13: 0000000000000000 R14: 0000000000000002 R15: ffff88196f62a2c0
<4>FS:  0000000000000000(0000) GS:ffff88009a5a0000(0000)
knlGS:0000000000000000
<4>CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
<4>CR2: 0000000000426820 CR3: 0000000001a8d000 CR4: 00000000000407e0
<4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>Process ll_ost_io02_077 (pid: 18657, threadinfo ffff8806fa810000,
task ffff8806faffaab0)
<4>Stack:
<4> ffff8806fa8136f0 ffffffff811beb8b 0000000000000002 dead000000200200
<4><d> ffff881f80925c00 ffff880106fc39b0 ffff8806fa813790 ffffffffa0be3dd1
<4><d> ffff88196f62a240 ffff880106fc39b0 ffff8806fa8137e8 00000000fa8137d4
<4>Call Trace:
<4> [<ffffffff811beb8b>] ? __mark_inode_dirty+0x3b/0x160
<4> [<ffffffffa0be3dd1>] ? ldiskfs_mb_new_blocks+0x241/0x640 [ldiskfs]
<4> [<ffffffffa0c6c169>] ldiskfs_ext_new_extent_cb+0x5d9/0x6d0 [osd_ldiskfs]
<4> [<ffffffff8129e348>] ? call_rwsem_wake+0x18/0x30
<4> [<ffffffffa0bc9c62>] ldiskfs_ext_walk_space+0x142/0x310 [ldiskfs]
<4> [<ffffffffa0c6bb90>] ? ldiskfs_ext_new_extent_cb+0x0/0x6d0 [osd_ldiskfs]
<4> [<ffffffffa0c6bafd>] osd_ldiskfs_map_nblocks+0x7d/0x110 [osd_ldiskfs]
<4> [<ffffffffa0c6c4d8>] osd_ldiskfs_map_inode_pages+0x278/0x2e0
[osd_ldiskfs]
<4> [<ffffffffa0bfc0d8>] ? __ldiskfs_journal_stop+0x68/0xa0 [ldiskfs]
<4> [<ffffffffa0c6edcb>] osd_write_commit+0x39b/0x9a0 [osd_ldiskfs]
<4> [<ffffffffa0de1c34>] ofd_commitrw_write+0x664/0xfa0 [ofd]
<4> [<ffffffffa0de2b2f>] ofd_commitrw+0x5bf/0xb10 [ofd]
<4> [<ffffffffa04bc791>] ? lprocfs_counter_add+0x151/0x1c0 [obdclass]
<4> [<ffffffffa0756d74>] obd_commitrw+0x114/0x380 [ptlrpc]
<4> [<ffffffffa07603f0>] tgt_brw_write+0xc70/0x1540 [ptlrpc]
<4> [<ffffffff8105e646>] ? enqueue_task+0x66/0x80
<4> [<ffffffff8105a81d>] ? check_preempt_curr+0x6d/0x90
<4> [<ffffffff8106711e>] ? try_to_wake_up+0x24e/0x3e0
<4> [<ffffffffa06feb70>] ? lustre_swab_niobuf_remote+0x0/0x30 [ptlrpc]
<4> [<ffffffffa06b6140>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
<4> [<ffffffffa075ec2c>] tgt_request_handle+0x8ec/0x1440 [ptlrpc]
<4> [<ffffffffa070bc61>] ptlrpc_main+0xd21/0x1800 [ptlrpc]
<4> [<ffffffff8106eab0>] ? pick_next_task_fair+0xd0/0x130
<4> [<ffffffff81538f46>] ? schedule+0x176/0x3a0
<4> [<ffffffffa070af40>] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
<4> [<ffffffff810a0fce>] kthread+0x9e/0xc0
<4> [<ffffffff8100c28a>] child_rip+0xa/0x20
<4> [<ffffffff810a0f30>] ? kthread+0x0/0xc0
<4> [<ffffffff8100c280>] ? child_rip+0x0/0x20
<4>Code: 48 85 c9 0f 84 05 10 00 00 4d 85 ff 74 0a f6 45 8c 08 0f 84 33
07 00 00 45 31 ed 4c 89 e8 66 2e 0f 1f 84 00 00 00 00 00 49 63 de <44>
0f b7 49 02 48 8d 14 dd 00 00 00 00 49 89 df 49 c1 e7 06 49
<1>RIP  [<ffffffffa0bc9ee3>] ldiskfs_ext_insert_extent+0xb3/0x10c0 [ldiskfs]
<4> RSP <ffff8806fa8136c0>
[root at lustre3 127.0.0.1-2020-03-11-19:14:18]#

I still have to read (I am not experienced with kernel debugging) how to
attach the vmcore to a gdb.

But if you already have a guess reading the trace I would be very happy
to take any advice.

By the way: we mounted the OST now exactly the other way round than
usual and now the other machine crashed, so it seems that it has
something to do with the content on the LUNs rather than it's a server
hardware problem.

Thanks again

  Torsten

-- 
Dr. Torsten Harenberg     harenberg at physik.uni-wuppertal.de
Bergische Universitaet
Fakultät 4 - Physik       Tel.: +49 (0)202 439-3521
Gaussstr. 20              Fax : +49 (0)202 439-2811
42097 Wuppertal

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5341 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20200312/449e2854/attachment.bin>