[lustre-discuss] Lustre OSS kernel panic after mounting OSTs

Riccardo Veraldi Riccardo.Veraldi at cnaf.infn.it
Tue Oct 30 05:05:38 PDT 2018


Hello,

I have quite a very critical problem.

One of my OSSes hanfs into a kernel panic when trying to mount the OSTs.

After mounting 11 OSTs over 12 total OSTs it goes into kernel panic. 
Does not matter hte order in which they are mounted.

Any clue on hints ?

I cannot really recover it and I have important data on it.

I already performed an e2fsck. Anyway it did not fix. it has found a few 
inode count inconsistencies before.

kernel is 2.6.32-431.23.3.el6_lustre.x86_64

Red Hat Enterprise Linux Server release 6.7 (Santiago)

lustre-2.5.3-2.6.32_431.23.3.el6_lustre.x86_64.x86_64


Oct 30 04:58:52 psanaoss231 kernel: INFO: task tgt_recov:4569 blocked 
for more than 120 seconds.

Oct 30 04:58:52 psanaoss231 kernel:      Not tainted 
2.6.32-431.23.3.el6_lustre.x86_64 #1
Oct 30 04:58:52 psanaoss231 kernel: "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 30 04:58:52 psanaoss231 kernel: tgt_recov     D 0000000000000003     
0  4569      2 0x00000080
Oct 30 04:58:52 psanaoss231 kernel: ffff880bf2ae1da0 0000000000000046 
0000000000000000 0000000000000003
Oct 30 04:58:52 psanaoss231 kernel: ffff880bf2ae1d30 ffffffff81059096 
ffff880bf2ae1d40 ffff880bf2a1d500
Oct 30 04:58:52 psanaoss231 kernel: ffff880bf2b01ab8 ffff880bf2ae1fd8 
000000000000fbc8 ffff880bf2b01ab8
Oct 30 04:58:52 psanaoss231 kernel: Call Trace:
Oct 30 04:58:52 psanaoss231 kernel: [<ffffffff81059096>] ? 
enqueue_task+0x66/0x80
Oct 30 04:58:52 psanaoss231 kernel: [<ffffffffa07ae560>] ? 
check_for_clients+0x0/0x70 [ptlrpc]
Oct 30 04:58:52 psanaoss231 kernel: [<ffffffffa07afbcd>] 
target_recovery_overseer+0x9d/0x230 [ptlrpc]
Oct 30 04:58:52 psanaoss231 kernel: [<ffffffffa07ae250>] ? 
exp_connect_healthy+0x0/0x20 [ptlrpc]
Oct 30 04:58:52 psanaoss231 kernel: [<ffffffff8109afa0>] ? 
autoremove_wake_function+0x0/0x40
Oct 30 04:58:52 psanaoss231 kernel: [<ffffffffa07b6490>] ? 
target_recovery_thread+0x0/0x1920 [ptlrpc]
Oct 30 04:58:52 psanaoss231 kernel: [<ffffffffa07b69d0>] 
target_recovery_thread+0x540/0x1920 [ptlrpc]
Oct 30 04:58:52 psanaoss231 kernel: [<ffffffff81061d12>] ? 
default_wake_function+0x12/0x20
Oct 30 04:58:52 psanaoss231 kernel: [<ffffffffa07b6490>] ? 
target_recovery_thread+0x0/0x1920 [ptlrpc]
Oct 30 04:58:52 psanaoss231 kernel: [<ffffffff8109abf6>] kthread+0x96/0xa0
Oct 30 04:58:52 psanaoss231 kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20
Oct 30 04:58:52 psanaoss231 kernel: [<ffffffff8109ab60>] ? kthread+0x0/0xa0
Oct 30 04:58:52 psanaoss231 kernel: [<ffffffff8100c200>] ? 
child_rip+0x0/0x20
Oct 30 04:59:02 psanaoss231 kernel: Lustre: ana13-OST0004: Recovery over 
after 3:05, of 147 clients 146 recovered and 1 was evicted.
Oct 30 04:59:03 psanaoss231 kernel: Lustre: ana13-OST0004: Client 
89ba817f-45c3-5e64-99a8-b472651bbe45 (at 172.21.52.213 at o2ib) reconnecting
Oct 30 04:59:03 psanaoss231 kernel: Lustre: Skipped 94 previous similar 
messages
Oct 30 04:59:21 psanaoss231 kernel: LustreError: 
4569:0:(ost_handler.c:1123:ost_brw_write()) Dropping timed-out write 
from 12345-172.21.49.129 at tcp because locking object 0x0:14198730 took 
153 seconds (limit was 30).
Oct 30 04:59:21 psanaoss231 kernel: Lustre: ana13-OST0005: Bulk IO write 
error with 3a71df2f-16e7-d507-2495-ab60364d8e7c (at 172.21.49.129 at tcp), 
client will retry: rc -110
Oct 30 04:59:52 psanaoss231 kernel: ------------[ cut here ]------------
Oct 30 04:59:52 psanaoss231 kernel: kernel BUG at 
fs/jbd2/transaction.c:1033!
Oct 30 04:59:52 psanaoss231 kernel: invalid opcode: 0000 [#1] SMP
Oct 30 04:59:52 psanaoss231 kernel: last sysfs file: 
/sys/devices/system/cpu/online
Oct 30 04:59:52 psanaoss231 kernel: CPU 10
Oct 30 04:59:52 psanaoss231 kernel: Modules linked in: osp(U) ofd(U) 
lfsck(U) ost(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) lquota(U) 
ldiskfs(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ksocklnd(U) 
ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic 
sha256_generic crc32c_intel libcfs(U) nfs lockd fscache auth_rpcgss 
nfs_acl mpt3sas mpt2sas scsi_transport_sas raid_class mptctl mptbase 
autofs4 sunrpc ipt_REDIRECT iptable_nat nf_nat nf_conntrack_ipv4 
nf_conntrack nf_defrag_ipv4 ip_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs 
ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 microcode power_meter iTCO_wdt 
iTCO_vendor_support dcdbas ipmi_devintf sb_edac edac_core lpc_ich 
mfd_core shpchp igb i2c_algo_bit i2c_core ses enclosure sg ixgbe dca ptp 
pps_core mdio ext4 jbd2 mbcache raid1 sd_mod crc_t10dif ahci wmi mlx4_ib 
ib_sa ib_mad ib_core mlx4_en mlx4_core megaraid_sas dm_mirror 
dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
Oct 30 04:59:52 psanaoss231 kernel:
Oct 30 04:59:52 psanaoss231 kernel: Pid: 4272, comm: ll_ost01_007 Not 
tainted 2.6.32-431.23.3.el6_lustre.x86_64 #1 Dell Inc. PowerEdge R620/0PXXHP
Oct 30 04:59:52 psanaoss231 kernel: RIP: 0010:[<ffffffffa01198ad>]  
[<ffffffffa01198ad>] jbd2_journal_dirty_metadata+0x10d/0x150 [jbd2]
Oct 30 04:59:52 psanaoss231 kernel: RSP: 0018:ffff880c058437d0 EFLAGS: 
00010246
Oct 30 04:59:52 psanaoss231 kernel: RAX: ffff880c05573dc0 RBX: 
ffff880c043b8d08 RCX: ffff88175b0fedc8
Oct 30 04:59:52 psanaoss231 kernel: RDX: 0000000000000000 RSI: 
ffff88175b0fedc8 RDI: 0000000000000000
Oct 30 04:59:52 psanaoss231 kernel: RBP: ffff880c058437f0 R08: 
9010000000000000 R09: e886f5e8fbf37202
Oct 30 04:59:52 psanaoss231 kernel: R10: 0000000000000002 R11: 
0000000000000000 R12: ffff880c040c26d8
Oct 30 04:59:52 psanaoss231 kernel: R13: ffff88175b0fedc8 R14: 
ffff88174728c800 R15: 0000000000000008
Oct 30 04:59:52 psanaoss231 kernel: FS:  0000000000000000(0000) 
GS:ffff8800282a0000(0000) knlGS:0000000000000000
Oct 30 04:59:52 psanaoss231 kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 
000000008005003b
Oct 30 04:59:52 psanaoss231 kernel: CR2: 00000034f304b750 CR3: 
0000000001a85000 CR4: 00000000000407e0
Oct 30 04:59:52 psanaoss231 kernel: DR0: 0000000000000000 DR1: 
0000000000000000 DR2: 0000000000000000
Oct 30 04:59:52 psanaoss231 kernel: DR3: 0000000000000000 DR6: 
00000000ffff0ff0 DR7: 0000000000000400
Oct 30 04:59:52 psanaoss231 kernel: Process ll_ost01_007 (pid: 4272, 
threadinfo ffff880c05842000, task ffff880c0634eaa0)
Oct 30 04:59:52 psanaoss231 kernel: Stack:
Oct 30 04:59:52 psanaoss231 kernel: ffff880c043b8d08 ffffffffa0d136f0 
ffff88175b0fedc8 0000000000000000
Oct 30 04:59:52 psanaoss231 kernel: <d> ffff880c05843830 
ffffffffa0cd100b ffff880c05843820 ffffffff8109af8f
Oct 30 04:59:52 psanaoss231 kernel: <d> ffff88175b105a40 
ffff880c043b8d08 0000000000000018 ffff88175b0fedc8
Oct 30 04:59:52 psanaoss231 kernel: Call Trace:
Oct 30 04:59:52 psanaoss231 kernel: [<ffffffffa0cd100b>] 
__ldiskfs_handle_dirty_metadata+0x7b/0x100 [ldiskfs]
Oct 30 04:59:52 psanaoss231 kernel: [<ffffffff8109af8f>] ? 
wake_up_bit+0x2f/0x40
Oct 30 04:59:52 psanaoss231 kernel: [<ffffffffa0d067c5>] 
ldiskfs_quota_write+0x165/0x210 [ldiskfs]
Oct 30 04:59:52 psanaoss231 kernel: [<ffffffff811eef11>] 
v2_write_file_info+0xa1/0xe0
Oct 30 04:59:52 psanaoss231 kernel: [<ffffffff811eb018>] 
dquot_acquire+0x138/0x140
Oct 30 04:59:52 psanaoss231 kernel: [<ffffffffa0d05956>] 
ldiskfs_acquire_dquot+0x66/0xb0 [ldiskfs]
Oct 30 04:59:52 psanaoss231 kernel: [<ffffffff811ecf8c>] dqget+0x2ac/0x390
Oct 30 04:59:52 psanaoss231 kernel: [<ffffffff811ed51b>] 
dquot_initialize+0x7b/0x240
Oct 30 04:59:52 psanaoss231 kernel: [<ffffffff8116f553>] ? 
kmem_cache_alloc_trace+0x1a3/0x1b0
Oct 30 04:59:52 psanaoss231 kernel: [<ffffffffa0d05bb3>] 
ldiskfs_dquot_initialize+0x83/0xd0 [ldiskfs]
Oct 30 04:59:52 psanaoss231 kernel: [<ffffffffa0dd0baf>] 
osd_attr_set+0x12f/0x540 [osd_ldiskfs]
Oct 30 04:59:52 psanaoss231 kernel: [<ffffffffa0ecb969>] 
dt_attr_set.clone.2+0x29/0xc0 [ofd]
Oct 30 04:59:52 psanaoss231 kernel: [<ffffffffa0ecf472>] 
ofd_attr_set+0x522/0x6c0 [ofd]
Oct 30 04:59:52 psanaoss231 kernel: [<ffffffffa0ec0e68>] 
ofd_setattr+0x678/0xc10 [ofd]
Oct 30 04:59:52 psanaoss231 kernel: [<ffffffffa07eeeae>] ? 
lustre_pack_reply_flags+0xae/0x1f0 [ptlrpc]
Oct 30 04:59:52 psanaoss231 kernel: [<ffffffffa0e711bb>] 
ost_setattr+0x30b/0x930 [ost]
Oct 30 04:59:52 psanaoss231 kernel: [<ffffffffa0e741bd>] 
ost_handle+0x1f8d/0x44d0 [ost]
Oct 30 04:59:52 psanaoss231 kernel: [<ffffffffa07f68db>] ? 
ptlrpc_update_export_timer+0x4b/0x560 [ptlrpc]
Oct 30 04:59:52 psanaoss231 kernel: [<ffffffffa07fecf5>] 
ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
Oct 30 04:59:52 psanaoss231 kernel: [<ffffffffa05164ce>] ? 
cfs_timer_arm+0xe/0x10 [libcfs]
Oct 30 04:59:52 psanaoss231 kernel: [<ffffffffa05273cf>] ? 
lc_watchdog_touch+0x6f/0x170 [libcfs]
Oct 30 04:59:52 psanaoss231 kernel: [<ffffffffa07f63d9>] ? 
ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
Oct 30 04:59:52 psanaoss231 kernel: [<ffffffff810546b9>] ? 
__wake_up_common+0x59/0x90
Oct 30 04:59:52 psanaoss231 kernel: [<ffffffffa080005d>] 
ptlrpc_main+0xaed/0x1740 [ptlrpc]
Oct 30 04:59:52 psanaoss231 kernel: [<ffffffffa07ff570>] ? 
ptlrpc_main+0x0/0x1740 [ptlrpc]
Oct 30 04:59:52 psanaoss231 kernel: [<ffffffff8109abf6>] kthread+0x96/0xa0
Oct 30 04:59:52 psanaoss231 kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20
Oct 30 04:59:52 psanaoss231 kernel: [<ffffffff8109ab60>] ? kthread+0x0/0xa0
Oct 30 04:59:52 psanaoss231 kernel: [<ffffffff8100c200>] ? 
child_rip+0x0/0x20
Oct 30 04:59:52 psanaoss231 kernel: Code: c6 9c 03 00 00 4c 89 f7 e8 c1 
21 41 e1 48 8b 33 ba 01 00 00 00 4c 89 e7 e8 11 ec ff ff 4c 89 f0 66 ff 
00 66 66 90 e9 73 ff ff ff <0f> 0b eb fe 0f 0b eb fe 0f 0b 66 0f 1f 84 
00 00 00 00 00 eb f5
Oct 30 04:59:52 psanaoss231 kernel: RIP  [<ffffffffa01198ad>] 
jbd2_journal_dirty_metadata+0x10d/0x150 [jbd2]
Oct 30 04:59:52 psanaoss231 kernel: RSP <ffff880c058437d0>
Oct 30 04:59:52 psanaoss231 kernel: ---[ end trace 5ceb40448d3277c6 ]---
Oct 30 04:59:52 psanaoss231 kernel: Kernel panic - not syncing: Fatal 
exception
Oct 30 04:59:52 psanaoss231 kernel: Pid: 4272, comm: ll_ost01_007 
Tainted: G      D    --------------- 2.6.32-431.23.3.el6_lustre.x86_64 #1



More information about the lustre-discuss mailing list