[lustre-discuss] ZFS PANIC
Bob Ball
ball at umich.edu
Fri Feb 10 08:23:09 PST 2017
Well, I find this odd, to say the least. All of this below was from
yesterday, and persisted through a couple of reboots. Today, shortly
after I sent this, I found all the disks idle, but this one OST out of 6
totally unresponsive, so I power cycled the system, and it came up just
fine. No issues, no complaints, responsive.... So I have no idea why
this healed itself.
Can anyone enlighten me?
I _think_ that what triggered this was adding a few more client mounts
of the lustre file system. That's when it all went wrong. Is this
helpful? Or just a coincidence? Current state:
18 UP obdfilter umt3B-OST000f umt3B-OST000f_UUID 403
bob
On 2/10/2017 9:39 AM, Bob Ball wrote:
> Hi,
>
> I am getting this message
>
> PANIC: zfs: accessing past end of object 29/7 (size=33792
> access=33792+128)
>
> The affected OST seems to reject new mounts from clients now, and the
> lctl dl count of connections to the obdfilter process increases, but
> does not seem to decrease?
>
> This is Lustre 2.7.58 with zfs 0.6.4.2
>
> Can anyone help me diagnose and fix whatever is going wrong here? I've
> included the stack dump below.
>
> Thanks,
> bob
>
>
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.781874]
> Showing stack for process 24449
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.781876]
> Pid: 24449, comm: ll_ost00_078 Tainted: P ---------------
> 2.6.32.504.16.2.el6_lustre #7
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.781878]
> Call Trace:
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
> [11630254.781902] [<ffffffffa0406f8d>] ? spl_dumpstack+0x3d/0x40 [spl]
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
> [11630254.781908] [<ffffffffa040701d>] ? vcmn_err+0x8d/0xf0 [spl]
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
> [11630254.781950] [<ffffffffa0465a46>] ? RW_WRITE_HELD+0x66/0xb0 [zfs]
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
> [11630254.781970] [<ffffffffa0466eb8>] ?
> dbuf_rele_and_unlock+0x268/0x3f0 [zfs]
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
> [11630254.781991] [<ffffffffa04687ba>] ? dbuf_read+0x5ca/0x8a0 [zfs]
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
> [11630254.782024] [<ffffffffa04bb032>] ? zfs_panic_recover+0x52/0x60
> [zfs]
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
> [11630254.782045] [<ffffffffa0471e3b>] ?
> dmu_buf_hold_array_by_dnode+0x41b/0x560 [zfs]
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
> [11630254.782068] [<ffffffffa0472205>] ? dmu_buf_hold_array+0x65/0x90
> [zfs]
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
> [11630254.782090] [<ffffffffa0472668>] ? dmu_write+0x68/0x1a0 [zfs]
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
> [11630254.782147] [<ffffffffa08fa0ae>] ? lprocfs_oh_tally+0x2e/0x50
> [obdclass]
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
> [11630254.782173] [<ffffffffa103f311>] ? osd_write+0x1d1/0x390 [osd_zfs]
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
> [11630254.782206] [<ffffffffa0926aad>] ? dt_record_write+0x3d/0x130
> [obdclass]
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
> [11630254.782305] [<ffffffffa0ba7575>] ?
> tgt_client_data_write+0x165/0x1b0 [ptlrpc]
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
> [11630254.782347] [<ffffffffa0bab575>] ?
> tgt_client_data_update+0x335/0x680 [ptlrpc]
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
> [11630254.782388] [<ffffffffa0bac298>] ? tgt_client_new+0x3d8/0x6a0
> [ptlrpc]
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
> [11630254.782407] [<ffffffffa117fad3>] ? ofd_obd_connect+0x363/0x400
> [ofd]
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
> [11630254.782443] [<ffffffffa0b12158>] ?
> target_handle_connect+0xe58/0x2d30 [ptlrpc]
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
> [11630254.782450] [<ffffffff8106d1a5>] ? enqueue_entity+0x125/0x450
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
> [11630254.782457] [<ffffffff8105870c>] ? check_preempt_curr+0x7c/0x90
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
> [11630254.782462] [<ffffffff81064a2e>] ? try_to_wake_up+0x24e/0x3e0
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
> [11630254.782481] [<ffffffffa07da6ca>] ? lc_watchdog_touch+0x7a/0x190
> [libcfs]
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
> [11630254.782524] [<ffffffffa0bb6f52>] ?
> tgt_request_handle+0x5b2/0x1230 [ptlrpc]
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
> [11630254.782564] [<ffffffffa0b5f5d1>] ? ptlrpc_main+0xe41/0x1920
> [ptlrpc]
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
> [11630254.782570] [<ffffffff81014959>] ? sched_clock+0x9/0x10
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
> [11630254.782576] [<ffffffff81529e1e>] ? thread_return+0x4e/0x7d0
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
> [11630254.782615] [<ffffffffa0b5e790>] ? ptlrpc_main+0x0/0x1920 [ptlrpc]
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
> [11630254.782622] [<ffffffff8109e71e>] ? kthread+0x9e/0xc0
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
> [11630254.782626] [<ffffffff8109e680>] ? kthread+0x0/0xc0
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
> [11630254.782632] [<ffffffff8100c20a>] ? child_rip+0xa/0x20
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
> [11630254.782636] [<ffffffff8109e680>] ? kthread+0x0/0xc0
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
> [11630254.782641] [<ffffffff8100c200>] ? child_rip+0x0/0x20
>
>
> Later, that same process showed:
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773156]
> LNet: Service thread pid 24449 was inactive for 200.00s. The thread
> might be hung, or it might only be slow and will resume later. Dumping
> the stack trace for debugging purposes:
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773163]
> Pid: 24449, comm: ll_ost00_078
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773164]
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773165]
> Call Trace:
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
> [11630454.773181] [<ffffffff81010f85>] ? show_trace_log_lvl+0x55/0x70
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
> [11630454.773194] [<ffffffff8152966e>] ? dump_stack+0x6f/0x76
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
> [11630454.773249] [<ffffffffa0407035>] vcmn_err+0xa5/0xf0 [spl]
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
> [11630454.773373] [<ffffffffa0465a46>] ? RW_WRITE_HELD+0x66/0xb0 [zfs]
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
> [11630454.773393] [<ffffffffa0466eb8>] ?
> dbuf_rele_and_unlock+0x268/0x3f0 [zfs]
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
> [11630454.773412] [<ffffffffa04687ba>] ? dbuf_read+0x5ca/0x8a0 [zfs]
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
> [11630454.773444] [<ffffffffa04bb032>] zfs_panic_recover+0x52/0x60 [zfs]
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
> [11630454.773463] [<ffffffffa0471e3b>]
> dmu_buf_hold_array_by_dnode+0x41b/0x560 [zfs]
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
> [11630454.773483] [<ffffffffa0472205>] dmu_buf_hold_array+0x65/0x90
> [zfs]
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
> [11630454.773502] [<ffffffffa0472668>] dmu_write+0x68/0x1a0 [zfs]
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
> [11630454.773579] [<ffffffffa08fa0ae>] ? lprocfs_oh_tally+0x2e/0x50
> [obdclass]
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
> [11630454.773617] [<ffffffffa103f311>] osd_write+0x1d1/0x390 [osd_zfs]
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
> [11630454.773659] [<ffffffffa0926aad>] dt_record_write+0x3d/0x130
> [obdclass]
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
> [11630454.773860] [<ffffffffa0ba7575>]
> tgt_client_data_write+0x165/0x1b0 [ptlrpc]
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
> [11630454.773899] [<ffffffffa0bab575>]
> tgt_client_data_update+0x335/0x680 [ptlrpc]
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
> [11630454.773938] [<ffffffffa0bac298>] tgt_client_new+0x3d8/0x6a0
> [ptlrpc]
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
> [11630454.773961] [<ffffffffa117fad3>] ofd_obd_connect+0x363/0x400 [ofd]
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
> [11630454.773997] [<ffffffffa0b12158>]
> target_handle_connect+0xe58/0x2d30 [ptlrpc]
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
> [11630454.774002] [<ffffffff8106d1a5>] ? enqueue_entity+0x125/0x450
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
> [11630454.774006] [<ffffffff8105870c>] ? check_preempt_curr+0x7c/0x90
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
> [11630454.774010] [<ffffffff81064a2e>] ? try_to_wake_up+0x24e/0x3e0
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
> [11630454.774055] [<ffffffffa07da6ca>] ? lc_watchdog_touch+0x7a/0x190
> [libcfs]
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
> [11630454.774111] [<ffffffffa0bb6f52>]
> tgt_request_handle+0x5b2/0x1230 [ptlrpc]
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
> [11630454.774163] [<ffffffffa0b5f5d1>] ptlrpc_main+0xe41/0x1920 [ptlrpc]
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
> [11630454.774167] [<ffffffff81014959>] ? sched_clock+0x9/0x10
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
> [11630454.774170] [<ffffffff81529e1e>] ? thread_return+0x4e/0x7d0
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
> [11630454.774225] [<ffffffffa0b5e790>] ? ptlrpc_main+0x0/0x1920 [ptlrpc]
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
> [11630454.774230] [<ffffffff8109e71e>] kthread+0x9e/0xc0
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
> [11630454.774232] [<ffffffff8109e680>] ? kthread+0x0/0xc0
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
> [11630454.774235] [<ffffffff8100c20a>] child_rip+0xa/0x20
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
> [11630454.774237] [<ffffffff8109e680>] ? kthread+0x0/0xc0
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
> [11630454.774239] [<ffffffff8100c200>] ? child_rip+0x0/0x20
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.774240]
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.774243]
> LustreError: dumping log to /tmp/lustre-log.1486613143.24449
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630455.164028]
> Pid: 23795, comm: ll_ost01_026
>
> There were at least 4 different PIDs that showed this situation. They
> seem to be named like ll_ost01_063
>
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
More information about the lustre-discuss
mailing list