[lustre-discuss] ZFS PANIC
Bob Ball
ball at umich.edu
Mon Feb 13 10:00:54 PST 2017
OK, so, I tried some new system mounts today, and each time the new
client attempts to mount, the zfs PANIC throws. This from 2 separate
client machines. It seems clear from the responsiveness problem last
week that it is impacting a single OST. After it happens, I power cycle
the OSS because it will not shut down cleanly, and it comes back fine (I
have pre-cycled the system where I tried the mount). The OSS is quiet,
no excessive traffic or load, so that does not match up with some Google
searches I found on this, where the OSS was under heavy load, and a fix
was purported to be found in an earlier version of this zfsonlinux. The
OST I suspect of being at the heart of this is always the last to finish
connecting as evidenced by the "lcdl dl" count of connections.
As I don't know what else to do, I am draining this OST and will
reformat/re-create it upon completion using spare disks. It would be
nice though if someone had a better way to fix this, or could truly
point to a reason why this is consistently happening now.
bob
On 2/10/2017 11:23 AM, Bob Ball wrote:
> Well, I find this odd, to say the least. All of this below was from
> yesterday, and persisted through a couple of reboots. Today, shortly
> after I sent this, I found all the disks idle, but this one OST out of
> 6 totally unresponsive, so I power cycled the system, and it came up
> just fine. No issues, no complaints, responsive.... So I have no
> idea why this healed itself.
>
> Can anyone enlighten me?
>
> I _think_ that what triggered this was adding a few more client mounts
> of the lustre file system. That's when it all went wrong. Is this
> helpful? Or just a coincidence? Current state:
> 18 UP obdfilter umt3B-OST000f umt3B-OST000f_UUID 403
>
> bob
>
> On 2/10/2017 9:39 AM, Bob Ball wrote:
>> Hi,
>>
>> I am getting this message
>>
>> PANIC: zfs: accessing past end of object 29/7 (size=33792
>> access=33792+128)
>>
>> The affected OST seems to reject new mounts from clients now, and the
>> lctl dl count of connections to the obdfilter process increases, but
>> does not seem to decrease?
>>
>> This is Lustre 2.7.58 with zfs 0.6.4.2
>>
>> Can anyone help me diagnose and fix whatever is going wrong here?
>> I've included the stack dump below.
>>
>> Thanks,
>> bob
>>
>>
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
>> [11630254.781874] Showing stack for process 24449
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
>> [11630254.781876] Pid: 24449, comm: ll_ost00_078 Tainted: P
>> --------------- 2.6.32.504.16.2.el6_lustre #7
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
>> [11630254.781878] Call Trace:
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
>> [11630254.781902] [<ffffffffa0406f8d>] ? spl_dumpstack+0x3d/0x40 [spl]
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
>> [11630254.781908] [<ffffffffa040701d>] ? vcmn_err+0x8d/0xf0 [spl]
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
>> [11630254.781950] [<ffffffffa0465a46>] ? RW_WRITE_HELD+0x66/0xb0 [zfs]
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
>> [11630254.781970] [<ffffffffa0466eb8>] ?
>> dbuf_rele_and_unlock+0x268/0x3f0 [zfs]
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
>> [11630254.781991] [<ffffffffa04687ba>] ? dbuf_read+0x5ca/0x8a0 [zfs]
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
>> [11630254.782024] [<ffffffffa04bb032>] ? zfs_panic_recover+0x52/0x60
>> [zfs]
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
>> [11630254.782045] [<ffffffffa0471e3b>] ?
>> dmu_buf_hold_array_by_dnode+0x41b/0x560 [zfs]
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
>> [11630254.782068] [<ffffffffa0472205>] ?
>> dmu_buf_hold_array+0x65/0x90 [zfs]
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
>> [11630254.782090] [<ffffffffa0472668>] ? dmu_write+0x68/0x1a0 [zfs]
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
>> [11630254.782147] [<ffffffffa08fa0ae>] ? lprocfs_oh_tally+0x2e/0x50
>> [obdclass]
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
>> [11630254.782173] [<ffffffffa103f311>] ? osd_write+0x1d1/0x390
>> [osd_zfs]
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
>> [11630254.782206] [<ffffffffa0926aad>] ? dt_record_write+0x3d/0x130
>> [obdclass]
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
>> [11630254.782305] [<ffffffffa0ba7575>] ?
>> tgt_client_data_write+0x165/0x1b0 [ptlrpc]
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
>> [11630254.782347] [<ffffffffa0bab575>] ?
>> tgt_client_data_update+0x335/0x680 [ptlrpc]
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
>> [11630254.782388] [<ffffffffa0bac298>] ? tgt_client_new+0x3d8/0x6a0
>> [ptlrpc]
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
>> [11630254.782407] [<ffffffffa117fad3>] ? ofd_obd_connect+0x363/0x400
>> [ofd]
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
>> [11630254.782443] [<ffffffffa0b12158>] ?
>> target_handle_connect+0xe58/0x2d30 [ptlrpc]
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
>> [11630254.782450] [<ffffffff8106d1a5>] ? enqueue_entity+0x125/0x450
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
>> [11630254.782457] [<ffffffff8105870c>] ? check_preempt_curr+0x7c/0x90
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
>> [11630254.782462] [<ffffffff81064a2e>] ? try_to_wake_up+0x24e/0x3e0
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
>> [11630254.782481] [<ffffffffa07da6ca>] ?
>> lc_watchdog_touch+0x7a/0x190 [libcfs]
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
>> [11630254.782524] [<ffffffffa0bb6f52>] ?
>> tgt_request_handle+0x5b2/0x1230 [ptlrpc]
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
>> [11630254.782564] [<ffffffffa0b5f5d1>] ? ptlrpc_main+0xe41/0x1920
>> [ptlrpc]
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
>> [11630254.782570] [<ffffffff81014959>] ? sched_clock+0x9/0x10
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
>> [11630254.782576] [<ffffffff81529e1e>] ? thread_return+0x4e/0x7d0
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
>> [11630254.782615] [<ffffffffa0b5e790>] ? ptlrpc_main+0x0/0x1920
>> [ptlrpc]
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
>> [11630254.782622] [<ffffffff8109e71e>] ? kthread+0x9e/0xc0
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
>> [11630254.782626] [<ffffffff8109e680>] ? kthread+0x0/0xc0
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
>> [11630254.782632] [<ffffffff8100c20a>] ? child_rip+0xa/0x20
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
>> [11630254.782636] [<ffffffff8109e680>] ? kthread+0x0/0xc0
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
>> [11630254.782641] [<ffffffff8100c200>] ? child_rip+0x0/0x20
>>
>>
>> Later, that same process showed:
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
>> [11630454.773156] LNet: Service thread pid 24449 was inactive for
>> 200.00s. The thread might be hung, or it might only be slow and will
>> resume later. Dumping the stack trace for debugging purposes:
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
>> [11630454.773163] Pid: 24449, comm: ll_ost00_078
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773164]
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
>> [11630454.773165] Call Trace:
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
>> [11630454.773181] [<ffffffff81010f85>] ? show_trace_log_lvl+0x55/0x70
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
>> [11630454.773194] [<ffffffff8152966e>] ? dump_stack+0x6f/0x76
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
>> [11630454.773249] [<ffffffffa0407035>] vcmn_err+0xa5/0xf0 [spl]
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
>> [11630454.773373] [<ffffffffa0465a46>] ? RW_WRITE_HELD+0x66/0xb0 [zfs]
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
>> [11630454.773393] [<ffffffffa0466eb8>] ?
>> dbuf_rele_and_unlock+0x268/0x3f0 [zfs]
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
>> [11630454.773412] [<ffffffffa04687ba>] ? dbuf_read+0x5ca/0x8a0 [zfs]
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
>> [11630454.773444] [<ffffffffa04bb032>] zfs_panic_recover+0x52/0x60
>> [zfs]
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
>> [11630454.773463] [<ffffffffa0471e3b>]
>> dmu_buf_hold_array_by_dnode+0x41b/0x560 [zfs]
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
>> [11630454.773483] [<ffffffffa0472205>] dmu_buf_hold_array+0x65/0x90
>> [zfs]
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
>> [11630454.773502] [<ffffffffa0472668>] dmu_write+0x68/0x1a0 [zfs]
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
>> [11630454.773579] [<ffffffffa08fa0ae>] ? lprocfs_oh_tally+0x2e/0x50
>> [obdclass]
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
>> [11630454.773617] [<ffffffffa103f311>] osd_write+0x1d1/0x390 [osd_zfs]
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
>> [11630454.773659] [<ffffffffa0926aad>] dt_record_write+0x3d/0x130
>> [obdclass]
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
>> [11630454.773860] [<ffffffffa0ba7575>]
>> tgt_client_data_write+0x165/0x1b0 [ptlrpc]
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
>> [11630454.773899] [<ffffffffa0bab575>]
>> tgt_client_data_update+0x335/0x680 [ptlrpc]
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
>> [11630454.773938] [<ffffffffa0bac298>] tgt_client_new+0x3d8/0x6a0
>> [ptlrpc]
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
>> [11630454.773961] [<ffffffffa117fad3>] ofd_obd_connect+0x363/0x400
>> [ofd]
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
>> [11630454.773997] [<ffffffffa0b12158>]
>> target_handle_connect+0xe58/0x2d30 [ptlrpc]
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
>> [11630454.774002] [<ffffffff8106d1a5>] ? enqueue_entity+0x125/0x450
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
>> [11630454.774006] [<ffffffff8105870c>] ? check_preempt_curr+0x7c/0x90
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
>> [11630454.774010] [<ffffffff81064a2e>] ? try_to_wake_up+0x24e/0x3e0
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
>> [11630454.774055] [<ffffffffa07da6ca>] ?
>> lc_watchdog_touch+0x7a/0x190 [libcfs]
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
>> [11630454.774111] [<ffffffffa0bb6f52>]
>> tgt_request_handle+0x5b2/0x1230 [ptlrpc]
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
>> [11630454.774163] [<ffffffffa0b5f5d1>] ptlrpc_main+0xe41/0x1920
>> [ptlrpc]
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
>> [11630454.774167] [<ffffffff81014959>] ? sched_clock+0x9/0x10
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
>> [11630454.774170] [<ffffffff81529e1e>] ? thread_return+0x4e/0x7d0
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
>> [11630454.774225] [<ffffffffa0b5e790>] ? ptlrpc_main+0x0/0x1920
>> [ptlrpc]
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
>> [11630454.774230] [<ffffffff8109e71e>] kthread+0x9e/0xc0
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
>> [11630454.774232] [<ffffffff8109e680>] ? kthread+0x0/0xc0
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
>> [11630454.774235] [<ffffffff8100c20a>] child_rip+0xa/0x20
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
>> [11630454.774237] [<ffffffff8109e680>] ? kthread+0x0/0xc0
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
>> [11630454.774239] [<ffffffff8100c200>] ? child_rip+0x0/0x20
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.774240]
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
>> [11630454.774243] LustreError: dumping log to
>> /tmp/lustre-log.1486613143.24449
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
>> [11630455.164028] Pid: 23795, comm: ll_ost01_026
>>
>> There were at least 4 different PIDs that showed this situation. They
>> seem to be named like ll_ost01_063
>>
>>
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
More information about the lustre-discuss
mailing list