[lustre-discuss] ZFS PANIC

Mon Feb 13 10:00:54 PST 2017

OK, so, I tried some new system mounts today, and each time the new 
client attempts to mount, the zfs PANIC throws.  This from 2 separate 
client machines.  It seems clear from the responsiveness problem last 
week that it is impacting a single OST.  After it happens, I power cycle 
the OSS because it will not shut down cleanly, and it comes back fine (I 
have pre-cycled the system where I tried the mount).  The OSS is quiet, 
no excessive traffic or load, so that does not match up with some Google 
searches I found on this, where the OSS was under heavy load, and a fix 
was purported to be found in an earlier version of this zfsonlinux.  The 
OST I suspect of being at the heart of this is always the last to finish 
connecting as evidenced by the "lcdl dl" count of connections.

As I don't know what else to do, I am draining this OST and will 
reformat/re-create it upon completion using spare disks.  It would be 
nice though if someone had a better way to fix this, or could truly 
point to a reason why this is consistently happening now.

bob

On 2/10/2017 11:23 AM, Bob Ball wrote:
> Well, I find this odd, to say the least.  All of this below was from 
> yesterday, and persisted through a couple of reboots.  Today, shortly 
> after I sent this, I found all the disks idle, but this one OST out of 
> 6 totally unresponsive, so I power cycled the system, and it came up 
> just fine.  No issues, no complaints, responsive....  So I have no 
> idea why this healed itself.
>
> Can anyone enlighten me?
>
> I _think_ that what triggered this was adding a few more client mounts 
> of the lustre file system.  That's when it all went wrong. Is this 
> helpful?  Or just a coincidence?  Current state:
>  18 UP obdfilter umt3B-OST000f umt3B-OST000f_UUID 403
>
> bob
>
> On 2/10/2017 9:39 AM, Bob Ball wrote:
>> Hi,
>>
>> I am getting this message
>>
>> PANIC: zfs: accessing past end of object 29/7 (size=33792 
>> access=33792+128)
>>
>> The affected OST seems to reject new mounts from clients now, and the 
>> lctl dl count of connections to the obdfilter process increases, but 
>> does not seem to decrease?
>>
>> This is Lustre 2.7.58 with zfs 0.6.4.2
>>
>> Can anyone help me diagnose and fix whatever is going wrong here? 
>> I've included the stack dump below.
>>
>> Thanks,
>> bob
>>
>>
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
>> [11630254.781874] Showing stack for process 24449
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
>> [11630254.781876] Pid: 24449, comm: ll_ost00_078 Tainted: P           
>> ---------------    2.6.32.504.16.2.el6_lustre #7
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
>> [11630254.781878] Call Trace:
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
>> [11630254.781902]  [<ffffffffa0406f8d>] ? spl_dumpstack+0x3d/0x40 [spl]
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
>> [11630254.781908]  [<ffffffffa040701d>] ? vcmn_err+0x8d/0xf0 [spl]
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
>> [11630254.781950]  [<ffffffffa0465a46>] ? RW_WRITE_HELD+0x66/0xb0 [zfs]
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
>> [11630254.781970]  [<ffffffffa0466eb8>] ? 
>> dbuf_rele_and_unlock+0x268/0x3f0 [zfs]
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
>> [11630254.781991]  [<ffffffffa04687ba>] ? dbuf_read+0x5ca/0x8a0 [zfs]
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
>> [11630254.782024]  [<ffffffffa04bb032>] ? zfs_panic_recover+0x52/0x60 
>> [zfs]
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
>> [11630254.782045]  [<ffffffffa0471e3b>] ? 
>> dmu_buf_hold_array_by_dnode+0x41b/0x560 [zfs]
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
>> [11630254.782068]  [<ffffffffa0472205>] ? 
>> dmu_buf_hold_array+0x65/0x90 [zfs]
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
>> [11630254.782090]  [<ffffffffa0472668>] ? dmu_write+0x68/0x1a0 [zfs]
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
>> [11630254.782147]  [<ffffffffa08fa0ae>] ? lprocfs_oh_tally+0x2e/0x50 
>> [obdclass]
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
>> [11630254.782173]  [<ffffffffa103f311>] ? osd_write+0x1d1/0x390 
>> [osd_zfs]
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
>> [11630254.782206]  [<ffffffffa0926aad>] ? dt_record_write+0x3d/0x130 
>> [obdclass]
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
>> [11630254.782305]  [<ffffffffa0ba7575>] ? 
>> tgt_client_data_write+0x165/0x1b0 [ptlrpc]
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
>> [11630254.782347]  [<ffffffffa0bab575>] ? 
>> tgt_client_data_update+0x335/0x680 [ptlrpc]
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
>> [11630254.782388]  [<ffffffffa0bac298>] ? tgt_client_new+0x3d8/0x6a0 
>> [ptlrpc]
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
>> [11630254.782407]  [<ffffffffa117fad3>] ? ofd_obd_connect+0x363/0x400 
>> [ofd]
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
>> [11630254.782443]  [<ffffffffa0b12158>] ? 
>> target_handle_connect+0xe58/0x2d30 [ptlrpc]
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
>> [11630254.782450]  [<ffffffff8106d1a5>] ? enqueue_entity+0x125/0x450
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
>> [11630254.782457]  [<ffffffff8105870c>] ? check_preempt_curr+0x7c/0x90
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
>> [11630254.782462]  [<ffffffff81064a2e>] ? try_to_wake_up+0x24e/0x3e0
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
>> [11630254.782481]  [<ffffffffa07da6ca>] ? 
>> lc_watchdog_touch+0x7a/0x190 [libcfs]
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
>> [11630254.782524]  [<ffffffffa0bb6f52>] ? 
>> tgt_request_handle+0x5b2/0x1230 [ptlrpc]
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
>> [11630254.782564]  [<ffffffffa0b5f5d1>] ? ptlrpc_main+0xe41/0x1920 
>> [ptlrpc]
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
>> [11630254.782570]  [<ffffffff81014959>] ? sched_clock+0x9/0x10
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
>> [11630254.782576]  [<ffffffff81529e1e>] ? thread_return+0x4e/0x7d0
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
>> [11630254.782615]  [<ffffffffa0b5e790>] ? ptlrpc_main+0x0/0x1920 
>> [ptlrpc]
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
>> [11630254.782622]  [<ffffffff8109e71e>] ? kthread+0x9e/0xc0
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
>> [11630254.782626]  [<ffffffff8109e680>] ? kthread+0x0/0xc0
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
>> [11630254.782632]  [<ffffffff8100c20a>] ? child_rip+0xa/0x20
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
>> [11630254.782636]  [<ffffffff8109e680>] ? kthread+0x0/0xc0
>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
>> [11630254.782641]  [<ffffffff8100c200>] ? child_rip+0x0/0x20
>>
>>
>> Later, that same process showed:
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
>> [11630454.773156] LNet: Service thread pid 24449 was inactive for 
>> 200.00s. The thread might be hung, or it might only be slow and will 
>> resume later. Dumping the stack trace for debugging purposes:
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
>> [11630454.773163] Pid: 24449, comm: ll_ost00_078
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773164]
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
>> [11630454.773165] Call Trace:
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
>> [11630454.773181]  [<ffffffff81010f85>] ? show_trace_log_lvl+0x55/0x70
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
>> [11630454.773194]  [<ffffffff8152966e>] ? dump_stack+0x6f/0x76
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
>> [11630454.773249]  [<ffffffffa0407035>] vcmn_err+0xa5/0xf0 [spl]
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
>> [11630454.773373]  [<ffffffffa0465a46>] ? RW_WRITE_HELD+0x66/0xb0 [zfs]
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
>> [11630454.773393]  [<ffffffffa0466eb8>] ? 
>> dbuf_rele_and_unlock+0x268/0x3f0 [zfs]
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
>> [11630454.773412]  [<ffffffffa04687ba>] ? dbuf_read+0x5ca/0x8a0 [zfs]
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
>> [11630454.773444]  [<ffffffffa04bb032>] zfs_panic_recover+0x52/0x60 
>> [zfs]
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
>> [11630454.773463]  [<ffffffffa0471e3b>] 
>> dmu_buf_hold_array_by_dnode+0x41b/0x560 [zfs]
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
>> [11630454.773483]  [<ffffffffa0472205>] dmu_buf_hold_array+0x65/0x90 
>> [zfs]
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
>> [11630454.773502]  [<ffffffffa0472668>] dmu_write+0x68/0x1a0 [zfs]
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
>> [11630454.773579]  [<ffffffffa08fa0ae>] ? lprocfs_oh_tally+0x2e/0x50 
>> [obdclass]
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
>> [11630454.773617]  [<ffffffffa103f311>] osd_write+0x1d1/0x390 [osd_zfs]
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
>> [11630454.773659]  [<ffffffffa0926aad>] dt_record_write+0x3d/0x130 
>> [obdclass]
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
>> [11630454.773860]  [<ffffffffa0ba7575>] 
>> tgt_client_data_write+0x165/0x1b0 [ptlrpc]
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
>> [11630454.773899]  [<ffffffffa0bab575>] 
>> tgt_client_data_update+0x335/0x680 [ptlrpc]
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
>> [11630454.773938]  [<ffffffffa0bac298>] tgt_client_new+0x3d8/0x6a0 
>> [ptlrpc]
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
>> [11630454.773961]  [<ffffffffa117fad3>] ofd_obd_connect+0x363/0x400 
>> [ofd]
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
>> [11630454.773997]  [<ffffffffa0b12158>] 
>> target_handle_connect+0xe58/0x2d30 [ptlrpc]
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
>> [11630454.774002]  [<ffffffff8106d1a5>] ? enqueue_entity+0x125/0x450
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
>> [11630454.774006]  [<ffffffff8105870c>] ? check_preempt_curr+0x7c/0x90
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
>> [11630454.774010]  [<ffffffff81064a2e>] ? try_to_wake_up+0x24e/0x3e0
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
>> [11630454.774055]  [<ffffffffa07da6ca>] ? 
>> lc_watchdog_touch+0x7a/0x190 [libcfs]
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
>> [11630454.774111]  [<ffffffffa0bb6f52>] 
>> tgt_request_handle+0x5b2/0x1230 [ptlrpc]
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
>> [11630454.774163]  [<ffffffffa0b5f5d1>] ptlrpc_main+0xe41/0x1920 
>> [ptlrpc]
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
>> [11630454.774167]  [<ffffffff81014959>] ? sched_clock+0x9/0x10
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
>> [11630454.774170]  [<ffffffff81529e1e>] ? thread_return+0x4e/0x7d0
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
>> [11630454.774225]  [<ffffffffa0b5e790>] ? ptlrpc_main+0x0/0x1920 
>> [ptlrpc]
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
>> [11630454.774230]  [<ffffffff8109e71e>] kthread+0x9e/0xc0
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
>> [11630454.774232]  [<ffffffff8109e680>] ? kthread+0x0/0xc0
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
>> [11630454.774235]  [<ffffffff8100c20a>] child_rip+0xa/0x20
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
>> [11630454.774237]  [<ffffffff8109e680>] ? kthread+0x0/0xc0
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
>> [11630454.774239]  [<ffffffff8100c200>] ? child_rip+0x0/0x20
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.774240]
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
>> [11630454.774243] LustreError: dumping log to 
>> /tmp/lustre-log.1486613143.24449
>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
>> [11630455.164028] Pid: 23795, comm: ll_ost01_026
>>
>> There were at least 4 different PIDs that showed this situation. They 
>> seem to be named like ll_ost01_063
>>
>>
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>