[lustre-discuss] ZFS PANIC

Bob Ball ball at umich.edu
Fri Feb 10 08:23:09 PST 2017


Well, I find this odd, to say the least.  All of this below was from 
yesterday, and persisted through a couple of reboots.  Today, shortly 
after I sent this, I found all the disks idle, but this one OST out of 6 
totally unresponsive, so I power cycled the system, and it came up just 
fine.  No issues, no complaints, responsive....  So I have no idea why 
this healed itself.

Can anyone enlighten me?

I _think_ that what triggered this was adding a few more client mounts 
of the lustre file system.  That's when it all went wrong. Is this 
helpful?  Or just a coincidence?  Current state:
  18 UP obdfilter umt3B-OST000f umt3B-OST000f_UUID 403

bob

On 2/10/2017 9:39 AM, Bob Ball wrote:
> Hi,
>
> I am getting this message
>
> PANIC: zfs: accessing past end of object 29/7 (size=33792 
> access=33792+128)
>
> The affected OST seems to reject new mounts from clients now, and the 
> lctl dl count of connections to the obdfilter process increases, but 
> does not seem to decrease?
>
> This is Lustre 2.7.58 with zfs 0.6.4.2
>
> Can anyone help me diagnose and fix whatever is going wrong here? I've 
> included the stack dump below.
>
> Thanks,
> bob
>
>
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.781874] 
> Showing stack for process 24449
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.781876] 
> Pid: 24449, comm: ll_ost00_078 Tainted: P           ---------------    
> 2.6.32.504.16.2.el6_lustre #7
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.781878] 
> Call Trace:
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
> [11630254.781902]  [<ffffffffa0406f8d>] ? spl_dumpstack+0x3d/0x40 [spl]
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
> [11630254.781908]  [<ffffffffa040701d>] ? vcmn_err+0x8d/0xf0 [spl]
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
> [11630254.781950]  [<ffffffffa0465a46>] ? RW_WRITE_HELD+0x66/0xb0 [zfs]
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
> [11630254.781970]  [<ffffffffa0466eb8>] ? 
> dbuf_rele_and_unlock+0x268/0x3f0 [zfs]
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
> [11630254.781991]  [<ffffffffa04687ba>] ? dbuf_read+0x5ca/0x8a0 [zfs]
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
> [11630254.782024]  [<ffffffffa04bb032>] ? zfs_panic_recover+0x52/0x60 
> [zfs]
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
> [11630254.782045]  [<ffffffffa0471e3b>] ? 
> dmu_buf_hold_array_by_dnode+0x41b/0x560 [zfs]
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
> [11630254.782068]  [<ffffffffa0472205>] ? dmu_buf_hold_array+0x65/0x90 
> [zfs]
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
> [11630254.782090]  [<ffffffffa0472668>] ? dmu_write+0x68/0x1a0 [zfs]
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
> [11630254.782147]  [<ffffffffa08fa0ae>] ? lprocfs_oh_tally+0x2e/0x50 
> [obdclass]
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
> [11630254.782173]  [<ffffffffa103f311>] ? osd_write+0x1d1/0x390 [osd_zfs]
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
> [11630254.782206]  [<ffffffffa0926aad>] ? dt_record_write+0x3d/0x130 
> [obdclass]
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
> [11630254.782305]  [<ffffffffa0ba7575>] ? 
> tgt_client_data_write+0x165/0x1b0 [ptlrpc]
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
> [11630254.782347]  [<ffffffffa0bab575>] ? 
> tgt_client_data_update+0x335/0x680 [ptlrpc]
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
> [11630254.782388]  [<ffffffffa0bac298>] ? tgt_client_new+0x3d8/0x6a0 
> [ptlrpc]
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
> [11630254.782407]  [<ffffffffa117fad3>] ? ofd_obd_connect+0x363/0x400 
> [ofd]
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
> [11630254.782443]  [<ffffffffa0b12158>] ? 
> target_handle_connect+0xe58/0x2d30 [ptlrpc]
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
> [11630254.782450]  [<ffffffff8106d1a5>] ? enqueue_entity+0x125/0x450
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
> [11630254.782457]  [<ffffffff8105870c>] ? check_preempt_curr+0x7c/0x90
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
> [11630254.782462]  [<ffffffff81064a2e>] ? try_to_wake_up+0x24e/0x3e0
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
> [11630254.782481]  [<ffffffffa07da6ca>] ? lc_watchdog_touch+0x7a/0x190 
> [libcfs]
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
> [11630254.782524]  [<ffffffffa0bb6f52>] ? 
> tgt_request_handle+0x5b2/0x1230 [ptlrpc]
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
> [11630254.782564]  [<ffffffffa0b5f5d1>] ? ptlrpc_main+0xe41/0x1920 
> [ptlrpc]
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
> [11630254.782570]  [<ffffffff81014959>] ? sched_clock+0x9/0x10
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
> [11630254.782576]  [<ffffffff81529e1e>] ? thread_return+0x4e/0x7d0
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
> [11630254.782615]  [<ffffffffa0b5e790>] ? ptlrpc_main+0x0/0x1920 [ptlrpc]
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
> [11630254.782622]  [<ffffffff8109e71e>] ? kthread+0x9e/0xc0
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
> [11630254.782626]  [<ffffffff8109e680>] ? kthread+0x0/0xc0
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
> [11630254.782632]  [<ffffffff8100c20a>] ? child_rip+0xa/0x20
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
> [11630254.782636]  [<ffffffff8109e680>] ? kthread+0x0/0xc0
> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
> [11630254.782641]  [<ffffffff8100c200>] ? child_rip+0x0/0x20
>
>
> Later, that same process showed:
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773156] 
> LNet: Service thread pid 24449 was inactive for 200.00s. The thread 
> might be hung, or it might only be slow and will resume later. Dumping 
> the stack trace for debugging purposes:
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773163] 
> Pid: 24449, comm: ll_ost00_078
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773164]
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773165] 
> Call Trace:
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
> [11630454.773181]  [<ffffffff81010f85>] ? show_trace_log_lvl+0x55/0x70
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
> [11630454.773194]  [<ffffffff8152966e>] ? dump_stack+0x6f/0x76
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
> [11630454.773249]  [<ffffffffa0407035>] vcmn_err+0xa5/0xf0 [spl]
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
> [11630454.773373]  [<ffffffffa0465a46>] ? RW_WRITE_HELD+0x66/0xb0 [zfs]
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
> [11630454.773393]  [<ffffffffa0466eb8>] ? 
> dbuf_rele_and_unlock+0x268/0x3f0 [zfs]
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
> [11630454.773412]  [<ffffffffa04687ba>] ? dbuf_read+0x5ca/0x8a0 [zfs]
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
> [11630454.773444]  [<ffffffffa04bb032>] zfs_panic_recover+0x52/0x60 [zfs]
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
> [11630454.773463]  [<ffffffffa0471e3b>] 
> dmu_buf_hold_array_by_dnode+0x41b/0x560 [zfs]
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
> [11630454.773483]  [<ffffffffa0472205>] dmu_buf_hold_array+0x65/0x90 
> [zfs]
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
> [11630454.773502]  [<ffffffffa0472668>] dmu_write+0x68/0x1a0 [zfs]
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
> [11630454.773579]  [<ffffffffa08fa0ae>] ? lprocfs_oh_tally+0x2e/0x50 
> [obdclass]
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
> [11630454.773617]  [<ffffffffa103f311>] osd_write+0x1d1/0x390 [osd_zfs]
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
> [11630454.773659]  [<ffffffffa0926aad>] dt_record_write+0x3d/0x130 
> [obdclass]
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
> [11630454.773860]  [<ffffffffa0ba7575>] 
> tgt_client_data_write+0x165/0x1b0 [ptlrpc]
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
> [11630454.773899]  [<ffffffffa0bab575>] 
> tgt_client_data_update+0x335/0x680 [ptlrpc]
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
> [11630454.773938]  [<ffffffffa0bac298>] tgt_client_new+0x3d8/0x6a0 
> [ptlrpc]
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
> [11630454.773961]  [<ffffffffa117fad3>] ofd_obd_connect+0x363/0x400 [ofd]
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
> [11630454.773997]  [<ffffffffa0b12158>] 
> target_handle_connect+0xe58/0x2d30 [ptlrpc]
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
> [11630454.774002]  [<ffffffff8106d1a5>] ? enqueue_entity+0x125/0x450
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
> [11630454.774006]  [<ffffffff8105870c>] ? check_preempt_curr+0x7c/0x90
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
> [11630454.774010]  [<ffffffff81064a2e>] ? try_to_wake_up+0x24e/0x3e0
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
> [11630454.774055]  [<ffffffffa07da6ca>] ? lc_watchdog_touch+0x7a/0x190 
> [libcfs]
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
> [11630454.774111]  [<ffffffffa0bb6f52>] 
> tgt_request_handle+0x5b2/0x1230 [ptlrpc]
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
> [11630454.774163]  [<ffffffffa0b5f5d1>] ptlrpc_main+0xe41/0x1920 [ptlrpc]
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
> [11630454.774167]  [<ffffffff81014959>] ? sched_clock+0x9/0x10
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
> [11630454.774170]  [<ffffffff81529e1e>] ? thread_return+0x4e/0x7d0
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
> [11630454.774225]  [<ffffffffa0b5e790>] ? ptlrpc_main+0x0/0x1920 [ptlrpc]
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
> [11630454.774230]  [<ffffffff8109e71e>] kthread+0x9e/0xc0
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
> [11630454.774232]  [<ffffffff8109e680>] ? kthread+0x0/0xc0
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
> [11630454.774235]  [<ffffffff8100c20a>] child_rip+0xa/0x20
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
> [11630454.774237]  [<ffffffff8109e680>] ? kthread+0x0/0xc0
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 
> [11630454.774239]  [<ffffffff8100c200>] ? child_rip+0x0/0x20
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.774240]
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.774243] 
> LustreError: dumping log to /tmp/lustre-log.1486613143.24449
> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630455.164028] 
> Pid: 23795, comm: ll_ost01_026
>
> There were at least 4 different PIDs that showed this situation. They 
> seem to be named like ll_ost01_063
>
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>



More information about the lustre-discuss mailing list