[lustre-discuss] ZFS PANIC

Fri Feb 10 06:39:39 PST 2017

Hi,

I am getting this message

PANIC: zfs: accessing past end of object 29/7 (size=33792 access=33792+128)

The affected OST seems to reject new mounts from clients now, and the 
lctl dl count of connections to the obdfilter process increases, but 
does not seem to decrease?

This is Lustre 2.7.58 with zfs 0.6.4.2

Can anyone help me diagnose and fix whatever is going wrong here? I've 
included the stack dump below.

Thanks,
bob

2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.781874] 
Showing stack for process 24449
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.781876] 
Pid: 24449, comm: ll_ost00_078 Tainted: P           ---------------    
2.6.32.504.16.2.el6_lustre #7
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.781878] 
Call Trace:
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.781902]  
[<ffffffffa0406f8d>] ? spl_dumpstack+0x3d/0x40 [spl]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.781908]  
[<ffffffffa040701d>] ? vcmn_err+0x8d/0xf0 [spl]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.781950]  
[<ffffffffa0465a46>] ? RW_WRITE_HELD+0x66/0xb0 [zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.781970]  
[<ffffffffa0466eb8>] ? dbuf_rele_and_unlock+0x268/0x3f0 [zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.781991]  
[<ffffffffa04687ba>] ? dbuf_read+0x5ca/0x8a0 [zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782024]  
[<ffffffffa04bb032>] ? zfs_panic_recover+0x52/0x60 [zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782045]  
[<ffffffffa0471e3b>] ? dmu_buf_hold_array_by_dnode+0x41b/0x560 [zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782068]  
[<ffffffffa0472205>] ? dmu_buf_hold_array+0x65/0x90 [zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782090]  
[<ffffffffa0472668>] ? dmu_write+0x68/0x1a0 [zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782147]  
[<ffffffffa08fa0ae>] ? lprocfs_oh_tally+0x2e/0x50 [obdclass]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782173]  
[<ffffffffa103f311>] ? osd_write+0x1d1/0x390 [osd_zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782206]  
[<ffffffffa0926aad>] ? dt_record_write+0x3d/0x130 [obdclass]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782305]  
[<ffffffffa0ba7575>] ? tgt_client_data_write+0x165/0x1b0 [ptlrpc]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782347]  
[<ffffffffa0bab575>] ? tgt_client_data_update+0x335/0x680 [ptlrpc]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782388]  
[<ffffffffa0bac298>] ? tgt_client_new+0x3d8/0x6a0 [ptlrpc]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782407]  
[<ffffffffa117fad3>] ? ofd_obd_connect+0x363/0x400 [ofd]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782443]  
[<ffffffffa0b12158>] ? target_handle_connect+0xe58/0x2d30 [ptlrpc]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782450]  
[<ffffffff8106d1a5>] ? enqueue_entity+0x125/0x450
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782457]  
[<ffffffff8105870c>] ? check_preempt_curr+0x7c/0x90
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782462]  
[<ffffffff81064a2e>] ? try_to_wake_up+0x24e/0x3e0
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782481]  
[<ffffffffa07da6ca>] ? lc_watchdog_touch+0x7a/0x190 [libcfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782524]  
[<ffffffffa0bb6f52>] ? tgt_request_handle+0x5b2/0x1230 [ptlrpc]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782564]  
[<ffffffffa0b5f5d1>] ? ptlrpc_main+0xe41/0x1920 [ptlrpc]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782570]  
[<ffffffff81014959>] ? sched_clock+0x9/0x10
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782576]  
[<ffffffff81529e1e>] ? thread_return+0x4e/0x7d0
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782615]  
[<ffffffffa0b5e790>] ? ptlrpc_main+0x0/0x1920 [ptlrpc]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782622]  
[<ffffffff8109e71e>] ? kthread+0x9e/0xc0
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782626]  
[<ffffffff8109e680>] ? kthread+0x0/0xc0
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782632]  
[<ffffffff8100c20a>] ? child_rip+0xa/0x20
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782636]  
[<ffffffff8109e680>] ? kthread+0x0/0xc0
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782641]  
[<ffffffff8100c200>] ? child_rip+0x0/0x20

Later, that same process showed:
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773156] 
LNet: Service thread pid 24449 was inactive for 200.00s. The thread 
might be hung, or it might only be slow and will resume later. Dumping 
the stack trace for debugging purposes:
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773163] 
Pid: 24449, comm: ll_ost00_078
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773164]
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773165] 
Call Trace:
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773181]  
[<ffffffff81010f85>] ? show_trace_log_lvl+0x55/0x70
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773194]  
[<ffffffff8152966e>] ? dump_stack+0x6f/0x76
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773249]  
[<ffffffffa0407035>] vcmn_err+0xa5/0xf0 [spl]
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773373]  
[<ffffffffa0465a46>] ? RW_WRITE_HELD+0x66/0xb0 [zfs]
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773393]  
[<ffffffffa0466eb8>] ? dbuf_rele_and_unlock+0x268/0x3f0 [zfs]
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773412]  
[<ffffffffa04687ba>] ? dbuf_read+0x5ca/0x8a0 [zfs]
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773444]  
[<ffffffffa04bb032>] zfs_panic_recover+0x52/0x60 [zfs]
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773463]  
[<ffffffffa0471e3b>] dmu_buf_hold_array_by_dnode+0x41b/0x560 [zfs]
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773483]  
[<ffffffffa0472205>] dmu_buf_hold_array+0x65/0x90 [zfs]
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773502]  
[<ffffffffa0472668>] dmu_write+0x68/0x1a0 [zfs]
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773579]  
[<ffffffffa08fa0ae>] ? lprocfs_oh_tally+0x2e/0x50 [obdclass]
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773617]  
[<ffffffffa103f311>] osd_write+0x1d1/0x390 [osd_zfs]
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773659]  
[<ffffffffa0926aad>] dt_record_write+0x3d/0x130 [obdclass]
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773860]  
[<ffffffffa0ba7575>] tgt_client_data_write+0x165/0x1b0 [ptlrpc]
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773899]  
[<ffffffffa0bab575>] tgt_client_data_update+0x335/0x680 [ptlrpc]
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773938]  
[<ffffffffa0bac298>] tgt_client_new+0x3d8/0x6a0 [ptlrpc]
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773961]  
[<ffffffffa117fad3>] ofd_obd_connect+0x363/0x400 [ofd]
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773997]  
[<ffffffffa0b12158>] target_handle_connect+0xe58/0x2d30 [ptlrpc]
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.774002]  
[<ffffffff8106d1a5>] ? enqueue_entity+0x125/0x450
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.774006]  
[<ffffffff8105870c>] ? check_preempt_curr+0x7c/0x90
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.774010]  
[<ffffffff81064a2e>] ? try_to_wake_up+0x24e/0x3e0
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.774055]  
[<ffffffffa07da6ca>] ? lc_watchdog_touch+0x7a/0x190 [libcfs]
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.774111]  
[<ffffffffa0bb6f52>] tgt_request_handle+0x5b2/0x1230 [ptlrpc]
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.774163]  
[<ffffffffa0b5f5d1>] ptlrpc_main+0xe41/0x1920 [ptlrpc]
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.774167]  
[<ffffffff81014959>] ? sched_clock+0x9/0x10
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.774170]  
[<ffffffff81529e1e>] ? thread_return+0x4e/0x7d0
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.774225]  
[<ffffffffa0b5e790>] ? ptlrpc_main+0x0/0x1920 [ptlrpc]
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.774230]  
[<ffffffff8109e71e>] kthread+0x9e/0xc0
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.774232]  
[<ffffffff8109e680>] ? kthread+0x0/0xc0
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.774235]  
[<ffffffff8100c20a>] child_rip+0xa/0x20
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.774237]  
[<ffffffff8109e680>] ? kthread+0x0/0xc0
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.774239]  
[<ffffffff8100c200>] ? child_rip+0x0/0x20
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.774240]
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.774243] 
LustreError: dumping log to /tmp/lustre-log.1486613143.24449
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630455.164028] 
Pid: 23795, comm: ll_ost01_026

There were at least 4 different PIDs that showed this situation. They 
seem to be named like ll_ost01_063