[lustre-discuss] OSS Panics in ptlrpc_prep_bulk_page

Fri Oct 16 15:21:18 PDT 2015

I agree with Patrick.  Funny thing about network based file systems, is that they tend not to work when the network is failing.  Are you seeing any errors on your IB fabric?  Any errors from the subnet manager?

-Marc

----
D. Marc Stearman
Lustre Operations Lead
stearman2 at llnl.gov
Office:  925-423-9670
Mobile:  925-216-7516

> On Oct 16, 2015, at 3:09 PM, Patrick Farrell <paf at cray.com> wrote:
> 
> Good afternoon,
> 
> I think you've got to unwind this a bit.  You've got a massive number of communication errors - I'd start there and try to analyze those.  You've also got nodes trying to reach the failover partners of some of your OSTs - Are the OSSes dying?  (That could cause the communication errors.)  Or is it simply because the clients can't reliably communicate with them?
> 
> It's extremely likely that everything flows from the communication errors or their immediate cause.  For example, they're likely causing the evictions.
> 
> I'd start with and concentrate on those.  There should be a bit more info either from the clients reporting the errors or from the nodes they're trying to connect to.
> 
> - Patrick
> 
> From: lustre-discuss [lustre-discuss-bounces at lists.lustre.org] on behalf of Exec Unerd [execunerd at gmail.com]
> Sent: Friday, October 16, 2015 4:23 PM
> To: Lustre discussion
> Subject: [lustre-discuss] OSS Panics in ptlrpc_prep_bulk_page
> 
> We have a smallish cluster -- a few thousands cores on the client side; four OSSs on the Lustre server side.
> 
> Under otherwise normal operations, some of the clients will stop being able to find some of the OSTs. 
> 
> When this happens, the OSSs start seeing an escalating error count. As more clients hit this condition, we start seeing 10s of thousands of errors of the following sort on the OSS, eventually resulting in a kernel panic on the OSS with what looks like "LNET/ptlrpc" messages. 
> 
> We have tried this with client = v2.5.3 and v2.7.55. The OSSs are running v2.7.55. The kernel on the OSS side is based on RHEL's 2.6.32-504.23.4.el6.x86_64, with the 2.7.55 server patches of course.
> 
> OSS panic message:
> LustreError: 31929:0:(client.c:210:__ptlrpc_prep_bulk_page()) ASSERTION( pageoffset + len <= ((1UL) << 12) ) failed: 
> LustreError: 31929:0:(client.c:210:__ptlrpc_prep_bulk_page()) LBUG
> Kernel panic - not syncing: LBUG
> Pid: 31929, comm: ll_ost_io00_050 Tainted: P
> 
> We think this is because the clients (randomly) are unable to find the OSTs. The clients show messages like the following:
> Oct 15 23:08:29 client00 kernel: Lustre: 60196:0:(client.c:1918:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1444964873/real 1444964873] req at ffff8801026d1400 x1514474144942020/t0(0) o8->fs00-OST012d-osc-ffff88041640a000 at 172.18.83.180@o2ib:28/4 lens 400/544 e 0 to 1 dl 1444964909 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
> Oct 15 23:12:53 client00 kernel: LNetError: 60196:0:(o2iblnd_cb.c:1322:kiblnd_connect_peer()) Can't resolve addr for 172.16.10.12 at o2ib: -19
> Oct 15 23:12:53 client00 kernel: LNetError: 60196:0:(o2iblnd_cb.c:1322:kiblnd_connect_peer()) Skipped 385 previous similar messages
> 
> It says "Can't resolve addr", but they can resolve the address of the OSS via DNS, so I don't know what "resolve" means in this context
> 
> The OSTs are always actually available on the OSSes, and most (e.g. 99%) of the clients can always talk to them even while a few clients are showing the above errors.
> 
> It's just that, inexplicably, some of the clients sometimes won't connect to some of the OSTs even though everybody else can. 
> 
> We see a ton of the following throughout the day on the OSSs, even when the OSSs are all up and seem to be serving data without issue: 
> Oct 15 05:12:23 OSS02 kernel: LustreError: 137-5: fs00-OST00c9_UUID: not available for connect from [IP]@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server. 
> Oct 15 05:12:23 OSS02 kernel: LustreError: Skipped 700 previous similar messages 
> 
> This appears to show lots of clients trying to reach "fs00-OST00c9" via an OSS that (a) is a valid HA service node for that OST, but (b) isn't actually the one serving it at the moment. So we'd expect the client to move on to the next service node and find the OST there... Which is what 99% of the clients actually do. But randomly, some of the clients just keep cycling through the available service nodes and never find the OSTs.
> 
> We also see a lot of eviction notices throughout the day on all servers (MDS and OSS).
> Oct 15 23:23:45 MDS00 kernel: Lustre: fs00-MDT0000: haven't heard from client ac1445c9-2178-b3c9-c701-d6ff83e13210 (at [IP]@o2ib) in 227 seconds. I think it's dead, and I am evicting it. exp ffff882028bd6800, cur 1444965825 expire 1444965675 last 1444965598
> Oct 15 23:23:45 MDS00 kernel: Lustre: Skipped 10 previous similar messages
> 
> We're pretty sure the above is a totally unrelated issue, but it is putting additional pressure on the OSSs. Add it all up, and the storage cluster could be getting >10k errors in a given second.
> 
> Eventually, the glut of invalid client attempts of each type results in a kernel panic on the OSS, usually referencing ptlrpc_prep_bulk_page like that below. 
> 
> LustreError: 31929:0:(client.c:210:__ptlrpc_prep_bulk_page()) ASSERTION( pageoffset + len <= ((1UL) << 12) ) failed: 
> LustreError: 31929:0:(client.c:210:__ptlrpc_prep_bulk_page()) LBUG
> Kernel panic - not syncing: LBUG
> Pid: 31929, comm: ll_ost_io00_050 Tainted: P           ---------------    2.6.32-504.23.4.el6.x86_64 #1
> Call Trace:
>  [<ffffffff8152931c>] ? panic+0xa7/0x16f
>  [<ffffffffa097becb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
>  [<ffffffffa0cae5e8>] ? __ptlrpc_prep_bulk_page+0x118/0x1e0 [ptlrpc]
>  [<ffffffffa0cae6c1>] ? ptlrpc_prep_bulk_page_nopin+0x11/0x20 [ptlrpc]
>  [<ffffffffa0d2c162>] ? tgt_brw_read+0xa92/0x11d0 [ptlrpc]
>  [<ffffffffa0cbfa0b>] ? lustre_pack_reply_v2+0x1eb/0x280 [ptlrpc]
>  [<ffffffffa0cbfb46>] ? lustre_pack_reply_flags+0xa6/0x1e0 [ptlrpc]
>  [<ffffffffa098968a>] ? lc_watchdog_touch+0x7a/0x190 [libcfs]
>  [<ffffffffa0d2994c>] ? tgt_request_handle+0xa4c/0x1290 [ptlrpc]
>  [<ffffffffa0cd15b1>] ? ptlrpc_main+0xe41/0x1910 [ptlrpc]
>  [<ffffffff81529a1e>] ? thread_return+0x4e/0x7d0
>  [<ffffffffa0cd0770>] ? ptlrpc_main+0x0/0x1910 [ptlrpc]
>  [<ffffffff8109e78e>] ? kthread+0x9e/0xc0
>  [<ffffffff8100c28a>] ? child_rip+0xa/0x20
>  [<ffffffff8109e6f0>] ? kthread+0x0/0xc0
>  [<ffffffff8100c280>] ? child_rip+0x0/0x20
> [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 38d is b0)
> dmar: DRHD: handling fault status reg 2
> dmar: INTR-REMAP: Request device [[82:00.0] fault index 48
> INTR-REMAP:[fault reason 34] Present field in the IRTE entry is clear
> dmar: INTR-REMAP: Request device [[82:00.0] fault index 4a
> INTR-REMAP:[fault reason 34] Present field in the IRTE entry is clear
> dmar: DRHD: handling fault status reg 200
> 
> I've been trying to find information on this sort of thing, but it's not exactly a common problem. :-( Thanks for your time and assistance.
> 
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org