[lustre-discuss] OSS Panics in ptlrpc_prep_bulk_page

Fri Oct 16 15:09:20 PDT 2015

Good afternoon,

I think you've got to unwind this a bit.  You've got a massive number of communication errors - I'd start there and try to analyze those.  You've also got nodes trying to reach the failover partners of some of your OSTs - Are the OSSes dying?  (That could cause the communication errors.)  Or is it simply because the clients can't reliably communicate with them?

It's extremely likely that everything flows from the communication errors or their immediate cause.  For example, they're likely causing the evictions.

I'd start with and concentrate on those.  There should be a bit more info either from the clients reporting the errors or from the nodes they're trying to connect to.

- Patrick

________________________________
From: lustre-discuss [lustre-discuss-bounces at lists.lustre.org] on behalf of Exec Unerd [execunerd at gmail.com]
Sent: Friday, October 16, 2015 4:23 PM
To: Lustre discussion
Subject: [lustre-discuss] OSS Panics in ptlrpc_prep_bulk_page

We have a smallish cluster -- a few thousands cores on the client side; four OSSs on the Lustre server side.

Under otherwise normal operations, some of the clients will stop being able to find some of the OSTs.

When this happens, the OSSs start seeing an escalating error count. As more clients hit this condition, we start seeing 10s of thousands of errors of the following sort on the OSS, eventually resulting in a kernel panic on the OSS with what looks like "LNET/ptlrpc" messages.

We have tried this with client = v2.5.3 and v2.7.55. The OSSs are running v2.7.55. The kernel on the OSS side is based on RHEL's 2.6.32-504.23.4.el6.x86_64, with the 2.7.55 server patches of course.

OSS panic message:
LustreError: 31929:0:(client.c:210:__ptlrpc_prep_bulk_page()) ASSERTION( pageoffset + len <= ((1UL) << 12) ) failed:
LustreError: 31929:0:(client.c:210:__ptlrpc_prep_bulk_page()) LBUG
Kernel panic - not syncing: LBUG
Pid: 31929, comm: ll_ost_io00_050 Tainted: P

We think this is because the clients (randomly) are unable to find the OSTs. The clients show messages like the following:
Oct 15 23:08:29 client00 kernel: Lustre: 60196:0:(client.c:1918:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1444964873/real 1444964873] req at ffff8801026d1400 x1514474144942020/t0(0) o8->fs00-OST012d-osc-ffff88041640a000 at 172.18.83.180@o2ib:28/4 lens 400/544 e 0 to 1 dl 1444964909 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Oct 15 23:12:53 client00 kernel: LNetError: 60196:0:(o2iblnd_cb.c:1322:kiblnd_connect_peer()) Can't resolve addr for 172.16.10.12 at o2ib: -19
Oct 15 23:12:53 client00 kernel: LNetError: 60196:0:(o2iblnd_cb.c:1322:kiblnd_connect_peer()) Skipped 385 previous similar messages

It says "Can't resolve addr", but they can resolve the address of the OSS via DNS, so I don't know what "resolve" means in this context

The OSTs are always actually available on the OSSes, and most (e.g. 99%) of the clients can always talk to them even while a few clients are showing the above errors.

It's just that, inexplicably, some of the clients sometimes won't connect to some of the OSTs even though everybody else can.

We see a ton of the following throughout the day on the OSSs, even when the OSSs are all up and seem to be serving data without issue:
Oct 15 05:12:23 OSS02 kernel: LustreError: 137-5: fs00-OST00c9_UUID: not available for connect from [IP]@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
Oct 15 05:12:23 OSS02 kernel: LustreError: Skipped 700 previous similar messages

This appears to show lots of clients trying to reach "fs00-OST00c9" via an OSS that (a) is a valid HA service node for that OST, but (b) isn't actually the one serving it at the moment. So we'd expect the client to move on to the next service node and find the OST there... Which is what 99% of the clients actually do. But randomly, some of the clients just keep cycling through the available service nodes and never find the OSTs.

We also see a lot of eviction notices throughout the day on all servers (MDS and OSS).
Oct 15 23:23:45 MDS00 kernel: Lustre: fs00-MDT0000: haven't heard from client ac1445c9-2178-b3c9-c701-d6ff83e13210 (at [IP]@o2ib) in 227 seconds. I think it's dead, and I am evicting it. exp ffff882028bd6800, cur 1444965825 expire 1444965675 last 1444965598
Oct 15 23:23:45 MDS00 kernel: Lustre: Skipped 10 previous similar messages

We're pretty sure the above is a totally unrelated issue, but it is putting additional pressure on the OSSs. Add it all up, and the storage cluster could be getting >10k errors in a given second.

Eventually, the glut of invalid client attempts of each type results in a kernel panic on the OSS, usually referencing ptlrpc_prep_bulk_page like that below.

LustreError: 31929:0:(client.c:210:__ptlrpc_prep_bulk_page()) ASSERTION( pageoffset + len <= ((1UL) << 12) ) failed:
LustreError: 31929:0:(client.c:210:__ptlrpc_prep_bulk_page()) LBUG
Kernel panic - not syncing: LBUG
Pid: 31929, comm: ll_ost_io00_050 Tainted: P           ---------------    2.6.32-504.23.4.el6.x86_64 #1
Call Trace:
 [<ffffffff8152931c>] ? panic+0xa7/0x16f
 [<ffffffffa097becb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
 [<ffffffffa0cae5e8>] ? __ptlrpc_prep_bulk_page+0x118/0x1e0 [ptlrpc]
 [<ffffffffa0cae6c1>] ? ptlrpc_prep_bulk_page_nopin+0x11/0x20 [ptlrpc]
 [<ffffffffa0d2c162>] ? tgt_brw_read+0xa92/0x11d0 [ptlrpc]
 [<ffffffffa0cbfa0b>] ? lustre_pack_reply_v2+0x1eb/0x280 [ptlrpc]
 [<ffffffffa0cbfb46>] ? lustre_pack_reply_flags+0xa6/0x1e0 [ptlrpc]
 [<ffffffffa098968a>] ? lc_watchdog_touch+0x7a/0x190 [libcfs]
 [<ffffffffa0d2994c>] ? tgt_request_handle+0xa4c/0x1290 [ptlrpc]
 [<ffffffffa0cd15b1>] ? ptlrpc_main+0xe41/0x1910 [ptlrpc]
 [<ffffffff81529a1e>] ? thread_return+0x4e/0x7d0
 [<ffffffffa0cd0770>] ? ptlrpc_main+0x0/0x1910 [ptlrpc]
 [<ffffffff8109e78e>] ? kthread+0x9e/0xc0
 [<ffffffff8100c28a>] ? child_rip+0xa/0x20
 [<ffffffff8109e6f0>] ? kthread+0x0/0xc0
 [<ffffffff8100c280>] ? child_rip+0x0/0x20
[Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 38d is b0)
dmar: DRHD: handling fault status reg 2
dmar: INTR-REMAP: Request device [[82:00.0] fault index 48
INTR-REMAP:[fault reason 34] Present field in the IRTE entry is clear
dmar: INTR-REMAP: Request device [[82:00.0] fault index 4a
INTR-REMAP:[fault reason 34] Present field in the IRTE entry is clear
dmar: DRHD: handling fault status reg 200

I've been trying to find information on this sort of thing, but it's not exactly a common problem. :-( Thanks for your time and assistance.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20151016/a8b21b66/attachment-0001.htm>