[lustre-discuss] OSS Panics in ptlrpc_prep_bulk_page

Fri Oct 16 14:23:48 PDT 2015

We have a smallish cluster -- a few thousands cores on the client side;
four OSSs on the Lustre server side.

Under otherwise normal operations, some of the clients will stop being able
to find some of the OSTs.

When this happens, the OSSs start seeing an escalating error count. As more
clients hit this condition, we start seeing 10s of thousands of errors of
the following sort on the OSS, eventually resulting in a kernel panic on
the OSS with what looks like "LNET/ptlrpc" messages.

We have tried this with client = v2.5.3 and v2.7.55. The OSSs are running
v2.7.55. The kernel on the OSS side is based on RHEL's
2.6.32-504.23.4.el6.x86_64, with the 2.7.55 server patches of course.

OSS panic message:
LustreError: 31929:0:(client.c:210:__ptlrpc_prep_bulk_page()) ASSERTION(
pageoffset + len <= ((1UL) << 12) ) failed:
LustreError: 31929:0:(client.c:210:__ptlrpc_prep_bulk_page()) LBUG
Kernel panic - not syncing: LBUG
Pid: 31929, comm: ll_ost_io00_050 Tainted: P

We think this is because the clients (randomly) are unable to find the
OSTs. The clients show messages like the following:
Oct 15 23:08:29 client00 kernel: Lustre:
60196:0:(client.c:1918:ptlrpc_expire_one_request()) @@@ Request sent has
timed out for slow reply: [sent 1444964873/real 1444964873]
req at ffff8801026d1400 x1514474144942020/t0(0)
o8->fs00-OST012d-osc-ffff88041640a000 at 172.18.83.180@o2ib:28/4 lens 400/544
e 0 to 1 dl 1444964909 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Oct 15 23:12:53 client00 kernel: LNetError:
60196:0:(o2iblnd_cb.c:1322:kiblnd_connect_peer()) Can't resolve addr for
172.16.10.12 at o2ib: -19
Oct 15 23:12:53 client00 kernel: LNetError:
60196:0:(o2iblnd_cb.c:1322:kiblnd_connect_peer()) Skipped 385 previous
similar messages

It says "Can't resolve addr", but they can resolve the address of the OSS
via DNS, so I don't know what "resolve" means in this context

The OSTs are always actually available on the OSSes, and most (e.g. 99%) of
the clients can always talk to them even while a few clients are showing
the above errors.

It's just that, inexplicably, some of the clients sometimes won't connect
to some of the OSTs even though everybody else can.

We see a ton of the following throughout the day on the OSSs, even when the
OSSs are all up and seem to be serving data without issue:
Oct 15 05:12:23 OSS02 kernel: LustreError: 137-5: fs00-OST00c9_UUID: not
available for connect from [IP]@o2ib (no target). If you are running an HA
pair check that the target is mounted on the other server.
Oct 15 05:12:23 OSS02 kernel: LustreError: Skipped 700 previous similar
messages

This appears to show lots of clients trying to reach "fs00-OST00c9" via an
OSS that (a) is a valid HA service node for that OST, but (b) isn't
actually the one serving it at the moment. So we'd expect the client to
move on to the next service node and find the OST there... Which is what
99% of the clients actually do. But randomly, some of the clients just keep
cycling through the available service nodes and never find the OSTs.

We also see a lot of eviction notices throughout the day on all servers
(MDS and OSS).
Oct 15 23:23:45 MDS00 kernel: Lustre: fs00-MDT0000: haven't heard from
client ac1445c9-2178-b3c9-c701-d6ff83e13210 (at [IP]@o2ib) in 227 seconds.
I think it's dead, and I am evicting it. exp ffff882028bd6800, cur
1444965825 expire 1444965675 last 1444965598
Oct 15 23:23:45 MDS00 kernel: Lustre: Skipped 10 previous similar messages

We're pretty sure the above is a totally unrelated issue, but it is putting
additional pressure on the OSSs. Add it all up, and the storage cluster
could be getting >10k errors in a given second.

Eventually, the glut of invalid client attempts of each type results in a
kernel panic on the OSS, usually referencing ptlrpc_prep_bulk_page like
that below.

LustreError: 31929:0:(client.c:210:__ptlrpc_prep_bulk_page()) ASSERTION(
pageoffset + len <= ((1UL) << 12) ) failed:
LustreError: 31929:0:(client.c:210:__ptlrpc_prep_bulk_page()) LBUG
Kernel panic - not syncing: LBUG
Pid: 31929, comm: ll_ost_io00_050 Tainted: P           ---------------
 2.6.32-504.23.4.el6.x86_64 #1
Call Trace:
 [<ffffffff8152931c>] ? panic+0xa7/0x16f
 [<ffffffffa097becb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
 [<ffffffffa0cae5e8>] ? __ptlrpc_prep_bulk_page+0x118/0x1e0 [ptlrpc]
 [<ffffffffa0cae6c1>] ? ptlrpc_prep_bulk_page_nopin+0x11/0x20 [ptlrpc]
 [<ffffffffa0d2c162>] ? tgt_brw_read+0xa92/0x11d0 [ptlrpc]
 [<ffffffffa0cbfa0b>] ? lustre_pack_reply_v2+0x1eb/0x280 [ptlrpc]
 [<ffffffffa0cbfb46>] ? lustre_pack_reply_flags+0xa6/0x1e0 [ptlrpc]
 [<ffffffffa098968a>] ? lc_watchdog_touch+0x7a/0x190 [libcfs]
 [<ffffffffa0d2994c>] ? tgt_request_handle+0xa4c/0x1290 [ptlrpc]
 [<ffffffffa0cd15b1>] ? ptlrpc_main+0xe41/0x1910 [ptlrpc]
 [<ffffffff81529a1e>] ? thread_return+0x4e/0x7d0
 [<ffffffffa0cd0770>] ? ptlrpc_main+0x0/0x1910 [ptlrpc]
 [<ffffffff8109e78e>] ? kthread+0x9e/0xc0
 [<ffffffff8100c28a>] ? child_rip+0xa/0x20
 [<ffffffff8109e6f0>] ? kthread+0x0/0xc0
 [<ffffffff8100c280>] ? child_rip+0x0/0x20
[Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 38d is b0)
dmar: DRHD: handling fault status reg 2
dmar: INTR-REMAP: Request device [[82:00.0] fault index 48
INTR-REMAP:[fault reason 34] Present field in the IRTE entry is clear
dmar: INTR-REMAP: Request device [[82:00.0] fault index 4a
INTR-REMAP:[fault reason 34] Present field in the IRTE entry is clear
dmar: DRHD: handling fault status reg 200

I've been trying to find information on this sort of thing, but it's not
exactly a common problem. :-( Thanks for your time and assistance.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20151016/e4af278d/attachment.htm>