<div dir="ltr"><div>We have a smallish cluster -- a few thousands cores on the client side; four OSSs on the Lustre server side.</div><div><br></div><div>Under otherwise normal operations, some of the clients will stop being able to find some of the OSTs. </div><div><br></div><div>When this happens, the OSSs start seeing an escalating error count. As more clients hit this condition, we start seeing 10s of thousands of errors of the following sort on the OSS, eventually resulting in a kernel panic on the OSS with what looks like "LNET/ptlrpc" messages. </div><div><br></div><div>We have tried this with client = v2.5.3 and v2.7.55. The OSSs are running v2.7.55. The kernel on the OSS side is based on RHEL's 2.6.32-504.23.4.el6.x86_64, with the 2.7.55 server patches of course.</div><div><br></div><div>OSS panic message:</div><div><font face="monospace, monospace" size="1">LustreError: 31929:0:(client.c:210:__ptlrpc_prep_bulk_page()) ASSERTION( pageoffset + len <= ((1UL) << 12) ) failed: </font></div><div><font face="monospace, monospace" size="1">LustreError: 31929:0:(client.c:210:__ptlrpc_prep_bulk_page()) LBUG</font></div><div><font face="monospace, monospace" size="1">Kernel panic - not syncing: LBUG</font></div><div><font face="monospace, monospace" size="1">Pid: 31929, comm: ll_ost_io00_050 Tainted: P</font></div><div><br></div><div>We think this is because the clients (randomly) are unable to find the OSTs. The clients show messages like the following:</div><div><font face="monospace, monospace" size="1">Oct 15 23:08:29 client00 kernel: Lustre: 60196:0:(client.c:1918:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1444964873/real 1444964873] req@ffff8801026d1400 x1514474144942020/t0(0) o8->fs00-OST012d-osc-ffff88041640a000@172.18.83.180@o2ib:28/4 lens 400/544 e 0 to 1 dl 1444964909 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1</font></div><div><font face="monospace, monospace" size="1">Oct 15 23:12:53 client00 kernel: LNetError: 60196:0:(o2iblnd_cb.c:1322:kiblnd_connect_peer()) Can't resolve addr for 172.16.10.12@o2ib: -19</font></div><div><font face="monospace, monospace" size="1">Oct 15 23:12:53 client00 kernel: LNetError: 60196:0:(o2iblnd_cb.c:1322:kiblnd_connect_peer()) Skipped 385 previous similar messages</font></div><div><br></div><div>It says "Can't resolve addr", but they can resolve the address of the OSS via DNS, so I don't know what "resolve" means in this context</div><div><br></div><div>The OSTs are always actually available on the OSSes, and most (e.g. 99%) of the clients can always talk to them even while a few clients are showing the above errors.</div><div><br></div><div>It's just that, inexplicably, some of the clients sometimes won't connect to some of the OSTs even though everybody else can. </div><div><br></div><div>We see a ton of the following throughout the day on the OSSs, even when the OSSs are all up and seem to be serving data without issue: </div><div><font face="monospace, monospace" size="1">Oct 15 05:12:23 OSS02 kernel: LustreError: 137-5: fs00-OST00c9_UUID: not available for connect from [IP]@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server. </font></div><div><font face="monospace, monospace" size="1">Oct 15 05:12:23 OSS02 kernel: LustreError: Skipped 700 previous similar messages </font></div><div><br></div><div>This appears to show lots of clients trying to reach "fs00-OST00c9" via an OSS that (a) is a valid HA service node for that OST, but (b) isn't actually the one serving it at the moment. So we'd expect the client to move on to the next service node and find the OST there... Which is what 99% of the clients actually do. But randomly, some of the clients just keep cycling through the available service nodes and never find the OSTs.</div><div><br></div><div>We also see a lot of eviction notices throughout the day on all servers (MDS and OSS).</div><div><font face="monospace, monospace" size="1">Oct 15 23:23:45 MDS00 kernel: Lustre: fs00-MDT0000: haven't heard from client ac1445c9-2178-b3c9-c701-d6ff83e13210 (at [IP]@o2ib) in 227 seconds. I think it's dead, and I am evicting it. exp ffff882028bd6800, cur 1444965825 expire 1444965675 last 1444965598</font></div><div><font face="monospace, monospace" size="1">Oct 15 23:23:45 MDS00 kernel: Lustre: Skipped 10 previous similar messages</font></div><div><br></div><div>We're pretty sure the above is a totally unrelated issue, but it is putting additional pressure on the OSSs. Add it all up, and the storage cluster could be getting >10k errors in a given second.</div><div><br></div><div>Eventually, the glut of invalid client attempts of each type results in a kernel panic on the OSS, usually referencing ptlrpc_prep_bulk_page like that below. </div><div><br></div><div><font face="monospace, monospace" size="1">LustreError: 31929:0:(client.c:210:__ptlrpc_prep_bulk_page()) ASSERTION( pageoffset + len <= ((1UL) << 12) ) failed: </font></div><div><font face="monospace, monospace" size="1">LustreError: 31929:0:(client.c:210:__ptlrpc_prep_bulk_page()) LBUG</font></div><div><font face="monospace, monospace" size="1">Kernel panic - not syncing: LBUG</font></div><div><font face="monospace, monospace" size="1">Pid: 31929, comm: ll_ost_io00_050 Tainted: P           ---------------    2.6.32-504.23.4.el6.x86_64 #1</font></div><div><font face="monospace, monospace" size="1">Call Trace:</font></div><div><font face="monospace, monospace" size="1"> [<ffffffff8152931c>] ? panic+0xa7/0x16f</font></div><div><font face="monospace, monospace" size="1"> [<ffffffffa097becb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]</font></div><div><font face="monospace, monospace" size="1"> [<ffffffffa0cae5e8>] ? __ptlrpc_prep_bulk_page+0x118/0x1e0 [ptlrpc]</font></div><div><font face="monospace, monospace" size="1"> [<ffffffffa0cae6c1>] ? ptlrpc_prep_bulk_page_nopin+0x11/0x20 [ptlrpc]</font></div><div><font face="monospace, monospace" size="1"> [<ffffffffa0d2c162>] ? tgt_brw_read+0xa92/0x11d0 [ptlrpc]</font></div><div><font face="monospace, monospace" size="1"> [<ffffffffa0cbfa0b>] ? lustre_pack_reply_v2+0x1eb/0x280 [ptlrpc]</font></div><div><font face="monospace, monospace" size="1"> [<ffffffffa0cbfb46>] ? lustre_pack_reply_flags+0xa6/0x1e0 [ptlrpc]</font></div><div><font face="monospace, monospace" size="1"> [<ffffffffa098968a>] ? lc_watchdog_touch+0x7a/0x190 [libcfs]</font></div><div><font face="monospace, monospace" size="1"> [<ffffffffa0d2994c>] ? tgt_request_handle+0xa4c/0x1290 [ptlrpc]</font></div><div><font face="monospace, monospace" size="1"> [<ffffffffa0cd15b1>] ? ptlrpc_main+0xe41/0x1910 [ptlrpc]</font></div><div><font face="monospace, monospace" size="1"> [<ffffffff81529a1e>] ? thread_return+0x4e/0x7d0</font></div><div><font face="monospace, monospace" size="1"> [<ffffffffa0cd0770>] ? ptlrpc_main+0x0/0x1910 [ptlrpc]</font></div><div><font face="monospace, monospace" size="1"> [<ffffffff8109e78e>] ? kthread+0x9e/0xc0</font></div><div><font face="monospace, monospace" size="1"> [<ffffffff8100c28a>] ? child_rip+0xa/0x20</font></div><div><font face="monospace, monospace" size="1"> [<ffffffff8109e6f0>] ? kthread+0x0/0xc0</font></div><div><font face="monospace, monospace" size="1"> [<ffffffff8100c280>] ? child_rip+0x0/0x20</font></div><div><font face="monospace, monospace" size="1">[Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 38d is b0)</font></div><div><font face="monospace, monospace" size="1">dmar: DRHD: handling fault status reg 2</font></div><div><font face="monospace, monospace" size="1">dmar: INTR-REMAP: Request device [[82:00.0] fault index 48</font></div><div><font face="monospace, monospace" size="1">INTR-REMAP:[fault reason 34] Present field in the IRTE entry is clear</font></div><div><font face="monospace, monospace" size="1">dmar: INTR-REMAP: Request device [[82:00.0] fault index 4a</font></div><div><font face="monospace, monospace" size="1">INTR-REMAP:[fault reason 34] Present field in the IRTE entry is clear</font></div><div><font face="monospace, monospace" size="1">dmar: DRHD: handling fault status reg 200</font></div><div><br></div><div>I've been trying to find information on this sort of thing, but it's not exactly a common problem. :-( Thanks for your time and assistance.</div><div><br></div></div>