[lustre-discuss] Diagnosing lustre errors.

Jacek Tomaka jacekt at dug.com
Sat Dec 15 21:22:19 PST 2018


Hello,

I have been trying to chase some errors on our machines that are
leading to OOM killer killing some processes. Sometimes i see OOM
errors and lustre errors around the same time but so far i have failed
to find out if OOM killer is causing lustre errors or lustre errors
(possibly caused by network errors) are the cause of OOM (the theory
here is that there are dirty blocks in page cache that cannot be freed
because there is a network delays in writing them to the server).

>From the userspace it is visible either as a process being killed by
oom or erroring out with I/O error (-5) while reading or writing a
file on Lustre file system.
Below is the case when it failed because of I/O error.

What would be the steps to diagnose problem?


Lustre: Lustre: Build Version: 2.10.1
Linux hnod2004 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50
UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
172.27.31.151:

Dec 15 19:17:00 hnod2004 kernel: INFO: task awk:108160 blocked for
more than 120 seconds.
Dec 15 19:17:00 hnod2004 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 15 19:17:00 hnod2004 kernel: awk             D ffffffff816a76b0
 0 108160 108148 0x00000080
Dec 15 19:17:00 hnod2004 kernel: ffff8802155f3800 0000000000000002
ffff8804c1b41fa0 ffff8802155f3fd8
Dec 15 19:17:00 hnod2004 kernel: ffff8802155f3fd8 ffff8802155f3fd8
ffff8804c1b41fa0 ffff881fbbb96cc0
Dec 15 19:17:00 hnod2004 kernel: 0000000000000000 7fffffffffffffff
ffff88203ff91de8 ffffffff816a76b0
Dec 15 19:17:00 hnod2004 kernel: Call Trace:
Dec 15 19:17:00 hnod2004 kernel: [<ffffffff816a76b0>] ? bit_wait+0x50/0x50
Dec 15 19:17:00 hnod2004 kernel: [<ffffffff816a9589>] schedule+0x29/0x70
Dec 15 19:17:00 hnod2004 kernel: [<ffffffff816a7099>]
schedule_timeout+0x239/0x2c0
Dec 15 19:17:00 hnod2004 kernel: [<ffffffffc0a35318>] ?
ldlm_lock_remove_from_lru_check+0x158/0x1a0 [ptlrpc]
Dec 15 19:17:00 hnod2004 kernel: [<ffffffffc0a36899>] ?
ldlm_lock_addref_internal_nolock+0x19/0x100 [ptlrpc]
Dec 15 19:17:00 hnod2004 kernel: [<ffffffff810e93ac>] ? ktime_get_ts64+0x4c/0xf0
Dec 15 19:17:00 hnod2004 kernel: [<ffffffff816a76b0>] ? bit_wait+0x50/0x50
Dec 15 19:17:00 hnod2004 kernel: [<ffffffff816a8c0d>]
io_schedule_timeout+0xad/0x130
Dec 15 19:17:00 hnod2004 kernel: [<ffffffff816a8ca8>] io_schedule+0x18/0x20
Dec 15 19:17:00 hnod2004 kernel: [<ffffffff816a76c1>] bit_wait_io+0x11/0x50
Dec 15 19:17:00 hnod2004 kernel: [<ffffffff816a71e5>] __wait_on_bit+0x65/0x90
Dec 15 19:17:00 hnod2004 kernel: [<ffffffff81181cc1>] wait_on_page_bit+0x81/0xa0
Dec 15 19:17:00 hnod2004 kernel: [<ffffffff810b19e0>] ?
wake_bit_function+0x40/0x40
Dec 15 19:17:00 hnod2004 kernel: [<ffffffffc1167bb2>]
vvp_page_assume+0x32/0xa0 [lustre]
Dec 15 19:17:00 hnod2004 kernel: [<ffffffffc061d1f8>]
cl_page_assume+0x68/0x240 [obdclass]
Dec 15 19:17:00 hnod2004 kernel: [<ffffffffc1159df0>]
ll_write_begin+0x100/0x830 [lustre]
Dec 15 19:17:00 hnod2004 kernel: [<ffffffff811824de>]
generic_file_buffered_write+0x11e/0x2a0
Dec 15 19:17:00 hnod2004 kernel: [<ffffffff81184902>]
__generic_file_aio_write+0x1e2/0x400
Dec 15 19:17:00 hnod2004 kernel: [<ffffffffc1168d5b>]
__generic_file_write_iter+0xcb/0x340 [lustre]
Dec 15 19:17:00 hnod2004 kernel: [<ffffffffc116cc46>]
vvp_io_write_start+0x4a6/0x810 [lustre]
Dec 15 19:17:00 hnod2004 kernel: [<ffffffffc0620217>] ?
cl_lock_request+0x67/0x1f0 [obdclass]
Dec 15 19:17:00 hnod2004 kernel: [<ffffffffc0621e25>]
cl_io_start+0x65/0x130 [obdclass]
Dec 15 19:17:00 hnod2004 kernel: [<ffffffffc06241ee>]
cl_io_loop+0x12e/0xc90 [obdclass]
Dec 15 19:17:00 hnod2004 kernel: [<ffffffffc1118258>]
ll_file_io_generic+0x498/0xc40 [lustre]
Dec 15 19:17:00 hnod2004 kernel: [<ffffffff811bc401>] ?
page_add_file_rmap+0x1/0xc0
Dec 15 19:17:00 hnod2004 kernel: [<ffffffffc1118cdd>]
ll_file_aio_write+0x12d/0x1f0 [lustre]
Dec 15 19:17:00 hnod2004 kernel: [<ffffffffc1118e6e>]
ll_file_write+0xce/0x1e0 [lustre]
Dec 15 19:17:00 hnod2004 kernel: [<ffffffff81200cad>] vfs_write+0xbd/0x1e0
Dec 15 19:17:00 hnod2004 kernel: [<ffffffff8111f394>] ?
__audit_syscall_entry+0xb4/0x110
Dec 15 19:17:00 hnod2004 kernel: [<ffffffff81201abf>] SyS_write+0x7f/0xe0
Dec 15 19:17:00 hnod2004 kernel: [<ffffffff816b5292>] tracesys+0xdd/0xe2
Dec 15 19:18:12 hnod2004 kernel: LustreError: 11-0:
h4-OST002e-osc-ffff881f01e01800: operation obd_ping to node
172.27.0.64 at tcp failed: rc = -107
Dec 15 19:18:12 hnod2004 kernel: Lustre:
h4-OST002e-osc-ffff881f01e01800: Connection to h4-OST002e (at
172.27.0.64 at tcp) was lost; in progress operations using this service
will wait for recovery to complete
Dec 15 19:18:12 hnod2004 kernel: LustreError: 167-0:
h4-OST002e-osc-ffff881f01e01800: This client was evicted by
h4-OST002e; in progress operations using this service will fail.


OST logs(172.27.0.64):
[Sat Dec 15 19:12:55 2018] LustreError:
0:0:(ldlm_lockd.c:334:waiting_locks_callback()) ### lock callback
timer expired after 100s: evicting client at 172.27.31.151 at tcp  ns:
filter-h4-OST002e_UUID lock: ffff88101d9cae00/0x7f97843f44bf7e83 lrc:
3/0,0 mode: PW/PW res: [0xd64612:0x0:0x0].0x0 rrc: 3 type: EXT
[0->18446744073709551615] (req 0->18446744073709551615) flags:
0x60000400000020 nid: 172.27.31.151 at tcp remote: 0xa2b3e4825f0f14ba
expref: 19 pid: 15538 timeout: 9310208750 lvb_type: 0
[Sat Dec 15 19:12:56 2018] LustreError:
0:0:(ldlm_lockd.c:334:waiting_locks_callback()) Skipped 1 previous
similar message

Regards.
-- 
Jacek Tomaka
Geophysical Software Developer




DownUnder GeoSolutions

76 Kings Park Road
West Perth 6005 WA, Australia
tel +61 8 9287 4143
jacekt at dug.com
www.dug.com


More information about the lustre-discuss mailing list