[lustre-discuss] Lustre client crashes in Lustre 2.12.3 with data on MDT

Fri Jan 10 07:42:17 PST 2020

Hi,

We just switched to a new 2.12.3 Lustre storage system on our local HPC cluster have seen a number of client node crashes - all leaving a similar syslog entry:

Jan 10 13:21:08 spectre15 kernel: LustreError: 24567:0:(osc_cache.c:3062:osc_cache_writeback_range()) extent ffff9238ac5133f0@{[0 -> 255/255], [1|0|-|cache|wiY|ffff9238a7370f00],[1703936|89|+|-|ffff9238733e7180|256|          (null)]}
Jan 10 13:21:08 spectre15 kernel: LustreError: 24567:0:(osc_cache.c:3062:osc_cache_writeback_range()) ### extent: ffff9238ac5133f0 ns: alice3-OST0019-osc-ffff9248e7337800 lock: ffff9238733e7180/0x3c4db4a67c3d39eb lrc: 2/0,0 mode: PW/PW res: [0x740000400:0x132478:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 65536->262143) flags: 0x20000000000 nid: local remote: 0xda6676eba5fdbd0c expref: -99 pid: 24499 timeout: 0 lvb_type: 1
Jan 10 13:21:08 spectre15 kernel: LustreError: 24567:0:(osc_cache.c:1241:osc_extent_tree_dump0()) Dump object ffff9238a7370f00 extents at osc_cache_writeback_range:3062, mppr: 256.
Jan 10 13:21:08 spectre15 kernel: LustreError: 24567:0:(osc_cache.c:1246:osc_extent_tree_dump0()) extent ffff9238ac5133f0@{[0 -> 255/255], [1|0|-|cache|wiY|ffff9238a7370f00], [1703936|89|+|-|ffff9238733e7180|256|          (null)]} in tree 1.
Jan 10 13:21:08 spectre15 kernel: LustreError: 24567:0:(osc_cache.c:1246:osc_extent_tree_dump0()) ### extent: ffff9238ac5133f0 ns: alice3-OST0019-osc-ffff9248e7337800 lock: ffff9238733e7180/0x3c4db4a67c3d39eb lrc: 2/0,0 mode: PW/PW res: [0x740000400:0x132478:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 65536->262143) flags: 0x20000000000 nid: local remote: 0xda6676eba5fdbd0c expref: -99 pid: 24499 timeout: 0 lvb_type: 1
Jan 10 13:21:08 spectre15 kernel: LustreError: 24567:0:(osc_cache.c:3062:osc_cache_writeback_range()) ASSERTION( ext->oe_start >= start && ext->oe_end <= end ) failed:
Jan 10 13:21:08 spectre15 kernel: LustreError: 24567:0:(osc_cache.c:3062:osc_cache_writeback_range()) LBUG
Jan 10 13:21:08 spectre15 kernel: Pid: 24567, comm: rm 3.10.0-1062.9.1.el7.x86_64 #1 SMP Fri Dec 6 15:49:49 UTC 2019
Jan 10 13:21:08 spectre15 kernel: Call Trace:
Jan 10 13:21:08 spectre15 kernel: [<ffffffffc0e167cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
Jan 10 13:21:08 spectre15 kernel: [<ffffffffc0e1687c>] lbug_with_loc+0x4c/0xa0 [libcfs]
Jan 10 13:21:08 spectre15 kernel: [<ffffffffc13b1a5d>] osc_cache_writeback_range+0xacd/0x1260 [osc]
Jan 10 13:21:08 spectre15 kernel: [<ffffffffc13a07f5>] osc_io_fsync_start+0x85/0x1a0 [osc]
Jan 10 13:21:08 spectre15 kernel: [<ffffffffc105e388>] cl_io_start+0x68/0x130 [obdclass]
Jan 10 13:21:08 spectre15 kernel: [<ffffffffc133e537>] lov_io_call.isra.7+0x87/0x140 [lov]
Jan 10 13:21:08 spectre15 kernel: [<ffffffffc133e6f6>] lov_io_start+0x56/0x150 [lov]
Jan 10 13:21:08 spectre15 kernel: [<ffffffffc105e388>] cl_io_start+0x68/0x130 [obdclass]
Jan 10 13:21:08 spectre15 kernel: [<ffffffffc106055c>] cl_io_loop+0xcc/0x1c0 [obdclass]
Jan 10 13:21:08 spectre15 kernel: [<ffffffffc1473d3b>] cl_sync_file_range+0x2db/0x380 [lustre]
Jan 10 13:21:08 spectre15 kernel: [<ffffffffc148ba90>] ll_delete_inode+0x160/0x230 [lustre]
Jan 10 13:21:08 spectre15 kernel: [<ffffffff88668544>] evict+0xb4/0x180
Jan 10 13:21:08 spectre15 kernel: [<ffffffff8866896c>] iput+0xfc/0x190
Jan 10 13:21:08 spectre15 kernel: [<ffffffff8865cbde>] do_unlinkat+0x1ae/0x2d0
Jan 10 13:21:08 spectre15 kernel: [<ffffffff8865dc5b>] SyS_unlinkat+0x1b/0x40
Jan 10 13:21:08 spectre15 kernel: [<ffffffff88b8dede>] system_call_fastpath+0x25/0x2a
Jan 10 13:21:08 spectre15 kernel: [<ffffffffffffffff>] 0xffffffffffffffff
Jan 10 13:21:08 spectre15 kernel: Kernel panic - not syncing: LBUG

We are able to reproduced the error on a test system - it appears to be caused by removing multiple files with a single rm -f *, strangely, repeating this and deleting the files one at a time is fine (these results are both reproducable). Only files with a data on MDT layout cause the crash.

We have been using the 2.12.3 client (with 2.10.7 servers) since December without issue. The problem seems to be occuring since we moved to using a new Lustre 2.12.3 filesystem which has data on MDT enabled. We have confirmed that deleting files which do not have a data on MDT layout does not cause the above problem.

This looks to me like LU-12462 (https://jira.whamcloud.com/browse/LU-12462), however, it looks like this is only known to affect 2.13.0 (and 2.12.4) - not 2.12.3, I'm not familiar with jira though so I could be reading this wrong!

Any suggestions on how best to report/resolve this?

We have repeated the tests using a 2.13.0 test client and we do not see any crashes on this client (LU-12462 says fixed in 2.13).

Regards,
Christopher.

-- 
-- 
# Dr. Christopher Mountford
# System specialist - Research Computing/HPC
# 
# IT services,
#     University of Leicester, University Road, 
#     Leicester, LE1 7RH, UK 
#
# t: 0116 252 3471
# e: cjm14 at le.ac.uk