[lustre-discuss] Lustre client crashes in Lustre 2.12.3 with data on MDT

Peter Jones pjones at whamcloud.com
Fri Jan 10 07:52:03 PST 2020


While I'm not who you need to interpret the stack trace, I can decipher JIRA and the state of LU-12462 is that it is already landed for the upcoming 2.12.4 release. So, if you have a good reproducer, you could always test a single client on the tip of b2_12 (either building from git or else grabbing the latest build from https://build.whamcloud.com/job/lustre-b2_12/) . What's there now is close to the finished article and this will let you know whether moving to 2.12.4 when it comes out will resolve this issue for you. 

On 2020-01-10, 7:42 AM, "lustre-discuss on behalf of Christopher Mountford" <lustre-discuss-bounces at lists.lustre.org on behalf of cjm14 at leicester.ac.uk> wrote:

    Hi,
    
    We just switched to a new 2.12.3 Lustre storage system on our local HPC cluster have seen a number of client node crashes - all leaving a similar syslog entry:
    
    Jan 10 13:21:08 spectre15 kernel: LustreError: 24567:0:(osc_cache.c:3062:osc_cache_writeback_range()) extent ffff9238ac5133f0@{[0 -> 255/255], [1|0|-|cache|wiY|ffff9238a7370f00],[1703936|89|+|-|ffff9238733e7180|256|          (null)]}
    Jan 10 13:21:08 spectre15 kernel: LustreError: 24567:0:(osc_cache.c:3062:osc_cache_writeback_range()) ### extent: ffff9238ac5133f0 ns: alice3-OST0019-osc-ffff9248e7337800 lock: ffff9238733e7180/0x3c4db4a67c3d39eb lrc: 2/0,0 mode: PW/PW res: [0x740000400:0x132478:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 65536->262143) flags: 0x20000000000 nid: local remote: 0xda6676eba5fdbd0c expref: -99 pid: 24499 timeout: 0 lvb_type: 1
    Jan 10 13:21:08 spectre15 kernel: LustreError: 24567:0:(osc_cache.c:1241:osc_extent_tree_dump0()) Dump object ffff9238a7370f00 extents at osc_cache_writeback_range:3062, mppr: 256.
    Jan 10 13:21:08 spectre15 kernel: LustreError: 24567:0:(osc_cache.c:1246:osc_extent_tree_dump0()) extent ffff9238ac5133f0@{[0 -> 255/255], [1|0|-|cache|wiY|ffff9238a7370f00], [1703936|89|+|-|ffff9238733e7180|256|          (null)]} in tree 1.
    Jan 10 13:21:08 spectre15 kernel: LustreError: 24567:0:(osc_cache.c:1246:osc_extent_tree_dump0()) ### extent: ffff9238ac5133f0 ns: alice3-OST0019-osc-ffff9248e7337800 lock: ffff9238733e7180/0x3c4db4a67c3d39eb lrc: 2/0,0 mode: PW/PW res: [0x740000400:0x132478:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 65536->262143) flags: 0x20000000000 nid: local remote: 0xda6676eba5fdbd0c expref: -99 pid: 24499 timeout: 0 lvb_type: 1
    Jan 10 13:21:08 spectre15 kernel: LustreError: 24567:0:(osc_cache.c:3062:osc_cache_writeback_range()) ASSERTION( ext->oe_start >= start && ext->oe_end <= end ) failed:
    Jan 10 13:21:08 spectre15 kernel: LustreError: 24567:0:(osc_cache.c:3062:osc_cache_writeback_range()) LBUG
    Jan 10 13:21:08 spectre15 kernel: Pid: 24567, comm: rm 3.10.0-1062.9.1.el7.x86_64 #1 SMP Fri Dec 6 15:49:49 UTC 2019
    Jan 10 13:21:08 spectre15 kernel: Call Trace:
    Jan 10 13:21:08 spectre15 kernel: [<ffffffffc0e167cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
    Jan 10 13:21:08 spectre15 kernel: [<ffffffffc0e1687c>] lbug_with_loc+0x4c/0xa0 [libcfs]
    Jan 10 13:21:08 spectre15 kernel: [<ffffffffc13b1a5d>] osc_cache_writeback_range+0xacd/0x1260 [osc]
    Jan 10 13:21:08 spectre15 kernel: [<ffffffffc13a07f5>] osc_io_fsync_start+0x85/0x1a0 [osc]
    Jan 10 13:21:08 spectre15 kernel: [<ffffffffc105e388>] cl_io_start+0x68/0x130 [obdclass]
    Jan 10 13:21:08 spectre15 kernel: [<ffffffffc133e537>] lov_io_call.isra.7+0x87/0x140 [lov]
    Jan 10 13:21:08 spectre15 kernel: [<ffffffffc133e6f6>] lov_io_start+0x56/0x150 [lov]
    Jan 10 13:21:08 spectre15 kernel: [<ffffffffc105e388>] cl_io_start+0x68/0x130 [obdclass]
    Jan 10 13:21:08 spectre15 kernel: [<ffffffffc106055c>] cl_io_loop+0xcc/0x1c0 [obdclass]
    Jan 10 13:21:08 spectre15 kernel: [<ffffffffc1473d3b>] cl_sync_file_range+0x2db/0x380 [lustre]
    Jan 10 13:21:08 spectre15 kernel: [<ffffffffc148ba90>] ll_delete_inode+0x160/0x230 [lustre]
    Jan 10 13:21:08 spectre15 kernel: [<ffffffff88668544>] evict+0xb4/0x180
    Jan 10 13:21:08 spectre15 kernel: [<ffffffff8866896c>] iput+0xfc/0x190
    Jan 10 13:21:08 spectre15 kernel: [<ffffffff8865cbde>] do_unlinkat+0x1ae/0x2d0
    Jan 10 13:21:08 spectre15 kernel: [<ffffffff8865dc5b>] SyS_unlinkat+0x1b/0x40
    Jan 10 13:21:08 spectre15 kernel: [<ffffffff88b8dede>] system_call_fastpath+0x25/0x2a
    Jan 10 13:21:08 spectre15 kernel: [<ffffffffffffffff>] 0xffffffffffffffff
    Jan 10 13:21:08 spectre15 kernel: Kernel panic - not syncing: LBUG
    
    
    We are able to reproduced the error on a test system - it appears to be caused by removing multiple files with a single rm -f *, strangely, repeating this and deleting the files one at a time is fine (these results are both reproducable). Only files with a data on MDT layout cause the crash.
    
    We have been using the 2.12.3 client (with 2.10.7 servers) since December without issue. The problem seems to be occuring since we moved to using a new Lustre 2.12.3 filesystem which has data on MDT enabled. We have confirmed that deleting files which do not have a data on MDT layout does not cause the above problem.
    
    This looks to me like LU-12462 (https://jira.whamcloud.com/browse/LU-12462), however, it looks like this is only known to affect 2.13.0 (and 2.12.4) - not 2.12.3, I'm not familiar with jira though so I could be reading this wrong!
    
    Any suggestions on how best to report/resolve this?
    
    We have repeated the tests using a 2.13.0 test client and we do not see any crashes on this client (LU-12462 says fixed in 2.13).
    
    Regards,
    Christopher.
    
    
    -- 
    -- 
    # Dr. Christopher Mountford
    # System specialist - Research Computing/HPC
    # 
    # IT services,
    #     University of Leicester, University Road, 
    #     Leicester, LE1 7RH, UK 
    #
    # t: 0116 252 3471
    # e: cjm14 at le.ac.uk
    
    _______________________________________________
    lustre-discuss mailing list
    lustre-discuss at lists.lustre.org
    http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
    



More information about the lustre-discuss mailing list