[lustre-discuss] Object creation bottleneck

Wed Jun 3 10:41:28 PDT 2020

Hi all,

We have a small lustre environment (one mgs/mds, one mdt, two oss, four
osts ZFS backends on all targets) and occasionally we have issues with
user jobs that rapidly create thousands of files, which chokes up the
mds leading to poor performance of the FS for users (long wait times
for dir lists, file creation, etc). I've advised users to avoid this
sort of workflow when possible or to use local scratch storage when
not, but I'd like to lessen the impact as much as I can when it
happens.

When this occurs the mds processes have stack traces that look like:

[<ffffffffb7992d77>] call_rwsem_down_write_failed+0x17/0x30
[<ffffffffc16b3225>] lod_alloc_qos.constprop.18+0x205/0x1840 [lod]
[<ffffffffc16b9847>] lod_qos_prep_create+0x12d7/0x1890 [lod]
[<ffffffffc16ba015>] lod_prepare_create+0x215/0x2e0 [lod]
[<ffffffffc16a9e1e>] lod_declare_striped_create+0x1ee/0x980 [lod]
[<ffffffffc16ae6f4>] lod_declare_create+0x204/0x590 [lod]
[<ffffffffc1724ca2>] mdd_declare_create_object_internal+0xe2/0x2f0
[mdd]
[<ffffffffc17146dc>] mdd_declare_create+0x4c/0xcb0 [mdd]
[<ffffffffc1718067>] mdd_create+0x847/0x14e0 [mdd]
[<ffffffffc11cb5ff>] mdt_reint_open+0x224f/0x3240 [mdt]
[<ffffffffc11be693>] mdt_reint_rec+0x83/0x210 [mdt]
[<ffffffffc119b1b3>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
[<ffffffffc11a7a92>] mdt_intent_open+0x82/0x3a0 [mdt]
[<ffffffffc11a5bb5>] mdt_intent_policy+0x435/0xd80 [mdt]
[<ffffffffc1b8cd56>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc]
[<ffffffffc1bb5366>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc]
[<ffffffffc1c3db02>] tgt_enqueue+0x62/0x210 [ptlrpc]
[<ffffffffc1c442ea>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[<ffffffffc1be929b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[<ffffffffc1becbfc>] ptlrpc_main+0xb2c/0x1460 [ptlrpc]
[<ffffffffb76c61f1>] kthread+0xd1/0xe0
[<ffffffffb7d8dd1d>] ret_from_fork_nospec_begin+0x7/0x21
[<ffffffffffffffff>] 0xffffffffffffffff

which to me implies that they're waiting on the OSTs to allocate
objects. The OSTs are each a ZFS span of mirrors. I've disabled sync on
the datasets, and set the osd-zfs parameters osd_object_sync_delay_us
and osd_txg_sync_delay_us to 0 (this FS is entirely scratch). Which has
improved things a bit, but we still have issues.

Does anyone have any pointers for improving OST performance for this
pathological use case? 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5680 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20200603/9ec28040/attachment-0001.bin>