[lustre-discuss] OST not freeing space for deleted files?

Degremont, Aurelien degremoa at amazon.fr
Fri Jan 13 03:35:56 PST 2023


In the past, when I've seen such issue this is was really because there was more thread adding new stuff to that queue and the MDT was able to empty it.
- Verify how many sync_in_flight you have?
- You're talking about Robinhood. Is Robinhood deleting lots of files?
- You're saying your destroy queue is not emptying, is there a steady UNLINK load coming to your MDT?
- Verify how many new requests is coming to your MDT

lctl set_param mdt.lfsc -MDT0000.md_stats=clear
sleep 10
lctl get_param mdt.lfsc -MDT0000.md_stats


Aurélien

Le 12/01/2023 18:38, « lustre-discuss au nom de Daniel Szkola via lustre-discuss » <lustre-discuss-bounces at lists.lustre.org au nom de lustre-discuss at lists.lustre.org> a écrit :

    CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.



    I’m not seeing anything obvious. Today, the inode counts are increased and the group has reached their hard limit.
    We have Robinhood running and the numbers there seem accurate but the quota numbers are still high.

    I’m seeing things like this on the MDS node in dmesg:

    [Wed Jan 11 11:39:07 2023] LustreError: 39308:0:(osp_dev.c:1682:osp_iocontrol()) lfsc-OST0004-osc-MDT0000: unrecognized ioctl 0xc00866e6 by lctl
    [Wed Jan 11 11:39:14 2023] LustreError: 39314:0:(class_obd.c:465:class_handle_ioctl()) OBD ioctl : No Device -12066
    [Wed Jan 11 11:39:38 2023] LustreError: 39385:0:(class_obd.c:465:class_handle_ioctl()) OBD ioctl : No Device -12066
    [Wed Jan 11 11:39:38 2023] LustreError: 39385:0:(class_obd.c:465:class_handle_ioctl()) Skipped 1 previous similar message
    [Wed Jan 11 12:06:12 2023] LustreError: 41360:0:(lod_dev.c:1551:lod_sync()) lfsc-MDT0000-mdtlov: can't sync ost 4: rc = -110
    [Wed Jan 11 12:06:12 2023] LustreError: 41360:0:(lod_dev.c:1551:lod_sync()) Skipped 1 previous similar message
    [Wed Jan 11 12:09:30 2023] LustreError: 41362:0:(lod_dev.c:1551:lod_sync()) lfsc-MDT0000-mdtlov: can't sync ost 4: rc = -110
    [Wed Jan 11 16:18:27 2023] LustreError: 41360:0:(lod_dev.c:1551:lod_sync()) lfsc-MDT0000-mdtlov: can't sync ost 4: rc = -110

    Only seeing this for OST4 though and not 5, both of which seem to be having the problem. So, these may be harmless.

    I still don’t know why the destroys_in_flight are over 13 million and not decreasing. Any ideas?

    —
    Dan Szkola
    FNAL



    > On Jan 12, 2023, at 2:59 AM, Degremont, Aurelien <degremoa at amazon.fr> wrote:
    >
    > Hello Daniel,
    >
    > You should also check if there is not some user workload that is triggering that load, like a constant load of SYNC to files on those OSTs by example.
    >
    > Aurélien
    >
    > Le 11/01/2023 22:37, « lustre-discuss au nom de Daniel Szkola via lustre-discuss » <lustre-discuss-bounces at lists.lustre.org au nom de lustre-discuss at lists.lustre.org> a écrit :
    >
    >    CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
    >
    >
    >
    >    We recently had to take an OSS node that hosts two OSTs out of service to test the hardware as it was randomly power cycling.
    >
    >    I migrated all files off of the two OSTs and after some testing we brought the node back into service after recreating the ZFS pools
    >    and the two OSTs. Since then it’s been mostly working fine, however we’ve noticed a few group quotas reporting file usage that doesn’t
    >    seem to match what is actually on the filesystem. The inode counts seem to be correct, but the space used is way too high.
    >
    >    After lots of poking around I am seeing this on the two OSTS:
    >
    >    osp.lfsc-OST0004-osc-MDT0000.sync_changes=13802381
    >    osp.lfsc-OST0005-osc-MDT0000.sync_changes=13060667
    >
    >    I upped the max_rpcs_in_progress and max_rpcs_in_flight for the two OSTs, but that just caused the numbers to dip slightly.
    >    All other OSTs have 0 for that value. Also destroys_in_flight show similar numbers for the two OSTs.
    >
    >    Any ideas how I can remedy this?
    >
    >    Lustre 2.12.8
    >    ZFS 0.7.13
    >
    >    —
    >    Dan Szkola
    >    FNAL
    >
    >
    >
    >
    >
    >    _______________________________________________
    >    lustre-discuss mailing list
    >    lustre-discuss at lists.lustre.org
    >    https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.lustre.org_listinfo.cgi_lustre-2Ddiscuss-2Dlustre.org&d=DwIGaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=e9DXjyTaQ786Tg7WH7oIVaQOA1YDRqyxHOUaYU2_LQw&m=DBtSEnlwRKd7IUYAtj21XR88qwWp8PCksiUQy7Mn0imnzYiq8OhdYUVdjx3aGoyR&s=T29TaXoWSYBTh5eRNhMflhEe2YEQu8M1CDqrp_NSNMg&e=
    >

    _______________________________________________
    lustre-discuss mailing list
    lustre-discuss at lists.lustre.org
    http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org



More information about the lustre-discuss mailing list