[lustre-discuss] On the origins of lnet stat's "drop_count"

Tue Dec 13 00:07:50 PST 2022

Hello all,

As is tradition, resident "off the beaten path" guy, Christian here! I've
been trying to track down some odd eviction behavior and whilst conducting
a network survey noticed an odd development: a steadily increasing number
of drops reported by lnet stat's "drop_count" statistic exclusively on the
machine serving as the MGS+MDS, far in excess of drops reported by
/proc/net/dev or ifconfig. On the affected interfaces, interface drops
accounted by the kernel's various tracking methods show <2k rx drops for
about a week of uptime. Lustre's drop_count reports in excess of 60K drops,
shown below:

statistics:
    msgs_alloc: 0
    msgs_max: 635
    rst_alloc: 20
    errors: 0
    send_count: 931455351
    resend_count: 0
    response_timeout_count: 0
    local_interrupt_count: 0
    local_dropped_count: 0
    local_aborted_count: 0
    local_no_route_count: 0
    local_timeout_count: 0
    local_error_count: 0
    remote_dropped_count: 29
    remote_error_count: 22
    remote_timeout_count: 0
    network_timeout_count: 0
    recv_count: 934393871
    route_count: 0
    drop_count: 66750
    send_length: 32635120259432
    recv_length: 43228611181641
    route_length: 0
    drop_length: 0

I've been trying to account for what exactly is contributing to that drop
count, and logs have not been particularly helpful. Of note there are two
messages I can identify, one that has the signature "Dropping ACK from ...
to invalid MD", and another with the signature "Dropping PUT". Both
messages seem to refer to consistent, but different, NIDS. However, both
messages do not appear nearly enough to contribute to the 66K drops, as
they appear only ~500 times in the debug logs I have available, which span
days of utilization.

I'm wondering what events cause the drop_count reported by lnet to
increment. I've trawled around the 2.14 codebase and I figured before I
dove in too deep I'd inquire to the experts as to what this number means,
and what expectations I should have. What sorts of events cause drop_count
to increase? Is it normal to see it increase over time in an otherwise
healthy cluster? Given that drop_length is 0 here but the count is high,
what sorts of events am I likely experiencing here?

Cheers, and thanks as always for your time,
Christian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20221213/c6ff5140/attachment.htm>