[lustre-discuss] Clients disconnect when there are 2+ growing file streams

Fri Sep 16 22:59:30 PDT 2022

Howdy all!

Long and short is; when multiple clients write long (100+ GB) contiguous of
streams to a file (different for each client) over an SMB export of the
lustre client mount, all but one stream locks up and the clients writing
those streams lose connectoins to the MDS/OSS momentarily. Eviction appears
to be caused by either spurious connection resets, or lock callbacks
hitting their timeouts. Apologies if something is unclear or if detail is
odd, I've been staring at this problem for a while now and have experienced
mild acute cognitive decline.

I'm on lustre 2.14.0, debian buster, kernel 4.19.208-3, zfs 2.0.3, using
upstream OFED.

The cluster is 4 nodes total, in two pairs of two. The two pairs are set up
in internal HA, and the cluster overall is controlled by pacemaker. The
first pair has the MGT,MDT0,OST0, and the second pair has OST1. All 4 nodes
have the lustre client running, and are exporting it over SMB with clients
connecting via multichannel.

Network topology is fairly simple, and looks like this on node 1:

net:
    - net type: lo
      local NI(s):
        - nid: 0 at lo
          status: up
    - net type: tcp1
      local NI(s):
        - nid: 192.168.68.121 at tcp1
          status: up
          interfaces:
              0: lacpbond0.1
    - net type: tcp2
      local NI(s):
        - nid: 182.168.98.121 at tcp2
          status: up
          interfaces:
              0: lacpbond0.2

Other nodes have share this config, peers look like:

peer:
    - primary nid: 192.168.68.122 at tcp1
      Multi-Rail: True
      peer ni:
        - nid: 192.168.68.122 at tcp1
          state: NA
        - nid: 192.168.98.122 at tcp2
          state: NA
    - primary nid: 192.168.68.132 at tcp1
      Multi-Rail: True
      peer ni:
        - nid: 192.168.68.132 at tcp1
          state: NA
        - nid: 192.168.98.132 at tcp2
          state: NA
    - primary nid: 0 at lo
      Multi-Rail: False
      peer ni:
        - nid: 0 at lo
          state: NA
    - primary nid: 192.168.68.131 at tcp1
      Multi-Rail: True
      peer ni:
        - nid: 192.168.68.131 at tcp1
          state: NA
        - nid: 192.168.98.131 at tcp2
          state: NA
    - primary nid: 192.168.98.121 at tcp2
      Multi-Rail: False
      peer ni:
        - nid: 192.168.98.121 at tcp2
          state: NA
    - primary nid: 192.168.68.121 at tcp1
      Multi-Rail: False
      peer ni:
        - nid: 192.168.68.121 at tcp1
          state: NA

121 is pair-1-node-1, 122 is pair-1-node-2, 131 is pair-2-node-1, 132 is
pair-2-node-2

At the time of testing, 121 holds MGT,MDT0 and OST0, and 131 holds OST1.

Of interest perhaps is that there are more IP addresses resident on those
bond vlans, but only one IP shows in `lnetctl net show`. Each one
additional VIP in each subnet used for SMB multichannel traffic.

Observed errors look like this in the log from the client that gets the axe:
````
00000400:00000100:11.0:1663299328.923153:0:5195:0:(lib-move.c:3987:lnet_parse_put())
Dropping PUT from 12345-192.168.68.121 at tcp1 portal 28 match
1743801431915008 offset 0 length 520: 4
00000100:02020000:8.0:1663357306.033900:0:5313:0:(client.c:1371:ptlrpc_check_status())
11-0: dfsone-OST0000-osc-ffff9f85fcbe7800: operation ldlm_enqueue to node
192.168.68.121 at tcp1 failed: rc = -107
00000100:00080000:8.0:1663357306.035405:0:5313:0:(recover.c:237:ptlrpc_request_handle_notconn())
import dfsone-OST0000-osc-ffff9f85fcbe7800 of
dfsone-OST0000_UUID at 192.168.68.121@tcp1 abruptly disconnected: reconnecting
00000100:00080000:8.0:1663357306.035407:0:5313:0:(import.c:86:import_set_state_nolock())
0000000090940d7d dfsone-OST0000_UUID: changing import state from FULL to
DISCONN
00000100:02000400:8.0:1663357306.035409:0:5313:0:(import.c:200:ptlrpc_set_import_discon())
dfsone-OST0000-osc-ffff9f85fcbe7800: Connection to dfsone-OST0000 (at
192.168.68.121 at tcp1) was lost; in progress operations using this service
will wait for recovery to complete
00000100:00080000:8.0:1663357306.035412:0:5313:0:(import.c:86:import_set_state_nolock())
0000000090940d7d dfsone-OST0000_UUID: changing import state from DISCONN to
CONNECTING
00000100:00080000:8.0:1663357306.035414:0:5313:0:(import.c:544:import_select_connection())
dfsone-OST0000-osc-ffff9f85fcbe7800: connect to NID 192.168.68.121 at tcp1
last attempt 285795
00000100:00080000:8.0:1663357306.035415:0:5313:0:(import.c:625:import_select_connection())
dfsone-OST0000-osc-ffff9f85fcbe7800: import 0000000090940d7d using
connection 192.168.68.121 at tcp1/192.168.68.121 at tcp1
00000100:00080000:13.0:1663357306.036858:0:5304:0:(import.c:1085:ptlrpc_connect_interpret())
dfsone-OST0000-osc-ffff9f85fcbe7800: connect to target with instance 22
00000008:00020000:13.0:1663357306.036861:0:5304:0:(osc_request.c:1042:osc_init_grant())
dfsone-OST0000-osc-ffff9f85fcbe7800: granted 3407872 but already consumed
247070720
00000080:00080000:13.0:1663357306.038241:0:5304:0:(lcommon_misc.c:75:cl_init_ea_size())
updating def/max_easize: 72/65536
00000100:00080000:13.0:1663357306.038244:0:5304:0:(import.c:1298:ptlrpc_connect_interpret())
@@@ dfsone-OST0000-osc-ffff9f85fcbe7800: evicting (reconnect/recover flags
not set: 4)  req at 0000000023e1589b x1743840866079616/t0(0)
o8->dfsone-OST0000-osc-ffff9f85fcbe7800 at 192.168.68.121@tcp1:28/4
 lens 520/416 e 0 to 0 dl 1663357312 ref 1 fl Interpret:RNQU/0/0 rc 0/0
job:''
00000100:00080000:13.0:1663357306.038250:0:5304:0:(import.c:86:import_set_state_nolock())
0000000090940d7d dfsone-OST0000_UUID: changing import state from CONNECTING
to EVICTED
00000100:02020000:13.0:1663357306.038251:0:5304:0:(import.c:1580:ptlrpc_import_recovery_state_machine())
167-0: dfsone-OST0000-osc-ffff9f85fcbe7800: This client was evicted by
dfsone-OST0000; in progress operations using this service will fail.
00000100:00080000:13.0:1663357306.039600:0:5304:0:(import.c:1585:ptlrpc_import_recovery_state_machine())
evicted from dfsone-OST0000_UUID at 192.168.68.121@tcp1; invalidating
````
Followed by a LOT of dirty page discards.

The debug logs from OST000 at the time show very little:
````
00010000:00020000:7.0:1663357306.009983:0:25912:0:(ldlm_lockd.c:260:expired_lock_main())
### lock callback timer expired after 101s: evicting client at
192.168.68.132 at tcp1  ns: filter-dfsone-OST0000_UUID lock:
000000005efd3160/0xe7b2f62b49d87750 lrc: 3/0,0 mode: PR/PR res:
[0x309f8:0x0:0x0].0x0 rrc: 109 type: EXT [0->18446744073709551615] (req
2072186880->2072219647) gid 0 flags: 0x60000400020020 nid:
192.168.68.132 at tcp1 remote: 0x7090539ead802aa5 expref: 2484 pid: 31523
timeout: 344608 lvb_type: 0
00000020:00080000:7.0:1663357306.012934:0:25912:0:(genops.c:1602:class_fail_export())
disconnecting export 000000003df4eab0/af951890-1dbc-4df6-9745-dfe62f31e4e6
00000020:00000080:7.0:1663357306.012938:0:25912:0:(genops.c:1443:class_disconnect())
disconnect: cookie 0xe7b2f62b147efd01
00000100:00080000:7.0:1663357306.012941:0:25912:0:(import.c:86:import_set_state_nolock())
00000000247857f4 <90><9F>O1S<95><FF><FF>: changing import state from FULL
to CLOSED
00000100:00020000:8.0:1663357306.012981:0:10171:0:(client.c:1266:ptlrpc_import_delay_req())
@@@ IMP_CLOSED  req at 00000000fcbfa5de x1743802516165888/t0(0)
o105->dfsone-OST0000 at 192.168.68.132@tcp1:15/16 lens 392/224 e 0 to 0 dl 0
ref 1 fl Rpc:QU/0/ffffffff rc 0/-1 job:''
00010000:00020000:9.0:1663357306.012999:0:31532:0:(ldlm_lib.c:3506:target_bulk_io())
@@@ Eviction on bulk READ  req at 00000000fa939054 x1743840866079296/t0(0)
o3->af951890-1dbc-4df6-9745-dfe62f31e4e6 at 192.168.68.132@tcp1:202/0 lens
488/440 e 0 to 0 dl 1663357312 ref 1 fl Interpret:/0/0 rc 0/0 job:''
00000020:00080000:7.0:1663357306.013001:0:25912:0:(genops.c:1619:class_fail_export())
disconnected export 000000003df4eab0/af951890-1dbc-4df6-9745-dfe62f31e4e6
00010000:00020000:7.0:1663357306.013005:0:25912:0:(ldlm_lockd.c:260:expired_lock_main())
### lock callback timer expired after 101s: evicting client at
192.168.68.132 at tcp1  ns: filter-dfsone-OST0000_UUID lock:
00000000594515f1/0xe7b2f62b48a04d27 lrc: 3/0,0 mode: PR/PR res:
[0x308e5:0x0:0x0].0x0 rrc: 2023 type: EXT [0->4106174463] (req
4105555968->4105621503) gid 0 flags: 0x60000400020020 nid:
192.168.68.132 at tcp1 remote: 0x7090539ead103d4b expref: 2484 pid: 32689
timeout: 344608 lvb_type: 0
00000020:00080000:7.0:1663357306.013008:0:25912:0:(genops.c:1597:class_fail_export())
disconnecting dead export
000000003df4eab0/af951890-1dbc-4df6-9745-dfe62f31e4e6; skipping
00010000:00020000:7.0:1663357306.013010:0:25912:0:(ldlm_lockd.c:260:expired_lock_main())
### lock callback timer expired after 101s: evicting client at
192.168.68.132 at tcp1  ns: filter-dfsone-OST0000_UUID lock:
000000008ffe558b/0xe7b2f62b4888ebef lrc: 3/0,0 mode: PR/PR res:
[0x308e5:0x0:0x0].0x0 rrc: 2023 type: EXT [0->2483986431] (req
2483920896->2483986431) gid 0 flags: 0x60000400020020 nid:
192.168.68.132 at tcp1 remote: 0x7090539ead09565e expref: 2484 pid: 7009
timeout: 344608 lvb_type: 0
```
All the while, SMB on the client is unhappy:
```[285788.856940] INFO: task smbd:11634 blocked for more than 120 seconds.
[285788.857717]       Tainted: P           O      4.19.0-21-amd64 #1 Debian
4.19.208-3
[285788.858468] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
[285788.859160] smbd            D    0 11634  30610 0x00000000
[285788.859161] Call Trace:
[285788.859168]  __schedule+0x29f/0x840
[285788.859170]  schedule+0x28/0x80
[285788.859171]  io_schedule+0x12/0x40
[285788.859176]  wait_on_page_bit+0x114/0x150
[285788.859178]  ? page_cache_tree_insert+0xe0/0xe0
[285788.859230]  cl_page_assume+0x68/0x230 [obdclass]
[285788.859253]  ll_write_begin+0x365/0xf30 [lustre]
[285788.859256]  generic_perform_write+0xf4/0x1b0
[285788.859258]  ? file_update_time+0xed/0x130
[285788.859259]  __generic_file_write_iter+0xfe/0x1c0
[285788.859269]  vvp_io_write_start+0x502/0xb80 [lustre]
[285788.859283]  cl_io_start+0x59/0x110 [obdclass]
[285788.859296]  cl_io_loop+0x9a/0x1e0 [obdclass]
[285788.859304]  ll_file_io_generic+0x37f/0xb30 [lustre]
[285788.859312]  ll_file_write_iter+0x4fc/0x620 [lustre]
[285788.859315]  new_sync_write+0xfb/0x160
[285788.859316]  vfs_write+0xa5/0x1a0
[285788.859318]  ksys_pwrite64+0x61/0xa0
[285788.859322]  do_syscall_64+0x53/0x110
[285788.859324]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[285788.859326] RIP: 0033:0x7f926ec7eedf
[285788.859328] Code: Bad RIP value.
[285788.859329] RSP: 002b:00007f9261b75b50 EFLAGS: 00000293 ORIG_RAX:
0000000000000012
[285788.859331] RAX: ffffffffffffffda RBX: 000000000000003b RCX:
00007f926ec7eedf
[285788.859331] RDX: 0000000000001000 RSI: 000055b55876f6f0 RDI:
000000000000003b
[285788.859332] RBP: 000055b55876f6f0 R08: 0000000000000000 R09:
00000000ffffffff
[285788.859332] R10: 000000000cf51ab0 R11: 0000000000000293 R12:
0000000000001000
[285788.859333] R13: 000000000cf51ab0 R14: 00007f926dde4c00 R15:
000055b558734430```

Perhaps unrelated but possibly related, there are a number of reconnects in
the logs that get labeled as network errors, seemingly triggered by IPs
that are resident to the nodes, but not a part of the cluster configuration:
```
00000800:02000400:10.0:1663252872.643150:0:5192:0:(socklnd_cb.c:614:ksocknal_process_transmit())
Host 192.168.98.137 reset our connection while we were sending data; it may
have rebooted.
00000100:00000400:8.0:1663252872.643164:0:5321:0:(client.c:2282:ptlrpc_expire_one_request())
@@@ Request sent has failed due to network error: [sent 1663252872/real
1663252872]  req at 0000000051b5e1ac x1743829801974592/t0(0)
o2->dfsone-OST0001-osc-ffff9f85fcbe7800 at 192.168.68.131@tcp1:28/4 lens
440/432 e 0 to 1 dl 1663252879 ref 1 fl Rpc:eXQr/0/ffffffff rc 0/-1 job:''
00000100:00080000:8.0:1663252872.643174:0:5321:0:(import.c:86:import_set_state_nolock())
00000000f6d77c17 dfsone-OST0001_UUID: changing import state from FULL to
DISCONN
00000100:02000400:8.0:1663252872.643175:0:5321:0:(import.c:200:ptlrpc_set_import_discon())
dfsone-OST0001-osc-ffff9f85fcbe7800: Connection to dfsone-OST0001 (at
192.168.68.131 at tcp1) was lost; in progress operations using this service
will wait for recovery to complete
00000100:00080000:8.0:1663252872.643178:0:5321:0:(import.c:444:ptlrpc_pinger_force())
dfsone-OST0001_UUID: waking up pinger s:DISCONN
00000100:00020000:6.0:1663252872.643180:0:5174:0:(events.c:208:client_bulk_callback())
event type 1, status -5, desc 00000000943f7085
00000100:00080000:9.0:1663252872.643186:0:26650:0:(pinger.c:249:ptlrpc_pinger_process_import())
af951890-1dbc-4df6-9745-dfe62f31e4e6->dfsone-OST0001_UUID: level DISCONN/3
force 1 force_next 0 deactive 0 pingable 1 suppress 0
00000100:00080000:9.0:1663252872.643188:0:26650:0:(recover.c:58:ptlrpc_initiate_recovery())
dfsone-OST0001_UUID: starting recovery
00000100:00080000:9.0:1663252872.643189:0:26650:0:(import.c:86:import_set_state_nolock())
00000000f6d77c17 dfsone-OST0001_UUID: changing import state from DISCONN to
CONNECTING
00000100:00080000:9.0:1663252872.643191:0:26650:0:(import.c:544:import_select_connection())
dfsone-OST0001-osc-ffff9f85fcbe7800: connect to NID 192.168.68.131 at tcp1
last attempt 9075
00000100:00080000:9.0:1663252872.643193:0:26650:0:(import.c:625:import_select_connection())
dfsone-OST0001-osc-ffff9f85fcbe7800: import 00000000f6d77c17 using
connection 192.168.68.131 at tcp1/192.168.68.131 at tcp1
00000100:00000400:7.0:1663252872.643216:0:5309:0:(client.c:2282:ptlrpc_expire_one_request())
@@@ Request sent has failed due to network error: [sent 1663252871/real
1663252871]  req at 000000001865eefa x1743829801842240/t0(0)
o4->dfsone-OST0001-osc-ffff9f85fcbe7800 at 192.168.68.131@tcp1:6/4 lens
488/448 e 0 to 1 dl 1663252878 ref 2 fl Rpc:eXQr/0/ffffffff rc 0/-1 job:''
00000100:00080000:7.0:1663252872.643222:0:5309:0:(import.c:234:ptlrpc_set_import_discon())
osc: import 00000000f6d77c17 already not connected (conn 12, was 11):
CONNECTING
00000800:02000400:12.0:1663252872.643226:0:5193:0:(socklnd_cb.c:614:ksocknal_process_transmit())
Host 192.168.68.137 reset our connection while we were sending data; it may
have rebooted.
```

Potentially of note is the lnet statistics from 132:
```
statistics:
    msgs_alloc: 8
    msgs_max: 1657
    rst_alloc: 0
    errors: 0
    send_count: 485076548
    resend_count: 7
    response_timeout_count: 0
    local_interrupt_count: 0
    local_dropped_count: 0
    local_aborted_count: 0
    local_no_route_count: 0
    local_timeout_count: 3
    local_error_count: 25
    remote_dropped_count: 27
    remote_error_count: 850
    remote_timeout_count: 0
    network_timeout_count: 0
    recv_count: 484607083
    route_count: 0
    drop_count: 52
    send_length: 135618914369236
    recv_length: 640430677792
    route_length: 0
    drop_length: 25056
```

I'm wondering if someone with fresh[er] eyes and a better knowledge of
Lustre's internals might be able to tell me where things are breaking. Is
this is a known issue that may be resolved in a future release, a problem
with my cluster's network config, or something I can resolve through using
some tuning parameters to facilitate the larger than usual transfer?
Modifying client behavior is a possibility, but would require engaging with
the vendor.

Cheers, and thanks for your time!
Christian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20220916/c696f022/attachment-0001.htm>