[lustre-discuss] MDS often overload

Zeeshan Ali Shah javaclinic at gmail.com
Mon Oct 8 02:51:05 PDT 2018


We are getting the following error when run rsync .

Background: We have three filesystem on same MDS , MDT are different zfs
pools.. would that be an issue ?

any advice ?

error below
--------
[Mon Oct  8 12:29:11 2018] Lustre: sgp-MDT0000-mdc-ffff883ffc2c3000:
Connection restored to 172.100.120.25 at o2ib (at 172.100.120.25 at o2ib)
[Mon Oct  8 12:29:11 2018] Lustre: Skipped 14 previous similar messages
[Mon Oct  8 12:31:25 2018] LNet:
31340:0:(o2iblnd_cb.c:1350:kiblnd_reconnect_peer()) Abort reconnection of
172.100.120.25 at o2ib: connected
[Mon Oct  8 12:31:25 2018] LNet:
31340:0:(o2iblnd_cb.c:1350:kiblnd_reconnect_peer()) Skipped 9 previous
similar messages
[Mon Oct  8 12:31:32 2018] Lustre:
54961:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has
timed out for slow reply: [sent 1538991905/real 1538991905]
req at ffff8819acf68f00 x1611847503061104/t0(0)
o36->sgp-MDT0000-mdc-ffff883ffc2c3000 at 172.100.120.25@o2ib:12/10 lens
608/33520 e 0 to 1 dl 1538991912 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
[Mon Oct  8 12:31:32 2018] Lustre:
54961:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 2 previous
similar messages
[Mon Oct  8 12:31:32 2018] Lustre: sgp-MDT0000-mdc-ffff883ffc2c3000:
Connection to sgp-MDT0000 (at 172.100.120.25 at o2ib) was lost; in progress
operations using this service will wait for recovery to complete
[Mon Oct  8 12:31:32 2018] Lustre: Skipped 2 previous similar messages
[Mon Oct  8 12:31:32 2018] Lustre: sgp-MDT0000-mdc-ffff883ffc2c3000:
Connection restored to 172.100.120.25 at o2ib (at 172.100.120.25 at o2ib)
[Mon Oct  8 12:31:32 2018] Lustre: Skipped 2 previous similar messages
[Mon Oct  8 12:34:01 2018] LNet:
25934:0:(o2iblnd_cb.c:2307:kiblnd_passive_connect()) Stale connection
request
[Mon Oct  8 12:34:01 2018] LNet:
25934:0:(o2iblnd_cb.c:2307:kiblnd_passive_connect()) Skipped 2 previous
similar messages
[Mon Oct  8 12:34:01 2018] LNet:
31340:0:(o2iblnd_cb.c:1350:kiblnd_reconnect_peer()) Abort reconnection of
172.100.120.25 at o2ib: connected
[Mon Oct  8 12:34:01 2018] LNet:
31340:0:(o2iblnd_cb.c:1350:kiblnd_reconnect_peer()) Skipped 3 previous
similar messages
[Mon Oct  8 12:34:08 2018] Lustre:
54961:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has
timed out for slow reply: [sent 1538992061/real 1538992061]
req at ffff881ad060f500 x1611847503440304/t0(0)
o101->sgp-MDT0000-mdc-ffff883ffc2c3000 at 172.100.120.25@o2ib:12/10 lens
880/33728 e 0 to 1 dl 1538992068 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
[Mon Oct  8 12:34:08 2018] Lustre:
54961:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 1 previous
similar message
[Mon Oct  8 12:34:08 2018] Lustre: sgp-MDT0000-mdc-ffff883ffc2c3000:
Connection to sgp-MDT0000 (at 172.100.120.25 at o2ib) was lost; in progress
operations using this service will wait for recovery to complete
[Mon Oct  8 12:34:08 2018] Lustre: Skipped 1 previous similar message
[Mon Oct  8 12:34:08 2018] Lustre: sgp-MDT0000-mdc-ffff883ffc2c3000:
Connection restored to 172.100.120.25 at o2ib (at 172.100.120.25 at o2ib)
[Mon Oct  8 12:34:08 2018] Lustre: Skipped 1 previous similar message
------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20181008/407962f7/attachment.html>


More information about the lustre-discuss mailing list