[lustre-discuss] Robinhood exhausting RPC resources against 2.5.5 lustre file systems

Wed May 17 14:06:24 PDT 2017

How is it you are getting the same NID registering twice in the log file:

Feb 24 20:46:44 10.7.7.8 kernel: LNet: Added LNI 10.7.17.8 at o2ib [8/256/0/180]
Feb 24 20:46:44 10.7.7.8 kernel: LNet: Added LNI 10.7.17.8 at o2ib [8/256/0/180]

Doug

On May 17, 2017, at 11:04 AM, Jessica Otey <jotey at nrao.edu<mailto:jotey at nrao.edu>> wrote:

All,

We have observed an unfortunate interaction between Robinhood and two Lustre 2.5.5 file systems (both of which originated as 1.8.9 file systems).

Robinhood was used successfully against these file systems when they were both 1.8.9, 2.4.3, and then 2.5.3 (a total time span of about 11 months).

We also have a third Lustre file system that originated as 2.4.3, and has since been upgraded to 2.5.5, against which Robinhood is currently operating as expected. This leads me to suppose that the issue may have to do the interaction between Robinhood and a legacy-1.8.x-now-lustre-2.5.5 system. But I don't know.

The problem manifests itself as follows: Either a Robinhood file scan or the initiation of the consumption of changelogs results in the consumption all the available RPC resources on the MDT. This in turn leads to the MDT not being able to satisfy any other requests from clients, which in turn leads to client disconnections (the MDT thinks they are dead and evicts them). Meanwhile, Robinhood itself is unable to traverse the file system to gather the information it seeks, and so its scans either hang (due to the client disconnect) or run at a rate such that they would never complete (less than 1 file per second).

If we don't run robinhood at all, the file system performs (after a remount of the MDT) as expected.

Initially, we thought that the difficulty might be that we neglected to activate the FID-in-direct feature when we upgraded to 2.4.3. We did so on one of these systems, and ran an lfsck oi_scrub, but that did not ameliorate the problem.

Any thoughts on this matter would be appreciated. (We miss using Robinhood!)

Thanks,

Jessica

************************************************************

More data for those who cannot help themselves:

April 2016 - Robinhood comes into production use against both our 1.8.9 file systems.

July 2016 - Upgrade to 2.4.3 (on both production lustre file systems) -- Robinhood rebuilt against 2.4.3 client; changelog consumption now included.

        Lustre "reconnects" (from /var/log/messages on one of the MDTs):

        July 2016: 4

        Aug 2016: 20

        Sept 2016: 8

        Oct 2016: 8

Nov 4-6, 2016 - Upgrade to 2.5.3 (on both production lustre file systems) -- Robinhood rebuilt against 2.5.3 client.

        Lustre "reconnects":

Nov. 2016: 180

Dec. 2016: 62

Jan. 2017: 96

Feb 1-24, 2017: 2

Feb 24, 2017 - Upgrade to 2.5.5 (on both production lustre file systems)

**** NAASC-Lustre MDT coming back ****

Feb 24 20:46:44 10.7.7.8 kernel: Lustre: Lustre: Build Version: 2.5.5-g22a210f-CHANGED-2.6.32-642.6.2.el6_lustre.2.5.5.x86_64
Feb 24 20:46:44 10.7.7.8 kernel: Lustre: Lustre: Build Version: 2.5.5-g22a210f-CHANGED-2.6.32-642.6.2.el6_lustre.2.5.5.x86_64
Feb 24 20:46:44 10.7.7.8 kernel: LNet: Added LNI 10.7.17.8 at o2ib [8/256/0/180]
Feb 24 20:46:44 10.7.7.8 kernel: LNet: Added LNI 10.7.17.8 at o2ib [8/256/0/180]
Feb 24 20:46:45 10.7.7.8 kernel: LDISKFS-fs (md127): mounted filesystem with ordered data mode. quota=off. Opts:
Feb 24 20:46:45 10.7.7.8 kernel: LDISKFS-fs (md127): mounted filesystem with ordered data mode. quota=off. Opts:
Feb 24 20:46:46 10.7.7.8 kernel: Lustre: MGC10.7.17.8 at o2ib: Connection restored to MGS (at 0 at lo)
Feb 24 20:46:46 10.7.7.8 kernel: Lustre: MGC10.7.17.8 at o2ib: Connection restored to MGS (at 0 at lo)
Feb 24 20:46:47 10.7.7.8 kernel: Lustre: naaschpc-MDT0000: used disk, loading
Feb 24 20:46:47 10.7.7.8 kernel: Lustre: naaschpc-MDT0000: used disk, loading

The night after this upgrade, a regular rsync to the backup Lustre system provokes a failure/client disconnect. (Unfortunately, I don't have the logs to look at Robinhood activity from this time, but I believe I restarted the service after the system came back.)

Feb 25 02:14:24 10.7.7.8 kernel: LustreError: 25103:0:(service.c:2020:ptlrpc_server_handle_request()) @@@ Dropping timed-out request from 12345-10.7.17.123 at o2ib: deadline 6:11s ago
Feb 25 02:14:24 10.7.7.8 kernel: LustreError: 25103:0:(service.c:2020:ptlrpc_server_handle_request()) @@@ Dropping timed-out request from 12345-10.7.17.123 at o2ib: deadline 6:11s ago
Feb 25 02:14:24 10.7.7.8 kernel:  req at ffff88045b3a2050 x1560271381909936/t0(0) o103->bb228923-4216-cc59-d847-38b543af1ae2 at 10.7.17.123@o2ib:0/0<mailto:bb228923-4216-cc59-d847-38b543af1ae2 at 10.7.17.123@o2ib:0/0> lens 3584/0 e 0 to 0 dl 1488006853 ref 1 fl Interpret:/0/ffffffff rc 0/-1
Feb 25 02:14:24 10.7.7.8 kernel:  req at ffff88045b3a2050 x1560271381909936/t0(0) o103->bb228923-4216-cc59-d847-38b543af1ae2 at 10.7.17.123@o2ib:0/0<mailto:bb228923-4216-cc59-d847-38b543af1ae2 at 10.7.17.123@o2ib:0/0> lens 3584/0 e 0 to 0 dl 1488006853 ref 1 fl Interpret:/0/ffffffff rc 0/-1
Feb 25 02:14:24 10.7.7.8 kernel: Lustre: 25111:0:(service.c:2052:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (6:11s); client may timeout.  req at ffff88045b3a2850 x1560271381909940/t0(0) o103->bb228923-4216-cc59-d847-38b543af1ae2 at 10.7.17.123@o2ib:0/0<mailto:bb228923-4216-cc59-d847-38b543af1ae2 at 10.7.17.123@o2ib:0/0> lens 3584/0 e 0 to 0 dl 1488006853 ref 1 fl Interpret:/0/ffffffff rc 0/-1
Feb 25 02:14:24 10.7.7.8 kernel: Lustre: 25111:0:(service.c:2052:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (6:11s); client may timeout.  req at ffff88045b3a2850 x1560271381909940/t0(0) o103->bb228923-4216-cc59-d847-38b543af1ae2 at 10.7.17.123@o2ib:0/0<mailto:bb228923-4216-cc59-d847-38b543af1ae2 at 10.7.17.123@o2ib:0/0> lens 3584/0 e 0 to 0 dl 1488006853 ref 1 fl Interpret:/0/ffffffff rc 0/-1
Feb 25 02:14:24 10.7.7.8 kernel: LustreError: 25103:0:(service.c:2020:ptlrpc_server_handle_request()) Skipped 30 previous similar messages
Feb 25 02:14:24 10.7.7.8 kernel: LustreError: 25103:0:(service.c:2020:ptlrpc_server_handle_request()) Skipped 30 previous similar messages
Feb 25 02:14:25 10.7.7.8 kernel: LustreError: 7526:0:(service.c:2020:ptlrpc_server_handle_request()) @@@ Dropping timed-out request from 12345-10.7.17.123 at o2ib: deadline 6:11s ago
Feb 25 02:14:25 10.7.7.8 kernel: LustreError: 7526:0:(service.c:2020:ptlrpc_server_handle_request()) @@@ Dropping timed-out request from 12345-10.7.17.123 at o2ib: deadline 6:11s ago
Feb 25 02:14:25 10.7.7.8 kernel:  req at ffff880149b75850 x1560271381926356/t0(0) o103->bb228923-4216-cc59-d847-38b543af1ae2 at 10.7.17.123@o2ib:0/0<mailto:bb228923-4216-cc59-d847-38b543af1ae2 at 10.7.17.123@o2ib:0/0> lens 3584/0 e 0 to 0 dl 1488006854 ref 1 fl Interpret:/0/ffffffff rc 0/-1
Feb 25 02:14:25 10.7.7.8 kernel:  req at ffff880149b75850 x1560271381926356/t0(0) o103->bb228923-4216-cc59-d847-38b543af1ae2 at 10.7.17.123@o2ib:0/0<mailto:bb228923-4216-cc59-d847-38b543af1ae2 at 10.7.17.123@o2ib:0/0> lens 3584/0 e 0 to 0 dl 1488006854 ref 1 fl Interpret:/0/ffffffff rc 0/-1
Feb 25 02:14:25 10.7.7.8 kernel: Lustre: 7528:0:(service.c:2052:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (6:11s); client may timeout.  req at ffff880149bb0050 x1560271381926372/t0(0) o103->bb228923-4216-cc59-d847-38b543af1ae2 at 10.7.17.123@o2ib:0/0<mailto:bb228923-4216-cc59-d847-38b543af1ae2 at 10.7.17.123@o2ib:0/0> lens 3584/0 e 0 to 0 dl 1488006854 ref 1 fl Interpret:/0/ffffffff rc 0/-1
Feb 25 02:14:25 10.7.7.8 kernel: Lustre: 7528:0:(service.c:2052:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (6:11s); client may timeout.  req at ffff880149bb0050 x1560271381926372/t0(0) o103->bb228923-4216-cc59-d847-38b543af1ae2 at 10.7.17.123@o2ib:0/0<mailto:bb228923-4216-cc59-d847-38b543af1ae2 at 10.7.17.123@o2ib:0/0> lens 3584/0 e 0 to 0 dl 1488006854 ref 1 fl Interpret:/0/ffffffff rc 0/-1
Feb 25 02:14:25 10.7.7.8 kernel: Lustre: 7528:0:(service.c:2052:ptlrpc_server_handle_request()) Skipped 1533 previous similar messages
Feb 25 02:14:25 10.7.7.8 kernel: Lustre: 7528:0:(service.c:2052:ptlrpc_server_handle_request()) Skipped 1533 previous similar messages
Feb 25 02:14:25 10.7.7.8 kernel: LustreError: 7526:0:(service.c:2020:ptlrpc_server_handle_request()) Skipped 1653 previous similar messages
Feb 25 02:14:25 10.7.7.8 kernel: LustreError: 7526:0:(service.c:2020:ptlrpc_server_handle_request()) Skipped 1653 previous similar messages
Feb 25 02:14:49 10.7.7.8 kernel: Lustre: naaschpc-MDT0000: Client bb228923-4216-cc59-d847-38b543af1ae2 (at 10.7.17.123 at o2ib) reconnecting
Feb 25 02:14:49 10.7.7.8 kernel: Lustre: naaschpc-MDT0000: Client bb228923-4216-cc59-d847-38b543af1ae2 (at 10.7.17.123 at o2ib) reconnecting
Feb 25 02:14:49 10.7.7.8 kernel: Lustre: naaschpc-MDT0000: Client bb228923-4216-cc59-d847-38b543af1ae2 (at 10.7.17.123 at o2ib) refused reconnection, still busy with 1 active RPCs
Feb 25 02:14:49 10.7.7.8 kernel: Lustre: naaschpc-MDT0000: Client bb228923-4216-cc59-d847-38b543af1ae2 (at 10.7.17.123 at o2ib) refused reconnection, still busy with 1 active RPCs
Feb 25 02:14:49 10.7.7.8 kernel: Lustre: Skipped 1 previous similar message
Feb 25 02:14:49 10.7.7.8 kernel: Lustre: Skipped 1 previous similar message
Feb 25 02:14:49 10.7.7.8 kernel: LustreError: 9437:0:(ldlm_lib.c:2693:target_bulk_io()) @@@ bulk PUT failed: rc -107  req at ffff88050290d050 x1560271382966952/t0(0) o37->bb228923-4216-cc59-d847-38b543af1ae2 at 10.7.17.123@o2ib:0/0<mailto:bb228923-4216-cc59-d847-38b543af1ae2 at 10.7.17.123@o2ib:0/0> lens 448/440 e 0 to 0 dl 1488006895 ref 1 fl Interpret:/0/0 rc 0/0
Feb 25 02:14:49 10.7.7.8 kernel: LustreError: 9437:0:(ldlm_lib.c:2693:target_bulk_io()) @@@ bulk PUT failed: rc -107  req at ffff88050290d050 x1560271382966952/t0(0) o37->bb228923-4216-cc59-d847-38b543af1ae2 at 10.7.17.123@o2ib:0/0<mailto:bb228923-4216-cc59-d847-38b543af1ae2 at 10.7.17.123@o2ib:0/0> lens 448/440 e 0 to 0 dl 1488006895 ref 1 fl Interpret:/0/0 rc 0/0
Feb 25 02:15:14 10.7.7.8 kernel: Lustre: naaschpc-MDT0000: Client bb228923-4216-cc59-d847-38b543af1ae2 (at 10.7.17.123 at o2ib) reconnecting
Feb 25 02:15:14 10.7.7.8 kernel: Lustre: naaschpc-MDT0000: Client bb228923-4216-cc59-d847-38b543af1ae2 (at 10.7.17.123 at o2ib) reconnecting

This problem then escalates... the Lustre "reconnects" from the logs for March will be 2818; for April
4574!

Grepping for "active RPCs", excluding dates of known downtimes (Nov  5,  2016 | Feb 24, 2017 | Apr 21, 2017)". Dates not included are zero.

2016
Nov 22nd: 72
Dec 14th: 32

2017
Jan 8th: 20
Jan 28th: 18
---
Feb 25: 2
---
Mar 13: 56
Mar 14: 236  First user ticket filed about fs non-responsiveness
Mar 15: 60
Mar 16: 90
Mar 17: 160
Mar 18: 280
Mar 19: 178
Mar 20: 280
Mar 21: 276
Mar 22: 140
Mar 23: 102
Mar 24: 216
Mar 25: 278
Mar 26: 142
Mar 27: 82
Mar 28: 52
Mar 29: 78
Mar 30: 52
Mar 31: 52
--
Apr 1: 112
Apr 2: 216
Apr 3: 276
Apr 4: 278
Apr 5: 278
Apr 6: 278
Apr 7: 158
Apr 8: 216
Apr 9: 278
Apr 10: 118
Apr 11: 216
Apr 12: 156
Apr 13: 78
Apr 14: 42
Apr 15: 74
Apr 16: 216
Apr 17: 88
Apr 18: 62
Apr 19: 198
Apr 20: 280
April 21, 2017 - on naasc lustre only: MDT and OSTs unmounted; MDT remounted in ordered mode; lfsck oi_scrub run
Apr 22: 0
Apr 23: 0
Apr 24: 0
Apr 25: 0
Apr 26: 0
Apr 27: 0

On Friday, April 28, I start a file system scan from our 'new' robinhood box, akebono, on both production file systems

2017/04/28 14:38:35 [130550/1] CheckFS | '/.lustre/naasc' matches mount point '/.lustre/naasc', type=lustre, fs=10.7.17.8 at o2ib0:/naaschpc

This scan does not successfully complete. And the RPC resource problem reappears:
Apr 28: 114      The first one is at 3pm...

Apr 28 15:01:30 10.7.7.8 kernel: Lustre: naaschpc-MDT0000: Client cab57f86-2124-884d-49a2-ec2871f36f78 (at 10.7.17.122 at o2ib) refused reconnection, still busy with 1 active RPCs

Apr 29: 282
Apr 30: 284
---
May 1: 186
May 2: 248
May 3: 204 (day of emergency MDT restart)

Since May 4: 0    Robinhood has not been restarted against naasc lustre... and rsyncs have been running without issues.

*********
Robinhood scan/changelog consumption also produced an identical situation on our other lustre fs

2017/04/28 14:53:11 [130926/1] CheckFS | '/.lustre/cv' matches mount point '/.lustre/cv', type=lustre, fs=10.7.3.9 at tcp2:/cvlustre

***** FROM CV-lustre's MDT *****

Apr 28 14:53:24 10.7.7.9 kernel: Lustre: cvlustre-MDT0000: Client deb3834b-6b53-5e68-123b-dfaedd352ac6 (at 10.7.17.122 at o2ib) refused reconnection, still busy with 1 active RPCs
Apr 28 14:53:24 10.7.7.9 kernel: Lustre: cvlustre-MDT0000: Client deb3834b-6b53-5e68-123b-dfaedd352ac6 (at 10.7.17.122 at o2ib) refused reconnection, still busy with 1 active RPCs
Apr 28 14:53:24 10.7.7.9 kernel: Lustre: Skipped 19 previous similar messages
Apr 28 14:53:24 10.7.7.9 kernel: Lustre: Skipped 19 previous similar messages
_______________________________________________
lustre-discuss mailing list
lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20170517/c5a56177/attachment-0001.htm>