[lustre-discuss] Robinhood exhausting RPC resources against 2.5.5 lustre file systems

Colin Faber cfaber at gmail.com
Wed May 17 12:56:57 PDT 2017


Have you been able to replicate the issue with a simple test?


find /mnt/lustre -exec lfs path2fid {} \;

?

On May 17, 2017 12:04 PM, "Jessica Otey" <jotey at nrao.edu> wrote:

> All,
>
> We have observed an unfortunate interaction between Robinhood and two
> Lustre 2.5.5 file systems (both of which originated as 1.8.9 file systems).
>
> Robinhood was used successfully against these file systems when they were
> both 1.8.9, 2.4.3, and then 2.5.3 (a total time span of about 11 months).
>
> We also have a third Lustre file system that originated as 2.4.3, and has
> since been upgraded to 2.5.5, against which Robinhood is currently
> operating as expected. This leads me to suppose that the issue may have to
> do the interaction between Robinhood and a legacy-1.8.x-now-lustre-2.5.5
> system. But I don't know.
>
> The problem manifests itself as follows: Either a Robinhood file scan or
> the initiation of the consumption of changelogs results in the consumption
> all the available RPC resources on the MDT. This in turn leads to the MDT
> not being able to satisfy any other requests from clients, which in turn
> leads to client disconnections (the MDT thinks they are dead and evicts
> them). Meanwhile, Robinhood itself is unable to traverse the file system to
> gather the information it seeks, and so its scans either hang (due to the
> client disconnect) or run at a rate such that they would never complete
> (less than 1 file per second).
>
> If we don't run robinhood at all, the file system performs (after a
> remount of the MDT) as expected.
>
> Initially, we thought that the difficulty might be that we neglected to
> activate the FID-in-direct feature when we upgraded to 2.4.3. We did so on
> one of these systems, and ran an lfsck oi_scrub, but that did not
> ameliorate the problem.
>
> Any thoughts on this matter would be appreciated. (We miss using
> Robinhood!)
>
> Thanks,
>
> Jessica
>
> ************************************************************
>
> More data for those who cannot help themselves:
> April 2016 - Robinhood comes into production use against both our 1.8.9
> file systems.
>
> July 2016 - Upgrade to 2.4.3 (on both production lustre file systems) --
> Robinhood rebuilt against 2.4.3 client; changelog consumption now included.
>
>         Lustre "reconnects" (from /var/log/messages on one of the MDTs):
>
>         July 2016: 4
>
>         Aug 2016: 20
>
>         Sept 2016: 8
>
>         Oct 2016: 8
>
> Nov 4-6, 2016 - Upgrade to 2.5.3 (on both production lustre file systems)
> -- Robinhood rebuilt against 2.5.3 client.
>
>         Lustre "reconnects":
>
> Nov. 2016: 180
>
> Dec. 2016: 62
>
> Jan. 2017: 96
>
> Feb 1-24, 2017: 2
>
> Feb 24, 2017 - Upgrade to 2.5.5 (on both production lustre file
> systems)
>
> **** NAASC-Lustre MDT coming back ****
> Feb 24 20:46:44 10.7.7.8 kernel: Lustre: Lustre: Build Version:
> 2.5.5-g22a210f-CHANGED-2.6.32-642.6.2.el6_lustre.2.5.5.x86_64
> Feb 24 20:46:44 10.7.7.8 kernel: Lustre: Lustre: Build Version:
> 2.5.5-g22a210f-CHANGED-2.6.32-642.6.2.el6_lustre.2.5.5.x86_64
> Feb 24 20:46:44 10.7.7.8 kernel: LNet: Added LNI 10.7.17.8 at o2ib
> [8/256/0/180]
> Feb 24 20:46:44 10.7.7.8 kernel: LNet: Added LNI 10.7.17.8 at o2ib
> [8/256/0/180]
> Feb 24 20:46:45 10.7.7.8 kernel: LDISKFS-fs (md127): mounted filesystem
> with ordered data mode. quota=off. Opts:
> Feb 24 20:46:45 10.7.7.8 kernel: LDISKFS-fs (md127): mounted filesystem
> with ordered data mode. quota=off. Opts:
> Feb 24 20:46:46 10.7.7.8 kernel: Lustre: MGC10.7.17.8 at o2ib: Connection
> restored to MGS (at 0 at lo)
> Feb 24 20:46:46 10.7.7.8 kernel: Lustre: MGC10.7.17.8 at o2ib: Connection
> restored to MGS (at 0 at lo)
> Feb 24 20:46:47 10.7.7.8 kernel: Lustre: naaschpc-MDT0000: used disk,
> loading
> Feb 24 20:46:47 10.7.7.8 kernel: Lustre: naaschpc-MDT0000: used disk,
> loading
>
> The night after this upgrade, a regular rsync to the backup Lustre system
> provokes a failure/client disconnect. (Unfortunately, I don't have the logs
> to look at Robinhood activity from this time, but I believe I restarted the
> service after the system came back.)
>
> Feb 25 02:14:24 10.7.7.8 kernel: LustreError: 25103:0:(service.c:2020:
> ptlrpc_server_handle_request()) @@@ Dropping timed-out request from
> 12345-10.7.17.123 at o2ib: deadline 6:11s ago
> Feb 25 02:14:24 10.7.7.8 kernel: LustreError: 25103:0:(service.c:2020:
> ptlrpc_server_handle_request()) @@@ Dropping timed-out request from
> 12345-10.7.17.123 at o2ib: deadline 6:11s ago
> Feb 25 02:14:24 10.7.7.8 kernel:  req at ffff88045b3a2050
> x1560271381909936/t0(0) o103->bb228923-4216-cc59-d847-
> 38b543af1ae2 at 10.7.17.123@o2ib:0/0 lens 3584/0 e 0 to 0 dl 1488006853 ref
> 1 fl Interpret:/0/ffffffff rc 0/-1
> Feb 25 02:14:24 10.7.7.8 kernel:  req at ffff88045b3a2050
> x1560271381909936/t0(0) o103->bb228923-4216-cc59-d847-
> 38b543af1ae2 at 10.7.17.123@o2ib:0/0 lens 3584/0 e 0 to 0 dl 1488006853 ref
> 1 fl Interpret:/0/ffffffff rc 0/-1
> Feb 25 02:14:24 10.7.7.8 kernel: Lustre: 25111:0:(service.c:2052:
> ptlrpc_server_handle_request()) @@@ Request took longer than estimated
> (6:11s); client may timeout.  req at ffff88045b3a2850
> x1560271381909940/t0(0) o103->bb228923-4216-cc59-d847-
> 38b543af1ae2 at 10.7.17.123@o2ib:0/0 lens 3584/0 e 0 to 0 dl 1488006853 ref
> 1 fl Interpret:/0/ffffffff rc 0/-1
> Feb 25 02:14:24 10.7.7.8 kernel: Lustre: 25111:0:(service.c:2052:
> ptlrpc_server_handle_request()) @@@ Request took longer than estimated
> (6:11s); client may timeout.  req at ffff88045b3a2850
> x1560271381909940/t0(0) o103->bb228923-4216-cc59-d847-
> 38b543af1ae2 at 10.7.17.123@o2ib:0/0 lens 3584/0 e 0 to 0 dl 1488006853 ref
> 1 fl Interpret:/0/ffffffff rc 0/-1
> Feb 25 02:14:24 10.7.7.8 kernel: LustreError: 25103:0:(service.c:2020:
> ptlrpc_server_handle_request()) Skipped 30 previous similar messages
> Feb 25 02:14:24 10.7.7.8 kernel: LustreError: 25103:0:(service.c:2020:
> ptlrpc_server_handle_request()) Skipped 30 previous similar messages
> Feb 25 02:14:25 10.7.7.8 kernel: LustreError:
> 7526:0:(service.c:2020:ptlrpc_server_handle_request()) @@@ Dropping
> timed-out request from 12345-10.7.17.123 at o2ib: deadline 6:11s ago
> Feb 25 02:14:25 10.7.7.8 kernel: LustreError:
> 7526:0:(service.c:2020:ptlrpc_server_handle_request()) @@@ Dropping
> timed-out request from 12345-10.7.17.123 at o2ib: deadline 6:11s ago
> Feb 25 02:14:25 10.7.7.8 kernel:  req at ffff880149b75850
> x1560271381926356/t0(0) o103->bb228923-4216-cc59-d847-
> 38b543af1ae2 at 10.7.17.123@o2ib:0/0 lens 3584/0 e 0 to 0 dl 1488006854 ref
> 1 fl Interpret:/0/ffffffff rc 0/-1
> Feb 25 02:14:25 10.7.7.8 kernel:  req at ffff880149b75850
> x1560271381926356/t0(0) o103->bb228923-4216-cc59-d847-
> 38b543af1ae2 at 10.7.17.123@o2ib:0/0 lens 3584/0 e 0 to 0 dl 1488006854 ref
> 1 fl Interpret:/0/ffffffff rc 0/-1
> Feb 25 02:14:25 10.7.7.8 kernel: Lustre: 7528:0:(service.c:2052:ptlrpc_server_handle_request())
> @@@ Request took longer than estimated (6:11s); client may timeout.
> req at ffff880149bb0050 x1560271381926372/t0(0) o103->
> bb228923-4216-cc59-d847-38b543af1ae2 at 10.7.17.123@o2ib:0/0 lens 3584/0 e 0
> to 0 dl 1488006854 ref 1 fl Interpret:/0/ffffffff rc 0/-1
> Feb 25 02:14:25 10.7.7.8 kernel: Lustre: 7528:0:(service.c:2052:ptlrpc_server_handle_request())
> @@@ Request took longer than estimated (6:11s); client may timeout.
> req at ffff880149bb0050 x1560271381926372/t0(0) o103->
> bb228923-4216-cc59-d847-38b543af1ae2 at 10.7.17.123@o2ib:0/0 lens 3584/0 e 0
> to 0 dl 1488006854 ref 1 fl Interpret:/0/ffffffff rc 0/-1
> Feb 25 02:14:25 10.7.7.8 kernel: Lustre: 7528:0:(service.c:2052:ptlrpc_server_handle_request())
> Skipped 1533 previous similar messages
> Feb 25 02:14:25 10.7.7.8 kernel: Lustre: 7528:0:(service.c:2052:ptlrpc_server_handle_request())
> Skipped 1533 previous similar messages
> Feb 25 02:14:25 10.7.7.8 kernel: LustreError:
> 7526:0:(service.c:2020:ptlrpc_server_handle_request()) Skipped 1653
> previous similar messages
> Feb 25 02:14:25 10.7.7.8 kernel: LustreError:
> 7526:0:(service.c:2020:ptlrpc_server_handle_request()) Skipped 1653
> previous similar messages
> Feb 25 02:14:49 10.7.7.8 kernel: Lustre: naaschpc-MDT0000: Client
> bb228923-4216-cc59-d847-38b543af1ae2 (at 10.7.17.123 at o2ib) reconnecting
> Feb 25 02:14:49 10.7.7.8 kernel: Lustre: naaschpc-MDT0000: Client
> bb228923-4216-cc59-d847-38b543af1ae2 (at 10.7.17.123 at o2ib) reconnecting
> Feb 25 02:14:49 10.7.7.8 kernel: Lustre: naaschpc-MDT0000: Client
> bb228923-4216-cc59-d847-38b543af1ae2 (at 10.7.17.123 at o2ib) refused
> reconnection, still busy with 1 active RPCs
> Feb 25 02:14:49 10.7.7.8 kernel: Lustre: naaschpc-MDT0000: Client
> bb228923-4216-cc59-d847-38b543af1ae2 (at 10.7.17.123 at o2ib) refused
> reconnection, still busy with 1 active RPCs
> Feb 25 02:14:49 10.7.7.8 kernel: Lustre: Skipped 1 previous similar message
> Feb 25 02:14:49 10.7.7.8 kernel: Lustre: Skipped 1 previous similar message
> Feb 25 02:14:49 10.7.7.8 kernel: LustreError: 9437:0:(ldlm_lib.c:2693:target_bulk_io())
> @@@ bulk PUT failed: rc -107  req at ffff88050290d050
> x1560271382966952/t0(0) o37->bb228923-4216-cc59-d847-
> 38b543af1ae2 at 10.7.17.123@o2ib:0/0 lens 448/440 e 0 to 0 dl 1488006895 ref
> 1 fl Interpret:/0/0 rc 0/0
> Feb 25 02:14:49 10.7.7.8 kernel: LustreError: 9437:0:(ldlm_lib.c:2693:target_bulk_io())
> @@@ bulk PUT failed: rc -107  req at ffff88050290d050
> x1560271382966952/t0(0) o37->bb228923-4216-cc59-d847-
> 38b543af1ae2 at 10.7.17.123@o2ib:0/0 lens 448/440 e 0 to 0 dl 1488006895 ref
> 1 fl Interpret:/0/0 rc 0/0
> Feb 25 02:15:14 10.7.7.8 kernel: Lustre: naaschpc-MDT0000: Client
> bb228923-4216-cc59-d847-38b543af1ae2 (at 10.7.17.123 at o2ib) reconnecting
> Feb 25 02:15:14 10.7.7.8 kernel: Lustre: naaschpc-MDT0000: Client
> bb228923-4216-cc59-d847-38b543af1ae2 (at 10.7.17.123 at o2ib) reconnecting
>
> This problem then escalates... the Lustre "reconnects" from the logs for
> March will be 2818; for April
> 4574!
>
>
>
> * Grepping for "active RPCs", excluding dates of known downtimes (Nov  5,
> 2016 | Feb 24, 2017 | Apr 21, 2017)". Dates not included are zero. **2016*
> Nov 22nd: 72
> Dec 14th: 32
>
> *2017*
> Jan 8th: 20
> Jan 28th: 18
> ---
> Feb 25: 2
> ---
> Mar 13: 56
> Mar 14: 236  First user ticket filed about fs non-responsiveness
> Mar 15: 60
> Mar 16: 90
> Mar 17: 160
> Mar 18: 280
> Mar 19: 178
> Mar 20: 280
> Mar 21: 276
> Mar 22: 140
> Mar 23: 102
> Mar 24: 216
> Mar 25: 278
> Mar 26: 142
> Mar 27: 82
> Mar 28: 52
> Mar 29: 78
> Mar 30: 52
> Mar 31: 52
> --
> Apr 1: 112
> Apr 2: 216
> Apr 3: 276
> Apr 4: 278
> Apr 5: 278
> Apr 6: 278
> Apr 7: 158
> Apr 8: 216
> Apr 9: 278
> Apr 10: 118
> Apr 11: 216
> Apr 12: 156
> Apr 13: 78
> Apr 14: 42
> Apr 15: 74
> Apr 16: 216
> Apr 17: 88
> Apr 18: 62
> Apr 19: 198
> Apr 20: 280
> April 21, 2017 - on naasc lustre only: MDT and OSTs unmounted; MDT
> remounted in ordered mode; lfsck oi_scrub run
> Apr 22: 0
> Apr 23: 0
> Apr 24: 0
> Apr 25: 0
> Apr 26: 0
> Apr 27: 0
>
> On Friday, April 28, I start a file system scan from our 'new' robinhood
> box, akebono, on both production file systems
>
> 2017/04/28 14:38:35 [130550/1] CheckFS | '/.lustre/naasc' matches mount
> point '/.lustre/naasc', type=lustre, fs=10.7.17.8 at o2ib0:/naaschpc
>
> This scan does not successfully complete. And the RPC resource problem
> reappears:
> Apr 28: 114      The first one is at 3pm...
>
> Apr 28 15:01:30 10.7.7.8 kernel: Lustre: naaschpc-MDT0000: Client
> cab57f86-2124-884d-49a2-ec2871f36f78 (at 10.7.17.122 at o2ib) refused
> reconnection, still busy with 1 active RPCs
>
> Apr 29: 282
> Apr 30: 284
> ---
> May 1: 186
> May 2: 248
> May 3: 204 (day of emergency MDT restart)
>
> Since May 4: 0    Robinhood has not been restarted against naasc lustre...
> and rsyncs have been running without issues.
>
> *********
> Robinhood scan/changelog consumption also produced an identical situation
> on our other lustre fs
>
> 2017/04/28 14:53:11 [130926/1] CheckFS | '/.lustre/cv' matches mount point
> '/.lustre/cv', type=lustre, fs=10.7.3.9 at tcp2:/cvlustre
>
> ***** FROM CV-lustre's MDT *****
>
> Apr 28 14:53:24 10.7.7.9 kernel: Lustre: cvlustre-MDT0000: Client
> deb3834b-6b53-5e68-123b-dfaedd352ac6 (at 10.7.17.122 at o2ib) refused
> reconnection, still busy with 1 active RPCs
> Apr 28 14:53:24 10.7.7.9 kernel: Lustre: cvlustre-MDT0000: Client
> deb3834b-6b53-5e68-123b-dfaedd352ac6 (at 10.7.17.122 at o2ib) refused
> reconnection, still busy with 1 active RPCs
> Apr 28 14:53:24 10.7.7.9 kernel: Lustre: Skipped 19 previous similar
> messages
> Apr 28 14:53:24 10.7.7.9 kernel: Lustre: Skipped 19 previous similar
> messages
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20170517/f7b0cbe6/attachment-0001.htm>


More information about the lustre-discuss mailing list