[lustre-discuss] Robinhood exhausting RPC resources against 2.5.5 lustre file systems

Jessica Otey jotey at nrao.edu
Wed May 17 11:04:28 PDT 2017


All,

We have observed an unfortunate interaction between Robinhood and two 
Lustre 2.5.5 file systems (both of which originated as 1.8.9 file systems).

Robinhood was used successfully against these file systems when they 
were both 1.8.9, 2.4.3, and then 2.5.3 (a total time span of about 11 
months).

We also have a third Lustre file system that originated as 2.4.3, and 
has since been upgraded to 2.5.5, against which Robinhood is currently 
operating as expected. This leads me to suppose that the issue may have 
to do the interaction between Robinhood and a 
legacy-1.8.x-now-lustre-2.5.5 system. But I don't know.

The problem manifests itself as follows: Either a Robinhood file scan or 
the initiation of the consumption of changelogs results in the 
consumption all the available RPC resources on the MDT. This in turn 
leads to the MDT not being able to satisfy any other requests from 
clients, which in turn leads to client disconnections (the MDT thinks 
they are dead and evicts them). Meanwhile, Robinhood itself is unable to 
traverse the file system to gather the information it seeks, and so its 
scans either hang (due to the client disconnect) or run at a rate such 
that they would never complete (less than 1 file per second).

If we don't run robinhood at all, the file system performs (after a 
remount of the MDT) as expected.

Initially, we thought that the difficulty might be that we neglected to 
activate the FID-in-direct feature when we upgraded to 2.4.3. We did so 
on one of these systems, and ran an lfsck oi_scrub, but that did not 
ameliorate the problem.

Any thoughts on this matter would be appreciated. (We miss using Robinhood!)

Thanks,

Jessica

************************************************************

More data for those who cannot help themselves:

April 2016 - Robinhood comes into production use against both our 1.8.9 
file systems.

July 2016 - Upgrade to 2.4.3 (on both production lustre file systems) -- 
Robinhood rebuilt against 2.4.3 client; changelog consumption now included.

         Lustre "reconnects" (from /var/log/messages on one of the MDTs):

         July 2016: 4

         Aug 2016: 20

         Sept 2016: 8

         Oct 2016: 8

Nov 4-6, 2016 - Upgrade to 2.5.3 (on both production lustre file 
systems) -- Robinhood rebuilt against 2.5.3 client.

         Lustre "reconnects":

Nov. 2016: 180

Dec. 2016: 62

Jan. 2017: 96

Feb 1-24, 2017: 2

Feb 24, 2017 - Upgrade to 2.5.5 (on both production lustre file systems)

**** NAASC-Lustre MDT coming back ****

Feb 24 20:46:44 10.7.7.8 kernel: Lustre: Lustre: Build Version: 
2.5.5-g22a210f-CHANGED-2.6.32-642.6.2.el6_lustre.2.5.5.x86_64
Feb 24 20:46:44 10.7.7.8 kernel: Lustre: Lustre: Build Version: 
2.5.5-g22a210f-CHANGED-2.6.32-642.6.2.el6_lustre.2.5.5.x86_64
Feb 24 20:46:44 10.7.7.8 kernel: LNet: Added LNI 10.7.17.8 at o2ib 
[8/256/0/180]
Feb 24 20:46:44 10.7.7.8 kernel: LNet: Added LNI 10.7.17.8 at o2ib 
[8/256/0/180]
Feb 24 20:46:45 10.7.7.8 kernel: LDISKFS-fs (md127): mounted filesystem 
with ordered data mode. quota=off. Opts:
Feb 24 20:46:45 10.7.7.8 kernel: LDISKFS-fs (md127): mounted filesystem 
with ordered data mode. quota=off. Opts:
Feb 24 20:46:46 10.7.7.8 kernel: Lustre: MGC10.7.17.8 at o2ib: Connection 
restored to MGS (at 0 at lo)
Feb 24 20:46:46 10.7.7.8 kernel: Lustre: MGC10.7.17.8 at o2ib: Connection 
restored to MGS (at 0 at lo)
Feb 24 20:46:47 10.7.7.8 kernel: Lustre: naaschpc-MDT0000: used disk, 
loading
Feb 24 20:46:47 10.7.7.8 kernel: Lustre: naaschpc-MDT0000: used disk, 
loading

The night after this upgrade, a regular rsync to the backup Lustre 
system provokes a failure/client disconnect. (Unfortunately, I don't 
have the logs to look at Robinhood activity from this time, but I 
believe I restarted the service after the system came back.)

Feb 25 02:14:24 10.7.7.8 kernel: LustreError: 
25103:0:(service.c:2020:ptlrpc_server_handle_request()) @@@ Dropping 
timed-out request from 12345-10.7.17.123 at o2ib: deadline 6:11s ago
Feb 25 02:14:24 10.7.7.8 kernel: LustreError: 
25103:0:(service.c:2020:ptlrpc_server_handle_request()) @@@ Dropping 
timed-out request from 12345-10.7.17.123 at o2ib: deadline 6:11s ago
Feb 25 02:14:24 10.7.7.8 kernel:  req at ffff88045b3a2050 
x1560271381909936/t0(0) 
o103->bb228923-4216-cc59-d847-38b543af1ae2 at 10.7.17.123@o2ib:0/0 lens 
3584/0 e 0 to 0 dl 1488006853 ref 1 fl Interpret:/0/ffffffff rc 0/-1
Feb 25 02:14:24 10.7.7.8 kernel:  req at ffff88045b3a2050 
x1560271381909936/t0(0) 
o103->bb228923-4216-cc59-d847-38b543af1ae2 at 10.7.17.123@o2ib:0/0 lens 
3584/0 e 0 to 0 dl 1488006853 ref 1 fl Interpret:/0/ffffffff rc 0/-1
Feb 25 02:14:24 10.7.7.8 kernel: Lustre: 
25111:0:(service.c:2052:ptlrpc_server_handle_request()) @@@ Request took 
longer than estimated (6:11s); client may timeout. req at ffff88045b3a2850 
x1560271381909940/t0(0) 
o103->bb228923-4216-cc59-d847-38b543af1ae2 at 10.7.17.123@o2ib:0/0 lens 
3584/0 e 0 to 0 dl 1488006853 ref 1 fl Interpret:/0/ffffffff rc 0/-1
Feb 25 02:14:24 10.7.7.8 kernel: Lustre: 
25111:0:(service.c:2052:ptlrpc_server_handle_request()) @@@ Request took 
longer than estimated (6:11s); client may timeout. req at ffff88045b3a2850 
x1560271381909940/t0(0) 
o103->bb228923-4216-cc59-d847-38b543af1ae2 at 10.7.17.123@o2ib:0/0 lens 
3584/0 e 0 to 0 dl 1488006853 ref 1 fl Interpret:/0/ffffffff rc 0/-1
Feb 25 02:14:24 10.7.7.8 kernel: LustreError: 
25103:0:(service.c:2020:ptlrpc_server_handle_request()) Skipped 30 
previous similar messages
Feb 25 02:14:24 10.7.7.8 kernel: LustreError: 
25103:0:(service.c:2020:ptlrpc_server_handle_request()) Skipped 30 
previous similar messages
Feb 25 02:14:25 10.7.7.8 kernel: LustreError: 
7526:0:(service.c:2020:ptlrpc_server_handle_request()) @@@ Dropping 
timed-out request from 12345-10.7.17.123 at o2ib: deadline 6:11s ago
Feb 25 02:14:25 10.7.7.8 kernel: LustreError: 
7526:0:(service.c:2020:ptlrpc_server_handle_request()) @@@ Dropping 
timed-out request from 12345-10.7.17.123 at o2ib: deadline 6:11s ago
Feb 25 02:14:25 10.7.7.8 kernel:  req at ffff880149b75850 
x1560271381926356/t0(0) 
o103->bb228923-4216-cc59-d847-38b543af1ae2 at 10.7.17.123@o2ib:0/0 lens 
3584/0 e 0 to 0 dl 1488006854 ref 1 fl Interpret:/0/ffffffff rc 0/-1
Feb 25 02:14:25 10.7.7.8 kernel:  req at ffff880149b75850 
x1560271381926356/t0(0) 
o103->bb228923-4216-cc59-d847-38b543af1ae2 at 10.7.17.123@o2ib:0/0 lens 
3584/0 e 0 to 0 dl 1488006854 ref 1 fl Interpret:/0/ffffffff rc 0/-1
Feb 25 02:14:25 10.7.7.8 kernel: Lustre: 
7528:0:(service.c:2052:ptlrpc_server_handle_request()) @@@ Request took 
longer than estimated (6:11s); client may timeout. req at ffff880149bb0050 
x1560271381926372/t0(0) 
o103->bb228923-4216-cc59-d847-38b543af1ae2 at 10.7.17.123@o2ib:0/0 lens 
3584/0 e 0 to 0 dl 1488006854 ref 1 fl Interpret:/0/ffffffff rc 0/-1
Feb 25 02:14:25 10.7.7.8 kernel: Lustre: 
7528:0:(service.c:2052:ptlrpc_server_handle_request()) @@@ Request took 
longer than estimated (6:11s); client may timeout. req at ffff880149bb0050 
x1560271381926372/t0(0) 
o103->bb228923-4216-cc59-d847-38b543af1ae2 at 10.7.17.123@o2ib:0/0 lens 
3584/0 e 0 to 0 dl 1488006854 ref 1 fl Interpret:/0/ffffffff rc 0/-1
Feb 25 02:14:25 10.7.7.8 kernel: Lustre: 
7528:0:(service.c:2052:ptlrpc_server_handle_request()) Skipped 1533 
previous similar messages
Feb 25 02:14:25 10.7.7.8 kernel: Lustre: 
7528:0:(service.c:2052:ptlrpc_server_handle_request()) Skipped 1533 
previous similar messages
Feb 25 02:14:25 10.7.7.8 kernel: LustreError: 
7526:0:(service.c:2020:ptlrpc_server_handle_request()) Skipped 1653 
previous similar messages
Feb 25 02:14:25 10.7.7.8 kernel: LustreError: 
7526:0:(service.c:2020:ptlrpc_server_handle_request()) Skipped 1653 
previous similar messages
Feb 25 02:14:49 10.7.7.8 kernel: Lustre: naaschpc-MDT0000: Client 
bb228923-4216-cc59-d847-38b543af1ae2 (at 10.7.17.123 at o2ib) reconnecting
Feb 25 02:14:49 10.7.7.8 kernel: Lustre: naaschpc-MDT0000: Client 
bb228923-4216-cc59-d847-38b543af1ae2 (at 10.7.17.123 at o2ib) reconnecting
Feb 25 02:14:49 10.7.7.8 kernel: Lustre: naaschpc-MDT0000: Client 
bb228923-4216-cc59-d847-38b543af1ae2 (at 10.7.17.123 at o2ib) refused 
reconnection, still busy with 1 active RPCs
Feb 25 02:14:49 10.7.7.8 kernel: Lustre: naaschpc-MDT0000: Client 
bb228923-4216-cc59-d847-38b543af1ae2 (at 10.7.17.123 at o2ib) refused 
reconnection, still busy with 1 active RPCs
Feb 25 02:14:49 10.7.7.8 kernel: Lustre: Skipped 1 previous similar message
Feb 25 02:14:49 10.7.7.8 kernel: Lustre: Skipped 1 previous similar message
Feb 25 02:14:49 10.7.7.8 kernel: LustreError: 
9437:0:(ldlm_lib.c:2693:target_bulk_io()) @@@ bulk PUT failed: rc -107  
req at ffff88050290d050 x1560271382966952/t0(0) 
o37->bb228923-4216-cc59-d847-38b543af1ae2 at 10.7.17.123@o2ib:0/0 lens 
448/440 e 0 to 0 dl 1488006895 ref 1 fl Interpret:/0/0 rc 0/0
Feb 25 02:14:49 10.7.7.8 kernel: LustreError: 
9437:0:(ldlm_lib.c:2693:target_bulk_io()) @@@ bulk PUT failed: rc -107  
req at ffff88050290d050 x1560271382966952/t0(0) 
o37->bb228923-4216-cc59-d847-38b543af1ae2 at 10.7.17.123@o2ib:0/0 lens 
448/440 e 0 to 0 dl 1488006895 ref 1 fl Interpret:/0/0 rc 0/0
Feb 25 02:15:14 10.7.7.8 kernel: Lustre: naaschpc-MDT0000: Client 
bb228923-4216-cc59-d847-38b543af1ae2 (at 10.7.17.123 at o2ib) reconnecting
Feb 25 02:15:14 10.7.7.8 kernel: Lustre: naaschpc-MDT0000: Client 
bb228923-4216-cc59-d847-38b543af1ae2 (at 10.7.17.123 at o2ib) reconnecting

This problem then escalates... the Lustre "reconnects" from the logs for 
March will be 2818; for April
4574! *

Grepping for "active RPCs", excluding dates of known downtimes (Nov  5,  
2016 | Feb 24, 2017 | Apr 21, 2017)". Dates not included are zero.

**2016*
Nov 22nd: 72
Dec 14th: 32

*2017*
Jan 8th: 20
Jan 28th: 18
---
Feb 25: 2
---
Mar 13: 56
Mar 14: 236  First user ticket filed about fs non-responsiveness
Mar 15: 60
Mar 16: 90
Mar 17: 160
Mar 18: 280
Mar 19: 178
Mar 20: 280
Mar 21: 276
Mar 22: 140
Mar 23: 102
Mar 24: 216
Mar 25: 278
Mar 26: 142
Mar 27: 82
Mar 28: 52
Mar 29: 78
Mar 30: 52
Mar 31: 52
--
Apr 1: 112
Apr 2: 216
Apr 3: 276
Apr 4: 278
Apr 5: 278
Apr 6: 278
Apr 7: 158
Apr 8: 216
Apr 9: 278
Apr 10: 118
Apr 11: 216
Apr 12: 156
Apr 13: 78
Apr 14: 42
Apr 15: 74
Apr 16: 216
Apr 17: 88
Apr 18: 62
Apr 19: 198
Apr 20: 280
April 21, 2017 - on naasc lustre only: MDT and OSTs unmounted; MDT 
remounted in ordered mode; lfsck oi_scrub run
Apr 22: 0
Apr 23: 0
Apr 24: 0
Apr 25: 0
Apr 26: 0
Apr 27: 0

On Friday, April 28, I start a file system scan from our 'new' robinhood 
box, akebono, on both production file systems

2017/04/28 14:38:35 [130550/1] CheckFS | '/.lustre/naasc' matches mount 
point '/.lustre/naasc', type=lustre, fs=10.7.17.8 at o2ib0:/naaschpc

This scan does not successfully complete. And the RPC resource problem 
reappears:
Apr 28: 114      The first one is at 3pm...

Apr 28 15:01:30 10.7.7.8 kernel: Lustre: naaschpc-MDT0000: Client 
cab57f86-2124-884d-49a2-ec2871f36f78 (at 10.7.17.122 at o2ib) refused 
reconnection, still busy with 1 active RPCs

Apr 29: 282
Apr 30: 284
---
May 1: 186
May 2: 248
May 3: 204 (day of emergency MDT restart)

Since May 4: 0    Robinhood has not been restarted against naasc 
lustre... and rsyncs have been running without issues.

*********
Robinhood scan/changelog consumption also produced an identical 
situation on our other lustre fs

2017/04/28 14:53:11 [130926/1] CheckFS | '/.lustre/cv' matches mount 
point '/.lustre/cv', type=lustre, fs=10.7.3.9 at tcp2:/cvlustre

***** FROM CV-lustre's MDT *****

Apr 28 14:53:24 10.7.7.9 kernel: Lustre: cvlustre-MDT0000: Client 
deb3834b-6b53-5e68-123b-dfaedd352ac6 (at 10.7.17.122 at o2ib) refused 
reconnection, still busy with 1 active RPCs
Apr 28 14:53:24 10.7.7.9 kernel: Lustre: cvlustre-MDT0000: Client 
deb3834b-6b53-5e68-123b-dfaedd352ac6 (at 10.7.17.122 at o2ib) refused 
reconnection, still busy with 1 active RPCs
Apr 28 14:53:24 10.7.7.9 kernel: Lustre: Skipped 19 previous similar 
messages
Apr 28 14:53:24 10.7.7.9 kernel: Lustre: Skipped 19 previous similar 
messages
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20170517/c0f848ba/attachment-0001.htm>


More information about the lustre-discuss mailing list