[Lustre-discuss] Slow read performance across OSSes
James Robnett
jrobnett at aoc.nrao.edu
Sat Oct 17 17:58:33 PDT 2009
Many thanks for the reply Andreas.
> You're sure that there isn't some other strange effect here, like you
> are only measuring the speed of a single iozone thread or similar?
I'm just looking at the output from Bonnie++ running on the client.
I see corresponding numbers when examining iostat on each OST. The sum
of all iostats from each OST in use matches the bonnie++ numbers.
Can Bonnie be at fault ? I've only been setting test size. I'll try
iozone to see if it returns similar results.
> This is definitely NOT expected, and I'm puzzled as to why this might
> be.
Considering how 'stock' this should be, ie RHEL 5.3 with Sun provided
RPMS I must be doing something wrong or more folks would see it but I'm
dipped if I know what it is. Everything works, no errors, just slow for
multiple OSSes.
> You could check /proc/fs/lustre/obdfilter/*/brw_stats on the
> respective OSTs
> to see if the client is not assembling the RPCs very well for some
> reason.
I ran two instances of bonnie++, the first used OST0000 and OST0001
on OSS1, the second used OST0001 on OSS1 and OST0002 on OSS2. I rebooted
between each run to reset the stats.
The contents of /proc/fs/lustre/obdfilter/lustre-OST0001/brw_stats
look essentially identical in both runs even thought the read rate in the
first was 114MB and in the second 38MB/s. I've appended the read portion
of both files below.
Not sure exactly what I should be looking for in those stats. I'm also
curious how it could be the OST at fault since 2 OSTs on one OSS give
the expected ~115MB/s read rate but 2 OSTs on two OSSes give ~40MB/s.
> Alternately, it might be that you have configured the disk storage of
> OSS-1
> and OSS-2 to compete (e.g. different partitions sharing the same disks).
Each OSS has two internal PCI 8 port 3ware 9550sx cards and 16 internal
disks carved into two separate 7+1 RAID 5 groups (one per card). They're
physically distinct where disk storage is concerned.
> No, the client needs to assemble the OST objects itself, regardless of
> whether the OSTs are on the same OSS or not. The file should be striped
> over all of the OSTs involved in the test.
Iostat on each OST confirms the striping. I see and don't see
reads on OSTs where I'd expect as I change the striping. OST's not in
use are quiescent. OSTs in use show uniform read rates between them and
they have relative constant rates per second. No starvation apparent.
It sure seems like some issue on the client not being able to deal
with multiple streams from multiple OSS but can deal just fine with
multiple streams from a single OSS.
I've tried to think of some way the switch could be at fault but
haven't come up with anything. It's a Cisco 2960 gigabit switch and while
it can block it shouldn't be in this case. I have no problem obtaining
115MB/s read writes as long as I avoid reading across two OSSes.
Again, many thanks for the reply. If nothing else knowing it really
is wrong will make me keep digging. If you can think of any output I
could show or test I could do to help isolate the problem I'm all ears.
James Robnett
NRAO/NM
Below is the read portion of brw_stats for OST0001 from the 40MB/s
run (left) and 115MB/s run (right), I removed the write portion for
clarity.
read (40MB/s) | read (115MB/s)
pages per bulk r/w rpcs % cum % | rpcs % cum %
1: 5003 17 17 | 5256 18 18
2: 13 0 17 | 23 0 18
4: 11 0 17 | 1 0 18
8: 19 0 17 | 1 0 18
16: 14 0 17 | 11 0 18
32: 53 0 17 | 18 0 18
64: 47 0 17 | 11 0 18
128: 74 0 17 | 35 0 18
256: 24145 82 100 | 23415 81 100
read | read
discontiguous pages rpcs % cum % | rpcs % cum %
0: 29261 99 99 | 28735 99 99
1: 61 0 99 | 34 0 99
2: 18 0 99 | 2 0 100
3: 15 0 99 | 0 0 100
4: 9 0 99 | 0 0 100
5: 7 0 99 | 0 0 100
6: 4 0 99 | 0 0 100
7: 3 0 99 | 0 0 100
8: 0 0 99 | 0 0 100
9: 1 0 100 | 0 0 100
10: 0 0 100 | 0 0 100
11: 0 0 100 | 0 0 100
12: 0 0 100 | 0 0 100
13: 0 0 100 |
read | read
discontiguous blocks rpcs % cum % | rpcs % cum %
0: 29261 99 99 | 28735 99 99
1: 61 0 99 | 34 0 99
2: 18 0 99 | 2 0 100
3: 15 0 99 | 0 0 100
4: 9 0 99 | 0 0 100
5: 7 0 99 | 0 0 100
6: 4 0 99 | 0 0 100
7: 3 0 99 | 0 0 100
8: 0 0 99 | 0 0 100
9: 1 0 100 | 0 0 100
10: 0 0 100 | 0 0 100
11: 0 0 100 | 0 0 100
12: 0 0 100 | 0 0 100
13: 0 0 100 |
read | read
disk fragmented I/Os ios % cum % | ios % cum %
0: 1 0 0 | 5308 18 18
1: 5084 17 17 | 12 0 18
2: 44 0 17 | 18 0 18
3: 46 0 17 | 17 0 18
4: 38 0 17 | 10 0 18
5: 31 0 17 | 20 0 18
6: 30 0 17 | 12 0 18
7: 29 0 18 | 23353 81 99
8: 24034 81 99 | 21 0 100
9: 27 0 99 | 0 0 100
10: 8 0 99 | 0 0 100
11: 3 0 99 | 0 0 100
12: 3 0 99 | 0 0 100
13: 0 0 99 |
14: 1 0 100 |
read | read
disk I/Os in flight ios % cum % | ios % cum %
1: 15990 8 8 | 14821 7 7
2: 16817 8 16 | 16105 8 16
3: 15968 8 24 | 14930 7 23
4: 15761 7 32 | 14260 7 31
5: 16390 8 40 | 14644 7 38
6: 17131 8 49 | 15039 7 46
7: 17786 8 58 | 15383 7 54
8: 18551 9 67 | 15887 8 62
9: 7313 3 71 | 7218 3 66
10: 7100 3 74 | 7006 3 70
11: 6755 3 78 | 6824 3 73
12: 6416 3 81 | 6738 3 77
13: 5931 2 84 | 6438 3 80
14: 5386 2 87 | 6209 3 83
15: 4831 2 89 | 5983 3 86
16: 4287 2 91 | 5540 2 89
17: 2146 1 92 | 2314 1 90
18: 1928 0 93 | 2213 1 92
19: 1703 0 94 | 2046 1 93
20: 1531 0 95 | 1911 0 94
21: 1376 0 96 | 1772 0 95
22: 1202 0 96 | 1602 0 95
23: 1011 0 97 | 1398 0 96
24: 749 0 97 | 1190 0 97
25: 435 0 97 | 640 0 97
26: 383 0 98 | 584 0 97
27: 358 0 98 | 526 0 98
28: 328 0 98 | 477 0 98
29: 298 0 98 | 434 0 98
30: 258 0 98 | 365 0 98
31: 2559 1 100 | 2224 1 100
read | read
I/O time (1/1000s) ios % cum % | ios % cum %
1: 1079 3 3 | 339 1 1
2: 5565 18 22 | 3228 11 12
4: 5672 19 41 | 6847 23 36
8: 2649 9 50 | 4393 15 51
16: 5967 20 71 | 8461 29 80
32: 7243 24 95 | 4243 14 95
64: 1073 3 99 | 1176 4 99
128: 126 0 99 | 84 0 100
256: 5 0 100 | 0 0 100
512: 0 0 100 | 0 0 100
read | read
disk I/O size ios % cum % | ios % cum %
4K: 5147 2 2 | 5263 2 2
8K: 94 0 2 | 28 0 2
16K: 18 0 2 | 11 0 2
32K: 45 0 2 | 20 0 2
64K: 98 0 2 | 48 0 2
128K: 193276 97 100 | 187351 97 100
More information about the lustre-discuss
mailing list