[Lustre-discuss] Plateau around 200MiB/s bond0
Arden Wiebe
albert682 at yahoo.com
Wed Jan 28 17:24:30 PST 2009
I ran this on my 6GigE Bond0 MGS Client. I had to go back and cd to the mounted lustre directory and filesystem.
[root at lustreone ~]# cd /mnt/ioio
[root at lustreone ioio]# iozone -t1 -i0 -il -r4m -s2g
Record Size 4096 KB
File size set to 2097152 KB
Command line used: iozone -t1 -i0 -il -r4m -s2g
Output is in Kbytes/sec
Time Resolution = 0.000001 seconds.
Processor cache size set to 1024 Kbytes.
Processor cache line size set to 32 bytes.
File stride size set to 17 * record size.
Throughput test with 1 process
Each process writes a 2097152 Kbyte file in 4096 Kbyte records
Children see throughput for 1 initial writers = 106916.81 KB/sec
Parent sees throughput for 1 initial writers = 105244.22 KB/sec
Min throughput per process = 106916.81 KB/sec
Max throughput per process = 106916.81 KB/sec
Avg throughput per process = 106916.81 KB/sec
Min xfer = 2097152.00 KB
Children see throughput for 1 rewriters = 106882.15 KB/sec
Parent sees throughput for 1 rewriters = 105215.34 KB/sec
Min throughput per process = 106882.15 KB/sec
Max throughput per process = 106882.15 KB/sec
Avg throughput per process = 106882.15 KB/sec
Min xfer = 2097152.00 KB
I ran this to match the physical ram in the MGS Client.
[root at lustreone ioio]# iozone -t1 -i0 -il -r4m -s8g
Run began: Wed Jan 28 17:33:53 2009
Record Size 4096 KB
File size set to 8388608 KB
Command line used: iozone -t1 -i0 -il -r4m -s8g
Output is in Kbytes/sec
Time Resolution = 0.000001 seconds.
Processor cache size set to 1024 Kbytes.
Processor cache line size set to 32 bytes.
File stride size set to 17 * record size.
Throughput test with 1 process
Each process writes a 8388608 Kbyte file in 4096 Kbyte records
Children see throughput for 1 initial writers = 100817.04 KB/sec
Parent sees throughput for 1 initial writers = 100420.04 KB/sec
Min throughput per process = 100817.04 KB/sec
Max throughput per process = 100817.04 KB/sec
Avg throughput per process = 100817.04 KB/sec
Min xfer = 8388608.00 KB
Children see throughput for 1 rewriters = 100884.15 KB/sec
Parent sees throughput for 1 rewriters = 100487.30 KB/sec
Min throughput per process = 100884.15 KB/sec
Max throughput per process = 100884.15 KB/sec
Avg throughput per process = 100884.15 KB/sec
Min xfer = 8388608.00 KB
Then I ran this to match my processors and my physical ram to the iozone results by increasing -t1 to -t4. Subsequent test of -t6 prove redundant.
[root at lustreone ioio]# iozone -t4 -i0 -il -r4m -s8g
Run began: Wed Jan 28 17:37:33 2009
Record Size 4096 KB
File size set to 8388608 KB
Command line used: iozone -t4 -i0 -il -r4m -s8g
Output is in Kbytes/sec
Time Resolution = 0.000001 seconds.
Processor cache size set to 1024 Kbytes.
Processor cache line size set to 32 bytes.
File stride size set to 17 * record size.
Throughput test with 4 processes
Each process writes a 8388608 Kbyte file in 4096 Kbyte records
Children see throughput for 4 initial writers = 206173.77 KB/sec
Parent sees throughput for 4 initial writers = 191062.04 KB/sec
Min throughput per process = 48302.41 KB/sec
Max throughput per process = 54266.61 KB/sec
Avg throughput per process = 51543.44 KB/sec
Min xfer = 7467008.00 KB
Children see throughput for 4 rewriters = 206216.61 KB/sec
Parent sees throughput for 4 rewriters = 205358.90 KB/sec
Min throughput per process = 50336.13 KB/sec
Max throughput per process = 53059.13 KB/sec
Avg throughput per process = 51554.15 KB/sec
Min xfer = 7958528.00 KB
With screens at http://ioio.ca/iozone/MGSClient/images.html that clearly show a large jump in smooth stable network activity from the 200MiB/s to the 400MiB/s range.
If one were to have more processors would that increase maximum throughput? Does the number of GigE interfaces scale to the number of processors? Give 6GigE bond0 can I test in any other way to increase the 412MiB/s plateau? How do I best interpret the above results?
--- On Wed, 1/28/09, Jeremy Mann <jeremy at biochem.uthscsa.edu> wrote:
From: Jeremy Mann <jeremy at biochem.uthscsa.edu>
Subject: Re: [Lustre-discuss] Plateau around 200MiB/s bond0
To: "Arden Wiebe" <albert682 at yahoo.com>
Cc: "lustre-discuss at lists.lustre.org" <lustre-discuss at lists.lustre.org>
Date: Wednesday, January 28, 2009, 1:56 PM
Arden, we also use dual channel gigE (bond0) and in my tests found that
this works best:
options bonding miimon=100 mode=802.3ad xmit_hash_policy=layer3+4
This allows us to get roughly 250 MB/s transfers. Here is the iozone
command I used:
iozone -t1 -i0 -il -r4m -s2g
You will not get anymore performance unless you move to Infiniband or
another interconnect.
Jeffrey Alan Bennett wrote:
> Hi Arden,
>
> Are you obtaining more than 100 MB/sec from one client to one OST? Given
> that you are using 802.3ad link aggregation, it will determine the
> physical NIC by the other party's MAC address. So having multiple OST and
> multiple clients will improve the chances of using more than one NIC of
> the bonding.
>
> What is the maximum performance you obtain on the client with two 1GbE?
>
> jeff
>
>
>
>
> ________________________________
> From: lustre-discuss-bounces at lists.lustre.org
> [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of Arden Wiebe
> Sent: Sunday, January 25, 2009 12:08 AM
> To: lustre-discuss at lists.lustre.org
> Subject: Re: [Lustre-discuss] Plateau around 200MiB/s bond0
>
> So if one OST gets 200MiB/s and another OST gets 200MiB/s does that make
> 400 MiB/s or this is not how to calculate throughput? I will eventually
> plug the right sequence into iozone to measure it.
>
>>From my perspective it looks like ioio.ca/ioio.jpg ioio.ca/lustreone.png
>> ioio.ca/lustretwo.png ioio.ca/lustrethree.png ioio.ca/lustrefour.png
>
> --- On Sat, 1/24/09, Arden Wiebe <albert682 at yahoo.com> wrote:
>
> From: Arden Wiebe <albert682 at yahoo.com>
> Subject: [Lustre-discuss] Plateau around 200MiB/s bond0
> To: lustre-discuss at lists.lustre.org
> Date: Saturday, January 24, 2009, 6:04 PM
>
> 1-2948-SFP Plus Baseline 3Com Switch
> 1-MGS bond0(eth0,eth1,eth2,eth3,eth4,eth5) raid1
> 1-MDT bond0(eth0,eth1,eth2,eth3,eth4,eth5) raid1
> 2-OSS bond0(eth0,eth1,eth2,eth3,eth4,eth5) raid6
> 1-MGS-CLIENT bond0(eth0,eth1,eth2,eth3,eth4,eth5)
> 1-CLIENT bond0(eth0,eth1)
> 1-CLIENT eth0
> 1-CLIENT eth0
>
> I fail so far creating external journal for MDT, MGS and OSSx2. How to
> add the external journal to /etc/fstab specifically the output of e2label
> /dev/sdb followed by what options for fstab?
>
> [root at lustreone ~]# cat /proc/fs/lustre/devices
> 0 UP mgs MGS MGS 17
> 1 UP mgc MGC192.168.0.7 at tcp 876c20af-aaec-1da0-5486-1fc61ec8cd15 5
> 2 UP lov ioio-clilov-ffff810209363c00
> 7307490a-4a12-4e8c-56ea-448e030a82e4 4
> 3 UP mdc ioio-MDT0000-mdc-ffff810209363c00
> 7307490a-4a12-4e8c-56ea-448e030a82e4 5
> 4 UP osc ioio-OST0000-osc-ffff810209363c00
> 7307490a-4a12-4e8c-56ea-448e030a82e4 5
> 5 UP osc ioio-OST0001-osc-ffff810209363c00
> 7307490a-4a12-4e8c-56ea-448e030a82e4 5
> [root at lustreone ~]# lfs df -h
> UUID bytes Used Available Use% Mounted on
> ioio-MDT0000_UUID 815.0G 534.0M 767.9G 0% /mnt/ioio[MDT:0]
> ioio-OST0000_UUID 3.6T 28.4G 3.4T 0% /mnt/ioio[OST:0]
> ioio-OST0001_UUID 3.6T 18.0G 3.4T 0% /mnt/ioio[OST:1]
>
> filesystem summary: 7.2T 46.4G 6.8T 0% /mnt/ioio
>
> [root at lustreone ~]# cat /proc/net/bonding/bond0
> Ethernet Channel Bonding Driver: v3.2.4 (January 28, 2008)
>
> Bonding Mode: IEEE 802.3ad Dynamic link aggregation
> Transmit Hash Policy: layer2 (0)
> MII Status: up
> MII Polling Interval (ms): 100
> Up Delay (ms): 0
> Down Delay (ms): 0
>
> 802.3ad info
> LACP rate: slow
> Active Aggregator Info:
> Aggregator ID: 1
> Number of ports: 1
> Actor Key: 17
> Partner Key: 1
> Partner Mac Address: 00:00:00:00:00:00
>
> Slave Interface: eth0
> MII Status: up
> Link Failure Count: 1
> Permanent HW addr: 00:1b:21:28:77:db
> Aggregator ID: 1
>
> Slave Interface: eth1
> MII Status: up
> Link Failure Count: 1
> Permanent HW addr: 00:1b:21:28:77:6c
> Aggregator ID: 2
>
> Slave Interface: eth3
> MII Status: up
> Link Failure Count: 0
> Permanent HW addr: 00:22:15:06:3a:94
> Aggregator ID: 3
>
> Slave Interface: eth2
> MII Status: up
> Link Failure Count: 0
> Permanent HW addr: 00:22:15:06:3a:93
> Aggregator ID: 4
>
> Slave Interface: eth4
> MII Status: up
> Link Failure Count: 0
> Permanent HW addr: 00:22:15:06:3a:95
> Aggregator ID: 5
>
> Slave Interface: eth5
> MII Status: up
> Link Failure Count: 0
> Permanent HW addr: 00:22:15:06:3a:96
> Aggregator ID: 6
> [root at lustreone ~]# cat /proc/mdstat
> Personalities : [raid1]
> md0 : active raid1 sdb[0] sdc[1]
> 976762496 blocks [2/2] [UU]
>
> unused devices: <none>
> [root at lustreone ~]# cat /etc/fstab
> LABEL=/ / ext3 defaults 1
> 1
> tmpfs /dev/shm tmpfs defaults 0
> 0
> devpts /dev/pts devpts gid=5,mode=620 0
> 0
> sysfs /sys sysfs defaults 0
> 0
> proc /proc proc defaults 0
> 0
> LABEL=MGS /mnt/mgs lustre defaults,_netdev 0
> 0
> 192.168.0.7 at tcp0:/ioio /mnt/ioio lustre
> defaults,_netdev,noauto 0 0
>
> [root at lustreone ~]# ifconfig
> bond0 Link encap:Ethernet HWaddr 00:1B:21:28:77:DB
> inet addr:192.168.0.7 Bcast:192.168.0.255 Mask:255.255.255.0
> inet6 addr: fe80::21b:21ff:fe28:77db/64 Scope:Link
> UP BROADCAST RUNNING MASTER MULTICAST MTU:9000 Metric:1
> RX packets:5457486 errors:0 dropped:0 overruns:0 frame:0
> TX packets:4665580 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:0
> RX bytes:12376680079 (11.5 GiB) TX bytes:34438742885 (32.0 GiB)
>
> eth0 Link encap:Ethernet HWaddr 00:1B:21:28:77:DB
> inet6 addr: fe80::21b:21ff:fe28:77db/64 Scope:Link
> UP BROADCAST RUNNING SLAVE MULTICAST MTU:9000 Metric:1
> RX packets:3808615 errors:0 dropped:0 overruns:0 frame:0
> TX packets:4664270 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000
> RX bytes:12290700380 (11.4 GiB) TX bytes:34438581771 (32.0 GiB)
> Base address:0xec00 Memory:febe0000-fec00000
>
>>From what I have read not having an external journal configured for the
>> OST's is a sure recipie for slowness which I would rather not have
>> considering the goal is around 350MiB/s or more which should be
>> obtainable.
>
> Here is how I formated the raid6 device on both OSS's that have identical
> [root at lustrefour ~]# fdisk -l
>
> Disk /dev/sda: 1000.2 GB, 1000204886016 bytes
> 255 heads, 63 sectors/track, 121601 cylinders
> Units = cylinders of 16065 * 512 = 8225280 bytes
>
> Device Boot Start End Blocks Id System
> /dev/sda1 * 1 121601 976760001 83 Linux
>
> Disk /dev/sdb: 1000.2 GB, 1000204886016 bytes
> 255 heads, 63 sectors/track, 121601 cylinders
> Units = cylinders of 16065 * 512 = 8225280 bytes
>
> Disk /dev/sdb doesn't contain a valid partition table
>
> Disk /dev/sdc: 1000.2 GB, 1000204886016 bytes
> 255 heads, 63 sectors/track, 121601 cylinders
> Units = cylinders of 16065 * 512 = 8225280 bytes
>
> Disk /dev/sdc doesn't contain a valid partition table
>
> Disk /dev/sdd: 1000.2 GB, 1000204886016 bytes
> 255 heads, 63 sectors/track, 121601 cylinders
> Units = cylinders of 16065 * 512 = 8225280 bytes
>
> Disk /dev/sdd doesn't contain a valid partition table
>
> Disk /dev/sde: 1000.2 GB, 1000204886016 bytes
> 255 heads, 63 sectors/track, 121601 cylinders
> Units = cylinders of 16065 * 512 = 8225280 bytes
>
> Disk /dev/sde doesn't contain a valid partition table
>
> Disk /dev/sdf: 1000.2 GB, 1000204886016 bytes
> 255 heads, 63 sectors/track, 121601 cylinders
> Units = cylinders of 16065 * 512 = 8225280 bytes
>
> Disk /dev/sdf doesn't contain a valid partition table
>
> Disk /dev/sdg: 1000.2 GB, 1000204886016 bytes
> 255 heads, 63 sectors/track, 121601 cylinders
> Units = cylinders of 16065 * 512 = 8225280 bytes
>
> Disk /dev/sdg doesn't contain a valid partition table
>
> Disk /dev/sdh: 1000.2 GB, 1000204886016 bytes
> 255 heads, 63 sectors/track, 121601 cylinders
> Units = cylinders of 16065 * 512 = 8225280 bytes
>
> Disk /dev/sdh doesn't contain a valid partition table
>
> Disk /dev/md0: 4000.8 GB, 4000819183616 bytes
> 2 heads, 4 sectors/track, 976762496 cylinders
> Units = cylinders of 8 * 512 = 4096 bytes
>
> Disk /dev/md0 doesn't contain a valid partition table
> [root at lustrefour ~]#
>
> [root at lustrefour ~]# mdadm --create --assume-clean /dev/md0 --level=6
> --chunk=128 --raid-devices=6 /dev/sd[cdefgh]
> [root at lustrefour ~]# cat /proc/mdstat
> Personalities : [raid6] [raid5] [raid4]
> md0 : active raid6 sdc[0] sdh[5] sdg[4] sdf[3] sde[2] sdd[1]
> 3907049984 blocks level 6, 128k chunk, algorithm 2 [6/6] [UUUUUU]
> in: 16674 reads, 16217479 writes; out: 3022788 reads,
> 32865192 writes
> 7712698 in raid5d, 8264 out of stripes, 25661224 handle
> called
> reads: 0 for rmw, 1710975 for rcw. zcopy writes: 4864584,
> copied writes: 16115932
> 0 delayed, 0 bit delayed, 0 active, queues: 0 in, 0 out
> 0 expanding overlap
>
>
> unused devices: <none>
>
> Followed with:
>
> [root at lustrefour ~]# mkfs.lustre --ost --fsname=ioio
> --mgsnode=192.168.0.7 at tcp0 --mkfsoptions="-J device=/dev/sdb1" --reformat
> /dev/md0
>
> [root at lustrefour ~]# mke2fs -b 4096 -O journal_dev /dev/sdb1
>
> But that is hard to reassemble on the reboot or at least was before I use
> e2label and label things right. Question how to label the external
> journal in fstab if at all? Right now only running
>
> [root at lustrefour ~]# mkfs.lustre --fsname=ioio --ost
> --mgsnode=192.168.0.7 at tcp0 --reformat /dev/md0
>
> So just raid6 no external journal.
>
> [root at lustrefour ~]# cat /etc/fstab
> LABEL=/ / ext3 defaults 1
> 1
> tmpfs /dev/shm tmpfs defaults 0
> 0
> devpts /dev/pts devpts gid=5,mode=620 0
> 0
> sysfs /sys sysfs defaults 0
> 0
> proc /proc proc defaults 0
> 0
> LABEL=ioio-OST0001 /mnt/ost00 lustre defaults,_netdev 0
> 0
> 192.168.0.7 at tcp0:/ioio /mnt/ioio lustre
> defaults,_netdev,noauto 0 0
>
> [root at lustrefour ~]#
>
>
> [root at lustreone bin]# ./ost-survey -s 4096 /mnt/ioio
> ./ost-survey: 01/24/09 OST speed survey on /mnt/ioio from 192.168.0.7 at tcp
> Number of Active OST devices : 2
> Worst Read OST indx: 0 speed: 38.789337
> Best Read OST indx: 1 speed: 40.017201
> Read Average: 39.403269 +/- 0.613932 MB/s
> Worst Write OST indx: 0 speed: 49.227064
> Best Write OST indx: 1 speed: 78.673564
> Write Average: 63.950314 +/- 14.723250 MB/s
> Ost# Read(MB/s) Write(MB/s) Read-time Write-time
> ----------------------------------------------------
> 0 38.789 49.227 105.596 83.206
> 1 40.017 78.674 102.356 52.063
> [root at lustreone bin]# ./ost-survey -s 1024 /mnt/ioio
> ./ost-survey: 01/24/09 OST speed survey on /mnt/ioio from 192.168.0.7 at tcp
> Number of Active OST devices : 2
> Worst Read OST indx: 0 speed: 38.559620
> Best Read OST indx: 1 speed: 40.053787
> Read Average: 39.306704 +/- 0.747083 MB/s
> Worst Write OST indx: 0 speed: 71.623744
> Best Write OST indx: 1 speed: 82.764897
> Write Average: 77.194320 +/- 5.570577 MB/s
> Ost# Read(MB/s) Write(MB/s) Read-time Write-time
> ----------------------------------------------------
> 0 38.560 71.624 26.556 14.297
> 1 40.054 82.765 25.566 12.372
> [root at lustreone bin]# dd of=/mnt/ioio/bigfileMGS if=/dev/zero bs=1048576
> 3536+0 records in
> 3536+0 records out
> 3707764736 bytes (3.7 GB) copied, 38.4775 seconds, 96.4 MB/s
>
> lustreonetwothreefour all have the same for modprobe.conf
>
> [root at lustrefour ~]# cat /etc/modprobe.conf
> alias eth0 e1000
> alias eth1 e1000
> alias scsi_hostadapter pata_marvell
> alias scsi_hostadapter1 ata_piix
> options lnet networks=tcp
> alias eth2 sky2
> alias eth3 sky2
> alias eth4 sky2
> alias eth5 sky2
> alias bond0 bonding
> options bonding miimon=100 mode=4
> [root at lustrefour ~]#
>
> When do the same from all clients I can watch
> ./usr/bin/gnome-system-monitor and the send and recieve from the various
> nodes reaches a 209 MiB/s plateau? Uggh
>
>
>
> -----Inline Attachment Follows-----
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org</mc/compose?to=Lustre-discuss at lists.lustre.org>
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
--
Jeremy Mann
jeremy at biochem.uthscsa.edu
University of Texas Health Science Center
Bioinformatics Core Facility
http://www.bioinformatics.uthscsa.edu
Phone: (210) 567-2672
More information about the lustre-discuss
mailing list