[Lustre-discuss] Plateau around 200MiB/s bond0

Wed Jan 28 17:24:30 PST 2009

I ran this on my 6GigE Bond0 MGS Client.  I had to go back and cd to the mounted lustre directory and filesystem.

[root at lustreone ~]# cd /mnt/ioio
[root at lustreone ioio]# iozone -t1 -i0 -il -r4m -s2g

Record Size 4096 KB
        File size set to 2097152 KB
        Command line used: iozone -t1 -i0 -il -r4m -s2g
        Output is in Kbytes/sec
        Time Resolution = 0.000001 seconds.
        Processor cache size set to 1024 Kbytes.
        Processor cache line size set to 32 bytes.
        File stride size set to 17 * record size.
        Throughput test with 1 process
        Each process writes a 2097152 Kbyte file in 4096 Kbyte records

        Children see throughput for  1 initial writers  =  106916.81 KB/sec
        Parent sees throughput for  1 initial writers   =  105244.22 KB/sec
        Min throughput per process                      =  106916.81 KB/sec
        Max throughput per process                      =  106916.81 KB/sec
        Avg throughput per process                      =  106916.81 KB/sec
        Min xfer                                        = 2097152.00 KB

        Children see throughput for  1 rewriters        =  106882.15 KB/sec
        Parent sees throughput for  1 rewriters         =  105215.34 KB/sec
        Min throughput per process                      =  106882.15 KB/sec
        Max throughput per process                      =  106882.15 KB/sec
        Avg throughput per process                      =  106882.15 KB/sec
        Min xfer                                        = 2097152.00 KB

I ran this to match the physical ram in the MGS Client.

[root at lustreone ioio]# iozone -t1 -i0 -il -r4m -s8g
Run began: Wed Jan 28 17:33:53 2009

        Record Size 4096 KB
        File size set to 8388608 KB
        Command line used: iozone -t1 -i0 -il -r4m -s8g
        Output is in Kbytes/sec
        Time Resolution = 0.000001 seconds.
        Processor cache size set to 1024 Kbytes.
        Processor cache line size set to 32 bytes.
        File stride size set to 17 * record size.
        Throughput test with 1 process
        Each process writes a 8388608 Kbyte file in 4096 Kbyte records

        Children see throughput for  1 initial writers  =  100817.04 KB/sec
        Parent sees throughput for  1 initial writers   =  100420.04 KB/sec
        Min throughput per process                      =  100817.04 KB/sec
        Max throughput per process                      =  100817.04 KB/sec
        Avg throughput per process                      =  100817.04 KB/sec
        Min xfer                                        = 8388608.00 KB

        Children see throughput for  1 rewriters        =  100884.15 KB/sec
        Parent sees throughput for  1 rewriters         =  100487.30 KB/sec
        Min throughput per process                      =  100884.15 KB/sec
        Max throughput per process                      =  100884.15 KB/sec
        Avg throughput per process                      =  100884.15 KB/sec
        Min xfer                                        = 8388608.00 KB

Then I ran this to match my processors and my physical ram to the iozone results by increasing -t1 to -t4.  Subsequent test of -t6 prove redundant.

[root at lustreone ioio]# iozone -t4 -i0 -il -r4m -s8g
Run began: Wed Jan 28 17:37:33 2009

        Record Size 4096 KB
        File size set to 8388608 KB
        Command line used: iozone -t4 -i0 -il -r4m -s8g
        Output is in Kbytes/sec
        Time Resolution = 0.000001 seconds.
        Processor cache size set to 1024 Kbytes.
        Processor cache line size set to 32 bytes.
        File stride size set to 17 * record size.
        Throughput test with 4 processes
        Each process writes a 8388608 Kbyte file in 4096 Kbyte records

        Children see throughput for  4 initial writers  =  206173.77 KB/sec
        Parent sees throughput for  4 initial writers   =  191062.04 KB/sec
        Min throughput per process                      =   48302.41 KB/sec
        Max throughput per process                      =   54266.61 KB/sec
        Avg throughput per process                      =   51543.44 KB/sec
        Min xfer                                        = 7467008.00 KB

        Children see throughput for  4 rewriters        =  206216.61 KB/sec
        Parent sees throughput for  4 rewriters         =  205358.90 KB/sec
        Min throughput per process                      =   50336.13 KB/sec
        Max throughput per process                      =   53059.13 KB/sec
        Avg throughput per process                      =   51554.15 KB/sec
        Min xfer                                        = 7958528.00 KB

With screens at http://ioio.ca/iozone/MGSClient/images.html that clearly show a large jump in smooth stable network activity from the 200MiB/s to the 400MiB/s range.

If one were to have more processors would that increase maximum throughput?  Does the number of GigE interfaces scale to the number of processors?  Give 6GigE bond0 can I test in any other way to increase the 412MiB/s plateau?  How do I best interpret the above results?

--- On Wed, 1/28/09, Jeremy Mann <jeremy at biochem.uthscsa.edu> wrote:

From: Jeremy Mann <jeremy at biochem.uthscsa.edu>
Subject: Re: [Lustre-discuss] Plateau around 200MiB/s bond0
To: "Arden Wiebe" <albert682 at yahoo.com>
Cc: "lustre-discuss at lists.lustre.org" <lustre-discuss at lists.lustre.org>
Date: Wednesday, January 28, 2009, 1:56 PM

Arden, we also use dual channel gigE (bond0) and in my tests found that
this works best:

options bonding miimon=100 mode=802.3ad xmit_hash_policy=layer3+4

This allows us to get roughly 250 MB/s transfers. Here is the iozone
command I used:

 iozone -t1 -i0 -il -r4m -s2g

You will not get anymore performance unless you move to Infiniband or
another interconnect.

Jeffrey Alan Bennett wrote:
> Hi Arden,
>
> Are you obtaining more than 100 MB/sec from one client to one OST? Given
> that you are using 802.3ad link aggregation, it will determine the
> physical NIC by the other party's MAC address. So having multiple OST and
> multiple clients will improve the chances of using more than one NIC of
> the bonding.
>
> What is the maximum performance you obtain on the client with two 1GbE?
>
> jeff
>
>
>
>
> ________________________________
> From: lustre-discuss-bounces at lists.lustre.org
> [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of Arden Wiebe
> Sent: Sunday, January 25, 2009 12:08 AM
> To: lustre-discuss at lists.lustre.org
> Subject: Re: [Lustre-discuss] Plateau around 200MiB/s bond0
>
> So if one OST gets 200MiB/s and another OST gets 200MiB/s does that make
> 400 MiB/s or this is not how to calculate throughput?  I will eventually
> plug the right sequence into iozone to measure it.
>
>>From my perspective it looks like ioio.ca/ioio.jpg ioio.ca/lustreone.png
>> ioio.ca/lustretwo.png ioio.ca/lustrethree.png ioio.ca/lustrefour.png
>
> --- On Sat, 1/24/09, Arden Wiebe <albert682 at yahoo.com> wrote:
>
> From: Arden Wiebe <albert682 at yahoo.com>
> Subject: [Lustre-discuss] Plateau around 200MiB/s bond0
> To: lustre-discuss at lists.lustre.org
> Date: Saturday, January 24, 2009, 6:04 PM
>
> 1-2948-SFP Plus Baseline 3Com Switch
> 1-MGS bond0(eth0,eth1,eth2,eth3,eth4,eth5) raid1
> 1-MDT bond0(eth0,eth1,eth2,eth3,eth4,eth5) raid1
> 2-OSS bond0(eth0,eth1,eth2,eth3,eth4,eth5) raid6
> 1-MGS-CLIENT bond0(eth0,eth1,eth2,eth3,eth4,eth5)
> 1-CLIENT bond0(eth0,eth1)
> 1-CLIENT eth0
> 1-CLIENT eth0
>
> I fail so far creating external journal for MDT, MGS and OSSx2.  How to
> add the external journal to /etc/fstab specifically the output of e2label
> /dev/sdb followed by what options for fstab?
>
> [root at lustreone ~]# cat /proc/fs/lustre/devices
>   0 UP mgs MGS MGS 17
>   1 UP mgc MGC192.168.0.7 at tcp 876c20af-aaec-1da0-5486-1fc61ec8cd15 5
>   2 UP lov ioio-clilov-ffff810209363c00
> 7307490a-4a12-4e8c-56ea-448e030a82e4 4
>   3 UP mdc ioio-MDT0000-mdc-ffff810209363c00
> 7307490a-4a12-4e8c-56ea-448e030a82e4 5
>   4 UP osc ioio-OST0000-osc-ffff810209363c00
> 7307490a-4a12-4e8c-56ea-448e030a82e4 5
>   5 UP osc ioio-OST0001-osc-ffff810209363c00
> 7307490a-4a12-4e8c-56ea-448e030a82e4 5
> [root at lustreone ~]# lfs df -h
> UUID                     bytes      Used Available  Use% Mounted on
> ioio-MDT0000_UUID       815.0G    534.0M    767.9G    0% /mnt/ioio[MDT:0]
> ioio-OST0000_UUID         3.6T     28.4G      3.4T    0% /mnt/ioio[OST:0]
> ioio-OST0001_UUID         3.6T     18.0G      3.4T    0% /mnt/ioio[OST:1]
>
> filesystem summary:       7.2T     46.4G      6.8T    0% /mnt/ioio
>
> [root at lustreone ~]# cat /proc/net/bonding/bond0
> Ethernet Channel Bonding Driver: v3.2.4 (January 28, 2008)
>
> Bonding Mode: IEEE 802.3ad Dynamic link aggregation
> Transmit Hash Policy: layer2 (0)
> MII Status: up
> MII Polling Interval (ms): 100
> Up Delay (ms): 0
> Down Delay (ms): 0
>
> 802.3ad info
> LACP rate: slow
> Active Aggregator Info:
>         Aggregator ID: 1
>         Number of ports: 1
>         Actor Key: 17
>         Partner Key: 1
>         Partner Mac Address: 00:00:00:00:00:00
>
> Slave Interface: eth0
> MII Status: up
> Link Failure Count: 1
> Permanent HW addr: 00:1b:21:28:77:db
> Aggregator ID: 1
>
> Slave Interface: eth1
> MII Status: up
> Link Failure Count: 1
> Permanent HW addr: 00:1b:21:28:77:6c
> Aggregator ID: 2
>
> Slave Interface: eth3
> MII Status: up
> Link Failure Count: 0
> Permanent HW addr: 00:22:15:06:3a:94
> Aggregator ID: 3
>
> Slave Interface: eth2
> MII Status: up
> Link Failure Count: 0
> Permanent HW addr: 00:22:15:06:3a:93
> Aggregator ID: 4
>
> Slave Interface: eth4
> MII Status: up
> Link Failure Count: 0
> Permanent HW addr: 00:22:15:06:3a:95
> Aggregator ID: 5
>
> Slave Interface: eth5
> MII Status: up
> Link Failure Count: 0
> Permanent HW addr: 00:22:15:06:3a:96
> Aggregator ID: 6
> [root at lustreone ~]# cat /proc/mdstat
> Personalities : [raid1]
> md0 : active raid1 sdb[0] sdc[1]
>       976762496 blocks [2/2] [UU]
>
> unused devices: <none>
> [root at lustreone ~]# cat /etc/fstab
> LABEL=/                 /                       ext3    defaults        1
> 1
> tmpfs                   /dev/shm                tmpfs   defaults        0
> 0
> devpts                  /dev/pts                devpts  gid=5,mode=620  0
> 0
> sysfs                   /sys                    sysfs   defaults        0
> 0
> proc                    /proc                   proc    defaults        0
> 0
> LABEL=MGS               /mnt/mgs                lustre  defaults,_netdev 0
> 0
> 192.168.0.7 at tcp0:/ioio  /mnt/ioio               lustre
> defaults,_netdev,noauto 0 0
>
> [root at lustreone ~]# ifconfig
> bond0     Link encap:Ethernet  HWaddr 00:1B:21:28:77:DB
>           inet addr:192.168.0.7  Bcast:192.168.0.255  Mask:255.255.255.0
>           inet6 addr: fe80::21b:21ff:fe28:77db/64 Scope:Link
>           UP BROADCAST RUNNING MASTER MULTICAST  MTU:9000  Metric:1
>           RX packets:5457486 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:4665580 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:0
>           RX bytes:12376680079 (11.5 GiB)  TX bytes:34438742885 (32.0 GiB)
>
> eth0      Link encap:Ethernet  HWaddr 00:1B:21:28:77:DB
>           inet6 addr: fe80::21b:21ff:fe28:77db/64 Scope:Link
>           UP BROADCAST RUNNING SLAVE MULTICAST  MTU:9000  Metric:1
>           RX packets:3808615 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:4664270 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000
>           RX bytes:12290700380 (11.4 GiB)  TX bytes:34438581771 (32.0 GiB)
>           Base address:0xec00 Memory:febe0000-fec00000
>
>>From what I have read not having an external journal configured for the
>> OST's is a sure recipie for slowness which I would rather not have
>> considering the goal is around 350MiB/s or more which should be
>> obtainable.
>
> Here is how I formated the raid6 device on both OSS's that have identical
> [root at lustrefour ~]# fdisk -l
>
> Disk /dev/sda: 1000.2 GB, 1000204886016 bytes
> 255 heads, 63 sectors/track, 121601 cylinders
> Units = cylinders of 16065 * 512 = 8225280 bytes
>
>    Device Boot      Start         End      Blocks   Id  System
> /dev/sda1   *           1      121601   976760001   83  Linux
>
> Disk /dev/sdb: 1000.2 GB, 1000204886016 bytes
> 255 heads, 63 sectors/track, 121601 cylinders
> Units = cylinders of 16065 * 512 = 8225280 bytes
>
> Disk /dev/sdb doesn't contain a valid partition table
>
> Disk /dev/sdc: 1000.2 GB, 1000204886016 bytes
> 255 heads, 63 sectors/track, 121601 cylinders
> Units = cylinders of 16065 * 512 = 8225280 bytes
>
> Disk /dev/sdc doesn't contain a valid partition table
>
> Disk /dev/sdd: 1000.2 GB, 1000204886016 bytes
> 255 heads, 63 sectors/track, 121601 cylinders
> Units = cylinders of 16065 * 512 = 8225280 bytes
>
> Disk /dev/sdd doesn't contain a valid partition table
>
> Disk /dev/sde: 1000.2 GB, 1000204886016 bytes
> 255 heads, 63 sectors/track, 121601 cylinders
> Units = cylinders of 16065 * 512 = 8225280 bytes
>
> Disk /dev/sde doesn't contain a valid partition table
>
> Disk /dev/sdf: 1000.2 GB, 1000204886016 bytes
> 255 heads, 63 sectors/track, 121601 cylinders
> Units = cylinders of 16065 * 512 = 8225280 bytes
>
> Disk /dev/sdf doesn't contain a valid partition table
>
> Disk /dev/sdg: 1000.2 GB, 1000204886016 bytes
> 255 heads, 63 sectors/track, 121601 cylinders
> Units = cylinders of 16065 * 512 = 8225280 bytes
>
> Disk /dev/sdg doesn't contain a valid partition table
>
> Disk /dev/sdh: 1000.2 GB, 1000204886016 bytes
> 255 heads, 63 sectors/track, 121601 cylinders
> Units = cylinders of 16065 * 512 = 8225280 bytes
>
> Disk /dev/sdh doesn't contain a valid partition table
>
> Disk /dev/md0: 4000.8 GB, 4000819183616 bytes
> 2 heads, 4 sectors/track, 976762496 cylinders
> Units = cylinders of 8 * 512 = 4096 bytes
>
> Disk /dev/md0 doesn't contain a valid partition table
> [root at lustrefour ~]#
>
> [root at lustrefour ~]#  mdadm --create --assume-clean /dev/md0 --level=6
> --chunk=128 --raid-devices=6 /dev/sd[cdefgh]
> [root at lustrefour ~]# cat /proc/mdstat
> Personalities : [raid6] [raid5] [raid4]
> md0 : active raid6 sdc[0] sdh[5] sdg[4] sdf[3] sde[2] sdd[1]
>       3907049984 blocks level 6, 128k chunk, algorithm 2 [6/6] [UUUUUU]
>                 in: 16674 reads, 16217479 writes; out: 3022788 reads,
> 32865192 writes
>                 7712698 in raid5d, 8264 out of stripes, 25661224 handle
> called
>                 reads: 0 for rmw, 1710975 for rcw. zcopy writes: 4864584,
> copied writes: 16115932
>                 0 delayed, 0 bit delayed, 0 active, queues: 0 in, 0 out
>                 0 expanding overlap
>
>
> unused devices: <none>
>
> Followed with:
>
> [root at lustrefour ~]# mkfs.lustre --ost --fsname=ioio
> --mgsnode=192.168.0.7 at tcp0 --mkfsoptions="-J device=/dev/sdb1" --reformat
> /dev/md0
>
> [root at lustrefour ~]# mke2fs -b 4096 -O journal_dev /dev/sdb1
>
> But that is hard to reassemble on the reboot or at least was before I use
> e2label and label things right.  Question how to label the external
> journal in fstab if at all?  Right now only running
>
> [root at lustrefour ~]# mkfs.lustre --fsname=ioio --ost
> --mgsnode=192.168.0.7 at tcp0 --reformat /dev/md0
>
> So just raid6 no external journal.
>
> [root at lustrefour ~]# cat /etc/fstab
> LABEL=/                 /                       ext3    defaults        1
> 1
> tmpfs                   /dev/shm                tmpfs   defaults        0
> 0
> devpts                  /dev/pts                devpts  gid=5,mode=620  0
> 0
> sysfs                   /sys                    sysfs   defaults        0
> 0
> proc                    /proc                   proc    defaults        0
> 0
> LABEL=ioio-OST0001      /mnt/ost00              lustre  defaults,_netdev 0
> 0
> 192.168.0.7 at tcp0:/ioio  /mnt/ioio               lustre
> defaults,_netdev,noauto 0 0
>
> [root at lustrefour ~]#
>
>
> [root at lustreone bin]# ./ost-survey -s 4096 /mnt/ioio
> ./ost-survey: 01/24/09 OST speed survey on /mnt/ioio from 192.168.0.7 at tcp
> Number of Active OST devices : 2
> Worst  Read OST indx: 0 speed: 38.789337
> Best   Read OST indx: 1 speed: 40.017201
> Read Average: 39.403269 +/- 0.613932 MB/s
> Worst  Write OST indx: 0 speed: 49.227064
> Best   Write OST indx: 1 speed: 78.673564
> Write Average: 63.950314 +/- 14.723250 MB/s
> Ost#  Read(MB/s)  Write(MB/s)  Read-time  Write-time
> ----------------------------------------------------
> 0     38.789       49.227        105.596      83.206
> 1     40.017       78.674        102.356      52.063
> [root at lustreone bin]# ./ost-survey -s 1024 /mnt/ioio
> ./ost-survey: 01/24/09 OST speed survey on /mnt/ioio from 192.168.0.7 at tcp
> Number of Active OST devices : 2
> Worst  Read OST indx: 0 speed: 38.559620
> Best   Read OST indx: 1 speed: 40.053787
> Read Average: 39.306704 +/- 0.747083 MB/s
> Worst  Write OST indx: 0 speed: 71.623744
> Best   Write OST indx: 1 speed: 82.764897
> Write Average: 77.194320 +/- 5.570577 MB/s
> Ost#  Read(MB/s)  Write(MB/s)  Read-time  Write-time
> ----------------------------------------------------
> 0     38.560       71.624        26.556      14.297
> 1     40.054       82.765        25.566      12.372
> [root at lustreone bin]# dd of=/mnt/ioio/bigfileMGS if=/dev/zero bs=1048576
> 3536+0 records in
> 3536+0 records out
> 3707764736 bytes (3.7 GB) copied, 38.4775 seconds, 96.4 MB/s
>
> lustreonetwothreefour all have the same for modprobe.conf
>
> [root at lustrefour ~]# cat /etc/modprobe.conf
> alias eth0 e1000
> alias eth1 e1000
> alias scsi_hostadapter pata_marvell
> alias scsi_hostadapter1 ata_piix
> options lnet networks=tcp
> alias eth2 sky2
> alias eth3 sky2
> alias eth4 sky2
> alias eth5 sky2
> alias bond0 bonding
> options bonding miimon=100 mode=4
> [root at lustrefour ~]#
>
> When do the same from all clients I can watch
> ./usr/bin/gnome-system-monitor and the send and recieve from the various
> nodes reaches a 209 MiB/s plateau?  Uggh
>
>
>
> -----Inline Attachment Follows-----
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org</mc/compose?to=Lustre-discuss at lists.lustre.org>
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

-- 
Jeremy Mann
jeremy at biochem.uthscsa.edu

University of Texas Health Science Center
Bioinformatics Core Facility
http://www.bioinformatics.uthscsa.edu
Phone: (210) 567-2672