[Lustre-discuss] Plateau around 200MiB/s bond0

Jeffrey Alan Bennett jab at sdsc.edu
Wed Jan 28 12:30:05 PST 2009


Hi Arden,

Are you obtaining more than 100 MB/sec from one client to one OST? Given that you are using 802.3ad link aggregation, it will determine the physical NIC by the other party's MAC address. So having multiple OST and multiple clients will improve the chances of using more than one NIC of the bonding.

What is the maximum performance you obtain on the client with two 1GbE?

jeff




________________________________
From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of Arden Wiebe
Sent: Sunday, January 25, 2009 12:08 AM
To: lustre-discuss at lists.lustre.org
Subject: Re: [Lustre-discuss] Plateau around 200MiB/s bond0

So if one OST gets 200MiB/s and another OST gets 200MiB/s does that make 400 MiB/s or this is not how to calculate throughput?  I will eventually plug the right sequence into iozone to measure it.

>From my perspective it looks like ioio.ca/ioio.jpg ioio.ca/lustreone.png ioio.ca/lustretwo.png ioio.ca/lustrethree.png ioio.ca/lustrefour.png

--- On Sat, 1/24/09, Arden Wiebe <albert682 at yahoo.com> wrote:

From: Arden Wiebe <albert682 at yahoo.com>
Subject: [Lustre-discuss] Plateau around 200MiB/s bond0
To: lustre-discuss at lists.lustre.org
Date: Saturday, January 24, 2009, 6:04 PM

1-2948-SFP Plus Baseline 3Com Switch
1-MGS bond0(eth0,eth1,eth2,eth3,eth4,eth5) raid1
1-MDT bond0(eth0,eth1,eth2,eth3,eth4,eth5) raid1
2-OSS bond0(eth0,eth1,eth2,eth3,eth4,eth5) raid6
1-MGS-CLIENT bond0(eth0,eth1,eth2,eth3,eth4,eth5)
1-CLIENT bond0(eth0,eth1)
1-CLIENT eth0
1-CLIENT eth0

I fail so far creating external journal for MDT, MGS and OSSx2.  How to add the external journal to /etc/fstab specifically the output of e2label /dev/sdb followed by what options for fstab?

[root at lustreone ~]# cat /proc/fs/lustre/devices
  0 UP mgs MGS MGS 17
  1 UP mgc MGC192.168.0.7 at tcp 876c20af-aaec-1da0-5486-1fc61ec8cd15 5
  2 UP lov ioio-clilov-ffff810209363c00 7307490a-4a12-4e8c-56ea-448e030a82e4 4
  3 UP mdc ioio-MDT0000-mdc-ffff810209363c00 7307490a-4a12-4e8c-56ea-448e030a82e4 5
  4 UP osc ioio-OST0000-osc-ffff810209363c00 7307490a-4a12-4e8c-56ea-448e030a82e4 5
  5 UP osc ioio-OST0001-osc-ffff810209363c00 7307490a-4a12-4e8c-56ea-448e030a82e4 5
[root at lustreone ~]# lfs df -h
UUID                     bytes      Used Available  Use% Mounted on
ioio-MDT0000_UUID       815.0G    534.0M    767.9G    0% /mnt/ioio[MDT:0]
ioio-OST0000_UUID         3.6T     28.4G      3.4T    0% /mnt/ioio[OST:0]
ioio-OST0001_UUID         3.6T     18.0G      3.4T    0% /mnt/ioio[OST:1]

filesystem summary:       7.2T     46.4G      6.8T    0% /mnt/ioio

[root at lustreone ~]# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.2.4 (January 28, 2008)

Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer2 (0)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

802.3ad info
LACP rate: slow
Active Aggregator Info:
        Aggregator ID: 1
        Number of ports: 1
        Actor Key: 17
        Partner Key: 1
        Partner Mac Address: 00:00:00:00:00:00

Slave Interface: eth0
MII Status: up
Link Failure Count: 1
Permanent HW addr: 00:1b:21:28:77:db
Aggregator ID: 1

Slave Interface: eth1
MII Status: up
Link Failure Count: 1
Permanent HW addr: 00:1b:21:28:77:6c
Aggregator ID: 2

Slave Interface: eth3
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:22:15:06:3a:94
Aggregator ID: 3

Slave Interface: eth2
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:22:15:06:3a:93
Aggregator ID: 4

Slave Interface: eth4
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:22:15:06:3a:95
Aggregator ID: 5

Slave Interface: eth5
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:22:15:06:3a:96
Aggregator ID: 6
[root at lustreone ~]# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb[0] sdc[1]
      976762496 blocks [2/2] [UU]

unused devices: <none>
[root at lustreone ~]# cat /etc/fstab
LABEL=/                 /                       ext3    defaults        1 1
tmpfs                   /dev/shm                tmpfs   defaults        0 0
devpts                  /dev/pts                devpts  gid=5,mode=620  0 0
sysfs                   /sys                    sysfs   defaults        0 0
proc                    /proc                   proc    defaults        0 0
LABEL=MGS               /mnt/mgs                lustre  defaults,_netdev 0 0
192.168.0.7 at tcp0:/ioio  /mnt/ioio               lustre  defaults,_netdev,noauto 0 0

[root at lustreone ~]# ifconfig
bond0     Link encap:Ethernet  HWaddr 00:1B:21:28:77:DB
          inet addr:192.168.0.7  Bcast:192.168.0.255  Mask:255.255.255.0
          inet6 addr: fe80::21b:21ff:fe28:77db/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:9000  Metric:1
          RX packets:5457486 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4665580 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:12376680079 (11.5 GiB)  TX bytes:34438742885 (32.0 GiB)

eth0      Link encap:Ethernet  HWaddr 00:1B:21:28:77:DB
          inet6 addr: fe80::21b:21ff:fe28:77db/64 Scope:Link
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:9000  Metric:1
          RX packets:3808615 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4664270 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:12290700380 (11.4 GiB)  TX bytes:34438581771 (32.0 GiB)
          Base address:0xec00 Memory:febe0000-fec00000

>From what I have read not having an external journal configured for the OST's is a sure recipie for slowness which I would rather not have considering the goal is around 350MiB/s or more which should be obtainable.

Here is how I formated the raid6 device on both OSS's that have identical
[root at lustrefour ~]# fdisk -l

Disk /dev/sda: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1      121601   976760001   83  Linux

Disk /dev/sdb: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdb doesn't contain a valid partition table

Disk /dev/sdc: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdc doesn't contain a valid partition table

Disk /dev/sdd: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdd doesn't contain a valid partition table

Disk /dev/sde: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sde doesn't contain a valid partition table

Disk /dev/sdf: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdf doesn't contain a valid partition table

Disk /dev/sdg: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdg doesn't contain a valid partition table

Disk /dev/sdh: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdh doesn't contain a valid partition table

Disk /dev/md0: 4000.8 GB, 4000819183616 bytes
2 heads, 4 sectors/track, 976762496 cylinders
Units = cylinders of 8 * 512 = 4096 bytes

Disk /dev/md0 doesn't contain a valid partition table
[root at lustrefour ~]#

[root at lustrefour ~]#  mdadm --create --assume-clean /dev/md0 --level=6 --chunk=128 --raid-devices=6 /dev/sd[cdefgh]
[root at lustrefour ~]# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sdc[0] sdh[5] sdg[4] sdf[3] sde[2] sdd[1]
      3907049984 blocks level 6, 128k chunk, algorithm 2 [6/6] [UUUUUU]
                in: 16674 reads, 16217479 writes; out: 3022788 reads, 32865192 writes
                7712698 in raid5d, 8264 out of stripes, 25661224 handle called
                reads: 0 for rmw, 1710975 for rcw. zcopy writes: 4864584, copied writes: 16115932
                0 delayed, 0 bit delayed, 0 active, queues: 0 in, 0 out
                0 expanding overlap


unused devices: <none>

Followed with:

[root at lustrefour ~]# mkfs.lustre --ost --fsname=ioio --mgsnode=192.168.0.7 at tcp0 --mkfsoptions="-J device=/dev/sdb1" --reformat /dev/md0

[root at lustrefour ~]# mke2fs -b 4096 -O journal_dev /dev/sdb1

But that is hard to reassemble on the reboot or at least was before I use e2label and label things right.  Question how to label the external journal in fstab if at all?  Right now only running

[root at lustrefour ~]# mkfs.lustre --fsname=ioio --ost --mgsnode=192.168.0.7 at tcp0 --reformat /dev/md0

So just raid6 no external journal.

[root at lustrefour ~]# cat /etc/fstab
LABEL=/                 /                       ext3    defaults        1 1
tmpfs                   /dev/shm                tmpfs   defaults        0 0
devpts                  /dev/pts                devpts  gid=5,mode=620  0 0
sysfs                   /sys                    sysfs   defaults        0 0
proc                    /proc                   proc    defaults        0 0
LABEL=ioio-OST0001      /mnt/ost00              lustre  defaults,_netdev 0 0
192.168.0.7 at tcp0:/ioio  /mnt/ioio               lustre  defaults,_netdev,noauto 0 0

[root at lustrefour ~]#


[root at lustreone bin]# ./ost-survey -s 4096 /mnt/ioio
./ost-survey: 01/24/09 OST speed survey on /mnt/ioio from 192.168.0.7 at tcp
Number of Active OST devices : 2
Worst  Read OST indx: 0 speed: 38.789337
Best   Read OST indx: 1 speed: 40.017201
Read Average: 39.403269 +/- 0.613932 MB/s
Worst  Write OST indx: 0 speed: 49.227064
Best   Write OST indx: 1 speed: 78.673564
Write Average: 63.950314 +/- 14.723250 MB/s
Ost#  Read(MB/s)  Write(MB/s)  Read-time  Write-time
----------------------------------------------------
0     38.789       49.227        105.596      83.206
1     40.017       78.674        102.356      52.063
[root at lustreone bin]# ./ost-survey -s 1024 /mnt/ioio
./ost-survey: 01/24/09 OST speed survey on /mnt/ioio from 192.168.0.7 at tcp
Number of Active OST devices : 2
Worst  Read OST indx: 0 speed: 38.559620
Best   Read OST indx: 1 speed: 40.053787
Read Average: 39.306704 +/- 0.747083 MB/s
Worst  Write OST indx: 0 speed: 71.623744
Best   Write OST indx: 1 speed: 82.764897
Write Average: 77.194320 +/- 5.570577 MB/s
Ost#  Read(MB/s)  Write(MB/s)  Read-time  Write-time
----------------------------------------------------
0     38.560       71.624        26.556      14.297
1     40.054       82.765        25.566      12.372
[root at lustreone bin]# dd of=/mnt/ioio/bigfileMGS if=/dev/zero bs=1048576
3536+0 records in
3536+0 records out
3707764736 bytes (3.7 GB) copied, 38.4775 seconds, 96.4 MB/s

lustreonetwothreefour all have the same for modprobe.conf

[root at lustrefour ~]# cat /etc/modprobe.conf
alias eth0 e1000
alias eth1 e1000
alias scsi_hostadapter pata_marvell
alias scsi_hostadapter1 ata_piix
options lnet networks=tcp
alias eth2 sky2
alias eth3 sky2
alias eth4 sky2
alias eth5 sky2
alias bond0 bonding
options bonding miimon=100 mode=4
[root at lustrefour ~]#

When do the same from all clients I can watch ./usr/bin/gnome-system-monitor and the send and recieve from the various nodes reaches a 209 MiB/s plateau?  Uggh



-----Inline Attachment Follows-----

_______________________________________________
Lustre-discuss mailing list
Lustre-discuss at lists.lustre.org</mc/compose?to=Lustre-discuss at lists.lustre.org>
http://lists.lustre.org/mailman/listinfo/lustre-discuss


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20090128/de86b0a3/attachment.htm>


More information about the lustre-discuss mailing list