[Lustre-discuss] Plateau around 200MiB/s bond0

Sat Jan 24 18:04:21 PST 2009

1-2948-SFP Plus Baseline 3Com Switch
1-MGS bond0(eth0,eth1,eth2,eth3,eth4,eth5) raid1
1-MDT bond0(eth0,eth1,eth2,eth3,eth4,eth5) raid1
2-OSS bond0(eth0,eth1,eth2,eth3,eth4,eth5) raid6
1-MGS-CLIENT bond0(eth0,eth1,eth2,eth3,eth4,eth5)
1-CLIENT bond0(eth0,eth1)
1-CLIENT eth0
1-CLIENT eth0

I fail so far creating external journal for MDT, MGS and OSSx2.  How to add the external journal to /etc/fstab specifically the output of e2label /dev/sdb followed by what options for fstab?

[root at lustreone ~]# cat /proc/fs/lustre/devices
  0 UP mgs MGS MGS 17
  1 UP mgc MGC192.168.0.7 at tcp 876c20af-aaec-1da0-5486-1fc61ec8cd15 5
  2 UP lov ioio-clilov-ffff810209363c00 7307490a-4a12-4e8c-56ea-448e030a82e4 4
  3 UP mdc ioio-MDT0000-mdc-ffff810209363c00 7307490a-4a12-4e8c-56ea-448e030a82e4 5
  4 UP osc ioio-OST0000-osc-ffff810209363c00 7307490a-4a12-4e8c-56ea-448e030a82e4 5
  5 UP osc ioio-OST0001-osc-ffff810209363c00 7307490a-4a12-4e8c-56ea-448e030a82e4 5
[root at lustreone ~]# lfs df -h
UUID                     bytes      Used Available  Use% Mounted on
ioio-MDT0000_UUID       815.0G    534.0M    767.9G    0% /mnt/ioio[MDT:0]
ioio-OST0000_UUID         3.6T     28.4G      3.4T    0% /mnt/ioio[OST:0]
ioio-OST0001_UUID         3.6T     18.0G      3.4T    0% /mnt/ioio[OST:1]

filesystem summary:       7.2T     46.4G      6.8T    0% /mnt/ioio

[root at lustreone ~]# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.2.4 (January 28, 2008)

Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer2 (0)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

802.3ad info
LACP rate: slow
Active Aggregator Info:
        Aggregator ID: 1
        Number of ports: 1
        Actor Key: 17
        Partner Key: 1
        Partner Mac Address: 00:00:00:00:00:00

Slave Interface: eth0
MII Status: up
Link Failure Count: 1
Permanent HW addr: 00:1b:21:28:77:db
Aggregator ID: 1

Slave Interface: eth1
MII Status: up
Link Failure Count: 1
Permanent HW addr: 00:1b:21:28:77:6c
Aggregator ID: 2

Slave Interface: eth3
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:22:15:06:3a:94
Aggregator ID: 3

Slave Interface: eth2
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:22:15:06:3a:93
Aggregator ID: 4

Slave Interface: eth4
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:22:15:06:3a:95
Aggregator ID: 5

Slave Interface: eth5
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:22:15:06:3a:96
Aggregator ID: 6
[root at lustreone ~]# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb[0] sdc[1]
      976762496 blocks [2/2] [UU]

unused devices: <none>
[root at lustreone ~]# cat /etc/fstab
LABEL=/                 /                       ext3    defaults        1 1
tmpfs                   /dev/shm                tmpfs   defaults        0 0
devpts                  /dev/pts                devpts  gid=5,mode=620  0 0
sysfs                   /sys                    sysfs   defaults        0 0
proc                    /proc                   proc    defaults        0 0
LABEL=MGS               /mnt/mgs                lustre  defaults,_netdev 0 0
192.168.0.7 at tcp0:/ioio  /mnt/ioio               lustre  defaults,_netdev,noauto 0 0

[root at lustreone ~]# ifconfig
bond0     Link encap:Ethernet  HWaddr 00:1B:21:28:77:DB
          inet addr:192.168.0.7  Bcast:192.168.0.255  Mask:255.255.255.0
          inet6 addr: fe80::21b:21ff:fe28:77db/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:9000  Metric:1
          RX packets:5457486 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4665580 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:12376680079 (11.5 GiB)  TX bytes:34438742885 (32.0 GiB)

eth0      Link encap:Ethernet  HWaddr 00:1B:21:28:77:DB
          inet6 addr: fe80::21b:21ff:fe28:77db/64 Scope:Link
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:9000  Metric:1
          RX packets:3808615 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4664270 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:12290700380 (11.4 GiB)  TX bytes:34438581771 (32.0 GiB)
          Base address:0xec00 Memory:febe0000-fec00000

>From what I have read not having an external journal configured for the OST's is a sure recipie for slowness which I would rather not have considering the goal is around 350MiB/s or more which should be obtainable.  

Here is how I formated the raid6 device on both OSS's that have identical 
[root at lustrefour ~]# fdisk -l

Disk /dev/sda: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1      121601   976760001   83  Linux

Disk /dev/sdb: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdb doesn't contain a valid partition table

Disk /dev/sdc: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdc doesn't contain a valid partition table

Disk /dev/sdd: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdd doesn't contain a valid partition table

Disk /dev/sde: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sde doesn't contain a valid partition table

Disk /dev/sdf: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdf doesn't contain a valid partition table

Disk /dev/sdg: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdg doesn't contain a valid partition table

Disk /dev/sdh: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdh doesn't contain a valid partition table

Disk /dev/md0: 4000.8 GB, 4000819183616 bytes
2 heads, 4 sectors/track, 976762496 cylinders
Units = cylinders of 8 * 512 = 4096 bytes

Disk /dev/md0 doesn't contain a valid partition table
[root at lustrefour ~]# 

[root at lustrefour ~]#  mdadm --create --assume-clean /dev/md0 --level=6 --chunk=128 --raid-devices=6 /dev/sd[cdefgh]
[root at lustrefour ~]# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sdc[0] sdh[5] sdg[4] sdf[3] sde[2] sdd[1]
      3907049984 blocks level 6, 128k chunk, algorithm 2 [6/6] [UUUUUU]
                in: 16674 reads, 16217479 writes; out: 3022788 reads, 32865192 writes
                7712698 in raid5d, 8264 out of stripes, 25661224 handle called
                reads: 0 for rmw, 1710975 for rcw. zcopy writes: 4864584, copied writes: 16115932
                0 delayed, 0 bit delayed, 0 active, queues: 0 in, 0 out
                0 expanding overlap

unused devices: <none>

Followed with:

[root at lustrefour ~]# mkfs.lustre --ost --fsname=ioio --mgsnode=192.168.0.7 at tcp0 --mkfsoptions="-J device=/dev/sdb1" --reformat /dev/md0

[root at lustrefour ~]# mke2fs -b 4096 -O journal_dev /dev/sdb1

But that is hard to reassemble on the reboot or at least was before I use e2label and label things right.  Question how to label the external journal in fstab if at all?  Right now only running 

[root at lustrefour ~]# mkfs.lustre --fsname=ioio --ost --mgsnode=192.168.0.7 at tcp0 --reformat /dev/md0

So just raid6 no external journal.

[root at lustrefour ~]# cat /etc/fstab
LABEL=/                 /                       ext3    defaults        1 1
tmpfs                   /dev/shm                tmpfs   defaults        0 0
devpts                  /dev/pts                devpts  gid=5,mode=620  0 0
sysfs                   /sys                    sysfs   defaults        0 0
proc                    /proc                   proc    defaults        0 0
LABEL=ioio-OST0001      /mnt/ost00              lustre  defaults,_netdev 0 0
192.168.0.7 at tcp0:/ioio  /mnt/ioio               lustre  defaults,_netdev,noauto 0 0

[root at lustrefour ~]#

[root at lustreone bin]# ./ost-survey -s 4096 /mnt/ioio
./ost-survey: 01/24/09 OST speed survey on /mnt/ioio from 192.168.0.7 at tcp
Number of Active OST devices : 2
Worst  Read OST indx: 0 speed: 38.789337
Best   Read OST indx: 1 speed: 40.017201
Read Average: 39.403269 +/- 0.613932 MB/s
Worst  Write OST indx: 0 speed: 49.227064
Best   Write OST indx: 1 speed: 78.673564
Write Average: 63.950314 +/- 14.723250 MB/s
Ost#  Read(MB/s)  Write(MB/s)  Read-time  Write-time
----------------------------------------------------
0     38.789       49.227        105.596      83.206
1     40.017       78.674        102.356      52.063
[root at lustreone bin]# ./ost-survey -s 1024 /mnt/ioio
./ost-survey: 01/24/09 OST speed survey on /mnt/ioio from 192.168.0.7 at tcp
Number of Active OST devices : 2
Worst  Read OST indx: 0 speed: 38.559620
Best   Read OST indx: 1 speed: 40.053787
Read Average: 39.306704 +/- 0.747083 MB/s
Worst  Write OST indx: 0 speed: 71.623744
Best   Write OST indx: 1 speed: 82.764897
Write Average: 77.194320 +/- 5.570577 MB/s
Ost#  Read(MB/s)  Write(MB/s)  Read-time  Write-time
----------------------------------------------------
0     38.560       71.624        26.556      14.297
1     40.054       82.765        25.566      12.372
[root at lustreone bin]# dd of=/mnt/ioio/bigfileMGS if=/dev/zero bs=1048576
3536+0 records in
3536+0 records out
3707764736 bytes (3.7 GB) copied, 38.4775 seconds, 96.4 MB/s

lustreonetwothreefour all have the same for modprobe.conf

[root at lustrefour ~]# cat /etc/modprobe.conf
alias eth0 e1000
alias eth1 e1000
alias scsi_hostadapter pata_marvell
alias scsi_hostadapter1 ata_piix
options lnet networks=tcp
alias eth2 sky2
alias eth3 sky2
alias eth4 sky2
alias eth5 sky2
alias bond0 bonding
options bonding miimon=100 mode=4
[root at lustrefour ~]#   

When do the same from all clients I can watch ./usr/bin/gnome-system-monitor and the send and recieve from the various nodes reaches a 209 MiB/s plateau?  Uggh

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20090124/3c4b417b/attachment.htm>