[Lustre-discuss] High difference in I/O network traffic in lustre client

Lex lexluthor87 at gmail.com
Mon Feb 1 01:44:28 PST 2010


Hi guys

In effort to improve our storage system performance, i found some strange
signs but unfortunately, couldn't explain it by myself. So i post here for
all you guys can't help me to clarify it

I'm using lustre client as web server for downloading file. When our system
in a heavy load ( about 12.000 concurrent connection for 8 web server -
lustre client ), %iowait has been pushed to about 98%, load average was
about 1-2000 !!!! ( just because of %iowait, i still could manipulate
normally almost every command over ssh ) i think it's a terrible number in
describing load average ! But, at that case, the in and out network
traffic*are almost the same
* ( although just about few MB/s :( )

The odd thing is, right now, when we only have about 3.500 concurrent
connection, load average is about 50 ( still too big, right ? ), iowait is
about 70%, the difference between receive and transmit network is too hight,
about 10-20MB ( see attached file, please )

We just have about 20 connection for our local lustre storage system:

*netstat -nat | grep 192.168.1.75
tcp        0    560 192.168.1.75:1023           192.168.1.85:988
ESTABLISHED
tcp        0      0 192.168.1.75:1023           192.168.1.81:988
ESTABLISHED
tcp        0      0 192.168.1.75:988            192.168.1.85:1023
ESTABLISHED
tcp        0      0 192.168.1.75:988            192.168.1.85:1022
ESTABLISHED
tcp        0      0 192.168.1.75:988            192.168.1.81:1023
ESTABLISHED
tcp        0      0 192.168.1.75:988            192.168.1.81:1022
ESTABLISHED
tcp        0      0 192.168.1.75:988            192.168.1.100:1023
ESTABLISHED
tcp        0      0 192.168.1.75:1021           192.168.1.78:988
ESTABLISHED
tcp        0      0 192.168.1.75:1023           192.168.1.78:988
ESTABLISHED
tcp        0      0 192.168.1.75:1022           192.168.1.78:988
ESTABLISHED
tcp        0    560 192.168.1.75:1023           192.168.1.100:988
ESTABLISHED*

and about 400 connection with client from internet :

*netstat -nat | grep out_wan_ip | grep EST | wc -l
407*

We're currently using 2 Gigabit Ethernet card, one for
192.168.1.0/24network for lnet and the other as wan ip for delivering
file out to internet
and *about 15MB/s thoughput was "lost" somehow* !!!!

So, my question is:

- Is there anyone have idea or hint about high load situation with our
lustre client - web server like i described above ?  I followed this link
<http://rackerhacker.com/2008/03/11/hunting-down-elusive-sources-of-iowait/>and
found out  *kjournald *process is the main main "culprit" ( with our ost, it
was "*ll*" process )
- What makes the too high difference between receive and transit direction
in our lustre client - web server ?


i'm really stressed with poor performance in our storage system and hope
anyone here can help me point out some thing

Any help would be highly appreciated

Best regards


*
*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20100201/1aa231ee/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: iotraf.jpg
Type: image/jpeg
Size: 66692 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20100201/1aa231ee/attachment.jpg>


More information about the lustre-discuss mailing list