[Lustre-discuss] Reply High difference in I/O network traffic in lustre client

Andreas Dilger adilger at sun.com
Mon Feb 1 12:25:36 PST 2010


On 2010-02-01, at 08:29, Lex wrote:
> I have 8 OSSs and 8 OSTs. Hadware info:
>
> CPU Intel(R) xeon E5420 2.5 Ghz Chipset intel 5000P
> 8GB RAM
> 8 x 1.5TB hard disks, divided into 2 arrays with raid controller  
> adaptec 5805
>
> We using 2 x 1Gigabit Ethernet card with linux bonding ( OS is  
> centos 5.3 ). Our lustre client work as web server for downloading  
> file, so there are many files has been read by web client, i can't  
> provide you an exact number. ( we have about millions file in our  
> lustre storage system, unfortunately, there are quite a lot small  
> file: a linux soft links )  Files are "striped" over each 2 OSTs,  
> some are striped over all our OSTs ( fewer than 2 OSTs parallel  
> striping )
>
> Do you have any idea for my issue ?

If you are using small files, you shouldn't be striping your files  
over multiple OSTs.  That is increasing the workload on the OSTs  
(size, lock overhead) without providing any benefits because the data  
is only stored on the first OST (assuming 1MB stripe size, and file  
size <= 1MB).

> On Mon, Feb 1, 2010 at 8:05 PM, Mag Gam <magawake at gmail.com> wrote:
> How many OSS and OSTs do you have ? What type of hardware are they
> running on? What type of network connection? The file you are trying
> to access what OSS is it on? Are the files striped?
>
>
>
> What
>
> On Mon, Feb 1, 2010 at 4:44 AM, Lex <lexluthor87 at gmail.com> wrote:
> > Hi guys
> >
> > In effort to improve our storage system performance, i found some  
> strange
> > signs but unfortunately, couldn't explain it by myself. So i post  
> here for
> > all you guys can't help me to clarify it
> >
> > I'm using lustre client as web server for downloading file. When  
> our system
> > in a heavy load ( about 12.000 concurrent connection for 8 web  
> server -
> > lustre client ), %iowait has been pushed to about 98%, load  
> average was
> > about 1-2000 !!!! ( just because of %iowait, i still could  
> manipulate
> > normally almost every command over ssh ) i think it's a terrible  
> number in
> > describing load average ! But, at that case, the in and out  
> network traffic
> > are almost the same ( although just about few MB/s :( )
> >
> > The odd thing is, right now, when we only have about 3.500  
> concurrent
> > connection, load average is about 50 ( still too big, right ? ),  
> iowait is
> > about 70%, the difference between receive and transmit network is  
> too hight,
> > about 10-20MB ( see attached file, please )
> >
> > We just have about 20 connection for our local lustre storage  
> system:
> >
> > netstat -nat | grep 192.168.1.75
> > tcp        0    560 192.168.1.75:1023           192.168.1.85:988
> > ESTABLISHED
> > tcp        0      0 192.168.1.75:1023           192.168.1.81:988
> > ESTABLISHED
> > tcp        0      0 192.168.1.75:988            192.168.1.85:1023
> > ESTABLISHED
> > tcp        0      0 192.168.1.75:988            192.168.1.85:1022
> > ESTABLISHED
> > tcp        0      0 192.168.1.75:988            192.168.1.81:1023
> > ESTABLISHED
> > tcp        0      0 192.168.1.75:988            192.168.1.81:1022
> > ESTABLISHED
> > tcp        0      0 192.168.1.75:988            192.168.1.100:1023
> > ESTABLISHED
> > tcp        0      0 192.168.1.75:1021           192.168.1.78:988
> > ESTABLISHED
> > tcp        0      0 192.168.1.75:1023           192.168.1.78:988
> > ESTABLISHED
> > tcp        0      0 192.168.1.75:1022           192.168.1.78:988
> > ESTABLISHED
> > tcp        0    560 192.168.1.75:1023           192.168.1.100:988
> > ESTABLISHED
> >
> > and about 400 connection with client from internet :
> >
> > netstat -nat | grep out_wan_ip | grep EST | wc -l
> > 407
> >
> > We're currently using 2 Gigabit Ethernet card, one for  
> 192.168.1.0/24
> > network for lnet and the other as wan ip for delivering file out  
> to internet
> > and about 15MB/s thoughput was "lost" somehow !!!!
> >
> > So, my question is:
> >
> > - Is there anyone have idea or hint about high load situation with  
> our
> > lustre client - web server like i described above ?  I followed  
> this link
> > and found out  kjournald process is the main main "culprit" ( with  
> our ost,
> > it was "ll" process )
> > - What makes the too high difference between receive and transit  
> direction
> > in our lustre client - web server ?
> >
> >
> > i'm really stressed with poor performance in our storage system  
> and hope
> > anyone here can help me point out some thing
> >
> > Any help would be highly appreciated
> >
> > Best regards
> >
> >
> >
> >
> >
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
> >
> >
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.




More information about the lustre-discuss mailing list