[Lustre-discuss] slow direct_io , slow journal .. in OST log

Mon Jan 25 08:35:58 PST 2010

I can't stop I/O to the lustre system as you said because it will make our
service goes down. But instead, i can user hdparm with our backup OST, it
has exactly the same hardware info with the master one. And this is the
result :

* hdparm -t /dev/sdc

/dev/sdc:
 Timing buffered disk reads:  1318 MB in  3.00 seconds = 439.01 MB/sec
HDIO_DRIVE_CMD(null) (wait for flush complete) failed: Inappropriate ioctl
for device*

Is it helpful ?

About the performance numbers of drbd or underlying raid devices, could u
please tell me exactly what info do you want is ?

Many thanks

On Mon, Jan 25, 2010 at 11:43 AM, Aaron Knister <aaron.knister at gmail.com>wrote:

> I don't necessarily think there's anything wrong with using drbd or running
> it over gigabit ethernet. If you stop all I/O to the lustre filesystem, what
> does an hdparm -t show on the sdc and drbd devices? Do you have any
> performance numbers for the drbd or underlying raid devices?
>
> On Jan 24, 2010, at 11:17 PM, Lex wrote:
>
> Thank you for your fast reply, Aaron
>
> I'm using Giga Ethernet to synchronize data between to our fail-over node.
> Is there something wrong ? Tell me, please
>
> On Mon, Jan 25, 2010 at 10:35 AM, Aaron Knister <aaron.knister at gmail.com>wrote:
>
>> My best guess (and please correct me if I'm wrong) is that those messages
>> are because the underlying block devices are slow to respond to i/o
>> requests. It looks like you're using DRBD. What's your interconnect?
>>
>> On Jan 24, 2010, at 9:42 PM, Lex wrote:
>>
>> Hi list
>>
>> I have one OSS with hadware info like this :
>>
>> CPU Intel(R) xeon E5420 2.5 Ghz
>> Chipset intel 5000P
>> 8GB RAM
>>
>> With this OSS, we using 2 RAID-5 arrays as OSTs ( each has 4 x 1.5 TB hard
>> drive with RAID controller adaptec 5805 )
>>
>> I worked quite smooth before, but, about 2 weeks ago, in
>> /var/log/messages, i saw many warning ( i thought so)  like this:
>>
>> *Jan 25 08:41:23 OST6 kernel: Lustre:
>> 9587:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
>> direct_io 35s
>> Jan 25 08:41:34 OST6 kernel: Lustre:
>> 9608:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
>> direct_io 41s
>> Jan 25 08:41:34 OST6 kernel: Lustre:
>> 9608:0:(filter_io_26.c:706:filter_commitrw_write()) Skipped 2 previous
>> similar messages
>> Jan 25 08:41:35 OST6 kernel: Lustre:
>> 9645:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
>> direct_io 43s
>> Jan 25 08:58:10 OST6 kernel: Lustre:
>> 9646:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
>> direct_io 31s
>> Jan 25 08:59:39 OST6 kernel: Lustre:
>> 9609:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
>> direct_io 30s
>> Jan 25 09:01:05 OST6 kernel: Lustre:
>> 9587:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
>> direct_io 33s
>> Jan 25 09:03:23 OST6 kernel: Lustre:
>> 9633:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
>> direct_io 32s
>> Jan 25 09:11:25 OST6 kernel: Lustre:
>> 9585:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
>> direct_io 36s*
>>
>> I googled around and found that it's because a problem with
>> oss_num_threads and even though brought it down to 64 ( followed by the
>> function i found in the 1.8 manual: thread_number = RAM * CPU core / 128 MB,
>> its value is 256  )
>>
>> *options ost oss_num_threads=64*
>>
>> It still didn't help.
>>
>> I thought it was only the harmless warning but maybe wrong, our
>> performance is goes down quite heavily ( it's maybe because of other reason,
>> but for now, i am only doubting slow direct_io problem )
>>
>> iostat -m 1 1
>> Linux 2.6.18-92.1.17.el5_lustre.1.8.0custom (OST6)      01/25/2010
>>
>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>            0.01    0.02    2.86   25.01    0.00   72.10
>>
>> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
>> sda               1.30         0.01         0.00      11386       3469
>> sdb               1.30         0.01         0.00      11531       3469
>> sdc             131.50        *12.40*         0.26   11793218     249934
>> sdd             178.46        *18.00*         0.26   17124065     250334
>> md2               3.33         0.02         0.00      22915       2634
>> md1               0.00         0.00         0.00          0          0
>> md0               0.00         0.00         0.00          0          0
>> drbd3           480.10        *12.39*         0.26   11789047     249639
>> drbd6           565.85        *14.89*         0.26   14168452     249211
>>
>>
>> So, could anyone please tell me whether it's warning impact our system
>> performance or not ? and if it does, give me solution or advice to resolve
>> it, please
>>
>> Best regards
>>
>>
>>
>>
>>
>>
>>  _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20100125/895e8e75/attachment.htm>