[Lustre-discuss] slow direct_io , slow journal .. in OST log

Thu Jan 28 23:04:46 PST 2010

Hi guys, do you have any idea about my issue, my question?

On Wed, Jan 27, 2010 at 8:59 AM, Lex <lexluthor87 at gmail.com> wrote:

> Hi all
>
> I heard somewhere about oversubscribing issue related to ost thread, but i
> just wonder why i calculated followed the function that i founded in the
> manual ( *thread_number = RAM * CPU core / 128 MB* - do correct me if
> there's something wrong with it, please ) , the oversubscribing warning is
> still appears.
>
> Maybe i have to choose my own value from trial and error, but is there any
> explanation for this situation?
>
> @Erik : could u please describe your bottleneck problem with journal device
> for me ? as detail as better ?
>
>
> On Tue, Jan 26, 2010 at 10:00 PM, Erik Froese <erik.froese at gmail.com>wrote:
>
>> Sorry Lex I misread your email. I saw similar messages about my journal
>> devices. The OST is an ext3+extra features filesystem. Each FS has an
>> associated journal that CAN be on a separate device. Its supposed to speed
>> up small file operations. Mine were oversubscribed and became a bottleneck.
>>
>> Erik
>>
>>
>> On Mon, Jan 25, 2010 at 11:40 AM, Lex <lexluthor87 at gmail.com> wrote:
>>
>>> Sorry Erik if i'm rising such a "bad" question, could u tell me more
>>> about OST journal device ? I even don't know what it is as well as haven't
>>> seen it before, in the lustre manual.
>>>
>>> Best regards
>>>
>>>
>>> On Mon, Jan 25, 2010 at 10:52 PM, Erik Froese <erik.froese at gmail.com>wrote:
>>>
>>>> Is each OST journals on its own physical disk? I've seen those messages
>>>> when there isn't enough hardware dedicated to the journal device.
>>>> Erik
>>>>
>>>> On Sun, Jan 24, 2010 at 11:43 PM, Aaron Knister <
>>>> aaron.knister at gmail.com> wrote:
>>>>
>>>>> I don't necessarily think there's anything wrong with using drbd or
>>>>> running it over gigabit ethernet. If you stop all I/O to the lustre
>>>>> filesystem, what does an hdparm -t show on the sdc and drbd devices? Do you
>>>>> have any performance numbers for the drbd or underlying raid devices?
>>>>>
>>>>> On Jan 24, 2010, at 11:17 PM, Lex wrote:
>>>>>
>>>>> Thank you for your fast reply, Aaron
>>>>>
>>>>> I'm using Giga Ethernet to synchronize data between to our fail-over
>>>>> node. Is there something wrong ? Tell me, please
>>>>>
>>>>> On Mon, Jan 25, 2010 at 10:35 AM, Aaron Knister <
>>>>> aaron.knister at gmail.com> wrote:
>>>>>
>>>>>> My best guess (and please correct me if I'm wrong) is that those
>>>>>> messages are because the underlying block devices are slow to respond to i/o
>>>>>> requests. It looks like you're using DRBD. What's your interconnect?
>>>>>>
>>>>>> On Jan 24, 2010, at 9:42 PM, Lex wrote:
>>>>>>
>>>>>> Hi list
>>>>>>
>>>>>> I have one OSS with hadware info like this :
>>>>>>
>>>>>> CPU Intel(R) xeon E5420 2.5 Ghz
>>>>>> Chipset intel 5000P
>>>>>> 8GB RAM
>>>>>>
>>>>>> With this OSS, we using 2 RAID-5 arrays as OSTs ( each has 4 x 1.5 TB
>>>>>> hard drive with RAID controller adaptec 5805 )
>>>>>>
>>>>>> I worked quite smooth before, but, about 2 weeks ago, in
>>>>>> /var/log/messages, i saw many warning ( i thought so)  like this:
>>>>>>
>>>>>> *Jan 25 08:41:23 OST6 kernel: Lustre:
>>>>>> 9587:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
>>>>>> direct_io 35s
>>>>>> Jan 25 08:41:34 OST6 kernel: Lustre:
>>>>>> 9608:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
>>>>>> direct_io 41s
>>>>>> Jan 25 08:41:34 OST6 kernel: Lustre:
>>>>>> 9608:0:(filter_io_26.c:706:filter_commitrw_write()) Skipped 2 previous
>>>>>> similar messages
>>>>>> Jan 25 08:41:35 OST6 kernel: Lustre:
>>>>>> 9645:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
>>>>>> direct_io 43s
>>>>>> Jan 25 08:58:10 OST6 kernel: Lustre:
>>>>>> 9646:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
>>>>>> direct_io 31s
>>>>>> Jan 25 08:59:39 OST6 kernel: Lustre:
>>>>>> 9609:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
>>>>>> direct_io 30s
>>>>>> Jan 25 09:01:05 OST6 kernel: Lustre:
>>>>>> 9587:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
>>>>>> direct_io 33s
>>>>>> Jan 25 09:03:23 OST6 kernel: Lustre:
>>>>>> 9633:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
>>>>>> direct_io 32s
>>>>>> Jan 25 09:11:25 OST6 kernel: Lustre:
>>>>>> 9585:0:(filter_io_26.c:706:filter_commitrw_write()) lustre-OST0006: slow
>>>>>> direct_io 36s*
>>>>>>
>>>>>> I googled around and found that it's because a problem with
>>>>>> oss_num_threads and even though brought it down to 64 ( followed by the
>>>>>> function i found in the 1.8 manual: thread_number = RAM * CPU core / 128 MB,
>>>>>> its value is 256  )
>>>>>>
>>>>>> *options ost oss_num_threads=64*
>>>>>>
>>>>>> It still didn't help.
>>>>>>
>>>>>> I thought it was only the harmless warning but maybe wrong, our
>>>>>> performance is goes down quite heavily ( it's maybe because of other reason,
>>>>>> but for now, i am only doubting slow direct_io problem )
>>>>>>
>>>>>> iostat -m 1 1
>>>>>> Linux 2.6.18-92.1.17.el5_lustre.1.8.0custom (OST6)      01/25/2010
>>>>>>
>>>>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>>>>            0.01    0.02    2.86   25.01    0.00   72.10
>>>>>>
>>>>>> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
>>>>>> sda               1.30         0.01         0.00      11386       3469
>>>>>> sdb               1.30         0.01         0.00      11531       3469
>>>>>> sdc             131.50        *12.40*         0.26   11793218
>>>>>> 249934
>>>>>> sdd             178.46        *18.00*         0.26   17124065
>>>>>> 250334
>>>>>> md2               3.33         0.02         0.00      22915       2634
>>>>>> md1               0.00         0.00         0.00          0          0
>>>>>> md0               0.00         0.00         0.00          0          0
>>>>>> drbd3           480.10        *12.39*         0.26   11789047
>>>>>> 249639
>>>>>> drbd6           565.85        *14.89*         0.26   14168452
>>>>>> 249211
>>>>>>
>>>>>>
>>>>>> So, could anyone please tell me whether it's warning impact our system
>>>>>> performance or not ? and if it does, give me solution or advice to resolve
>>>>>> it, please
>>>>>>
>>>>>> Best regards
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>  _______________________________________________
>>>>>> Lustre-discuss mailing list
>>>>>> Lustre-discuss at lists.lustre.org
>>>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Lustre-discuss mailing list
>>>>> Lustre-discuss at lists.lustre.org
>>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20100129/11bbe4c8/attachment.htm>