[Lustre-discuss] disk fragmented I/Os
Lu Wang
wanglu at ihep.ac.cn
Wed Mar 31 22:44:59 PDT 2010
Hi,
We parted the device /dev/sda with "parted" command Using GUID Partition Table. Therefore the fsdisk result is not correct:
[root at boss32 ~]# sfdisk -uS -l /dev/sda
WARNING: GPT (GUID Partition Table) detected on '/dev/sda'! The util sfdisk doesn't support GPT. Use GNU Parted.
Disk /dev/sda: 1094112 cylinders, 255 heads, 63 sectors/track
Units = sectors of 512 bytes, counting from 0
Device Boot Start End #sectors Id System
/dev/sda1 1 4294967295 4294967295 ee EFI GPT
start: (c,h,s) expected (0,0,2) found (0,0,1)
/dev/sda2 0 - 0 0 Empty
/dev/sda3 0 - 0 0 Empty
/dev/sda4 0 - 0 0 Empty
With "parted", you will see the result.
[root at boss32 ~]# parted /dev/sda
GNU Parted 1.8.1
Using /dev/sda
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) p
Model: TOYOU NetStor_iSUM510 (scsi)
Disk /dev/sda: 8999GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Number Start End Size File system Name Flags
1 17.4kB 4500GB 4500GB ext3
2 4500GB 8999GB 4499GB ext3
(parted)
------------------
Lu Wang
2010-04-01
-------------------------------------------------------------
发件人:Kevin Van Maren
发送日期:2010-04-01 11:09:25
收件人:Lu Wang
抄送:lustre-discuss
主题:Re: [Lustre-discuss] Re: Re: disk fragmented I/Os
Thanks for the "df" output -- at 96% full the problem is likely Lustre
fragmenting the IO because it cannot allocate contiguous space on the
OST. If possible, free up a bunch of space on the OSTs (ie, delete old
large files) and see if it improves. Still not clear to me why you
don't have more outstanding IOs to the disk.
8 TiB devices would only have been 1% less capacity than you have with
two partitions. Possibly related, did you ensure partitions were
aligned on the RAID boundary of the underlying device? RAID alignment
is the main reason it is recommended to not use drive partitions with
Lustre. "sfdisk -uS -l /dev/sda" will show the actual start.
Kevin
Lu Wang wrote:
> We are using lustre 1.8.1.1 on 2.6.18-128.7.1.el5. The disk controller is NetStor_iSUM510, driver is qla2xxx (8.02.00.06.05.03-k).
>
> We made 2 partition for each disk volume:
> /dev/sda1 4325574520 3911425648 194422312 96% /lustre/ost1
> /dev/sda2 4324980788 3898888124 206396204 95% /lustre/ost2
> /dev/sdb1 4325574520 3909042320 196805640 96% /lustre/ost3
> /dev/sdb2 4324980788 3920306524 184977804 96% /lustre/ost4
> /dev/sdc1 4325574520 3868328108 237519852 95% /lustre/ost5
> /dev/sdc2 4324980788 3921774384 183509944 96% /lustre/ost6
> /dev/sdd1 4325574520 3911662272 194185688 96% /lustre/ost7
> /dev/sdd2 4324980788 3884415428 220868900 95% /lustre/ost8
>
> Because Lustre can not support OST larger than 8TB.
>
>
>
>
> ------------------
> Lu Wang
> 2010-04-01
>
> -------------------------------------------------------------
> 发件人:Kevin Van Maren
> 发送日期:2010-03-31 19:56:59
> 收件人:Lu Wang
> 抄送:lustre-discuss
> 主题:Re: [Lustre-discuss] disk fragmented I/Os
>
> Lu Wang wrote:
>
>> Dear list,
>> We have got a brw_stat result for one OST since it was up. According to this statistic, 50% percent disk I/Os are fragmented. I find a earlier discussion in this list referred to this qestion:
>> http://lists.lustre.org/pipermail/lustre-discuss/2009-August/011433.html
>> It seems it is ideal to have 100% disk I/Os with fragment "1" or "0". I don't know why the I/Os are fragmented, since I found the max_sectors_kb is big enough(16MB?)for biggest disk I/O size( according to brw_stat, it is 1MB)
>>
>> # cat /sys/block/sda/queue/max_sectors_kb
>> 32767
>> # cat /sys/block/sda/queue/max_hw_sectors_kb
>> 32767
>>
>>
>
> So the drive is limited to 32MB IOs. Below it is clear that you are
> seeing fragmentation, so the question becomes why are the IOs being
> broken up? It is very unlikely Lustre is breaking up the IO willingly,
> so most likely something in the IO stack is restricting the IO sizes.
>
> What are you using for an OST, and what controller/driver/driver version?
>
> What version of Lustre, and what version of Linux are you using on the
> OSS node?
>
>
>> read | write
>> pages per bulk r/w rpcs % cum % | rpcs % cum %
>> 128: 4976083 20 45 | 198864 4 13
>> 256: 13457144 54 100 | 3522333 86 100
>>
>>
> So the clients are doing 1MB RPCs to the server (which is good).
>
>
>> read | write
>> disk fragmented I/Os ios % cum % | ios % cum %
>> 0: 9821 0 0 | 0 0 0
>> 1: 11933478 48 48 | 630964 15 15
>> 2: 12726392 51 99 | 3350479 82 97
>> 3: 155476 0 99 | 84465 2 99
>>
>>
> But all your IOs are being broken in half.
>
>
>> read | write
>> disk I/Os in flight ios % cum % | ios % cum %
>> 1: 10954265 28 28 | 3781021 49 49
>> 2: 9217023 24 53 | 3329128 43 93
>> 3: 6063548 15 69 | 272981 3 97
>>
>>
>
> This is really bad -- it seems that it only ever issues one write at a
> time to the disks.
> Lustre would normally issue up to 31, so there may be something about
> your disk or driver
> preventing multiple outstanding IOs.
>
>
>> read | write
>> disk I/O size ios % cum % | ios % cum %
>> 256K: 4464264 11 29 | 288737 3 11
>> 512K: 24846133 65 94 | 5997373 78 90
>> 1M: 1951161 5 100 | 747214 9 100
>>
>>
>
> Basically this is saying that nearly all 1MB IOs are being broken into
> 512KB pieces.
>
>
>> I have 2 questions:
>> 1. Could any one explain what dose these parameters exactly mean?
>> /sys/block/sda/queue/max_sectors_kb /sys/block/sda/queue/max_hw_sectors_kb
>>
> How large an IO size can be sent (allowed) to the disk, and how large of
> an IO the disk drive supports.
>
>
>> ,disk fragmented I/Os, disk I/O size of brw_stats
>>
>>
> How many pieces each ldiskfs write are broken into, and the size of the
> pieces.
>
>
>> 2. In which case, the disk I/O will be fragemented?
>>
>>
>> Thanks a lot in advance!
>>
>> Best Regards
>> Lu Wang
>>
>>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
More information about the lustre-discuss
mailing list