[Lustre-discuss] disk fragmented I/Os

Wed Mar 31 22:44:59 PDT 2010

Hi, 
	We parted the device /dev/sda with "parted" command Using GUID Partition Table. Therefore the fsdisk result is not correct:
[root at boss32 ~]# sfdisk -uS -l /dev/sda

WARNING: GPT (GUID Partition Table) detected on '/dev/sda'! The util sfdisk doesn't support GPT. Use GNU Parted.

Disk /dev/sda: 1094112 cylinders, 255 heads, 63 sectors/track
Units = sectors of 512 bytes, counting from 0

   Device Boot    Start       End   #sectors  Id  System
/dev/sda1             1 4294967295 4294967295  ee  EFI GPT
                start: (c,h,s) expected (0,0,2) found (0,0,1)
/dev/sda2             0         -          0   0  Empty
/dev/sda3             0         -          0   0  Empty
/dev/sda4             0         -          0   0  Empty

With "parted", you will see the result. 
[root at boss32 ~]# parted /dev/sda
GNU Parted 1.8.1
Using /dev/sda
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) p                                                                

Model: TOYOU NetStor_iSUM510 (scsi)
Disk /dev/sda: 8999GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt

Number  Start   End     Size    File system  Name  Flags
 1      17.4kB  4500GB  4500GB  ext3                    
 2      4500GB  8999GB  4499GB  ext3                    

(parted)                                                                  

------------------				 
Lu Wang
2010-04-01

-------------------------------------------------------------
发件人：Kevin Van Maren
发送日期：2010-04-01 11:09:25
收件人：Lu Wang
抄送：lustre-discuss
主题：Re: [Lustre-discuss]  Re: Re:  disk fragmented I/Os

Thanks for the "df" output -- at 96% full the problem is likely Lustre 
fragmenting the IO because it cannot allocate contiguous space on the 
OST.  If possible, free up a bunch of space on the OSTs (ie, delete old 
large files) and see if it improves.  Still not clear to me why you 
don't have more outstanding IOs to the disk.

8 TiB devices would only have been 1% less capacity than you have with 
two partitions.  Possibly related, did you ensure partitions were 
aligned on the RAID boundary of the underlying device?  RAID alignment 
is the main reason it is recommended to not use drive partitions with 
Lustre.  "sfdisk -uS -l /dev/sda" will show the actual start.

Kevin

Lu Wang wrote:
> We are using  lustre 1.8.1.1 on 2.6.18-128.7.1.el5. The disk controller is NetStor_iSUM510, driver is  qla2xxx (8.02.00.06.05.03-k).   
> 	
> We made 2 partition for each disk volume:
> 	/dev/sda1         4325574520 3911425648 194422312  96% /lustre/ost1
> /dev/sda2            4324980788 3898888124 206396204  95% /lustre/ost2
> /dev/sdb1            4325574520 3909042320 196805640  96% /lustre/ost3
> /dev/sdb2            4324980788 3920306524 184977804  96% /lustre/ost4
> /dev/sdc1            4325574520 3868328108 237519852  95% /lustre/ost5
> /dev/sdc2            4324980788 3921774384 183509944  96% /lustre/ost6
> /dev/sdd1            4325574520 3911662272 194185688  96% /lustre/ost7
> /dev/sdd2            4324980788 3884415428 220868900  95% /lustre/ost8
> 	
> Because Lustre can not support OST larger than 8TB. 
>
> 	
> 	
>
> ------------------				 
> Lu Wang
> 2010-04-01
>
> -------------------------------------------------------------
> 发件人：Kevin Van Maren
> 发送日期：2010-03-31 19:56:59
> 收件人：Lu Wang
> 抄送：lustre-discuss
> 主题：Re: [Lustre-discuss] disk fragmented I/Os
>
> Lu Wang wrote:
>   
>> Dear list,
>> 	 We have got a brw_stat result for one OST since it was up. According to this statistic, 50% percent disk I/Os are fragmented. I find a earlier discussion in this list referred to this qestion:
>> http://lists.lustre.org/pipermail/lustre-discuss/2009-August/011433.html
>> It seems it is ideal to have 100% disk I/Os with fragment "1" or "0".  I don't know why the I/Os are fragmented, since I found the max_sectors_kb is big enough(16MB?)for biggest disk I/O size( according to brw_stat, it is 1MB)
>>
>> # cat /sys/block/sda/queue/max_sectors_kb 
>> 32767
>> # cat /sys/block/sda/queue/max_hw_sectors_kb 
>> 32767
>>   
>>     
>
> So the drive is limited to 32MB IOs.  Below it is clear that you are 
> seeing fragmentation, so the question becomes why are the IOs being 
> broken up?  It is very unlikely Lustre is breaking up the IO willingly, 
> so most likely something in the IO stack is restricting the IO sizes.
>
> What are you using for an OST, and what controller/driver/driver version?
>
> What version of Lustre, and what version of Linux are you using on the 
> OSS node?
>
>   
>>                            read      |     write
>> pages per bulk r/w     rpcs  % cum % |  rpcs  % cum %
>> 128:               4976083  20  45   | 198864   4  13
>> 256:              13457144  54 100   | 3522333  86 100
>>   
>>     
> So the clients are doing 1MB RPCs to the server (which is good).
>
>   
>>                            read      |     write
>> disk fragmented I/Os   ios   % cum % |  ios   % cum %
>> 0:                    9821   0   0   |    0   0   0
>> 1:                11933478  48  48   | 630964  15  15
>> 2:                12726392  51  99   | 3350479  82  97
>> 3:                  155476   0  99   | 84465   2  99
>>   
>>     
> But all your IOs are being broken in half.
>
>   
>>                            read      |     write
>> disk I/Os in flight    ios   % cum % |  ios   % cum %
>> 1:                10954265  28  28   | 3781021  49  49
>> 2:                 9217023  24  53   | 3329128  43  93
>> 3:                 6063548  15  69   | 272981   3  97
>>   
>>     
>
> This is really bad -- it seems that it only ever issues one write at a 
> time to the disks.
> Lustre would normally issue up to 31, so there may be something about 
> your disk or driver
> preventing multiple outstanding IOs.
>
>   
>>                            read      |     write
>> disk I/O size          ios   % cum % |  ios   % cum %
>> 256K:              4464264  11  29   | 288737   3  11
>> 512K:             24846133  65  94   | 5997373  78  90
>> 1M:                1951161   5 100   | 747214   9 100
>>   
>>     
>
> Basically this is saying that nearly all 1MB IOs are being broken into 
> 512KB pieces.
>
>   
>> I have 2 questions: 
>> 1. Could any one explain what dose these parameters exactly mean?
>>   /sys/block/sda/queue/max_sectors_kb  /sys/block/sda/queue/max_hw_sectors_kb
>>     
> How large an IO size can be sent (allowed) to the disk, and how large of 
> an IO the disk drive supports.
>
>   
>> ,disk fragmented I/Os,  disk I/O size  of brw_stats
>>   
>>     
> How many pieces each ldiskfs write are broken into, and the size of the 
> pieces.
>
>   
>> 2. In which case, the disk I/O will be fragemented?
>>
>>
>> Thanks a lot in advance!
>>
>> Best Regards
>> Lu Wang
>> 	
>>     
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>