[Lustre-discuss] hacking max_sectors

Wed Aug 26 03:27:17 PDT 2009

Andreas Dilger wrote:
> On Aug 26, 2009  00:46 -0400, Robin Humble wrote:
>   
>> I've had another go at fixing the problem I was seeing a few months ago:
>>   http://lists.lustre.org/pipermail/lustre-discuss/2009-April/010315.html
>> and which we are seeing again now as we are setting up a new machine
>> with 128k chunk software raid (md) RAID6 8+2 eg.
>>   Lustre: test-OST000d: underlying device md5 should be tuned for larger I/O requests: max_sectors = 1024 could be up to max_hw_sectors=2560 
>>
>> without this patch, and despite raising all disks to a ridiculously
>> huge max_sectors_kb, all Lustre 1M rpc's are still fragmented into two
>> 512k chunks before being sent to md :-/ likely md then aggregates them
>> again 'cos performance isn't totaly dismal, which it would be if it was
>> 100% read-modify-writes for each stripe write.
>>     

They are "RCW"s rather than "RMW"s.

> Yes, we've seen this same issue, but haven't been able to tweak the
> /sys tunables correctly to get MD RAID to agree.  I wonder if the
> problem is that the /sys/block/*/queue/max_* tunables are being set
> too late in the MD startup, and it has picked up the 1024 sectors
> value too early, and never updates it afterward?
>   

There is no tunable for the "md" device (but there needs to be!), and so 
it appears that the (kernel) default value is used.  This is true even 
if the "sd" devices are set large before "md" is loaded.

>> with the patch, 1M i/o's are being fed to md (according to brw_stats),
>> and performance is a little better for RAID6 8+2 with 128k chunks, and
>> a bit worse for RAID6 8+2 with 64k chunks (which are curiously now fed
>> half 512k and half 1M i/o's by Lustre).
>>     
>
> This was the other question I'd asked internally.  If the array is
> formatted with 64kB chunks then 512k IOs shouldn't cause any read-modify-
> write operations and (in theory) give the same performance as 1M IOs on
> a 128kB chunksize array.  What is the relative performance of the
> 64kB and 128kB configurations?
>
>   
>> the one-liner is a core kernel change, so perhaps some Lustre/kernel
>> block device/md people can look at it and see if it's acceptable for
>> inclusion in standard Lustre OSS kernels, or whether it breaks
>> assumptions in the core scsi layer somehow.
>>
>> IMHO the best solution would be to apply the patch, and then have a
>> /sys/block/md*/queue/ for md devices so that max_sectors_kb and
>> max_hw_sectors_kb can be tuned without recompiling the kernel...
>> is that possible?
>>
>> the patch is against 2.6.18-128.1.14.el5-lustre1.8.1
>>     
>
>   
>> --- linux-2.6.18.x86_64.lustre/include/linux/blkdev.h	2009-08-18 17:40:51.000000000 +1000
>> +++ linux-2.6.18.x86_64.lustre.hackBlock/include/linux/blkdev.h	2009-08-21 13:47:55.000000000 +1000
>> @@ -778,7 +778,7 @@
>>  #define MAX_PHYS_SEGMENTS 128
>>  #define MAX_HW_SEGMENTS 128
>>  #define SAFE_MAX_SECTORS 255
>> -#define BLK_DEF_MAX_SECTORS 1024
>> +#define BLK_DEF_MAX_SECTORS 2048
>>  
>>  #define MAX_SEGMENT_SIZE	65536
>>     
>
> This patch definitely looks reasonable, and since we already patch
> the server kernel it doesn't appear to be a huge problem to include
> it.  Can you please create a bug and attach the patch there.
>
> Cheers, Andreas

Already done, Bug 20533.  Has a few other notes.

Kevin