[Lustre-discuss] hacking max_sectors

Wed Aug 26 03:11:12 PDT 2009

On Aug 26, 2009  00:46 -0400, Robin Humble wrote:
> I've had another go at fixing the problem I was seeing a few months ago:
>   http://lists.lustre.org/pipermail/lustre-discuss/2009-April/010315.html
> and which we are seeing again now as we are setting up a new machine
> with 128k chunk software raid (md) RAID6 8+2 eg.
>   Lustre: test-OST000d: underlying device md5 should be tuned for larger I/O requests: max_sectors = 1024 could be up to max_hw_sectors=2560 
> 
> without this patch, and despite raising all disks to a ridiculously
> huge max_sectors_kb, all Lustre 1M rpc's are still fragmented into two
> 512k chunks before being sent to md :-/ likely md then aggregates them
> again 'cos performance isn't totaly dismal, which it would be if it was
> 100% read-modify-writes for each stripe write.

Yes, we've seen this same issue, but haven't been able to tweak the
/sys tunables correctly to get MD RAID to agree.  I wonder if the
problem is that the /sys/block/*/queue/max_* tunables are being set
too late in the MD startup, and it has picked up the 1024 sectors
value too early, and never updates it afterward?

> with the patch, 1M i/o's are being fed to md (according to brw_stats),
> and performance is a little better for RAID6 8+2 with 128k chunks, and
> a bit worse for RAID6 8+2 with 64k chunks (which are curiously now fed
> half 512k and half 1M i/o's by Lustre).

This was the other question I'd asked internally.  If the array is
formatted with 64kB chunks then 512k IOs shouldn't cause any read-modify-
write operations and (in theory) give the same performance as 1M IOs on
a 128kB chunksize array.  What is the relative performance of the
64kB and 128kB configurations?

> the one-liner is a core kernel change, so perhaps some Lustre/kernel
> block device/md people can look at it and see if it's acceptable for
> inclusion in standard Lustre OSS kernels, or whether it breaks
> assumptions in the core scsi layer somehow.
> 
> IMHO the best solution would be to apply the patch, and then have a
> /sys/block/md*/queue/ for md devices so that max_sectors_kb and
> max_hw_sectors_kb can be tuned without recompiling the kernel...
> is that possible?
> 
> the patch is against 2.6.18-128.1.14.el5-lustre1.8.1

> --- linux-2.6.18.x86_64.lustre/include/linux/blkdev.h	2009-08-18 17:40:51.000000000 +1000
> +++ linux-2.6.18.x86_64.lustre.hackBlock/include/linux/blkdev.h	2009-08-21 13:47:55.000000000 +1000
> @@ -778,7 +778,7 @@
>  #define MAX_PHYS_SEGMENTS 128
>  #define MAX_HW_SEGMENTS 128
>  #define SAFE_MAX_SECTORS 255
> -#define BLK_DEF_MAX_SECTORS 1024
> +#define BLK_DEF_MAX_SECTORS 2048
>  
>  #define MAX_SEGMENT_SIZE	65536

This patch definitely looks reasonable, and since we already patch
the server kernel it doesn't appear to be a huge problem to include
it.  Can you please create a bug and attach the patch there.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.