[Lustre-discuss] hacking max_sectors
Andreas Dilger
adilger at sun.com
Wed Aug 26 03:11:12 PDT 2009
On Aug 26, 2009 00:46 -0400, Robin Humble wrote:
> I've had another go at fixing the problem I was seeing a few months ago:
> http://lists.lustre.org/pipermail/lustre-discuss/2009-April/010315.html
> and which we are seeing again now as we are setting up a new machine
> with 128k chunk software raid (md) RAID6 8+2 eg.
> Lustre: test-OST000d: underlying device md5 should be tuned for larger I/O requests: max_sectors = 1024 could be up to max_hw_sectors=2560
>
> without this patch, and despite raising all disks to a ridiculously
> huge max_sectors_kb, all Lustre 1M rpc's are still fragmented into two
> 512k chunks before being sent to md :-/ likely md then aggregates them
> again 'cos performance isn't totaly dismal, which it would be if it was
> 100% read-modify-writes for each stripe write.
Yes, we've seen this same issue, but haven't been able to tweak the
/sys tunables correctly to get MD RAID to agree. I wonder if the
problem is that the /sys/block/*/queue/max_* tunables are being set
too late in the MD startup, and it has picked up the 1024 sectors
value too early, and never updates it afterward?
> with the patch, 1M i/o's are being fed to md (according to brw_stats),
> and performance is a little better for RAID6 8+2 with 128k chunks, and
> a bit worse for RAID6 8+2 with 64k chunks (which are curiously now fed
> half 512k and half 1M i/o's by Lustre).
This was the other question I'd asked internally. If the array is
formatted with 64kB chunks then 512k IOs shouldn't cause any read-modify-
write operations and (in theory) give the same performance as 1M IOs on
a 128kB chunksize array. What is the relative performance of the
64kB and 128kB configurations?
> the one-liner is a core kernel change, so perhaps some Lustre/kernel
> block device/md people can look at it and see if it's acceptable for
> inclusion in standard Lustre OSS kernels, or whether it breaks
> assumptions in the core scsi layer somehow.
>
> IMHO the best solution would be to apply the patch, and then have a
> /sys/block/md*/queue/ for md devices so that max_sectors_kb and
> max_hw_sectors_kb can be tuned without recompiling the kernel...
> is that possible?
>
> the patch is against 2.6.18-128.1.14.el5-lustre1.8.1
> --- linux-2.6.18.x86_64.lustre/include/linux/blkdev.h 2009-08-18 17:40:51.000000000 +1000
> +++ linux-2.6.18.x86_64.lustre.hackBlock/include/linux/blkdev.h 2009-08-21 13:47:55.000000000 +1000
> @@ -778,7 +778,7 @@
> #define MAX_PHYS_SEGMENTS 128
> #define MAX_HW_SEGMENTS 128
> #define SAFE_MAX_SECTORS 255
> -#define BLK_DEF_MAX_SECTORS 1024
> +#define BLK_DEF_MAX_SECTORS 2048
>
> #define MAX_SEGMENT_SIZE 65536
This patch definitely looks reasonable, and since we already patch
the server kernel it doesn't appear to be a huge problem to include
it. Can you please create a bug and attach the patch there.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
More information about the lustre-discuss
mailing list