[Lustre-discuss] Fragmented I/O

Thu May 12 07:52:40 PDT 2011

One of the oddities that I'm seeing that has me grasping at write 
fragmentation and I/O sizes may not be directly related to these things at 
all.  Periodically, iostat will show that one or more of my OST disks will 
be running at 99% utilization.  Reads per second is somewhere in the 
150-200 range, while read kB/second is quite small.  In addition, average 
request size is also very small.  llobdstat output on the OST in question 
usually has zero, or very small values for reads and writes, and values 
for stats/punches/creates/deletes in the ones and twos.
While this is happening, lustre starts complaining about 'slow commitrw', 
'slow direct_io', etc.  At this time, accesses from clients are usually 
hanging.

Why would the disk(s) be pegged while llobdstat shows zero activity?

After a few minutes in this state, the %util drops back down to single 
digit percentages and normal I/O resumes on the clients.

Thanks,
Kevin

On Thu, 12 May 2011, Kevin Van Maren wrote:

> Kevin Hildebrand wrote:
>>
>> The PERC 6 and H800 use megaraid_sas, I'm currently running
>> 00.00.04.17-RH1.
>>
>> The max_sectors numbers (320) are what is being set by default- I am
>> able to set it to something smaller than 320, but not larger.
>
> Right.  You can not set max_sectors_kb larger than max_hw_sectors_kb
> (Linux normally defaults most drivers to 512, but Lustre sets them to be
> the same): you may want to instrument your HBA driver to see what is
> going on (ie, why the max_hw_sectors_kb is < 1024).  I don't know if it
> is due to a driver limitation or a true hardware limit.
>
> Most drivers have a limit of 512KB by default; see Bug 22850 for the
> patches that fixed the QLogic and Emulex fibre channel drivers.
>
> Kevin
>
>> Kevin
>>
>> On Wed, 11 May 2011, Kevin Van Maren wrote:
>>
>>> You didn't say, but I think they are LSI-based: are you using the mptsas
>>> driver with the PERC cards?  Which driver version?
>>>
>>> First, max_sectors_kb should normally be set to a power of 2 number,
>>> like 256, over an odd size like 320.  This number should also match the
>>> native raid size of the device, to avoid read-modify-write cycles.  (See
>>> Bug 22886 on why not to make it > 1024 in general).
>>>
>>> See Bug 17086 for patches to increase the max_sectors_kb limitation for
>>> the mptsas driver to 1MB, or the true hardware maximum, rather than a
>>> driver limit; however, the hardware may still be limited to sizes < 1MB.
>>>
>>> Also, to clarify the sizes: the smallest bucket >= transfer_size is the
>>> one incremented, so a 320KB IO increments the 512KB bucket.  Since your
>>> HW says it can only do a 320KB IO, there will never be a 1MB IO.
>>>
>>> You may want to instrument your HBA driver to see what is going on (ie,
>>> why the max_hw_sectors_kb is < 1024).
>>>
>>> Kevin
>>>
>>>
>>> Kevin Hildebrand wrote:
>>>> Hi, I'm having some performance issues on my Lustre filesystem and it
>>>> looks to me like it's related to I/Os getting fragmented before being
>>>> written to disk, but I can't figure out why.  This system is RHEL5,
>>>> running Lustre 1.8.4.
>>>>
>>>> All of my OSTs look pretty much the same-
>>>>
>>>>                             read      |     write
>>>> pages per bulk r/w     rpcs  % cum % |  rpcs  % cum %
>>>> 1:                   88811  38  38   | 46375  17  17
>>>> 2:                    1497   0  38   | 7733   2  20
>>>> 4:                    1161   0  39   | 1840   0  21
>>>> 8:                    1168   0  39   | 7148   2  24
>>>> 16:                    922   0  40   | 3297   1  25
>>>> 32:                    979   0  40   | 7602   2  28
>>>> 64:                   1576   0  41   | 9046   3  31
>>>> 128:                  7063   3  44   | 16284   6  37
>>>> 256:                129282  55 100   | 162090  62 100
>>>>
>>>>
>>>>                             read      |     write
>>>> disk fragmented I/Os   ios   % cum % |  ios   % cum %
>>>> 0:                   51181  22  22   |    0   0   0
>>>> 1:                   45280  19  42   | 82206  31  31
>>>> 2:                   16615   7  49   | 29108  11  42
>>>> 3:                    3425   1  50   | 17392   6  49
>>>> 4:                  110445  48  98   | 129481  49  98
>>>> 5:                    1661   0  99   | 2702   1  99
>>>>
>>>>                             read      |     write
>>>> disk I/O size          ios   % cum % |  ios   % cum %
>>>> 4K:                  45889   8   8   | 56240   7   7
>>>> 8K:                   3658   0   8   | 6416   0   8
>>>> 16K:                  7956   1  10   | 4703   0   9
>>>> 32K:                  4527   0  11   | 11951   1  10
>>>> 64K:                114369  20  31   | 134128  18  29
>>>> 128K:                 5095   0  32   | 17229   2  31
>>>> 256K:                 7164   1  33   | 30826   4  35
>>>> 512K:               369512  66 100   | 465719  64 100
>>>>
>>>> Oddly, there's no 1024K row in the I/O size table...
>>>>
>>>>
>>>> ...and these seem small to me as well, but I can't seem to change them.
>>>> Writing new values to either doesn't change anything.
>>>>
>>>> # cat /sys/block/sdb/queue/max_hw_sectors_kb
>>>> 320
>>>> # cat /sys/block/sdb/queue/max_sectors_kb
>>>> 320
>>>>
>>>> Hardware in question is DELL PERC 6/E and DELL PERC H800 RAID
>>>> controllers, with MD1000 and MD1200 arrays, respectively.
>>>>
>>>>
>>>> Any clues on where I should look next?
>>>>
>>>> Thanks,
>>>>
>>>> Kevin
>>>>
>>>> Kevin Hildebrand
>>>> University of Maryland, College Park
>>>> Office of Information Technology
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>>
>>>
>>>
>
>