[Lustre-discuss] Fragmented I/O

Thu May 12 07:57:41 PDT 2011

On May 12, 2011, at 7:52 AM, Kevin Hildebrand wrote:

> 
> One of the oddities that I'm seeing that has me grasping at write 
> fragmentation and I/O sizes may not be directly related to these things at 
> all.  Periodically, iostat will show that one or more of my OST disks will 
> be running at 99% utilization.  Reads per second is somewhere in the 
> 150-200 range, while read kB/second is quite small.  

That sounds familiar. You're probably experiencing these:

https://bugzilla.lustre.org/show_bug.cgi?id=24183
http://jira.whamcloud.com/browse/LU-15

Jason

> In addition, average 
> request size is also very small.  llobdstat output on the OST in question 
> usually has zero, or very small values for reads and writes, and values 
> for stats/punches/creates/deletes in the ones and twos.
> While this is happening, lustre starts complaining about 'slow commitrw', 
> 'slow direct_io', etc.  At this time, accesses from clients are usually 
> hanging.
> 
> Why would the disk(s) be pegged while llobdstat shows zero activity?
> 
> After a few minutes in this state, the %util drops back down to single 
> digit percentages and normal I/O resumes on the clients.
> 
> Thanks,
> Kevin
> 
> On Thu, 12 May 2011, Kevin Van Maren wrote:
> 
>> Kevin Hildebrand wrote:
>>> 
>>> The PERC 6 and H800 use megaraid_sas, I'm currently running
>>> 00.00.04.17-RH1.
>>> 
>>> The max_sectors numbers (320) are what is being set by default- I am
>>> able to set it to something smaller than 320, but not larger.
>> 
>> Right.  You can not set max_sectors_kb larger than max_hw_sectors_kb
>> (Linux normally defaults most drivers to 512, but Lustre sets them to be
>> the same): you may want to instrument your HBA driver to see what is
>> going on (ie, why the max_hw_sectors_kb is < 1024).  I don't know if it
>> is due to a driver limitation or a true hardware limit.
>> 
>> Most drivers have a limit of 512KB by default; see Bug 22850 for the
>> patches that fixed the QLogic and Emulex fibre channel drivers.
>> 
>> Kevin
>> 
>>> Kevin
>>> 
>>> On Wed, 11 May 2011, Kevin Van Maren wrote:
>>> 
>>>> You didn't say, but I think they are LSI-based: are you using the mptsas
>>>> driver with the PERC cards?  Which driver version?
>>>> 
>>>> First, max_sectors_kb should normally be set to a power of 2 number,
>>>> like 256, over an odd size like 320.  This number should also match the
>>>> native raid size of the device, to avoid read-modify-write cycles.  (See
>>>> Bug 22886 on why not to make it > 1024 in general).
>>>> 
>>>> See Bug 17086 for patches to increase the max_sectors_kb limitation for
>>>> the mptsas driver to 1MB, or the true hardware maximum, rather than a
>>>> driver limit; however, the hardware may still be limited to sizes < 1MB.
>>>> 
>>>> Also, to clarify the sizes: the smallest bucket >= transfer_size is the
>>>> one incremented, so a 320KB IO increments the 512KB bucket.  Since your
>>>> HW says it can only do a 320KB IO, there will never be a 1MB IO.
>>>> 
>>>> You may want to instrument your HBA driver to see what is going on (ie,
>>>> why the max_hw_sectors_kb is < 1024).
>>>> 
>>>> Kevin
>>>> 
>>>> 
>>>> Kevin Hildebrand wrote:
>>>>> Hi, I'm having some performance issues on my Lustre filesystem and it
>>>>> looks to me like it's related to I/Os getting fragmented before being
>>>>> written to disk, but I can't figure out why.  This system is RHEL5,
>>>>> running Lustre 1.8.4.
>>>>> 
>>>>> All of my OSTs look pretty much the same-
>>>>> 
>>>>>                            read      |     write
>>>>> pages per bulk r/w     rpcs  % cum % |  rpcs  % cum %
>>>>> 1:                   88811  38  38   | 46375  17  17
>>>>> 2:                    1497   0  38   | 7733   2  20
>>>>> 4:                    1161   0  39   | 1840   0  21
>>>>> 8:                    1168   0  39   | 7148   2  24
>>>>> 16:                    922   0  40   | 3297   1  25
>>>>> 32:                    979   0  40   | 7602   2  28
>>>>> 64:                   1576   0  41   | 9046   3  31
>>>>> 128:                  7063   3  44   | 16284   6  37
>>>>> 256:                129282  55 100   | 162090  62 100
>>>>> 
>>>>> 
>>>>>                            read      |     write
>>>>> disk fragmented I/Os   ios   % cum % |  ios   % cum %
>>>>> 0:                   51181  22  22   |    0   0   0
>>>>> 1:                   45280  19  42   | 82206  31  31
>>>>> 2:                   16615   7  49   | 29108  11  42
>>>>> 3:                    3425   1  50   | 17392   6  49
>>>>> 4:                  110445  48  98   | 129481  49  98
>>>>> 5:                    1661   0  99   | 2702   1  99
>>>>> 
>>>>>                            read      |     write
>>>>> disk I/O size          ios   % cum % |  ios   % cum %
>>>>> 4K:                  45889   8   8   | 56240   7   7
>>>>> 8K:                   3658   0   8   | 6416   0   8
>>>>> 16K:                  7956   1  10   | 4703   0   9
>>>>> 32K:                  4527   0  11   | 11951   1  10
>>>>> 64K:                114369  20  31   | 134128  18  29
>>>>> 128K:                 5095   0  32   | 17229   2  31
>>>>> 256K:                 7164   1  33   | 30826   4  35
>>>>> 512K:               369512  66 100   | 465719  64 100
>>>>> 
>>>>> Oddly, there's no 1024K row in the I/O size table...
>>>>> 
>>>>> 
>>>>> ...and these seem small to me as well, but I can't seem to change them.
>>>>> Writing new values to either doesn't change anything.
>>>>> 
>>>>> # cat /sys/block/sdb/queue/max_hw_sectors_kb
>>>>> 320
>>>>> # cat /sys/block/sdb/queue/max_sectors_kb
>>>>> 320
>>>>> 
>>>>> Hardware in question is DELL PERC 6/E and DELL PERC H800 RAID
>>>>> controllers, with MD1000 and MD1200 arrays, respectively.
>>>>> 
>>>>> 
>>>>> Any clues on where I should look next?
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Kevin
>>>>> 
>>>>> Kevin Hildebrand
>>>>> University of Maryland, College Park
>>>>> Office of Information Technology
>>>>> _______________________________________________
>>>>> Lustre-discuss mailing list
>>>>> Lustre-discuss at lists.lustre.org
>>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>>> 
>>>> 
>>>> 
>> 
>> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

--
Jason Rappleye
System Administrator
NASA Advanced Supercomputing Division
NASA Ames Research Center
Moffett Field, CA 94035