[Lustre-devel] New test results for "ls -Ul"
Fan Yong
yong.fan at whamcloud.com
Mon May 30 01:11:59 PDT 2011
Inline comments as following:
On 5/30/11 1:51 PM, Jinshan Xiong wrote:
>
> On May 26, 2011, at 6:01 AM, Eric Barton wrote:
>
>> Nasf,
>> Interesting results. Thank you - especially for graphing the results
>> so thoroughly.
>> I’m attaching them here and cc-ing lustre-devel since these are of
>> general interest.
>> I don’t think your conclusion number (1), to say CLIO locking is
>> slowing us down
>> is as obvious from these results as you imply. If you just compare
>> the 1.8 and
>> patched 2.x per-file times and how they scale with #stripes you get this…
>> <image001.png>
>> The gradients of these lines should correspond to the additional time
>> per stripe required
>> to stat each file and I’ve graphed these times below (ignoring the
>> 0-stripe data for this
>> calculation because I’m just interested in the incremental per-stripe
>> overhead).
>> <image004.png>
>> They show per-stripe overhead for 1.8 well above patched 2.x for the
>> lower stripe
>> counts, but whereas 1.8 gets better with more stripes, patched 2.x
>> gets worse. I’m
>> guessing that at high stripe counts, 1.8 puts many concurrent
>> glimpses on the wire
>> and does it quite efficiently. I’d like to understand better how you
>> control the #
>> of glimpse-aheads you keep on the wire – is it a single fixed number,
>> or a fixed
>> number per OST or some other scheme? In any case, it will be
>> interesting to see
>> measurements at higher stripe counts.
>>
>> Cheers,
>> Eric
>>
>> *From:*Fan Yong [mailto:yong.fan at whamcloud.com]
>> *Sent:*12 May 2011 10:18 AM
>> *To:*Eric Barton
>> *Cc:*Bryon Neitzel; Ian Colle; Liang Zhen
>> *Subject:*New test results for "ls -Ul"
>>
>> I have improved statahead load balance mechanism to distribute
>> statahead load to more CPU units on client. And adjusted AGL
>> according to CLIO lock state machine. After those improvement, 'ls
>> -Ul' can run more fast than old patches, especially on large SMP node.
>>
>> On the other hand, as the increasing the degree of parallelism, the
>> lower network scheduler is becoming performance bottleneck. So I
>> combine my patches together with Liang's SMP patches in the test.
>>
>>
>>
>> client (fat-intel-4, 24 cores)
>>
>> server (client-xxx, 4 OSSes, 8 OSTs on each OSS)
>> b2x_patched
>>
>> my patches + SMP patches
>>
>> my patches
>> b18
>>
>> original b1_8
>>
>> share the same server with "b2x_patched"
>> b2x_original
>>
>> original b2_x
>>
>> original b2_x
>>
>>
>> Some notes:
>>
>> 1) Stripe count affects traversing performance much, and the impact
>> is more than linear. Even if with all the patches applied on b2_x,
>> the degree of stripe count impact is still larger than b1_8. It is
>> related with the complex CLIO lock state machine and tedious
>> iteration/repeat operations. It is not easy to make it run as
>> efficiently as b1_8.
>
>
> Hi there,
>
> I did some tests to investigate the overhead of clio lock state
> machine and glimpse lock, and I found something new.
>
> Basically I did the same thing as what Nasf had done, but I only cared
> about the overhead of glimpse locks. For this purpose, I ran 'ls -lU'
> twice for each test, and the 1st run is only used to create IBITS
> UPDATE lock cache for files; then, I dropped cl_locks and ldlm_locks
> from client side cache by setting zero to lru_size of ldlm namespaces,
> then do 'ls -lU' once again. In the second run of 'ls -lU', the
> statahead thread will always find cached IBITS lock(we can check mdc
> lock_count for sure), so the elapsed time of ls will be glimpse related.
>
> This is what I got from the test:
>
>
>
>
>
> Description and test environment:
> - `ls -Ul time' means the time to finish the second run;
> - 100K means 100K files under the same directory; 400K means 400K
> files under the same directory;
> - there are two OSSes in my test, and each OSS has 8 OSTs; OSTs are
> crossed over on two OSSes, i.e., OST0, 2, 4,.. are on OSS0; 1, 3, 5,
> .. are on OSS1;
> - each node has 12G memory, 4 CPU cores;
> - latest lustre-master build, b140
>
> and, prorated per stripe overhead:
>
>
>
>
>
> From the above test, it's very hard to make the conclusion that
> cl_lock causes the increase of ls time by the stripe count.
>
> Here is the test script I used to do the test, and test output is
> attached as well. Please let me know if I missed something.
In theory, processing glimpse RPC for each stripe of the same file
should be in parallel. So means more stripe count, then less average
overhead per-stripe, at least it is the expectation. Flat line cannot
indicate the overhead is small enough. I suggest to compare with b1_8
for the same tests.
>
>
>
>
>
>
> ===================
> Let's take a step back to reconsider what's real cause in Nasf's test.
> I tend to think the load on OSSes might cause that symptom. It's
> obvious that Async Glimpse Lock produces more stress on OSS,
> especially in his test env where multiple OSTs are actually on the
> same OSS. This will make the ls time increased by the stripe count as
> well - since OSS has to handle more RPCs when the stripe count
> increases in a specific time. This problem may be mitigated by
> distributing OSTs to more OSSes.
Basically, I agree with you that the heavy load on OSS may be the
performance bottleneck, just as I said in former email, we found the CPU
loads on OSS were quite high when "ls -Ul" for large-striped cases. It
is easy to be verified as long as we have enough powerful OSSes,
unfortunately we have not now.
Cheers,
--
Nasf
>
> Thanks,
> Jinshan
>
>>
>> 2) Patched b2_x is much faster than original b2_x, for traversing
>> 400K * 32-striped directory, it is 100 times or more improved.
>>
>> 3) Patched b2_x is also faster than b1_8, within our test, patched
>> b2_x is at least 4X faster than b1_8, which matches the requirement
>> in ORNL contract.
>>
>> 4) Original b2_x is faster than b1_8 only for small striped cases,
>> not more than 4-striped. For large striped cases, slower than b1_8,
>> which is consistent with ORNL test result.
>>
>> 5) The largest stripe count is 32 in our test. We have not enough
>> resource to test more large striped cases. And I also wonder whether
>> it is worth to test more large striped directory or not. Because how
>> many customers want to use large and full striped directory? means
>> contains 1M * 160-striped items in signal directory. If it is rare
>> case, then wasting lots of time on that is worthless.
>>
>> We need to confirm with ORNL what is the last acceptance test cases
>> and environment, includes:
>> a) stripe count
>> b) item count
>> c) network latency, w/o lnet router, suggest without router.
>> d) OST count on each OSS
>>
>>
>> Cheers,
>> --
>> Nasf
>> <result_20110512.xls>_______________________________________________
>> Lustre-devel mailing list
>> Lustre-devel at lists.lustre.org <mailto:Lustre-devel at lists.lustre.org>
>> http://lists.lustre.org/mailman/listinfo/lustre-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20110530/29c7a5d7/attachment.htm>
More information about the lustre-devel
mailing list