[lustre-discuss] ZFS-OST layout, number of OSTs

Patrick Farrell paf at cray.com
Sun Oct 22 11:21:43 PDT 2017


Thomas,

This is likely a reflection of an older issue, since resolved.  For a long time, Lustre reserved max_rpcs_in_flight*max_pages_per_rpc for each OST (on the client).  This was a huge memory commitment in larger setups, but was resolved a few versions back, and now per OST memory usage on the client is pretty trivial when the client isn’t doing I/o to that OST.  The main arguments against large OST counts are probably the pain of managing larger numbers of them, and individual OSTs being slow (because they use fewer disks), requiring users to stripe files more widely to see the benefit.  This is both an administrative burden for users and uses more space on the metadata server to track the file layouts.

But if your MDT is large and your users amenable to thinking about that (or you set a good default striping policy - progressive file layouts from 2.10 are wonderful for this), then it’s probably fine.  The largest OST counts I am aware of are in the low thousands.

Ah, one more thing - clients must ping every OST periodically if they haven’t otherwise contacted it within the required interval.  This can contribute to network traffic and CPU noise/jitter on the clients.  I don’t have a good sense of how serious this is in practice, but I know some larger sites worry about it.

- Patrick


________________________________
From: lustre-discuss <lustre-discuss-bounces at lists.lustre.org> on behalf of Thomas Roth <t.roth at gsi.de>
Sent: Sunday, October 22, 2017 9:04:35 AM
To: Lustre Discuss
Subject: [lustre-discuss] ZFS-OST layout, number of OSTs

Hi all,

I have done some "fio" benchmarking, amongst other things to test the proposition that to get more iops, the number of disks per raidz should be less.
I was happy I could reproduce that: one server with 30 disks in one raidz2 (=one zpool = one OST) is indeed slower than one with 30 disks in three
raidz2 (one zpool, one OST).
I ran fio also on a third server were the 30 disks make up 3 raidz2 = 3 zpools = 3 OSTs, that one is faster still.

Now I seem to remember a warning not to have too many OSTs in one Lustre, because each OST eats some memory on the client. I haven't found that
reference, and I would like to ask what the critical numbers might be? How much RAM are we talking about? Is there any other "wise" limit on the OST
number?
Currently our clients are equipped with 128 or 256 GB RAM.  We have 550 OSTs in the system, but the next cluster could easily grow much larger here if
we stick to the small OSTs.

Regards,
Thomas
_______________________________________________
lustre-discuss mailing list
lustre-discuss at lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20171022/d74da42f/attachment.html>


More information about the lustre-discuss mailing list