[Lustre-discuss] Performance Expectations of Lustre

Mon Jan 26 10:54:40 PST 2009

Hi Brian! Thanks for the reply, comments below

Brian J. Murrell wrote:
>>   Instead of just adding another 1TB server, I need to plan for a more 
>> scalable solution. Immediately Lustre came to mind, but I'm wondering 
>> about the performance. Basically our company does niche web-hosting for 
>> "Creative Professionals" so we need fast access to the data in order to 
>> have snappy web services for our clients. Typically these are smaller 
>> files (2MB pictures, 50MB videos, .swf files, etc.).
> 
> Well, I'm not sure those files would fall within our general
> classification of "small files" (wherein we know we don't perform very
> well).  Our small-file issues are usually characterized by "kernel
> builds" and ~ use, where files are usually much smaller than 1MB.

  Aha, OK well then that's good to know. There's also some kind of 
read-ahead and client side caching right? So files which are accessed a 
lot will be faster to access.

>>   Also I'm wondering about the best way set this up in terms of speed 
>> and ease of growth. I want the web-servers and the storage pool to be 
>> independent of each other. So I can add web-servers as the web traffic 
>> increases, and add more storage ass our storage needs grow.
> 
> Well, your web-servers would be Lustre clients.  There is no
> relationship, or rather requirements in terms of the number of clients
> and servers being used.  You use as many servers as your client load
> demands.  So you could imagine both ends of the spectrum where only a
> relatively few clients could be used to tax quite a few servers or the
> opposite where a lot of clients with modest demand requires only a few
> servers.
> 
>>   I was thinking initially we could start with 2 servers, both attached 
>> to the storage array. setup as OSS' and functioning as (load balanced) 
>> web-servers as well.
> 
> Sounds like you are describing 2 storage servers, which would require at
> least 3 servers total.  Don't forget about the MDS.  Also don't forget
> about HA if that's a concern for you.  You could make the 2 OSSes
> failover partners for each other if you are willing to accept a possibly
> lower performance impact when one of the OSSes failing.
>
> If HA is important to you however, you need to address an MDS failover
> with a second server to pick up the MDT should the active MDS fail.

HA is definitely critical, if the storage pool becomes inaccessible we 
loose clients (and all fingers point at me!). However, I need to find a 
reasonable balance between cost / scalability / performance. The idea 
would be to start small, with the simplest configuration, but allow for 
a lot of growth. In a years time, if we are using 5TB of data, we will 
be in a very good position financially and can afford a systems expansion.

So for starters, what can I get away with here? 1 OSS, 1MDS & 1 Client 
node? Is it a smart thing to do to have the MDS and OSS share the same 
storage target (just a separate partition for the MDS)? What kind of 
system specs are advisable for each type (MDS, OSS & Client node) as far 
as RAM, CPU, disk configuration etc? Also, is it possible to add more 
OSS' to take over existing OSTs that another OSS was previously 
managing? ie. if I have the MD3000i split into 5x1TB volumes (5xOSTs), 
and the OSS is getting hammered, I set another OSS up and hand off 2 or 
3 OSTs from the old OSS to the new one, and set it up as failover for 
the remaining OSTs. Do-able?

> As for OSSes being web-servers, that would require the OSS/Webservers
> also be clients and that is an unsupported configuration due to the risk
> of deadlock due to memory pressure.  The recommended architecture would
> be to make the webservers Lustre clients.

I see, so from the get-go I'm going to need an internal gigE network for 
OSS/Client communication.

>> performance can I expect, am I out of touch to expect something similar 
>> to a directly attached RAID array?
> 
> I think our generally talked about numbers are something on the order of
> achieving 80% of the raw storage bandwidth (assuming a capable network
> and so on).  Maybe somebody who is closer to the benchmarking that we
> are constantly doing can comment further on how close-to-raw-disk we are
> achieving lately.

Is it safe to say my bottleneck is going to be the OSS & not the 
network? Is there some documentation I can read about typical setups, 
usage cases & methods for optimal performance?

Thanks!
-Nick