[lustre-discuss] separate SSD only filesystem including HDD

Patrick Farrell paf at cray.com
Tue Aug 28 08:37:05 PDT 2018


Hmm – It’s possible you’ve got an issue, but I think more likely is that your chosen benchmarks aren’t capable of showing the higher speed.

I’m not really sure about your fio test - writing 4K random blocks will be relatively slow and might not speed up with more disks, but I can’t speak to it in detail for fio.  I would try a much larger size and possibly more processes (is numjobs the number of concurrent processes?)…

But I am sure about your other two:
Both of those tests (dd and cp) are single threaded, and if they’re running to Lustre (rather than to the ZFS volume directly), 1.3 GB/s is around the maximum expected speed.  On a recent Xeon, one process can write a maximum of about 1-1.5 GB/s to Lustre, depending on various details.  Improving disk speed won’t affect that limit for one process, it’s a client side thing.  Try several processes at once, ideally from multiple clients (and definitely writing to multiple files), if you really want to see your OST bandwidth limit.

Also, a block size of 10GB is *way* too big for DD and will harm performance.  It’s going to cause slowdown vs a smaller block size, like 16M or something.

There’s also limit on how fast /dev/zero can be read, especially with really large block sizes [it cannot provide 10 GiB of zeroes at a time, that’s why you had to add the “fullblock” flag, which is doing multiple reads (and writes)].  Here’s a quick sample on a system here, writing to /dev/null (so there is no real limit on the write bandwidth of the destination):
dd if=/dev/zero bs=10G of=/dev/null count=1
0+1 records in
0+1 records out
2147479552 bytes (2.1 GB, 2.0 GiB) copied, 1.66292 s, 1.3 GB/s

Notice that 1.3 GB/s, the same as your result.

Try 16M instead:
saturn-p2:/lus # dd if=/dev/zero bs=16M of=/dev/null count=1024
1024+0 records in
1024+0 records out
17179869184 bytes (17 GB, 16 GiB) copied, 7.38402 s, 2.3 GB/s

Also note that multiple dds reading from /dev/zero will run in to issues with the bandwidth of /dev/zero.  /dev/zero is different than most people assume – One would think it just magically spews zeroes at any rate needed, but it’s not really designed to be read at high speed and actually isn’t that fast.  If you really want to test high speed storage, you may need a tool that allocates memory and writes that out, not just dd.  (ior is one example)

From: Zeeshan Ali Shah <javaclinic at gmail.com>
Date: Tuesday, August 28, 2018 at 9:52 AM
To: Patrick Farrell <paf at cray.com>
Cc: "lustre-discuss at lists.lustre.org" <lustre-discuss at lists.lustre.org>
Subject: Re: [lustre-discuss] separate SSD only filesystem including HDD

1) fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k --direct=0 --size=20G --numjobs=4 --runtime=240 --group_reporting

2) time cp x x2

3) and dd if=/dev/zero of=/ssd/d.data bs=10G count=4 iflag=fullblock

any other way to test this plz let me know

/Zee



On Tue, Aug 28, 2018 at 3:54 PM Patrick Farrell <paf at cray.com<mailto:paf at cray.com>> wrote:
How are you measuring write speed?

________________________________
From: lustre-discuss <lustre-discuss-bounces at lists.lustre.org<mailto:lustre-discuss-bounces at lists.lustre.org>> on behalf of Zeeshan Ali Shah <javaclinic at gmail.com<mailto:javaclinic at gmail.com>>
Sent: Tuesday, August 28, 2018 1:30:03 AM
To: lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>
Subject: [lustre-discuss] separate SSD only filesystem including HDD

Dear All, I recently deployed 10PB+ Lustre solution which is working fine. Recently for  genomic pipeline we acquired another racks with dedicated compute nodes with single 24-NVME SSD Servers/Rack .  Each SSD server connected to Compute Node via 100 G Omnipath.

Issue 1:  is that when I combined SSDs in stripe mode using zfs we  are not linearly scaling in terms of performance . for e..g single SSD write speed is 1.3GB/sec , adding 5 of those in stripe mode should give us atleast 1.3x5 or less , but we still get 1.3 GB out of those 5 SSD .

Issue 2: if we resolve issue #1, 2nd challenge is to allow 24 NVMEs to compute nodes distributed and parallel wise , NFS not an option .. tried glusterfs but due to its DHT it is slow..

I am thinking to add another Filesystem to our existing MDT and install OSTs/OSS over the NVME server.. mounting this specific ssd where needed. so basically we will end up having two filesystem (one with normal 10PB+ and 2nd with SSD)..

Does this sounds correct ?

any other advice please ..


/Zeeshan


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20180828/43babfb3/attachment.html>


More information about the lustre-discuss mailing list