[lustre-discuss] separate SSD only filesystem including HDD

Tue Aug 28 23:14:56 PDT 2018

Thanks a lot Patrick for detail answer, I tried with gnu parallel with dd
and over all the throughput was increased locally .. you are right it is
due to client side single thread issue.

what about 2nd challenge to export bunch of NVMes from single server as
shared volume ? I tried glusterfs (very slow due to dht) , Lustre by
creating another filesystem in our existing MDT could be an option . I
furthermore tried to export single  NVME-over Fabric (NVMEOF) looks
promising but i am looking to have  a shared volume kind ...

any advice ?

/Zeeshan

On Tue, Aug 28, 2018 at 6:37 PM Patrick Farrell <paf at cray.com> wrote:

> Hmm – It’s possible you’ve got an issue, but I think more likely is that
> your chosen benchmarks aren’t capable of showing the higher speed.
>
>
>
> I’m not really sure about your fio test - writing 4K random blocks will be
> relatively slow and might not speed up with more disks, but I can’t speak
> to it in detail for fio.  I would try a much larger size and possibly more
> processes (is numjobs the number of concurrent processes?)…
>
>
>
> But I am sure about your other two:
> Both of those tests (dd and cp) are single threaded, and if they’re
> running to Lustre (rather than to the ZFS volume directly), 1.3 GB/s is
> around the maximum expected speed.  On a recent Xeon, one process can write
> a maximum of about 1-1.5 GB/s to Lustre, depending on various details.
> Improving disk speed won’t affect that limit for one process, it’s a client
> side thing.  Try several processes at once, ideally from multiple clients
> (and definitely writing to multiple files), if you really want to see your
> OST bandwidth limit.
>
>
>
> Also, a block size of 10GB is **way** too big for DD and will harm
> performance.  It’s going to cause slowdown vs a smaller block size, like
> 16M or something.
>
>
>
> There’s also limit on how fast /dev/zero can be read, especially with
> really large block sizes [it cannot provide 10 GiB of zeroes at a time,
> that’s why you had to add the “fullblock” flag, which is doing multiple
> reads (and writes)].  Here’s a quick sample on a system here, writing to
> /dev/null (so there is no real limit on the write bandwidth of the
> destination):
>
> dd if=/dev/zero bs=10G of=/dev/null count=1
>
> 0+1 records in
>
> 0+1 records out
>
> 2147479552 bytes (2.1 GB, 2.0 GiB) copied, 1.66292 s, 1.3 GB/s
>
>
>
> Notice that 1.3 GB/s, the same as your result.
>
>
>
> Try 16M instead:
>
> saturn-p2:/lus # dd if=/dev/zero bs=16M of=/dev/null count=1024
>
> 1024+0 records in
>
> 1024+0 records out
>
> 17179869184 bytes (17 GB, 16 GiB) copied, 7.38402 s, 2.3 GB/s
>
>
>
> Also note that multiple dds reading from /dev/zero will run in to issues
> with the bandwidth of /dev/zero.  /dev/zero is different than most people
> assume – One would think it just magically spews zeroes at any rate needed,
> but it’s not really designed to be read at high speed and actually isn’t
> that fast.  If you really want to test high speed storage, you may need a
> tool that allocates memory and writes that out, not just dd.  (ior is one
> example)
>
>
>
> *From: *Zeeshan Ali Shah <javaclinic at gmail.com>
> *Date: *Tuesday, August 28, 2018 at 9:52 AM
> *To: *Patrick Farrell <paf at cray.com>
> *Cc: *"lustre-discuss at lists.lustre.org" <lustre-discuss at lists.lustre.org>
> *Subject: *Re: [lustre-discuss] separate SSD only filesystem including HDD
>
>
>
> 1) fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite
> --bs=4k --direct=0 --size=20G --numjobs=4 --runtime=240 --group_reporting
>
>
>
> 2) time cp x x2
>
>
>
> 3) and dd if=/dev/zero of=/ssd/d.data bs=10G count=4 iflag=fullblock
>
>
>
> any other way to test this plz let me know
>
>
>
> /Zee
>
>
>
>
>
>
>
> On Tue, Aug 28, 2018 at 3:54 PM Patrick Farrell <paf at cray.com> wrote:
>
> How are you measuring write speed?
>
> ------------------------------
>
> *From:* lustre-discuss <lustre-discuss-bounces at lists.lustre.org> on
> behalf of Zeeshan Ali Shah <javaclinic at gmail.com>
> *Sent:* Tuesday, August 28, 2018 1:30:03 AM
> *To:* lustre-discuss at lists.lustre.org
> *Subject:* [lustre-discuss] separate SSD only filesystem including HDD
>
>
>
> Dear All, I recently deployed 10PB+ Lustre solution which is working fine.
> Recently for  genomic pipeline we acquired another racks with dedicated
> compute nodes with single 24-NVME SSD Servers/Rack .  Each SSD server
> connected to Compute Node via 100 G Omnipath.
>
>
>
> Issue 1:  is that when I combined SSDs in stripe mode using zfs we  are
> not linearly scaling in terms of performance . for e..g single SSD write
> speed is 1.3GB/sec , adding 5 of those in stripe mode should give us
> atleast 1.3x5 or less , but we still get 1.3 GB out of those 5 SSD .
>
>
>
> Issue 2: if we resolve issue #1, 2nd challenge is to allow 24 NVMEs to
> compute nodes distributed and parallel wise , NFS not an option .. tried
> glusterfs but due to its DHT it is slow..
>
>
>
> I am thinking to add another Filesystem to our existing MDT and install
> OSTs/OSS over the NVME server.. mounting this specific ssd where needed. so
> basically we will end up having two filesystem (one with normal 10PB+ and
> 2nd with SSD)..
>
>
> Does this sounds correct ?
>
>
>
> any other advice please ..
>
>
>
>
>
> /Zeeshan
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20180829/e409ff01/attachment.html>