[lustre-discuss] separate SSD only filesystem including HDD

Fri Aug 31 07:26:55 PDT 2018

Deat Adilger, it is single server with 24nvmes with 100g opa card

On Fri, Aug 31, 2018 at 11:20 AM Andreas Dilger <adilger at whamcloud.com>
wrote:

> Just to confirm, there is only a single NVMe device in each server node,
> or there is a single server with 24 NVMe devices in it?
>
> Depending on what you want to use the NVMe storage for (e.g. very fast
> short-term scratch == burst buffer) it may be OK to just make a Lustre
> filesystem with each NVMe device a separate OST with no redundancy.  The
> failure rate for these devices is low, and adding redundancy will hurt
> performance.
>
> Cheers, Andreas
>
> On Aug 29, 2018, at 00:14, Zeeshan Ali Shah <javaclinic at gmail.com> wrote:
> >
> > Thanks a lot Patrick for detail answer, I tried with gnu parallel with
> dd and over all the throughput was increased locally .. you are right it is
> due to client side single thread issue.
> >
> > what about 2nd challenge to export bunch of NVMes from single server as
> shared volume ? I tried glusterfs (very slow due to dht) , Lustre by
> creating another filesystem in our existing MDT could be an option . I
> furthermore tried to export single  NVME-over Fabric (NVMEOF) looks
> promising but i am looking to have  a shared volume kind ...
> >
> > any advice ?
>
>
> > On Tue, Aug 28, 2018 at 6:37 PM Patrick Farrell <paf at cray.com> wrote:
> >> Hmm – It’s possible you’ve got an issue, but I think more likely is
> that your chosen benchmarks aren’t capable of showing the higher speed.
> >>
> >> I’m not really sure about your fio test - writing 4K random blocks will
> be relatively slow and might not speed up with more disks, but I can’t
> speak to it in detail for fio.  I would try a much larger size and possibly
> more processes (is numjobs the number of concurrent processes?)…
> >>
> >> But I am sure about your other two:
> >> Both of those tests (dd and cp) are single threaded, and if they’re
> running to Lustre (rather than to the ZFS volume directly), 1.3 GB/s is
> around the maximum expected speed.  On a recent Xeon, one process can write
> a maximum of about 1-1.5 GB/s to Lustre, depending on various details.
> Improving disk speed won’t affect that limit for one process, it’s a client
> side thing.  Try several processes at once, ideally from multiple clients
> (and definitely writing to multiple files), if you really want to see your
> OST bandwidth limit.
> >>
> >> Also, a block size of 10GB is *way* too big for DD and will harm
> performance.  It’s going to cause slowdown vs a smaller block size, like
> 16M or something.
> >>
> >>
> >> There’s also limit on how fast /dev/zero can be read, especially with
> really large block sizes [it cannot provide 10 GiB of zeroes at a time,
> that’s why you had to add the “fullblock” flag, which is doing multiple
> reads (and writes)].  Here’s a quick sample on a system here, writing to
> /dev/null (so there is no real limit on the write bandwidth of the
> destination):
> >>
> >> dd if=/dev/zero bs=10G of=/dev/null count=1
> >> 0+1 records in
> >> 0+1 records out
> >> 2147479552 bytes (2.1 GB, 2.0 GiB) copied, 1.66292 s, 1.3 GB/s
> >>
> >> Notice that 1.3 GB/s, the same as your result.
> >>
> >>
> >>
> >> Try 16M instead:
> >>
> >> saturn-p2:/lus # dd if=/dev/zero bs=16M of=/dev/null count=1024
> >> 1024+0 records in
> >> 1024+0 records out
> >> 17179869184 bytes (17 GB, 16 GiB) copied, 7.38402 s, 2.3 GB/s
> >>
> >>
> >>
> >> Also note that multiple dds reading from /dev/zero will run in to
> issues with the bandwidth of /dev/zero.  /dev/zero is different than most
> people assume – One would think it just magically spews zeroes at any rate
> needed, but it’s not really designed to be read at high speed and actually
> isn’t that fast.  If you really want to test high speed storage, you may
> need a tool that allocates memory and writes that out, not just dd.  (ior
> is one example)
> >>
> >>
> >>
> >>> From: Zeeshan Ali Shah <javaclinic at gmail.com>
> >>> Date: Tuesday, August 28, 2018 at 9:52 AM
> >>> To: Patrick Farrell <paf at cray.com>
> >>> Cc: "lustre-discuss at lists.lustre.org" <lustre-discuss at lists.lustre.org
> >
> >>> Subject: Re: [lustre-discuss] separate SSD only filesystem including
> HDD
> >>>
> >>> 1) fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite
> --bs=4k --direct=0 --size=20G --numjobs=4 --runtime=240 --group_reporting
> >>> 2) time cp x x2
> >>> 3) and dd if=/dev/zero of=/ssd/d.data bs=10G count=4 iflag=fullblock
> >>>
> >>> any other way to test this plz let me know
> >>>
> >>> /Zee
> >>>
> >>>
> >>>
> >>>> On Tue, Aug 28, 2018 at 3:54 PM Patrick Farrell <paf at cray.com> wrote:
> >>>>
> >>>> How are you measuring write speed?
> >>>>
> >>>>
> >>>>> From: lustre-discuss <lustre-discuss-bounces at lists.lustre.org> on
> behalf of Zeeshan Ali Shah <javaclinic at gmail.com>
> >>>>> Sent: Tuesday, August 28, 2018 1:30:03 AM
> >>>>> To: lustre-discuss at lists.lustre.org
> >>>>> Subject: [lustre-discuss] separate SSD only filesystem including HDD
> >>>>>
> >>>>>
> >>>>>
> >>>>> Dear All, I recently deployed 10PB+ Lustre solution which is working
> fine. Recently for  genomic pipeline we acquired another racks with
> dedicated compute nodes with single 24-NVME SSD Servers/Rack .  Each SSD
> server connected to Compute Node via 100 G Omnipath.
> >>>>>
> >>>>>
> >>>>>
> >>>>> Issue 1:  is that when I combined SSDs in stripe mode using zfs we
> are not linearly scaling in terms of performance . for e..g single SSD
> write speed is 1.3GB/sec , adding 5 of those in stripe mode should give us
> atleast 1.3x5 or less , but we still get 1.3 GB out of those 5 SSD .
> >>>>>
> >>>>>
> >>>>>
> >>>>> Issue 2: if we resolve issue #1, 2nd challenge is to allow 24 NVMEs
> to compute nodes distributed and parallel wise , NFS not an option .. tried
> glusterfs but due to its DHT it is slow..
> >>>>>
> >>>>>
> >>>>>
> >>>>> I am thinking to add another Filesystem to our existing MDT and
> install OSTs/OSS over the NVME server.. mounting this specific ssd where
> needed. so basically we will end up having two filesystem (one with normal
> 10PB+ and 2nd with SSD)..
> >>>>>
> >>>>>
> >>>>> Does this sounds correct ?
> >>>>>
> >>>>>
> >>>>>
> >>>>> any other advice please ..
>
> Cheers, Andreas
> ---
> Andreas Dilger
> Principal Lustre Architect
> Whamcloud
>
>
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20180831/4d5a0af9/attachment-0001.html>