[lustre-discuss] Lustre failover configuration - Need help in selecting storage

jeevan.patnaik at wipro.com jeevan.patnaik at wipro.com
Tue Mar 15 03:09:04 PDT 2016


Hi Malcolm,

Thank! That's a very good point. I've almost forgot that we should prefer cross failover configuration, rather than the individual failover.

So, using the same connection type is we should prefer (to cover failover + load distribution).

--
Regards,
Jeevan.

From: Cowe, Malcolm J [mailto:malcolm.j.cowe at intel.com]
Sent: 15 March 2016 15:24
To: Jeevan Behara Patnaik (GIS) <jeevan.patnaik at wipro.com>; lustre-discuss at lists.lustre.org
Subject: RE: [lustre-discuss] Lustre failover configuration - Need help in selecting storage

One of the reasons to use the same connection type for the IO path to the storage is to ensure consistency in performance regardless of which server the storage is mounted on. However, there is another reason for using a symmetrical IO path: Lustre systems are designed to distribute the IO workload across multiple servers in parallel, maximising the available throughput across the network and the disk IO and each server delivering the same level of performance.

The servers are usually configured into building blocks of paired clusters for HA (that is, 2 servers attached to one or more shared DAS arrays). The storage is split into multiple LUNs, with half of the LUNs presented to one server, half to the other server in the pair. This means that each server is able to transact IO, each server has the same performance characteristics and there are no idle or passive servers. Maximum utilisation and consistent performance across the all the servers in the network.

For example, if you have 2 OSS servers (oss1, oss2) connected to a 60 disk tray split into 6 RAID 6 (8+2) LUNs, then 3 LUNs would be primary targets on oss1, 3 on oss2, and you'd allow the LUNs to migrate on failover. This way, each of the servers is active on the network, maximising the available throughput performance of the file system. Similarly, the MGT and MDT are commonly paired into an metadata server pair.

If you create an imbalance in the performance of each IO path, then the 2nd server is going to end up as a passive node only, rather than being another server to scale out the bandwidth.

I've attached some example diagrams (apologies to the list for the additional 120KB or so - not sure the list will accept it actually :) ), that highlights at a very high level a fairly well used pattern for the metadata and OSS servers for HA. Just pictures, but enough to get an idea of what Lustre is about. Where costs are a concern, one can also investigate use of JBODs, although they add their own complexity with regard to storage management (identifying failed disks, etc.). ZFS is gaining popularity as a storage platform but has its own challenges as well.

Malcolm Cowe
High Performance Data Division
Intel Corporation | www.intel.com<http://www.intel.com>

From: jeevan.patnaik at wipro.com<mailto:jeevan.patnaik at wipro.com> [mailto:jeevan.patnaik at wipro.com]
Sent: Tuesday, March 15, 2016 6:18 PM
To: bevans at cray.com<mailto:bevans at cray.com>; lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>; Cowe, Malcolm J
Subject: RE: [lustre-discuss] Lustre failover configuration - Need help in selecting storage

Thanks Ben and Malcolm,

Yes, now I have an idea what to do. I thought multiport DAS that could share a single storage on two servers is hard to find. Also, if there is any cost concern, we can still use one directly attached Primary node and Network attached Failover node.

--
Regards,
Jeevan.

From: Cowe, Malcolm J [mailto:malcolm.j.cowe at intel.com]
Sent: 15 March 2016 01:58
To: Jeevan Behara Patnaik (GIS) <jeevan.patnaik at wipro.com<mailto:jeevan.patnaik at wipro.com>>; lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>
Subject: RE: [lustre-discuss] Lustre failover configuration - Need help in selecting storage

Why not use a multi-ported direct attached storage (DAS) enclosure? Performance is retained and configuration is straightforward. There are a number of such enclosures available from a range of vendors, many of whom have solutions that have been qualified with Lustre.

Malcolm Cowe
High Performance Data Division
Intel Corporation | www.intel.com<http://www.intel.com>


From: Ben Evans [mailto:bevans at cray.com]
Sent: 14 March 2016 19:15
To: Jeevan Behara Patnaik (GIS) <jeevan.patnaik at wipro.com<mailto:jeevan.patnaik at wipro.com>>; lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>
Subject: Re: [lustre-discuss] Lustre failover configuration - Need help in selecting storage

You'll only go as fast as your slowest piece.

With that in mind, First figure out what sorts of bandwidth you can actually get across your chosen network type (per server).  That will dictate how fast you want your storage to be.  Benchmark it, make sure you can get the I/O over the wire that you think you can for that one server.

Next, find a disk system that can deliver that speed for you (you'll be able to get some of the info, but you'll want to benchmark that as well, with different RAID configurations, settings, etc.).  You may want to overprovision storage speed, since you probably won't be getting ideal throughput numbers.

As to redundancy, there are a number of direct-attach systems that allow you to connect two servers to the same set of disks.  You don't need (or really want) anything fancy like a SAN.

Given the cost/performance ratios, you might also experiment with a few smaller OSTs made up of SSDs, or using something like flashcache on the MDT(s).

-Ben Evans

From: lustre-discuss <lustre-discuss-bounces at lists.lustre.org<mailto:lustre-discuss-bounces at lists.lustre.org>> on behalf of "jeevan.patnaik at wipro.com<mailto:jeevan.patnaik at wipro.com>" <jeevan.patnaik at wipro.com<mailto:jeevan.patnaik at wipro.com>>
Date: Monday, March 14, 2016 at 8:36 AM
To: "lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>" <lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>>
Subject: [lustre-discuss] Lustre failover configuration - Need help in selecting storage



We need storage specifically for HPC Lustre failover setup, where it is must that two servers should share same block level storage to have failover configuration.

With very limited knowledge on hardware, I have the below understanding:
*         NAS can be used for shared storage, but there will be bottleneck for speed due to intermediate network.
*         SAN can be used, but it is costly to implement the solution and not really needed for Storage of 50-100TB.
*         If at all we find multiple iscsi ports to the storage enclosure, the storage can be used only by splitting i.e., works as two storage devices and the same storage can't be used by both the
systems. (And one thing to remind here, in the lustre setup, both the servers would be only attached, but only one will be used (not sure, how it is possible, again need to check on this).
*         Having two virtual machines may be how we can do it. But, then, it is not really helpful for the purpose of failover, as the physical machine would be only one.

But, while posting the question, I am thinking, maybe we can compromise on speed in NAS, if we try having one directly attached server (primary) and the other attached via network (failover), so we face slowness only when the primary stops working.

As I posted the similar question on Server Fault: http://serverfault.com/questions/763569/is-it-possible-to-have-a-directly-attached-shared-storage-accessed-at-block-lev, I have got the following response:
"Have you actually attempted to set up a proof of concept, or at least looked through the documentation<http://doc.lustre.org/lustre_manual.xhtml>? Lustre really doesn't care very much how you connect to the underlying storage, so you can do whatever gets you the bandwidth you need."

So, is it true that we don't need to worry about bandwidth of the storage server?

I mean, for example: the communication as I understood is as follows:

==>  Client <----> MGS (Ethernet)

==>  MGS <----> MGT (Direct/ISCSI)

==>  MGS <----> MDS (Ethernet/Internal Communication)

==>  MDS <----> MDT (Direct/ISCSI/Ethernet)

==>  MDS <----> OSS (Ethernet)

==>  OSS <----> OST (Direct/ISCSI/Ethernet)

==>  OST <----> Client (Ethernet)

Does it mean that, the performance won't be affected at any stage, if iscsi is replaced by Ethernet or by using limited bandwidth?








[WNC_Logo]--
Thanks and Regards,
Jeevan Patnaik B| Project Engineer
Nokia IT - HEE Platform | WIPRO Technologies - Hyderabad
Mob: +91-9000607181| Off: +91-4030970347.

The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com<http://www.wipro.com>
The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com<http://www.wipro.com>
The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20160315/2b7b1547/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.jpg
Type: image/jpeg
Size: 1911 bytes
Desc: image001.jpg
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20160315/2b7b1547/attachment-0001.jpg>


More information about the lustre-discuss mailing list