<div dir="ltr">You're also leaving out the corosync/pacemaker/stonith configuration. That is unless you are doing manual export/import of pools.</div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Feb 10, 2017 at 9:03 PM, Vicker, Darby (JSC-EG311) <span dir="ltr"><<a href="mailto:darby.vicker-1@nasa.gov" target="_blank">darby.vicker-1@nasa.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Sure.  Our hardware is very similar to this:<br>

<br>

<a href="https://www.supermicro.com/solutions/Lustre.cfm" rel="noreferrer" target="_blank">https://www.supermicro.com/<wbr>solutions/Lustre.cfm</a><br>

<br>

We are using twin servers instead two single chassis servers as shown there but functionally this is the same – we can just fit more stuff into a single rack with the twin servers.  We are using a single JBOB per twin server as shown in one of the configurations on the above page and are using ZFS as the backend.  All servers are dual-homed on both Ethernet and IB.  A combined MGS/MDS is at 10.148.0.30 address for IB and X.X.98.30 for Ethernet. The secondary MDS/MGS on the .31 address for both networks.  With the combined MDS/MGS, they both fail over together.  This did require a patch from LU-8397 to get the MGS failover to work properly so we are using 2.9.0 with the LU-8397 patch and are compiling our own server rpms.  But this is pretty simple with ZFS since you don't need a patched kernel.  The lustre formatting and configuration bits are below.  I'm leaving out the ZFS pool creation but I think you get the idea.<br>

<br>

I hope that helps.<br>

<br>

Darby<br>

<br>

<br>

<br>

if [[ $HOSTNAME == *mds* ]] ; then<br>

<br>

    mkfs.lustre \<br>

        --fsname=hpfs-fsl \<br>

        --backfstype=zfs \<br>

        --reformat \<br>

        --verbose \<br>

        --mgs --mdt --index=0 \<br>

        --servicenode=${LUSTRE_LOCAL_<wbr>TCP_IP}@tcp0,${LUSTRE_LOCAL_<wbr>IB_IP}@o2ib0 \<br>

        --servicenode=${LUSTRE_PEER_<wbr>TCP_IP}@tcp0,${LUSTRE_PEER_IB_<wbr>IP}@o2ib0 \<br>

        metadata/meta-fst<br>

<br>

elif [[ $HOSTNAME == *oss* ]] ; then<br>

<br>

   num=`hostname --short | sed 's/hpfs-fsl-//' | sed 's/oss//'`<br>

   num=`printf '%g' $num`<br>

<br>

   mkfs.lustre \<br>

       --mgsnode=X.X.98.30@tcp0,10.<wbr>148.0.30@o2ib0 \<br>

       --mgsnode=X.X.98.31@tcp0,10.<wbr>148.0.31@o2ib0 \<br>

       --fsname=hpfs-fsl \<br>

       --backfstype=zfs \<br>

       --reformat \<br>

       --verbose \<br>

       --ost --index=$num \<br>

       --servicenode=${LUSTRE_LOCAL_<wbr>TCP_IP}@tcp0,${LUSTRE_LOCAL_<wbr>IB_IP}@o2ib0 \<br>

       --servicenode=${LUSTRE_PEER_<wbr>TCP_IP}@tcp0,${LUSTRE_PEER_IB_<wbr>IP}@o2ib0 \<br>

       $pool/ost-fsl<br>

fi<br>

<br>

<br>

<br>

<br>

/etc/ldev.conf:<br>

<br>

#local  foreign/-  label       [md|zfs:]device-path   [journal-path]/- [raidtab]<br>

<br>

hpfs-fsl-mds0  hpfs-fsl-mds1  hpfs-fsl-MDT0000  zfs:metadata/meta-fsl<br>

<br>

hpfs-fsl-oss00 hpfs-fsl-oss01 hpfs-fsl-OST0000  zfs:oss00-0/ost-fsl<br>

hpfs-fsl-oss01 hpfs-fsl-oss00 hpfs-fsl-OST0001  zfs:oss01-0/ost-fsl<br>

hpfs-fsl-oss02 hpfs-fsl-oss03 hpfs-fsl-OST0002  zfs:oss02-0/ost-fsl<br>

hpfs-fsl-oss03 hpfs-fsl-oss02 hpfs-fsl-OST0003  zfs:oss03-0/ost-fsl<br>

hpfs-fsl-oss04 hpfs-fsl-oss05 hpfs-fsl-OST0004  zfs:oss04-0/ost-fsl<br>

hpfs-fsl-oss05 hpfs-fsl-oss04 hpfs-fsl-OST0005  zfs:oss05-0/ost-fsl<br>

hpfs-fsl-oss06 hpfs-fsl-oss07 hpfs-fsl-OST0006  zfs:oss06-0/ost-fsl<br>

hpfs-fsl-oss07 hpfs-fsl-oss06 hpfs-fsl-OST0007  zfs:oss07-0/ost-fsl<br>

hpfs-fsl-oss08 hpfs-fsl-oss09 hpfs-fsl-OST0008  zfs:oss08-0/ost-fsl<br>

hpfs-fsl-oss09 hpfs-fsl-oss08 hpfs-fsl-OST0009  zfs:oss09-0/ost-fsl<br>

hpfs-fsl-oss10 hpfs-fsl-oss11 hpfs-fsl-OST000a  zfs:oss10-0/ost-fsl<br>

hpfs-fsl-oss11 hpfs-fsl-oss10 hpfs-fsl-OST000b  zfs:oss11-0/ost-fsl<br>

<br>

<br>

<br>

<br>

/etc/modprobe.d/lustre.conf:<br>

<br>

options lnet networks=tcp0(enp4s0),o2ib0(<wbr>ib1)<br>

options ko2iblnd map_on_demand=32<br>

<div class="HOEnZb"><div class="h5"><br>

-----Original Message-----<br>

From: Brian Andrus <<a href="mailto:toomuchit@gmail.com">toomuchit@gmail.com</a>><br>

Date: Friday, February 10, 2017 at 12:07 AM<br>

To: Darby Vicker <<a href="mailto:darby.vicker-1@nasa.gov">darby.vicker-1@nasa.gov</a>>, Ben Evans <<a href="mailto:bevans@cray.com">bevans@cray.com</a>>, "<a href="mailto:lustre-discuss@lists.lustre.org">lustre-discuss@lists.lustre.<wbr>org</a>" <<a href="mailto:lustre-discuss@lists.lustre.org">lustre-discuss@lists.lustre.<wbr>org</a>><br>

Subject: Re: [lustre-discuss] design to enable kernel updates<br>

<br>

Darby,<br>

<br>

Do you mind if I inquire about the setup for your lustre systems?<br>

I'm trying to understand how the MGS/MGT is setup for high availability.<br>

I understand with OSTs and MDTs where all I really need is to have the<br>

failnode set when I do the mkfs.lustre<br>

However, as I understand it, you have to use something like pacemaker<br>

and drbd to deal with the MGS/MGT. Is this how you approached it?<br>

<br>

Brian Andrus<br>

<br>

<br>

<br>

On 2/6/2017 12:58 PM, Vicker, Darby (JSC-EG311) wrote:<br>

> Agreed.  We are just about to go into production on our next LFS with the<br>

> setup described.  We had to get past a bug in the MGS failover for<br>

> dual-homed servers but as of last week that is done and everything is<br>

> working great (see "MGS failover problem" thread on this mailing list from<br>

> this month and last).  We are in the process of syncing our existing LFS<br>

> to this new one and I've failed over/rebooted/upgraded the new LFS servers<br>

> many times now to make sure we can do this in practice when the new LFS goes<br>

> into production.  Its working beautifully.<br>

><br>

> Many thanks to the lustre developers for their continued efforts.  We have<br>

> been using and have been fans of lustre for quite some time now and it<br>

> just keeps getting better.<br>

><br>

> -----Original Message-----<br>

> From: lustre-discuss <<a href="mailto:lustre-discuss-bounces@lists.lustre.org">lustre-discuss-bounces@lists.<wbr>lustre.org</a>> on behalf of Ben Evans <<a href="mailto:bevans@cray.com">bevans@cray.com</a>><br>

> Date: Monday, February 6, 2017 at 2:22 PM<br>

> To: Brian Andrus <<a href="mailto:toomuchit@gmail.com">toomuchit@gmail.com</a>>, "<a href="mailto:lustre-discuss@lists.lustre.org">lustre-discuss@lists.lustre.<wbr>org</a>" <<a href="mailto:lustre-discuss@lists.lustre.org">lustre-discuss@lists.lustre.<wbr>org</a>><br>

> Subject: Re: [lustre-discuss] design to enable kernel updates<br>

><br>

> It's certainly possible.  When I've done that sort of thing, you upgrade<br>

> the OS on all the servers first, boot half of them (the A side) to the new<br>

> image, all the targets will fail over to the B servers.  Once the A side<br>

> is up, reboot the B half to the new OS.  Finally, do a failback to the<br>

> "normal" running state.<br>

><br>

> At least when I've done it, you'll want to do the failovers manually so<br>

> the HA infrastructure doesn't surprise you for any reason.<br>

><br>

> -Ben<br>

><br>

> On 2/6/17, 2:54 PM, "lustre-discuss on behalf of Brian Andrus"<br>

> <<a href="mailto:lustre-discuss-bounces@lists.lustre.org">lustre-discuss-bounces@lists.<wbr>lustre.org</a> on behalf of <a href="mailto:toomuchit@gmail.com">toomuchit@gmail.com</a>><br>

> wrote:<br>

><br>

>> All,<br>

>><br>

>> I have been contemplating how lustre could be configured such that I<br>

>> could update the kernel on each server without downtime.<br>

>><br>

>> It seems this is _almost_ possible when you have a san system so you<br>

>> have failover for OSTs and MDTs. BUT the MGS/MGT seems to be the<br>

>> problematic one, since rebooting that seems cause downtime that cannot<br>

>> be avoided.<br>

>><br>

>> If you have a system where the disks are physically part of the OSS<br>

>> hardware, you are out of luck. The hypothetical scenario I am using is<br>

>> if someone had a VM that was a qcow image on a lustre mount (basically<br>

>> an active, open file being read/written to continuously). How could<br>

>> lustre be built to ensure anyone on the VM would not notice a kernel<br>

>> upgrade to the underlying lustre servers.<br>

>><br>

>><br>

>> Could such a setup be done? It seems that would be a better use case for<br>

>> something like GPFS or Gluster, but being a die-hard lustre enthusiast,<br>

>> I want to at least show it could be done.<br>

>><br>

>><br>

>> Thanks in advance,<br>

>><br>

>> Brian Andrus<br>

>><br>

>> ______________________________<wbr>_________________<br>

>> lustre-discuss mailing list<br>

>> <a href="mailto:lustre-discuss@lists.lustre.org">lustre-discuss@lists.lustre.<wbr>org</a><br>

>> <a href="http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org" rel="noreferrer" target="_blank">http://lists.lustre.org/<wbr>listinfo.cgi/lustre-discuss-<wbr>lustre.org</a><br>

> ______________________________<wbr>_________________<br>

> lustre-discuss mailing list<br>

> <a href="mailto:lustre-discuss@lists.lustre.org">lustre-discuss@lists.lustre.<wbr>org</a><br>

> <a href="http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org" rel="noreferrer" target="_blank">http://lists.lustre.org/<wbr>listinfo.cgi/lustre-discuss-<wbr>lustre.org</a><br>

><br>

><br>

<br>

<br>

<br>

______________________________<wbr>_________________<br>

lustre-discuss mailing list<br>

<a href="mailto:lustre-discuss@lists.lustre.org">lustre-discuss@lists.lustre.<wbr>org</a><br>

<a href="http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org" rel="noreferrer" target="_blank">http://lists.lustre.org/<wbr>listinfo.cgi/lustre-discuss-<wbr>lustre.org</a><br>

</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr">------------------------------<br>Jeff Johnson<br>Co-Founder<br>Aeon Computing<br><br><a href="mailto:jeff.johnson@aeoncomputing.com" target="_blank">jeff.johnson@aeoncomputing.com</a><br><a href="http://www.aeoncomputing.com" target="_blank">www.aeoncomputing.com</a><br>t: 858-412-3810 x1001   f: 858-412-3845<br>m: 619-204-9061<br><br>4170 Morena Boulevard, Suite D - San Diego, CA 92117<div><br></div><div>High-Performance Computing / Lustre Filesystems / Scale-out Storage</div></div></div>

</div>