[lustre-discuss] design to enable kernel updates

Jeff Johnson jeff.johnson at aeoncomputing.com
Fri Feb 10 21:57:41 PST 2017


You're also leaving out the corosync/pacemaker/stonith configuration. That
is unless you are doing manual export/import of pools.

On Fri, Feb 10, 2017 at 9:03 PM, Vicker, Darby (JSC-EG311) <
darby.vicker-1 at nasa.gov> wrote:

> Sure.  Our hardware is very similar to this:
>
> https://www.supermicro.com/solutions/Lustre.cfm
>
> We are using twin servers instead two single chassis servers as shown
> there but functionally this is the same – we can just fit more stuff into a
> single rack with the twin servers.  We are using a single JBOB per twin
> server as shown in one of the configurations on the above page and are
> using ZFS as the backend.  All servers are dual-homed on both Ethernet and
> IB.  A combined MGS/MDS is at 10.148.0.30 address for IB and X.X.98.30 for
> Ethernet. The secondary MDS/MGS on the .31 address for both networks.  With
> the combined MDS/MGS, they both fail over together.  This did require a
> patch from LU-8397 to get the MGS failover to work properly so we are using
> 2.9.0 with the LU-8397 patch and are compiling our own server rpms.  But
> this is pretty simple with ZFS since you don't need a patched kernel.  The
> lustre formatting and configuration bits are below.  I'm leaving out the
> ZFS pool creation but I think you get the idea.
>
> I hope that helps.
>
> Darby
>
>
>
> if [[ $HOSTNAME == *mds* ]] ; then
>
>     mkfs.lustre \
>         --fsname=hpfs-fsl \
>         --backfstype=zfs \
>         --reformat \
>         --verbose \
>         --mgs --mdt --index=0 \
>         --servicenode=${LUSTRE_LOCAL_TCP_IP}@tcp0,${LUSTRE_LOCAL_
> IB_IP}@o2ib0 \
>         --servicenode=${LUSTRE_PEER_TCP_IP}@tcp0,${LUSTRE_PEER_IB_
> IP}@o2ib0 \
>         metadata/meta-fst
>
> elif [[ $HOSTNAME == *oss* ]] ; then
>
>    num=`hostname --short | sed 's/hpfs-fsl-//' | sed 's/oss//'`
>    num=`printf '%g' $num`
>
>    mkfs.lustre \
>        --mgsnode=X.X.98.30 at tcp0,10.148.0.30 at o2ib0 \
>        --mgsnode=X.X.98.31 at tcp0,10.148.0.31 at o2ib0 \
>        --fsname=hpfs-fsl \
>        --backfstype=zfs \
>        --reformat \
>        --verbose \
>        --ost --index=$num \
>        --servicenode=${LUSTRE_LOCAL_TCP_IP}@tcp0,${LUSTRE_LOCAL_
> IB_IP}@o2ib0 \
>        --servicenode=${LUSTRE_PEER_TCP_IP}@tcp0,${LUSTRE_PEER_IB_IP}@o2ib0
> \
>        $pool/ost-fsl
> fi
>
>
>
>
> /etc/ldev.conf:
>
> #local  foreign/-  label       [md|zfs:]device-path   [journal-path]/-
> [raidtab]
>
> hpfs-fsl-mds0  hpfs-fsl-mds1  hpfs-fsl-MDT0000  zfs:metadata/meta-fsl
>
> hpfs-fsl-oss00 hpfs-fsl-oss01 hpfs-fsl-OST0000  zfs:oss00-0/ost-fsl
> hpfs-fsl-oss01 hpfs-fsl-oss00 hpfs-fsl-OST0001  zfs:oss01-0/ost-fsl
> hpfs-fsl-oss02 hpfs-fsl-oss03 hpfs-fsl-OST0002  zfs:oss02-0/ost-fsl
> hpfs-fsl-oss03 hpfs-fsl-oss02 hpfs-fsl-OST0003  zfs:oss03-0/ost-fsl
> hpfs-fsl-oss04 hpfs-fsl-oss05 hpfs-fsl-OST0004  zfs:oss04-0/ost-fsl
> hpfs-fsl-oss05 hpfs-fsl-oss04 hpfs-fsl-OST0005  zfs:oss05-0/ost-fsl
> hpfs-fsl-oss06 hpfs-fsl-oss07 hpfs-fsl-OST0006  zfs:oss06-0/ost-fsl
> hpfs-fsl-oss07 hpfs-fsl-oss06 hpfs-fsl-OST0007  zfs:oss07-0/ost-fsl
> hpfs-fsl-oss08 hpfs-fsl-oss09 hpfs-fsl-OST0008  zfs:oss08-0/ost-fsl
> hpfs-fsl-oss09 hpfs-fsl-oss08 hpfs-fsl-OST0009  zfs:oss09-0/ost-fsl
> hpfs-fsl-oss10 hpfs-fsl-oss11 hpfs-fsl-OST000a  zfs:oss10-0/ost-fsl
> hpfs-fsl-oss11 hpfs-fsl-oss10 hpfs-fsl-OST000b  zfs:oss11-0/ost-fsl
>
>
>
>
> /etc/modprobe.d/lustre.conf:
>
> options lnet networks=tcp0(enp4s0),o2ib0(ib1)
> options ko2iblnd map_on_demand=32
>
> -----Original Message-----
> From: Brian Andrus <toomuchit at gmail.com>
> Date: Friday, February 10, 2017 at 12:07 AM
> To: Darby Vicker <darby.vicker-1 at nasa.gov>, Ben Evans <bevans at cray.com>, "
> lustre-discuss at lists.lustre.org" <lustre-discuss at lists.lustre.org>
> Subject: Re: [lustre-discuss] design to enable kernel updates
>
> Darby,
>
> Do you mind if I inquire about the setup for your lustre systems?
> I'm trying to understand how the MGS/MGT is setup for high availability.
> I understand with OSTs and MDTs where all I really need is to have the
> failnode set when I do the mkfs.lustre
> However, as I understand it, you have to use something like pacemaker
> and drbd to deal with the MGS/MGT. Is this how you approached it?
>
> Brian Andrus
>
>
>
> On 2/6/2017 12:58 PM, Vicker, Darby (JSC-EG311) wrote:
> > Agreed.  We are just about to go into production on our next LFS with the
> > setup described.  We had to get past a bug in the MGS failover for
> > dual-homed servers but as of last week that is done and everything is
> > working great (see "MGS failover problem" thread on this mailing list
> from
> > this month and last).  We are in the process of syncing our existing LFS
> > to this new one and I've failed over/rebooted/upgraded the new LFS
> servers
> > many times now to make sure we can do this in practice when the new LFS
> goes
> > into production.  Its working beautifully.
> >
> > Many thanks to the lustre developers for their continued efforts.  We
> have
> > been using and have been fans of lustre for quite some time now and it
> > just keeps getting better.
> >
> > -----Original Message-----
> > From: lustre-discuss <lustre-discuss-bounces at lists.lustre.org> on
> behalf of Ben Evans <bevans at cray.com>
> > Date: Monday, February 6, 2017 at 2:22 PM
> > To: Brian Andrus <toomuchit at gmail.com>, "lustre-discuss at lists.lustre.org"
> <lustre-discuss at lists.lustre.org>
> > Subject: Re: [lustre-discuss] design to enable kernel updates
> >
> > It's certainly possible.  When I've done that sort of thing, you upgrade
> > the OS on all the servers first, boot half of them (the A side) to the
> new
> > image, all the targets will fail over to the B servers.  Once the A side
> > is up, reboot the B half to the new OS.  Finally, do a failback to the
> > "normal" running state.
> >
> > At least when I've done it, you'll want to do the failovers manually so
> > the HA infrastructure doesn't surprise you for any reason.
> >
> > -Ben
> >
> > On 2/6/17, 2:54 PM, "lustre-discuss on behalf of Brian Andrus"
> > <lustre-discuss-bounces at lists.lustre.org on behalf of
> toomuchit at gmail.com>
> > wrote:
> >
> >> All,
> >>
> >> I have been contemplating how lustre could be configured such that I
> >> could update the kernel on each server without downtime.
> >>
> >> It seems this is _almost_ possible when you have a san system so you
> >> have failover for OSTs and MDTs. BUT the MGS/MGT seems to be the
> >> problematic one, since rebooting that seems cause downtime that cannot
> >> be avoided.
> >>
> >> If you have a system where the disks are physically part of the OSS
> >> hardware, you are out of luck. The hypothetical scenario I am using is
> >> if someone had a VM that was a qcow image on a lustre mount (basically
> >> an active, open file being read/written to continuously). How could
> >> lustre be built to ensure anyone on the VM would not notice a kernel
> >> upgrade to the underlying lustre servers.
> >>
> >>
> >> Could such a setup be done? It seems that would be a better use case for
> >> something like GPFS or Gluster, but being a die-hard lustre enthusiast,
> >> I want to at least show it could be done.
> >>
> >>
> >> Thanks in advance,
> >>
> >> Brian Andrus
> >>
> >> _______________________________________________
> >> lustre-discuss mailing list
> >> lustre-discuss at lists.lustre.org
> >> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> > _______________________________________________
> > lustre-discuss mailing list
> > lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> >
> >
>
>
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>



-- 
------------------------------
Jeff Johnson
Co-Founder
Aeon Computing

jeff.johnson at aeoncomputing.com
www.aeoncomputing.com
t: 858-412-3810 x1001   f: 858-412-3845
m: 619-204-9061

4170 Morena Boulevard, Suite D - San Diego, CA 92117

High-Performance Computing / Lustre Filesystems / Scale-out Storage
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20170210/3836c318/attachment-0001.htm>


More information about the lustre-discuss mailing list