[lustre-discuss] Configuring Luster failover on NON shared targets/disks

Thu May 26 23:31:47 PDT 2016

Note that you need to download and install the lustre-osd-zfs and lustre-osd-zfs-mount RPMs for your installation in order to configure and mount ZFS filesystems.  It appears that this is not installed.

Cheers, Andreas
-- 
Andreas Dilger
Lustre Principal Architect
Intel High Performance Data Division

On 2016/05/21, 01:04, "sohamm" <sohamm at gmail.com> wrote:
>
> Hi Andreas
>
> I did take some time, to get back to this. I started to try out this
> configuration on bunch of VM's with the powerful underlying HW.
>
> Configuration:
> 1 Physical machine hosts 2 VM ( Vm1 and Vm2 ) . Both of them have
> kernel 3.10.0-327.13.1.el7_lustre.x86_64 , Zfs , Iscsi
> Vm1 - disk 1, disk 2
> Vm2 - disk 3, disk 4
> 
> After Iscsi setup
> Vm1 - disk1, disk 3
> Vm2 - disk 4, disk 2
> 
> After zpool
> Vm1 - disk1 || disk 3 ( zpool mirror ) - for mgs
> Vm2-  disk4 || disk 2 ( zpool mirror ) - for mdt
>
> [root at lustre_mgs01_vm03 ~]# zpool status
>  pool: mds1_2
>  state: ONLINE
>  scan: none requested
> config:
>
>        NAME        STATE     READ WRITE CKSUM
>        mds1_2      ONLINE       0     0     0
>          mirror-0  ONLINE       0     0     0
>            sdb     ONLINE       0     0     0
>            vdb2    ONLINE       0     0     0
>
> when setting up mgs/mdt i get the following error
>
> [root at lustre_mgs01_vm03 /]# mkfs.lustre --mgs --backfstype=zfs mds1_2/mgs
> mkfs.lustre FATAL: unhandled/unloaded fs type 5 'zfs'
> mkfs.lustre FATAL: unable to prepare backend (22)
> mkfs.lustre: exiting with 22 (Invalid argument)
> 
> when i searched for the specific error i ran into this
> Jira.. https://jira.hpdd.intel.com/browse/LU-7601
> i have lustre version
> [root at lustre_mgs01_vm03 /]# cat /proc/fs/lustre/version
> lustre: 2.8.53_11_gfd4ab6e
> kernel: patchless_client
> build:  2.8.53_11_gfd4ab6e
>
> I found an earlier discussion on similar topic. I plan to setup
> something similar but with Iscsi instead of common storage boxes.
> I dont see the output similar to this thread for mkfs.lustre command.
> https://lists.01.org/pipermail/hpdd-discuss/2013-December/000662.html
>
> I understand that this might not be a regular setup, but i would like
> to set it up and see the performance if possible.
> Please let me if i am missing something.
>
> Thanks
> Divakar

On Sat, Feb 6, 2016 at 1:57 AM, Dilger, Andreas <andreas.dilger at intel.com> wrote:
> On 2016/02/05, 17:08, "sohamm" <sohamm at gmail.com> wrote:
>
> Hi
> 
> I have been reading bunch of documents on how failures are handled in
> Luster and almost all of them seem to indicate that i would need a
> shared disks/ target for MDS or OSS failover configuration. I want
> to know if failover configuration is possible without the shared disks.
> Eg i have one physical box i want to configure as OSS/OST and another
> as MGS/MDS/MDT. Each physical box will have its own HDD/SDDs and are
> connected via ethernet. Please guide and point me to any good
> documentation available for such configuration.

It is _possible_ to do this without shared disks, if there is some other mechanism to make the data available on both nodes.  One option is to use iSCSI targets (SRP or iSER) and mirror the drives across the two servers using ZFS, making sure you serve each mirrored device from only one node.  Then, if the primary server fails you can mount the filesystem on the backup node. This is described in http://wiki.lustre.org/MDT_Mirroring_with_ZFS_and_SRP and http://cdn.opensfs.org/wp-content/uploads/2011/11/LUG-2012.pptx .

Note that if you only have a 2-way mirror you've lost 1/2 of your disks during failover.  That might be OK for the MDT if it has been configured correctly, since there are additional copies of metadata.  For the OST you could use RAID-1+5 or RAID-1+6 (e.g. mirror of RAID-5/6 devices on each node).  With a more complex configuration it would even potentially be possible to export iSCSI disks from a group of nodes and use RAID-6 of disks from different nodes so that redundancy isn't lost when a single node goes down.  That might get hairy during configuration for a large system.

Another alternative to iSCSI+ZFS would be some other form of network block device (e.g. NBD or DRBD) and then build your target on top of that.  It is essentially the same but the consistency is managed by the block device instead of the filesystem. IMHO (just a gut feeling, never tested) having a "robust" network block device would be slower than having ZFS do this because the block device doesn't know the details of what the filesystem is doing, and will add its own overhead to provide its own consistency in addition to the consistency provided by ZFS itself.

That said, this isn't a typical Lustre configuration, but I think there would definitely be other interested parties if you tried this out and reported your results back here.

Cheers, Andreas
-- 
Andreas Dilger
Lustre Principal Architect
Intel High Performance Data Division