[lustre-discuss] Lustre/ZFS snapshots mount error

Robert Redl robert.redl at lmu.de
Tue Sep 11 03:53:56 PDT 2018


Thanks for the fast reply! If I understood correctly, it is currently
not possible to use the changelog feature together with the snapshot
feature, right?

Is there already a LU-Ticket about that?

Cheers,
Robert


On 09/10/2018 02:57 PM, Yong, Fan wrote:
>
> It is suspected that there were some llog to be handled when the
> snapshot was making Then when mount-up such snapshot, some conditions
> trigger the llog cleanup/modification automatically. So it is not
> related with your actions when mount the snapshot. Since we cannot
> control the system status when making the snapshot, then we have to
> skip llog related cleanup/modification against the snapshot when mount
> the snapshot. Such “skip” related logic is just what we need.
>
>  
>
> Cheers,
>
> Nasf
>
> *From:*lustre-discuss [mailto:lustre-discuss-bounces at lists.lustre.org]
> *On Behalf Of * Robert Redl
> *Sent:* Saturday, September 8, 2018 9:04 PM
> *To:* lustre-discuss at lists.lustre.org
> *Subject:* Re: [lustre-discuss] Lustre/ZFS snapshots mount error
>
>  
>
> Dear All,
>
> we have a similar setup with Lustre on ZFS and we make regular use of
> snapshots for the purpose of backups (backups on tape use snapshots as
> source). We would like to use robinhood in future and the question is
> now how to do it.
>
> Would it be a workaround to disable the robinhood daemon temporary
> during the mount process?
> Does the problem only occur when changelogs are consumed during the
> process of mounting a snapshot? Or is it also a problem when
> changelogs are consumed while the snapshot remains mounted (which is
> for us typically several hours)?
> Is there already an LU-ticket about this issue?
>
> Thanks!
> Robert
>
> -- 
> Dr. Robert Redl
> Scientific Programmer, "Waves to Weather" (SFB/TRR165)
> Meteorologisches Institut
> Ludwig-Maximilians-Universität München
> Theresienstr. 37, 80333 München, Germany
>
> Am 03.09.2018 um 08:16 schrieb Yong, Fan:
>
>     I would say that it is not your operations order caused trouble.
>     Instead, it is related with the snapshot mount logic. As mentioned
>     in former reply, we need some patch for the llog logic to avoid
>     modifying llog under snapshot mode.
>
>      
>
>      
>
>     --
>
>     Cheers,
>
>     Nasf
>
>     *From:*Kirk, Benjamin (JSC-EG311) [mailto:benjamin.kirk at nasa.gov]
>     *Sent:* Tuesday, August 28, 2018 7:53 PM
>     *To:* lustre-discuss at lists.lustre.org
>     <mailto:lustre-discuss at lists.lustre.org>
>     *Cc:* Andreas Dilger <adilger at whamcloud.com>
>     <mailto:adilger at whamcloud.com>; Yong, Fan <fan.yong at intel.com>
>     <mailto:fan.yong at intel.com>
>     *Subject:* Re: [lustre-discuss] Lustre/ZFS snapshots mount error
>
>      
>
>     The MDS situation is very basic: active/passive mds0/mds1 for both
>     fas & fsB.  fsA has the combined msg/mdt in a single zfs
>     filesystem, and fsB has its own mdt in a separate zfs filesystem.
>      mds0 is primary for all.
>
>      
>
>     fsA & fsB DO both have changelogs enabled to feed robinhood databases.
>
>      
>
>     What’s the recommended procedure here we should follow before
>     mounting the snapshots?
>
>      
>
>     1) disable changelogs on the active mdt’s (this will compromise
>     robinhood, requiring a rescan…), or  
>
>     2) temporarily halt changelog consumption / cleanup (e.g. stop
>     robinhood in our case) and then mount the snapshot?
>
>      
>
>     Thanks for the help!
>
>      
>
>     --
>
>     Benjamin S. Kirk, Ph.D.
>
>     NASA Lyndon B. Johnson Space Center
>
>     Acting Chief, Aeroscience & Flight Mechanics Division
>
>      
>
>         On Aug 27, 2018, at 7:33 PM, Yong, Fan <fan.yong at intel.com
>         <mailto:fan.yong at intel.com>> wrote:
>
>          
>
>         According to the stack trace, someone was trying to cleanup
>         old empty llogs during mount the snapshot. We do NOT allow any
>         modification during mount snapshot; otherwise, it will trigger
>         ZFS backend BUG(). That is why we add LASSERT() when start the
>         transaction. One possible solution is that, we can add some
>         check in the llog logic to avoid modifying llog under snapshot
>         mode.
>
>
>         --
>         Cheers,
>         Nasf
>
>         -----Original Message-----
>         From: lustre-discuss
>         [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of
>         Andreas Dilger
>         Sent: Tuesday, August 28, 2018 5:57 AM
>         To: Kirk, Benjamin (JSC-EG311) <benjamin.kirk at nasa.gov
>         <mailto:benjamin.kirk at nasa.gov>>
>         Cc: lustre-discuss at lists.lustre.org
>         <mailto:lustre-discuss at lists.lustre.org>
>         Subject: Re: [lustre-discuss] Lustre/ZFS snapshots mount error
>
>         It's probably best to file an LU ticket for this issue.
>
>         It looks like there is something with the log processing at
>         mount that is trying to modify the configuration files.  I'm
>         not sure whether that should be allowed or not.
>
>         Does fab have the same MGS as fsA?  Does it have the same MDS
>         node as fsA?
>         If it has a different MDS, you might consider to give it its
>         own MGS as well.
>         That doesn't have to be a separate MGS node, just a separate
>         filesystem (ZFS fileset in the same zpool) on the MDS node.
>
>         Cheers, Andreas
>
>
>
>             On Aug 27, 2018, at 10:18, Kirk, Benjamin (JSC-EG311)
>             <benjamin.kirk at nasa.gov <mailto:benjamin.kirk at nasa.gov>>
>             wrote:
>
>             Hi all,
>
>             We have two filesystems, fsA & fsB (eadc below).  Both of
>             which get snapshots taken daily, rotated over a week.
>              It’s a beautiful feature we’ve been using in production
>             ever since it was introduced with 2.10.
>
>             -) We’ve got Lustre/ZFS 2.10.4 on CentOS 7.5.
>             -) Both fsA & fsB have changelogs active.
>             -) fsA has combined mgt/mdt on a single ZFS filesystem.
>             -) fsB has a single mdt on a single ZFS filesystem.
>             -) for fsA, I have no issues mounting any of the snapshots
>             via lctl.
>             -) for fsB, I can mount the most three recent snapshots,
>             then encounter errors:
>
>             [root at hpfs-fsl-mds0 ~]# lctl snapshot_mount -F eadc -n
>             eadc_AutoSS-Mon
>             mounted the snapshot eadc_AutoSS-Mon with fsname 3d40bbc
>             [root at hpfs-fsl-mds0 ~]# lctl snapshot_umount -F eadc -n
>             eadc_AutoSS-Mon
>             [root at hpfs-fsl-mds0 ~]# lctl snapshot_mount -F eadc -n
>             eadc_AutoSS-Sun
>             mounted the snapshot eadc_AutoSS-Sun with fsname 584c07a
>             [root at hpfs-fsl-mds0 ~]# lctl snapshot_umount -F eadc -n
>             eadc_AutoSS-Sun
>             [root at hpfs-fsl-mds0 ~]# lctl snapshot_mount -F eadc -n
>             eadc_AutoSS-Sat
>             mounted the snapshot eadc_AutoSS-Sat with fsname 4e646fe
>             [root at hpfs-fsl-mds0 ~]# lctl snapshot_umount -F eadc -n
>             eadc_AutoSS-Sat
>             [root at hpfs-fsl-mds0 ~]# lctl snapshot_mount -F eadc -n
>             eadc_AutoSS-Fri
>             mount.lustre: mount metadata/meta-eadc at eadc_AutoSS-Fri at
>             /mnt/eadc_AutoSS-Fri_MDT0000 failed: Read-only file system
>             Can't mount
>             the snapshot eadc_AutoSS-Fri: Read-only file system
>
>             The relevant bits from dmesg are
>             [1353434.417762] Lustre: 3d40bbc-MDT0000: set dev_rdonly
>             on this
>             device [1353434.417765] Lustre: Skipped 3 previous similar
>             messages
>             [1353434.649480] Lustre: 3d40bbc-MDT0000: Imperative
>             Recovery enabled,
>             recovery window shrunk from 300-900 down to 150-900
>             [1353434.649484]
>             Lustre: Skipped 3 previous similar messages
>             [1353434.866228] Lustre:
>             3d40bbc-MDD0000: changelog on [1353434.866233] Lustre:
>             Skipped 1
>             previous similar message [1353435.427744] Lustre:
>             3d40bbc-MDT0000:
>             Connection restored to ... at tcp <mailto:... at tcp> (at
>             ... at tcp <mailto:... at tcp>) [1353435.427747] Lustre:
>             Skipped 23 previous similar messages [1353445.255899]
>             Lustre: Failing
>             over 3d40bbc-MDT0000 [1353445.255903] Lustre: Skipped 3
>             previous
>             similar messages [1353445.256150] LustreError: 11-0:
>             3d40bbc-OST0000-osc-MDT0000: operation ost_disconnect to
>             node ... at tcp <mailto:... at tcp>
>             failed: rc = -107 [1353445.257896] LustreError: Skipped 23
>             previous
>             similar messages [1353445.353874] Lustre: server umount
>             3d40bbc-MDT0000 complete [1353445.353877] Lustre: Skipped
>             3 previous
>             similar messages [1353475.302224] Lustre: 4e646fe-MDD0000:
>             changelog
>             on [1353475.302228] Lustre: Skipped 1 previous similar
>             message [1353498.964016] LustreError:
>             25582:0:(osd_handler.c:341:osd_trans_create())
>             36ca26b-MDT0000-osd: someone try to start transaction
>             under readonly mode, should be disabled.
>             [1353498.967260] LustreError:
>             25582:0:(osd_handler.c:341:osd_trans_create()) Skipped 1
>             previous similar message
>             [1353498.968829] CPU: 6 PID: 25582 Comm: mount.lustre
>             Kdump: loaded Tainted: P           OE  ------------
>               3.10.0-862.6.3.el7.x86_64 #1
>             [1353498.968830] Hardware name: Supermicro
>             SYS-6027TR-D71FRF/X9DRT,
>             BIOS 3.2a 08/04/2015 [1353498.968832] Call Trace:
>             [1353498.968841]  [<ffffffffb5b0e80e>] dump_stack+0x19/0x1b
>             [1353498.968851]  [<ffffffffc0cbe5db>]
>             osd_trans_create+0x38b/0x3d0
>             [osd_zfs] [1353498.968876]  [<ffffffffc1116044>]
>             llog_destroy+0x1f4/0x3f0 [obdclass] [1353498.968887]  
>             [<ffffffffc111f0f6>] llog_cat_reverse_process_cb+0x246/0x3f0
>             [obdclass] [1353498.968897]  [<ffffffffc111a32c>]
>             llog_reverse_process+0x38c/0xaa0 [obdclass] [1353498.968910]  
>             [<ffffffffc111eeb0>] ? llog_cat_process_cb+0x4e0/0x4e0
>             [obdclass]
>             [1353498.968922]  [<ffffffffc111af69>]
>             llog_cat_reverse_process+0x179/0x270 [obdclass]
>             [1353498.968932]  
>             [<ffffffffc1115585>] ? llog_init_handle+0xd5/0x9a0 [obdclass]
>             [1353498.968943]  [<ffffffffc1116e78>] ?
>             llog_open_create+0x78/0x320
>             [obdclass] [1353498.968949]  [<ffffffffc12e55f0>] ?
>             mdd_root_get+0xf0/0xf0 [mdd] [1353498.968954]
>              [<ffffffffc12ec7af>]
>             mdd_prepare+0x13ff/0x1c70 [mdd] [1353498.968966]
>              [<ffffffffc166b037>]
>             mdt_prepare+0x57/0x3b0 [mdt] [1353498.968983]
>              [<ffffffffc1183afd>]
>             server_start_targets+0x234d/0x2bd0 [obdclass]
>             [1353498.968999]  
>             [<ffffffffc1153500>] ? class_config_dump_handler+0x7e0/0x7e0
>             [obdclass] [1353498.969012]  [<ffffffffc118541d>]
>             server_fill_super+0x109d/0x185a [obdclass] [1353498.969025]  
>             [<ffffffffc115cef8>] lustre_fill_super+0x328/0x950 [obdclass]
>             [1353498.969038]  [<ffffffffc115cbd0>] ?
>             lustre_common_put_super+0x270/0x270 [obdclass]
>             [1353498.969041]  
>             [<ffffffffb561f3bf>] mount_nodev+0x4f/0xb0 [1353498.969053]  
>             [<ffffffffc1154f18>] lustre_mount+0x38/0x60 [obdclass]
>             [1353498.969055]  [<ffffffffb561ff3e>] mount_fs+0x3e/0x1b0
>             [1353498.969060]  [<ffffffffb563d4b7>]
>             vfs_kern_mount+0x67/0x110 [1353498.969062]
>              [<ffffffffb563fadf>] do_mount+0x1ef/0xce0
>             [1353498.969066]  [<ffffffffb55f7c2c>] ?
>             kmem_cache_alloc_trace+0x3c/0x200 [1353498.969069]
>              [<ffffffffb5640913>] SyS_mount+0x83/0xd0 [1353498.969074]
>              [<ffffffffb5b20795>] system_call_fastpath+0x1c/0x21
>             [1353498.969079] LustreError:
>             25582:0:(llog_cat.c:1027:llog_cat_reverse_process_cb())
>             36ca26b-MDD0000: fail to destroy empty log: rc = -30
>             [1353498.970785] CPU: 6 PID: 25582 Comm: mount.lustre
>             Kdump: loaded Tainted: P           OE  ------------
>               3.10.0-862.6.3.el7.x86_64 #1
>             [1353498.970786] Hardware name: Supermicro
>             SYS-6027TR-D71FRF/X9DRT,
>             BIOS 3.2a 08/04/2015 [1353498.970787] Call Trace:
>             [1353498.970790]  [<ffffffffb5b0e80e>] dump_stack+0x19/0x1b
>             [1353498.970795]  [<ffffffffc0cbe5db>]
>             osd_trans_create+0x38b/0x3d0
>             [osd_zfs] [1353498.970807]  [<ffffffffc1117921>]
>             llog_cancel_rec+0xc1/0x880 [obdclass] [1353498.970817]  
>             [<ffffffffc111e13b>] llog_cat_cleanup+0xdb/0x380 [obdclass]
>             [1353498.970827]  [<ffffffffc111f14d>]
>             llog_cat_reverse_process_cb+0x29d/0x3f0 [obdclass]
>             [1353498.970838]  
>             [<ffffffffc111a32c>] llog_reverse_process+0x38c/0xaa0
>             [obdclass]
>             [1353498.970848]  [<ffffffffc111eeb0>] ?
>             llog_cat_process_cb+0x4e0/0x4e0 [obdclass] [1353498.970858]  
>             [<ffffffffc111af69>] llog_cat_reverse_process+0x179/0x270
>             [obdclass]
>             [1353498.970868]  [<ffffffffc1115585>] ?
>             llog_init_handle+0xd5/0x9a0
>             [obdclass] [1353498.970878]  [<ffffffffc1116e78>] ?
>             llog_open_create+0x78/0x320 [obdclass] [1353498.970883]  
>             [<ffffffffc12e55f0>] ? mdd_root_get+0xf0/0xf0 [mdd]
>             [1353498.970887]  
>             [<ffffffffc12ec7af>] mdd_prepare+0x13ff/0x1c70 [mdd]
>             [1353498.970894]  
>             [<ffffffffc166b037>] mdt_prepare+0x57/0x3b0 [mdt]
>             [1353498.970908]  
>             [<ffffffffc1183afd>] server_start_targets+0x234d/0x2bd0
>             [obdclass]
>             [1353498.970924]  [<ffffffffc1153500>] ?
>             class_config_dump_handler+0x7e0/0x7e0 [obdclass]
>             [1353498.970938]  
>             [<ffffffffc118541d>] server_fill_super+0x109d/0x185a
>             [obdclass]
>             [1353498.970950]  [<ffffffffc115cef8>]
>             lustre_fill_super+0x328/0x950
>             [obdclass] [1353498.970962]  [<ffffffffc115cbd0>] ?
>             lustre_common_put_super+0x270/0x270 [obdclass]
>             [1353498.970964]  
>             [<ffffffffb561f3bf>] mount_nodev+0x4f/0xb0 [1353498.970976]  
>             [<ffffffffc1154f18>] lustre_mount+0x38/0x60 [obdclass]
>             [1353498.970978]  [<ffffffffb561ff3e>] mount_fs+0x3e/0x1b0
>             [1353498.970980]  [<ffffffffb563d4b7>]
>             vfs_kern_mount+0x67/0x110
>             [1353498.970982]  [<ffffffffb563fadf>] do_mount+0x1ef/0xce0
>             [1353498.970984]  [<ffffffffb55f7c2c>] ?
>             kmem_cache_alloc_trace+0x3c/0x200 [1353498.970986]  
>             [<ffffffffb5640913>] SyS_mount+0x83/0xd0 [1353498.970989]  
>             [<ffffffffb5b20795>] system_call_fastpath+0x1c/0x21
>             [1353498.970996]
>             LustreError:
>             25582:0:(mdd_device.c:354:mdd_changelog_llog_init())
>             36ca26b-MDD0000: changelog init failed: rc = -30
>             [1353498.972790]
>             LustreError: 25582:0:(mdd_device.c:427:mdd_changelog_init())
>             36ca26b-MDD0000: changelog setup during init failed: rc = -30
>             [1353498.974525] LustreError:
>             25582:0:(mdd_device.c:1061:mdd_prepare()) 36ca26b-MDD0000:
>             failed to
>             initialize changelog: rc = -30 [1353498.976229] LustreError:
>             25582:0:(obd_mount_server.c:1879:server_fill_super())
>             Unable to start
>             targets: -30 [1353499.072002] LustreError:
>             25582:0:(obd_mount.c:1582:lustre_fill_super()) Unable to
>             mount  (-30)
>
>
>             I’m hoping those traces mean something to someone - any ideas?
>
>             Thanks!
>
>             --
>             Benjamin S. Kirk
>
>             _______________________________________________
>             lustre-discuss mailing list
>             lustre-discuss at lists.lustre.org
>             <mailto:lustre-discuss at lists.lustre.org>
>             http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
>
>         Cheers, Andreas
>         ---
>         Andreas Dilger
>         CTO Whamcloud
>
>
>
>
>      
>
>
>
>
>     _______________________________________________
>
>     lustre-discuss mailing list
>
>     lustre-discuss at lists.lustre.org
>     <mailto:lustre-discuss at lists.lustre.org>
>
>     http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
>  
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20180911/9f543b0d/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 455 bytes
Desc: OpenPGP digital signature
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20180911/9f543b0d/attachment-0001.sig>


More information about the lustre-discuss mailing list