[lustre-discuss] Changelog users failing to clear records in 2.8, can anyone help?

Colin Faber cfaber at gmail.com
Thu Jun 1 10:38:29 PDT 2017


There seems to have been a few instances of this reported here on the list
in the last few months, I don't recall the earlier versions of lustre, but
we have also seen this in the wild for customer systems, so very likely a
bug which results in corruption of llog files.

-cf


On Thu, Jun 1, 2017 at 11:36 AM, Dilger, Andreas <andreas.dilger at intel.com>
wrote:

> On Jun 1, 2017, at 10:55, Faccini, Bruno <bruno.faccini at intel.com> wrote:
> >
> > Hello,
> > According to the error msgs, looks like there is a corrupted plain-LLOG
> file for the ChangeLogs of MDT0. And unfortunately, neither e2fsck nor
> lfsck can help to recover in this case.
>
> Bruno,
> is this bug fixed in newer Lustre releases, or can something be done in
> the ChangeLog handling so that the ChangeLog can still be cleared in this
> case?  I don't think we care if the record is invalid when it is being
> deleted...  Could you please file a ticket in Jira about this, if it isn't
> already fixed.
>
> Cheers, Andreas
>
> > I think that to clear this situation you need to stop/umount this MDT
> and re-mount it as ldiskfs to move both changelog_users and
> changelog_catalog files to some alternate place/name (do not remove them!),
> umount ldiskfs, re-start/mount your MDT, re-run a RBH full-scan,
> re-register a ChangeLog user.
> > Only side-effect doing so, can be the volume of orphan plain-LLOGs that
> will be kept consuming space on MDT. You should be able to identify them by
> running llog_reader tool over the saved/renamed old catalog file that will
> list you the references to all these remaining plain-LLOGs, allowing you to
> find+remove them during a new ldiskfs-mount session.
> >
> > Bruno.
> >
> >> On Jun 1, 2017, at 4:09 PM, Gibbins, Faye <Faye.Gibbins at cirrus.com>
> wrote:
> >>
> >> Hi,
> >>
> >> We have 4 file systems on our lustre cluster. All have changelog users
> registered for robinhood to use.
> >>
> >> We have discovered that a changelog user for one of the file systems is
> not catching up to its index. Manual runs of Robinhood fail to read any
> more records even though according to mdd/tools-MDT0000/changelog_users
> there are record to read!
> >>
> >> Over time the change log had filled and the file system had become
> sluggish. Wiping the robinhood mysql and reinitializing robin hood with a
> full scan didn’t fix the issue and like I said above three other change
> logs from different file systems (on the same MSG) are ok when used from
> the same robinhood instance.
> >>
> >> What makes me think this is a lustre (and we are using 2.8 on ext4)
> problem is this (repeated) error we are getting in syslog:
> >>
> >> [Wed May 31 14:06:59 2017] Lustre: 46400:0:(llog.c:530:llog_process_thread())
> invalid length -420090294 in llog record for index 372672342/61708
> >> [Wed May 31 14:06:59 2017] LustreError: 46400:0:(mdd_device.c:261:llog_changelog_cancel())
> tools-MDD0000: cancel idx 645 of catalog 0x7:10 rc=-22
> >>
> >> Deregistering the user from the change log and starting with a new one
> has not changed the behaviour and we still can’t use this new user to track
> changes to the file system.
> >>
> >> Can anyone offer any advice on how to resolve this issue in the
> changelog?
> >> If not can anyone confirm if taking the file system down for a
> e2fsck/lfsck will fix issues with the changelog? I’d settle for being able
> to clear the whole log and starting afresh if that’s possible?
> >>
> >> Yours
> >> Faye Gibbins
> >> Snr SysAdmin, Unix Lead Architect
> >> Software Systems and Cloud Services
> >> Cirrus Logic | cirrus.com  | +44 (0) 131 272 7398
> >>
> >> <image001.png>
> >>
> >> This message and any attachments may contain privileged and
> confidential information that is intended solely for the person(s) to whom
> it is addressed. If you are not an intended recipient you must not: read;
> copy; distribute; discuss; take any action in or make any reliance upon the
> contents of this message; nor open or read any attachment. If you have
> received this message in error, please notify us as soon as possible on the
> following telephone number and destroy this message including any
> attachments. Thank you. Cirrus Logic International (UK) Ltd and Cirrus
> Logic International Semiconductor Ltd are companies registered in Scotland,
> with registered numbers SC089839 and SC495735 respectively. Our registered
> office is at 7B Nightingale Way, Quartermile, Edinburgh, EH3 9EG, UK. Tel: +44
> (0)131 272 7000. cirrus.com_______________________________________________
> >> lustre-discuss mailing list
> >> lustre-discuss at lists.lustre.org
> >> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> >
> > ---------------------------------------------------------------------
> > Intel Corporation SAS (French simplified joint stock company)
> > Registered headquarters: "Les Montalets"- 2, rue de Paris,
> > 92196 Meudon Cedex, France
> > Registration Number:  302 456 199 R.C.S. NANTERRE
> > Capital: 4,572,000 Euros
> >
> > This e-mail and any attachments may contain confidential material for
> > the sole use of the intended recipient(s). Any review or distribution
> > by others is strictly prohibited. If you are not the intended
> > recipient, please contact the sender and delete all copies.
> >
> > _______________________________________________
> > lustre-discuss mailing list
> > lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Principal Architect
> Intel Corporation
>
>
>
>
>
>
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20170601/bc2af070/attachment.htm>


More information about the lustre-discuss mailing list