[lustre-discuss] Changelog record cleanup in /O/1/d*

Tue Dec 6 08:11:07 PST 2016

Thanks, Cory.  We are still running 2.5.3.90, which doesn't have that fix.  That patch looks like it would solve our slow-to-mount MDT.  FWIW, I don't think we have many (any?) empty plain llogs, but the removal of the llog_process_or_fork() call in  llog_cat_init_and_process() looks like it addresses our issue - I see that in the stack of the osp-syn-* threads when the MDT is being read like crazy during mounts.

As a followup - is there any reason *not* to unmount the MDT, mount it as ldiskfs, and simply delete the plain llogs in our MDT's O/1/d* folders that contain only CHANGELOG_REC records?   Or even every file under the MDT's O/1/d* folders?  I'm a little unsure.  It seems that most of (if not all of) the files there now are just taking up space, and nothing else is going to remove them.

FWIW, our intent is to start using changelogs and robinhood again after we upgrade to a later version of Lustre than what we are currently running, at which time we'll just start over - register new changelog users and rescan the whole filesystem.  We won't care about any prior history.

Thanks again,

Craig

________________________________
From: Cory Spitz <spitzcor at cray.com>
Sent: Monday, December 5, 2016 5:30 PM
To: Prescott,Craig P; lustre-discuss at lists.lustre.org
Subject: Re: [lustre-discuss] Changelog record cleanup in /O/1/d*

Craig, FWIW, this sounds a lot like https://jira.hpdd.intel.com/browse/LU-5038, which was addressed in 2.7.0.
-Cory

--

From: lustre-discuss <lustre-discuss-bounces at lists.lustre.org> on behalf of "Prescott,Craig P" <prescott at rc.ufl.edu>
Date: Monday, December 5, 2016 at 3:02 PM
To: "lustre-discuss at lists.lustre.org" <lustre-discuss at lists.lustre.org>
Subject: [lustre-discuss] Changelog record cleanup in /O/1/d*

We were running 2.5.3.90 with changelogs enabled earlier this summer.  We ran into a catalog corruption issue (LU-6556) - we decided to deregister our changelog users, move the CONFIGS/changelog_{catalog,users} files out of the way, and carry on until we had an opportunity to upgrade.  We did not remove anything from /O/1/d* at that time (though we probably should have).

We've observed that mounting our MDT can take several-to-many minutes - I can see with iostat that the MDT is very busy with reads while it is being mounted.  I suspect that those stale files in /O/1/d* are the reason (there are lots of them), as they are processed by the OSP sync at MDT startup.   I looked with debugfs at the /O/1/d* directories - there are 1000s of files and their timestamps are consistent with when we were using changelogs.  I dumped a few randomly selected ones and checked with llog_reader that the types of records they contain are CHANGELOG_REC (type=10660000).

At the least, I think we should to remove the files in /O/1/d* that contain CHANGELOG_REC entries.  Can I just delete every file in /O/1/d*, or do I need to be careful and only remove the CHANGELOG_REC entries?

The reason I ask is that I do see a handful of files that are not changelog-related in these directories - their timestamps are newer and their record type as reported by llog_reader is not CHANGELOG_REC or CHANGELOG_USER.  There are only a small number of such files, though.

Thanks,

Craig Prescott
University of Florida Research Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20161206/53039572/attachment.htm>