[lustre-discuss] 1.8 client on 3.13.0 kernel

Mohr Jr, Richard Frank (Rick Mohr) rmohr at utk.edu
Thu Sep 10 08:17:55 PDT 2015


I did an upgrade from Lustre 1.8.6 to 2.4.3 on our servers, and for the most part things went pretty good.  I’ll chime in on a couple of Martin’s points and mention a few other things.

> On Sep 10, 2015, at 9:30 AM, Martin Hecht <hecht at hlrs.de> wrote:
> In any case the file systems should be clean before starting the
> upgrade, so I would recommend to run e2fsck on all targets and repair
> them before starting the upgrade. We did so, but unfortunately our
> e2fsprogs were not really up to date and after our lustre upgrade a lot
> of fixes for e2fsprogs were committed to whamclouds e2fsprogs git. So,
> probably some errors on the file systems were still present, but
> unnoticed when we did the upgrade.

This is a very important point.  While I didn’t run e2fsck before the upgrade (but maybe I should have), I made sure to install the latest e2fsprogs.  

> Lustre 2 introduces the FID (which is something like an inode number,
> where lustre 1.8 used the inode number of the underlying ldiskfs, but
> with the possibility to have several MDTs in one file system a
> replacement was needed). The FID is stored in the inode, but it can also
> be activated that the FIDs are stored in the directory node, which makes
> lookups faster, especially when there are many files in a directory.
> However, there were bugs in the code that takes care about adding the
> FID to the directory entry when the file system is converted from 1.8 to
> 2.x. So, I would recommend to use a version in which these bug are
> solved. We went to 2.4.1 that time. By default this fid_in_dirent
> feature is not automatically enabled, however, this is the only point
> where a performance boost may be expected... so we took the risk to
> enable this... and ran into some bugs.

Enabling fid_in_dirent prevents you from backing out of the upgrade.  In theory, if you upgraded to Lustre 2.x without enabling fid_in_dirent, you could always revert back to Lustre 1.8.  We tried this on a test system, and the downgrade seemed to work.  However, this was a small scale test and I have never tried it on a production file system.  But if you want to minimize possible complications, you could always leave this disabled for a while after the updgrade, and then if things are going well, enable it later on.

> LU-4504 quota out of sync: turn off quota, run e2fsck, turn it on again
> - I believe that's something which must be done anyhow quite often,
> because there is no quotacheck anymore. It's run in the background when
> enabling quotas, but file systems have to be unmounted for this.

We didn’t exactly hit this bug, but I will mention that we have had a couple of instance where e2fsck complained about problems on an OST, and it turned out that we had to disable and re-enable quotas on the OST to correct the issue.

> LU-4743: We had to remove the CATALOGS file on another file system
> (otherwise the MDT wouldn't mount)

We hit this problem.

Someone I know had to do a Lustre upgrade, and they suggested that I apply a patch for LU-4708 (which I did).  But if you upgrade to Lustre 2.5.2 or later, that patch should already be included.

My only other advice is to test as much as possible prior to the upgrade.  If you have a little test hardware, install the same Lustre 1.8 version you are currently running in production and then try upgrading that to the new Lustre version.  I think preparation is the key.  I think I spent about 2 months reading about upgrade procedures, talking with others who have upgraded, reading JIRA bug reports, and running tests on hardware.

Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences

More information about the lustre-discuss mailing list