[lustre-discuss] 1.8 client on 3.13.0 kernel

Martin Hecht hecht at hlrs.de
Thu Sep 10 06:30:42 PDT 2015


Hi Lewis,

it's difficult to tell how much data loss was actually related to the
lustre upgrade itself. We have upgraded 6 file systems and we had to do
it more or less in one shot, because at that time they were using a
common MGS server. All servers of one file system must be on the same
level (at least for the major upgrade 1.8 to 2.x, there is rolling
upgrade for minor versions in the lustre 2 branch now, but I have no
experience with that).

In any case the file systems should be clean before starting the
upgrade, so I would recommend to run e2fsck on all targets and repair
them before starting the upgrade. We did so, but unfortunately our
e2fsprogs were not really up to date and after our lustre upgrade a lot
of fixes for e2fsprogs were committed to whamclouds e2fsprogs git. So,
probably some errors on the file systems were still present, but
unnoticed when we did the upgrade.

Lustre 2 introduces the FID (which is something like an inode number,
where lustre 1.8 used the inode number of the underlying ldiskfs, but
with the possibility to have several MDTs in one file system a
replacement was needed). The FID is stored in the inode, but it can also
be activated that the FIDs are stored in the directory node, which makes
lookups faster, especially when there are many files in a directory.
However, there were bugs in the code that takes care about adding the
FID to the directory entry when the file system is converted from 1.8 to
2.x. So, I would recommend to use a version in which these bug are
solved. We went to 2.4.1 that time. By default this fid_in_dirent
feature is not automatically enabled, however, this is the only point
where a performance boost may be expected... so we took the risk to
enable this... and ran into some bugs.

We had other file systems, still on 1.8, so with the server upgrade we
didn't upgrade the clients, because lustre 2 clients wouldn't have been
able to mount the 1.8 file systems. And we use quotas, and for this you
need the 1.8.9 client with a patch that corrects a defect of the 1.8.9
client when it talks to 2.x servers (LU-3067). However, older 1.8
clients don't support the Lustre 2 quota (which came in 2.2 or 2.4, I'm
not 100% sure). BTW, it still runs out of sync from time to time, but
the limit seems to be fine now, it's just the numbers the users see. lfs
quota prints out too low numbers and users run out of quota earlier than
they expect... It's better in the latest 2.5 versions now.

Here an unsorted(!) list of bugs we have hit during the lustre upgrade.
For most of them we weren't the first ones, but I guess you could wait
forever for the version in which all bugs are resolved :-)

LU-3067 - already mentioned above, a patch for 1.8.9 clients
interoperating with 2.x servers, however, 1.8.9 is needed for having
quota working. Without this patch clients become unresponsive, 100% cpu
load, then just hang and devices become unavailable, reboot doesn't
work, so power cycle needed, but after a while the problem reappeared

LU-4504 - e2fsck noticed quota issues similar to this bug on osts - use
latest e2fsprogs, check again and then the ldiskfs backend doesn't run
into this anymore

e2fsck noticed quota issues on MDT "Problem in HTREE directory inode
21685465: block #16 not referenced"  however, could be fixed by e2fsck

LU-5626 mdt becomes readonly: one file system where the MDT was
corrupted at earlier stage and obviously not fully repaired lbuged upon
MDT mount, could only be mounted with noscrub option

the mdt group_upcall (which can be configured with tunefs) used to be
/usr/sbin/l_getgroups in lustre 1.8 and it was set by default - the
program is called l_getidentity now, is not configured by default
anymore. You should either change it with tunefs, or put an appropriate
link in place as a fallback. Anyhow, lustre 2 file systems don't use it
by default anymore. They just trust the client. It also means that
users/groups are not needed anymore on lustre the servers. (we had lokal
passwd/group files there so that secondary groups work properly,
alternatively you could configure ldap, but without group_upcall, all
this is handled by the lustre client.

LU-5626 and LU-2627: .. directory entries were damaged by adding the
FID, once all old directories were converted and all files somehow
recovered (in several consecutive attempts), the problem is gone. The
number of emergency maintenances is basically limited by the depth of
your directory structure. It could be repaired by running e2fsck,
followed by manually moving back everything (save the log of the e2fsck
which tells you the relation of the objects in lost+found and their
original path!)

LU-4504 quota out of sync: turn off quota, run e2fsck, turn it on again
- I believe that's something which must be done anyhow quite often,
because there is no quotacheck anymore. It's run in the background when
enabling quotas, but file systems have to be unmounted for this.

Related to quota, there is a change in the lfs setquota command. The
manual sais that soft limits must be  < hard limits, but you have to
specify them. You could put a zero, but in later versions it must be
present on the command line. In 1.8 lfs setquota was more relaxed, but
it simply didn't initialize some values properly. This change caused our
quota management to fail. However, after fixing the call it worked fine
again.

LU-3861 quota severely broken: It was not possible to move files for
some users/groups while it worked for others. Copying on the other hand
seemed to work. Maybe this was in combination with one of the first
attempts to fix the fid issue. However, neither e2fsck, nor tune2fs
could fix the problem. We had to upgrade to e2fsprogs 1.42.7 which then
contained some improvements which made e2fsk able to fix this and
allowed ldiskfs running more stable afterwards.

LU-3917: During the upgrade we needed to re-create the PENDING direktory
on the ldiskfs level on one of our file systems

LU-4743: We had to remove the CATALOGS file on another file system
(otherwise the MDT wouldn't mount)

And if you upgrade to 2.5, there was a bug which caused the MDS to crash
when large_xattr (for wide striping) is not set and a user tries to use
it anyway. But probably you don't have that many OSTs because the number
was limited anyway in 1.8.

A couple of other problems were related to the software which our
supplier uses to manage the lustre servers, but that's not a lustre
issue, it's just how a large number of servers is booted, maintained and
configured. Anyhow, fighting these problems on top didn't make things
easier ;-)

That was a very much shortened list of our upgrade trouble (shortened
not in the number of issues, but leaving out the log messages,
discussions, attempts to repair things...). Later, we also have
configured a separate MGS for each file system, upgraded once more to
2.5, reconfigured the lnet configuration - that was all much less
trouble than the upgrade from 1.8 to 2.4.1 - maybe looking back that was
a bad version but at some point you have to decide for a target version
- and maybe I would do exactly the same step again, now with the
knowledge what can happen and on which things I must keep an eye. I
wouldn't enable the fid_in_dirent feature and I would for sure update
e2fsprogs as a first step.

best regards,
Martin

On 09/09/2015 03:16 PM, Lewis Hyatt wrote:
> OK thanks for sharing your experience. Unfortunately I can't see a way
> for us to get duplicate hardware, so we will have to give it a shot;
> we were going to try the artificial test first as well. If you don't
> mind taking another minute, I'd be curious what was the nature of the
> problems you ran into... was it potential data loss, or just issues
> getting it to perform the upgrade? Thanks again.
>
> -lewis 


-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2252 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20150910/7bc8eb8f/attachment.bin>


More information about the lustre-discuss mailing list