[lustre-discuss] Compiling lustre-2.10.6

Tung-Han Hsieh thhsieh at twcp1.phys.ntu.edu.tw
Fri Mar 29 04:50:49 PDT 2019


Dear All,

This is a following up of migrating data out of an OST issue.

Two weeks ago we have upgraded our Lustre file system to version
2.10.6 (the OSTs are based on ldiskfs). Since there are many OSTs
containing data over 99% of their capacity, we plan to migrate
parts of their data to free OSTs to balance the data capacity.

In the MDT, we turned off the writing to the almost full OSTs by:

(In MDS)
echo 0 > /proc/fs/lustre/osc/chome-OST0000-osc-MDT0000/max_create_count
echo 0 > /proc/fs/lustre/osc/chome-OST0001-osc-MDT0000/max_create_count
echo 0 > /proc/fs/lustre/osc/chome-OST0002-osc-MDT0000/max_create_count
....

We have totally 40 OSTs distributed in 8 file servers, and only
one of them is a newly installed one which has large free space.
So we turned off writing to the other 7 file servers (36 OSTs),
and migrate their data one after one.

In the first week, the progress is quite smooth. We are happy that
everything looks fine. But then the migrating progress becoming
slower and slower. However, there is no abnormal messages at all
in the "dmesg" of all the file servers. So in the beginning we did
not pay attention to it. Until today, we suddently realized why the
migration progress is getting slower: The file system often gets
freezed without any response for around 30 secs during migration.

When the file system got freezed, no matter running "ls", copy a
file, or even changing to a directory with "cd", all the commands
just hung there for around 30 secs. The situation is quite similar
to that we encountered when using Lustre-2.5.X. At that time we tried
several ways to disable writing to the full OSTs:

(In MDS)
echo 0 > /proc/fs/lustre/osc/chome-OST0000-osc-MDT0000/max_create_count

or

(In MDS)
echo 0 > /proc/fs/lustre/osc/chome-OST0000-osc-MDT0000/active

or

(In OSS)
lctl set_param fail_loc=0x229 fail_val=-1

No matter which way, we always saw error messages in "dmesg":

========================================================================
[960570.287161] Lustre: chome-OST001a-osc-MDT0000: slow creates, last=[0x1001a0000:0x3ef241:0x0], next=[0x1001a0000:0x3ef241:0x0], reserved=0, syn_changes=0, syn_rpc_in_progress=0, status=0
========================================================================

Since Lustre-2.5.X has bug on this part, we upgraded the system to
Lustre-2.10.6. However, after one week of migration, the same problem
raised again, but this time we did not see any error messages in all
the file servers and clients.

Is there any solution to fix it? Does it help to upgrade to Lustre-2.10.7 ?

Thanks you very much for your response in advacne.

Best Regards,

T.H.Hsieh


On Sat, Mar 16, 2019 at 10:24:55AM +0800, Tung-Han Hsieh wrote:
> Dear YangSheng,
> 
> Sorry for replying late, and thank you very much your suggestion.
> Here is the follow up of our tests.
> 
> We followed your suggestion. In this file:
> 
>     ldiskfs/kernel_patches/series/ldiskfs-3.0-sles11sp3.series
> 
> we add the following line in the end of that file:
> 
>     rhel6.3/ext4-export-64bit-name-hash.patch
> 
> in order to apply the patch you suggested. During compilation with
> the kernel "linux-3.0.101-138.gcdbe806.tar.gz" obtained from SuSE
> (sles11sp3) Linux distribution, we can successfully build the Lustre
> 2.10.6 code (using ldiskfs backend) in our Debian 7.11 / 8.7 / 9.8
> systems. Now we are looking for an appreciated chance to shutdown our
> cluster for upgrading the lustre file system.
> 
> Thanks very much for your suggestion. It solves our big problem.
> 
> Best Regards,
> 
> T.H.Hsieh
> 
> 
> On Wed, Mar 13, 2019 at 01:02:03AM +0800, YangSheng wrote:
> > Hi, Hsieh,
> > 
> > You can port ldiskfs patch(ldiskfs//kernel_patches/patches/rhel6.3/ext4-export-64bit-name-hash.patch) to fix this issue. This patch has been landed to RHEL. But looks like Debian not.
> > 
> > Thanks,
> > YangSheng
> > 
> > > 在 2019年3月13日,上午12:36,Tung-Han Hsieh <thhsieh at twcp1.phys.ntu.edu.tw> ?吹?:
> > > 
> > > Dear Farrell,
> > > 
> > > Thank you very much for your response. Unfortunately, the wiki does
> > > not cover our special environment.
> > > 
> > > We need to build Lustre Master (i.e., to handle MDS and OSS) on a
> > > Debian Linux 8.X/9.X operating system, with LDSIKFS as the backend
> > > for MDT and OST. This is because we already have a lot of OSS with
> > > a lot of data. They were upgraded from Lustre-1.8.8 to Lustre-2.5.3.
> > > These servers are all installed Debian 8.11 Linux system. We just
> > > want to keep them running smoothly to provide a stable computing
> > > environment.
> > > 
> > > I saw that the Lustre Master with LDISKFS is tightly bond to RHEL
> > > 6.4 or 7.3. It is quite unfortunate not our case. So we have to
> > > find some non-standard way to build it for our system.
> > > 
> > > From Lustre-1.6.X, 1.8.X, and up to 2.5.X, we got the source code
> > > of Linux kernel from SuSE, patch it, and then build lustre source.
> > > For example, for 2.5.X, we get the following Linux kernel from
> > > sele11sp3:
> > > 
> > >    linux-3.0.101-138.gcdbe806
> > > 
> > > and patch it via the following commands:
> > > 
> > >    cd linux-3.0.101-138.gcdbe806
> > >    ln -s ../lustre-2.5.X/lustre/kernel_patches/series/3.0-sles11sp3.series series
> > >    ln -s ../lustre-2.5.X/lustre/kernel_patches/patches patches
> > >    quilt push -av
> > > 
> > > Then we build Linux kernel, and build Lustre with the following
> > > instructions:
> > > 
> > >    cd lustre-2.5.X
> > >    ./configure --prefix=/opt/lustre --with-linux=/usr/src/linux-3.0.101-138.gcdbe806 --with-o2ib=no
> > >    make
> > >    make install
> > > 
> > > These are standard methods we found in the Lustre-1.8.X manuals. It
> > > works before, and the resulting system ran without problem for a long
> > > time. So for Lustre-2.10.X, we want to try similar method. Now we
> > > encountered the compiling errors of LDISKFS_HTREE_EOF_32BIT and
> > > LDISKFS_HTREE_EOF_64BIT not defined.
> > > 
> > > I guess that these symbols are defined somewhere in RHEL systems,
> > > but unfortunately not available in other Linux distributions. I
> > > searched the entire source code of lustre-2.10.6, including all the
> > > patch files coming with the source code, there is no definition of
> > > these two symbols. So maybe I should ask what are the definition
> > > of these symbols ? We could just add their definitions into the
> > > header file "ldiskfs/ldiskfs.h" to complete the compilation.
> > > 
> > > ps. In Debian Linux system, we already successfully compiled Lustre-2.10.5
> > >    with ZFS backend. It works very nice in one of our clusters. However,
> > >    this time we need Lustre-2.10.X with LDISKFS backend, because there
> > >    are a lot of OSTs which are formatted with LDISKFS. The total size of
> > >    data is more than 50TB which is impossible to back it up and reformat.
> > > 
> > > Thanks very much.
> > > 
> > > T.H.Hsieh
> > > 
> > > On Tue, Mar 12, 2019 at 03:55:13PM +0000, Patrick Farrell wrote:
> > >> Hsieh,
> > >> 
> > >> 
> > >> We have instructions for compiling from source here on our Wiki:
> > >> 
> > >> https://wiki.whamcloud.com/display/PUB/Building+Lustre+from+Source
> > >> 
> > >> 
> > >> Are you following those?  If not, I'd suggest it - Your problem looks likely to be an error in the build process.
> > >> 
> > >> 
> > >> We also have prebuilt 2.10.6 packages for many platforms:
> > >> https://downloads.whamcloud.com/public/lustre/
> > >> 
> > >> 
> > >> - Patrick
> > >> 
> > >> 
> > >> 
> > >> ________________________________
> > >> From: lustre-discuss <lustre-discuss-bounces at lists.lustre.org> on behalf of Tung-Han Hsieh <thhsieh at twcp1.phys.ntu.edu.tw>
> > >> Sent: Tuesday, March 12, 2019 9:22:57 AM
> > >> To: lustre-discuss at lists.lustre.org
> > >> Subject: [lustre-discuss] Compiling lustre-2.10.6
> > >> 
> > >> Dear All,
> > >> 
> > >> I am trying to compile lustre-2.10.6 from source code. During
> > >> compilation, there are undefined symbols:
> > >> 
> > >> ==========================================================================
> > >>  CC [M]  /home/thhsieh/lustre/L-2.10.6/lustre-2.10.6/lustre/osd-ldiskfs/osd_handler.o
> > >> In file included from /home/thhsieh/lustre/L-2.10.6/lustre-2.10.6/lustre/osd-ldiskfs/osd_handler.c:72:0:
> > >> /home/thhsieh/lustre/L-2.10.6/lustre-2.10.6/lustre/osd-ldiskfs/osd_internal.h: In function 'ldiskfs_get_htree_eof':
> > >> /home/thhsieh/lustre/L-2.10.6/lustre-2.10.6/lustre/osd-ldiskfs/osd_internal.h:1254:10: error: 'LDISKFS_HTREE_EOF_32BIT' undeclared (first use in this function)
> > >>   return LDISKFS_HTREE_EOF_32BIT;
> > >>          ^
> > >> /home/thhsieh/lustre/L-2.10.6/lustre-2.10.6/lustre/osd-ldiskfs/osd_internal.h:1254:10: note: each undeclared identifier is reported only once for each function it appears in
> > >> /home/thhsieh/lustre/L-2.10.6/lustre-2.10.6/lustre/osd-ldiskfs/osd_internal.h:1256:10: error: 'LDISKFS_HTREE_EOF_64BIT' undeclared (first use in this function)
> > >>   return LDISKFS_HTREE_EOF_64BIT;
> > >>          ^
> > >> /home/thhsieh/lustre/L-2.10.6/lustre-2.10.6/lustre/osd-ldiskfs/osd_handler.c: In function 'osd_check_lmv':
> > >> /home/thhsieh/lustre/L-2.10.6/lustre-2.10.6/lustre/osd-ldiskfs/osd_handler.c:968:19: error: 'LDISKFS_HTREE_EOF_64BIT' undeclared (first use in this function)
> > >>    filp->f_pos != LDISKFS_HTREE_EOF_64BIT);
> > >>                   ^
> > >> In file included from /home/thhsieh/lustre/L-2.10.6/lustre-2.10.6/lustre/osd-ldiskfs/osd_handler.c:72:0:
> > >> /home/thhsieh/lustre/L-2.10.6/lustre-2.10.6/lustre/osd-ldiskfs/osd_internal.h: In function 'ldiskfs_get_htree_eof':
> > >> /home/thhsieh/lustre/L-2.10.6/lustre-2.10.6/lustre/osd-ldiskfs/osd_internal.h:1257:1: error: control reaches end of non-void function [-Werror=return-type]
> > >> }
> > >> ^
> > >> cc1: all warnings being treated as errors
> > >> scripts/Makefile.build:321: recipe for target '/home/thhsieh/lustre/L-2.10.6/lustre-2.10.6/lustre/osd-ldiskfs/osd_handler.o' failed
> > >> ==========================================================================
> > >> 
> > >> I searched the source code, only find the definition of LDISKFS_HTREE_EOF
> > >> in ldiskfs/ldiskfs.h:
> > >> 
> > >> #define LDISKFS_HTREE_EOF       0x7fffffff
> > >> 
> > >> but cannot find the definition of LDISKFS_HTREE_EOF_32BIT and
> > >> LDISKFS_HTREE_EOF_64BIT. Could anyone tell me how to fix this problem ?
> > >> 
> > >> Thanks very much.
> > >> 
> > >> T.H.Hsieh
> > >> _______________________________________________
> > >> lustre-discuss mailing list
> > >> lustre-discuss at lists.lustre.org
> > >> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> > > _______________________________________________
> > > lustre-discuss mailing list
> > > lustre-discuss at lists.lustre.org
> > > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> > 
> > 
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


More information about the lustre-discuss mailing list