[lustre-discuss] Hidden QoS in Lustre ?

Simon Guilbault simon.guilbault at calculquebec.ca
Thu Oct 29 06:49:42 PDT 2020


Our current workaround was to use the following command on the MGS with
Lustre 2.12.5 that include the patches in LU-12651 and LU-12759 (we were
using a patched 2.12.4 a few months ago):
lctl set_param -P osc.*.grant_shrink=0

We could not find the root cause of the underlying problem, dynamic grant
shrinking seems to be useful when the OSTs are running out of free space.

On Wed, Oct 28, 2020 at 11:47 PM Tung-Han Hsieh <
thhsieh at twcp1.phys.ntu.edu.tw> wrote:

> Dear Simon,
>
> Thank you very much for your hint. Yes, you are right. We compared
> the grant size of two client by (running in each client):
>
>         lctl get_param osc.*.cur_grant_bytes
>
> - Client A: It has run the following large data transfer for over 36 hrs.
>
>         while [ 1 ]; do
>             tar cf - /home/large/data | ssh remote_host "cat > /dev/null"
>         done
>
>   The value of "cur_grant_bytes" is 796134.
>
> - Client B: It is almost idling during the action of Client A.
>
>   The value of "cur_grant_bytes" is 1715863552.
>
> If this is the reason that hit the I/O performance of Client A seriously,
> is it possible to maintain it at a constant value at least for the head
> node (since the head node is the most probable one to have large and long
> time data I/O of the whole cluster, especially for a data center) ?
>
> I would be also like to ask: Why this value has to be dynamically adjusted
> ?
>
> Thank you very much for your comment in advance.
>
> Best Regards,
>
> T.H.Hsieh
>
> On Wed, Oct 28, 2020 at 02:00:21PM -0400, Simon Guilbault wrote:
> > Hi, we had a similar performance problem on our login/DTNs node a few
> > months ago, the problem was the grant size was shrinking and was getting
> > stuck under 1MB. Once under 1MB, the client had to send every request to
> > the OST using sync IO.
> >
> > Check the output of the following command:
> > lctl get_param osc.*.cur_grant_bytes
> >
> > On Wed, Oct 28, 2020 at 12:08 AM Tung-Han Hsieh <
> > thhsieh at twcp1.phys.ntu.edu.tw> wrote:
> >
> > > Dear All,
> > >
> > > Sorry that I am not sure whether this mail was successfully posted to
> > > the lustre-discuss mailing list or not. So I resent it again. Please
> > > ignore it if you already read it before.
> > >
> > >
> ===========================================================================
> > >
> > > Dear Andreas,
> > >
> > > Thank you very much for your kindly suggestions. These days I got a
> chance
> > > to follow your suggestions for the test. This email is to report the
> > > results
> > > I have done so far. What I have done were:
> > >
> > > 1. Upgrade one client (with Infiniband) to Lustre 2.13.56_44_gf8a8d3f
> > >    (obtained from github). The compiling information is:
> > >
> > >    - Linux kernel 4.19.123.
> > >    - Infiniband MLNX_OFED_SRC-4.6-1.0.1.1.
> > >    - ./configure --prefix=/opt/lustre \
> > >                  --with-o2ib=/path/of/mlnx-ofed-kernel-4.6 \
> > >                  --disable-server --enable-mpitests=no
> > >    - make
> > >    - make install
> > >
> > > 2. We mounted the lustre file system (lustre MDT/OST servers: version
> > >    2.12.4 with Infiniband with ZFS backend) by this command:
> > >
> > >    - mount -t lustre -o flock mdt at o2ib:/chome /home
> > >
> > > 3. The script to simulate large data transfer is following:
> > >    (the directory "/home/large/data" contains 758 files, each size
> 600MB)
> > >
> > >    while [ 1 ]; do
> > >        tar cf - /home/large/data | ssh remote_host "cat > /dev/null"
> > >    done
> > >
> > >    ps. Note that this scenario is common in a large data center, while
> > >        some users transferring large data out of the data center
> through
> > >        the head node; while other users might copy files and do their
> > >        normal works in the same head node.
> > >
> > > 4. During the data transfer in the background, I occationally ran this
> > >    command in the same client to test whether there is any abnormality
> > >    in I/O performance (where /home/dir1/file has size 600MB):
> > >
> > >    cp /home/dir1/file /home/dir2/
> > >
> > >    In the beginning this command can complete in about 1 sec. But after
> > >    around 18 hours (not exactly, because the test ran overnight while
> > >    I was sleeping), the problem appeared. The time to complete the same
> > >    cp command was more than 1 minute.
> > >
> > >    During the test, I am sure that the whole cluster was idling. The
> MDT
> > >    and OST servers did not have other loading. The CPU usage of the
> testing
> > >    client was below 0.3.
> > >
> > >    Then I stopped the test, and let the whole system completely idle.
> But
> > >    after 3 hours, the I/O abnormality of the same "cp" command was
> still
> > >    there. Only after I unmounted /home and remounted /home, the
> abnormality
> > >    of "cp" recovered to normal.
> > >
> > > Before and after remounting /home (which I call "reset"), I did the
> > > following tests:
> > >
> > > 1. Using "top" to check the memory usage:
> > >
> > > Before reset:
> > > =====================================
> > > top - 10:43:15 up 35 days, 52 min,  3 users,  load average: 0.00, 0.00,
> > > 0.00
> > > Tasks: 404 total,   1 running, 162 sleeping,   0 stopped,   0 zombie
> > > %Cpu(s):  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,
> > > 0.0 st
> > > KiB Mem : 13232632+total, 13000131+free,   647784 used,  1677220
> buff/cache
> > > KiB Swap: 15631240 total, 15631240 free,        0 used. 13076376+avail
> Mem
> > >
> > > After reset:
> > > =====================================
> > > top - 10:48:02 up 35 days, 57 min,  3 users,  load average: 0.04, 0.01,
> > > 0.00
> > > Tasks: 395 total,   1 running, 159 sleeping,   0 stopped,   0 zombie
> > > %Cpu(s):  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,
> > > 0.0 st
> > > KiB Mem : 13232632+total, 12946539+free,   675948 used,  2184976
> buff/cache
> > > KiB Swap: 15631240 total, 15631240 free,        0 used. 13073571+avail
> Mem
> > >
> > >    It seems that most of the memory were in "free" state. The amount of
> > >    hidden memory was neglectable. So I did not further investigate the
> > >    amount of slab memory.
> > >
> > > 2. Using "strace" with the following commands:
> > >
> > >    - Before reset (took 1 min of each cp):
> > >      strace -c -o /tmp/log2-err.txt cp /home/dir1/file /home/dir2/
> > >
> > >    - After reset (took 1 sec of each cp):
> > >      strace -c -o /tmp/log2-reset.txt cp /home/dir1/file /home/dir2/
> > >
> > >    From the log files, the major time consuming was read and write
> > > syscalls.
> > >    The others are neglectable.
> > >
> > >    % time     seconds  usecs/call     calls    errors syscall
> > >    ------ ----------- ----------- --------- --------- ----------------
> > >    (Before reset)
> > >     71.46    0.278424        1920       145           write
> > >     28.06    0.109322         705       155           read
> > >    (After reset)
> > >     52.92    0.299091        2063       145           write
> > >     46.85    0.264777        1708       155           read
> > >
> > >    Before reset, since we have done the cp test for the same file a
> > >    few times, the file was already cached. So the reading time is
> > >    smaller before reset than that after reset (since after reset /home
> > >    was remounted).
> > >
> > >    Hence from this result, the time of syscalls looks normal. The
> > >    performance drop seems occuring in other places.
> > >
> > > Now I haven't done the investigation of Lustre kernel debug log by
> enabling
> > > Lustre debug=-1. We will find another chance to do it.
> > >
> > > Up to now, any comments or suggestions are very welcome.
> > >
> > > Thanks for your help in advance.
> > >
> > >
> > > Best Regards,
> > >
> > > T.H.Hsieh
> > >
> > >
> > > On Thu, Oct 08, 2020 at 01:32:53PM -0600, Andreas Dilger wrote:
> > > > On Oct 8, 2020, at 10:37 AM, Tung-Han Hsieh <
> > > thhsieh at twcp1.phys.ntu.edu.tw> wrote:
> > > > >
> > > > > Dear All,
> > > > >
> > > > > In the past months, we encountered several times of Lustre I/O
> > > abnormally
> > > > > slowing down. It is quite mysterious that there seems no problem
> on the
> > > > > network hardware, nor the lustre itself since there is no error
> message
> > > > > at all in MDT/OST/client sides.
> > > > >
> > > > > Recently we probably found a way to reproduce it, and then have
> some
> > > > > suspections. We found that if we continuously perform I/O on a
> client
> > > > > without stop, then after some time threshold (probably more than 24
> > > > > hours), the additional file I/O bandwidth of that client will be
> > > shriked
> > > > > dramatically.
> > > > >
> > > > > Our configuration is the following:
> > > > > - One MDT and one OST server, based on ZFS + Lustre-2.12.4.
> > > > > - The OST is served by a RAID 5 system with 15 SAS hard disks.
> > > > > - Some clients connect to MDT/OST through Infiniband, some through
> > > > >  gigabit ethernet.
> > > > >
> > > > > Our test was focused on the clients using infiniband, which is
> > > described
> > > > > in the following:
> > > > >
> > > > > We have a huge (several TB) amount of data stored in the Lustre
> file
> > > > > system to be transferred to outside network. In order not to
> exhaust
> > > > > the network bandwidth of our institute, we transfer the data with
> > > limited
> > > > > bandwidth via the following command:
> > > > >
> > > > > rsync -av --bwlimit=1000 <data_in_Lustre>
> > > <out_side_server>:/<out_side_path>/
> > > > >
> > > > > That is, the transferring rate is 1 MB per second, which is
> relatively
> > > > > low. The client read the data from Lustre through infiniband. So
> during
> > > > > data transmission, presumably there is no problem to do other data
> I/O
> > > > > on the same client. On average, when copy a 600 MB file from one
> > > directory
> > > > > to another directory (both in the same Lustre file system), it took
> > > about
> > > > > 1.0 - 2.0 secs, even when the rsync process still working.
> > > > >
> > > > > But after about 24 hours of continuously sending data via rsync,
> the
> > > > > additional I/O on the same client was dramatically shrinked. When
> it
> > > happens,
> > > > > it took more than 1 minute to copy a 600 MB from somewhere to
> another
> > > place
> > > > > (both in the same Lustre) while rsync is still running.
> > > > >
> > > > > Then, we stopped the rsync process, and wait for a while (about one
> > > > > hour). The I/O performance of copying that 600 MB file returns
> normal.
> > > > >
> > > > > Based on this observation, we are suspecting that whether there is
> a
> > > > > hidden QoS mechanism built in Lustre ? When a process occupies the
> I/O
> > > > > bandwidth for a long time and exceeded some limits, does Lustre
> > > automatically
> > > > > shrinked the I/O bandwidth for all processes running in the same
> > > client ?
> > > > >
> > > > > I am not against such QoS design, if it does exist. But the amount
> of
> > > > > shrinking seems to be too large for infiniband (QDR and above).
> Then
> > > > > I am further suspecting that whether this is due to that our
> system is
> > > > > mixed with clients in which some have infiniband but some do not ?
> > > > >
> > > > > Could anyone help to fix this problem ? Any suggestions will be
> very
> > > > > appreciated.
> > > >
> > > > There is no "hidden QOS", unless it is so well hidden that I don't
> know
> > > > about it.
> > > >
> > > > You could investigate several different things to isolate the
> problem:
> > > > - try with a 2.13.56 client to see if the problem is already fixed
> > > > - check if the client is using a lot of CPU when it becomes slow
> > > > - run strace on your copy process to see which syscalls are slow
> > > > - check memory/slab usage
> > > > - enable Lustre debug=-1 and dump the kernel debug log to see where
> > > >   the process is taking a long time to complete a request
> > > >
> > > > It is definitely possible that there is some kind of problem, since
> this
> > > > is not a very common workload to be continuously writing to the same
> file
> > > > descriptor for over a day.  You'll have to do the investigation on
> your
> > > > system to isolate the source of the problem.
> > > >
> > > > Cheers, Andreas
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> > >
> > > _______________________________________________
> > > lustre-discuss mailing list
> > > lustre-discuss at lists.lustre.org
> > > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> > >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20201029/090bee26/attachment-0001.html>


More information about the lustre-discuss mailing list