[lustre-discuss] Hidden QoS in Lustre ?

Wed Oct 28 20:46:58 PDT 2020

Dear Simon,

Thank you very much for your hint. Yes, you are right. We compared
the grant size of two client by (running in each client):

	lctl get_param osc.*.cur_grant_bytes

- Client A: It has run the following large data transfer for over 36 hrs.

        while [ 1 ]; do
            tar cf - /home/large/data | ssh remote_host "cat > /dev/null"
        done

  The value of "cur_grant_bytes" is 796134.

- Client B: It is almost idling during the action of Client A.

  The value of "cur_grant_bytes" is 1715863552.

If this is the reason that hit the I/O performance of Client A seriously,
is it possible to maintain it at a constant value at least for the head
node (since the head node is the most probable one to have large and long
time data I/O of the whole cluster, especially for a data center) ?

I would be also like to ask: Why this value has to be dynamically adjusted ?

Thank you very much for your comment in advance.

Best Regards,

T.H.Hsieh

On Wed, Oct 28, 2020 at 02:00:21PM -0400, Simon Guilbault wrote:
> Hi, we had a similar performance problem on our login/DTNs node a few
> months ago, the problem was the grant size was shrinking and was getting
> stuck under 1MB. Once under 1MB, the client had to send every request to
> the OST using sync IO.
> 
> Check the output of the following command:
> lctl get_param osc.*.cur_grant_bytes
> 
> On Wed, Oct 28, 2020 at 12:08 AM Tung-Han Hsieh <
> thhsieh at twcp1.phys.ntu.edu.tw> wrote:
> 
> > Dear All,
> >
> > Sorry that I am not sure whether this mail was successfully posted to
> > the lustre-discuss mailing list or not. So I resent it again. Please
> > ignore it if you already read it before.
> >
> > ===========================================================================
> >
> > Dear Andreas,
> >
> > Thank you very much for your kindly suggestions. These days I got a chance
> > to follow your suggestions for the test. This email is to report the
> > results
> > I have done so far. What I have done were:
> >
> > 1. Upgrade one client (with Infiniband) to Lustre 2.13.56_44_gf8a8d3f
> >    (obtained from github). The compiling information is:
> >
> >    - Linux kernel 4.19.123.
> >    - Infiniband MLNX_OFED_SRC-4.6-1.0.1.1.
> >    - ./configure --prefix=/opt/lustre \
> >                  --with-o2ib=/path/of/mlnx-ofed-kernel-4.6 \
> >                  --disable-server --enable-mpitests=no
> >    - make
> >    - make install
> >
> > 2. We mounted the lustre file system (lustre MDT/OST servers: version
> >    2.12.4 with Infiniband with ZFS backend) by this command:
> >
> >    - mount -t lustre -o flock mdt at o2ib:/chome /home
> >
> > 3. The script to simulate large data transfer is following:
> >    (the directory "/home/large/data" contains 758 files, each size 600MB)
> >
> >    while [ 1 ]; do
> >        tar cf - /home/large/data | ssh remote_host "cat > /dev/null"
> >    done
> >
> >    ps. Note that this scenario is common in a large data center, while
> >        some users transferring large data out of the data center through
> >        the head node; while other users might copy files and do their
> >        normal works in the same head node.
> >
> > 4. During the data transfer in the background, I occationally ran this
> >    command in the same client to test whether there is any abnormality
> >    in I/O performance (where /home/dir1/file has size 600MB):
> >
> >    cp /home/dir1/file /home/dir2/
> >
> >    In the beginning this command can complete in about 1 sec. But after
> >    around 18 hours (not exactly, because the test ran overnight while
> >    I was sleeping), the problem appeared. The time to complete the same
> >    cp command was more than 1 minute.
> >
> >    During the test, I am sure that the whole cluster was idling. The MDT
> >    and OST servers did not have other loading. The CPU usage of the testing
> >    client was below 0.3.
> >
> >    Then I stopped the test, and let the whole system completely idle. But
> >    after 3 hours, the I/O abnormality of the same "cp" command was still
> >    there. Only after I unmounted /home and remounted /home, the abnormality
> >    of "cp" recovered to normal.
> >
> > Before and after remounting /home (which I call "reset"), I did the
> > following tests:
> >
> > 1. Using "top" to check the memory usage:
> >
> > Before reset:
> > =====================================
> > top - 10:43:15 up 35 days, 52 min,  3 users,  load average: 0.00, 0.00,
> > 0.00
> > Tasks: 404 total,   1 running, 162 sleeping,   0 stopped,   0 zombie
> > %Cpu(s):  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,
> > 0.0 st
> > KiB Mem : 13232632+total, 13000131+free,   647784 used,  1677220 buff/cache
> > KiB Swap: 15631240 total, 15631240 free,        0 used. 13076376+avail Mem
> >
> > After reset:
> > =====================================
> > top - 10:48:02 up 35 days, 57 min,  3 users,  load average: 0.04, 0.01,
> > 0.00
> > Tasks: 395 total,   1 running, 159 sleeping,   0 stopped,   0 zombie
> > %Cpu(s):  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,
> > 0.0 st
> > KiB Mem : 13232632+total, 12946539+free,   675948 used,  2184976 buff/cache
> > KiB Swap: 15631240 total, 15631240 free,        0 used. 13073571+avail Mem
> >
> >    It seems that most of the memory were in "free" state. The amount of
> >    hidden memory was neglectable. So I did not further investigate the
> >    amount of slab memory.
> >
> > 2. Using "strace" with the following commands:
> >
> >    - Before reset (took 1 min of each cp):
> >      strace -c -o /tmp/log2-err.txt cp /home/dir1/file /home/dir2/
> >
> >    - After reset (took 1 sec of each cp):
> >      strace -c -o /tmp/log2-reset.txt cp /home/dir1/file /home/dir2/
> >
> >    From the log files, the major time consuming was read and write
> > syscalls.
> >    The others are neglectable.
> >
> >    % time     seconds  usecs/call     calls    errors syscall
> >    ------ ----------- ----------- --------- --------- ----------------
> >    (Before reset)
> >     71.46    0.278424        1920       145           write
> >     28.06    0.109322         705       155           read
> >    (After reset)
> >     52.92    0.299091        2063       145           write
> >     46.85    0.264777        1708       155           read
> >
> >    Before reset, since we have done the cp test for the same file a
> >    few times, the file was already cached. So the reading time is
> >    smaller before reset than that after reset (since after reset /home
> >    was remounted).
> >
> >    Hence from this result, the time of syscalls looks normal. The
> >    performance drop seems occuring in other places.
> >
> > Now I haven't done the investigation of Lustre kernel debug log by enabling
> > Lustre debug=-1. We will find another chance to do it.
> >
> > Up to now, any comments or suggestions are very welcome.
> >
> > Thanks for your help in advance.
> >
> >
> > Best Regards,
> >
> > T.H.Hsieh
> >
> >
> > On Thu, Oct 08, 2020 at 01:32:53PM -0600, Andreas Dilger wrote:
> > > On Oct 8, 2020, at 10:37 AM, Tung-Han Hsieh <
> > thhsieh at twcp1.phys.ntu.edu.tw> wrote:
> > > >
> > > > Dear All,
> > > >
> > > > In the past months, we encountered several times of Lustre I/O
> > abnormally
> > > > slowing down. It is quite mysterious that there seems no problem on the
> > > > network hardware, nor the lustre itself since there is no error message
> > > > at all in MDT/OST/client sides.
> > > >
> > > > Recently we probably found a way to reproduce it, and then have some
> > > > suspections. We found that if we continuously perform I/O on a client
> > > > without stop, then after some time threshold (probably more than 24
> > > > hours), the additional file I/O bandwidth of that client will be
> > shriked
> > > > dramatically.
> > > >
> > > > Our configuration is the following:
> > > > - One MDT and one OST server, based on ZFS + Lustre-2.12.4.
> > > > - The OST is served by a RAID 5 system with 15 SAS hard disks.
> > > > - Some clients connect to MDT/OST through Infiniband, some through
> > > >  gigabit ethernet.
> > > >
> > > > Our test was focused on the clients using infiniband, which is
> > described
> > > > in the following:
> > > >
> > > > We have a huge (several TB) amount of data stored in the Lustre file
> > > > system to be transferred to outside network. In order not to exhaust
> > > > the network bandwidth of our institute, we transfer the data with
> > limited
> > > > bandwidth via the following command:
> > > >
> > > > rsync -av --bwlimit=1000 <data_in_Lustre>
> > <out_side_server>:/<out_side_path>/
> > > >
> > > > That is, the transferring rate is 1 MB per second, which is relatively
> > > > low. The client read the data from Lustre through infiniband. So during
> > > > data transmission, presumably there is no problem to do other data I/O
> > > > on the same client. On average, when copy a 600 MB file from one
> > directory
> > > > to another directory (both in the same Lustre file system), it took
> > about
> > > > 1.0 - 2.0 secs, even when the rsync process still working.
> > > >
> > > > But after about 24 hours of continuously sending data via rsync, the
> > > > additional I/O on the same client was dramatically shrinked. When it
> > happens,
> > > > it took more than 1 minute to copy a 600 MB from somewhere to another
> > place
> > > > (both in the same Lustre) while rsync is still running.
> > > >
> > > > Then, we stopped the rsync process, and wait for a while (about one
> > > > hour). The I/O performance of copying that 600 MB file returns normal.
> > > >
> > > > Based on this observation, we are suspecting that whether there is a
> > > > hidden QoS mechanism built in Lustre ? When a process occupies the I/O
> > > > bandwidth for a long time and exceeded some limits, does Lustre
> > automatically
> > > > shrinked the I/O bandwidth for all processes running in the same
> > client ?
> > > >
> > > > I am not against such QoS design, if it does exist. But the amount of
> > > > shrinking seems to be too large for infiniband (QDR and above). Then
> > > > I am further suspecting that whether this is due to that our system is
> > > > mixed with clients in which some have infiniband but some do not ?
> > > >
> > > > Could anyone help to fix this problem ? Any suggestions will be very
> > > > appreciated.
> > >
> > > There is no "hidden QOS", unless it is so well hidden that I don't know
> > > about it.
> > >
> > > You could investigate several different things to isolate the problem:
> > > - try with a 2.13.56 client to see if the problem is already fixed
> > > - check if the client is using a lot of CPU when it becomes slow
> > > - run strace on your copy process to see which syscalls are slow
> > > - check memory/slab usage
> > > - enable Lustre debug=-1 and dump the kernel debug log to see where
> > >   the process is taking a long time to complete a request
> > >
> > > It is definitely possible that there is some kind of problem, since this
> > > is not a very common workload to be continuously writing to the same file
> > > descriptor for over a day.  You'll have to do the investigation on your
> > > system to isolate the source of the problem.
> > >
> > > Cheers, Andreas
> > >
> > >
> > >
> > >
> > >
> >
> >
> > _______________________________________________
> > lustre-discuss mailing list
> > lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> >