[lustre-discuss] Hidden QoS in Lustre ?

Wed Nov 11 18:25:50 PST 2020

Simon, Tung-Han,

You may also want to watch these tickets that seem to be related to the
issue you describe:
https://jira.whamcloud.com/browse/LU-14124
https://jira.whamcloud.com/browse/LU-14125

-Nathan

On Mon, Nov 2, 2020 at 7:18 AM Simon Guilbault <
simon.guilbault at calculquebec.ca> wrote:

> Hi,
>
> If you set it on the MGS, it will be the new default for all the clients
> and new mount on the FS, the problem is you need LU-12759 (fixed in 2.12.4)
> on your clients since there was a bug on older clients and that setting was
> not working correctly.
>
> On Mon, Nov 2, 2020 at 12:38 AM Tung-Han Hsieh <
> thhsieh at twcp1.phys.ntu.edu.tw> wrote:
>
>> Dear Simon,
>>
>> Following your suggestions, now we confirmed that the problem of
>> dropping I/O performance of a client when there is a continous
>> I/O in the background is solved. It works charming. Thank you so
>> much !!
>>
>> Here is a final question. We found that this command:
>>
>>         lctl set_param osc.*.grant_shrink=0
>>
>> can be run the client, which fixed the value of "cur_grant_bytes"
>> to be the highest value 1880752127, and thereby fixed the problem.
>> Whenever we remount the file system (I mean, explicitly umount and
>> mount the file system), we need to execute this command again to
>> set it to zero.
>>
>> But this command:
>>
>>         lctl set_param -P osc.*.grant_shrink=0
>>
>> has to be run in the MGS node. Only setting it in MGS but without
>> setting in the client, it seems that the "cur_grant_bytes" of the
>> testing client still dropping under the background continous I/O.
>> So I am asking what's the meaning of this setting in MGS node.
>>
>> Thank you very much.
>>
>>
>> T.H.Hsieh
>>
>> On Fri, Oct 30, 2020 at 01:37:01PM +0800, Tung-Han Hsieh wrote:
>> > Dear Simon,
>> >
>> > Thank you very much for your useful information. Now we are arranging
>> > the system maintenance date in order to upgrade to Lustre-2.12.5. Then
>> > we will follow your suggestion to see whether this problem could be
>> > fixed.
>> >
>> > Here I report a test of under continuous I/O, how the cur_grant_bytes
>> > changed overtime. Again the client runs the following script for
>> > continuous reading in the background:
>> >
>> >     # The Lustre file system was mounted under /home
>> >     while [ 1 ]; do
>> >         tar cf - /home/large/data | ssh remote_host "cat > /dev/null"
>> >     done
>> >
>> > And every 20 mins, in the same client we copied a 600MB file from one
>> > directory to another within Lustre, and check the "cur_grant_bytes" by
>> > the following command running in the same client:
>> >
>> >     /opt/lustre/sbin/lctl get_param osc.*.cur_grant_bytes
>> >
>> > The result is (every line separated by around 20 mins):
>> >
>> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=1880752127
>> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=1410564096
>> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=1059201024
>> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=794400768
>> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=595800576
>> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=446850432
>> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=335137824
>> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=251353368
>> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=188515026
>> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=141386270
>> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=106039703
>> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=79529778
>> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=59647334
>> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=44735501
>> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=33551626
>> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=25163720
>> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=18872790
>> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=14154593
>> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=10615945
>> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=7961959
>> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=5971470
>> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=4478603
>> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=3358953
>> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=2519215
>> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=1889412
>> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=1417059
>> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=1062795
>> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=797097
>> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=797097
>> > ....
>> >
>> > The value 797097 seems to be the minimum. When it dropped to 1062795,
>> > the time of cp dramatically increased from around 1 sec to 1 min. In
>> > addition, during the test, the cluster is completely idling. And it
>> > is obvious that this test does not saturate the loading of both network
>> > and MDT / OST hardware (they have almost no loading).
>> >
>> > I am wondering whether this could be a bug to report to the development
>> > team.
>> >
>> > Best Regards,
>> >
>> > T.H.Hsieh
>> >
>> > On Thu, Oct 29, 2020 at 09:49:42AM -0400, Simon Guilbault wrote:
>> > > Our current workaround was to use the following command on the MGS
>> with
>> > > Lustre 2.12.5 that include the patches in LU-12651 and LU-12759 (we
>> were
>> > > using a patched 2.12.4 a few months ago):
>> > > lctl set_param -P osc.*.grant_shrink=0
>> > >
>> > > We could not find the root cause of the underlying problem, dynamic
>> grant
>> > > shrinking seems to be useful when the OSTs are running out of free
>> space.
>> > >
>> > > On Wed, Oct 28, 2020 at 11:47 PM Tung-Han Hsieh <
>> > > thhsieh at twcp1.phys.ntu.edu.tw> wrote:
>> > >
>> > > > Dear Simon,
>> > > >
>> > > > Thank you very much for your hint. Yes, you are right. We compared
>> > > > the grant size of two client by (running in each client):
>> > > >
>> > > >         lctl get_param osc.*.cur_grant_bytes
>> > > >
>> > > > - Client A: It has run the following large data transfer for over
>> 36 hrs.
>> > > >
>> > > >         while [ 1 ]; do
>> > > >             tar cf - /home/large/data | ssh remote_host "cat >
>> /dev/null"
>> > > >         done
>> > > >
>> > > >   The value of "cur_grant_bytes" is 796134.
>> > > >
>> > > > - Client B: It is almost idling during the action of Client A.
>> > > >
>> > > >   The value of "cur_grant_bytes" is 1715863552.
>> > > >
>> > > > If this is the reason that hit the I/O performance of Client A
>> seriously,
>> > > > is it possible to maintain it at a constant value at least for the
>> head
>> > > > node (since the head node is the most probable one to have large
>> and long
>> > > > time data I/O of the whole cluster, especially for a data center) ?
>> > > >
>> > > > I would be also like to ask: Why this value has to be dynamically
>> adjusted
>> > > > ?
>> > > >
>> > > > Thank you very much for your comment in advance.
>> > > >
>> > > > Best Regards,
>> > > >
>> > > > T.H.Hsieh
>> > > >
>> > > > On Wed, Oct 28, 2020 at 02:00:21PM -0400, Simon Guilbault wrote:
>> > > > > Hi, we had a similar performance problem on our login/DTNs node a
>> few
>> > > > > months ago, the problem was the grant size was shrinking and was
>> getting
>> > > > > stuck under 1MB. Once under 1MB, the client had to send every
>> request to
>> > > > > the OST using sync IO.
>> > > > >
>> > > > > Check the output of the following command:
>> > > > > lctl get_param osc.*.cur_grant_bytes
>> > > > >
>> > > > > On Wed, Oct 28, 2020 at 12:08 AM Tung-Han Hsieh <
>> > > > > thhsieh at twcp1.phys.ntu.edu.tw> wrote:
>> > > > >
>> > > > > > Dear All,
>> > > > > >
>> > > > > > Sorry that I am not sure whether this mail was successfully
>> posted to
>> > > > > > the lustre-discuss mailing list or not. So I resent it again.
>> Please
>> > > > > > ignore it if you already read it before.
>> > > > > >
>> > > > > >
>> > > >
>> ===========================================================================
>> > > > > >
>> > > > > > Dear Andreas,
>> > > > > >
>> > > > > > Thank you very much for your kindly suggestions. These days I
>> got a
>> > > > chance
>> > > > > > to follow your suggestions for the test. This email is to
>> report the
>> > > > > > results
>> > > > > > I have done so far. What I have done were:
>> > > > > >
>> > > > > > 1. Upgrade one client (with Infiniband) to Lustre
>> 2.13.56_44_gf8a8d3f
>> > > > > >    (obtained from github). The compiling information is:
>> > > > > >
>> > > > > >    - Linux kernel 4.19.123.
>> > > > > >    - Infiniband MLNX_OFED_SRC-4.6-1.0.1.1.
>> > > > > >    - ./configure --prefix=/opt/lustre \
>> > > > > >                  --with-o2ib=/path/of/mlnx-ofed-kernel-4.6 \
>> > > > > >                  --disable-server --enable-mpitests=no
>> > > > > >    - make
>> > > > > >    - make install
>> > > > > >
>> > > > > > 2. We mounted the lustre file system (lustre MDT/OST servers:
>> version
>> > > > > >    2.12.4 with Infiniband with ZFS backend) by this command:
>> > > > > >
>> > > > > >    - mount -t lustre -o flock mdt at o2ib:/chome /home
>> > > > > >
>> > > > > > 3. The script to simulate large data transfer is following:
>> > > > > >    (the directory "/home/large/data" contains 758 files, each
>> size
>> > > > 600MB)
>> > > > > >
>> > > > > >    while [ 1 ]; do
>> > > > > >        tar cf - /home/large/data | ssh remote_host "cat >
>> /dev/null"
>> > > > > >    done
>> > > > > >
>> > > > > >    ps. Note that this scenario is common in a large data
>> center, while
>> > > > > >        some users transferring large data out of the data center
>> > > > through
>> > > > > >        the head node; while other users might copy files and do
>> their
>> > > > > >        normal works in the same head node.
>> > > > > >
>> > > > > > 4. During the data transfer in the background, I occationally
>> ran this
>> > > > > >    command in the same client to test whether there is any
>> abnormality
>> > > > > >    in I/O performance (where /home/dir1/file has size 600MB):
>> > > > > >
>> > > > > >    cp /home/dir1/file /home/dir2/
>> > > > > >
>> > > > > >    In the beginning this command can complete in about 1 sec.
>> But after
>> > > > > >    around 18 hours (not exactly, because the test ran overnight
>> while
>> > > > > >    I was sleeping), the problem appeared. The time to complete
>> the same
>> > > > > >    cp command was more than 1 minute.
>> > > > > >
>> > > > > >    During the test, I am sure that the whole cluster was
>> idling. The
>> > > > MDT
>> > > > > >    and OST servers did not have other loading. The CPU usage of
>> the
>> > > > testing
>> > > > > >    client was below 0.3.
>> > > > > >
>> > > > > >    Then I stopped the test, and let the whole system completely
>> idle.
>> > > > But
>> > > > > >    after 3 hours, the I/O abnormality of the same "cp" command
>> was
>> > > > still
>> > > > > >    there. Only after I unmounted /home and remounted /home, the
>> > > > abnormality
>> > > > > >    of "cp" recovered to normal.
>> > > > > >
>> > > > > > Before and after remounting /home (which I call "reset"), I did
>> the
>> > > > > > following tests:
>> > > > > >
>> > > > > > 1. Using "top" to check the memory usage:
>> > > > > >
>> > > > > > Before reset:
>> > > > > > =====================================
>> > > > > > top - 10:43:15 up 35 days, 52 min,  3 users,  load average:
>> 0.00, 0.00,
>> > > > > > 0.00
>> > > > > > Tasks: 404 total,   1 running, 162 sleeping,   0 stopped,   0
>> zombie
>> > > > > > %Cpu(s):  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,
>> 0.0 si,
>> > > > > > 0.0 st
>> > > > > > KiB Mem : 13232632+total, 13000131+free,   647784 used,  1677220
>> > > > buff/cache
>> > > > > > KiB Swap: 15631240 total, 15631240 free,        0 used.
>> 13076376+avail
>> > > > Mem
>> > > > > >
>> > > > > > After reset:
>> > > > > > =====================================
>> > > > > > top - 10:48:02 up 35 days, 57 min,  3 users,  load average:
>> 0.04, 0.01,
>> > > > > > 0.00
>> > > > > > Tasks: 395 total,   1 running, 159 sleeping,   0 stopped,   0
>> zombie
>> > > > > > %Cpu(s):  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,
>> 0.0 si,
>> > > > > > 0.0 st
>> > > > > > KiB Mem : 13232632+total, 12946539+free,   675948 used,  2184976
>> > > > buff/cache
>> > > > > > KiB Swap: 15631240 total, 15631240 free,        0 used.
>> 13073571+avail
>> > > > Mem
>> > > > > >
>> > > > > >    It seems that most of the memory were in "free" state. The
>> amount of
>> > > > > >    hidden memory was neglectable. So I did not further
>> investigate the
>> > > > > >    amount of slab memory.
>> > > > > >
>> > > > > > 2. Using "strace" with the following commands:
>> > > > > >
>> > > > > >    - Before reset (took 1 min of each cp):
>> > > > > >      strace -c -o /tmp/log2-err.txt cp /home/dir1/file
>> /home/dir2/
>> > > > > >
>> > > > > >    - After reset (took 1 sec of each cp):
>> > > > > >      strace -c -o /tmp/log2-reset.txt cp /home/dir1/file
>> /home/dir2/
>> > > > > >
>> > > > > >    From the log files, the major time consuming was read and
>> write
>> > > > > > syscalls.
>> > > > > >    The others are neglectable.
>> > > > > >
>> > > > > >    % time     seconds  usecs/call     calls    errors syscall
>> > > > > >    ------ ----------- ----------- --------- ---------
>> ----------------
>> > > > > >    (Before reset)
>> > > > > >     71.46    0.278424        1920       145           write
>> > > > > >     28.06    0.109322         705       155           read
>> > > > > >    (After reset)
>> > > > > >     52.92    0.299091        2063       145           write
>> > > > > >     46.85    0.264777        1708       155           read
>> > > > > >
>> > > > > >    Before reset, since we have done the cp test for the same
>> file a
>> > > > > >    few times, the file was already cached. So the reading time
>> is
>> > > > > >    smaller before reset than that after reset (since after
>> reset /home
>> > > > > >    was remounted).
>> > > > > >
>> > > > > >    Hence from this result, the time of syscalls looks normal.
>> The
>> > > > > >    performance drop seems occuring in other places.
>> > > > > >
>> > > > > > Now I haven't done the investigation of Lustre kernel debug log
>> by
>> > > > enabling
>> > > > > > Lustre debug=-1. We will find another chance to do it.
>> > > > > >
>> > > > > > Up to now, any comments or suggestions are very welcome.
>> > > > > >
>> > > > > > Thanks for your help in advance.
>> > > > > >
>> > > > > >
>> > > > > > Best Regards,
>> > > > > >
>> > > > > > T.H.Hsieh
>> > > > > >
>> > > > > >
>> > > > > > On Thu, Oct 08, 2020 at 01:32:53PM -0600, Andreas Dilger wrote:
>> > > > > > > On Oct 8, 2020, at 10:37 AM, Tung-Han Hsieh <
>> > > > > > thhsieh at twcp1.phys.ntu.edu.tw> wrote:
>> > > > > > > >
>> > > > > > > > Dear All,
>> > > > > > > >
>> > > > > > > > In the past months, we encountered several times of Lustre
>> I/O
>> > > > > > abnormally
>> > > > > > > > slowing down. It is quite mysterious that there seems no
>> problem
>> > > > on the
>> > > > > > > > network hardware, nor the lustre itself since there is no
>> error
>> > > > message
>> > > > > > > > at all in MDT/OST/client sides.
>> > > > > > > >
>> > > > > > > > Recently we probably found a way to reproduce it, and then
>> have
>> > > > some
>> > > > > > > > suspections. We found that if we continuously perform I/O
>> on a
>> > > > client
>> > > > > > > > without stop, then after some time threshold (probably more
>> than 24
>> > > > > > > > hours), the additional file I/O bandwidth of that client
>> will be
>> > > > > > shriked
>> > > > > > > > dramatically.
>> > > > > > > >
>> > > > > > > > Our configuration is the following:
>> > > > > > > > - One MDT and one OST server, based on ZFS + Lustre-2.12.4.
>> > > > > > > > - The OST is served by a RAID 5 system with 15 SAS hard
>> disks.
>> > > > > > > > - Some clients connect to MDT/OST through Infiniband, some
>> through
>> > > > > > > >  gigabit ethernet.
>> > > > > > > >
>> > > > > > > > Our test was focused on the clients using infiniband, which
>> is
>> > > > > > described
>> > > > > > > > in the following:
>> > > > > > > >
>> > > > > > > > We have a huge (several TB) amount of data stored in the
>> Lustre
>> > > > file
>> > > > > > > > system to be transferred to outside network. In order not to
>> > > > exhaust
>> > > > > > > > the network bandwidth of our institute, we transfer the
>> data with
>> > > > > > limited
>> > > > > > > > bandwidth via the following command:
>> > > > > > > >
>> > > > > > > > rsync -av --bwlimit=1000 <data_in_Lustre>
>> > > > > > <out_side_server>:/<out_side_path>/
>> > > > > > > >
>> > > > > > > > That is, the transferring rate is 1 MB per second, which is
>> > > > relatively
>> > > > > > > > low. The client read the data from Lustre through
>> infiniband. So
>> > > > during
>> > > > > > > > data transmission, presumably there is no problem to do
>> other data
>> > > > I/O
>> > > > > > > > on the same client. On average, when copy a 600 MB file
>> from one
>> > > > > > directory
>> > > > > > > > to another directory (both in the same Lustre file system),
>> it took
>> > > > > > about
>> > > > > > > > 1.0 - 2.0 secs, even when the rsync process still working.
>> > > > > > > >
>> > > > > > > > But after about 24 hours of continuously sending data via
>> rsync,
>> > > > the
>> > > > > > > > additional I/O on the same client was dramatically
>> shrinked. When
>> > > > it
>> > > > > > happens,
>> > > > > > > > it took more than 1 minute to copy a 600 MB from somewhere
>> to
>> > > > another
>> > > > > > place
>> > > > > > > > (both in the same Lustre) while rsync is still running.
>> > > > > > > >
>> > > > > > > > Then, we stopped the rsync process, and wait for a while
>> (about one
>> > > > > > > > hour). The I/O performance of copying that 600 MB file
>> returns
>> > > > normal.
>> > > > > > > >
>> > > > > > > > Based on this observation, we are suspecting that whether
>> there is
>> > > > a
>> > > > > > > > hidden QoS mechanism built in Lustre ? When a process
>> occupies the
>> > > > I/O
>> > > > > > > > bandwidth for a long time and exceeded some limits, does
>> Lustre
>> > > > > > automatically
>> > > > > > > > shrinked the I/O bandwidth for all processes running in the
>> same
>> > > > > > client ?
>> > > > > > > >
>> > > > > > > > I am not against such QoS design, if it does exist. But the
>> amount
>> > > > of
>> > > > > > > > shrinking seems to be too large for infiniband (QDR and
>> above).
>> > > > Then
>> > > > > > > > I am further suspecting that whether this is due to that our
>> > > > system is
>> > > > > > > > mixed with clients in which some have infiniband but some
>> do not ?
>> > > > > > > >
>> > > > > > > > Could anyone help to fix this problem ? Any suggestions
>> will be
>> > > > very
>> > > > > > > > appreciated.
>> > > > > > >
>> > > > > > > There is no "hidden QOS", unless it is so well hidden that I
>> don't
>> > > > know
>> > > > > > > about it.
>> > > > > > >
>> > > > > > > You could investigate several different things to isolate the
>> > > > problem:
>> > > > > > > - try with a 2.13.56 client to see if the problem is already
>> fixed
>> > > > > > > - check if the client is using a lot of CPU when it becomes
>> slow
>> > > > > > > - run strace on your copy process to see which syscalls are
>> slow
>> > > > > > > - check memory/slab usage
>> > > > > > > - enable Lustre debug=-1 and dump the kernel debug log to see
>> where
>> > > > > > >   the process is taking a long time to complete a request
>> > > > > > >
>> > > > > > > It is definitely possible that there is some kind of problem,
>> since
>> > > > this
>> > > > > > > is not a very common workload to be continuously writing to
>> the same
>> > > > file
>> > > > > > > descriptor for over a day.  You'll have to do the
>> investigation on
>> > > > your
>> > > > > > > system to isolate the source of the problem.
>> > > > > > >
>> > > > > > > Cheers, Andreas
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > >
>> > > > > >
>> > > > > > _______________________________________________
>> > > > > > lustre-discuss mailing list
>> > > > > > lustre-discuss at lists.lustre.org
>> > > > > > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>> > > > > >
>> > > >
>>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20201111/97924172/attachment-0001.html>