<div dir="ltr">Our current workaround was to use the following command on the MGS with Lustre 2.12.5 that include the patches in LU-12651 and LU-12759 (we were using a patched 2.12.4 a few months ago):<div><div>lctl set_param -P osc.*.grant_shrink=0</div><div><br></div><div>We could not find the root cause of the underlying problem, dynamic grant shrinking seems to be useful when the OSTs are running out of free space. </div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Oct 28, 2020 at 11:47 PM Tung-Han Hsieh <<a href="mailto:thhsieh@twcp1.phys.ntu.edu.tw">thhsieh@twcp1.phys.ntu.edu.tw</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Dear Simon,<br>

<br>

Thank you very much for your hint. Yes, you are right. We compared<br>

the grant size of two client by (running in each client):<br>

<br>

        lctl get_param osc.*.cur_grant_bytes<br>

<br>

- Client A: It has run the following large data transfer for over 36 hrs.<br>

<br>

        while [ 1 ]; do<br>

            tar cf - /home/large/data | ssh remote_host "cat > /dev/null"<br>

        done<br>

<br>

  The value of "cur_grant_bytes" is 796134.<br>

<br>

- Client B: It is almost idling during the action of Client A.<br>

<br>

  The value of "cur_grant_bytes" is 1715863552.<br>

<br>

If this is the reason that hit the I/O performance of Client A seriously,<br>

is it possible to maintain it at a constant value at least for the head<br>

node (since the head node is the most probable one to have large and long<br>

time data I/O of the whole cluster, especially for a data center) ?<br>

<br>

I would be also like to ask: Why this value has to be dynamically adjusted ?<br>

<br>

Thank you very much for your comment in advance.<br>

<br>

Best Regards,<br>

<br>

T.H.Hsieh<br>

<br>

On Wed, Oct 28, 2020 at 02:00:21PM -0400, Simon Guilbault wrote:<br>

> Hi, we had a similar performance problem on our login/DTNs node a few<br>

> months ago, the problem was the grant size was shrinking and was getting<br>

> stuck under 1MB. Once under 1MB, the client had to send every request to<br>

> the OST using sync IO.<br>

> <br>

> Check the output of the following command:<br>

> lctl get_param osc.*.cur_grant_bytes<br>

> <br>

> On Wed, Oct 28, 2020 at 12:08 AM Tung-Han Hsieh <<br>

> <a href="mailto:thhsieh@twcp1.phys.ntu.edu.tw" target="_blank">thhsieh@twcp1.phys.ntu.edu.tw</a>> wrote:<br>

> <br>

> > Dear All,<br>

> ><br>

> > Sorry that I am not sure whether this mail was successfully posted to<br>

> > the lustre-discuss mailing list or not. So I resent it again. Please<br>

> > ignore it if you already read it before.<br>

> ><br>

> > ===========================================================================<br>

> ><br>

> > Dear Andreas,<br>

> ><br>

> > Thank you very much for your kindly suggestions. These days I got a chance<br>

> > to follow your suggestions for the test. This email is to report the<br>

> > results<br>

> > I have done so far. What I have done were:<br>

> ><br>

> > 1. Upgrade one client (with Infiniband) to Lustre 2.13.56_44_gf8a8d3f<br>

> >    (obtained from github). The compiling information is:<br>

> ><br>

> >    - Linux kernel 4.19.123.<br>

> >    - Infiniband MLNX_OFED_SRC-4.6-1.0.1.1.<br>

> >    - ./configure --prefix=/opt/lustre \<br>

> >                  --with-o2ib=/path/of/mlnx-ofed-kernel-4.6 \<br>

> >                  --disable-server --enable-mpitests=no<br>

> >    - make<br>

> >    - make install<br>

> ><br>

> > 2. We mounted the lustre file system (lustre MDT/OST servers: version<br>

> >    2.12.4 with Infiniband with ZFS backend) by this command:<br>

> ><br>

> >    - mount -t lustre -o flock mdt@o2ib:/chome /home<br>

> ><br>

> > 3. The script to simulate large data transfer is following:<br>

> >    (the directory "/home/large/data" contains 758 files, each size 600MB)<br>

> ><br>

> >    while [ 1 ]; do<br>

> >        tar cf - /home/large/data | ssh remote_host "cat > /dev/null"<br>

> >    done<br>

> ><br>

> >    ps. Note that this scenario is common in a large data center, while<br>

> >        some users transferring large data out of the data center through<br>

> >        the head node; while other users might copy files and do their<br>

> >        normal works in the same head node.<br>

> ><br>

> > 4. During the data transfer in the background, I occationally ran this<br>

> >    command in the same client to test whether there is any abnormality<br>

> >    in I/O performance (where /home/dir1/file has size 600MB):<br>

> ><br>

> >    cp /home/dir1/file /home/dir2/<br>

> ><br>

> >    In the beginning this command can complete in about 1 sec. But after<br>

> >    around 18 hours (not exactly, because the test ran overnight while<br>

> >    I was sleeping), the problem appeared. The time to complete the same<br>

> >    cp command was more than 1 minute.<br>

> ><br>

> >    During the test, I am sure that the whole cluster was idling. The MDT<br>

> >    and OST servers did not have other loading. The CPU usage of the testing<br>

> >    client was below 0.3.<br>

> ><br>

> >    Then I stopped the test, and let the whole system completely idle. But<br>

> >    after 3 hours, the I/O abnormality of the same "cp" command was still<br>

> >    there. Only after I unmounted /home and remounted /home, the abnormality<br>

> >    of "cp" recovered to normal.<br>

> ><br>

> > Before and after remounting /home (which I call "reset"), I did the<br>

> > following tests:<br>

> ><br>

> > 1. Using "top" to check the memory usage:<br>

> ><br>

> > Before reset:<br>

> > =====================================<br>

> > top - 10:43:15 up 35 days, 52 min,  3 users,  load average: 0.00, 0.00,<br>

> > 0.00<br>

> > Tasks: 404 total,   1 running, 162 sleeping,   0 stopped,   0 zombie<br>

> > %Cpu(s):  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,<br>

> > 0.0 st<br>

> > KiB Mem : 13232632+total, 13000131+free,   647784 used,  1677220 buff/cache<br>

> > KiB Swap: 15631240 total, 15631240 free,        0 used. 13076376+avail Mem<br>

> ><br>

> > After reset:<br>

> > =====================================<br>

> > top - 10:48:02 up 35 days, 57 min,  3 users,  load average: 0.04, 0.01,<br>

> > 0.00<br>

> > Tasks: 395 total,   1 running, 159 sleeping,   0 stopped,   0 zombie<br>

> > %Cpu(s):  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,<br>

> > 0.0 st<br>

> > KiB Mem : 13232632+total, 12946539+free,   675948 used,  2184976 buff/cache<br>

> > KiB Swap: 15631240 total, 15631240 free,        0 used. 13073571+avail Mem<br>

> ><br>

> >    It seems that most of the memory were in "free" state. The amount of<br>

> >    hidden memory was neglectable. So I did not further investigate the<br>

> >    amount of slab memory.<br>

> ><br>

> > 2. Using "strace" with the following commands:<br>

> ><br>

> >    - Before reset (took 1 min of each cp):<br>

> >      strace -c -o /tmp/log2-err.txt cp /home/dir1/file /home/dir2/<br>

> ><br>

> >    - After reset (took 1 sec of each cp):<br>

> >      strace -c -o /tmp/log2-reset.txt cp /home/dir1/file /home/dir2/<br>

> ><br>

> >    From the log files, the major time consuming was read and write<br>

> > syscalls.<br>

> >    The others are neglectable.<br>

> ><br>

> >    % time     seconds  usecs/call     calls    errors syscall<br>

> >    ------ ----------- ----------- --------- --------- ----------------<br>

> >    (Before reset)<br>

> >     71.46    0.278424        1920       145           write<br>

> >     28.06    0.109322         705       155           read<br>

> >    (After reset)<br>

> >     52.92    0.299091        2063       145           write<br>

> >     46.85    0.264777        1708       155           read<br>

> ><br>

> >    Before reset, since we have done the cp test for the same file a<br>

> >    few times, the file was already cached. So the reading time is<br>

> >    smaller before reset than that after reset (since after reset /home<br>

> >    was remounted).<br>

> ><br>

> >    Hence from this result, the time of syscalls looks normal. The<br>

> >    performance drop seems occuring in other places.<br>

> ><br>

> > Now I haven't done the investigation of Lustre kernel debug log by enabling<br>

> > Lustre debug=-1. We will find another chance to do it.<br>

> ><br>

> > Up to now, any comments or suggestions are very welcome.<br>

> ><br>

> > Thanks for your help in advance.<br>

> ><br>

> ><br>

> > Best Regards,<br>

> ><br>

> > T.H.Hsieh<br>

> ><br>

> ><br>

> > On Thu, Oct 08, 2020 at 01:32:53PM -0600, Andreas Dilger wrote:<br>

> > > On Oct 8, 2020, at 10:37 AM, Tung-Han Hsieh <<br>

> > <a href="mailto:thhsieh@twcp1.phys.ntu.edu.tw" target="_blank">thhsieh@twcp1.phys.ntu.edu.tw</a>> wrote:<br>

> > > ><br>

> > > > Dear All,<br>

> > > ><br>

> > > > In the past months, we encountered several times of Lustre I/O<br>

> > abnormally<br>

> > > > slowing down. It is quite mysterious that there seems no problem on the<br>

> > > > network hardware, nor the lustre itself since there is no error message<br>

> > > > at all in MDT/OST/client sides.<br>

> > > ><br>

> > > > Recently we probably found a way to reproduce it, and then have some<br>

> > > > suspections. We found that if we continuously perform I/O on a client<br>

> > > > without stop, then after some time threshold (probably more than 24<br>

> > > > hours), the additional file I/O bandwidth of that client will be<br>

> > shriked<br>

> > > > dramatically.<br>

> > > ><br>

> > > > Our configuration is the following:<br>

> > > > - One MDT and one OST server, based on ZFS + Lustre-2.12.4.<br>

> > > > - The OST is served by a RAID 5 system with 15 SAS hard disks.<br>

> > > > - Some clients connect to MDT/OST through Infiniband, some through<br>

> > > >  gigabit ethernet.<br>

> > > ><br>

> > > > Our test was focused on the clients using infiniband, which is<br>

> > described<br>

> > > > in the following:<br>

> > > ><br>

> > > > We have a huge (several TB) amount of data stored in the Lustre file<br>

> > > > system to be transferred to outside network. In order not to exhaust<br>

> > > > the network bandwidth of our institute, we transfer the data with<br>

> > limited<br>

> > > > bandwidth via the following command:<br>

> > > ><br>

> > > > rsync -av --bwlimit=1000 <data_in_Lustre><br>

> > <out_side_server>:/<out_side_path>/<br>

> > > ><br>

> > > > That is, the transferring rate is 1 MB per second, which is relatively<br>

> > > > low. The client read the data from Lustre through infiniband. So during<br>

> > > > data transmission, presumably there is no problem to do other data I/O<br>

> > > > on the same client. On average, when copy a 600 MB file from one<br>

> > directory<br>

> > > > to another directory (both in the same Lustre file system), it took<br>

> > about<br>

> > > > 1.0 - 2.0 secs, even when the rsync process still working.<br>

> > > ><br>

> > > > But after about 24 hours of continuously sending data via rsync, the<br>

> > > > additional I/O on the same client was dramatically shrinked. When it<br>

> > happens,<br>

> > > > it took more than 1 minute to copy a 600 MB from somewhere to another<br>

> > place<br>

> > > > (both in the same Lustre) while rsync is still running.<br>

> > > ><br>

> > > > Then, we stopped the rsync process, and wait for a while (about one<br>

> > > > hour). The I/O performance of copying that 600 MB file returns normal.<br>

> > > ><br>

> > > > Based on this observation, we are suspecting that whether there is a<br>

> > > > hidden QoS mechanism built in Lustre ? When a process occupies the I/O<br>

> > > > bandwidth for a long time and exceeded some limits, does Lustre<br>

> > automatically<br>

> > > > shrinked the I/O bandwidth for all processes running in the same<br>

> > client ?<br>

> > > ><br>

> > > > I am not against such QoS design, if it does exist. But the amount of<br>

> > > > shrinking seems to be too large for infiniband (QDR and above). Then<br>

> > > > I am further suspecting that whether this is due to that our system is<br>

> > > > mixed with clients in which some have infiniband but some do not ?<br>

> > > ><br>

> > > > Could anyone help to fix this problem ? Any suggestions will be very<br>

> > > > appreciated.<br>

> > ><br>

> > > There is no "hidden QOS", unless it is so well hidden that I don't know<br>

> > > about it.<br>

> > ><br>

> > > You could investigate several different things to isolate the problem:<br>

> > > - try with a 2.13.56 client to see if the problem is already fixed<br>

> > > - check if the client is using a lot of CPU when it becomes slow<br>

> > > - run strace on your copy process to see which syscalls are slow<br>

> > > - check memory/slab usage<br>

> > > - enable Lustre debug=-1 and dump the kernel debug log to see where<br>

> > >   the process is taking a long time to complete a request<br>

> > ><br>

> > > It is definitely possible that there is some kind of problem, since this<br>

> > > is not a very common workload to be continuously writing to the same file<br>

> > > descriptor for over a day.  You'll have to do the investigation on your<br>

> > > system to isolate the source of the problem.<br>

> > ><br>

> > > Cheers, Andreas<br>

> > ><br>

> > ><br>

> > ><br>

> > ><br>

> > ><br>

> ><br>

> ><br>

> > _______________________________________________<br>

> > lustre-discuss mailing list<br>

> > <a href="mailto:lustre-discuss@lists.lustre.org" target="_blank">lustre-discuss@lists.lustre.org</a><br>

> > <a href="http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org" rel="noreferrer" target="_blank">http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org</a><br>

> ><br>

</blockquote></div>