<div dir="ltr"><div>Hi,</div><div><br></div>If you set it on the MGS, it will be the new default for all the clients and new mount on the FS, the problem is you need LU-12759 (fixed in 2.12.4) on your clients since there was a bug on older clients and that setting was not working correctly.</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Nov 2, 2020 at 12:38 AM Tung-Han Hsieh <<a href="mailto:thhsieh@twcp1.phys.ntu.edu.tw">thhsieh@twcp1.phys.ntu.edu.tw</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Dear Simon,<br>

<br>

Following your suggestions, now we confirmed that the problem of<br>

dropping I/O performance of a client when there is a continous<br>

I/O in the background is solved. It works charming. Thank you so<br>

much !!<br>

<br>

Here is a final question. We found that this command:<br>

<br>

        lctl set_param osc.*.grant_shrink=0<br>

<br>

can be run the client, which fixed the value of "cur_grant_bytes"<br>

to be the highest value 1880752127, and thereby fixed the problem.<br>

Whenever we remount the file system (I mean, explicitly umount and<br>

mount the file system), we need to execute this command again to<br>

set it to zero.<br>

<br>

But this command:<br>

<br>

        lctl set_param -P osc.*.grant_shrink=0<br>

<br>

has to be run in the MGS node. Only setting it in MGS but without<br>

setting in the client, it seems that the "cur_grant_bytes" of the<br>

testing client still dropping under the background continous I/O.<br>

So I am asking what's the meaning of this setting in MGS node.<br>

<br>

Thank you very much.<br>

<br>

<br>

T.H.Hsieh<br>

<br>

On Fri, Oct 30, 2020 at 01:37:01PM +0800, Tung-Han Hsieh wrote:<br>

> Dear Simon,<br>

> <br>

> Thank you very much for your useful information. Now we are arranging<br>

> the system maintenance date in order to upgrade to Lustre-2.12.5. Then<br>

> we will follow your suggestion to see whether this problem could be<br>

> fixed.<br>

> <br>

> Here I report a test of under continuous I/O, how the cur_grant_bytes<br>

> changed overtime. Again the client runs the following script for<br>

> continuous reading in the background:<br>

> <br>

>     # The Lustre file system was mounted under /home<br>

>     while [ 1 ]; do<br>

>         tar cf - /home/large/data | ssh remote_host "cat > /dev/null"<br>

>     done<br>

> <br>

> And every 20 mins, in the same client we copied a 600MB file from one<br>

> directory to another within Lustre, and check the "cur_grant_bytes" by<br>

> the following command running in the same client:<br>

> <br>

>     /opt/lustre/sbin/lctl get_param osc.*.cur_grant_bytes<br>

> <br>

> The result is (every line separated by around 20 mins):<br>

> <br>

> osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=1880752127<br>

> osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=1410564096<br>

> osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=1059201024<br>

> osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=794400768<br>

> osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=595800576<br>

> osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=446850432<br>

> osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=335137824<br>

> osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=251353368<br>

> osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=188515026<br>

> osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=141386270<br>

> osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=106039703<br>

> osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=79529778<br>

> osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=59647334<br>

> osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=44735501<br>

> osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=33551626<br>

> osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=25163720<br>

> osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=18872790<br>

> osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=14154593<br>

> osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=10615945<br>

> osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=7961959<br>

> osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=5971470<br>

> osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=4478603<br>

> osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=3358953<br>

> osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=2519215<br>

> osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=1889412<br>

> osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=1417059<br>

> osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=1062795<br>

> osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=797097<br>

> osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=797097<br>

> ....<br>

> <br>

> The value 797097 seems to be the minimum. When it dropped to 1062795,<br>

> the time of cp dramatically increased from around 1 sec to 1 min. In<br>

> addition, during the test, the cluster is completely idling. And it<br>

> is obvious that this test does not saturate the loading of both network<br>

> and MDT / OST hardware (they have almost no loading).<br>

> <br>

> I am wondering whether this could be a bug to report to the development<br>

> team.<br>

> <br>

> Best Regards,<br>

> <br>

> T.H.Hsieh<br>

> <br>

> On Thu, Oct 29, 2020 at 09:49:42AM -0400, Simon Guilbault wrote:<br>

> > Our current workaround was to use the following command on the MGS with<br>

> > Lustre 2.12.5 that include the patches in LU-12651 and LU-12759 (we were<br>

> > using a patched 2.12.4 a few months ago):<br>

> > lctl set_param -P osc.*.grant_shrink=0<br>

> > <br>

> > We could not find the root cause of the underlying problem, dynamic grant<br>

> > shrinking seems to be useful when the OSTs are running out of free space.<br>

> > <br>

> > On Wed, Oct 28, 2020 at 11:47 PM Tung-Han Hsieh <<br>

> > <a href="mailto:thhsieh@twcp1.phys.ntu.edu.tw" target="_blank">thhsieh@twcp1.phys.ntu.edu.tw</a>> wrote:<br>

> > <br>

> > > Dear Simon,<br>

> > ><br>

> > > Thank you very much for your hint. Yes, you are right. We compared<br>

> > > the grant size of two client by (running in each client):<br>

> > ><br>

> > >         lctl get_param osc.*.cur_grant_bytes<br>

> > ><br>

> > > - Client A: It has run the following large data transfer for over 36 hrs.<br>

> > ><br>

> > >         while [ 1 ]; do<br>

> > >             tar cf - /home/large/data | ssh remote_host "cat > /dev/null"<br>

> > >         done<br>

> > ><br>

> > >   The value of "cur_grant_bytes" is 796134.<br>

> > ><br>

> > > - Client B: It is almost idling during the action of Client A.<br>

> > ><br>

> > >   The value of "cur_grant_bytes" is 1715863552.<br>

> > ><br>

> > > If this is the reason that hit the I/O performance of Client A seriously,<br>

> > > is it possible to maintain it at a constant value at least for the head<br>

> > > node (since the head node is the most probable one to have large and long<br>

> > > time data I/O of the whole cluster, especially for a data center) ?<br>

> > ><br>

> > > I would be also like to ask: Why this value has to be dynamically adjusted<br>

> > > ?<br>

> > ><br>

> > > Thank you very much for your comment in advance.<br>

> > ><br>

> > > Best Regards,<br>

> > ><br>

> > > T.H.Hsieh<br>

> > ><br>

> > > On Wed, Oct 28, 2020 at 02:00:21PM -0400, Simon Guilbault wrote:<br>

> > > > Hi, we had a similar performance problem on our login/DTNs node a few<br>

> > > > months ago, the problem was the grant size was shrinking and was getting<br>

> > > > stuck under 1MB. Once under 1MB, the client had to send every request to<br>

> > > > the OST using sync IO.<br>

> > > ><br>

> > > > Check the output of the following command:<br>

> > > > lctl get_param osc.*.cur_grant_bytes<br>

> > > ><br>

> > > > On Wed, Oct 28, 2020 at 12:08 AM Tung-Han Hsieh <<br>

> > > > <a href="mailto:thhsieh@twcp1.phys.ntu.edu.tw" target="_blank">thhsieh@twcp1.phys.ntu.edu.tw</a>> wrote:<br>

> > > ><br>

> > > > > Dear All,<br>

> > > > ><br>

> > > > > Sorry that I am not sure whether this mail was successfully posted to<br>

> > > > > the lustre-discuss mailing list or not. So I resent it again. Please<br>

> > > > > ignore it if you already read it before.<br>

> > > > ><br>

> > > > ><br>

> > > ===========================================================================<br>

> > > > ><br>

> > > > > Dear Andreas,<br>

> > > > ><br>

> > > > > Thank you very much for your kindly suggestions. These days I got a<br>

> > > chance<br>

> > > > > to follow your suggestions for the test. This email is to report the<br>

> > > > > results<br>

> > > > > I have done so far. What I have done were:<br>

> > > > ><br>

> > > > > 1. Upgrade one client (with Infiniband) to Lustre 2.13.56_44_gf8a8d3f<br>

> > > > >    (obtained from github). The compiling information is:<br>

> > > > ><br>

> > > > >    - Linux kernel 4.19.123.<br>

> > > > >    - Infiniband MLNX_OFED_SRC-4.6-1.0.1.1.<br>

> > > > >    - ./configure --prefix=/opt/lustre \<br>

> > > > >                  --with-o2ib=/path/of/mlnx-ofed-kernel-4.6 \<br>

> > > > >                  --disable-server --enable-mpitests=no<br>

> > > > >    - make<br>

> > > > >    - make install<br>

> > > > ><br>

> > > > > 2. We mounted the lustre file system (lustre MDT/OST servers: version<br>

> > > > >    2.12.4 with Infiniband with ZFS backend) by this command:<br>

> > > > ><br>

> > > > >    - mount -t lustre -o flock mdt@o2ib:/chome /home<br>

> > > > ><br>

> > > > > 3. The script to simulate large data transfer is following:<br>

> > > > >    (the directory "/home/large/data" contains 758 files, each size<br>

> > > 600MB)<br>

> > > > ><br>

> > > > >    while [ 1 ]; do<br>

> > > > >        tar cf - /home/large/data | ssh remote_host "cat > /dev/null"<br>

> > > > >    done<br>

> > > > ><br>

> > > > >    ps. Note that this scenario is common in a large data center, while<br>

> > > > >        some users transferring large data out of the data center<br>

> > > through<br>

> > > > >        the head node; while other users might copy files and do their<br>

> > > > >        normal works in the same head node.<br>

> > > > ><br>

> > > > > 4. During the data transfer in the background, I occationally ran this<br>

> > > > >    command in the same client to test whether there is any abnormality<br>

> > > > >    in I/O performance (where /home/dir1/file has size 600MB):<br>

> > > > ><br>

> > > > >    cp /home/dir1/file /home/dir2/<br>

> > > > ><br>

> > > > >    In the beginning this command can complete in about 1 sec. But after<br>

> > > > >    around 18 hours (not exactly, because the test ran overnight while<br>

> > > > >    I was sleeping), the problem appeared. The time to complete the same<br>

> > > > >    cp command was more than 1 minute.<br>

> > > > ><br>

> > > > >    During the test, I am sure that the whole cluster was idling. The<br>

> > > MDT<br>

> > > > >    and OST servers did not have other loading. The CPU usage of the<br>

> > > testing<br>

> > > > >    client was below 0.3.<br>

> > > > ><br>

> > > > >    Then I stopped the test, and let the whole system completely idle.<br>

> > > But<br>

> > > > >    after 3 hours, the I/O abnormality of the same "cp" command was<br>

> > > still<br>

> > > > >    there. Only after I unmounted /home and remounted /home, the<br>

> > > abnormality<br>

> > > > >    of "cp" recovered to normal.<br>

> > > > ><br>

> > > > > Before and after remounting /home (which I call "reset"), I did the<br>

> > > > > following tests:<br>

> > > > ><br>

> > > > > 1. Using "top" to check the memory usage:<br>

> > > > ><br>

> > > > > Before reset:<br>

> > > > > =====================================<br>

> > > > > top - 10:43:15 up 35 days, 52 min,  3 users,  load average: 0.00, 0.00,<br>

> > > > > 0.00<br>

> > > > > Tasks: 404 total,   1 running, 162 sleeping,   0 stopped,   0 zombie<br>

> > > > > %Cpu(s):  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,<br>

> > > > > 0.0 st<br>

> > > > > KiB Mem : 13232632+total, 13000131+free,   647784 used,  1677220<br>

> > > buff/cache<br>

> > > > > KiB Swap: 15631240 total, 15631240 free,        0 used. 13076376+avail<br>

> > > Mem<br>

> > > > ><br>

> > > > > After reset:<br>

> > > > > =====================================<br>

> > > > > top - 10:48:02 up 35 days, 57 min,  3 users,  load average: 0.04, 0.01,<br>

> > > > > 0.00<br>

> > > > > Tasks: 395 total,   1 running, 159 sleeping,   0 stopped,   0 zombie<br>

> > > > > %Cpu(s):  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,<br>

> > > > > 0.0 st<br>

> > > > > KiB Mem : 13232632+total, 12946539+free,   675948 used,  2184976<br>

> > > buff/cache<br>

> > > > > KiB Swap: 15631240 total, 15631240 free,        0 used. 13073571+avail<br>

> > > Mem<br>

> > > > ><br>

> > > > >    It seems that most of the memory were in "free" state. The amount of<br>

> > > > >    hidden memory was neglectable. So I did not further investigate the<br>

> > > > >    amount of slab memory.<br>

> > > > ><br>

> > > > > 2. Using "strace" with the following commands:<br>

> > > > ><br>

> > > > >    - Before reset (took 1 min of each cp):<br>

> > > > >      strace -c -o /tmp/log2-err.txt cp /home/dir1/file /home/dir2/<br>

> > > > ><br>

> > > > >    - After reset (took 1 sec of each cp):<br>

> > > > >      strace -c -o /tmp/log2-reset.txt cp /home/dir1/file /home/dir2/<br>

> > > > ><br>

> > > > >    From the log files, the major time consuming was read and write<br>

> > > > > syscalls.<br>

> > > > >    The others are neglectable.<br>

> > > > ><br>

> > > > >    % time     seconds  usecs/call     calls    errors syscall<br>

> > > > >    ------ ----------- ----------- --------- --------- ----------------<br>

> > > > >    (Before reset)<br>

> > > > >     71.46    0.278424        1920       145           write<br>

> > > > >     28.06    0.109322         705       155           read<br>

> > > > >    (After reset)<br>

> > > > >     52.92    0.299091        2063       145           write<br>

> > > > >     46.85    0.264777        1708       155           read<br>

> > > > ><br>

> > > > >    Before reset, since we have done the cp test for the same file a<br>

> > > > >    few times, the file was already cached. So the reading time is<br>

> > > > >    smaller before reset than that after reset (since after reset /home<br>

> > > > >    was remounted).<br>

> > > > ><br>

> > > > >    Hence from this result, the time of syscalls looks normal. The<br>

> > > > >    performance drop seems occuring in other places.<br>

> > > > ><br>

> > > > > Now I haven't done the investigation of Lustre kernel debug log by<br>

> > > enabling<br>

> > > > > Lustre debug=-1. We will find another chance to do it.<br>

> > > > ><br>

> > > > > Up to now, any comments or suggestions are very welcome.<br>

> > > > ><br>

> > > > > Thanks for your help in advance.<br>

> > > > ><br>

> > > > ><br>

> > > > > Best Regards,<br>

> > > > ><br>

> > > > > T.H.Hsieh<br>

> > > > ><br>

> > > > ><br>

> > > > > On Thu, Oct 08, 2020 at 01:32:53PM -0600, Andreas Dilger wrote:<br>

> > > > > > On Oct 8, 2020, at 10:37 AM, Tung-Han Hsieh <<br>

> > > > > <a href="mailto:thhsieh@twcp1.phys.ntu.edu.tw" target="_blank">thhsieh@twcp1.phys.ntu.edu.tw</a>> wrote:<br>

> > > > > > ><br>

> > > > > > > Dear All,<br>

> > > > > > ><br>

> > > > > > > In the past months, we encountered several times of Lustre I/O<br>

> > > > > abnormally<br>

> > > > > > > slowing down. It is quite mysterious that there seems no problem<br>

> > > on the<br>

> > > > > > > network hardware, nor the lustre itself since there is no error<br>

> > > message<br>

> > > > > > > at all in MDT/OST/client sides.<br>

> > > > > > ><br>

> > > > > > > Recently we probably found a way to reproduce it, and then have<br>

> > > some<br>

> > > > > > > suspections. We found that if we continuously perform I/O on a<br>

> > > client<br>

> > > > > > > without stop, then after some time threshold (probably more than 24<br>

> > > > > > > hours), the additional file I/O bandwidth of that client will be<br>

> > > > > shriked<br>

> > > > > > > dramatically.<br>

> > > > > > ><br>

> > > > > > > Our configuration is the following:<br>

> > > > > > > - One MDT and one OST server, based on ZFS + Lustre-2.12.4.<br>

> > > > > > > - The OST is served by a RAID 5 system with 15 SAS hard disks.<br>

> > > > > > > - Some clients connect to MDT/OST through Infiniband, some through<br>

> > > > > > >  gigabit ethernet.<br>

> > > > > > ><br>

> > > > > > > Our test was focused on the clients using infiniband, which is<br>

> > > > > described<br>

> > > > > > > in the following:<br>

> > > > > > ><br>

> > > > > > > We have a huge (several TB) amount of data stored in the Lustre<br>

> > > file<br>

> > > > > > > system to be transferred to outside network. In order not to<br>

> > > exhaust<br>

> > > > > > > the network bandwidth of our institute, we transfer the data with<br>

> > > > > limited<br>

> > > > > > > bandwidth via the following command:<br>

> > > > > > ><br>

> > > > > > > rsync -av --bwlimit=1000 <data_in_Lustre><br>

> > > > > <out_side_server>:/<out_side_path>/<br>

> > > > > > ><br>

> > > > > > > That is, the transferring rate is 1 MB per second, which is<br>

> > > relatively<br>

> > > > > > > low. The client read the data from Lustre through infiniband. So<br>

> > > during<br>

> > > > > > > data transmission, presumably there is no problem to do other data<br>

> > > I/O<br>

> > > > > > > on the same client. On average, when copy a 600 MB file from one<br>

> > > > > directory<br>

> > > > > > > to another directory (both in the same Lustre file system), it took<br>

> > > > > about<br>

> > > > > > > 1.0 - 2.0 secs, even when the rsync process still working.<br>

> > > > > > ><br>

> > > > > > > But after about 24 hours of continuously sending data via rsync,<br>

> > > the<br>

> > > > > > > additional I/O on the same client was dramatically shrinked. When<br>

> > > it<br>

> > > > > happens,<br>

> > > > > > > it took more than 1 minute to copy a 600 MB from somewhere to<br>

> > > another<br>

> > > > > place<br>

> > > > > > > (both in the same Lustre) while rsync is still running.<br>

> > > > > > ><br>

> > > > > > > Then, we stopped the rsync process, and wait for a while (about one<br>

> > > > > > > hour). The I/O performance of copying that 600 MB file returns<br>

> > > normal.<br>

> > > > > > ><br>

> > > > > > > Based on this observation, we are suspecting that whether there is<br>

> > > a<br>

> > > > > > > hidden QoS mechanism built in Lustre ? When a process occupies the<br>

> > > I/O<br>

> > > > > > > bandwidth for a long time and exceeded some limits, does Lustre<br>

> > > > > automatically<br>

> > > > > > > shrinked the I/O bandwidth for all processes running in the same<br>

> > > > > client ?<br>

> > > > > > ><br>

> > > > > > > I am not against such QoS design, if it does exist. But the amount<br>

> > > of<br>

> > > > > > > shrinking seems to be too large for infiniband (QDR and above).<br>

> > > Then<br>

> > > > > > > I am further suspecting that whether this is due to that our<br>

> > > system is<br>

> > > > > > > mixed with clients in which some have infiniband but some do not ?<br>

> > > > > > ><br>

> > > > > > > Could anyone help to fix this problem ? Any suggestions will be<br>

> > > very<br>

> > > > > > > appreciated.<br>

> > > > > ><br>

> > > > > > There is no "hidden QOS", unless it is so well hidden that I don't<br>

> > > know<br>

> > > > > > about it.<br>

> > > > > ><br>

> > > > > > You could investigate several different things to isolate the<br>

> > > problem:<br>

> > > > > > - try with a 2.13.56 client to see if the problem is already fixed<br>

> > > > > > - check if the client is using a lot of CPU when it becomes slow<br>

> > > > > > - run strace on your copy process to see which syscalls are slow<br>

> > > > > > - check memory/slab usage<br>

> > > > > > - enable Lustre debug=-1 and dump the kernel debug log to see where<br>

> > > > > >   the process is taking a long time to complete a request<br>

> > > > > ><br>

> > > > > > It is definitely possible that there is some kind of problem, since<br>

> > > this<br>

> > > > > > is not a very common workload to be continuously writing to the same<br>

> > > file<br>

> > > > > > descriptor for over a day.  You'll have to do the investigation on<br>

> > > your<br>

> > > > > > system to isolate the source of the problem.<br>

> > > > > ><br>

> > > > > > Cheers, Andreas<br>

> > > > > ><br>

> > > > > ><br>

> > > > > ><br>

> > > > > ><br>

> > > > > ><br>

> > > > ><br>

> > > > ><br>

> > > > > _______________________________________________<br>

> > > > > lustre-discuss mailing list<br>

> > > > > <a href="mailto:lustre-discuss@lists.lustre.org" target="_blank">lustre-discuss@lists.lustre.org</a><br>

> > > > > <a href="http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org" rel="noreferrer" target="_blank">http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org</a><br>

> > > > ><br>

> > ><br>

</blockquote></div>