[lustre-discuss] sudden read performance drop on sequential forward read.

Vicker, Darby (JSC-EG311) darby.vicker-1 at nasa.gov
Thu Aug 31 18:24:48 PDT 2017


This sounds exactly like what we ran into when we upgraded to 2.9 (and is still present in 2.10).  See these:

http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/2017-May/014524.html
https://jira.hpdd.intel.com/browse/LU-9574

The mailing list thread describes our problem a little more and gives a workaround (reverting a specific commit that seems to cause the problem).  In the LU, Jinshan uploaded a patch about 1.5 weeks ago that seems to fix this for us.  It would be good to know if this helps your situation too.

From: lustre-discuss <lustre-discuss-bounces at lists.lustre.org> on behalf of John Bauer <bauerj at iodoctors.com>
Date: Thursday, August 31, 2017 at 7:52 PM
To: "lustre-discuss at lists.lustre.org" <lustre-discuss at lists.lustre.org>
Subject: [lustre-discuss] sudden read performance drop on sequential forward read.

All,

I have an application that writes a 100GB file forwards, and then begins a sequence of reading a 70 GB section of the file forwards and backwards. At some point in the run,
not always at the same point, the read performance degrades significantly.  The initial forward reads are about 1.3 GB/s.  The backwards reads about 300 MB/s.  In an instant,
the forward read performance drops to 2.8 MB/s.  From about 250 seconds on, this is the only file that is being read or written by the application, running on a dedicated client node.
The file has a stripe count of 4, and stripe size of 512KB.    If the stripe count is changed to 1, this behavior does not present itself.  The cpu usage is minimal during the period of degraded performance.
The LNET traffic is also about 2.8 MB/s during the period of degraded performance.  The system has 64GB of memory, meaning Lustre can not cache the entire 70GB active set of the file that is being read.
The Lustre client version is 2.9.0.

Any ideas what could be causing this?  What should I be watching in the /proc/fs/lustre file system to find some clues?

The behavior is depicted in the image below, which shows the file position as a function of wall clock time.  The writes and reads are of size 512KB.

Thanks,

John



[cid:image001.png at 01D32297.312AF020]

--

I/O Doctors, LLC

507-766-0378

bauerj at iodoctors.com<mailto:bauerj at iodoctors.com>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20170901/9009c9a5/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 15204 bytes
Desc: image001.png
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20170901/9009c9a5/attachment-0001.png>


More information about the lustre-discuss mailing list