[Lustre-discuss] Optimize parallel commpilation on lustre

Mon Jun 28 13:00:38 PDT 2010

On 2010-06-28, at 10:04, Maxence Dunnewind wrote:
>> If you are interested to do a tiny bit of hacking, it would be interesting to do an experiment to see what kind of performance can be gotten in your benchmark by a single client.  Currently, Lustre limits each client to a single filesystem-modifying metadata operation at one time, in order to prevent the clients from overwhelming the server, and to ensure that the clients can recover the filesystem correctly in case of a server crash.
> 
> I just tested this. Before, I tried to do an out-of-tree build. My for clients
> are using nfsroot, so I put the kernel source on it, then I mount lustre on
> /mnt/lustre, and I compile on /mnt/lustrE/build (with make O=). The results
> (without) your patch are interesting :
> 7m42 against 9m37 before with -j 4
> 4 min 51 against 5 min 34 with -j 8
> 3 min 27 againt 4 min 19 with -j 16
> I also use -pipe as gcc option, to avoid temp files.

I was actually thinking of keeping the source tree on Lustre as well, just not building the output files in the same directory as the input files.  It isn't clear from this result whether the speedup was due to having the input files in a separate directory (i.e. lock contention), or whether it was because you had a second server hosting the input files (i.e. RPC limitation of the server).

> So, my first question is : could it be possible in some way to disable cache coherency on some subdirectory ? If I know all the files in this directory will be acceded in read only, I do not need coherency. It would permit to read the files from lustre instead of nfs.

I don't think this would be practical to do for many years.

> I then tried with your patch, not much difference :
> 
> 4 min 43 againt 4 min 51 without it (-j 8)

Ah, this number is with a separate server for the input files.  It might be more interesting to see if it made a difference with the files all hosted on the same server.

> 7min 40 against 7 min 42 with -j 8

This should be "-j 4" to match the above numbers, 

> So it changes almost nothing :)

That implies that the MDS modifying RPCs are not necessarily the bottleneck here.

>> I'm not sure if it makes a difference in your case or not, but increasing the MDC RPCs in flight might also help performance.  Also, increasing the client cache size and the number of IO RPCs may also help.  On the clients run:
>> 
>> lctl set_param *.*.max_rpcs_in_flight=64
>> lctl set_param osc.*.max_dirty_mb=512
> no change

Hmm, I'd thought possibly allowing more of the output files to be cached on the clients would reduce the compilation time, but that doesn't seem to be the bottleneck either.

Did you try pre-reading all of the input files on the clients to see if eliminating the small-file reads was a source of improvement?

> I will try directly on the mds (so on only one node) to compare.

I look forward to your results.

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.