[Lustre-discuss] OST load distribution

Wed May 8 08:08:12 PDT 2013

I've seen issues like this where a user used lfs setstripe -i 0 for their directory when they really wanted lfs setstripe -i -1.  The 0 will create all files starting on index 0 (OST 0), where -1 will be the default.  It could be that one of your users is creating ALL their files to start on OST0 making it more busy than the rest.  The successive stripes would be placed elswhere on the file system.

-Marc

----
D. Marc Stearman
Lustre Operations Lead
stearman2 at llnl.gov
925.423.9670

On May 8, 2013, at 6:12 AM, Jure Pečar <pegasus at nerv.eu.org> wrote:

> 
> Hello,
> 
> I have a lustre 2.2 environment which looks like this:
> 
> # lfs df -h
> UUID                       bytes        Used   Available Use% Mounted on
> lustre22-MDT0000_UUID      95.0G        9.4G       79.3G  11% /lustre[MDT:0]
> lustre22-OST0000_UUID       5.5T        2.1T        3.3T  39% /lustre[OST:0]
> lustre22-OST0001_UUID       5.5T        1.2T        4.3T  22% /lustre[OST:1]
> lustre22-OST0002_UUID       5.5T     1016.0G        4.5T  18% /lustre[OST:2]
> lustre22-OST0003_UUID       5.5T      948.3G        4.5T  17% /lustre[OST:3]
> lustre22-OST0004_UUID       5.5T      812.3G        4.7T  15% /lustre[OST:4]
> lustre22-OST0005_UUID       5.5T      641.4G        4.8T  11% /lustre[OST:5]
> lustre22-OST0006_UUID       5.5T      619.4G        4.8T  11% /lustre[OST:6]
> lustre22-OST0007_UUID       5.5T      587.0G        4.9T  11% /lustre[OST:7]
> lustre22-OST0008_UUID       5.5T      539.7G        4.9T  10% /lustre[OST:8]
> OST0009             : inactive device
> lustre22-OST000a_UUID       5.5T      531.3G        4.9T  10% /lustre[OST:10]
> lustre22-OST000b_UUID       5.5T      488.9G        5.0T   9% /lustre[OST:11]
> lustre22-OST000c_UUID       5.5T      451.2G        5.0T   8% /lustre[OST:12]
> lustre22-OST000d_UUID       5.5T      450.1G        5.0T   8% /lustre[OST:13]
> lustre22-OST000e_UUID       5.5T      448.8G        5.0T   8% /lustre[OST:14]
> lustre22-OST000f_UUID       5.5T      444.0G        5.0T   8% /lustre[OST:15]
> lustre22-OST0010_UUID       5.5T      422.5G        5.0T   8% /lustre[OST:16]
> lustre22-OST0011_UUID       5.5T      414.5G        5.0T   7% /lustre[OST:17]
> lustre22-OST0012_UUID       5.5T      406.9G        5.1T   7% /lustre[OST:18]
> OST0013             : inactive device
> 
> Reading through documentation I see that lustre should prefer those OSTs with most free disk space (qos_prio_free is set to 91%). However my monitoring tells me that OST0000 is the most loaded by far, having loadavg over 300 and network traffic 3-5x higher than the rest.
> 
> I raised qos_threshold_rr to 55% and am waiting to see the results. Right now I have clients reading and writing to this fs at around 600MB/s aggregated, generating hundreds of files per job.
> 
> How soon am I expected to see the results?
> 
> What else can I do to spread the load from OST0000 evenly among the other OSTs?
> 
> 
> -- 
> 
> Jure Pečar
> http://jure.pecar.org
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss