[Lustre-discuss] help tracking down extremely high loads on OSSs

John White jwhite at lbl.gov
Mon Oct 18 12:00:53 PDT 2010


Far far from it.  All OSTs are at most 23% full.  There appear to be no lagging disks.

On Oct 18, 2010, at 11:55 AM, Wojciech Turek wrote:

> Is this filesystem nearly full? Fragmentation can decrease back end performance.
>  
> Also check the disks stats on the DDN, maybe you have a slow disk in one of your tiers.
> 
> Wojciech
> 
> On 18 October 2010 18:49, Peter Kjellstrom <cap at nsc.liu.se> wrote:
> On Monday 18 October 2010, John White wrote:
> > Hello Folks,
> >       A while back (say 3 weeks ago) we started noticing extremely high loads
> > (load avg around 300 at times) on our OSSs when in production and serving
> > IO.  This cluster was, at the time, on 1.8.2 (we have since upgraded to
> > 1.8.4 but the problem remains).  The load increases fairly predictably as
> > clients generate IO but even 2 clients can produce a load avg above 5.00.
> 
> Does this impact performance or does it only show up as an unexpectedly high
> number on the OSSes?
> 
> /Peter
> 
> > An identical file system of ours does not exhibit this behavior (sticks
> > below load avg 1.00 under even the heaviest IO load).  I've looked around
> > bugzilla and haven't found anything.  We've disabled heartbeat on the
> > off-chance that was generating the load (it's not), we've attempted using a
> > different client transport (o2ib->tcp), this did not solve the issue.
> > There doesn't appear to be any specific non-kernel thread causing the
> > high-load.  The only info in dmesg/syslog pertains to sporadic client
> > evictions or sporadic slow setattr due to heavy IO load (we've since tuned
> > the number of OST threads).  We're basically out of ideas to try.
> >
> > As reference, this is a 1 MDS/4 OSS cluster backed by a DDN 9900 couplet
> > (15 tiers, 1:1 lun mapping) running the lustre.org rpm build kernel for
> > 1.8.4.  The MDS/OSSs are Dell R710s and the MDT is a Dell MD1000.  Is this
> > a common problem or should a bug be filed?  Any info available upon
> > request.  Thanks for your time. ----------------
> > John White
> > High Performance Computing Services (HPCS)
> > (510) 486-7307
> > One Cyclotron Rd, MS: 50B-3209C
> > Lawrence Berkeley National Lab
> > Berkeley, CA 94720
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 
> 
> 
> 
> -- 
> Wojciech Turek
> 
> Senior System Architect
> 
> High Performance Computing Service
> University of Cambridge
> Email: wjt27 at cam.ac.uk
> Tel: (+)44 1223 763517 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

----------------
John White
High Performance Computing Services (HPCS)
(510) 486-7307
One Cyclotron Rd, MS: 50B-3209C
Lawrence Berkeley National Lab
Berkeley, CA 94720




More information about the lustre-discuss mailing list