[Lustre-discuss] help tracking down extremely high loads on OSSs

John White jwhite at lbl.gov
Mon Oct 18 11:59:14 PDT 2010


We've thoroughly examined the back-end storage and the connections between the OSSs and back-end.  There are no faults as of now.  Previously our couplet had lost cache sync, but that's since been resolved and the load issue remains.


On Oct 18, 2010, at 10:43 AM, Paul Nowoczynski wrote:

> I wonder if there's some type of fault in the I/O path which is increasing the latency of individual I/Os?  Something like this could affect the load especially when considering the number of kernel threads on the OST.
> paul
> 
> John White wrote:
>> Hello Folks,
>> 	A while back (say 3 weeks ago) we started noticing extremely high loads (load avg around 300 at times) on our OSSs when in production and serving IO.  This cluster was, at the time, on 1.8.2 (we have since upgraded to 1.8.4 but the problem remains).  The load increases fairly predictably as clients generate IO but even 2 clients can produce a load avg above 5.00.  An identical file system of ours does not exhibit this behavior (sticks below load avg 1.00 under even the heaviest IO load).  I've looked around bugzilla and haven't found anything.  We've disabled heartbeat on the off-chance that was generating the load (it's not), we've attempted using a different client transport (o2ib->tcp), this did not solve the issue.  There doesn't appear to be any specific non-kernel thread causing the high-load.  The only info in dmesg/syslog pertains to sporadic client evictions or sporadic slow setattr due to heavy IO load (we've since tuned the number of OST threads).  We're basically
>>  out of ideas to try.
>> 
>> As reference, this is a 1 MDS/4 OSS cluster backed by a DDN 9900 couplet (15 tiers, 1:1 lun mapping) running the lustre.org rpm build kernel for 1.8.4.  The MDS/OSSs are Dell R710s and the MDT is a Dell MD1000.  Is this a common problem or should a bug be filed?  Any info available upon request.  Thanks for your time.
>> ----------------
>> John White
>> High Performance Computing Services (HPCS)
>> (510) 486-7307
>> One Cyclotron Rd, MS: 50B-3209C
>> Lawrence Berkeley National Lab
>> Berkeley, CA 94720
>> 
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>  
> 
> 

----------------
John White
High Performance Computing Services (HPCS)
(510) 486-7307
One Cyclotron Rd, MS: 50B-3209C
Lawrence Berkeley National Lab
Berkeley, CA 94720




More information about the lustre-discuss mailing list