[Lustre-discuss] "Slow XXXX" Syslog Messages

Fri Jun 27 03:39:28 PDT 2008

On Fri, Jun 27, 2008 at 06:08:12AM -0400, Charles Taylor wrote:
> I'm curious about all the "slow xxxx" messages on our OSSs.   It is  
> true that the OSSs are getting hammered and the "spindles"  (actually  
> RAID 5 LUNs) are saturated (96% to 100% busy in iostat).   However,  
> await and svctm numbers are *very* reasonable (in my experience).    
> The await times range from 50 to 250 ms and the svctm numbers are  
> usually only a few ms.
> 
> The cpus are mostly idle or servicing interrupts from the FC HCAs and  
> IB HCAs and the MDT threads are mostly sleeping or in disk wait.   The  
> answer seems obvious enough but lustre is reporting delays of (that's  
> what I assume the "slow whatevers" are about)  on the order of a  
> minute which is 2 to 3 orders of magnitude beyond anything I'm seeing  
> in I/O stat.
> 
> What is Lustre really trying to tell us?    It seems like there may be  
> some race condition that is forcing some of these operations to wait  
> until one or more timers expire before they can proceed.    Device  
> delays alone don't seem to account for it.
> 296:fsfilt_commit_wait()) crn-OST0009: slow journal start 43s

helo!

We have the same problem with one of our meta servers (meta1). But we switched
to its pair (meta2) and now everything seems to be all right.
By the way when we created the cluster, there was only meta2, meta1
was attached to it later with drbd. Anything else is the same about
them.

Just for your information, I know, it's not a big help.

Bye,

tamas