[Lustre-discuss] OSS load in the roof

Fri Jun 27 10:39:46 PDT 2008

On Fri, Jun 27, 2008 at 01:07:32PM -0400, Brian J. Murrell wrote:
> On Fri, 2008-06-27 at 12:44 -0400, Brock Palen wrote:
> > 
> > All of them are stuck in un-interruptible sleep.
> > Has anyone seen this happen before?  Is this caused by a pending disk  
> > failure?
> 
> Well, they are certainly stuck because of some blocking I/O.  That could
> be disk failure, indeed.
> 
> > mptscsi: ioc1: attempting task abort! (sc=0000010038904c40)
> > scsi1 : destination target 0, lun 0
> >          command = Read (10) 00 75 94 40 00 00 10 00 00
> > mptscsi: ioc1: task abort: SUCCESS (sc=0000010038904c40)
> 
> That does not look like a picture of happiness, indeed, no.  You have
> SCSI commands aborting.
> 

Well, these messages are not nice of course, since the mpt error handler
got activated, but in principle a scsi device can recover then.
Unfortunately, the verbosity level of scsi makes it impossbible to
figure out what was actually the problem. Since we suffered from severe
scsi problems, I wrote quite a number of patches to improve the situation.
We now at least can understand where the problem came from and also have
a slightly improved error handling. These are presently for 2.6.22 only, 
but my plan is to sent these upstream for 2.6.28.

> > Lustre: 6698:0:(lustre_fsfilt.h:306:fsfilt_setattr()) nobackup- 
> > OST0001: slow setattr 100s
> > Lustre: 6698:0:(watchdog.c:312:lcw_update_time()) Expired watchdog  
> > for pid 6698 disabled after 103.1261s
> 
> Those are just fallout from the above disk situation.

Probably the device was offlined and actually this also should have been
printed in the logs. Brock, can you check the device status 
(cat /sys/block/sdX/device/state).

Cheers,
Bernd