[Lustre-discuss] OSS load in the roof

Brock Palen brockp at umich.edu
Fri Jun 27 10:44:13 PDT 2008


On Jun 27, 2008, at 1:39 PM, Bernd Schubert wrote:
> On Fri, Jun 27, 2008 at 01:07:32PM -0400, Brian J. Murrell wrote:
>> On Fri, 2008-06-27 at 12:44 -0400, Brock Palen wrote:
>>>
>>> All of them are stuck in un-interruptible sleep.
>>> Has anyone seen this happen before?  Is this caused by a pending  
>>> disk
>>> failure?
>>
>> Well, they are certainly stuck because of some blocking I/O.  That  
>> could
>> be disk failure, indeed.
>>
>>> mptscsi: ioc1: attempting task abort! (sc=0000010038904c40)
>>> scsi1 : destination target 0, lun 0
>>>          command = Read (10) 00 75 94 40 00 00 10 00 00
>>> mptscsi: ioc1: task abort: SUCCESS (sc=0000010038904c40)
>>
>> That does not look like a picture of happiness, indeed, no.  You have
>> SCSI commands aborting.
>>
>
> Well, these messages are not nice of course, since the mpt error  
> handler
> got activated, but in principle a scsi device can recover then.
> Unfortunately, the verbosity level of scsi makes it impossbible to
> figure out what was actually the problem. Since we suffered from  
> severe
> scsi problems, I wrote quite a number of patches to improve the  
> situation.
> We now at least can understand where the problem came from and also  
> have
> a slightly improved error handling. These are presently for 2.6.22  
> only,
> but my plan is to sent these upstream for 2.6.28.
>
>
>>> Lustre: 6698:0:(lustre_fsfilt.h:306:fsfilt_setattr()) nobackup-
>>> OST0001: slow setattr 100s
>>> Lustre: 6698:0:(watchdog.c:312:lcw_update_time()) Expired watchdog
>>> for pid 6698 disabled after 103.1261s
>>
>> Those are just fallout from the above disk situation.
>
> Probably the device was offlined and actually this also should have  
> been
> printed in the logs. Brock, can you check the device status
> (cat /sys/block/sdX/device/state).

IO Is still flowing from both OST's on that OSS,

[root at nyx167 ~]# cat /sys/block/sd*/device/state
running
running

Sigh, it only needs to live till August when we install our x4500's.
I think its safe to send a notice to users they may want to copy  
their data.

>
> Cheers,
> Bernd
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>




More information about the lustre-discuss mailing list