[Lustre-discuss] OSS load in the roof

Fri Jun 27 13:23:08 PDT 2008

On Fri, Jun 27, 2008 at 02:29:24PM -0400, Brock Palen wrote:
> On Jun 27, 2008, at 2:22 PM, Bernd Schubert wrote:
>> On Fri, Jun 27, 2008 at 01:44:13PM -0400, Brock Palen wrote:
>>> On Jun 27, 2008, at 1:39 PM, Bernd Schubert wrote:
>>>> On Fri, Jun 27, 2008 at 01:07:32PM -0400, Brian J. Murrell wrote:
>>>>> On Fri, 2008-06-27 at 12:44 -0400, Brock Palen wrote:
>>>>>>
>>>>>> All of them are stuck in un-interruptible sleep.
>>>>>> Has anyone seen this happen before?  Is this caused by a pending
>>>>>> disk
>>>>>> failure?
>>>>>
>>>>> Well, they are certainly stuck because of some blocking I/O.  That
>>>>> could
>>>>> be disk failure, indeed.
>>>>>
>>>>>> mptscsi: ioc1: attempting task abort! (sc=0000010038904c40)
>>>>>> scsi1 : destination target 0, lun 0
>>>>>>          command = Read (10) 00 75 94 40 00 00 10 00 00
>>>>>> mptscsi: ioc1: task abort: SUCCESS (sc=0000010038904c40)
>>>>>
>>>>> That does not look like a picture of happiness, indeed, no.  You  
>>>>> have
>>>>> SCSI commands aborting.
>>>>>
>>>>
>>>> Well, these messages are not nice of course, since the mpt error
>>>> handler
>>>> got activated, but in principle a scsi device can recover then.
>>>> Unfortunately, the verbosity level of scsi makes it impossbible to
>>>> figure out what was actually the problem. Since we suffered from
>>>> severe
>>>> scsi problems, I wrote quite a number of patches to improve the
>>>> situation.
>>>> We now at least can understand where the problem came from and also
>>>> have
>>>> a slightly improved error handling. These are presently for 2.6.22
>>>> only,
>>>> but my plan is to sent these upstream for 2.6.28.
>>>>
>>>>
>>>>>> Lustre: 6698:0:(lustre_fsfilt.h:306:fsfilt_setattr()) nobackup-
>>>>>> OST0001: slow setattr 100s
>>>>>> Lustre: 6698:0:(watchdog.c:312:lcw_update_time()) Expired watchdog
>>>>>> for pid 6698 disabled after 103.1261s
>>>>>
>>>>> Those are just fallout from the above disk situation.
>>>>
>>>> Probably the device was offlined and actually this also should have
>>>> been
>>>> printed in the logs. Brock, can you check the device status
>>>> (cat /sys/block/sdX/device/state).
>>>
>>> IO Is still flowing from both OST's on that OSS,
>>>
>>> [root at nyx167 ~]# cat /sys/block/sd*/device/state
>>> running
>>> running
>>
>> So the device recovered. Is the parallel-scsi? If so it now might run 
>> at
>> a lower scsi speed level, but you should have got domain validation  
>> messages
>> about this (unless you are using a customized driver, which has DV  
>> disabled).
>
> Its Fibre Channel for the medium. Direct connected (no loop or switch)  
> So I am not sure,  the driver is the stock one with RHEL4.
>

Ok, quite different then. I only have very little experience with FC, so no 
idea what's wrong with your system now.

Cheers,
Bernd