[Lustre-discuss] 1.6.5.1 OSS crashes

Robin Humble rjh+lustre at cita.utoronto.ca
Fri Jul 25 00:48:17 PDT 2008


On Sun, Jul 20, 2008 at 08:40:19AM -0400, Mag Gam wrote:
>I am trying to understand. What was the problem? How does SD_IOSTATS
>affect the crash? How did you disable this?

the comments describe the bug:
  https://bugzilla.lustre.org/show_bug.cgi?id=16404#c22
which from a quick look seems like a SMP locking issue around the
statistics collection issue that presumable under some circumstances
can cause an overflow and a crash.

the way to disable it is to rebuild the patched-by-Lustre RHEL kernel
with the CONFIG_SD_IOSTATS option turned off.

>Sorry for a newbie question....

no probs.
let me know if you need a recipe for patching and rebuilding this
kernel. I should really write it all down before I forget anyway...

there are most likely descriptions for patching and building kernels on
the Lustre wiki too.

cheers,
robin

>
>
>On Sun, Jul 20, 2008 at 4:54 AM, Robin Humble
><rjh+lustre at cita.utoronto.ca> wrote:
>> On Fri, Jul 18, 2008 at 09:02:36AM -0400, Brian J. Murrell wrote:
>>>On Fri, 2008-07-18 at 05:52 -0400, Robin Humble wrote:
>>>> Hi,
>>>>
>>>> I'm seeing coordinated OSS crashes with Lustre 1.6.5.1.
>>>>
>>>> our RHEL4 OSS have been stable for ~months with these kernels:
>>>>   kernel-lustre-smp-2.6.9-67.0.4.EL_lustre.1.6.4.3
>>>>   kernel-lustre-smp-2.6.9-55.0.9.EL_lustre.1.6.4.2
>>>>
>>>> but have crashed hard, twice, about 10hrs apart as soon as we started
>>>> using this kernel:
>>>>   kernel-lustre-smp-2.6.9-67.0.7.EL_lustre.1.6.5.1
>>>Can you try rebuilding the kernel, disabling SD_IOSTATS?
>>
>> done. I rebuilt using the stock kernel's InfiniBand stack and
>>  # CONFIG_SD_IOSTATS is not set
>>
>>  % cexec -p oss: uptime
>> oss x17:  18:45:07 up 1 day, 30 min,  1 user,  load average: 4.97, 7.00, 6.27
>> oss x18:  18:45:07 up 1 day, 23 min,  1 user,  load average: 4.18, 5.78, 5.71
>> oss x19:  18:45:07 up 1 day, 23 min,  1 user,  load average: 5.18, 5.66, 4.60
>>
>> which is >> the 10hrs it was crashing at before.
>> good guess about the cause of the problem! :-)
>>
>> maybe that rhel4 1.6.5.1 kernel rpm needs a respin then? seems like a
>> fairly critical issue... :-/
>>
>> cheers,
>> robin
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>_______________________________________________
>Lustre-discuss mailing list
>Lustre-discuss at lists.lustre.org
>http://lists.lustre.org/mailman/listinfo/lustre-discuss



More information about the lustre-discuss mailing list