[Lustre-discuss] 1.6.5.1 OSS crashes
Robin Humble
rjh+lustre at cita.utoronto.ca
Sun Jul 20 01:54:53 PDT 2008
On Fri, Jul 18, 2008 at 09:02:36AM -0400, Brian J. Murrell wrote:
>On Fri, 2008-07-18 at 05:52 -0400, Robin Humble wrote:
>> Hi,
>>
>> I'm seeing coordinated OSS crashes with Lustre 1.6.5.1.
>>
>> our RHEL4 OSS have been stable for ~months with these kernels:
>> kernel-lustre-smp-2.6.9-67.0.4.EL_lustre.1.6.4.3
>> kernel-lustre-smp-2.6.9-55.0.9.EL_lustre.1.6.4.2
>>
>> but have crashed hard, twice, about 10hrs apart as soon as we started
>> using this kernel:
>> kernel-lustre-smp-2.6.9-67.0.7.EL_lustre.1.6.5.1
>Can you try rebuilding the kernel, disabling SD_IOSTATS?
done. I rebuilt using the stock kernel's InfiniBand stack and
# CONFIG_SD_IOSTATS is not set
% cexec -p oss: uptime
oss x17: 18:45:07 up 1 day, 30 min, 1 user, load average: 4.97, 7.00, 6.27
oss x18: 18:45:07 up 1 day, 23 min, 1 user, load average: 4.18, 5.78, 5.71
oss x19: 18:45:07 up 1 day, 23 min, 1 user, load average: 5.18, 5.66, 4.60
which is >> the 10hrs it was crashing at before.
good guess about the cause of the problem! :-)
maybe that rhel4 1.6.5.1 kernel rpm needs a respin then? seems like a
fairly critical issue... :-/
cheers,
robin
More information about the lustre-discuss
mailing list