[Lustre-discuss] OSS kernel panic

Wed Dec 3 18:30:05 PST 2008

Hi;

We have a lustre filesystem that has been pretty stable since June 2008 on a 200 node
cluster until three weeks ago.  The OSS kernel panic has escalated since then to now about 
every 2 hours.
The MDT/MGS is on a x86_64 server with 8G memory and 2 dual core AMD procs
The OSS is on a x86_64 server with 8G memory and 2 dual core AMD procs
One OST raid 6 ~9TB (I know it is larger than currently tested) - at 58%
Lustre 1.6.4.2

I decreased the threads to 256 then 128 thinking the storage was oversubscribed however the
kernel panics continue.  The storage has no errors in the logs.  I have done a fsck with no filesystem
issues detected.
We do have an average of ~35 Gaussian programs running which is heavy I/O, however collectl
does not show any system stress before the panic.  Console shows a few messages about brw_writes 
and OST timeouts.  I am attaching the messages from syslog prior to one of the kernel panics and the one lustre 
dump that has data.  

If anyone has any thoughts, I would appreciate it. 
Denise
-------------- next part --------------
A non-text attachment was scrubbed...
Name: messages
Type: application/octet-stream
Size: 161948 bytes
Desc: messages
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20081203/d1dbe851/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lustre-log.1228334804.4054
Type: application/octet-stream
Size: 904990 bytes
Desc: lustre-log.1228334804.4054
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20081203/d1dbe851/attachment-0001.obj>