[Lustre-discuss] OSS kernel panic

Wed Dec 3 19:25:42 PST 2008

On Wed, 2008-12-03 at 19:30 -0700, Hummel, Denise wrote:
> Hi;

I've only had a chance to take the quickest of peeks at this...

> We have a lustre filesystem that has been pretty stable since June 2008 on a 200 node
> cluster until three weeks ago.  The OSS kernel panic has escalated since then to now about 
> every 2 hours.

Those are not "panics".  A kernel panic is a very particular thing and
what you are seeing is not that.  What you are seeing is watchdog timers
firing.

I notice that they mostly (all?) seem to be in ldiskfs code paths and at
the end of the messages a bunch of these:

Dec  3 13:07:36 oss1 kernel: Lustre: 3990:0:(lustre_fsfilt.h:240:fsfilt_brw_start_log()) lustre-OST0000: slow journal start 112s
Dec  3 13:07:36 oss1 kernel: Lustre: 3942:0:(filter_io_26.c:711:filter_commitrw_write()) lustre-OST0000: slow brw_start 36s
Dec  3 13:07:36 oss1 kernel: Lustre: 3947:0:(lustre_fsfilt.h:205:fsfilt_start_log()) lustre-OST0000: slow journal start 128s
Dec  3 13:07:36 oss1 kernel: Lustre: 3947:0:(watchdog.c:312:lcw_update_time()) Expired watchdog for pid 3947 disabled after 128.2092s
Dec  3 13:07:36 oss1 kernel: Lustre: 3988:0:(lustre_fsfilt.h:296:fsfilt_commit_wait()) lustre-OST0000: slow journal start 150s
Dec  3 13:07:36 oss1 kernel: Lustre: 3988:0:(filter_io_26.c:776:filter_commitrw_write()) lustre-OST0000: slow commitrw commit 150s
Dec  3 13:07:36 oss1 kernel: Lustre: 4035:0:(filter_io_26.c:763:filter_commitrw_write()) lustre-OST0000: slow direct_io 31s
Dec  3 13:07:36 oss1 kernel: Lustre: 4053:0:(filter_io_26.c:698:filter_commitrw_write()) lustre-OST0000: slow i_mutex 150s
Dec  3 13:07:36 oss1 kernel: Lustre: 4000:0:(filter_io_26.c:763:filter_commitrw_write()) lustre-OST0000: slow direct_io 132s
Dec  3 13:07:36 oss1 kernel: Lustre: 4000:0:(filter_io_26.c:763:filter_commitrw_write()) Skipped 10 previous similar messages
Dec  3 13:07:36 oss1 kernel: Lustre: 4054:0:(filter_io_26.c:776:filter_commitrw_write()) lustre-OST0000: slow commitrw commit 151s

Which means your storage is too slow for the load that the OSS is
putting on it.

> I decreased the threads to 256 then 128 thinking the storage was oversubscribed

Your instinct was right as this certainly is a symptom of that problem,
as well as others however.

> The storage has no errors in the logs.

Hrm.  That was going to be my next question.  This symptom can also
describe a back end storage system that has "slowed down".

Or it could also describe a load that has gone up over time.  Perhaps
the storage has always been oversubscribed but just never taxed so the
symptom was hiding.

Did you ever do any iokit benchmarking of your storage before you put
Lustre on it?  I hope so, because doing that provides a good baseline
for you do another, say, obdfilter run and compare your performance then
and now to see how they measure up.

Even if you didn't do a baseline obdfilter-survey run before you
started, doing one now will help you tune the number of OST threads you
can use before you enter the realm of diminishing returns.  The
alternative is of course continuing to binary search for your "sweet
spot".

If you choose the latter, once you have found the number of OST threads
you can run with before hitting too many "slow" messages and watchdog
timeouts, you can do some benchmarking to see if your performance is as
you would expect given your storage interconnect and hardware.  If not,
you will need to start trying to figure out why.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20081203/a8ed2e88/attachment.pgp>