[Lustre-discuss] kernel panic with 1.6.5rc2 on mds

Sat May 17 08:47:24 PDT 2008

On May 16, 2008  12:45 +0200, Patrick Winnertz wrote:
> As I wrote in #11742 [1] I experienced a kernel panic after doing heavy I/O 
> on the 1.6.5rc2 cluster on the mds.  Since nobody answered to this bug 
> until now (and I think in other cases the lustre team is _really_ fast 
> (thanks for that :))) I fear that it was not recognised by anybody. 
> 
> This kernel-panic seems somehow to be related to the bug mentioned above 
> (#11742) as this bugnr. is mentioned in the dmesg output when it died. 
> Furthermore right before it started to fail there were several messages 
> like the following:
> 
> LustreError: 3342:0:(osc_request.c:678:osc_announce_cached()) dirty 
> 81108992 > dirty_max 33554432
> 
> This behaviour is described in #13344 [2].

Sorry, I don't have net access right now, so I can't see your comments
in the bug, but the above messsage is definitely unusual and an indication
of some kind of code bug.

The client imposes a limit on the amount of dirty data that it can cache
(in /proc/fs/lustre/osc/*/max_dirty_mb, default 32MB), on a per-OST basis.
This ensures that in case of lock cancellation there isn't 5TB of dirty
data out on the client and flushing this to the OST will take 30min.

It seems that either the accounting of the number of dirty pages on the
client has gone badly, or the client has actually dirtied far more data
(80MB) than it should have (32MB).

Could you please explain the type of IO that the client is doing?  Is
this normal write(), or writev(), pwrite(), O_DIRECT, mmap, other?
Were there IO errors, or IO resends, or some other unusual problem?
The entry points for this IO into Lustre is all slightly different, and
it wouldn't be the first time there was an accounting error somewhere.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.