[Lustre-discuss] Possible out of memory condition

Andreas Dilger adilger at sun.com
Sat Oct 25 02:18:56 PDT 2008


On Oct 22, 2008  14:37 -0600, Craig Tierney wrote:
> I just had two nodes hang with the following soft lockup messages.
> I am running Centos 5.2 (2.6.18-93.1.13.el5) with the patchless client
> (1.6.5.1).  My nodes do not have swap configured on them (no local
> disks).   We do have a tool that looks for out of memory condition
> and neither of the nodes in question reported a problem (not that it
> is perfect).

Note that soft lockups are only a warning.  It shouldn't mean that the
node is completely dead, only that some thread was hogging the CPU.

> Does the problem look like an issue with Lustre?

Lots of Lustre functions on the stack...

> Oct 22 08:06:45 h53 kernel: BUG: soft lockup - CPU#2 stuck for 10s! [kswapd0:418]
> Oct 22 08:06:45 h53 kernel: Call Trace:
> Oct 22 08:06:45 h53 kernel:  [<ffffffff8871125a>] :osc:cache_remove_extent+0x4a/0x90
> Oct 22 08:06:45 h53 kernel:  [<ffffffff88707c5a>] :osc:osc_teardown_async_page+0x25a/0x3c0

Do you have particularly large files in use (e.g. in the realm of 1TB or
more)?  It seems possible that if there are a lot of pages to be cleaned
up that this might cause a report like this.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.




More information about the lustre-discuss mailing list