[Lustre-discuss] Possible out of memory condition
Craig.Tierney at noaa.gov
Mon Oct 27 09:16:39 PDT 2008
Andreas Dilger wrote:
> On Oct 22, 2008 14:37 -0600, Craig Tierney wrote:
>> I just had two nodes hang with the following soft lockup messages.
>> I am running Centos 5.2 (2.6.18-93.1.13.el5) with the patchless client
>> (18.104.22.168). My nodes do not have swap configured on them (no local
>> disks). We do have a tool that looks for out of memory condition
>> and neither of the nodes in question reported a problem (not that it
>> is perfect).
> Note that soft lockups are only a warning. It shouldn't mean that the
> node is completely dead, only that some thread was hogging the CPU.
The two soft lockup messages (one in kswapd0 and the other in the user
process convert_emiss) repeated their messages for 6 hours before I rebooted
the node. I don't recall if I could login to the node or not.
>> Does the problem look like an issue with Lustre?
> Lots of Lustre functions on the stack...
>> Oct 22 08:06:45 h53 kernel: BUG: soft lockup - CPU#2 stuck for 10s! [kswapd0:418]
>> Oct 22 08:06:45 h53 kernel: Call Trace:
>> Oct 22 08:06:45 h53 kernel: [<ffffffff8871125a>] :osc:cache_remove_extent+0x4a/0x90
>> Oct 22 08:06:45 h53 kernel: [<ffffffff88707c5a>] :osc:osc_teardown_async_page+0x25a/0x3c0
> Do you have particularly large files in use (e.g. in the realm of 1TB or
> more)? It seems possible that if there are a lot of pages to be cleaned
> up that this might cause a report like this.
My first guess would be no, we don't create files that large. But it is
entirely possible a user did something wrong with this code which caused
some large files (append vs. create). I will check it out.
> Cheers, Andreas
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
Craig Tierney (craig.tierney at noaa.gov)
More information about the lustre-discuss