[Lustre-discuss] Lustre-1.8.4 : BUG soft lock up

Tue Aug 9 23:27:41 PDT 2011

On 08/10/2011 01:40 AM, Jeff Johnson wrote:
> Greetings,
>
> The below console output is from a 1.8.4 OST (RHEL5.5,
> 2.6.18-194.3.1.el5_lustre.1.8.4, x86_64). Not saying it is a Lustre bug
> for sure. Just wondering if anyone has seen this or something very
> similar. Updating to 1.8.6 WC variant isn't an option at this time.

It was stuck in a kernel swap thread for more than 10 seconds.  Possibly 
a race condition on the disk.

>
> If anyone has some insight into this I'd appreciate the feedback.
>
> Thanks,
>
> --Jeff
>
> BUG: soft lockup - CPU#6 stuck for 10s! [kswapd0:409]

More to the point, it shouldn't be swapping.  What is

	sysctl -a | grep swappiness

?  and

	cat /proc/meminfo  | grep -i swap

Likely you have some process with a memory leak, and you need to flush 
cache/swap every now and then to make sure it doesn't fill up.

> CPU 6:

> RIP: 0010:[<ffffffff801011bf>]  [<ffffffff801011bf>] dqput+0x105/0x19f

This is a quota put.  It has some nice spin locks in there, and there 
could be some allocations in some of the function calls.  I haven't checked.

http://lxr.free-electrons.com/source/fs/quota/dquot.c?a=microblaze#L718

> RSP: 0018:ffff8101be805cd0  EFLAGS: 00000202
> RAX: ffff81012e03f000 RBX: 0000000000000000 RCX: ffff81012e03f000
> RDX: ffffffffffffffe2 RSI: 0000000000000002 RDI: ffff81012f4f01c0
> RBP: ffff81007fb4c918 R08: ffff810000018b00 R09: ffff81007fb4c918
> R10: ffff8101be805c60 R11: ffffffff8b6448f0 R12: ffff8101be805c60
> R13: ffffffff8b6448f0 R14: 00000000ffffffe2 R15: ffffffff8b6448f0
> FS:  0000000000000000(0000) GS:ffff8101bfc2adc0(0000) knlGS:0000000000000000
> CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> CR2: 0000000000402000 CR3: 0000000000201000 CR4: 00000000000006e0
>
> Call Trace:
>    [<ffffffff8010182b>] dquot_drop+0x30/0x5e
>    [<ffffffff8b647e83>] :ldiskfs:ldiskfs_dquot_drop+0x43/0x70
>    [<ffffffff80022d99>] clear_inode+0xb4/0x123
>    [<ffffffff80034e52>] dispose_list+0x41/0xe0
>    [<ffffffff8002d6a7>] shrink_icache_memory+0x1b7/0x1e6
>    [<ffffffff8003f466>] shrink_slab+0xdc/0x153
>    [<ffffffff80057e59>] kswapd+0x343/0x46c
>    [<ffffffff800a0ab2>] autoremove_wake_function+0x0/0x2e
>    [<ffffffff80057b16>] kswapd+0x0/0x46c
>    [<ffffffff800a089a>] keventd_create_kthread+0x0/0xc4
>    [<ffffffff80032890>] kthread+0xfe/0x132
>    [<ffffffff8009d728>] request_module+0x0/0x14d
>    [<ffffffff8005dfb1>] child_rip+0xa/0x11
>    [<ffffffff800a089a>] keventd_create_kthread+0x0/0xc4
>    [<ffffffff80032792>] kthread+0x0/0x132
>    [<ffffffff8005dfa7>] child_rip+0x0/0x11

There are a couple of bugs in RHEL that this could be similar to.

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615