[Lustre-discuss] 1.8.1.1
Andreas Dilger
adilger at sun.com
Fri Nov 27 15:22:36 PST 2009
On 2009-11-27, at 03:13, Papp Tamás wrote:
> Craig Prescott wrote, On 2009. 11. 19. 20:42:
>> We had the same problem with 1.8.x.x.
>>
>> We set lnet.printk=0 on our OSS nodes and it has helped us
>> dramatically - we have not seen the 'soft lockup' problem since.
>>
>> sysctl -w lnet.printk=0
>>
>> This will turn off all but 'emerg' messages from lnet.
>>
>> It would be interesting to know if this avoided the lockups for
>> you, too.
>
> Well, this definetely helped, but didn't resolve the root of the
> problem.
>
> A few minutes ago we were not able to reach our cluster from clients.
>
> On MDS I see a lot of this:
>
>
> Nov 27 10:52:17 meta1 kernel: BUG: soft lockup - CPU#3 stuck for
> 10s! [ll_evictor:6123]
> Nov 27 10:52:18 meta1 kernel: Call Trace:
> :obdclass:lustre_hash_for_each_empty+0x237/0x2b0
> :obdclass:class_disconnect+0x398/0x420
> :mds:mds_disconnect+0x121/0xe40
> :obdclass:class_fail_export+0x384/0x4c0
> :ptlrpc:ping_evictor_main+0x4f8/0x7e0
> default_wake_function+0x0/0xe
> :ptlrpc:ping_evictor_main+0x0/0x7e0
This looks like the server evicting a client that has a lot of locks.
One thing to try is in lustre_hash_for_each_empty() add a call to
cond_resched(), since it seems this function could run a long time if
func() doesn't ever cause a reschedule:
lustre_hash_for_each_empty(lustre_hash_t *lh, lh_for_each_cb func,
{
read_unlock(&lh->lh_rwlock);
func(obj, data);
(void)lh_put(lh, hnode);
+ cond_resched();
goto restart;
I'm not sure this is the root cause, but you could check the DLM lock
stats in /proc/fs/lustre/ldlm/namespaces/*/lock_count on some clients,
to see how many locks they are holding, or the same on the MDS, which
will be the total number of locks currently granted to all clients.
> After them:
>
> 6009:0:(events.c:367:server_bulk_callback()) event type 4, status
> -5, desc ffff81022ed54280
This is just fallout from the MDS being too busy to handle requests.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
More information about the lustre-discuss
mailing list