[Lustre-discuss] 1.8.1.1

Andreas Dilger adilger at sun.com
Fri Nov 27 15:22:36 PST 2009


On 2009-11-27, at 03:13, Papp Tamás wrote:
> Craig Prescott wrote, On 2009. 11. 19. 20:42:
>> We had the same problem with 1.8.x.x.
>>
>> We set lnet.printk=0 on our OSS nodes and it has helped us  
>> dramatically - we have not seen the 'soft lockup' problem since.
>>
>> sysctl -w lnet.printk=0
>>
>> This will turn off all but 'emerg' messages from lnet.
>>
>> It would be interesting to know if this avoided the lockups for  
>> you, too.
>
> Well, this definetely helped, but didn't resolve the root of the  
> problem.
>
> A few minutes ago we were not able to reach our cluster from clients.
>
> On MDS I see a lot of this:
>
>
> Nov 27 10:52:17 meta1 kernel: BUG: soft lockup - CPU#3 stuck for  
> 10s! [ll_evictor:6123]
> Nov 27 10:52:18 meta1 kernel: Call Trace:
> :obdclass:lustre_hash_for_each_empty+0x237/0x2b0
> :obdclass:class_disconnect+0x398/0x420
> :mds:mds_disconnect+0x121/0xe40
> :obdclass:class_fail_export+0x384/0x4c0
> :ptlrpc:ping_evictor_main+0x4f8/0x7e0
> default_wake_function+0x0/0xe
> :ptlrpc:ping_evictor_main+0x0/0x7e0

This looks like the server evicting a client that has a lot of locks.
One thing to try is in lustre_hash_for_each_empty() add a call to  
cond_resched(), since it seems this function could run a long time if
func() doesn't ever cause a reschedule:

lustre_hash_for_each_empty(lustre_hash_t *lh, lh_for_each_cb func,
{
                         read_unlock(&lh->lh_rwlock);
                         func(obj, data);
                         (void)lh_put(lh, hnode);
+                       cond_resched();
                         goto restart;

I'm not sure this is the root cause, but you could check the DLM lock  
stats in /proc/fs/lustre/ldlm/namespaces/*/lock_count on some clients,  
to see how many locks they are holding, or the same on the MDS, which  
will be the total number of locks currently granted to all clients.

> After them:
>
> 6009:0:(events.c:367:server_bulk_callback()) event type 4, status  
> -5, desc ffff81022ed54280

This is just fallout from the MDS being too busy to handle requests.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.




More information about the lustre-discuss mailing list