[Lustre-discuss] Slow read performance after ~700MB-2GB

Wed Jun 9 13:51:46 PDT 2010

On Jun 9, 2010, at 12:45 PM, Andreas Dilger wrote:

> On 2010-06-09, at 13:41, Alexander Oltu wrote:
>> On Wed, 9 Jun 2010 10:29:36 -0700 Jason Rappleye wrote:
>>> Is vm.zone_reclaim_mode set to a value other that zero?
>>
>> Yes, it was 1, as soon as I set it to 0 the problem disappears.
>
> Interesting, I have never heard of this problem before.  Is this a  
> client-side parameter, or on the server?

Client. In 1.6.x Lustre doesn't use the page cache on the server  
(right?), so I don't think this will cause a problem. At least, not  
this particular problem.

When we first saw this problem a few weeks ago it appeared that client  
processes were stuck in uninterruptible sleep in blk_congestion_wait,  
but upon further examination we saw they were still issuing 1-2 I/Os  
per second. The kernel stack trace looked like this:

<ffffffff8013d6ec>{internal_add_timer+21}
<ffffffff8030fbc4>{schedule_timeout+138}
<ffffffff8013def0>{process_timeout+0}
<ffffffff8030f3ec>{io_schedule_timeout+88}
<ffffffff801f1d74>{blk_congestion_wait+102}
<ffffffff80148d46>{autoremove_wake_function+0}
<ffffffff8016851d>{throttle_vm_writeout+33}
<ffffffff8016aa0e>{remove_mapping+133}
<ffffffff8016b8e8>{shrink_zone+3367}
<ffffffff80218799>{find_next_bit+96}
<ffffffff8016c435>{zone_reclaim+430}
<ffffffff8843c3ba>{:ptlrpc:ldlm_lock_decref+154}
<ffffffff8852df5a>{:osc:cache_add_extent+1178}
<ffffffff8860f838>{:lustre:ll_removepage+488}
<ffffffff8852152a>{:osc:osc_prep_async_page+426}
<ffffffff8860c953>{:lustre:llap_shrink_cache+1715}
<ffffffff88524224>{:osc:osc_queue_group_io+644}
<ffffffff801671a2>{get_page_from_freelist+222}
<ffffffff8016756d>{__alloc_pages+113}
<ffffffff80162416>{add_to_page_cache+57}
<ffffffff80162c49>{grab_cache_page_nowait+53}
<ffffffff8860e368>{:lustre:ll_readahead+2584}
<ffffffff8851db55>{:osc:osc_check_rpcs+773}
<ffffffff8012c52c>{__wake_up+56}
<ffffffff88515db1>{:osc:loi_list_maint+225}
<ffffffff88330288>{:libcfs:cfs_alloc+40}
<ffffffff88615557>{:lustre:ll_readpage+4775}
<ffffffff885b3109>{:lov:lov_fini_enqueue_set+585}
<ffffffff88438cc7>{:ptlrpc:ldlm_lock_add_to_lru+119}
<ffffffff8843719e>{:ptlrpc:lock_res_and_lock+190}
<ffffffff883d792f>{:obdclass:class_handle_unhash_nolock+207}
<ffffffff8843bb1c>{:ptlrpc:ldlm_lock_decref_internal+1356}
<ffffffff885b235f>{:lov:lov_finish_set+1695}
<ffffffff801629bd>{do_generic_mapping_read+525}
<ffffffff8016476e>{file_read_actor+0}
<ffffffff8016328b>{__generic_file_aio_read+324}
<ffffffff80164576>{generic_file_readv+143}
<ffffffff885b07c9>{:lov:lov_merge_lvb+281}
<ffffffff80148d46>{autoremove_wake_function+0}
<ffffffff8019f156>{__touch_atime+118}
<ffffffff885ef821>{:lustre:ll_file_readv+6385}
<ffffffff80216f4f>{__up_read+16}
<ffffffff885efada>{:lustre:ll_file_read+26}
<ffffffff801878f0>{vfs_read+212}
<ffffffff80187cd0>{sys_read+69}
<ffffffff8010ae5e>{system_call+126}

zone_reclaim_mode is set in mm/page_alloc.c:build_zonelists by  
examining the distance between nodes. The value of 20 in the old BOIS  
was fine; the remote distance is set to 21 in the new BIOS and that  
put it over the edge. We're working with the vendor to understand why  
that change was made. In any case, setting it back to zero works  
around the issue.

Just to be clear, the remote node distance in and of itself doesn't  
have anything to do with the problem - setting zone_reclaim_mode to  
one on a host with the original distances is sufficient to reproduce  
the problem.

When the problem occurs, echoing 3 into drop_caches provides some  
temporarily relief, but the kernel will eventually re-enter  
zone_reclaim.

>
>>> This sounds a lot like a problem we recently experienced when a BIOS
>>> upgrade changed to the ACPI SLIT table, which specifies the
>>> distances between NUMA nodes in a system. It put the remote node
>>> distance over the threshold the kernel uses to decide whether or not
>>> to enable the inline zone reclaim path. At least on SLES, Lustre
>>> doesn't seem to be able to free up pages in the page cache in this
>>> path, and performance dropped to 2-4MB/s. In my test case I was
>>> issuing 2MB I/Os and the kernel only let 1-2 I/Os trickle out per
>>> second, so that matches up with what you're seeing.
>
> Jason, do you have enough of an understanding of this codepath to  
> know why Lustre is not freeing pages in this case?  Is it because  
> Lustre just doesn't have a VM callback that frees pages at all, or  
> is it somehow ignoring the requests from the kernel to free up the  
> pages?

Kind of. The stack trace above definitely shows that the process  
enters the inline zone reclaim path. zone_reclaim itself calls  
shrink_slabs, This asks each cache that has been registered via  
set_shrinker to free up memory. At least, that's how it works in the  
SLES 10 kernel.

Looking at our 1.6.7 client sources (no, we don't run 1.6.7 on our  
servers!), llite registers the cache using ll_register_cache, which  
calls the kernel register_cache function. It ends up being a no-op if  
register_cache isn't found by autoconf. My understanding is that the  
interface changed at some point, and some versions of Lustre doesn't  
support the set_shrinker interface.

It *looks* like 1.8.2 does the right thing, at least in terms of  
registering the cache. Apparently register_shrinker is the function to  
call in some kernels. autoconf checks for it's existence, and if it's  
there it wraps the call with its own version of set_shrinker (in  
lustre/include/linux/lustre_compat25.h). Otherwise, it calls the  
kernel's set_shrinker.

I haven't tried to reproduce this problem with a 1.8.2 client yet, but  
I should be able to sneak in a test this week and see if it works as  
expected.

j

--
Jason Rappleye
System Administrator
NASA Advanced Supercomputing Division
NASA Ames Research Center
Moffett Field, CA 94035