[Lustre-discuss] Slow read performance after ~700MB-2GB
Jason Rappleye
jason.rappleye at nasa.gov
Wed Jun 9 13:51:46 PDT 2010
On Jun 9, 2010, at 12:45 PM, Andreas Dilger wrote:
> On 2010-06-09, at 13:41, Alexander Oltu wrote:
>> On Wed, 9 Jun 2010 10:29:36 -0700 Jason Rappleye wrote:
>>> Is vm.zone_reclaim_mode set to a value other that zero?
>>
>> Yes, it was 1, as soon as I set it to 0 the problem disappears.
>
> Interesting, I have never heard of this problem before. Is this a
> client-side parameter, or on the server?
Client. In 1.6.x Lustre doesn't use the page cache on the server
(right?), so I don't think this will cause a problem. At least, not
this particular problem.
When we first saw this problem a few weeks ago it appeared that client
processes were stuck in uninterruptible sleep in blk_congestion_wait,
but upon further examination we saw they were still issuing 1-2 I/Os
per second. The kernel stack trace looked like this:
<ffffffff8013d6ec>{internal_add_timer+21}
<ffffffff8030fbc4>{schedule_timeout+138}
<ffffffff8013def0>{process_timeout+0}
<ffffffff8030f3ec>{io_schedule_timeout+88}
<ffffffff801f1d74>{blk_congestion_wait+102}
<ffffffff80148d46>{autoremove_wake_function+0}
<ffffffff8016851d>{throttle_vm_writeout+33}
<ffffffff8016aa0e>{remove_mapping+133}
<ffffffff8016b8e8>{shrink_zone+3367}
<ffffffff80218799>{find_next_bit+96}
<ffffffff8016c435>{zone_reclaim+430}
<ffffffff8843c3ba>{:ptlrpc:ldlm_lock_decref+154}
<ffffffff8852df5a>{:osc:cache_add_extent+1178}
<ffffffff8860f838>{:lustre:ll_removepage+488}
<ffffffff8852152a>{:osc:osc_prep_async_page+426}
<ffffffff8860c953>{:lustre:llap_shrink_cache+1715}
<ffffffff88524224>{:osc:osc_queue_group_io+644}
<ffffffff801671a2>{get_page_from_freelist+222}
<ffffffff8016756d>{__alloc_pages+113}
<ffffffff80162416>{add_to_page_cache+57}
<ffffffff80162c49>{grab_cache_page_nowait+53}
<ffffffff8860e368>{:lustre:ll_readahead+2584}
<ffffffff8851db55>{:osc:osc_check_rpcs+773}
<ffffffff8012c52c>{__wake_up+56}
<ffffffff88515db1>{:osc:loi_list_maint+225}
<ffffffff88330288>{:libcfs:cfs_alloc+40}
<ffffffff88615557>{:lustre:ll_readpage+4775}
<ffffffff885b3109>{:lov:lov_fini_enqueue_set+585}
<ffffffff88438cc7>{:ptlrpc:ldlm_lock_add_to_lru+119}
<ffffffff8843719e>{:ptlrpc:lock_res_and_lock+190}
<ffffffff883d792f>{:obdclass:class_handle_unhash_nolock+207}
<ffffffff8843bb1c>{:ptlrpc:ldlm_lock_decref_internal+1356}
<ffffffff885b235f>{:lov:lov_finish_set+1695}
<ffffffff801629bd>{do_generic_mapping_read+525}
<ffffffff8016476e>{file_read_actor+0}
<ffffffff8016328b>{__generic_file_aio_read+324}
<ffffffff80164576>{generic_file_readv+143}
<ffffffff885b07c9>{:lov:lov_merge_lvb+281}
<ffffffff80148d46>{autoremove_wake_function+0}
<ffffffff8019f156>{__touch_atime+118}
<ffffffff885ef821>{:lustre:ll_file_readv+6385}
<ffffffff80216f4f>{__up_read+16}
<ffffffff885efada>{:lustre:ll_file_read+26}
<ffffffff801878f0>{vfs_read+212}
<ffffffff80187cd0>{sys_read+69}
<ffffffff8010ae5e>{system_call+126}
zone_reclaim_mode is set in mm/page_alloc.c:build_zonelists by
examining the distance between nodes. The value of 20 in the old BOIS
was fine; the remote distance is set to 21 in the new BIOS and that
put it over the edge. We're working with the vendor to understand why
that change was made. In any case, setting it back to zero works
around the issue.
Just to be clear, the remote node distance in and of itself doesn't
have anything to do with the problem - setting zone_reclaim_mode to
one on a host with the original distances is sufficient to reproduce
the problem.
When the problem occurs, echoing 3 into drop_caches provides some
temporarily relief, but the kernel will eventually re-enter
zone_reclaim.
>
>>> This sounds a lot like a problem we recently experienced when a BIOS
>>> upgrade changed to the ACPI SLIT table, which specifies the
>>> distances between NUMA nodes in a system. It put the remote node
>>> distance over the threshold the kernel uses to decide whether or not
>>> to enable the inline zone reclaim path. At least on SLES, Lustre
>>> doesn't seem to be able to free up pages in the page cache in this
>>> path, and performance dropped to 2-4MB/s. In my test case I was
>>> issuing 2MB I/Os and the kernel only let 1-2 I/Os trickle out per
>>> second, so that matches up with what you're seeing.
>
> Jason, do you have enough of an understanding of this codepath to
> know why Lustre is not freeing pages in this case? Is it because
> Lustre just doesn't have a VM callback that frees pages at all, or
> is it somehow ignoring the requests from the kernel to free up the
> pages?
Kind of. The stack trace above definitely shows that the process
enters the inline zone reclaim path. zone_reclaim itself calls
shrink_slabs, This asks each cache that has been registered via
set_shrinker to free up memory. At least, that's how it works in the
SLES 10 kernel.
Looking at our 1.6.7 client sources (no, we don't run 1.6.7 on our
servers!), llite registers the cache using ll_register_cache, which
calls the kernel register_cache function. It ends up being a no-op if
register_cache isn't found by autoconf. My understanding is that the
interface changed at some point, and some versions of Lustre doesn't
support the set_shrinker interface.
It *looks* like 1.8.2 does the right thing, at least in terms of
registering the cache. Apparently register_shrinker is the function to
call in some kernels. autoconf checks for it's existence, and if it's
there it wraps the call with its own version of set_shrinker (in
lustre/include/linux/lustre_compat25.h). Otherwise, it calls the
kernel's set_shrinker.
I haven't tried to reproduce this problem with a 1.8.2 client yet, but
I should be able to sneak in a test this week and see if it works as
expected.
j
--
Jason Rappleye
System Administrator
NASA Advanced Supercomputing Division
NASA Ames Research Center
Moffett Field, CA 94035
More information about the lustre-discuss
mailing list