[Lustre-discuss] Slow read performance after ~700MB-2GB

Andreas Dilger andreas.dilger at oracle.com
Thu Jun 10 15:46:11 PDT 2010


On 2010-06-10, at 08:48, Cory Spitz wrote:
> Slightly off-topic, but did anyone else notice that readahead is triggering the shrinking and page writeout?  ll_read_ahead_page() clears __GFP_WAIT but it seems sane to me that it should also drop __GFP_IO.  In my opinion, Lustre
> shouldn't speculatively force other pages out.  Only when there is an actual,
> demonstrated need, should it force out the other pages.

We used to have a kernel patch (and more recently I implemented this using generic kernel EXPORT_FUNCTION() operations) to implement grab_cache_page_nowait_gfp() to allow specifying the GFP mask when allocating pages for readahead.  Without that, the kernel uses the GFP mask from the address space, which we have no control over.

That said, disabling memory pressure from readahead has a negative side effect also.  When the client memory is full (i.e. all the time) there is NO readahead generated because the readahead grab_cache_page_nowait_gfp() calls always fail, and this degrades performance significantly, since the reads are now synchronous and a single stream, instead of pipelined.

While it is true that some speculative readahead may result in evicting other useful pages from cache, it is more likely to be prefetching useful pages that the current process wants to use immediately and evicting old/useless pages.  

The readahead algorithms definitely need some improvement, and it is possible that it is over-zealous in this case, but it isn't possible to say in this case.  

I'd say the core problem is that no reclaim is being triggered and/or the reclaim is deadlocked on the cache cleaning, and that is the first issue to focus on here.


> Jason Rappleye wrote:
> [...]
>> When we first saw this problem a few weeks ago it appeared that client  
>> processes were stuck in uninterruptible sleep in blk_congestion_wait,  
>> but upon further examination we saw they were still issuing 1-2 I/Os  
>> per second. The kernel stack trace looked like this:
>> 
>> <ffffffff8013d6ec>{internal_add_timer+21}
>> <ffffffff8030fbc4>{schedule_timeout+138}
>> <ffffffff8013def0>{process_timeout+0}
>> <ffffffff8030f3ec>{io_schedule_timeout+88}
>> <ffffffff801f1d74>{blk_congestion_wait+102}
>> <ffffffff80148d46>{autoremove_wake_function+0}
>> <ffffffff8016851d>{throttle_vm_writeout+33}
>> <ffffffff8016aa0e>{remove_mapping+133}
>> <ffffffff8016b8e8>{shrink_zone+3367}
>> <ffffffff80218799>{find_next_bit+96}
>> <ffffffff8016c435>{zone_reclaim+430}
>> <ffffffff8843c3ba>{:ptlrpc:ldlm_lock_decref+154}
>> <ffffffff8852df5a>{:osc:cache_add_extent+1178}
>> <ffffffff8860f838>{:lustre:ll_removepage+488}
>> <ffffffff8852152a>{:osc:osc_prep_async_page+426}
>> <ffffffff8860c953>{:lustre:llap_shrink_cache+1715}
>> <ffffffff88524224>{:osc:osc_queue_group_io+644}
>> <ffffffff801671a2>{get_page_from_freelist+222}
>> <ffffffff8016756d>{__alloc_pages+113}
>> <ffffffff80162416>{add_to_page_cache+57}
>> <ffffffff80162c49>{grab_cache_page_nowait+53}
>> <ffffffff8860e368>{:lustre:ll_readahead+2584}
>> <ffffffff8851db55>{:osc:osc_check_rpcs+773}
>> <ffffffff8012c52c>{__wake_up+56}
>> <ffffffff88515db1>{:osc:loi_list_maint+225}
>> <ffffffff88330288>{:libcfs:cfs_alloc+40}
>> <ffffffff88615557>{:lustre:ll_readpage+4775}
>> <ffffffff885b3109>{:lov:lov_fini_enqueue_set+585}
>> <ffffffff88438cc7>{:ptlrpc:ldlm_lock_add_to_lru+119}
>> <ffffffff8843719e>{:ptlrpc:lock_res_and_lock+190}
>> <ffffffff883d792f>{:obdclass:class_handle_unhash_nolock+207}
>> <ffffffff8843bb1c>{:ptlrpc:ldlm_lock_decref_internal+1356}
>> <ffffffff885b235f>{:lov:lov_finish_set+1695}
>> <ffffffff801629bd>{do_generic_mapping_read+525}
>> <ffffffff8016476e>{file_read_actor+0}
>> <ffffffff8016328b>{__generic_file_aio_read+324}
>> <ffffffff80164576>{generic_file_readv+143}
>> <ffffffff885b07c9>{:lov:lov_merge_lvb+281}
>> <ffffffff80148d46>{autoremove_wake_function+0}
>> <ffffffff8019f156>{__touch_atime+118}
>> <ffffffff885ef821>{:lustre:ll_file_readv+6385}
>> <ffffffff80216f4f>{__up_read+16}
>> <ffffffff885efada>{:lustre:ll_file_read+26}
>> <ffffffff801878f0>{vfs_read+212}
>> <ffffffff80187cd0>{sys_read+69}
>> <ffffffff8010ae5e>{system_call+126}
>> 
> [...]


Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.




More information about the lustre-discuss mailing list