[lustre-devel] Lustre upstream client TODO list

NeilBrown neilb at suse.com
Sun Feb 11 18:09:37 PST 2018


On Mon, Feb 12 2018, Patrick Farrell wrote:

> Neil,
>
> Apologies if you've answered this elsewhere, but what's the genesis of
> your current (extremely welcome) interest in Lustre?  Some commitment
> by SUSE? 

"commitment" might be too strong a word - certainly too strong for me to
use - but "interest" is probably fair.  Some interest within SUSE.

NeilBrown


>
> Regards,
> - Patrick
> ________________________________
> From: lustre-devel <lustre-devel-bounces at lists.lustre.org> on behalf of NeilBrown <neilb at suse.com>
> Sent: Sunday, February 11, 2018 5:54:51 PM
> To: James Simmons; Lustre Development List
> Cc: Oleg Drokin
> Subject: Re: [lustre-devel] Lustre upstream client TODO list
>
> On Sun, Feb 11 2018, James Simmons wrote:
>
>> So I sent a patch upstream that laid out what most needs to be done for
>> the linux lustre client to leave staging. I placed the new text here for
>> ease of read so you don't have to go searching for it. Feed back is
>> welcomed. Hoepfully posting it will make it clear what needs to be done.
>
>
> Thanks so much for putting this together and pushing it out.  I really
> appreciated it and hope to show that appreciation with patches :-)
>
> NeilBrown
>
>>
>> Currently all the work directed toward the lustre upstream client is tracked
>> at the following link:
>>
>> https://jira.hpdd.intel.com/browse/LU-9679
>>
>> Under this ticket you will see the following work items that need to be
>> addressed:
>>
>> ******************************************************************************
>> * libcfs cleanup
>> *
>> * https://jira.hpdd.intel.com/browse/LU-9859
>> *
>> * Track all the cleanups and simplification of the libcfs module. Remove
>> * functions the kernel provides. Possible intergrate some of the functionality
>> * into the kernel proper.
>> *
>> ******************************************************************************
>>
>> https://jira.hpdd.intel.com/browse/LU-100086
>>
>> LNET_MINOR conflicts with USERIO_MINOR
>>
>> ------------------------------------------------------------------------------
>>
>> https://jira.hpdd.intel.com/browse/LU-8130
>>
>> Fix and simplify libcfs hash handling
>>
>> ------------------------------------------------------------------------------
>>
>> https://jira.hpdd.intel.com/browse/LU-8703
>>
>> The current way we handle SMP is wrong. Platforms like ARM and KNL can have
>> core and NUMA setups with things like NUMA nodes with no cores. We need to
>> handle such cases. This work also greatly simplified the lustre SMP code.
>>
>> ------------------------------------------------------------------------------
>>
>> https://jira.hpdd.intel.com/browse/LU-9019
>>
>> Replace libcfs time API with standard kernel APIs. Also migrate away from
>> jiffies. We found jiffies can vary on nodes which can lead to corner cases
>> that can break the file system due to nodes having inconsistent behavior.
>> So move to time64_t and ktime_t as much as possible.
>>
>> ******************************************************************************
>> * Proper IB support for ko2iblnd
>> ******************************************************************************
>> https://jira.hpdd.intel.com/browse/LU-9179
>>
>> Poor performance for the ko2iblnd driver. This is related to many of the
>> patches below that are missing from the linux client.
>> ------------------------------------------------------------------------------
>>
>> https://jira.hpdd.intel.com/browse/LU-9886
>>
>> Crash in upstream kiblnd_handle_early_rxs()
>> ------------------------------------------------------------------------------
>>
>> https://jira.hpdd.intel.com/browse/LU-10394 / LU-10526 / LU-10089
>>
>> Default to default to using MEM_REG
>> ------------------------------------------------------------------------------
>>
>> https://jira.hpdd.intel.com/browse/LU-10459
>>
>> throttle tx based on queue depth
>> ------------------------------------------------------------------------------
>>
>> https://jira.hpdd.intel.com/browse/LU-9943
>>
>> correct WR fast reg accounting
>> ------------------------------------------------------------------------------
>>
>> https://jira.hpdd.intel.com/browse/LU-10291
>>
>> remove concurrent_sends tunable
>> ------------------------------------------------------------------------------
>>
>> https://jira.hpdd.intel.com/browse/LU-10213
>>
>> calculate qp max_send_wrs properly
>> ------------------------------------------------------------------------------
>>
>> https://jira.hpdd.intel.com/browse/LU-9810
>>
>> use less CQ entries for each connection
>> ------------------------------------------------------------------------------
>>
>> https://jira.hpdd.intel.com/browse/LU-10129 / LU-9180
>>
>> rework map_on_demand behavior
>> ------------------------------------------------------------------------------
>>
>> https://jira.hpdd.intel.com/browse/LU-10129
>>
>> query device capabilities
>> ------------------------------------------------------------------------------
>>
>> https://jira.hpdd.intel.com/browse/LU-10015
>>
>> fix race at kiblnd_connect_peer
>> ------------------------------------------------------------------------------
>>
>> https://jira.hpdd.intel.com/browse/LU-9983
>>
>> allow for discontiguous fragments
>> ------------------------------------------------------------------------------
>>
>> https://jira.hpdd.intel.com/browse/LU-9500
>>
>> Don't Page Align remote_addr with FastReg
>> ------------------------------------------------------------------------------
>>
>> https://jira.hpdd.intel.com/browse/LU-9448
>>
>> handle empty CPTs
>> ------------------------------------------------------------------------------
>>
>> https://jira.hpdd.intel.com/browse/LU-9507
>>
>> Don't Assert On Reconnect with MultiQP
>> ------------------------------------------------------------------------------
>>
>> https://jira.hpdd.intel.com/browse/LU-9472
>>
>> Fix FastReg map/unmap for MLX5
>> ------------------------------------------------------------------------------
>>
>> https://jira.hpdd.intel.com/browse/LU-9425
>>
>> Turn on 2 sges by default
>> ------------------------------------------------------------------------------
>>
>> https://jira.hpdd.intel.com/browse/LU-8943
>>
>> Enable Multiple OPA Endpoints between Nodes
>> ------------------------------------------------------------------------------
>>
>> https://jira.hpdd.intel.com/browse/LU-5718
>>
>> multiple sges for work request
>> ------------------------------------------------------------------------------
>>
>> https://jira.hpdd.intel.com/browse/LU-9094
>>
>> kill timedout txs from ibp_tx_queue
>> ------------------------------------------------------------------------------
>>
>> https://jira.hpdd.intel.com/browse/LU-9094
>>
>> reconnect peer for REJ_INVALID_SERVICE_ID
>> ------------------------------------------------------------------------------
>>
>> https://jira.hpdd.intel.com/browse/LU-8752
>>
>> Stop MLX5 triggering a dump_cqe
>> ------------------------------------------------------------------------------
>>
>> https://jira.hpdd.intel.com/browse/LU-8874
>>
>> Move ko2iblnd to latest RDMA changes
>> ------------------------------------------------------------------------------
>>
>> https://jira.hpdd.intel.com/browse/LU-8875 / LU-8874
>>
>> Change to new RDMA done callback mechanism
>>
>> ------------------------------------------------------------------------------
>>
>> https://jira.hpdd.intel.com/browse/LU-9164 / LU-8874
>>
>> Incorporate RDMA map/unamp API's into ko2iblnd
>>
>> ******************************************************************************
>> * sysfs/debugfs fixes
>> *
>> * https://jira.hpdd.intel.com/browse/LU-8066
>> *
>> * The original migration to sysfs was done in haste without properly working
>> * utilities to test the changes. This covers the work to restore the proper
>> * behavior. Huge project to make this right.
>> *
>> ******************************************************************************
>>
>> https://jira.hpdd.intel.com/browse/LU-9431
>>
>> The function class_process_proc_param was used for our mass updates of proc
>> tunables. It didn't work with sysfs and it was just ugly so it was removed.
>> In the process the ability to mass update thousands of clients was lost. This
>> work restores this in a sane way.
>>
>> ------------------------------------------------------------------------------
>> https://jira.hpdd.intel.com/browse/LU-9091
>>
>> One the major request of users is the ability to pass in parameters into a
>> sysfs file in various different units. For example we can set max_pages_per_rpc
>> but this can vary on platforms due to different platform sizes. So you can
>> set this like max_pages_per_rpc=16MiB. The original code to handle this written
>> before the string helpers were created so the code doesn't follow that format
>> but it would be easy to move to. Currently the string helpers does the reverse
>> of what we need, changing bytes to string. We need to change a string to bytes.
>>
>> ******************************************************************************
>> * Proper user land to kernel space interface for Lustre
>> *
>> * https://jira.hpdd.intel.com/browse/LU-9680
>> *
>> ******************************************************************************
>>
>> https://jira.hpdd.intel.com/browse/LU-8915
>>
>> Don't use linux list structure as user land arguments for lnet selftest.
>> This code is pretty poor quality and really needs to be reworked.
>>
>> ------------------------------------------------------------------------------
>>
>> https://jira.hpdd.intel.com/browse/LU-8834
>>
>> The lustre ioctl LL_IOC_FUTIMES_3 is very generic. Need to either work with
>> other file systems with similar functionality and make a common syscall
>> interface or rework our server code to automagically do it for us.
>>
>> ------------------------------------------------------------------------------
>>
>> https://jira.hpdd.intel.com/browse/LU-6202
>>
>> Cleanup up ioctl handling. We have many obsolete ioctls. Also the way we do
>> ioctls can be changed over to netlink. This also has the benefit of working
>> better with HPC systems that do IO forwarding. Such systems don't like ioctls
>> very well.
>>
>> ------------------------------------------------------------------------------
>>
>> https://jira.hpdd.intel.com/browse/LU-9667
>>
>> More cleanups by making our utilities use sysfs instead of ioctls for LNet.
>> Also it has been requested to move the remaining ioctls to the netlink API.
>>
>> ******************************************************************************
>> * Misc
>> ******************************************************************************
>>
>> ------------------------------------------------------------------------------
>> https://jira.hpdd.intel.com/browse/LU-9855
>>
>> Clean up obdclass preprocessor code. One of the major eye sores is the various
>> pointer redirections and macros used by the obdclass. This makes the code very
>> difficult to understand. It was requested by the Al Viro to clean this up before
>> we leave staging.
>>
>> ------------------------------------------------------------------------------
>>
>> https://jira.hpdd.intel.com/browse/LU-9633
>>
>> Migrate to sphinx kernel-doc style comments. Add documents in Documentation.
>>
>> ------------------------------------------------------------------------------
>>
>> https://jira.hpdd.intel.com/browse/LU-6142
>>
>> Possible remaining coding style fix. Remove deadcode. Enforce kernel code
>> style. Other minor misc cleanups...
>>
>> ------------------------------------------------------------------------------
>>
>> https://jira.hpdd.intel.com/browse/LU-8837
>>
>> Separate client/server functionality. Functions only used by server can be
>> removed from client. Most of this has been done but we need a inspect of the
>> code to make sure.
>>
>> ------------------------------------------------------------------------------
>>
>> https://jira.hpdd.intel.com/browse/LU-8964
>>
>> Lustre client readahead/writeback control needs to better suit kernel providings.
>> Currently its being explored. We could end up replacing the CLIO read ahead
>> abstract with the kernel proper version.
>>
>> ------------------------------------------------------------------------------
>>
>> https://jira.hpdd.intel.com/browse/LU-9862
>>
>> Patch that landed for LU-7890 leads to static checker errors
>> ------------------------------------------------------------------------------
>>
>> https://jira.hpdd.intel.com/browse/LU-9868
>>
>> dcache/namei fixes for lustre
>> ------------------------------------------------------------------------------
>>
>> https://jira.hpdd.intel.com/browse/LU-10467
>>
>> use standard linux wait_events macros work by Neil Brown
>>
>> ------------------------------------------------------------------------------
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20180212/6be1b155/attachment.sig>


More information about the lustre-devel mailing list