[lustre-discuss] Lustre 2.10.1 + RHEL7 Page Allocation Failures

Wed Nov 29 16:47:35 PST 2017

In particular, see the patch https://review.whamcloud.com/30164

LU-10133 o2iblnd: fall back to vmalloc for mlx4/mlx5

If a large QP is allocated with kmalloc(), but fails due to memory
fragmentation, fall back to vmalloc() to handle the allocation.
This is done in the upstream kernel, but was only fixed in mlx4
in the RHEL7.3 kernel, and neither mlx4 or mlx5 in the RHEL6 kernel.
Also fix mlx5 for SLES12 kernels.

Test-Parameters: trivial
Signed-off-by: Andreas Dilger<andreas.dilger at intel.com>
Change-Id: Ie74800edd27bf4c3210724079cbebbae532d1318

On Nov 29, 2017, at 06:09, Jones, Peter A <peter.a.jones at intel.com> wrote:
> 
> Charles
> 
> That ticket is completely open so you do have access to everything. As I understand it the options are to either use the latest MOFED update rather than relying on the in-kernel OFED (which I believe is the advise usually provided by Mellanox anyway) or else apply the kernel patch Andreas has created that is referenced in the ticket.
> 
> Peter
> 
> On 2017-11-29, 2:50 AM, "lustre-discuss on behalf of Charles A Taylor" <lustre-discuss-bounces at lists.lustre.org on behalf of chasman at ufl.edu> wrote:
> 
>> 
>> Hi All,
>> 
>> We recently upgraded from Lustre 2.5.3.90 on EL6 to 2.10.1 on EL7 (details below) but have hit what looks like LU-10133 (order 8 page allocation failures).
>> 
>> We don’t have access to look at the JIRA ticket in more detail but from what we can tell the the fix is to change from vmalloc() to vmalloc_array() in the mlx4 drivers.  However, the vmalloc_array() infrastructure is in an upstream (far upstream) kernel so I’m not sure when we’ll see that fix.
>> 
>> While this may not be a Lustre issue directly, I know we can’t be the only Lustre site running 2.10.1 over IB on Mellanox ConnectX-3 HCAs.  So far we have tried increasing vm.min_free_kbytes to 8GB but that does not help.  Zone_reclaim_mode is disabled (for other reasons that may not be valid under EL7) but order 8 chunks get depleted on both NUMA nodes so I’m not sure that is the answer either (though we have not tried it yet).
>> 
>> [root at ufrcmds1 ~]# cat /proc/buddyinfo 
>> Node 0, zone      DMA      1      0      0      0      2      1      1      0      1      1      3 
>> Node 0, zone    DMA32   1554  13496  11481   5108    150      0      0      0      0      0      0 
>> Node 0, zone   Normal 114119 208080  78468  35679   6215    690      0      0      0      0      0 
>> Node 1, zone   Normal  81295 184795 106942  38818   4485    293   1653      0      0      0      0 
>> 
>> I’m wondering if other sites are hitting this and, if so, what are you doing to work around the issue on your OSSs.  
>> 
>> Regards,
>> 
>> Charles Taylor
>> UF Research Computing
>> 
>> 
>> Some Details:
>> -------------------
>> OS: RHEL 7.4 (Linux ufrcoss28.ufhpc 3.10.0-693.2.2.el7_lustre.x86_64)
>> Lustre: 2.10.1 (lustre-2.10.1-1.el7.x86_64)
>> Clients: ~1400 (still running 2.5.3.90 but we are in the process of upgrading)
>> Servers: 10 HA OSS pairs (20 OSSs)
>>    128 GB RAM
>>    6 OSTs (8+2 RAID-6) per OSS 
>>    Mellanox ConnectX-3 IB/VPI HCAs 
>>    RedHat Native IB Stack (i.e. not MOFED)
>>    mlx4_core driver:
>>       filename:       /lib/modules/3.10.0-693.2.2.el7_lustre.x86_64/kernel/drivers/net/ethernet/mellanox/mlx4/mlx4_core.ko.xz
>>       version:        2.2-1
>>       license:        Dual BSD/GPL
>>       description:    Mellanox ConnectX HCA low-level driver
>>       author:         Roland Dreier
>>       rhelversion:    7.4
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation