<html><head><meta http-equiv="Content-Type" content="text/html charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class=""><div class=""><br class=""></div><div class=""><div class="">Hi All,</div><div class=""><br class=""></div>We recently upgraded from Lustre 2.5.3.90 on EL6 to 2.10.1 on EL7 (details below) but have hit what looks like LU-10133 (order 8 page allocation failures).<br class=""><br class=""><div class=""><div class="">We don’t have access to look at the JIRA ticket in more detail but from what we can tell the the fix is to change from vmalloc() to vmalloc_array() in the mlx4 drivers. However, the vmalloc_array() infrastructure is in an upstream (far upstream) kernel so I’m not sure when we’ll see that fix.</div></div><div class=""><br class=""></div><div class="">While this may not be a Lustre issue directly, I know we can’t be the only Lustre site running 2.10.1 over IB on Mellanox ConnectX-3 HCAs. So far we have tried increasing vm.min_free_kbytes to 8GB but that does not help. Zone_reclaim_mode is disabled (for other reasons that may not be valid under EL7) but order 8 chunks get depleted on both NUMA nodes so I’m not sure that is the answer either (though we have not tried it yet).</div><div class=""><br class=""></div><div class="">[root@ufrcmds1 ~]# cat /proc/buddyinfo <br class=""><font face="Courier" class="">Node 0, zone DMA 1 0 0 0 2 1 1 0 1 1 3 <br class="">Node 0, zone DMA32 1554 13496 11481 5108 150 0 0 0 0 0 0 <br class="">Node 0, zone Normal 114119 208080 78468 35679 6215 690 0 0 0 0 0 <br class="">Node 1, zone Normal 81295 184795 106942 38818 4485 293 1653 0 0 0 0 </font><br class=""><br class=""></div><div class="">I’m wondering if other sites are hitting this and, if so, what are you doing to work around the issue on your OSSs. </div><div class=""><br class=""></div><div class="">Regards,</div><div class=""><br class=""></div><div class="">Charles Taylor</div><div class="">UF Research Computing</div><div class=""><br class=""></div><div class=""><br class=""></div><div class="">Some Details:</div><div class="">-------------------</div><div class="">OS: RHEL 7.4 (Linux ufrcoss28.ufhpc 3.10.0-693.2.2.el7_lustre.x86_64)<br class="">Lustre: 2.10.1 (lustre-2.10.1-1.el7.x86_64)<br class="">Clients: ~1400 (still running 2.5.3.90 but we are in the process of upgrading)<br class="">Servers: 10 HA OSS pairs (20 OSSs)<br class=""> 128 GB RAM</div><div class=""> 6 OSTs (8+2 RAID-6) per OSS <br class=""> Mellanox ConnectX-3 IB/VPI HCAs <br class=""> RedHat Native IB Stack (i.e. not MOFED)<br class=""> mlx4_core driver:<br class=""> filename: /lib/modules/3.10.0-693.2.2.el7_lustre.x86_64/kernel/drivers/net/ethernet/mellanox/mlx4/mlx4_core.ko.xz<br class=""> version: 2.2-1<br class=""> license: Dual BSD/GPL<br class=""> description: Mellanox ConnectX HCA low-level driver<br class=""> author: Roland Dreier<br class=""> rhelversion: 7.4</div></div></body></html>