[lustre-discuss] Status of LU-8703 for Knights Landing

Prout, Andrew - LLSC - MITLL aprout at ll.mit.edu
Mon Feb 6 14:43:58 PST 2017


Patrick,

Yes, it was a hard-stop, libcfs refused to insmod. I expect the issue would not appear if you have MCDRAM configured in cache mode, so it would depending on how you have that set up.



            Thanks, I wasn't aware of the module parameter to bypass the problematic detection code. Using "cpu_pattern" worked nicely to bypass the problem.



Andrew Prout

Lincoln Laboratory Supercomputing Center

MIT Lincoln Laboratory

244 Wood Street, Lexington, MA 02420



From: Patrick Farrell [mailto:paf at cray.com]
Sent: Wednesday, February 01, 2017 4:27 PM
To: Prout, Andrew - LLSC - MITLL; lustre-discuss at lists.lustre.org
Subject: Re: Status of LU-8703 for Knights Landing



Andrew,



Are they really just not working?  I didn't see that with KNL (the default CPT generated without the fixes from LU-8703 is very weird, but didn't affect performance much - the real NUMA-ness of KNL processors seems to be minimal, despite the various NUMA related configuration options...), but Cray systems are unusual and I don't think I ever saw an empty NUMA node (possibly something we fix in the BIOS).  Anyway, you should be able to work around this without patching your client, just set some module parameters before starting Lustre/loading the modules.



I can think of two things which should work, both are module parameters for the libcfs module, I believe.  I haven't tried this, so it's possible your error is coming earlier in the loading process...  But I think not, based on the message.



1. Limit yourself to 1 partition, by setting cpu_npartitions to 1.

static int cpu_npartitions;

module_param(cpu_npartitions, int, 0444);

MODULE_PARM_DESC(cpu_npartitions, "# of CPU partitions");



2. Or, you could draw up a CPU partition table yourself.  Parameter name is cpu_pattern.



Here's the code describing that:
"

/**

 * modparam for setting CPU partitions patterns:

 *

 * i.e: "0[0,1,2,3] 1[4,5,6,7]", number before bracket is CPU partition ID,

 *      number in bracket is processor ID (core or HT)

 *

 * i.e: "N 0[0,1] 1[2,3]" the first character 'N' means numbers in bracket

 *       are NUMA node ID, number before bracket is CPU partition ID.

 *

 * i.e: "N", shortcut expression to create CPT from NUMA & CPU topology

 *

 * NB: If user specified cpu_pattern, cpu_npartitions will be ignored

 */

static char *cpu_pattern = "N";

module_param(cpu_pattern, charp, 0444);

MODULE_PARM_DESC(cpu_pattern, "CPU partitions pattern");"



Notice the default pattern is N, but you can override it.



(Code references from libcfs/libcfs/linux/linux-cpu.c in Lustre.)



Either of those should let you get past the error, no need to carry patches.  I can't speak to the production-readiness of the patches, but I'd definitely go the module parameter route if it were my system.



- Patrick

  _____

From: lustre-discuss <lustre-discuss-bounces at lists.lustre.org<mailto:lustre-discuss-bounces at lists.lustre.org>> on behalf of Prout, Andrew - LLSC - MITLL <aprout at ll.mit.edu<mailto:aprout at ll.mit.edu>>
Sent: Wednesday, February 1, 2017 3:11:07 PM
To: lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>
Subject: [lustre-discuss] Status of LU-8703 for Knights Landing



Anyone know the production-readiness of the patches attached to LU-8703 to fix issues with Lustre on Xeon Phi Knights Landing hardware? We're considering merging them against our 2.9.0 client to get it working on our KL nodes.



Andrew Prout

Lincoln Laboratory Supercomputing Center

MIT Lincoln Laboratory

244 Wood Street, Lexington, MA 02420

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20170206/cc61f6e7/attachment.htm>


More information about the lustre-discuss mailing list