[lustre-discuss] LNET Self-test
tegner at foi.se
Wed Feb 8 00:52:23 PST 2017
Thanks a lot!
A related question: is it possible to use the result from the "ping"
test to verify the latency obtained from openmpi? Or, how do I know it
the result from the "ping" test is "acceptable"?
On 02/07/2017 06:38 PM, Oucharek, Doug S wrote:
> Because the stat command is “lst stat servers”, the statistics you are seeing are from the perspective of the server. The “from” and “to” parameters can get quite confusing for the read case. When reading, you are transferring the bulk data from the “to” group to the “from” group (yes, seems the opposite of what you would expect). I think the “from” and “to” labels were designed to make sense in the write case and the logic was just flipped for the read case.
> So, the stats you show indicated that are you writing an average of 3.6GiB/s (note: the lnet-selftest stats are mislabel and should be MiB/s rather than MB/s…I have fixed this in the latest release. You are then getting 3.8GB/s). The reason you see traffic in the read direction is due to responses/acks. That is why there are a lot of small messages going back to the server (high RPC rate, small bandwidth).
> So, your test looks like it is working to me.
>> On Feb 7, 2017, at 2:13 AM, Jon Tegner <tegner at foi.se> wrote:
>> Probably doing something wrong here, but I tried to test only READING with the following:
>> export LST_SESSION=$$
>> lst new_session read
>> lst add_group servers 10.0.12.12 at o2ib
>> lst add_group readers 10.0.12.11 at o2ib
>> lst add_batch bulk_read
>> lst add_test --batch bulk_read --concurrency 12 --from readers --to servers \
>> brw read check=simple size=1M
>> lst run bulk_read
>> lst stat servers & sleep 10; kill $!
>> lst end_session
>> which in my case gives:
>> [LNet Rates of servers]
>> [R] Avg: 3633 RPC/s Min: 3633 RPC/s Max: 3633 RPC/s
>> [W] Avg: 7241 RPC/s Min: 7241 RPC/s Max: 7241 RPC/s
>> [LNet Bandwidth of servers]
>> [R] Avg: 2.29 MB/s Min: 2.29 MB/s Max: 2.29 MB/s
>> [W] Avg: 3608.44 MB/s Min: 3608.44 MB/s Max: 3608.44 MB/s
>> it seems strange that it should report non zero numbers in the [W] positions? Specially that bandwidth is low in the [R] position (since I explicitly demanded "read")? Also note that if I change "brw read" to "brw write" in the script above the results are "reversed" in the sense that it reports the higher number regarding bandwidth in the [R] position. That is "brw read" reports (almost) the expected bandwidth in the [W]-position, whereas "brw write" reports it in the [R]-position.
>> This is on CentOS-6.5/Lustre-2.5.3. Will try 7.3/2.9.0 later.
>> On 02/06/2017 05:45 PM, Oucharek, Doug S wrote:
>>> Try running just a read test and then just a write test rather than having both at the same time and see if the performance goes up.
More information about the lustre-discuss