[lustre-discuss] Lustre traffic slow on OPA fabric network

Cory Spitz spitzcor at cray.com
Thu Jul 5 19:08:43 PDT 2018


It sounds like you've diagnosed the problem to be your OPA fabric.  Do you have network errors that will help confirm your theory?  Can you test your network without Lustre & LNet to prove its fitness?  That is, do you pass network diagnostics?  If it goes well, maybe LNet Self Test can help as a diagnostic.  There is a guide at http://wiki.lustre.org/LNET_Selftest.

-Cory

-- 

On 7/3/18, 1:59 PM, "lustre-discuss on behalf of Kurt Strosahl" <lustre-discuss-bounces at lists.lustre.org on behalf of strosahl at jlab.org> wrote:

    Good Afternoon,
    
       I've been seeing a great deal of slowness from clients on an OPA network accessing lustre through lnet routers.  The nodes take very long to complete things like lfs df, and show lots of dropped / reestablished connections.  The OSS systems show this as well, and occasionally will report that all routes are down to a host on the omnipath fabric.  They also show large numbers of bulk callback errors.  The lnet router show large numbers of PUT_NACK messages, as well as Abort reconnection messages for nodes on the OPA fabric.
    
    w/r, 
    Kurt J. Strosahl
    System Administrator: Lustre, HPC
    Scientific Computing Group, Thomas Jefferson National Accelerator Facility
    _______________________________________________
    lustre-discuss mailing list
    lustre-discuss at lists.lustre.org
    http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
    



More information about the lustre-discuss mailing list