[Lustre-discuss] Luster clients getting evicted

Craig Prescott prescott at hpc.ufl.edu
Mon Feb 11 12:19:21 PST 2008


Aaron Knister wrote:
> I'm having a similar issue with lustre 1.6.4.2 and infiniband. Under  
> load, the clients hand about every 10 minutes which is really bad for  
> a production machine. The only way to fix the hang is to reboot the  
> server. My users are getting extremely impatient :-/
> 
> I see this on the clients-
> 
> LustreError: 2814:0:(client.c:975:ptlrpc_expire_one_request()) @@@  
> timeout (sent at 1202756629, 301s ago)  req at ffff8100af233600 x1796079/ 
> t0 o6->data-OST0000_UUID at 192.168.64.71@o2ib:28 lens 336/336 ref 1 fl  
> Rpc:/0/0 rc 0/-22
> Lustre: data-OST0000-osc-ffff810139ce4800: Connection to service data- 
> OST0000 via nid 192.168.64.71 at o2ib was lost; in progress operations  
> using this service will wait for recovery to complete.
> LustreError: 11-0: an error occurred while communicating with  
> 192.168.64.71 at o2ib. The ost_connect operation failed with -16
> LustreError: 11-0: an error occurred while communicating with  
> 192.168.64.71 at o2ib. The ost_connect operation failed with -16
> 
> I've increased the timeout to 300seconds and it has helped marginally.

Hi Aaron;

We set the timeout a big number (1000secs) on our 400 node cluster
(mostly o2ib, some tcp clients).  Until we did this, we had loads
of evictions.  In our case, it solved the problem.

Cheers,
Craig



More information about the lustre-discuss mailing list