[Lustre-discuss] Large directories optimization

Thu Sep 17 13:17:54 PDT 2009

Hello!

On Sep 17, 2009, at 7:28 AM, Lukas Hejtmanek wrote:
> LustreError: 11-0: an error occurred while communicating with  
> x.x.x.x at tcp.
> The mds_connect operation failed with -16
> Lustre: Request x112815827 sent from stable-OST0001-osc- 
> ffff8802855b7800 to
> NID x.x.x.x at tcp 100s ago has timed out (limit 100s).

This looks like your OSTs are overloaded (do you get any "slow ..."  
messages
in the logs there?, watchdog triggers?) dragging down MDS with them  
(trying
to do e.g. creates which is slow and so client times out from MDS as  
well,
though you did not show it in your log - we see MDS refuses client  
connection
because it thinks it is still processing a request from this client).
The spurious eviction is addressed by adaptive timeouts (enabled by  
default
in 1.8).
If you bring down the load on the OSTs (read this list, recently there  
were
several methods discussed like bringing down number of service threads)
that should help.

> LustreError: 166-1: MGCx.x.x.x at tcp: Connection to service MGS via nid
> x.x.x.x at tcp was lost; in progress operations using this service will  
> fail.
> Lustre: MGCx.x.x.x at tcp: Reactivating import

Now this is unexpected and I do not see a timeout so I do not know
what actually happened there.

Bye,
     Oleg