[Lustre-discuss] Lustre client lockups

Thu Nov 6 09:23:20 PST 2008

On Nov 04, 2008  09:06 -0800, Kurt Dillen wrote:
> We have a serious problem with lustre.  Since a few days we have
> lockups on the client side.  Not all clients are having this
> problem.
> 
> We are running this kernel  2.6.16-54-0.2.5_lustre.1.6.4.3smp.
> 
> The statahead disable is done on the systems.
> 
> Some more information about the environment:
> 
> - Lustre clients are all vmware virtual systems
> - Lustre Farm are all vmware virtual systems
> 
> the errors I see are the following:
> 
> LustreError: 3420:0:(events.c:134:client_bulk_callback()) event type
> 0, status -5, desc ffff8100e5dca000
> LustreError: 3428:0:(client.c:975:ptlrpc_expire_one_request()) @@@
> timeout (sent at 1225816920, 100s ago)  req at ffff8100e7e2ba00 x17940/t0
> o4->lustre-OST0005_UUID at 172.16.0.29@tcp:28 lens 384/352 ref 2 fl Rpc:/
> 0/0 rc 0/-22
> Lustre: lustre-OST0005-osc-ffff8100e8551800: Connection to service
> lustre-OST0005 via nid 172.16.0.29 at tcp was lost; in progress
> operations using this service will wait for recovery to complete.

These all look like network problems.  Running production Lustre servers
inside a vmware doesn't make much sense.  We don't test clients inside
vmware, but I don't think that is nearly as bad as running the servers
in a virtual environment.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.