[Lustre-discuss] Lustre client lockups

Wed Nov 5 12:22:54 PST 2008

On Tue, 2008-11-04 at 09:06 -0800, Kurt Dillen wrote:
> 
> Some more information about the environment:
> 
> - Lustre clients are all vmware virtual systems
> - Lustre Farm are all vmware virtual systems

Hrm.  That is a bit of a red flag right there.

> the errors I see are the following:
> 
> LustreError: 3420:0:(events.c:134:client_bulk_callback()) event type
> 0, status -5, desc ffff8100e5dca000
> LustreError: 3420:0:(events.c:134:client_bulk_callback()) event type
> 0, status -5, desc ffff8100e519e000
> LustreError: 3420:0:(events.c:134:client_bulk_callback()) event type
> 0, status -5, desc ffff8100e4e0a000
> LustreError: 3420:0:(events.c:134:client_bulk_callback()) event type
> 0, status -5, desc ffff8100e86b1bc0
> LustreError: 3420:0:(events.c:134:client_bulk_callback()) event type
> 0, status -5, desc ffff8100e79fe5c0
> LustreError: 3420:0:(events.c:134:client_bulk_callback()) event type
> 0, status -5, desc ffff8100e70a88c0
> LustreError: 3420:0:(events.c:134:client_bulk_callback()) event type
> 0, status -5, desc ffff8100e7081280
> LustreError: 3420:0:(events.c:134:client_bulk_callback()) event type
> 0, status -5, desc ffff8100e6d6d5c0
> LustreError: 3428:0:(client.c:975:ptlrpc_expire_one_request()) @@@
> timeout (sent at 1225816920, 100s ago)  req at ffff8100e7e2ba00 x17940/t0
> o4->lustre-OST0005_UUID at 172.16.0.29@tcp:28 lens 384/352 ref 2 fl Rpc:/
> 0/0 rc 0/-22
> Lustre: lustre-OST0005-osc-ffff8100e8551800: Connection to service
> lustre-OST0005 via nid 172.16.0.29 at tcp was lost; in progress
> operations using this service will wait for recovery to complete.
> Lustre: lustre-OST0005-osc-ffff8100e8551800: Connection restored to
> service lustre-OST0005 using nid 172.16.0.29 at tcp.

These are just regular timeouts with nothing really to explain them.  A
detailed log analysis of all of your server logs (not something we can
do here on lustre-discuss) might yield more but I have suspicions about
your vmware-farm set up.  Running VMs, all competing for the same host
resources makes the environment unpredictable.

I'm not sure if you are using host-only or bridged networking but my
(now quite historic) experience with running lots of vmware machines on
a single piece of hardware is that the host-only network is less than
robust and the memory rquirements of running many VMs on a single
machine are demanding.  Additionally, if you have many OSTs all sharing
the same physical disk, you will have further contention there.
Timeouts are not surprising.

I would also encourage you to try 1.6.6 now that it is out.  I would
also encourage you to get some baseline performance metrics of all of
this virtual hardware with our iokit.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20081105/b6a8f419/attachment.pgp>