[Lustre-discuss] Interpreting iozone measurements

Wed Mar 9 05:46:24 PST 2011

> we have a testbed running with Lustre 1.8.3 and a RTT of ~4ms (10GE
> network cards everywhere) for a ping between client and servers. If I
> have read the iozone source code correctly, iozone reports bandwidth in
> KB/s and includes the time for the open() call, but not for close(). If

To include the close time you need to add the -c flag.  The -e should
include fsync/fflush in the time.
> I write a single 64 byte file (with one I/O request), iozone tells me
> something like '295548', which means ~295 MB/s. Dividing the file size
> by the bandwidth, I get the time that was needed to create the file and
> write the 64 bytes (with a single request). In this case, the time is
> about 0,2 micro seconds which is way below the RTT. 

That seems oddly fast for such a small file over any latency.  Since you
shouldn't even be able to lock the file in that time.
> That mean for a Lustre file system, if I create a file and write 64
> bytes, the client sends two(?) RPCs to the server and does not wait for
> the completion. Is this correct? But it will wait for the completion of
> both RPCs when the file ist closed?
You can see what Lustre is doing if the client isn't doing any other
activity and you enable some tracing.  "sysctl -w lnet.debug='+rpctrace
vfstrace'" should allow you to see the VFS ops ll_file_open,
ll_file_writev/ll_file_aio_write, ll_file_release, along with any RPCs
generated by them.  You should see an RPC for the file open which will
be a 101 opcode for requesting the lock and you should see a reply AFAIK
before the client actually attempts to write any data.  So that should
bring your time upto at least 4 ms for 1 RTT.  The initial write should
request a lock from the first stripe followed by a OST write RPC (opcode
4) followed by a file close (opcode 35).   I ran a test over 4 ms
latency so you can see what I'm referring to.  I thought that there was
a patch in Lustre a few months back that forced a flush before a file
close, but this is from a 1.8.5 client so I'm guessing that isn't how it
works because between when I closed the file and the end I had to "sync"
for the OST write to show up.

00000100:00100000:3:1299677923.962151:0:31223:0:(client.c:2087:ptlrpc_queue_wait())
Sending RPC pname:cluuid:pid:xid:nid:opc
bash:052a94ea-8d57-ac3e-829e-890e0bb65639:31223:x1362774702706288:192.168.66.2 at tcp8:101
00000100:00100000:3:1299677923.966512:0:31223:0:(client.c:2192:ptlrpc_queue_wait())
Completed RPC pname:cluuid:pid:xid:nid:opc
bash:052a94ea-8d57-ac3e-829e-890e0bb65639:31223:x1362774702706288:192.168.66.2 at tcp8:101
00000080:00200000:3:1299677923.966562:0:31223:0:(file.c:419:ll_file_open())
VFS Op:inode=219/2880174574(ffff8801e1a68550), flags 101101
00000100:00100000:3:1299677923.966615:0:31223:0:(client.c:2087:ptlrpc_queue_wait())
Sending RPC pname:cluuid:pid:xid:nid:opc
bash:052a94ea-8d57-ac3e-829e-890e0bb65639:31223:x1362774702706292:192.168.66.3 at tcp8:101
00000100:00100000:3:1299677923.970994:0:31223:0:(client.c:2192:ptlrpc_queue_wait())
Completed RPC pname:cluuid:pid:xid:nid:opc
bash:052a94ea-8d57-ac3e-829e-890e0bb65639:31223:x1362774702706292:192.168.66.3 at tcp8:101
00000080:00200000:3:1299677923.971058:0:31223:0:(file.c:256:ll_file_release())
VFS Op:inode=219/2880174574(ffff8801e1a68550)
00000100:00100000:3:1299677923.971078:0:31223:0:(client.c:2087:ptlrpc_queue_wait())
Sending RPC pname:cluuid:pid:xid:nid:opc
bash:052a94ea-8d57-ac3e-829e-890e0bb65639:31223:x1362774702706296:192.168.66.2 at tcp8:35
00000100:00100000:3:1299677923.975447:0:31223:0:(client.c:2192:ptlrpc_queue_wait())
Completed RPC pname:cluuid:pid:xid:nid:opc
bash:052a94ea-8d57-ac3e-829e-890e0bb65639:31223:x1362774702706296:192.168.66.2 at tcp8:35
00000100:00100000:0:1299677927.817739:0:2894:0:(client.c:1177:ptlrpc_send_new_req())
Sending RPC pname:cluuid:pid:xid:nid:opc
ptlrpcd:052a94ea-8d57-ac3e-829e-890e0bb65639:2894:x1362774702706304:192.168.66.3 at tcp8:4
00000100:00100000:0:1299677927.826642:0:2894:0:(client.c:1444:ptlrpc_check_set())
Completed RPC pname:cluuid:pid:xid:nid:opc
ptlrpcd:052a94ea-8d57-ac3e-829e-890e0bb65639:2894:x1362774702706304:192.168.66.3 at tcp8:4

> The numbers look different when I disable the client side cache by
> setting max_dirty_mb to 0.

Without any grant I think all RPCs have to be synchronous so you'll see
a huge performance hit over latency.

Jeremy