[Lustre-discuss] How to determine which lustre clients are loading filesystem.

Andreas Dilger andreas.dilger at oracle.com
Thu Jul 8 12:35:01 PDT 2010


On 2010-07-08, at 12:03, Wojciech Turek wrote:
> Our Lustre filesystem (Lustre 1.8.3, RHEL5) got recently very busy and users are noticing the slowness. The Lustre system consists of ~550 clients and currently we have 50 different users running jobs. I can see that OSS servers have load oscillating between 100-300 and collectl shows that there are lots of I/O going on (mainly read). I would like to find a good method of finding out which Lustre clients are generating the I/O so I could pinpoint the high load to a particular jobs. I hope that some Lustre users can share their experience in that matter.

There are a number of ways to do this.  One way is to check the "/proc/fs/lustre/obdfilter/*/exports/*/stats" files, which contains per-client statistics.  They can be cleared by writing "0" to the file, and then check for files with lots of operations.

Another way that I heard some sites were doing this is to use the "rpc history".  They may already have a script to do this, but the basics are below:

oss# lctl set_param ost.OSS.ost_io.req_buffer_history=10240
{wait a few seconds to collect some history}
oss# lctl get_param ost.OSS.ost_io.req_history

This will give you a list of the past (up to) 10240 RPCs for the "ost_io" RPC service, which is what you are observing the high load on:

3436037:192.168.20.1 at tcp:12345-192.168.20.159 at tcp:x1340648957534353:448:Complete:1278612656:0s(-6s) opc 3
3436038:192.168.20.1 at tcp:12345-192.168.20.159 at tcp:x1340648957536190:448:Complete:1278615489:1s(-41s) opc 3
3436039:192.168.20.1 at tcp:12345-192.168.20.159 at tcp:x1340648957536193:448:Complete:1278615490:0s(-6s) opc 3

This output is in the format:

identifier:target_nid:source_nid:rpc_xid:rpc_size:rpc_status:arrival_time:service_time(deadline) opcode

Using some shell scripting, one can find the clients sending the most RPC requests:

oss# lctl get_param ost.OSS.ost_io.req_history | tr ":" " " | cut -d" " -f3,9,10 | sort | uniq -c | sort -nr | head -20


   3443 12345-192.168.20.159 at tcp opc 3
   1215 12345-192.168.20.157 at tcp opc 3
    121 12345-192.168.20.157 at tcp opc 4

This will give you a sorted list of the top 20 clients that are sending the most RPCs to the ost_io service, along with the operation being done (3 = OST_READ, 4 = OST_WRITE).

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.




More information about the lustre-discuss mailing list