[Lustre-discuss] How to determine which lustre clients are loading filesystem.
Andreas Dilger
andreas.dilger at oracle.com
Thu Jul 8 12:35:01 PDT 2010
On 2010-07-08, at 12:03, Wojciech Turek wrote:
> Our Lustre filesystem (Lustre 1.8.3, RHEL5) got recently very busy and users are noticing the slowness. The Lustre system consists of ~550 clients and currently we have 50 different users running jobs. I can see that OSS servers have load oscillating between 100-300 and collectl shows that there are lots of I/O going on (mainly read). I would like to find a good method of finding out which Lustre clients are generating the I/O so I could pinpoint the high load to a particular jobs. I hope that some Lustre users can share their experience in that matter.
There are a number of ways to do this. One way is to check the "/proc/fs/lustre/obdfilter/*/exports/*/stats" files, which contains per-client statistics. They can be cleared by writing "0" to the file, and then check for files with lots of operations.
Another way that I heard some sites were doing this is to use the "rpc history". They may already have a script to do this, but the basics are below:
oss# lctl set_param ost.OSS.ost_io.req_buffer_history=10240
{wait a few seconds to collect some history}
oss# lctl get_param ost.OSS.ost_io.req_history
This will give you a list of the past (up to) 10240 RPCs for the "ost_io" RPC service, which is what you are observing the high load on:
3436037:192.168.20.1 at tcp:12345-192.168.20.159 at tcp:x1340648957534353:448:Complete:1278612656:0s(-6s) opc 3
3436038:192.168.20.1 at tcp:12345-192.168.20.159 at tcp:x1340648957536190:448:Complete:1278615489:1s(-41s) opc 3
3436039:192.168.20.1 at tcp:12345-192.168.20.159 at tcp:x1340648957536193:448:Complete:1278615490:0s(-6s) opc 3
This output is in the format:
identifier:target_nid:source_nid:rpc_xid:rpc_size:rpc_status:arrival_time:service_time(deadline) opcode
Using some shell scripting, one can find the clients sending the most RPC requests:
oss# lctl get_param ost.OSS.ost_io.req_history | tr ":" " " | cut -d" " -f3,9,10 | sort | uniq -c | sort -nr | head -20
3443 12345-192.168.20.159 at tcp opc 3
1215 12345-192.168.20.157 at tcp opc 3
121 12345-192.168.20.157 at tcp opc 4
This will give you a sorted list of the top 20 clients that are sending the most RPCs to the ost_io service, along with the operation being done (3 = OST_READ, 4 = OST_WRITE).
Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.
More information about the lustre-discuss
mailing list