[Lustre-discuss] Performance Issue Troubleshooting
Sebastian Gutierrez
gutseb at cs.stanford.edu
Thu Nov 13 11:45:01 PST 2008
Hello
I am working with lustre 1.6.5.
A couple of weeks ago we had a move of the racks that the MDS and OSTs
were in. Ever since the move we have had our users complain about
performance issues. Most of the complaints have been regarding
listing files and bash auto-completion but the jobs running on the
cluster seem to be running into issues as well.
I have received few Lustre errors in /var/log/messages
except for a few nodes being evicted.
I have rebooted all the clients to try to remedy the performance issues.
On the OSS we have 4 Gig-E ports bonded. So i assume that the bottle
neck may be the MDS which only has one Gig-E network up-link.
I have inherited this system and am verifying the configuration as I go.
I may also be missing something fairly obvious. So I apologize if I
give to much information.
lfs check servers > from the mds_2 shows all active.
OSC_hoxa-mds-1.Stanford.EDU_ost1-cluster_MNT_cluster-ffff81022e1a6400
active.
OSC_hoxa-mds-1.Stanford.EDU_ost2-cluster_MNT_cluster-ffff81022e1a6400
active.
OSC_hoxa-mds-1.Stanford.EDU_ost3-cluster_MNT_cluster-ffff81022e1a6400
active.
OSC_hoxa-mds-1.Stanford.EDU_ost4-cluster_MNT_cluster-ffff81022e1a6400
active.
MDC_hoxa-mds-1.Stanford.EDU_mds-hoxa_MNT_cluster-ffff81022e1a6400 active.
from mds_1 the command shows
0 UP mgs MGS MGS 25
1 UP mgc MGC192.168.136.10 at tcp c5642c46-5232-2004-4ba7-01a5ae11047f 5
2 UP mdt MDS MDS_uuid 3
3 UP lov lustre-mdtlov lustre-mdtlov_UUID 4
4 UP mds lustre-MDT0000 mds-hoxa_UUID 91
5 UP osc lustre-OST0000-osc lustre-mdtlov_UUID 5
6 UP osc lustre-OST0003-osc lustre-mdtlov_UUID 5
7 UP osc lustre-OST0001-osc lustre-mdtlov_UUID 5
8 UP osc lustre-OST0002-osc lustre-mdtlov_UUID 5
lctl device_list
0 UP mgc MGC192.168.136.10 at tcp 5467a230-9a4b-41bc-dabf-5ae86ba7a287 5
2 UP lov lov-cluster-ffff81022e1a6400
fd455bf8-544e-5f93-f316-270497373710 4
3 UP osc
OSC_hoxa-mds-1.Stanford.EDU_ost1-cluster_MNT_cluster-ffff81022e1a6400
fd455bf8-544e-5f93-f316-270497373710 5
4 UP osc
OSC_hoxa-mds-1.Stanford.EDU_ost2-cluster_MNT_cluster-ffff81022e1a6400
fd455bf8-544e-5f93-f316-270497373710 5
5 UP osc
OSC_hoxa-mds-1.Stanford.EDU_ost3-cluster_MNT_cluster-ffff81022e1a6400
fd455bf8-544e-5f93-f316-270497373710 5
6 UP osc
OSC_hoxa-mds-1.Stanford.EDU_ost4-cluster_MNT_cluster-ffff81022e1a6400
fd455bf8-544e-5f93-f316-270497373710 5
7 UP mdc
MDC_hoxa-mds-1.Stanford.EDU_mds-hoxa_MNT_cluster-ffff81022e1a6400
fd455bf8-544e-5f93-f316-270497373710 5
running the following from a target
lfs getstripe /cluster/
OBDS:
0: ost1-cluster_UUID ACTIVE
1: ost2-cluster_UUID ACTIVE
2: ost3-cluster_UUID ACTIVE
3: ost4-cluster_UUID ACTIVE
/cluster/
default stripe_count: 1 stripe_size: 1048576 stripe_offset: 0
I have also verified that the sub folders also have the same configuration.
So i started doing llstat -i10 on the oss and the MDS on different stat
files. While doing a ls that experiences the lockups.
I found that there seems to be quite a few requests with high wait
times. I have looked into the directory i was doing the LS in and it
looks like the user has about 80000 files in his directories. The files
sizes are about 1.5k.
Any pointers to track this down would be greatly appreciated.
/proc/fs/lustre/mdt/MDS/mds/stats @ 1226456419.648209
Name Cur.Count Cur.Rate #Events Unit
last min avg max stddev
req_waittime 34367 3436 1265403433[usec]
428069 3 80.58 2068859 683.10
req_qdepth 34367 3436 1265403433[reqs]
3540 0 1.41 317 2.18
req_active 34367 3436 1265403433[reqs]
54803 1 10.12 127 8.66
req_timeout 34368 3436 1265403434[sec]
34368 1 3.26 169 7.82
reqbuf_avail 70115 7011 2746703429[bufs]
17929503 157 249.87 256 6.44
ldlm_ibits_enqueue 34351 3435 1244343736[reqs]
34351 1 1.00 1 0.00
This is without the ls that locks up the box.
/proc/fs/lustre/mdt/MDS/mds/stats @ 1226456999.791520
Name Cur.Count Cur.Rate #Events Unit
last min avg max stddev
req_waittime 10664 1066 1266340431[usec]
136698 3 80.54 2068859 682.85
req_qdepth 10664 1066 1266340431[reqs]
2162 0 1.40 317 2.18
req_active 10664 1066 1266340431[reqs]
14923 1 10.12 127 8.65
req_timeout 10664 1066 1266340431[sec]
10664 1 3.26 169 7.82
reqbuf_avail 23857 2385 2748636467[bufs]
6102353 157 249.87 256 6.44
ldlm_ibits_enqueue 10647 1064 1245279702[reqs]
10647 1 1.00 1 0.00
More information about the lustre-discuss
mailing list