[Lustre-discuss] Performance Issue Troubleshooting

Thu Nov 13 11:45:01 PST 2008

Hello
I am working with lustre 1.6.5.
A couple of weeks ago we had a move of the racks that the MDS and OSTs
were in.  Ever since the move we have had our users complain about
performance issues.    Most of the complaints have been regarding
listing files and bash auto-completion but the jobs running on the
cluster seem to be running into issues as well.

I have received few Lustre errors in /var/log/messages
except for a few nodes being evicted.

I have rebooted all the clients to try to remedy the performance issues.
On the OSS we have 4 Gig-E ports bonded. So i assume that the bottle
neck may be the MDS which only has one Gig-E network up-link.

I have inherited this system and am verifying the configuration as I go.
    I may also be missing something fairly obvious.  So I apologize if I
give to much information.

lfs check servers > from the mds_2 shows all active.

OSC_hoxa-mds-1.Stanford.EDU_ost1-cluster_MNT_cluster-ffff81022e1a6400
active.
OSC_hoxa-mds-1.Stanford.EDU_ost2-cluster_MNT_cluster-ffff81022e1a6400
active.
OSC_hoxa-mds-1.Stanford.EDU_ost3-cluster_MNT_cluster-ffff81022e1a6400
active.
OSC_hoxa-mds-1.Stanford.EDU_ost4-cluster_MNT_cluster-ffff81022e1a6400
active.
MDC_hoxa-mds-1.Stanford.EDU_mds-hoxa_MNT_cluster-ffff81022e1a6400 active.

from mds_1 the command shows
  0 UP mgs MGS MGS 25
  1 UP mgc MGC192.168.136.10 at tcp c5642c46-5232-2004-4ba7-01a5ae11047f 5
  2 UP mdt MDS MDS_uuid 3
  3 UP lov lustre-mdtlov lustre-mdtlov_UUID 4
  4 UP mds lustre-MDT0000 mds-hoxa_UUID 91
  5 UP osc lustre-OST0000-osc lustre-mdtlov_UUID 5
  6 UP osc lustre-OST0003-osc lustre-mdtlov_UUID 5
  7 UP osc lustre-OST0001-osc lustre-mdtlov_UUID 5
  8 UP osc lustre-OST0002-osc lustre-mdtlov_UUID 5

lctl device_list
0 UP mgc MGC192.168.136.10 at tcp 5467a230-9a4b-41bc-dabf-5ae86ba7a287 5
  2 UP lov lov-cluster-ffff81022e1a6400
fd455bf8-544e-5f93-f316-270497373710 4
  3 UP osc
OSC_hoxa-mds-1.Stanford.EDU_ost1-cluster_MNT_cluster-ffff81022e1a6400
fd455bf8-544e-5f93-f316-270497373710 5
  4 UP osc
OSC_hoxa-mds-1.Stanford.EDU_ost2-cluster_MNT_cluster-ffff81022e1a6400
fd455bf8-544e-5f93-f316-270497373710 5
  5 UP osc
OSC_hoxa-mds-1.Stanford.EDU_ost3-cluster_MNT_cluster-ffff81022e1a6400
fd455bf8-544e-5f93-f316-270497373710 5
  6 UP osc
OSC_hoxa-mds-1.Stanford.EDU_ost4-cluster_MNT_cluster-ffff81022e1a6400
fd455bf8-544e-5f93-f316-270497373710 5
  7 UP mdc
MDC_hoxa-mds-1.Stanford.EDU_mds-hoxa_MNT_cluster-ffff81022e1a6400
fd455bf8-544e-5f93-f316-270497373710 5

running the following from a target
lfs getstripe /cluster/

OBDS:
0: ost1-cluster_UUID ACTIVE
1: ost2-cluster_UUID ACTIVE
2: ost3-cluster_UUID ACTIVE
3: ost4-cluster_UUID ACTIVE
/cluster/
default stripe_count: 1 stripe_size: 1048576 stripe_offset: 0

I have also verified that the sub folders also have the same configuration.

So i started doing llstat -i10 on the oss and the MDS on different stat
files.  While doing a ls that experiences the lockups.

I found that there seems to be quite a few requests with high wait
times.  I have looked into the directory i was doing the LS in and it
looks like the user has about 80000 files in his directories.  The files
sizes are about 1.5k.

Any pointers to track this down would be greatly appreciated.

/proc/fs/lustre/mdt/MDS/mds/stats @ 1226456419.648209
Name                      Cur.Count  Cur.Rate   #Events   Unit
 last        min          avg        max    stddev
req_waittime              34367      3436       1265403433[usec]
428069          3        80.58    2068859    683.10
req_qdepth                34367      3436       1265403433[reqs]
 3540          0         1.41        317      2.18
req_active                34367      3436       1265403433[reqs]
54803          1        10.12        127      8.66
req_timeout               34368      3436       1265403434[sec]
34368          1         3.26        169      7.82
reqbuf_avail              70115      7011       2746703429[bufs]
17929503        157       249.87        256      6.44
ldlm_ibits_enqueue        34351      3435       1244343736[reqs]
34351          1         1.00          1      0.00

This is without the ls that locks up the box.

/proc/fs/lustre/mdt/MDS/mds/stats @ 1226456999.791520
Name                      Cur.Count  Cur.Rate   #Events   Unit
 last        min          avg        max    stddev
req_waittime              10664      1066       1266340431[usec]
136698          3        80.54    2068859    682.85
req_qdepth                10664      1066       1266340431[reqs]
 2162          0         1.40        317      2.18
req_active                10664      1066       1266340431[reqs]
14923          1        10.12        127      8.65
req_timeout               10664      1066       1266340431[sec]
10664          1         3.26        169      7.82
reqbuf_avail              23857      2385       2748636467[bufs]
6102353        157       249.87        256      6.44
ldlm_ibits_enqueue        10647      1064       1245279702[reqs]
10647          1         1.00          1      0.00