[Lustre-discuss] collectl

Tom.Wang Tom.Wang at Sun.COM
Tue Jul 29 10:38:22 PDT 2008


Mark Seger wrote:
>
>
> Tom.Wang wrote:
>> Hi, Mark
>> Mark Seger wrote:
>>> I recently got an email from a collectl/lustre user who reported I'm 
>>> not handling all the data for an MDS properly and showed me the 
>>> following from his stats file:
>>>
>>> snapshot_time             1217282595.543548 secs.usecs
>>> req_waittime              170215294 samples [usec] 0 1992338 
>>> 23146560348 30473383620091658
>>> req_qdepth                170215294 samples [reqs] 0 9444 124827689 
>>> 348732141299
>>> req_active                170215294 samples [reqs] 1 127 352457248 
>>> 5984384772
>>> req_timeout               170215294 samples [sec] 1 301 442496292 
>>> 28785551992
>>> reqbuf_avail              410936563 samples [bufs] 128 1024 
>>> 420313410439 430054594284593
>>> ldlm_flock_enqueue        4 samples [reqs] 1 1 4 4
>>> ldlm_ibits_enqueue        139193009 samples [reqs] 1 1 139193009 
>>> 139193009
>>> mds_reint_create          11018837 samples [reqs] 1 1 11018837 11018837
>>> mds_reint_link            51315 samples [reqs] 1 1 51315 51315
>>> mds_reint_setattr         395400 samples [reqs] 1 1 395400 395400
>>> mds_reint_rename         224241 samples [reqs] 1 1 224241 224241
>>> mds_reint_unlink          13109877 samples [reqs] 1 1 13109877 13109877
>>> mds_getattr               49739 samples [usec] 10 271559 7792733 
>>> 655474619433
>>> mds_connect               2373 samples [usec] 12 1300137 5532759 
>>> 3705296256699
>>> mds_disconnect            291 samples [usec] 18 3047 76662 88841758
>>> mds_getstatus             899 samples [usec] 6 41 9824 114260
>>> mds_statfs                4090 samples [usec] 5 79121 1205331 
>>> 45689777475
>>> mds_sync                  1190 samples [usec] 1843 5014449 236091207 
>>> 450545148159775
>>> mds_quotactl              2049 samples [usec] 7 883434 7687131 
>>> 3232395885317
>>> mds_getxattr              36089 samples [usec] 9 8996 675208 252525110
>>> mds_setxattr              1230 samples [usec] 123 10110 263367 
>>> 225741995
>>> obd_ping                  6124258 samples [usec] 0 30366 67518884 
>>> 4513131130
>>>
>>> and I've never heard of all these 'reint' variable now have see some 
>>> many of the others either and so thought they were recently added.
>>> the interesting thing is I have 1.6.5.1 installed and my stats file 
>>> shows:
>>>
>>> snapshot_time             1217344821.777664 secs.usecs
>>> req_waittime              5500 samples [usec] 6 399 57198 831384
>>> req_qdepth                5500 samples [reqs] 0 2 16 18
>>> req_active                5500 samples [reqs] 1 2 5514 5542
>>> req_timeout               5500 samples [sec] 1 10 5572 6292
>>> reqbuf_avail              12650 samples [bufs] 511 512 6475895 
>>> 3315195785
>>> ldlm_ibits_enqueue        1646 samples [reqs] 1 1 1646 1646
>>> mds_reint_unlink          8 samples [reqs] 1 1 8 8
>>> mds_getattr               17 samples [usec] 11 14567 15655 212882791
>>> mds_connect               24 samples [usec] 33 1418 2948 2142530
>>> mds_getstatus             17 samples [usec] 8 15 165 1657
>>> mds_statfs                55 samples [usec] 7 48 762 15484
>>> mds_sync                  800 samples [usec] 803 40698 2188133 
>>> 15429580807
>>> obd_ping                  2933 samples [usec] 6 48 48843 1130361
>> MDS_REINT and LDLM req information update /proc has been detailed 
>> since 1.6.5, see bug 14184.
> I don't think so.  It looks like a bunch of patches and random 
> comments.  Are we both talking about - 
> https://bugzilla.lustre.org/show_bug.cgi?id=14184 ?
> The first thing I did was to see what it said about those 5 reint 
> counters and so searched for "mds_reint_" and found none of them.  
> Perhaps there are some clues to the meanings to some of the counters 
> but certainly not all of them.
What I mean here is that the original MDS_REINT req has been divided to 
5 sub-req status(which I mean detailed here, sorry about the confusion).
In original implementation, there is only 1 "mds_reint" req counter in 
the stats file, and we update that and change it to five specific 
mds_reint req,
Then MDS would know what kind of reint request it has been handled. Not 
sure whether lustre manual include all the statements of these stats,
you might check that?

>
> Also, you didn't answer my question about what types of operations 
> will cause each of the different counters to increment.  
Hmm, I assume you only ask mds_reint req ?  These counters will be 
incremented when mds identify the request and
ready to handle it. So

mds_reint_unlink:  unlink a file
mds_reint_create: mknod or mkdir
mds_reint_link: link a file
mds_reint_setattr: chmod/chacl or other setattr meta ops.
mds_reint_rename: rename a file.

Hope I did not miss your questions this time.

> Would it be easier and more beneficial to other users to move this to 
> lustre-discuss?
Yes.
>
> -mark
>
>
Thanks
WangDi


-- 
Regards,
Tom Wangdi    
--
Sun Lustre Group
System Software Engineer 
http://www.sun.com




More information about the lustre-discuss mailing list