[Lustre-discuss] collectl

Mark Seger Mark.Seger at hp.com
Tue Jul 29 11:36:47 PDT 2008


Thanks for posting this...

Just to be a little clearer, what I'm trying to do is multipurpose. My 
main objective is to make sure collectl gathers and reports 
comprehensive/accurate lustre data.  The next most important thing is 
that users of collectl (including me) understand what that data means.

As for making sure I collect the correct data, it sounds like you're 
saying the only thing that changed in 1.6.5.1 is splitting reint into 5 
new counters, which is certainly something I can do as well but it will 
require a code change and will also result in incorrect data being 
reported in earlier versions of collectl.

 From the perspective of understanding what the data means, I'd like to 
make sure the meanings of the different fields are properly documented 
in collectl and since none of this seems to be written down anywhere 
else could actually become a very useful document to others.  Right now 
it you look at http://collectl.sourceforge.net/Data.html you can see the 
data definitions for all collectl data, including lustre.  Some 
definitions are good while other could use more work.

One thing that confuses me about lustre counters, and maybe others, is I 
don't really know what they mean, when they change and in fact how to 
stimulate them to change.  For example, on my system I'm doing a watch 
of /proc/fs/lustre/mdt/MDS/mds/stats and only see 1 reint counter, 
because the others are all 0.  So I went and did some file renames, and 
chmods and sure enough, the other counters did appear.  Cool!

The easiest thing for me to do is to simply say that reint_setattr 
counts the number of setattrs, but that would be a pretty weak 
definition. When I changed did a single chmod to 100 files, setattr only 
incremented by 1 and I expected it to increment by 100.  This means that 
the counter is not of how many files actually had there attributes 
changes, which is what I had thought.  It would seem to me that this 
level of detail should be captured somewhere.  And why didn't 
reint_setattr get called when I created the files?  Shouldn't it have 
been?  And if not, shouldn't that be noted somewhere as well.

That's just 2 reint counters.  How about all the other mds counters?  
Are there any missing from the list below - remember, since you only see 
then after they're incremented you don't always see them and that's 
apparently what happened to me with the reint counters.

I really don't want to do in this thread is get into a low level 
discussion about each counter and what they mean, but if someone is 
willing to write the words, I'm willing to put them in my document.  If 
people want to discuss the finer points of any of the definitions in a 
different thread (or borrow this one) that'd be fine too, but I don't 
want to be the one responsible for the words or all you're going to see 
is 'reint_setattr counts the number of setattr calls' and I really don't 
think that would be all that useful to anyone.

-mark

Tom.Wang wrote:
> Mark Seger wrote:
>>
>>
>> Tom.Wang wrote:
>>> Hi, Mark
>>> Mark Seger wrote:
>>>> I recently got an email from a collectl/lustre user who reported 
>>>> I'm not handling all the data for an MDS properly and showed me the 
>>>> following from his stats file:
>>>>
>>>> snapshot_time             1217282595.543548 secs.usecs
>>>> req_waittime              170215294 samples [usec] 0 1992338 
>>>> 23146560348 30473383620091658
>>>> req_qdepth                170215294 samples [reqs] 0 9444 124827689 
>>>> 348732141299
>>>> req_active                170215294 samples [reqs] 1 127 352457248 
>>>> 5984384772
>>>> req_timeout               170215294 samples [sec] 1 301 442496292 
>>>> 28785551992
>>>> reqbuf_avail              410936563 samples [bufs] 128 1024 
>>>> 420313410439 430054594284593
>>>> ldlm_flock_enqueue        4 samples [reqs] 1 1 4 4
>>>> ldlm_ibits_enqueue        139193009 samples [reqs] 1 1 139193009 
>>>> 139193009
>>>> mds_reint_create          11018837 samples [reqs] 1 1 11018837 
>>>> 11018837
>>>> mds_reint_link            51315 samples [reqs] 1 1 51315 51315
>>>> mds_reint_setattr         395400 samples [reqs] 1 1 395400 395400
>>>> mds_reint_rename         224241 samples [reqs] 1 1 224241 224241
>>>> mds_reint_unlink          13109877 samples [reqs] 1 1 13109877 
>>>> 13109877
>>>> mds_getattr               49739 samples [usec] 10 271559 7792733 
>>>> 655474619433
>>>> mds_connect               2373 samples [usec] 12 1300137 5532759 
>>>> 3705296256699
>>>> mds_disconnect            291 samples [usec] 18 3047 76662 88841758
>>>> mds_getstatus             899 samples [usec] 6 41 9824 114260
>>>> mds_statfs                4090 samples [usec] 5 79121 1205331 
>>>> 45689777475
>>>> mds_sync                  1190 samples [usec] 1843 5014449 
>>>> 236091207 450545148159775
>>>> mds_quotactl              2049 samples [usec] 7 883434 7687131 
>>>> 3232395885317
>>>> mds_getxattr              36089 samples [usec] 9 8996 675208 252525110
>>>> mds_setxattr              1230 samples [usec] 123 10110 263367 
>>>> 225741995
>>>> obd_ping                  6124258 samples [usec] 0 30366 67518884 
>>>> 4513131130
>>>>
>>>> and I've never heard of all these 'reint' variable now have see 
>>>> some many of the others either and so thought they were recently 
>>>> added.
>>>> the interesting thing is I have 1.6.5.1 installed and my stats file 
>>>> shows:
>>>>
>>>> snapshot_time             1217344821.777664 secs.usecs
>>>> req_waittime              5500 samples [usec] 6 399 57198 831384
>>>> req_qdepth                5500 samples [reqs] 0 2 16 18
>>>> req_active                5500 samples [reqs] 1 2 5514 5542
>>>> req_timeout               5500 samples [sec] 1 10 5572 6292
>>>> reqbuf_avail              12650 samples [bufs] 511 512 6475895 
>>>> 3315195785
>>>> ldlm_ibits_enqueue        1646 samples [reqs] 1 1 1646 1646
>>>> mds_reint_unlink          8 samples [reqs] 1 1 8 8
>>>> mds_getattr               17 samples [usec] 11 14567 15655 212882791
>>>> mds_connect               24 samples [usec] 33 1418 2948 2142530
>>>> mds_getstatus             17 samples [usec] 8 15 165 1657
>>>> mds_statfs                55 samples [usec] 7 48 762 15484
>>>> mds_sync                  800 samples [usec] 803 40698 2188133 
>>>> 15429580807
>>>> obd_ping                  2933 samples [usec] 6 48 48843 1130361
>>> MDS_REINT and LDLM req information update /proc has been detailed 
>>> since 1.6.5, see bug 14184.
>> I don't think so.  It looks like a bunch of patches and random 
>> comments.  Are we both talking about - 
>> https://bugzilla.lustre.org/show_bug.cgi?id=14184 ?
>> The first thing I did was to see what it said about those 5 reint 
>> counters and so searched for "mds_reint_" and found none of them.  
>> Perhaps there are some clues to the meanings to some of the counters 
>> but certainly not all of them.
> What I mean here is that the original MDS_REINT req has been divided 
> to 5 sub-req status(which I mean detailed here, sorry about the 
> confusion).
> In original implementation, there is only 1 "mds_reint" req counter in 
> the stats file, and we update that and change it to five specific 
> mds_reint req,
> Then MDS would know what kind of reint request it has been handled. 
> Not sure whether lustre manual include all the statements of these stats,
> you might check that?
>
>>
>> Also, you didn't answer my question about what types of operations 
>> will cause each of the different counters to increment.  
> Hmm, I assume you only ask mds_reint req ?  These counters will be 
> incremented when mds identify the request and
> ready to handle it. So
>
> mds_reint_unlink:  unlink a file
> mds_reint_create: mknod or mkdir
> mds_reint_link: link a file
> mds_reint_setattr: chmod/chacl or other setattr meta ops.
> mds_reint_rename: rename a file.
>
> Hope I did not miss your questions this time.
>
>> Would it be easier and more beneficial to other users to move this to 
>> lustre-discuss?
> Yes.
>>
>> -mark
>>
>>
> Thanks
> WangDi
>
>




More information about the lustre-discuss mailing list