[Lustre-devel] Lustre RPC visualization

Michael Kluge Michael.Kluge at tu-dresden.de
Fri May 28 07:54:33 PDT 2010


Hi WangDi,

> Looks great! Just query, as you said, "All these counter can be broken 
> down by the type of RPC (op code)" , you actually implemented that, but 
> not shown in the attached picture? 

Yes.

> And could you please also add "Server queued RPCs" over time ?

Already done.

One good news: The Feature that Vampir can show something like a heat
map (Eric asked about this) comes back with the release at ISC. It is
now called "performance radar". It can produce a heat map for a counter
and does some other things as well. I could send a picture around, but
need at first an bigger trace (more hosts generating traces in
parallel).


Regards, Michael

> Thanks
> WangDi
> 
> Michael Kluge wrote:
> > Hi WangDi,
> >
> > so, for the moment I am done with what I promised. The work to be done
> > is mainly debugging with more input data sets. Screenshot of Vampir
> > showing the derived counter values for the RPC processing/queue times on
> > the server and the client is attached. Units for the values are either
> > microseconds or just a number.
> >
> >
> > Regards, Michael
> >
> > Am Sonntag, den 16.05.2010, 11:29 +0200 schrieb Michael Kluge: 
> >   
> >> Hi WangDi,
> >>
> >> the first version works. Screenshot is attached. I have a couple of 
> >> counter realized: RPC's in flight and RPC's completed in total on the 
> >> client, RPC's enqueued, RPC's in processing and RPC'c completed in total 
> >> on the server. All these counter can be broken down by the type of RPC 
> >> (op code). The picture has not yet the lines that show each single RPC, 
> >> I still have to do counter like "avg. time to complete an RPC over the 
> >> last second" and there are some more TODO's. Like the timer 
> >> synchronization. (In the screenshot the first and the last counter show 
> >> total values while the one in the middle shows a rate.)
> >>
> >> What I like to have is a complete set of traces from a small cluster 
> >> (<100 nodes) including the servers. Would that be possible?
> >>
> >> Is one of you in Hamburg May, 31-June, 3 for ISC'2010? I'll be there and 
> >> like to talk about what would be useful for the next steps.
> >>
> >>
> >> Regards, Michael
> >>
> >> Am 03.05.2010 21:52, schrieb di.wang:
> >>     
> >>> Michael Kluge wrote:
> >>>       
> >>>>>> One more question: RPC 1334380768266400 (in the log WangDi sent me)
> >>>>>> has on the client side only a "Sending RPC" message, thus missing the
> >>>>>> "Completed RPC". The server has all three (received,start work, done
> >>>>>> work). Has this RPC vanished on the way back to the client? There is
> >>>>>> no further indication what happend. The last timestamp in the client
> >>>>>> log is:
> >>>>>> 1272565368.228628
> >>>>>> and the server says it finished the processing of the request at:
> >>>>>> 1272565281.379471
> >>>>>> So the client log has been recorded long enough to contain the
> >>>>>> "Completed RPC" message for this RPC if it arrived ever ...
> >>>>>>             
> >>>>> Logically, yes. But in some cases, some debug logs might be abandoned
> >>>>> for some reasons(actually, it happens not rarely), and probably you need
> >>>>> maintain an average time from server "Handled RPC" to client "Completed
> >>>>> RPC", then you just guess the client "Completed RPC" time in this case.
> >>>>>           
> >>>> Oh my gosh ;) I don't want to start speculations about the helpfulness
> >>>> of incomplete debug logs. Anyway, what can get lost? Any kind of
> >>>> message on the servers and clients? I think I'd like to know what
> >>>> cases have to be handled while I try to track individual RPC's on
> >>>> their way.
> >>>>         
> >>> Any records can get lost here. Unfortunately, there are not any messages
> >>> indicate the missing happened. :(
> >>> (Usually, I would check the time stamp in the log, i.e. no records for a
> >>> "long" time, for example several seconds, but this is not the accurate
> >>> way).
> >>>
> >>> I guess you can just ignore these uncompleted records in your first
> >>> step? Let's see how these incomplete log will
> >>> impact the profiling result, then we will decide how to deal with this?
> >>>
> >>> Thanks
> >>> Wangdi
> >>>       
> >>>> Regards, Michael
> >>>> _______________________________________________
> >>>> Lustre-devel mailing list
> >>>> Lustre-devel at lists.lustre.org
> >>>> http://lists.lustre.org/mailman/listinfo/lustre-devel
> >>>>         
> >>>       
> >> _______________________________________________
> >> Lustre-devel mailing list
> >> Lustre-devel at lists.lustre.org
> >> http://lists.lustre.org/mailman/listinfo/lustre-devel
> >>     
> >
> >   
> >
> > ------------------------------------------------------------------------
> >
> 
> 

-- 

Michael Kluge, M.Sc.

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room A 208
Phone:  (+49) 351 463-34217
Fax:    (+49) 351 463-37773
e-mail: michael.kluge at tu-dresden.de
WWW:    http://www.tu-dresden.de/zih
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 5997 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20100528/3897d501/attachment.bin>


More information about the lustre-devel mailing list