profile on profile offwhich call the ProfileSupport functions for you. After you call ProfileSupport.Off(), the collected PC counts are stored in memory so that they can be read by the debugger.
If you want to turn profiling on and off rapidly from the shell, you can say:
profile cmdThis starts profiling immediately before running the shell command and stops it immediately after, which reduces the amount of extraneous code that is profiled.
FlatProfile.perl tweed kernel/sal/SPIN/spin_kernel.sys > profileThe script will print its results to standard output so you can redirect it to a file or a pipe. The output of FlatProfile.perl looks like this:
Count Percent Procedure
----- ------- ---------
47072 81.62 scc_putc
9146 15.86 MachineCPUPrivate__RestoreInterruptLevel
788 1.37 MachineCPUPrivate__SetInterruptLevel
476 0.83 Thread__Idle
31 0.05 sccputc
30 0.05 prf
.... continues ....
The Count column is the raw number of ClockTick events in that procedure, so
47072 ticks * 1 millisecond/tick = 47.072 seconds. The Percent column is the
percentage of ClockTick events in that procedure and the last column is
the symbolic name of the procedure.
The second problem with PC sampling is that you have to collect large frequency counts to get accurate information. For example, if you have a procedure which usually takes less than one millisecond to execute, then it will most likely not receive a ClockTick every time it runs. You will need to ensure that the procedure runs many times in order to get a statistically meaningful number of ClockTick events.
In order to build a profiled version of SPIN or of some subpart of SPIN, give a definition (any definition) for the variable PROFILE on the gmake command line. For example,
gmake PROFILE=TRUE kernelwill build the kernel with profiling.
gmake PROFILE=ONwill build everything. The profiled code is built in separate directories from unprofiled code. For Modula-3 sources, this happens by defining a new target called ALPHA_SPIN_PROF. Within sal, we have a new configuration called SPIN_PROF. The final bootable kernel is called kernel/sal/SPIN_PROF/spin.boot.
To build the unprofiled code, just type the same commands as before, such as "gmake kernel" and "gmake user."
Even though the system automatically tries to load either all profiled or all unprofiled code, you should be able to combine profiled and unprofiled code freely in your system. Let me know if there are any problems with this.
There is a new profile option which lets you turn profiling
on and off quickly for a single shell command. You say
"profile cmd
You can profile the kernel while it is loading extensions. A revision
to the profiling implementation has removed the earlier restriction
which prevented profiling while linking new code.
When you start m3gdbttd, specify the name of the kernel you are
running. Connect to your crash machine and run "domain sweep"
just like you would to debug your code. Next run "gprof" which
makes m3gdbttd read the profile information and write it into
two files in your current directory. The files are called "dlinked.syms"
and "gmon.out". "dlinked.syms" contains the names and addresses of
the procedures in your spin kernel, including all of the dynamically
linked code. "gmon.out" contains the PC sample counts and procedure
call counts gathered by the profiling code. You can detach from
your crash machine and quit after running the "gprof" command.
Use the "spinprof" program in spin/local/bin to analyze the two
files that were written by m3gdbttd. "spinprof" reads in the files
and prints out a long listing that breaks down the amount of time
spent in a procedure on behalf of each caller. I'll describe this in
more detail below. You will probably want to direct the output of
"spinprof" to a file since it will be thousands of lines.
Here is an example session of the Unix commands to type to read a
profile.
There is a Perl script which will execute these commands for you
automatically. You just have to give it the name of your crash machine
and kernel to be debugged and the profile information will be printed
on the standard output. So to do the same thing as the last example,
you would say:
Here is an example of the first type of listing with parent and children
procedures:
The procedures listed before PhysAddr.Deallocate are its callers or parents.
These were MemoryForSAL.FreePhysPage and MemoryObject.Destroy. MemoryForSAL.FreePhysPage
called PhysAddr.Deallocate 184 times and MemoryObject.Destroy called PhysAddr.Deallocate
3000 times.
The procedures listed after PhysAddr.Deallocate are its callees or children.
These three prcedures, PhysFreeMem.DeallocatePage, RTHooks.LockMutex and
RTHooks.UnlockMutex were each called 3184 times from PhysAddr.Deallocate.
The entry 3184/3184 for PhysFreeMem.DeallocatePage shows that
PhysAddr.Deallocate called PhysFreeMem.Deallocate 3184 and
PhysFreeMem.Deallocate was called a total of 3184 times from all of its
parents. Compare this to RTHooks.LockMutex which was called 3184 times from
PhysFreeMem.Deallocate and was called a total of 43,330 times.
If a cycle occurs in the call graph, then you will see a listing for the
cycle which shows the total time spent in it. There is a cycle which
occurs in the garbage collector that you will probably see in your own
profiles. I verified that this cycle really occurs and is not an artifact of
a profiling bug.
The second listing is a flat profile ranking the procedure by total time spent
in them, not counting their descendants. Here is an axample:
A few C and assembly files need to know whether call graph profiling is
enabled. They have statements of the form #ifdef CALL_GRAPH inside them
and their m3makefiles define CALL_GRAPH when profiling is enabled.
One of these files is kernel/spincore/src/machine/alpha/Profile.s. It
contains the actual procedure which counts the execution of call graph arcs.
The procedure is named _mcount.
Some of our kernel code, such as sal and the device driver code is
compiled with the DEC C compiler. This compiler inserts calls to
_mcount in a funny way, it reserves extra space in a procedure that
is filled in with no-ops. The no-ops are overwritten with calls to _mcount
by the ld program at link time. This creates a problem for us when compiling
device driver code because it will be dynamically linked into our kernel
and will never have profiling calls inserted by ld. To get around this
problem, there is a special Perl script called InsertMcount.perl which
is used to compile the device driver code with profiling enabled.
We have designed a flexible profiling system which can profile PC values
occuring anywhere in the Alpha's 64 bit address space. This is necessary
for profiling while dynamically linking new code which can be placed at
an unpredictable location in memory. The mechanism we use is a series of
three tables which map from the pair (caller PC, callee PC)
to the data structure which contains the execution count and elapsed cycles
for that call arc. Here is a picture of the tables:
The only limit on how long the profiler can run is the size of the
second and third tables. I anticipate that the third table is the
likely bottleneck, so you can change its size with the command
"profile arcs [size]" from the SPIN shell. The default size is 10000.
When you turn profiling off, you will get a warning message if you
ran out of room in either of the tables,
Extracting the profile information
After you turn off profiling, the information is stored in the
memory of the crash machine. You need to use m3gdbttd to connect
to the machine and read it out. First I will describe the individual
commands that you can use in m3gdbttd and then I'll describe the
perl script that can do everything for you.
spinach: m3gdbttd kernel/sal/SPIN_PROF/spin_kernel.sys
(gdb) ta t tweed
(gdb) domain sweep
(gdb) gprof
(gdb) det
(gdb) q
spinach: spinprof > profile.output
spinach: GraphProfile.perl tweed kernel/sal/SPIN_PROF/spin_kernel.sys > profile.output
Interpreting the output
The "spinprof" program produces the same style of output as "gprof." The
procedures are numbered according to how much time was spent in them,
number 1 being the most time. You will see this number in brackets next
to the procedure name. First "spinprof"
prints a description of its own output, then it prints each procedure
with its callers (parents) and callees (children). Then it prints a flat
profile which shows the ranking of time spent in each procedure. Finally
it prints an index of all procedures in alphabetical order.
granularity: each sample hit covers 4 byte(s) for 0.01% of 12.12 seconds
called/total parents
index %time self descendents called+self name index
called/total children
-----------------------------------------------
0.00 0.00 184/3184 MemoryForSAL__FreePhysPage [145]
0.00 0.03 3000/3184 MemoryObject__Destroy [49]
[79] 0.3 0.01 0.03 3184 PhysAddr__Deallocate [79]
0.00 0.02 3184/3184 PhysFreeMem__DeallocatePage [89]
0.00 0.00 3184/43330 RTHooks__LockMutex [77]
0.00 0.00 3184/55644 RTHooks__UnlockMutex [71]
This is the listing for PhysAddr.Deallocate which was 79th in the ranking
of time spent inside it (including its descendants.) PhysAddr.Deallocate
was called 3184 times and 0.3% of the total time was spent in it and in its
descendants. The breakdown was 0.01 seconds spent inside PhysAddr.Deallocate and
0.03 seconds inside of its descendants.
granularity: each sample hit covers 4 byte(s) for 0.01% of 12.12 seconds
% cumulative self self total
time seconds seconds calls ms/call ms/call name
66.6 8.06 8.06 Thread__Idle [1]
20.3 10.53 2.47 2116 1.17 1.17 scc_putc [13]
4.0 11.01 0.48 swap_ipl [20]
0.9 11.12 0.11 424 0.25 0.25 pmap_zero_page [46]
0.7 11.20 0.08 bzero [50]
0.4 11.25 0.05 408 0.12 0.13 pmap_destroy [65]
0.3 11.29 0.04 __remq [72]
You can see that some procedures do not have the number of calls listed
for them. This happens either because the procedure was never called by
another, such as Thread.Idle which starts by a context switch, or because
the procedure is written in assembly language, such as swap_ipl, bzero
or __remq.
2.2 How call graph profiling is implemented.
Build environment
At build time, you must specify PROFILE= Runtime
Call graph profiling requires some extra work at runtime when profiling
is turned on and off. For the most part, this work is to allocate some
space to hold the call graph arc counts and to set up a table which
maps PC values into indices in an array of call graph arcs.
Caller PC Hash Full Caller PC Callee PC + Profile Info
-------------- ------------------- ------------------
| unsigned | - | struct CallerPC | --> | struct ArcInfo |
|------------| \ |-----------------| |----------------|
| | -> | | | |
-------------- ------------------- ------------------
This table is This table This table has
keyed by a 16-bit remembers the the count of the
hash of the whole caller PC. number of times the
caller PC value. call arc occured.
Let's walk through a typical profiling hit to see how these data structures
work together. We begin with the values ParentPC, ChildPC and Cycles,
which are the PC of the caller and the callee and the number of cycles
elapsed since the last invocation of the profiler. We compute a 16-bit
hash of ParentPC and use it as an index into the Caller PC Hash table.
This gives us a small integer index, call it ParentIndex, which is
the offset of the head of a list in the second table. This list contains
all Caller PC values that have occured which hashed to ParentIndex. We
walk down the list until we find the PC value that matches ParentPC.
This element of the list contains an index into the third table,
call it ChildIndex, which is the offset of a list of Callee PC values
that have been called from ParentPC. We walk down this list until
we find the PC value matching ChildPC and then we add the Cycles
to this entry in the third table.
garrett@cs.washington.edu