2013-01-10 23:25:18 +00:00
|
|
|
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN"
|
|
|
|
"http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd"
|
|
|
|
[<!ENTITY % poky SYSTEM "../poky.ent"> %poky; ] >
|
|
|
|
|
2013-01-16 00:29:17 +00:00
|
|
|
<chapter id='profile-manual-usage'>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 00:29:17 +00:00
|
|
|
<title>Basic Usage (with examples) for each of the Yocto Tracing Tools</title>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
|
|
|
<para>
|
2013-01-16 00:29:17 +00:00
|
|
|
This chapter presents basic usage examples for each of the tracing
|
|
|
|
tools.
|
2013-01-10 23:25:18 +00:00
|
|
|
</para>
|
|
|
|
|
2013-01-16 00:29:17 +00:00
|
|
|
<section id='profile-manual-perf'>
|
|
|
|
<title>perf</title>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
|
|
|
<para>
|
2013-01-16 00:29:17 +00:00
|
|
|
The 'perf' tool is the profiling and tracing tool that comes
|
|
|
|
bundled with the Linux kernel.
|
2013-01-10 23:25:18 +00:00
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
2013-01-16 00:29:17 +00:00
|
|
|
Don't let the fact that it's part of the kernel fool you into thinking
|
|
|
|
that it's only for tracing and profiling the kernel - you can indeed
|
|
|
|
use it to trace and profile just the kernel , but you can also use it
|
|
|
|
to profile specific applications separately (with or without kernel
|
|
|
|
context), and you can also use it to trace and profile the kernel
|
|
|
|
and all applications on the system simultaneously to gain a system-wide
|
|
|
|
view of what's going on.
|
2013-01-10 23:25:18 +00:00
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
2013-01-16 00:29:17 +00:00
|
|
|
In many ways, it aims to be a superset of all the tracing and profiling
|
|
|
|
tools available in Linux today, including all the other tools covered
|
|
|
|
in this HOWTO. The past couple of years have seen perf subsume a lot
|
|
|
|
of the functionality of those other tools, and at the same time those
|
|
|
|
other tools have removed large portions of their previous functionality
|
|
|
|
and replaced it with calls to the equivalent functionality now
|
|
|
|
implemented by the perf subsystem. Extrapolation suggests that at
|
|
|
|
some point those other tools will simply become completely redundant
|
|
|
|
and go away; until then, we'll cover those other tools in these pages
|
|
|
|
and in many cases show how the same things can be accomplished in
|
|
|
|
perf and the other tools when it seems useful to do so.
|
2013-01-10 23:25:18 +00:00
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
2013-01-16 00:29:17 +00:00
|
|
|
The coverage below details some of the most common ways you'll likely
|
|
|
|
want to apply the tool; full documentation can be found either within
|
|
|
|
the tool itself or in the man pages at
|
|
|
|
<ulink url='http://linux.die.net/man/1/perf'>perf(1)</ulink>.
|
2013-01-10 23:25:18 +00:00
|
|
|
</para>
|
|
|
|
|
2013-01-16 00:29:17 +00:00
|
|
|
<section id='perf-setup'>
|
|
|
|
<title>Setup</title>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 00:29:17 +00:00
|
|
|
<para>
|
|
|
|
For this section, we'll assume you've already performed the basic
|
|
|
|
setup outlined in the General Setup section.
|
|
|
|
</para>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 00:29:17 +00:00
|
|
|
<para>
|
|
|
|
In particular, you'll get the most mileage out of perf if you
|
|
|
|
profile an image built with INHIBIT_PACKAGE_STRIP = "1" in your
|
|
|
|
local.conf.
|
|
|
|
</para>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 00:29:17 +00:00
|
|
|
<para>
|
|
|
|
perf runs on the target system for the most part. You can archive
|
|
|
|
profile data and copy it to the host for analysis, but for the
|
|
|
|
rest of this document we assume you've ssh'ed to the host and
|
|
|
|
will be running the perf commands on the target.
|
|
|
|
</para>
|
|
|
|
</section>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 00:29:17 +00:00
|
|
|
<section id='perf-basic-usage'>
|
|
|
|
<title>Basic Usage</title>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 00:29:17 +00:00
|
|
|
<para>
|
|
|
|
The perf tool is pretty much self-documenting. To remind yourself
|
|
|
|
of the available commands, simply type 'perf', which will show you
|
|
|
|
basic usage along with the available perf subcommands:
|
|
|
|
<literallayout class='monospaced'>
|
|
|
|
root@crownbay:~# perf
|
|
|
|
|
|
|
|
usage: perf [--version] [--help] COMMAND [ARGS]
|
|
|
|
|
|
|
|
The most commonly used perf commands are:
|
|
|
|
annotate Read perf.data (created by perf record) and display annotated code
|
|
|
|
archive Create archive with object files with build-ids found in perf.data file
|
|
|
|
bench General framework for benchmark suites
|
|
|
|
buildid-cache Manage build-id cache.
|
|
|
|
buildid-list List the buildids in a perf.data file
|
|
|
|
diff Read two perf.data files and display the differential profile
|
|
|
|
evlist List the event names in a perf.data file
|
|
|
|
inject Filter to augment the events stream with additional information
|
|
|
|
kmem Tool to trace/measure kernel memory(slab) properties
|
|
|
|
kvm Tool to trace/measure kvm guest os
|
|
|
|
list List all symbolic event types
|
|
|
|
lock Analyze lock events
|
|
|
|
probe Define new dynamic tracepoints
|
|
|
|
record Run a command and record its profile into perf.data
|
|
|
|
report Read perf.data (created by perf record) and display the profile
|
|
|
|
sched Tool to trace/measure scheduler properties (latencies)
|
|
|
|
script Read perf.data (created by perf record) and display trace output
|
|
|
|
stat Run a command and gather performance counter statistics
|
|
|
|
test Runs sanity tests.
|
|
|
|
timechart Tool to visualize total system behavior during a workload
|
|
|
|
top System profiling tool.
|
|
|
|
|
|
|
|
See 'perf help COMMAND' for more information on a specific command.
|
2013-01-16 20:49:45 +00:00
|
|
|
</literallayout>
|
|
|
|
</para>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<section id='using-perf-to-do-basic-profiling'>
|
|
|
|
<title>Using perf to do Basic Profiling</title>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
As a simple test case, we'll profile the 'wget' of a fairly large
|
|
|
|
file, which is a minimally interesting case because it has both
|
|
|
|
file and network I/O aspects, and at least in the case of standard
|
|
|
|
Yocto images, it's implemented as part of busybox, so the methods
|
|
|
|
we use to analyze it can be used in a very similar way to the whole
|
|
|
|
host of supported busybox applets in Yocto.
|
|
|
|
<literallayout class='monospaced'>
|
2013-01-16 00:29:17 +00:00
|
|
|
root@crownbay:~# rm linux-2.6.19.2.tar.bz2; \
|
|
|
|
wget http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2
|
2013-01-16 20:49:45 +00:00
|
|
|
</literallayout>
|
|
|
|
The quickest and easiest way to get some basic overall data about
|
|
|
|
what's going on for a particular workload it to profile it using
|
|
|
|
'perf stat'. 'perf stat' basically profiles using a few default
|
|
|
|
counters and displays the summed counts at the end of the run:
|
|
|
|
<literallayout class='monospaced'>
|
2013-01-16 00:29:17 +00:00
|
|
|
root@crownbay:~# perf stat wget http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2
|
|
|
|
Connecting to downloads.yoctoproject.org (140.211.169.59:80)
|
|
|
|
linux-2.6.19.2.tar.b 100% |***************************************************| 41727k 0:00:00 ETA
|
|
|
|
|
|
|
|
Performance counter stats for 'wget http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2':
|
|
|
|
|
|
|
|
4597.223902 task-clock # 0.077 CPUs utilized
|
|
|
|
23568 context-switches # 0.005 M/sec
|
|
|
|
68 CPU-migrations # 0.015 K/sec
|
|
|
|
241 page-faults # 0.052 K/sec
|
|
|
|
3045817293 cycles # 0.663 GHz
|
|
|
|
<not supported> stalled-cycles-frontend
|
|
|
|
<not supported> stalled-cycles-backend
|
|
|
|
858909167 instructions # 0.28 insns per cycle
|
|
|
|
165441165 branches # 35.987 M/sec
|
|
|
|
19550329 branch-misses # 11.82% of all branches
|
|
|
|
|
|
|
|
59.836627620 seconds time elapsed
|
2013-01-16 20:49:45 +00:00
|
|
|
</literallayout>
|
|
|
|
Many times such a simple-minded test doesn't yield much of
|
|
|
|
interest, but sometimes it does (see Real-world Yocto bug
|
|
|
|
(slow loop-mounted write speed)).
|
|
|
|
</para>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
Also, note that 'perf stat' isn't restricted to a fixed set of
|
|
|
|
counters - basically any event listed in the output of 'perf list'
|
|
|
|
can be tallied by 'perf stat'. For example, suppose we wanted to
|
|
|
|
see a summary of all the events related to kernel memory
|
|
|
|
allocation/freeing along with cache hits and misses:
|
|
|
|
<literallayout class='monospaced'>
|
2013-01-16 00:29:17 +00:00
|
|
|
root@crownbay:~# perf stat -e kmem:* -e cache-references -e cache-misses wget http:// downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2
|
|
|
|
Connecting to downloads.yoctoproject.org (140.211.169.59:80)
|
|
|
|
linux-2.6.19.2.tar.b 100% |***************************************************| 41727k 0:00:00 ETA
|
|
|
|
|
|
|
|
Performance counter stats for 'wget http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2':
|
|
|
|
|
|
|
|
5566 kmem:kmalloc
|
|
|
|
125517 kmem:kmem_cache_alloc
|
|
|
|
0 kmem:kmalloc_node
|
|
|
|
0 kmem:kmem_cache_alloc_node
|
|
|
|
34401 kmem:kfree
|
|
|
|
69920 kmem:kmem_cache_free
|
|
|
|
133 kmem:mm_page_free
|
|
|
|
41 kmem:mm_page_free_batched
|
|
|
|
11502 kmem:mm_page_alloc
|
|
|
|
11375 kmem:mm_page_alloc_zone_locked
|
|
|
|
0 kmem:mm_page_pcpu_drain
|
|
|
|
0 kmem:mm_page_alloc_extfrag
|
|
|
|
66848602 cache-references
|
|
|
|
2917740 cache-misses # 4.365 % of all cache refs
|
|
|
|
|
|
|
|
44.831023415 seconds time elapsed
|
2013-01-16 20:49:45 +00:00
|
|
|
</literallayout>
|
|
|
|
So 'perf stat' gives us a nice easy way to get a quick overview of
|
|
|
|
what might be happening for a set of events, but normally we'd
|
|
|
|
need a little more detail in order to understand what's going on
|
|
|
|
in a way that we can act on in a useful way.
|
|
|
|
</para>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
To dive down into a next level of detail, we can use 'perf
|
|
|
|
record'/'perf report' which will collect profiling data and
|
|
|
|
present it to use using an interactive text-based UI (or
|
|
|
|
simply as text if we specify --stdio to 'perf report').
|
|
|
|
</para>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
As our first attempt at profiling this workload, we'll simply
|
|
|
|
run 'perf record', handing it the workload we want to profile
|
|
|
|
(everything after 'perf record' and any perf options we hand
|
|
|
|
it - here none - will be executedin a new shell). perf collects
|
|
|
|
samples until the process exits and records them in a file named
|
|
|
|
'perf.data' in the current working directory.
|
|
|
|
<literallayout class='monospaced'>
|
2013-01-16 00:29:17 +00:00
|
|
|
root@crownbay:~# perf record wget http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 00:29:17 +00:00
|
|
|
Connecting to downloads.yoctoproject.org (140.211.169.59:80)
|
|
|
|
linux-2.6.19.2.tar.b 100% |************************************************| 41727k 0:00:00 ETA
|
|
|
|
[ perf record: Woken up 1 times to write data ]
|
|
|
|
[ perf record: Captured and wrote 0.176 MB perf.data (~7700 samples) ]
|
|
|
|
</literallayout>
|
|
|
|
To see the results in a 'text-based UI' (tui), simply run
|
|
|
|
'perf report', which will read the perf.data file in the current
|
|
|
|
working directory and display the results in an interactive UI:
|
2013-01-16 20:49:45 +00:00
|
|
|
<literallayout class='monospaced'>
|
2013-01-16 00:29:17 +00:00
|
|
|
root@crownbay:~# perf report
|
2013-01-16 20:49:45 +00:00
|
|
|
</literallayout>
|
|
|
|
</para>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
<imagedata fileref="figures/perf-wget-flat-stripped.png" width="6in" depth="7in" align="center" scalefit="1" />
|
|
|
|
</para>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
The above screenshot displays a 'flat' profile, one entry for
|
|
|
|
each 'bucket' corresponding to the functions that were profiled
|
|
|
|
during the profiling run, ordered from the most popular to the
|
|
|
|
least (perf has options to sort in various orders and keys as
|
|
|
|
well as display entries only above a certain threshold and so
|
|
|
|
on - see the perf documentation for details). Note that this
|
|
|
|
includes both userspace functions (entries containing a [.]) and
|
|
|
|
kernel functions accounted to the process (entries containing
|
|
|
|
a [k]). (perf has command-line modifiers that can be used to
|
|
|
|
restrict the profiling to kernel or userspace, among others).
|
|
|
|
</para>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
Notice also that the above report shows an entry for 'busybox',
|
|
|
|
which is the executable that implements 'wget' in Yocto, but that
|
|
|
|
instead of a useful function name in that entry, it displays
|
|
|
|
an not-so-friendly hex value instead. The steps below will show
|
|
|
|
how to fix that problem.
|
|
|
|
</para>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
Before we do that, however, let's try running a different profile,
|
|
|
|
one which shows something a little more interesting. The only
|
|
|
|
difference between the new profile and the previous one is that
|
|
|
|
we'll add the -g option, which will record not just the address
|
|
|
|
of a sampled function, but the entire callchain to the sampled
|
|
|
|
function as well:
|
|
|
|
<literallayout class='monospaced'>
|
2013-01-16 00:29:17 +00:00
|
|
|
root@crownbay:~# perf record -g wget http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2
|
|
|
|
Connecting to downloads.yoctoproject.org (140.211.169.59:80)
|
|
|
|
linux-2.6.19.2.tar.b 100% |************************************************| 41727k 0:00:00 ETA
|
|
|
|
[ perf record: Woken up 3 times to write data ]
|
|
|
|
[ perf record: Captured and wrote 0.652 MB perf.data (~28476 samples) ]
|
2013-01-10 23:25:18 +00:00
|
|
|
|
|
|
|
|
2013-01-16 00:29:17 +00:00
|
|
|
root@crownbay:~# perf report
|
2013-01-16 20:49:45 +00:00
|
|
|
</literallayout>
|
|
|
|
</para>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
<imagedata fileref="figures/perf-wget-g-copy-to-user-expanded-stripped.png" width="6in" depth="7in" align="center" scalefit="1" />
|
|
|
|
</para>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
Using the callgraph view, we can actually see not only which
|
|
|
|
functions took the most time, but we can also see a summary of
|
|
|
|
how those functions were called and learn something about how the
|
|
|
|
program interacts with the kernel in the process.
|
|
|
|
</para>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
Notice that each entry in the above screenshot now contains a '+'
|
|
|
|
on the left-hand side. This means that we can expand the entry and
|
|
|
|
drill down into the callchains that feed into that entry.
|
|
|
|
Pressing 'enter' on any one of them will expand the callchain
|
|
|
|
(you can also press 'E' to expand them all at the same time or 'C'
|
|
|
|
to collapse them all).
|
|
|
|
</para>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
In the screenshot above, we've toggled the __copy_to_user_ll()
|
|
|
|
entry and several subnodes all the way down. This lets us see
|
|
|
|
which callchains contributed to the profiled __copy_to_user_ll()
|
|
|
|
function which contributed 1.77% to the total profile.
|
|
|
|
</para>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
As a bit of background explanation for these callchains, think
|
|
|
|
about what happens at a high level when you run wget to get a file
|
|
|
|
out on the network. Basically what happens is that the data comes
|
|
|
|
into the kernel via the network connection (socket) and is passed
|
|
|
|
to the userspace program 'wget' (which is actually a part of
|
|
|
|
busybox, but that's not important for now), which takes the buffers
|
|
|
|
the kernel passes to it and writes it to a disk file to save it.
|
|
|
|
</para>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
The part of this process that we're looking at in the above call
|
|
|
|
stacks is the part where the kernel passes the data it's read from
|
|
|
|
the socket down to wget i.e. a copy-to-user.
|
|
|
|
</para>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
Notice also that here there's also a case where the a hex value
|
|
|
|
is displayed in the callstack, here in the expanded
|
|
|
|
sys_clock_gettime() function. Later we'll see it resolve to a
|
|
|
|
userspace function call in busybox.
|
|
|
|
</para>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
<imagedata fileref="figures/perf-wget-g-copy-from-user-expanded-stripped.png" width="6in" depth="7in" align="center" scalefit="1" />
|
|
|
|
</para>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
The above screenshot shows the other half of the journey for the
|
|
|
|
data - from the wget program's userspace buffers to disk. To get
|
|
|
|
the buffers to disk, the wget program issues a write(2), which
|
|
|
|
does a copy-from-user to the kernel, which then takes care via
|
|
|
|
some circuitous path (probably also present somewhere in the
|
|
|
|
profile data), to get it safely to disk.
|
|
|
|
</para>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
Now that we've seen the basic layout of the profile data and the
|
|
|
|
basics of how to extract useful information out of it, let's get
|
|
|
|
back to the task at hand and see if we can get some basic idea
|
|
|
|
about where the time is spent in the program we're profiling,
|
|
|
|
wget. Remember that wget is actually implemented as an applet
|
|
|
|
in busybox, so while the process name is 'wget', the executable
|
|
|
|
we're actually interested in is busybox. So let's expand the
|
|
|
|
first entry containing busybox:
|
|
|
|
</para>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
<imagedata fileref="figures/perf-wget-busybox-expanded-stripped.png" width="6in" depth="7in" align="center" scalefit="1" />
|
|
|
|
</para>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
Again, before we expanded we saw that the function was labeled
|
|
|
|
with a hex value instead of a symbol as with most of the kernel
|
|
|
|
entries. Expanding the busybox entry doesn't make it any better.
|
|
|
|
</para>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
The problem is that perf can't find the symbol information for the
|
|
|
|
busybox binary, which is actually stripped out by the Yocto build
|
|
|
|
system.
|
|
|
|
</para>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
One way around that is to put the following in your local.conf
|
|
|
|
when you build the image:
|
|
|
|
<literallayout class='monospaced'>
|
2013-01-16 00:29:17 +00:00
|
|
|
INHIBIT_PACKAGE_STRIP = "1"
|
2013-01-16 20:49:45 +00:00
|
|
|
</literallayout>
|
|
|
|
However, we already have an image with the binaries stripped,
|
|
|
|
so what can we do to get perf to resolve the symbols? Basically
|
|
|
|
we need to install the debuginfo for the busybox package.
|
|
|
|
</para>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
To generate the debug info for the packages in the image, we can
|
|
|
|
to add dbg-pkgs to EXTRA_IMAGE_FEATURES in local.conf. For example:
|
|
|
|
<literallayout class='monospaced'>
|
2013-01-16 00:29:17 +00:00
|
|
|
EXTRA_IMAGE_FEATURES = "debug-tweaks tools-profile dbg-pkgs"
|
2013-01-16 20:49:45 +00:00
|
|
|
</literallayout>
|
|
|
|
Additionally, in order to generate the type of debuginfo that
|
|
|
|
perf understands, we also need to add the following to local.conf:
|
|
|
|
<literallayout class='monospaced'>
|
2013-01-16 00:29:17 +00:00
|
|
|
PACKAGE_DEBUG_SPLIT_STYLE = 'debug-file-directory'
|
2013-01-16 20:49:45 +00:00
|
|
|
</literallayout>
|
|
|
|
Once we've done that, we can install the debuginfo for busybox.
|
|
|
|
The debug packages once built can be found in
|
|
|
|
build/tmp/deploy/rpm/* on the host system. Find the
|
|
|
|
busybox-dbg-...rpm file and copy it to the target. For example:
|
|
|
|
<literallayout class='monospaced'>
|
2013-01-16 00:29:17 +00:00
|
|
|
[trz@empanada core2]$ scp /home/trz/yocto/crownbay-tracing-dbg/build/tmp/deploy/rpm/core2/busybox-dbg-1.20.2-r2.core2.rpm root@192.168.1.31:
|
|
|
|
root@192.168.1.31's password:
|
|
|
|
busybox-dbg-1.20.2-r2.core2.rpm 100% 1826KB 1.8MB/s 00:01
|
2013-01-16 20:49:45 +00:00
|
|
|
</literallayout>
|
|
|
|
Now install the debug rpm on the target:
|
|
|
|
<literallayout class='monospaced'>
|
2013-01-16 00:29:17 +00:00
|
|
|
root@crownbay:~# rpm -i busybox-dbg-1.20.2-r2.core2.rpm
|
2013-01-16 20:49:45 +00:00
|
|
|
</literallayout>
|
|
|
|
Now that the debuginfo is installed, we see that the busybox
|
|
|
|
entries now display their functions symbolically:
|
|
|
|
</para>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
<imagedata fileref="figures/perf-wget-busybox-debuginfo.png" width="6in" depth="7in" align="center" scalefit="1" />
|
|
|
|
</para>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
If we expand one of the entries and press 'enter' on a leaf node,
|
|
|
|
we're presented with a menu of actions we can take to get more
|
|
|
|
information related to that entry:
|
|
|
|
</para>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
<imagedata fileref="figures/perf-wget-busybox-dso-zoom-menu.png" width="6in" depth="7in" align="center" scalefit="1" />
|
|
|
|
</para>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
One of these actions allows us to show a view that displays a
|
|
|
|
busybox-centric view of the profiled functions (in this case we've
|
|
|
|
also expanded all the nodes using the 'E' key):
|
|
|
|
</para>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
<imagedata fileref="figures/perf-wget-busybox-dso-zoom.png" width="6in" depth="7in" align="center" scalefit="1" />
|
|
|
|
</para>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
Finally, we can see that now that the busybox debuginfo is
|
|
|
|
installed, the previously unresolved symbol in the
|
|
|
|
sys_clock_gettime() entry mentioned previously is now resolved,
|
|
|
|
and shows that the sys_clock_gettime system call that was the
|
|
|
|
source of 6.75% of the copy-to-user overhead was initiated by
|
|
|
|
the handle_input() busybox function:
|
|
|
|
</para>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
<imagedata fileref="figures/perf-wget-g-copy-to-user-expanded-debuginfo.png" width="6in" depth="7in" align="center" scalefit="1" />
|
|
|
|
</para>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
At the lowest level of detail, we can dive down to the assembly
|
|
|
|
level and see which instructions caused the most overhead in a
|
|
|
|
function. Pressing 'enter' on the 'udhcpc_main' function, we're
|
|
|
|
again presented with a menu:
|
|
|
|
</para>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
<imagedata fileref="figures/perf-wget-busybox-annotate-menu.png" width="6in" depth="7in" align="center" scalefit="1" />
|
|
|
|
</para>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
Selecting 'Annotate udhcpc_main', we get a detailed listing of
|
|
|
|
percentages by instruction for the udhcpc_main function. From the
|
|
|
|
display, we can see that over 50% of the time spent in this
|
|
|
|
function is taken up by a couple tests and the move of a
|
|
|
|
constant (1) to a register:
|
|
|
|
</para>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
<imagedata fileref="figures/perf-wget-busybox-annotate-udhcpc.png" width="6in" depth="7in" align="center" scalefit="1" />
|
|
|
|
</para>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
As a segue into tracing, let's try another profile using a
|
|
|
|
different counter, something other than the default 'cycles'.
|
|
|
|
</para>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
The tracing and profiling infrastructure in Linux has become
|
|
|
|
unified in a way that allows us to use the same tool with a
|
|
|
|
completely different set of counters, not just the standard
|
|
|
|
hardware counters that traditionally tools have had to restrict
|
|
|
|
themselves to (of course the traditional tools can also make use
|
|
|
|
of the expanded possibilities now available to them, and in some
|
|
|
|
cases have, as mentioned previously).
|
|
|
|
</para>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
We can get a list of the available events that can be used to
|
|
|
|
profile a workload via 'perf list':
|
|
|
|
<literallayout class='monospaced'>
|
2013-01-16 00:29:17 +00:00
|
|
|
root@crownbay:~# perf list
|
|
|
|
|
|
|
|
List of pre-defined events (to be used in -e):
|
|
|
|
cpu-cycles OR cycles [Hardware event]
|
|
|
|
stalled-cycles-frontend OR idle-cycles-frontend [Hardware event]
|
|
|
|
stalled-cycles-backend OR idle-cycles-backend [Hardware event]
|
|
|
|
instructions [Hardware event]
|
|
|
|
cache-references [Hardware event]
|
|
|
|
cache-misses [Hardware event]
|
|
|
|
branch-instructions OR branches [Hardware event]
|
|
|
|
branch-misses [Hardware event]
|
|
|
|
bus-cycles [Hardware event]
|
|
|
|
ref-cycles [Hardware event]
|
|
|
|
|
|
|
|
cpu-clock [Software event]
|
|
|
|
task-clock [Software event]
|
|
|
|
page-faults OR faults [Software event]
|
|
|
|
minor-faults [Software event]
|
|
|
|
major-faults [Software event]
|
|
|
|
context-switches OR cs [Software event]
|
|
|
|
cpu-migrations OR migrations [Software event]
|
|
|
|
alignment-faults [Software event]
|
|
|
|
emulation-faults [Software event]
|
|
|
|
|
|
|
|
L1-dcache-loads [Hardware cache event]
|
|
|
|
L1-dcache-load-misses [Hardware cache event]
|
|
|
|
L1-dcache-prefetch-misses [Hardware cache event]
|
|
|
|
L1-icache-loads [Hardware cache event]
|
|
|
|
L1-icache-load-misses [Hardware cache event]
|
|
|
|
.
|
|
|
|
.
|
|
|
|
.
|
|
|
|
rNNN [Raw hardware event descriptor]
|
|
|
|
cpu/t1=v1[,t2=v2,t3 ...]/modifier [Raw hardware event descriptor]
|
|
|
|
(see 'perf list --help' on how to encode it)
|
|
|
|
|
|
|
|
mem:<addr>[:access] [Hardware breakpoint]
|
|
|
|
|
|
|
|
sunrpc:rpc_call_status [Tracepoint event]
|
|
|
|
sunrpc:rpc_bind_status [Tracepoint event]
|
|
|
|
sunrpc:rpc_connect_status [Tracepoint event]
|
|
|
|
sunrpc:rpc_task_begin [Tracepoint event]
|
|
|
|
skb:kfree_skb [Tracepoint event]
|
|
|
|
skb:consume_skb [Tracepoint event]
|
|
|
|
skb:skb_copy_datagram_iovec [Tracepoint event]
|
|
|
|
net:net_dev_xmit [Tracepoint event]
|
|
|
|
net:net_dev_queue [Tracepoint event]
|
|
|
|
net:netif_receive_skb [Tracepoint event]
|
|
|
|
net:netif_rx [Tracepoint event]
|
|
|
|
napi:napi_poll [Tracepoint event]
|
|
|
|
sock:sock_rcvqueue_full [Tracepoint event]
|
|
|
|
sock:sock_exceed_buf_limit [Tracepoint event]
|
|
|
|
udp:udp_fail_queue_rcv_skb [Tracepoint event]
|
|
|
|
hda:hda_send_cmd [Tracepoint event]
|
|
|
|
hda:hda_get_response [Tracepoint event]
|
|
|
|
hda:hda_bus_reset [Tracepoint event]
|
|
|
|
scsi:scsi_dispatch_cmd_start [Tracepoint event]
|
|
|
|
scsi:scsi_dispatch_cmd_error [Tracepoint event]
|
|
|
|
scsi:scsi_eh_wakeup [Tracepoint event]
|
|
|
|
drm:drm_vblank_event [Tracepoint event]
|
|
|
|
drm:drm_vblank_event_queued [Tracepoint event]
|
|
|
|
drm:drm_vblank_event_delivered [Tracepoint event]
|
|
|
|
random:mix_pool_bytes [Tracepoint event]
|
|
|
|
random:mix_pool_bytes_nolock [Tracepoint event]
|
|
|
|
random:credit_entropy_bits [Tracepoint event]
|
|
|
|
gpio:gpio_direction [Tracepoint event]
|
|
|
|
gpio:gpio_value [Tracepoint event]
|
|
|
|
block:block_rq_abort [Tracepoint event]
|
|
|
|
block:block_rq_requeue [Tracepoint event]
|
|
|
|
block:block_rq_issue [Tracepoint event]
|
|
|
|
block:block_bio_bounce [Tracepoint event]
|
|
|
|
block:block_bio_complete [Tracepoint event]
|
|
|
|
block:block_bio_backmerge [Tracepoint event]
|
|
|
|
.
|
|
|
|
.
|
|
|
|
writeback:writeback_wake_thread [Tracepoint event]
|
|
|
|
writeback:writeback_wake_forker_thread [Tracepoint event]
|
|
|
|
writeback:writeback_bdi_register [Tracepoint event]
|
|
|
|
.
|
|
|
|
.
|
|
|
|
writeback:writeback_single_inode_requeue [Tracepoint event]
|
|
|
|
writeback:writeback_single_inode [Tracepoint event]
|
|
|
|
kmem:kmalloc [Tracepoint event]
|
|
|
|
kmem:kmem_cache_alloc [Tracepoint event]
|
|
|
|
kmem:mm_page_alloc [Tracepoint event]
|
|
|
|
kmem:mm_page_alloc_zone_locked [Tracepoint event]
|
|
|
|
kmem:mm_page_pcpu_drain [Tracepoint event]
|
|
|
|
kmem:mm_page_alloc_extfrag [Tracepoint event]
|
|
|
|
vmscan:mm_vmscan_kswapd_sleep [Tracepoint event]
|
|
|
|
vmscan:mm_vmscan_kswapd_wake [Tracepoint event]
|
|
|
|
vmscan:mm_vmscan_wakeup_kswapd [Tracepoint event]
|
|
|
|
vmscan:mm_vmscan_direct_reclaim_begin [Tracepoint event]
|
|
|
|
.
|
|
|
|
.
|
|
|
|
module:module_get [Tracepoint event]
|
|
|
|
module:module_put [Tracepoint event]
|
|
|
|
module:module_request [Tracepoint event]
|
|
|
|
sched:sched_kthread_stop [Tracepoint event]
|
|
|
|
sched:sched_wakeup [Tracepoint event]
|
|
|
|
sched:sched_wakeup_new [Tracepoint event]
|
|
|
|
sched:sched_process_fork [Tracepoint event]
|
|
|
|
sched:sched_process_exec [Tracepoint event]
|
|
|
|
sched:sched_stat_runtime [Tracepoint event]
|
|
|
|
rcu:rcu_utilization [Tracepoint event]
|
|
|
|
workqueue:workqueue_queue_work [Tracepoint event]
|
|
|
|
workqueue:workqueue_execute_end [Tracepoint event]
|
|
|
|
signal:signal_generate [Tracepoint event]
|
|
|
|
signal:signal_deliver [Tracepoint event]
|
|
|
|
timer:timer_init [Tracepoint event]
|
|
|
|
timer:timer_start [Tracepoint event]
|
|
|
|
timer:hrtimer_cancel [Tracepoint event]
|
|
|
|
timer:itimer_state [Tracepoint event]
|
|
|
|
timer:itimer_expire [Tracepoint event]
|
|
|
|
irq:irq_handler_entry [Tracepoint event]
|
|
|
|
irq:irq_handler_exit [Tracepoint event]
|
|
|
|
irq:softirq_entry [Tracepoint event]
|
|
|
|
irq:softirq_exit [Tracepoint event]
|
|
|
|
irq:softirq_raise [Tracepoint event]
|
|
|
|
printk:console [Tracepoint event]
|
|
|
|
task:task_newtask [Tracepoint event]
|
|
|
|
task:task_rename [Tracepoint event]
|
|
|
|
syscalls:sys_enter_socketcall [Tracepoint event]
|
|
|
|
syscalls:sys_exit_socketcall [Tracepoint event]
|
|
|
|
.
|
|
|
|
.
|
|
|
|
.
|
|
|
|
syscalls:sys_enter_unshare [Tracepoint event]
|
|
|
|
syscalls:sys_exit_unshare [Tracepoint event]
|
|
|
|
raw_syscalls:sys_enter [Tracepoint event]
|
|
|
|
raw_syscalls:sys_exit [Tracepoint event]
|
2013-01-16 20:49:45 +00:00
|
|
|
</literallayout>
|
|
|
|
</para>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<note>
|
|
|
|
Tying It Together: These are exactly the same set of events defined
|
|
|
|
by the trace event subsystem and exposed by
|
|
|
|
ftrace/tracecmd/kernelshark as files in
|
|
|
|
/sys/kernel/debug/tracing/events, by SystemTap as
|
|
|
|
kernel.trace("tracepoint_name") and (partially) accessed by LTTng.
|
|
|
|
</note>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
Only a subset of these would be of interest to us when looking at
|
|
|
|
this workload, so let's choose the most likely subsystems
|
|
|
|
(identified by the string before the colon in the Tracepoint events)
|
|
|
|
and do a 'perf stat' run using only those wildcarded subsystems:
|
|
|
|
<literallayout class='monospaced'>
|
2013-01-16 00:29:17 +00:00
|
|
|
root@crownbay:~# perf stat -e skb:* -e net:* -e napi:* -e sched:* -e workqueue:* -e irq:* -e syscalls:* wget http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2
|
|
|
|
Performance counter stats for 'wget http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2':
|
|
|
|
|
|
|
|
23323 skb:kfree_skb
|
|
|
|
0 skb:consume_skb
|
|
|
|
49897 skb:skb_copy_datagram_iovec
|
|
|
|
6217 net:net_dev_xmit
|
|
|
|
6217 net:net_dev_queue
|
|
|
|
7962 net:netif_receive_skb
|
|
|
|
2 net:netif_rx
|
|
|
|
8340 napi:napi_poll
|
|
|
|
0 sched:sched_kthread_stop
|
|
|
|
0 sched:sched_kthread_stop_ret
|
|
|
|
3749 sched:sched_wakeup
|
|
|
|
0 sched:sched_wakeup_new
|
|
|
|
0 sched:sched_switch
|
|
|
|
29 sched:sched_migrate_task
|
|
|
|
0 sched:sched_process_free
|
|
|
|
1 sched:sched_process_exit
|
|
|
|
0 sched:sched_wait_task
|
|
|
|
0 sched:sched_process_wait
|
|
|
|
0 sched:sched_process_fork
|
|
|
|
1 sched:sched_process_exec
|
|
|
|
0 sched:sched_stat_wait
|
|
|
|
2106519415641 sched:sched_stat_sleep
|
|
|
|
0 sched:sched_stat_iowait
|
|
|
|
147453613 sched:sched_stat_blocked
|
|
|
|
12903026955 sched:sched_stat_runtime
|
|
|
|
0 sched:sched_pi_setprio
|
|
|
|
3574 workqueue:workqueue_queue_work
|
|
|
|
3574 workqueue:workqueue_activate_work
|
|
|
|
0 workqueue:workqueue_execute_start
|
|
|
|
0 workqueue:workqueue_execute_end
|
|
|
|
16631 irq:irq_handler_entry
|
|
|
|
16631 irq:irq_handler_exit
|
|
|
|
28521 irq:softirq_entry
|
|
|
|
28521 irq:softirq_exit
|
|
|
|
28728 irq:softirq_raise
|
|
|
|
1 syscalls:sys_enter_sendmmsg
|
|
|
|
1 syscalls:sys_exit_sendmmsg
|
|
|
|
0 syscalls:sys_enter_recvmmsg
|
|
|
|
0 syscalls:sys_exit_recvmmsg
|
|
|
|
14 syscalls:sys_enter_socketcall
|
|
|
|
14 syscalls:sys_exit_socketcall
|
|
|
|
.
|
|
|
|
.
|
|
|
|
.
|
|
|
|
16965 syscalls:sys_enter_read
|
|
|
|
16965 syscalls:sys_exit_read
|
|
|
|
12854 syscalls:sys_enter_write
|
|
|
|
12854 syscalls:sys_exit_write
|
|
|
|
.
|
|
|
|
.
|
|
|
|
.
|
|
|
|
|
|
|
|
58.029710972 seconds time elapsed
|
2013-01-16 20:49:45 +00:00
|
|
|
</literallayout>
|
|
|
|
Let's pick one of these tracepoints and tell perf to do a profile
|
|
|
|
using it as the sampling event:
|
|
|
|
<literallayout class='monospaced'>
|
2013-01-16 00:29:17 +00:00
|
|
|
root@crownbay:~# perf record -g -e sched:sched_wakeup wget http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2
|
2013-01-16 20:49:45 +00:00
|
|
|
</literallayout>
|
|
|
|
</para>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
<imagedata fileref="figures/sched-wakeup-profile.png" width="6in" depth="7in" align="center" scalefit="1" />
|
|
|
|
</para>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
The screenshot above shows the results of running a profile using
|
|
|
|
sched:sched_switch tracepoint, which shows the relative costs of
|
|
|
|
various paths to sched_wakeup (note that sched_wakeup is the
|
|
|
|
name of the tracepoint - it's actually defined just inside
|
|
|
|
ttwu_do_wakeup(), which accounts for the function name actually
|
|
|
|
displayed in the profile:
|
|
|
|
<literallayout class='monospaced'>
|
2013-01-16 00:29:17 +00:00
|
|
|
/*
|
|
|
|
* Mark the task runnable and perform wakeup-preemption.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
|
|
|
|
{
|
|
|
|
trace_sched_wakeup(p, true);
|
|
|
|
.
|
|
|
|
.
|
|
|
|
.
|
|
|
|
}
|
2013-01-16 20:49:45 +00:00
|
|
|
</literallayout>
|
|
|
|
A couple of the more interesting callchains are expanded and
|
|
|
|
displayed above, basically some network receive paths that
|
|
|
|
presumably end up waking up wget (busybox) when network data is
|
|
|
|
ready.
|
|
|
|
</para>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
Note that because tracepoints are normally used for tracing,
|
|
|
|
the default sampling period for tracepoints is 1 i.e. for
|
|
|
|
tracepoints perf will sample on every event occurrence (this
|
|
|
|
can be changed using the -c option). This is in contrast to
|
|
|
|
hardware counters such as for example the default 'cycles'
|
|
|
|
hardware counter used for normal profiling, where sampling
|
|
|
|
periods are much higher (in the thousands) because profiling should
|
|
|
|
have as low an overhead as possible and sampling on every cycle
|
|
|
|
would be prohibitively expensive.
|
|
|
|
</para>
|
|
|
|
</section>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<section id='using-perf-to-do-basic-tracing'>
|
|
|
|
<title>Using perf to do Basic Tracing</title>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
Profiling is a great tool for solving many problems or for
|
|
|
|
getting a high-level view of what's going on with a workload or
|
|
|
|
across the system. It is however by definition an approximation,
|
|
|
|
as suggested by the most prominent word associated with it,
|
|
|
|
'sampling'. On the one hand, it allows a representative picture of
|
|
|
|
what's going on in the system to be cheaply taken, but on the other
|
|
|
|
hand, that cheapness limits its utility when that data suggests a
|
|
|
|
need to 'dive down' more deeply to discover what's really going
|
|
|
|
on. In such cases, the only way to see what's really going on is
|
|
|
|
to be able to look at (or summarize more intelligently) the
|
|
|
|
individual steps that go into the higher-level behavior exposed
|
|
|
|
by the coarse-grained profiling data.
|
|
|
|
</para>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
As a concrete example, we can trace all the events we think might
|
|
|
|
be applicable to our workload:
|
|
|
|
<literallayout class='monospaced'>
|
2013-01-16 00:29:17 +00:00
|
|
|
root@crownbay:~# perf record -g -e skb:* -e net:* -e napi:* -e sched:sched_switch -e sched:sched_wakeup -e irq:*
|
|
|
|
-e syscalls:sys_enter_read -e syscalls:sys_exit_read -e syscalls:sys_enter_write -e syscalls:sys_exit_write
|
|
|
|
wget http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2
|
2013-01-16 20:49:45 +00:00
|
|
|
</literallayout>
|
|
|
|
We can look at the raw trace output using 'perf script' with no
|
|
|
|
arguments:
|
|
|
|
<literallayout class='monospaced'>
|
2013-01-16 00:29:17 +00:00
|
|
|
root@crownbay:~# perf script
|
|
|
|
|
|
|
|
perf 1262 [000] 11624.857082: sys_exit_read: 0x0
|
|
|
|
perf 1262 [000] 11624.857193: sched_wakeup: comm=migration/0 pid=6 prio=0 success=1 target_cpu=000
|
|
|
|
wget 1262 [001] 11624.858021: softirq_raise: vec=1 [action=TIMER]
|
|
|
|
wget 1262 [001] 11624.858074: softirq_entry: vec=1 [action=TIMER]
|
|
|
|
wget 1262 [001] 11624.858081: softirq_exit: vec=1 [action=TIMER]
|
|
|
|
wget 1262 [001] 11624.858166: sys_enter_read: fd: 0x0003, buf: 0xbf82c940, count: 0x0200
|
|
|
|
wget 1262 [001] 11624.858177: sys_exit_read: 0x200
|
|
|
|
wget 1262 [001] 11624.858878: kfree_skb: skbaddr=0xeb248d80 protocol=0 location=0xc15a5308
|
|
|
|
wget 1262 [001] 11624.858945: kfree_skb: skbaddr=0xeb248000 protocol=0 location=0xc15a5308
|
|
|
|
wget 1262 [001] 11624.859020: softirq_raise: vec=1 [action=TIMER]
|
|
|
|
wget 1262 [001] 11624.859076: softirq_entry: vec=1 [action=TIMER]
|
|
|
|
wget 1262 [001] 11624.859083: softirq_exit: vec=1 [action=TIMER]
|
|
|
|
wget 1262 [001] 11624.859167: sys_enter_read: fd: 0x0003, buf: 0xb7720000, count: 0x0400
|
|
|
|
wget 1262 [001] 11624.859192: sys_exit_read: 0x1d7
|
|
|
|
wget 1262 [001] 11624.859228: sys_enter_read: fd: 0x0003, buf: 0xb7720000, count: 0x0400
|
|
|
|
wget 1262 [001] 11624.859233: sys_exit_read: 0x0
|
|
|
|
wget 1262 [001] 11624.859573: sys_enter_read: fd: 0x0003, buf: 0xbf82c580, count: 0x0200
|
|
|
|
wget 1262 [001] 11624.859584: sys_exit_read: 0x200
|
|
|
|
wget 1262 [001] 11624.859864: sys_enter_read: fd: 0x0003, buf: 0xb7720000, count: 0x0400
|
|
|
|
wget 1262 [001] 11624.859888: sys_exit_read: 0x400
|
|
|
|
wget 1262 [001] 11624.859935: sys_enter_read: fd: 0x0003, buf: 0xb7720000, count: 0x0400
|
|
|
|
wget 1262 [001] 11624.859944: sys_exit_read: 0x400
|
2013-01-16 20:49:45 +00:00
|
|
|
</literallayout>
|
|
|
|
This gives us a detailed timestamped sequence of events that
|
|
|
|
occurred within the workload with respect to those events.
|
|
|
|
</para>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
In many ways, profiling can be viewed as a subset of tracing -
|
|
|
|
theoretically, if you have a set of trace events that's sufficient
|
|
|
|
to capture all the important aspects of a workload, you can derive
|
|
|
|
any of the results or views that a profiling run can.
|
|
|
|
</para>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
Another aspect of traditional profiling is that while powerful in
|
|
|
|
many ways, it's limited by the granularity of the underlying data.
|
|
|
|
Profiling tools offer various ways of sorting and presenting the
|
|
|
|
sample data, which make it much more useful and amenable to user
|
|
|
|
experimentation, but in the end it can't be used in an open-ended
|
|
|
|
way to extract data that just isn't present as a consequence of
|
|
|
|
the fact that conceptually, most of it has been thrown away.
|
|
|
|
</para>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
Full-blown detailed tracing data does however offer the opportunity
|
|
|
|
to manipulate and present the information collected during a
|
|
|
|
tracing run in an infinite variety of ways.
|
|
|
|
</para>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
Another way to look at it is that there are only so many ways that
|
|
|
|
the 'primitive' counters can be used on their own to generate
|
|
|
|
interesting output; to get anything more complicated than simple
|
|
|
|
counts requires some amount of additional logic, which is typically
|
|
|
|
very specific to the problem at hand. For example, if we wanted to
|
|
|
|
make use of a 'counter' that maps to the value of the time
|
|
|
|
difference between when a process was scheduled to run on a
|
|
|
|
processor and the time it actually ran, we wouldn't expect such
|
|
|
|
a counter to exist on its own, but we could derive one called say
|
|
|
|
'wakeup_latency' and use it to extract a useful view of that metric
|
|
|
|
from trace data. Likewise, we really can't figure out from standard
|
|
|
|
profiling tools how much data every process on the system reads and
|
|
|
|
writes, along with how many of those reads and writes fail
|
|
|
|
completely. If we have sufficient trace data, however, we could
|
|
|
|
with the right tools easily extract and present that information,
|
|
|
|
but we'd need something other than pre-canned profiling tools to
|
|
|
|
do that.
|
|
|
|
</para>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
Luckily, there is general-purpose way to handle such needs,
|
|
|
|
called 'programming languages'. Making programming languages
|
|
|
|
easily available to apply to such problems given the specific
|
|
|
|
format of data is called a 'programming language binding' for
|
|
|
|
that data and language. Perf supports two programming language
|
|
|
|
bindings, one for Python and one for Perl.
|
|
|
|
</para>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<note>
|
|
|
|
Tying It Together: Language bindings for manipulating and
|
|
|
|
aggregating trace data are of course not a new
|
|
|
|
idea. One of the first projects to do this was IBM's DProbes
|
|
|
|
dpcc compiler, an ANSI C compiler which targeted a low-level
|
|
|
|
assembly language running on an in-kernel interpreter on the
|
|
|
|
target system. This is exactly analagous to what Sun's DTrace
|
|
|
|
did, except that DTrace invented its own language for the purpose.
|
|
|
|
Systemtap, heavily inspired by DTrace, also created its own
|
|
|
|
one-off language, but rather than running the product on an
|
|
|
|
in-kernel interpreter, created an elaborate compiler-based
|
|
|
|
machinery to translate its language into kernel modules written
|
|
|
|
in C.
|
|
|
|
</note>
|
2013-01-16 00:29:17 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
Now that we have the trace data in perf.data, we can use
|
|
|
|
'perf script -g' to generate a skeleton script with handlers
|
|
|
|
for the read/write entry/exit events we recorded:
|
|
|
|
<literallayout class='monospaced'>
|
2013-01-16 00:29:17 +00:00
|
|
|
root@crownbay:~# perf script -g python
|
|
|
|
generated Python script: perf-script.py
|
2013-01-16 20:49:45 +00:00
|
|
|
</literallayout>
|
|
|
|
The skeleton script simply creates a python function for each
|
|
|
|
event type in the perf.data file. The body of each function simply
|
|
|
|
prints the event name along with its parameters. For example:
|
|
|
|
<literallayout class='monospaced'>
|
2013-01-16 00:29:17 +00:00
|
|
|
def net__netif_rx(event_name, context, common_cpu,
|
|
|
|
common_secs, common_nsecs, common_pid, common_comm,
|
|
|
|
skbaddr, len, name):
|
|
|
|
print_header(event_name, common_cpu, common_secs, common_nsecs,
|
|
|
|
common_pid, common_comm)
|
|
|
|
|
|
|
|
print "skbaddr=%u, len=%u, name=%s\n" % (skbaddr, len, name),
|
2013-01-16 20:49:45 +00:00
|
|
|
</literallayout>
|
|
|
|
We can run that script directly to print all of the events
|
|
|
|
contained in the perf.data file:
|
|
|
|
<literallayout class='monospaced'>
|
2013-01-16 00:29:17 +00:00
|
|
|
root@crownbay:~# perf script -s perf-script.py
|
|
|
|
|
|
|
|
in trace_begin
|
|
|
|
syscalls__sys_exit_read 0 11624.857082795 1262 perf nr=3, ret=0
|
|
|
|
sched__sched_wakeup 0 11624.857193498 1262 perf comm=migration/0, pid=6, prio=0, success=1, target_cpu=0
|
|
|
|
irq__softirq_raise 1 11624.858021635 1262 wget vec=TIMER
|
|
|
|
irq__softirq_entry 1 11624.858074075 1262 wget vec=TIMER
|
|
|
|
irq__softirq_exit 1 11624.858081389 1262 wget vec=TIMER
|
|
|
|
syscalls__sys_enter_read 1 11624.858166434 1262 wget nr=3, fd=3, buf=3213019456, count=512
|
|
|
|
syscalls__sys_exit_read 1 11624.858177924 1262 wget nr=3, ret=512
|
|
|
|
skb__kfree_skb 1 11624.858878188 1262 wget skbaddr=3945041280, location=3243922184, protocol=0
|
|
|
|
skb__kfree_skb 1 11624.858945608 1262 wget skbaddr=3945037824, location=3243922184, protocol=0
|
|
|
|
irq__softirq_raise 1 11624.859020942 1262 wget vec=TIMER
|
|
|
|
irq__softirq_entry 1 11624.859076935 1262 wget vec=TIMER
|
|
|
|
irq__softirq_exit 1 11624.859083469 1262 wget vec=TIMER
|
|
|
|
syscalls__sys_enter_read 1 11624.859167565 1262 wget nr=3, fd=3, buf=3077701632, count=1024
|
|
|
|
syscalls__sys_exit_read 1 11624.859192533 1262 wget nr=3, ret=471
|
|
|
|
syscalls__sys_enter_read 1 11624.859228072 1262 wget nr=3, fd=3, buf=3077701632, count=1024
|
|
|
|
syscalls__sys_exit_read 1 11624.859233707 1262 wget nr=3, ret=0
|
|
|
|
syscalls__sys_enter_read 1 11624.859573008 1262 wget nr=3, fd=3, buf=3213018496, count=512
|
|
|
|
syscalls__sys_exit_read 1 11624.859584818 1262 wget nr=3, ret=512
|
|
|
|
syscalls__sys_enter_read 1 11624.859864562 1262 wget nr=3, fd=3, buf=3077701632, count=1024
|
|
|
|
syscalls__sys_exit_read 1 11624.859888770 1262 wget nr=3, ret=1024
|
|
|
|
syscalls__sys_enter_read 1 11624.859935140 1262 wget nr=3, fd=3, buf=3077701632, count=1024
|
|
|
|
syscalls__sys_exit_read 1 11624.859944032 1262 wget nr=3, ret=1024
|
2013-01-16 20:49:45 +00:00
|
|
|
</literallayout>
|
|
|
|
That in itself isn't very useful; after all, we can accomplish
|
|
|
|
pretty much the same thing by simply running 'perf script'
|
|
|
|
without arguments in the same directory as the perf.data file.
|
|
|
|
</para>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
We can however replace the print statements in the generated
|
|
|
|
function bodies with whatever we want, and thereby make it
|
|
|
|
infinitely more useful.
|
|
|
|
</para>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
As a simple example, let's just replace the print statements in
|
|
|
|
the function bodies with a simple function that does nothing but
|
|
|
|
increment a per-event count. When the program is run against a
|
|
|
|
perf.data file, each time a particular event is encountered,
|
|
|
|
a tally is incremented for that event. For example:
|
|
|
|
<literallayout class='monospaced'>
|
2013-01-16 00:29:17 +00:00
|
|
|
def net__netif_rx(event_name, context, common_cpu,
|
|
|
|
common_secs, common_nsecs, common_pid, common_comm,
|
|
|
|
skbaddr, len, name):
|
|
|
|
inc_counts(event_name)
|
2013-01-16 20:49:45 +00:00
|
|
|
</literallayout>
|
|
|
|
Each event handler function in the generated code is modified
|
|
|
|
to do this. For convenience, we define a common function called
|
|
|
|
inc_counts() that each handler calls; inc_counts simply tallies
|
|
|
|
a count for each event using the 'counts' hash, which is a
|
|
|
|
specialized has function that does Perl-like autovivification, a
|
|
|
|
capability that's extremely useful for kinds of multi-level
|
|
|
|
aggregation commonly used in processing traces (see perf's
|
|
|
|
documentation on the Python language binding for details):
|
|
|
|
<literallayout class='monospaced'>
|
2013-01-16 00:29:17 +00:00
|
|
|
counts = autodict()
|
|
|
|
|
|
|
|
def inc_counts(event_name):
|
|
|
|
try:
|
|
|
|
counts[event_name] += 1
|
|
|
|
except TypeError:
|
|
|
|
counts[event_name] = 1
|
2013-01-16 20:49:45 +00:00
|
|
|
</literallayout>
|
|
|
|
Finally, at the end of the trace processing run, we want to
|
|
|
|
print the result of all the per-event tallies. For that, we
|
|
|
|
use the special 'trace_end()' function:
|
|
|
|
<literallayout class='monospaced'>
|
2013-01-16 00:29:17 +00:00
|
|
|
def trace_end():
|
|
|
|
for event_name, count in counts.iteritems():
|
|
|
|
print "%-40s %10s\n" % (event_name, count)
|
2013-01-16 20:49:45 +00:00
|
|
|
</literallayout>
|
|
|
|
The end result is a summary of all the events recorded in the
|
|
|
|
trace:
|
2013-01-16 21:03:44 +00:00
|
|
|
<literallayout class='monospaced'>
|
2013-01-16 00:29:17 +00:00
|
|
|
skb__skb_copy_datagram_iovec 13148
|
|
|
|
irq__softirq_entry 4796
|
|
|
|
irq__irq_handler_exit 3805
|
|
|
|
irq__softirq_exit 4795
|
|
|
|
syscalls__sys_enter_write 8990
|
|
|
|
net__net_dev_xmit 652
|
|
|
|
skb__kfree_skb 4047
|
|
|
|
sched__sched_wakeup 1155
|
|
|
|
irq__irq_handler_entry 3804
|
|
|
|
irq__softirq_raise 4799
|
|
|
|
net__net_dev_queue 652
|
|
|
|
syscalls__sys_enter_read 17599
|
|
|
|
net__netif_receive_skb 1743
|
|
|
|
syscalls__sys_exit_read 17598
|
|
|
|
net__netif_rx 2
|
|
|
|
napi__napi_poll 1877
|
|
|
|
syscalls__sys_exit_write 8990
|
2013-01-16 20:49:45 +00:00
|
|
|
</literallayout>
|
|
|
|
Note that this is pretty much exactly the same information we get
|
|
|
|
from 'perf stat', which goes a little way to support the idea
|
|
|
|
mentioned previously that given the right kind of trace data,
|
|
|
|
higher-level profiling-type summaries can be derived from it.
|
|
|
|
</para>
|
2013-01-10 23:25:18 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
Documentation on using the
|
|
|
|
<ulink url='http://linux.die.net/man/1/perf-script-python'>'perf script' python binding</ulink>.
|
|
|
|
</para>
|
|
|
|
</section>
|
2013-01-16 00:29:17 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<section id='system-wide-tracing-and-profiling'>
|
|
|
|
<title>System-Wide Tracing and Profiling</title>
|
2013-01-16 00:29:17 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
The examples so far have focused on tracing a particular program or
|
|
|
|
workload - in other words, every profiling run has specified the
|
|
|
|
program to profile in the command-line e.g. 'perf record wget ...'.
|
|
|
|
</para>
|
2013-01-16 18:58:22 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
It's also possible, and more interesting in many cases, to run a
|
|
|
|
system-wide profile or trace while running the workload in a
|
|
|
|
separate shell.
|
|
|
|
</para>
|
2013-01-16 18:58:22 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
To do system-wide profiling or tracing, you typically use
|
|
|
|
the -a flag to 'perf record'.
|
|
|
|
</para>
|
2013-01-16 18:58:22 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
To demonstrate this, open up one window and start the profile
|
|
|
|
using the -a flag (press Ctrl-C to stop tracing):
|
|
|
|
<literallayout class='monospaced'>
|
2013-01-16 18:58:22 +00:00
|
|
|
root@crownbay:~# perf record -g -a
|
|
|
|
^C[ perf record: Woken up 6 times to write data ]
|
|
|
|
[ perf record: Captured and wrote 1.400 MB perf.data (~61172 samples) ]
|
2013-01-16 20:49:45 +00:00
|
|
|
</literallayout>
|
|
|
|
In another window, run the wget test:
|
|
|
|
<literallayout class='monospaced'>
|
2013-01-16 18:58:22 +00:00
|
|
|
root@crownbay:~# wget <ulink url='http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2'>http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2</ulink>
|
|
|
|
Connecting to downloads.yoctoproject.org (140.211.169.59:80)
|
|
|
|
linux-2.6.19.2.tar.b 100% |*******************************| 41727k 0:00:00 ETA
|
2013-01-16 20:49:45 +00:00
|
|
|
</literallayout>
|
|
|
|
Here we see entries not only for our wget load, but for other
|
|
|
|
processes running on the system as well:
|
|
|
|
</para>
|
2013-01-16 18:58:22 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
<imagedata fileref="figures/perf-systemwide.png" width="6in" depth="7in" align="center" scalefit="1" />
|
|
|
|
</para>
|
2013-01-16 18:58:22 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
In the snapshot above, we can see callchains that originate in
|
|
|
|
libc, and a callchain from Xorg that demonstrates that we're
|
|
|
|
using a proprietary X driver in userspace (notice the presence
|
|
|
|
of 'PVR' and some other unresolvable symbols in the expanded
|
|
|
|
Xorg callchain).
|
|
|
|
</para>
|
2013-01-16 18:58:22 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
Note also that we have both kernel and userspace entries in the
|
|
|
|
above snapshot. We can also tell perf to focus on userspace but
|
|
|
|
providing a modifier, in this case 'u', to the 'cycles' hardware
|
|
|
|
counter when we record a profile:
|
|
|
|
<literallayout class='monospaced'>
|
2013-01-16 18:58:22 +00:00
|
|
|
root@crownbay:~# perf record -g -a -e cycles:u
|
|
|
|
^C[ perf record: Woken up 2 times to write data ]
|
|
|
|
[ perf record: Captured and wrote 0.376 MB perf.data (~16443 samples) ]
|
2013-01-16 20:49:45 +00:00
|
|
|
</literallayout>
|
|
|
|
</para>
|
2013-01-16 18:58:22 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
<imagedata fileref="figures/perf-report-cycles-u.png" width="6in" depth="7in" align="center" scalefit="1" />
|
|
|
|
</para>
|
2013-01-16 18:58:22 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
Notice in the screenshot above, we see only userspace entries ([.])
|
|
|
|
</para>
|
2013-01-16 18:58:22 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
Finally, we can press 'enter' on a leaf node and select the 'Zoom
|
|
|
|
into DSO' menu item to show only entries associated with a
|
|
|
|
specific DSO. In the screenshot below, we've zoomed into the
|
|
|
|
'libc' DSO which shows all the entries associated with the
|
|
|
|
libc-xxx.so DSO.
|
|
|
|
</para>
|
2013-01-16 18:58:22 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
<imagedata fileref="figures/perf-systemwide-libc.png" width="6in" depth="7in" align="center" scalefit="1" />
|
|
|
|
</para>
|
2013-01-16 18:58:22 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
We can also use the system-wide -a switch to do system-wide
|
|
|
|
tracing. Here we'll trace a couple of scheduler events:
|
|
|
|
<literallayout class='monospaced'>
|
2013-01-16 18:58:22 +00:00
|
|
|
root@crownbay:~# perf record -a -e sched:sched_switch -e sched:sched_wakeup
|
|
|
|
^C[ perf record: Woken up 38 times to write data ]
|
|
|
|
[ perf record: Captured and wrote 9.780 MB perf.data (~427299 samples) ]
|
2013-01-16 20:49:45 +00:00
|
|
|
</literallayout>
|
|
|
|
We can look at the raw output using 'perf script' with no
|
|
|
|
arguments:
|
|
|
|
<literallayout class='monospaced'>
|
2013-01-16 18:58:22 +00:00
|
|
|
root@crownbay:~# perf script
|
|
|
|
|
|
|
|
perf 1383 [001] 6171.460045: sched_wakeup: comm=kworker/1:1 pid=21 prio=120 success=1 target_cpu=001
|
|
|
|
perf 1383 [001] 6171.460066: sched_switch: prev_comm=perf prev_pid=1383 prev_prio=120 prev_state=R+ ==> next_comm=kworker/1:1 next_pid=21 next_prio=120
|
|
|
|
kworker/1:1 21 [001] 6171.460093: sched_switch: prev_comm=kworker/1:1 prev_pid=21 prev_prio=120 prev_state=S ==> next_comm=perf next_pid=1383 next_prio=120
|
|
|
|
swapper 0 [000] 6171.468063: sched_wakeup: comm=kworker/0:3 pid=1209 prio=120 success=1 target_cpu=000
|
|
|
|
swapper 0 [000] 6171.468107: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/0:3 next_pid=1209 next_prio=120
|
|
|
|
kworker/0:3 1209 [000] 6171.468143: sched_switch: prev_comm=kworker/0:3 prev_pid=1209 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120
|
|
|
|
perf 1383 [001] 6171.470039: sched_wakeup: comm=kworker/1:1 pid=21 prio=120 success=1 target_cpu=001
|
|
|
|
perf 1383 [001] 6171.470058: sched_switch: prev_comm=perf prev_pid=1383 prev_prio=120 prev_state=R+ ==> next_comm=kworker/1:1 next_pid=21 next_prio=120
|
|
|
|
kworker/1:1 21 [001] 6171.470082: sched_switch: prev_comm=kworker/1:1 prev_pid=21 prev_prio=120 prev_state=S ==> next_comm=perf next_pid=1383 next_prio=120
|
|
|
|
perf 1383 [001] 6171.480035: sched_wakeup: comm=kworker/1:1 pid=21 prio=120 success=1 target_cpu=001
|
2013-01-16 20:49:45 +00:00
|
|
|
</literallayout>
|
|
|
|
</para>
|
2013-01-16 00:29:17 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<section id='perf-filtering'>
|
|
|
|
<title>Filtering</title>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Notice that there are a lot of events that don't really have
|
|
|
|
anything to do with what we're interested in, namely events
|
|
|
|
that schedule 'perf' itself in and out or that wake perf up.
|
|
|
|
We can get rid of those by using the '--filter' option -
|
|
|
|
for each event we specify using -e, we can add a --filter
|
|
|
|
after that to filter out trace events that contain fields
|
|
|
|
with specific values:
|
|
|
|
<literallayout class='monospaced'>
|
2013-01-16 18:58:22 +00:00
|
|
|
root@crownbay:~# perf record -a -e sched:sched_switch --filter 'next_comm != perf && prev_comm != perf' -e sched:sched_wakeup --filter 'comm != perf'
|
|
|
|
^C[ perf record: Woken up 38 times to write data ]
|
|
|
|
[ perf record: Captured and wrote 9.688 MB perf.data (~423279 samples) ]
|
2013-01-16 00:29:17 +00:00
|
|
|
|
2013-01-16 18:58:22 +00:00
|
|
|
|
|
|
|
root@crownbay:~# perf script
|
|
|
|
|
|
|
|
swapper 0 [000] 7932.162180: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/0:3 next_pid=1209 next_prio=120
|
|
|
|
kworker/0:3 1209 [000] 7932.162236: sched_switch: prev_comm=kworker/0:3 prev_pid=1209 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120
|
|
|
|
perf 1407 [001] 7932.170048: sched_wakeup: comm=kworker/1:1 pid=21 prio=120 success=1 target_cpu=001
|
|
|
|
perf 1407 [001] 7932.180044: sched_wakeup: comm=kworker/1:1 pid=21 prio=120 success=1 target_cpu=001
|
|
|
|
perf 1407 [001] 7932.190038: sched_wakeup: comm=kworker/1:1 pid=21 prio=120 success=1 target_cpu=001
|
|
|
|
perf 1407 [001] 7932.200044: sched_wakeup: comm=kworker/1:1 pid=21 prio=120 success=1 target_cpu=001
|
|
|
|
perf 1407 [001] 7932.210044: sched_wakeup: comm=kworker/1:1 pid=21 prio=120 success=1 target_cpu=001
|
|
|
|
perf 1407 [001] 7932.220044: sched_wakeup: comm=kworker/1:1 pid=21 prio=120 success=1 target_cpu=001
|
|
|
|
swapper 0 [001] 7932.230111: sched_wakeup: comm=kworker/1:1 pid=21 prio=120 success=1 target_cpu=001
|
|
|
|
swapper 0 [001] 7932.230146: sched_switch: prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/1:1 next_pid=21 next_prio=120
|
|
|
|
kworker/1:1 21 [001] 7932.230205: sched_switch: prev_comm=kworker/1:1 prev_pid=21 prev_prio=120 prev_state=S ==> next_comm=swapper/1 next_pid=0 next_prio=120
|
|
|
|
swapper 0 [000] 7932.326109: sched_wakeup: comm=kworker/0:3 pid=1209 prio=120 success=1 target_cpu=000
|
|
|
|
swapper 0 [000] 7932.326171: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/0:3 next_pid=1209 next_prio=120
|
|
|
|
kworker/0:3 1209 [000] 7932.326214: sched_switch: prev_comm=kworker/0:3 prev_pid=1209 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120
|
2013-01-16 20:49:45 +00:00
|
|
|
</literallayout>
|
|
|
|
In this case, we've filtered out all events that have 'perf'
|
|
|
|
in their 'comm' or 'comm_prev' or 'comm_next' fields. Notice
|
|
|
|
that there are still events recorded for perf, but notice
|
|
|
|
that those events don't have values of 'perf' for the filtered
|
|
|
|
fields. To completely filter out anything from perf will
|
|
|
|
require a bit more work, but for the purpose of demonstrating
|
|
|
|
how to use filters, it's close enough.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<note>
|
|
|
|
Tying It Together: These are exactly the same set of event
|
|
|
|
filters defined by the trace event subsystem. See the
|
|
|
|
ftrace/tracecmd/kernelshark section for more discussion about
|
|
|
|
these event filters.
|
|
|
|
</note>
|
|
|
|
|
|
|
|
<note>
|
|
|
|
Tying It Together: These event filters are implemented by a
|
|
|
|
special-purpose pseudo-interpreter in the kernel and are an
|
|
|
|
integral and indispensable part of the perf design as it
|
|
|
|
relates to tracing. kernel-based event filters provide a
|
|
|
|
mechanism to precisely throttle the event stream that appears
|
|
|
|
in user space, where it makes sense to provide bindings to real
|
|
|
|
programming languages for postprocessing the event stream.
|
|
|
|
This architecture allows for the intelligent and flexible
|
|
|
|
partitioning of processing between the kernel and user space.
|
|
|
|
Contrast this with other tools such as SystemTap, which does
|
|
|
|
all of its processing in the kernel and as such requires a
|
|
|
|
special project-defined language in order to accommodate that
|
|
|
|
design, or LTTng, where everything is sent to userspace and
|
|
|
|
as such requires a super-efficient kernel-to-userspace
|
|
|
|
transport mechanism in order to function properly. While
|
|
|
|
perf certainly can benefit from for instance advances in
|
|
|
|
the design of the transport, it doesn't fundamentally depend
|
|
|
|
on them. Basically, if you find that your perf tracing
|
|
|
|
application is causing buffer I/O overruns, it probably
|
|
|
|
means that you aren't taking enough advantage of the
|
|
|
|
kernel filtering engine.
|
|
|
|
</note>
|
|
|
|
</section>
|
2013-01-16 18:58:22 +00:00
|
|
|
</section>
|
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<section id='using-dynamic-tracepoints'>
|
|
|
|
<title>Using Dynamic Tracepoints</title>
|
2013-01-16 18:58:22 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
perf isn't restricted to the fixed set of static tracepoints
|
|
|
|
listed by 'perf list'. Users can also add their own 'dynamic'
|
|
|
|
tracepoints anywhere in the kernel. For instance, suppose we
|
|
|
|
want to define our own tracepoint on do_fork(). We can do that
|
|
|
|
using the 'perf probe' perf subcommand:
|
|
|
|
<literallayout class='monospaced'>
|
2013-01-16 18:58:22 +00:00
|
|
|
root@crownbay:~# perf probe do_fork
|
|
|
|
Added new event:
|
|
|
|
probe:do_fork (on do_fork)
|
|
|
|
|
|
|
|
You can now use it in all perf tools, such as:
|
|
|
|
|
|
|
|
perf record -e probe:do_fork -aR sleep 1
|
2013-01-16 20:49:45 +00:00
|
|
|
</literallayout>
|
|
|
|
Adding a new tracepoint via 'perf probe' results in an event
|
|
|
|
with all the expected files and format in
|
|
|
|
/sys/kernel/debug/tracing/events, just the same as for static
|
|
|
|
tracepoints (as discussed in more detail in the trace events
|
|
|
|
subsystem section:
|
|
|
|
<literallayout class='monospaced'>
|
2013-01-16 18:58:22 +00:00
|
|
|
root@crownbay:/sys/kernel/debug/tracing/events/probe/do_fork# ls -al
|
|
|
|
drwxr-xr-x 2 root root 0 Oct 28 11:42 .
|
|
|
|
drwxr-xr-x 3 root root 0 Oct 28 11:42 ..
|
|
|
|
-rw-r--r-- 1 root root 0 Oct 28 11:42 enable
|
|
|
|
-rw-r--r-- 1 root root 0 Oct 28 11:42 filter
|
|
|
|
-r--r--r-- 1 root root 0 Oct 28 11:42 format
|
|
|
|
-r--r--r-- 1 root root 0 Oct 28 11:42 id
|
|
|
|
|
|
|
|
root@crownbay:/sys/kernel/debug/tracing/events/probe/do_fork# cat format
|
|
|
|
name: do_fork
|
|
|
|
ID: 944
|
|
|
|
format:
|
|
|
|
field:unsigned short common_type; offset:0; size:2; signed:0;
|
|
|
|
field:unsigned char common_flags; offset:2; size:1; signed:0;
|
|
|
|
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
|
|
|
|
field:int common_pid; offset:4; size:4; signed:1;
|
|
|
|
field:int common_padding; offset:8; size:4; signed:1;
|
|
|
|
|
|
|
|
field:unsigned long __probe_ip; offset:12; size:4; signed:0;
|
|
|
|
|
|
|
|
print fmt: "(%lx)", REC->__probe_ip
|
2013-01-16 20:49:45 +00:00
|
|
|
</literallayout>
|
|
|
|
We can list all dynamic tracepoints currently in existence:
|
|
|
|
<literallayout class='monospaced'>
|
2013-01-16 18:58:22 +00:00
|
|
|
root@crownbay:~# perf probe -l
|
|
|
|
probe:do_fork (on do_fork)
|
|
|
|
probe:schedule (on schedule)
|
2013-01-16 20:49:45 +00:00
|
|
|
</literallayout>
|
|
|
|
Let's record system-wide ('sleep 30' is a trick for recording
|
|
|
|
system-wide but basically do nothing and then wake up after
|
|
|
|
30 seconds):
|
|
|
|
<literallayout class='monospaced'>
|
2013-01-16 18:58:22 +00:00
|
|
|
root@crownbay:~# perf record -g -a -e probe:do_fork sleep 30
|
|
|
|
[ perf record: Woken up 1 times to write data ]
|
|
|
|
[ perf record: Captured and wrote 0.087 MB perf.data (~3812 samples) ]
|
2013-01-16 20:49:45 +00:00
|
|
|
</literallayout>
|
|
|
|
Using 'perf script' we can see each do_fork event that fired:
|
|
|
|
<literallayout class='monospaced'>
|
2013-01-16 18:58:22 +00:00
|
|
|
root@crownbay:~# perf script
|
|
|
|
|
|
|
|
# ========
|
|
|
|
# captured on: Sun Oct 28 11:55:18 2012
|
|
|
|
# hostname : crownbay
|
|
|
|
# os release : 3.4.11-yocto-standard
|
|
|
|
# perf version : 3.4.11
|
|
|
|
# arch : i686
|
|
|
|
# nrcpus online : 2
|
|
|
|
# nrcpus avail : 2
|
|
|
|
# cpudesc : Intel(R) Atom(TM) CPU E660 @ 1.30GHz
|
|
|
|
# cpuid : GenuineIntel,6,38,1
|
|
|
|
# total memory : 1017184 kB
|
|
|
|
# cmdline : /usr/bin/perf record -g -a -e probe:do_fork sleep 30
|
|
|
|
# event : name = probe:do_fork, type = 2, config = 0x3b0, config1 = 0x0, config2 = 0x0, excl_usr = 0, excl_kern
|
|
|
|
= 0, id = { 5, 6 }
|
|
|
|
# HEADER_CPU_TOPOLOGY info available, use -I to display
|
|
|
|
# ========
|
|
|
|
#
|
|
|
|
matchbox-deskto 1197 [001] 34211.378318: do_fork: (c1028460)
|
|
|
|
matchbox-deskto 1295 [001] 34211.380388: do_fork: (c1028460)
|
|
|
|
pcmanfm 1296 [000] 34211.632350: do_fork: (c1028460)
|
|
|
|
pcmanfm 1296 [000] 34211.639917: do_fork: (c1028460)
|
|
|
|
matchbox-deskto 1197 [001] 34217.541603: do_fork: (c1028460)
|
|
|
|
matchbox-deskto 1299 [001] 34217.543584: do_fork: (c1028460)
|
|
|
|
gthumb 1300 [001] 34217.697451: do_fork: (c1028460)
|
|
|
|
gthumb 1300 [001] 34219.085734: do_fork: (c1028460)
|
|
|
|
gthumb 1300 [000] 34219.121351: do_fork: (c1028460)
|
|
|
|
gthumb 1300 [001] 34219.264551: do_fork: (c1028460)
|
|
|
|
pcmanfm 1296 [000] 34219.590380: do_fork: (c1028460)
|
|
|
|
matchbox-deskto 1197 [001] 34224.955965: do_fork: (c1028460)
|
|
|
|
matchbox-deskto 1306 [001] 34224.957972: do_fork: (c1028460)
|
|
|
|
matchbox-termin 1307 [000] 34225.038214: do_fork: (c1028460)
|
|
|
|
matchbox-termin 1307 [001] 34225.044218: do_fork: (c1028460)
|
|
|
|
matchbox-termin 1307 [000] 34225.046442: do_fork: (c1028460)
|
|
|
|
matchbox-deskto 1197 [001] 34237.112138: do_fork: (c1028460)
|
|
|
|
matchbox-deskto 1311 [001] 34237.114106: do_fork: (c1028460)
|
|
|
|
gaku 1312 [000] 34237.202388: do_fork: (c1028460)
|
2013-01-16 20:49:45 +00:00
|
|
|
</literallayout>
|
|
|
|
And using 'perf report' on the same file, we can see the
|
|
|
|
callgraphs from starting a few programs during those 30 seconds:
|
|
|
|
</para>
|
2013-01-16 18:58:22 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<para>
|
|
|
|
<imagedata fileref="figures/perf-probe-do_fork-profile.png" width="6in" depth="7in" align="center" scalefit="1" />
|
|
|
|
</para>
|
2013-01-16 18:58:22 +00:00
|
|
|
|
2013-01-16 20:49:45 +00:00
|
|
|
<note>
|
|
|
|
Tying It Together: The trace events subsystem accomodate static
|
|
|
|
and dynamic tracepoints in exactly the same way - there's no
|
|
|
|
difference as far as the infrastructure is concerned. See the
|
|
|
|
ftrace section for more details on the trace event subsystem.
|
|
|
|
</note>
|
|
|
|
|
|
|
|
<note>
|
|
|
|
Tying It Together: Dynamic tracepoints are implemented under the
|
|
|
|
covers by kprobes and uprobes. kprobes and uprobes are also used
|
|
|
|
by and in fact are the main focus of SystemTap.
|
|
|
|
</note>
|
|
|
|
</section>
|
2013-01-16 18:58:22 +00:00
|
|
|
</section>
|
|
|
|
|
|
|
|
<section id='perf-documentation'>
|
|
|
|
<title>Documentation</title>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Online versions of the man pages for the commands discussed in this
|
|
|
|
section can be found here:
|
|
|
|
<itemizedlist>
|
|
|
|
<listitem><para>The <ulink url='http://linux.die.net/man/1/perf-stat'>'perf stat' manpage</ulink>.
|
|
|
|
</para></listitem>
|
|
|
|
<listitem><para>The <ulink url='http://linux.die.net/man/1/perf-record'>'perf record' manpage</ulink>.
|
|
|
|
</para></listitem>
|
|
|
|
<listitem><para>The <ulink url='http://linux.die.net/man/1/perf-report'>'perf report' manpage</ulink>.
|
|
|
|
</para></listitem>
|
|
|
|
<listitem><para>The <ulink url='http://linux.die.net/man/1/perf-probe'>'perf probe' manpage</ulink>.
|
|
|
|
</para></listitem>
|
|
|
|
<listitem><para>The <ulink url='http://linux.die.net/man/1/perf-script'>'perf script' manpage</ulink>.
|
|
|
|
</para></listitem>
|
|
|
|
<listitem><para>Documentation on using the
|
|
|
|
<ulink url='http://linux.die.net/man/1/perf-script-python'>'perf script' python binding</ulink>.
|
|
|
|
</para></listitem>
|
|
|
|
<listitem><para>The top-level
|
|
|
|
<ulink url='http://linux.die.net/man/1/perf'>perf(1) manpage</ulink>.
|
|
|
|
</para></listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Normally, you should be able to invoke the man pages via perf
|
|
|
|
itself e.g. 'perf help' or 'perf help record'.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
However, by default Yocto doesn't install man pages, but perf
|
|
|
|
invokes the man pages for most help functionality. This is a bug
|
|
|
|
and is being addressed by a Yocto bug:
|
|
|
|
<ulink url='https://bugzilla.yoctoproject.org/show_bug.cgi?id=3388'>Bug 3388 - perf: enable man pages for basic 'help' functionality</ulink>.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
The man pages in text form, along with some other files, such as
|
|
|
|
a set of examples, can be found in the 'perf' directory of the
|
|
|
|
kernel tree:
|
|
|
|
<literallayout class='monospaced'>
|
|
|
|
tools/perf/Documentation
|
|
|
|
</literallayout>
|
|
|
|
There's also a nice perf tutorial on the perf wiki that goes
|
|
|
|
into more detail than we do here in certain areas:
|
|
|
|
<ulink url='https://perf.wiki.kernel.org/index.php/Tutorial'>Perf Tutorial</ulink>
|
|
|
|
</para>
|
|
|
|
</section>
|
2013-01-10 23:25:18 +00:00
|
|
|
</section>
|
|
|
|
</chapter>
|
|
|
|
<!--
|
|
|
|
vim: expandtab tw=80 ts=4
|
|
|
|
-->
|