Sometimes you just want to abuse Linux perf
to make it do a thing it’s not designed for, and a proper C program
would represent an excessive amount of work.
Here are two tricks I find helpful when jotting down hacky analysis scripts.
Programmatically interacting with addr2line -i
Perf can resolve symbols itself, but addr2line is a lot more flexible (especially when you inflict subtle things on your executable’s mappings).
It’s already nice that addr2line -Cfe /path/to/binary
lets you write
hex addresses to stdin and spits out the corresponding function name
on one line, and its source location on the next (or ??
/ ??:0
if
debug info is missing). However, for heavily inlined (cough C++
cough) programs, you really want the whole callstack that’s encoded
in the debug info, not just the most deeply inlined function (“oh
great, it’s in std::vector<Foo>::size()
”).
The --inline
flag
addresses that… by printing source locations for inline callers on
their own line(s). Now that the output for each address can span
a variable number of lines, how is one to know when to stop reading?
A simple trick is to always write two addresses to addr2line
’s
standard input: the address we want to symbolicate, and
that never has debug info (e.g., 0).
EDIT: Travis Downs reports that llvm-addr2line-14
finds debug info for 0x0
(presumably a bug. I don’t see that on llvm-addr2line-12)
and suggests looking for 0x0.*
in addition to ??
/??:0
. It’s
easy enough to stop when either happens, and clang’s version of
addr2line
can be a lot faster than binutil’s on files with a lot of
debug information.1
We now know that the first set of resolution information lines (one
line when printing only the file and line number, two lines when
printing function names as well with -f
)
belongs to the address we want to symbolicate. We also know to
expect output for missing information (??:0
or ??
/ ??:0
)
from the dummy address. We can thus keep reading until we find
a set of lines that corresponds to missing information, and
disregard that final source info.
For example, passing $IP\n0\n
on stdin could yield:
??
??:0
??
??:0
or, without -f
function names,
??:0
??:0
In both cases we first consume the first set of lines (the output
for$IP
must include at least one record), then consume the next set
of lines and observe it represent missing information, so we stop
reading.
When debug information is present, we might instead find
foo()
src/foo.cc:10
??
??:0
The same algorithm clearly works.
Finally, with inlining, we might instead observe
inline_function()
src/bar.h:5
foo()
src/foo.cc:12
??
??:0
We’ll unconditionally assign the first pair of lines to $IP
,
read a second pair of lines, see that it’s not ??
/ ??:0
and push that to the bottom of the inline source location
stack, and finally stop after reading the third pair of lines.
Triggering PMU events from non-PMU perf events
Performance monitoring events in perf tend to be much more powerful than non-PMU events: each perf “driver” works independently, so only PMU events can snapshot the Processor Trace buffer, for example.
However, we sometimes really want to trigger on a non-PMU event. For example, we might want to watch for writes to a specific address with a hardware breakpoint, and snapshot the PT buffer to figure out what happened in the microseconds preceding that write. Unfortunately, that doesn’t work out of the box: only PMU events can snapshot the buffer. I remember running into a similar limitation when I wanted to capture performance counters after non-PMU events.
There is however a way to trigger PMU events from most non-PMU events: watch for far branches! I believe I also found these events much more reliable to detect preemption than the scheduler’s software event, many years ago.
Far branches are rare (they certainly don’t happen in regular x86-64 userspace program), but interrupt usually trigger a far CALL to execute the handler in ring 0 (attributed to ring 0), and a far RET to switch back to the user program (attributed to ring 3).
We can thus configure
perf record \
-e intel_pt//u \
-e BR_INST_RETIRED.FAR_BRANCH/aux-sample-size=...,period=1/u \
-e mem:0x...:wu ...
to:
- trigger a debug interrupt when userspace writes to the watched memory address
- which will increment the
far_branch
performance monitoring counter - which triggers Linux’s performance monitoring interrupt handler
- which will finally write both the far branch event and its associated PT buffer to the perf event ring buffer.
Not only does this work, but it also minimises the trigger latency.
That’s a big win compared to, e.g., perf record’s built-in --switch-output-event
:
a trigger latency on the order of hundreds of microseconds forces a
large PT buffer in order to capture the period we’re actually
interested in, and copying that large buffer slows down everything.
Is this documented?
Who knows? (Who cares?) These tricks fulfill a common need in quick hacks, and I’ve been using (and rediscovering) them for years.
I find tightly scoped tools that don’t try to generalise have an ideal insight:effort ratio. Go write your own!
-
I ended up generating passing a string suffixed with a UUIDv4 as a sentinel:
llvm-addr2line
just spits back any line that doesn’t look addresses. Alexey Alexandrov on the profiler developers’ slack noted thatllvm-symbolizer
cleanly terminates each sequence of frames with an empty line. ↩