Magic-trace: Diagnose tricky performance issues with Intel Processor Trace

Intel Processor Trace is a hardware technology that can record allprogram execution flow along with timing information accurate toaround 30ns. As far as I can tell almostnobody uses it, seemingly because capturing the data is tricky and,without any visualization tools, you’re forced to read enormous textdumps.

Magic-trace is a tool we built and open-sourced to make it easy tocapture a trace of around 10ms leading up to a function call youchose to instrument, and then visualize the call stack on a timelinewhere you can zoom in and see every function call and how long ittook. Here’s a captured trace of 5ms of OCaml program startup:

5ms of OCaml startup

And here’s the same trace zoomed in to an arbitrary 500 nanoseconds.The thin red events are 1-3 nanoseconds:

500ns of OCaml startup

Recently we’ve been using this tool to diagnose performance issues thatwould be very difficult to solve with other tools. Using it is as easyas adding a Magic_trace.take_snapshot call to your code (or using afuzzy-finder to select any existing function), then runningmagic-trace attach and using the fuzzy-finder to select your process.It’ll spit out a trace you can view in Google’s Perfetto traceviewer.

In this post we’ll go over why Processor Trace is so special, thedifficulties of building something on top of a hardware technologyalmost nobody uses, how we were beset by a kernel bug and a hardwarebug, and the kinds of problems we’ve been able to solve with thetool.

Why Intel Processor Trace, and why not?

Let’s look at the major types of performance analysis tools and whymagic-trace serves a different niche:

Sampling profilers interrupt the program every 250 microseconds orso, sample the current call stack, and then summarize them alltogether. These are great for giving you a sense of where your programis spending its time. However, at Jane Street we have lots ofhigh-performance trading systems that spend nearly all of their timewaiting for network packets that we want to respond to in far lessthan the 250-microsecond sampling interval. Sampling profilers areapproximately useless for diagnosing latency issues on that scale:you’d be lucky to get one sample in the code you care about!

Even in more traditional systems, you may want to diagnose short butrare tail latency events, or notice the difference between a functionbeing called 10 times more than you expected or one call to it taking10 times longer than expected, which a sampling profiler can’t tellyou.

Instrumentation-based tracing either patches or compiles probesinto a program that record when certain functions start and end, thentypically visualizes them on an interactive timeline UI. We re-use theUI from the Perfetto tracing system for magic-trace, although weneeded to fork it to better handle events at the scale of singlenanoseconds. High-performance tracing systems like tracy evenmanage to get the overhead down to around 2ns per call (we built asimilar system for OCaml and open-sourced it). However,instrumenting every single function is risky (e.g. you might triple thecost of a 1ns function that’s called everywhere) so typically theyrequire manual instrumentation, and sometimes your performance problemsare in an app or function you haven’t annotated.

Hardware tracing like Intel Processor Trace (IPT) has theadvantages of tracing but doesn’t require any instrumentation, and canhave much lower overhead than instrumenting everything. They use a veryefficient format that only encodes just enough info to reconstruct thecontrol flow – for example conditional branches take one bit. Timeoverhead for IPT varies from 2-20% depending on the program, with everyone of our programs I’ve benchmarked experiencing less than a 10%slowdown and usually under 5%.

There are a few downsides to Processor Trace though:

  • Many VMs don’t support it and it needs a post-Skylake Intel processor(some other vendors have similar tech; AMD doesn’t yet).
  • You have no choice but the full 1GB/s firehose (with the exception ofsome limited filtering options) so it’s difficult to store andanalyze longer traces. With instrumentation you can manually pick theimportant functions and economize on trace size.
  • Decoding is slow because it needs to follow along with disassembledinstructions from the binary and reconstruct the flow. Other thanspecialized decoders for fuzzing, the fastest decoder is 60x slowerthan real time.

A minimum viable product

During Jane Street’s 2021 summer internship, I was talking tosome colleagues about our issues profiling very short interesting timesegments. I noted that Intel Processor Trace would be great for thisbut that it was really hard to use. Then I realized that with thetrace visualization library I had just written, and some features fromthe Processor Trace documentation I had just read, I could see a pathto a user-friendly tool. So I drafted a new intern project document,and for the second half of his internship, Chris Lambert and I workedon putting it together.

The key idea behind quickly making a useful tool was to limit thescope:

  • We’d focus on the circular buffer mode, where it overwrites old datauntil you snapshot it after something interesting happens. ProcessorTrace can save all data, but doing so creates 1GB of trace fileper second.
  • We’d trigger the snapshots based on a function call in the targetprogram. There are lots of other possibilities for deciding when tosnapshot, but calling a function is very flexible, especially if youput it behind custom logic waiting for tail latency events orsomething.
  • We’d only visualize function calls and returns, and only on a tracetimeline. Processor Trace gives you full control-flow data and intheory you could visualize down to individual lines, but that ends upbeing too much data to deal with.

The first stage was to implement the tool as a wrapper around theLinux perf tool’s Processor Trace functionality, and Chris blazedthrough it in under a week. Sending the SIGUSR2 signal to perfcaused it to take a snapshot, so Chris wrote aMagic_trace.take_snapshot function that sent SIGUSR2 to the parentpid. Then he wrote a parser and call-stack reconstructor to turn theperf script text dump of all branches into a trace that handledOCaml features like tail-calls and some exceptions.

It was pretty exciting looking through the first traces and being ableto zoom in and see the smallest details, and immediately noticingthings like that OCaml program startup time was mostly composed ofmodule initializers page faulting in random parts of the binary.

Directly using kernel APIs and libipt

Then we embarked on something harder. Parsing the output of the perftool was slow and couldn’t do the instruction-level decoding needed forproperly handling pushes and pops to the OCaml exception handler stack.We decided to try directly using the kernel perf_event_open APIand Intel’s libipt decoding library.

This turned out to be quite tricky, as we couldn’t find any evidenceanyone had ever tried directly integrating perf_event_open withlibipt before. I ended up spending my days poring over documentationand source code of libipt and the perf tool to figure out how todo things we hadn’t understood yet and handing answers and examplelinks over to Chris, who wrote and debugged the C code to interfacewith the APIs and with OCaml.

After lots of research and debugging, by the end of his internship we’dmanaged to get a trace of events out of our from-scratchimplementation. After Chris left I debugged the remaining issues andplumbed it in fully. Hopefully now that we’ve published a referencecodebase, anyone else attempting this will have an easier time.

Hardware breakpoints for seamless triggering

After Chris left and things were working, the biggest feature that weneeded to make useful and easy was the ability to attach to existingprocesses. Unfortunately this broke our parent-SIGUSR2-basedsnapshot signalling. I wanted Magic_trace.take_snapshot to have closeto zero overhead while magic-trace wasn’t attached, and low overheadeven when it did trigger a snapshot. I thought I might have to haveevery process host a tiny IPC server or use ptrace, but Iwasn’t happy with those solutions.

I spent a bunch of time looking for a better solution and eventually Ifound a really satisfying one in the perf_event_open docs. Itturns out that perf_event_open can use hardware breakpoints andnotify you when a memory address is executed or accessed.

The cool thing about this approach is that it requires no cooperationfrom the target, no overhead when not attached, and can actually beused on any function we want, not just a specialMagic_trace.take_snapshot function. When we do use it on a specialfunction, we can sample registers so we can see the arguments it wascalled with, allowing the user to include metadata with theirsnapshot.

I think it says something interesting about my programming aestheticthat I spent a whole day researching alternatives to adding a tiny IPCserver and ended up using a niche kernel API and hardware feature. Iknew the hardware allowed a design which didn’t require recompiling oradding extra bloat to processes that weren’t being traced, and I reallywanted to make first-time use as smooth as possible and avoid bloatingeveryone else’s programs. If I did go the IPC route, I was at leastgoing to use less-obscure-but-still-rare Linux-only abstract domainsockets (named by the PID) to avoid having to clean up files or dealwith ports. Sometimes standard approaches can’t get you to an idealuser experience, but they’re easier for your coworkers to maintain, yourun into fewer issues, and need to do less research. This tradeoffleaves low-hanging fruit for people who enjoy diving deep into obscuredocumentation and debugging weird issues, which can tip the balance.Hardware breakpoints, the whole magic-trace project, and otherprojects of mine are all the result of delighting in asking myself“could I obliterate this problem by being willing to do cursed things?”

Kernel bugs and hardware bugs, the perils of being early

People have sometimes used Processor Trace, and it mostly works, butI’ve learned that when using niche and complicated new hardware, Ican’t have the same low priors as I usually do about bugs being due tothe kernel or hardware.

I was excited to be able to try my hand at kernel debugging for thefirst time when I discovered a way to crash the kernel using aspecific unusual combination of Processor Trace features. I used infofrom the kernel core dump, and read through control flow paths andrecent patches in the kernel, to figure out the reason for the nullpointer access. It turns out a patch added a flag that made one pieceof state invalid to access, but missed guarding it with an ifstatement in one place. Exactly the kind of bug that algebraic datatypes in OCaml/Rust/etc help you avoid 🙂

Another bug was much more mysterious and difficult. On exactly oneprogram out of any I tried, Processor Trace would mysteriously stopadding events to the trace buffer before it reached the snapshotpoint. I spent 2 weeks adding various sorts of observability andfixing other issues that got in the way (so at least magic-traceended up better regardless), and couldn’t find any sensible softwarecause, e.g. a context switch that lined up with when the recordingstopped. Finally I tried running it on a newer generation of Intelprocessors and the problem went away. I suspect it may be Intelerratum SKL171 “Intel® PT May Drop All Packets After an InternalBuffer Overflow” which happens under a “rare microarchitecturalcondition”, although it still might be some race condition kernel bugthat’s very consistent only in the older hardware.

Solving tricky problems

People have only been using magic-trace internally for about a monthbut we’ve already made good use of it.

The original design goal was to help with performance problems inhigh-performance trading system that sampling profilers are hopelessfor, and that’s panned out. It helped identify a 100ns performanceregression caused by a patch that turned out to cause a function callnot to be inlined. It also helped diagnose why a new compiler versionmade a trading system slower, which also turned out to come down to aninlining decision.

But after we built magic-trace, we realized it could help with anotherkind of difficult performance problem that people at Jane Streetencounter frequently. We use Async to cooperatively handle manyconcurrent tasks. The “cooperatively” part means that if one tasktakes too long then all other tasks have to wait for it. If you havean issue that causes a task to handle way more work than usual, it cancause a “long async cycle”. These can be really tricky to debug ifthey only happen occasionally, since you don’t get any info about whatcode was too slow. Previously people have resorted to capturingenormous long perf profiles and then using logged monotonictimestamps to filter the profile to the relevant region.

Now with magic-trace people can just add a snippet of code thatcalls Magic_trace.take_snapshot after cycles over a certain length,and then attach magic-trace and wait for it to capture. Even if along cycle is 15 seconds, the last 10 milliseconds of the job arenormally the same uniform large batch of work, so you can just lookback in the trace to see which code is doing too much work. We’vealready used this to solve one tricky issue where there were way moreitems in a certain collection than expected and a loop was spendingseconds working over them. Sampling profile filtering would’ve beenharder and wouldn’t have been able to tell whether the function waslooping too many times instead of, say, taking a really long time onceor just always being somewhat slow.

Even if magic-trace is only indispensable for certainhigh-performance code, as a user-friendly performance tool in atoolbox it can be useful for all sorts of debugging and performanceproblems just by being quicker and easier to use than alternatives.

How you can use magic-trace

We designed magic-trace for our own OCaml code but now you can useit on any native executable with symbol names (e.g. a C++ or Rustprogram) as long as you have a new enough Intel Linux machine. Here’show:

# If this prints 1 your machine supports Processor Trace with precise timingcat /sys/bus/event_source/devices/intel_pt/caps/psb_cyc# Install Opam (https://opam.ocaml.org/doc/Install.html), then OCaml 4.12.1opam switch create 4.12.1opam switch 4.12.1# Add the Jane Street pre-release repo, magic-trace isn't on public Opam yetopam repo add janestreet-bleeding https://ocaml.janestreet.com/opam-repository# Install the magic-trace command line toolopam install magic-trace# This lets you fuzzy-search a process to attach to and a symbol to snapshot onmagic-trace attach -symbol '' -output magic.ftf# Go to https://ui.perfetto.dev/ and open the trace file

Nobody else has used Perfetto for traces like this before so we alsoneeded to leave our OCaml and C extensions to the land of Typescriptand patch Perfetto to support zooming to nanosecond-levels. Themain Perfetto UI works, and we hope to upstream some patches toit, but for the best experience you can build the UI from ourfork.

Let’s use more Processor Trace!

Intel Processor Trace is an incredibly powerful and cool technology, andit’s an absolute shame that more people don’t use it. I hope thatmagic-trace shows people how useful Processor Trace and technologieslike it can be, and makes it easier to use Processor Trace by providingan example codebase.

One way to build tools on top of Processor Trace that I haven’tmentioned yet is perf-dlfilter, which allows you to consumeperf events using an efficient C API with a shared library ratherthan parsing text output. We didn’t use it because it was just beingsubmitted to the kernel as we were writing magic-trace; we didn’tlearn about it until I stumbled upon it months later. I’d recommendthat tools try to start with perf and dlfilter rather thanperf_event_open and libipt, as it just implements tons of stuffyou’ll otherwise need to reimplement.

At the end of his internship, Chris even suggested that with hindsightwe should have forked perf to add a C interface rather thanembarking on the libipt route – and luckily someone else did, withthe specific goal of efficiently reading Processor Trace events! Youdon’t even need a super new kernel, because you can compile the perftool from a newer kernel tree and use it on an older kernel.

Here are some more ideas we’ve thought of for Processor Trace toolingthat nobody’s built yet and that we might build into magic-trace:

  • Visualizing control flow at the level of individual lines of code,perhaps with a custom trace viewer designed for Processor Tracewhich lazily decodes the lowest levels again as you zoom in.
  • Feedback Directed Optimization of low-latency systems by optimizingbased on recorded control flow in the latency-critical case.
  • Using Processor Trace to evaluate compiler optimizations by countingthe number of actually executed register spills, page faults,etc. on benchmarks.
  • Building efficient instrumentation-based tracing on top of thePTWRITE instruction in the latest processors, which allows addingdata into the trace.
  • Using the “PEBS via PT” feature on very new processors to sample cachemisses or branch mispredicts and add them to the trace so you cannotice why your function is slow.
  • Using root permissions to record every process on every core plus thekernel and combining it with task switch information to visualizeliterally everything the processor was doing during a time slice somysterious performance problems have nowhere left to hide.

If more people use Processor Trace, more VM hypervisors will supportit, Intel will hopefully invest more in it as a differentiating factor,and perhaps AMD will try to catch up. And if you want to use ProcessorTrace today, try magic-trace on your own problems, or apply to JaneStreet and come try it on ours: we’re always hiring!

Note: This article have been indexed to our site. We do not claim legitimacy, ownership or copyright of any of the content above. To see the article at original source Click Here

Related Posts
I highly recommend this 12-in-1 electric screwdriver, and it's still 53% off thumbnail

I highly recommend this 12-in-1 electric screwdriver, and it’s still 53% off

HOTO electric screwdriver. Adrian Kingsley-Hughes/ZDNETWhat's the deal?The HOTO electric screwdriver is one of my favorite gadgets, and right now it's 25% off, plus Amazon is offering an additional 20% off coupon, bringing the price down to $48. Why this deal is ZDNET-recommendedLate last year, during the holidays, I featured an electric screwdriver that was popular with readers. I
Read More
Apple introduceert ondersteuning voor unlisted apps op App Store thumbnail

Apple introduceert ondersteuning voor unlisted apps op App Store

Tweakers maakt gebruik van cookies Tweakers is onderdeel van DPG Media en maakt gebruik van cookies, JavaScript en vergelijkbare technologie om je onder andere een optimale gebruikerservaring te bieden. Functionele en analytische cookies die door Tweakers zelf geplaatst worden, worden gebruikt om de website goed te laten functioneren, bezoekersstatistieken bij te houden en a/b-testen uit…
Read More
6 Times Hollywood Was Right About Getting 'Too Close' to AI thumbnail

6 Times Hollywood Was Right About Getting ‘Too Close’ to AI

Image: Gorodenkoff (Shutterstock)It’s no question that artificial intelligence is rapidly advancing, creating a buzz in the tech world and to all those who use it. But, as AI becomes a larger part of everyday life, from automated responses on Bing’s search engine to a human-like companion, it is questionable if this is all a good
Read More
“Poor”: Mark Zuckerberg falls behind Bill Gates after Facebook went down thumbnail

“Poor”: Mark Zuckerberg falls behind Bill Gates after Facebook went down

Pixabay/Pexels In Mark Zuckerbergs Haut möchte ich grundsätzlich, aber aktuell ganz besonders nicht stecken. Erst meldet sich Mitte September eine Whistleblowerin zu Wort, die dann auch noch vor dem amerikanischen Senat aussagt, fast zeitgleich sind auch noch die Server von Facebook, WhatsApp und Instagram für mehrere Stunden nicht zu erreichen. Dem Bloomberg Index der Milliardäre…
Read More
Index Of News
Total
0
Share