Perforator is a tool for recording performance metrics over subregions of a program using the Linux "perf" interface.

Overview

Perforator

Documentation Go Report Card MIT License

Perforator is a tool for recording performance metrics over subregions of a program (e.g., functions) using the Linux "perf" interface. The perf tool provided by the Linux kernel only supports collecting statistics over the complete lifetime of a program, which is often inconvenient when a program includes setup and cleanup that should not be profiled along with the benchmark. Perforator is not as comprehensive as perf but it allows you to collect statistics for individual functions or address ranges.

Perforator only supports Linux AMD64. The target ELF binary may be generated from any language. For function lookup, make sure the binary is not stripped (it must contain a symbol table), and for additional information (source code regions, inlined function lookup), the binary must include DWARF information. Perforator supports position-independent binaries.

Perforator is primarily intended to be used as a CLI tool, but includes a library for more general user-code tracing called utrace, a library for reading ELF/DWARF information from executables, and a library for tracing perf events in processes.

Installation

There are three ways to install Perforator.

  1. Download the prebuilt binary from the releases page.

  2. Install from source:

git clone https://github.com/zyedidia/perforator
cd perforator
make build # or make install to install to $GOBIN
  1. Install with go get (version info will be missing):
go get github.com/zyedidia/perforator/cmd/perforator

Usage

First make sure that you have the appropriate permissions to record the events you are interested in (this may require running Perforator with sudo or modifying /proc/sys/kernel/perf_event_paranoid -- see this post). If Perforator still can't find any events, double check that your system supports the perf_event_open system call (try installing the perf tool from the Linux kernel).

Example

Suppose we had a C function that summed an array and wanted to benchmark it for some large array of numbers. We could write a small benchmark program like so:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <stdint.h>

#define SIZE 10000000

uint64_t sum(uint32_t* numbers) {
    uint64_t sum = 0;
    for (int i = 0; i < SIZE; i++) {
        sum += numbers[i];
    }
    return sum;
}

int main() {
    srand(time(NULL));
    uint32_t* numbers = malloc(SIZE * sizeof(uint32_t));
    for (int i = 0; i < SIZE; i++) {
        numbers[i] = rand();
    }

    uint64_t result = sum(numbers);
    printf("%lu\n", result);
    return 0;
}

If we want to determine the number of cache misses, branch mispredictions, etc... perf is not suitable because running perf stat on this program will profile the creation of the array in addition to the sum. With Perforator, we can measure just the sum.

Profiling functions

First compile with

$ gcc -g -O2 -o bench bench.c

Now we can measure with Perforator:

$ perforator -r sum ./bench
+---------------------+-------------+
| Event               | Count (sum) |
+---------------------+-------------+
| instructions        | 50000004    |
| branch-instructions | 10000002    |
| branch-misses       | 10          |
| cache-references    | 1246340     |
| cache-misses        | 14984       |
| time-elapsed        | 4.144814ms  |
+---------------------+-------------+
10736533065142551

Results are printed immediately when the profiled function returns.

Note: in this case we compiled with -g to include DWARF debugging information. This was necessary because GCC will inline the call to sum, so Perforator needs to be able to read the DWARF information to determine where it was inlined to. If you compile without -g make sure the target function is not being inlined (either you know it is not inlined, or you mark it with the noinline attribute).

Fun fact: clang does a better job optimizing this code than gcc. I tried running this example with clang instead and found it only had 1,250,000 branch instructions (roughly 8x fewer than gcc!). The reason: vector instructions.

Events

By default, Perforator will measure some basic events such as instructions executed, cache references, cache misses, branches, branch misses. You can specify events yourself with the -e flag:

$ perforator -e l1d-read-accesses,l1d-read-misses -r sum ./bench
+-------------------+-------------+
| Event             | Count (sum) |
+-------------------+-------------+
| l1d-read-accesses | 10010311    |
| l1d-read-misses   | 625399      |
| time-elapsed      | 4.501523ms  |
+-------------------+-------------+
10736888439771461

To view available events, use the --list flag:

$ perforator --list hardware # List hardware events
$ perforator --list software # List software events
$ perforator --list cache    # List cache events
$ perforator --list trace    # List kernel trace events

Detailed documentation for each event is available in the manual page for Perforator. See the perforator.1 manual included with the prebuilt binary. The man directory in the source code contains the Markdown source, which can be compiled using Pandoc (via make perforator.1).

Source code regions

In additional to profiling functions, you may profile regions specified by source code ranges if your binary has DWARF debugging information.

$ perforator -r bench.c:18-bench.c:23 ./bench
+---------------------+-------------------------------+
| Event               | Count (bench.c:18-bench.c:23) |
+---------------------+-------------------------------+
| instructions        | 668794280                     |
| branch-instructions | 169061639                     |
| branch-misses       | 335360                        |
| cache-references    | 945581                        |
| cache-misses        | 3569                          |
| time-elapsed        | 78.433272ms                   |
+---------------------+-------------------------------+
10737167007294257

Only certain line numbers are available for breakpoints. The range is exclusive on the upper bound, meaning that in the example above bench.c:23 is not included in profiling.

You may also directly specify addresses as decimal or hexadecimal numbers. This is useful if you don't have DWARF information but you know the addresses you want to profile (for example, by inspecting the disassembly via objdump).

Multiple regions

You can also profile multiple regions at once:

$ perforator -r bench.c:18-bench.c:23 -r sum -r main ./bench
+---------------------+-------------------------------+
| Event               | Count (bench.c:18-bench.c:23) |
+---------------------+-------------------------------+
| instructions        | 697120715                     |
| branch-instructions | 162949718                     |
| branch-misses       | 302849                        |
| cache-references    | 823087                        |
| cache-misses        | 3645                          |
| time-elapsed        | 78.832332ms                   |
+---------------------+-------------------------------+
+---------------------+-------------+
| Event               | Count (sum) |
+---------------------+-------------+
| instructions        | 49802557    |
| branch-instructions | 10000002    |
| branch-misses       | 9           |
| cache-references    | 1246639     |
| cache-misses        | 14382       |
| time-elapsed        | 4.235705ms  |
+---------------------+-------------+
10739785644063349
+---------------------+--------------+
| Event               | Count (main) |
+---------------------+--------------+
| instructions        | 675150939    |
| branch-instructions | 184259174    |
| branch-misses       | 386503       |
| cache-references    | 1128637      |
| cache-misses        | 8368         |
| time-elapsed        | 83.132829ms  |
+---------------------+--------------+

In this case, it may be useful to use the --summary option, which will aggregate all results into a table that is printed when tracing stops.

$ perforator --summary -r bench.c:19-bench.c:24 -r sum -r main ./bench
10732787118410148
+-----------------------+--------------+---------------------+---------------+------------------+--------------+--------------+
| region                | instructions | branch-instructions | branch-misses | cache-references | cache-misses | time-elapsed |
+-----------------------+--------------+---------------------+---------------+------------------+--------------+--------------+
| bench.c:18-bench.c:23 | 718946520    | 172546336           | 326000        | 833098           | 3616         | 81.798381ms  |
| main                  | 678365328    | 174259806           | 363737        | 1115394          | 4403         | 86.321344ms  |
| sum                   | 43719896     | 10000002            | 9             | 1248069          | 16931        | 4.453342ms   |
+-----------------------+--------------+---------------------+---------------+------------------+--------------+--------------+

You can use the --sort-key and --reverse-sort options to modify which columns are sorted and how. In addition, you can use the --csv option to write the output table in CSV form.

Note: to an astute observer, the results from the above table don't look very accurate. In particular the totals for the main function seem questionable. This is due to event multiplexing (explained more below), and for best results you should not profile multiple regions simultaneously. In the table above, you can see that it's likely that profiling for main was disabled while sum was running.

Groups

The CPU has a fixed number of performance counters. If you try recording more events than there are counters, "multiplexing" will be performed to estimate the totals for all the events. For example, if we record 6 events on the sum benchmark, the instruction count becomes less stable. This is because the number of events now exceeds the number of hardware registers for counting, and multiplexing occurs. To ensure that certain events are always counted together, you can put them all in a group with the -g option. The -g option has the same syntax as the -e option, but may be specified multiple times (for multiple groups).

Notes and caveats

  • Tip: enable verbose mode with the -V flag when you are not seeing the expected result.
  • Many CPUs expose additional/non-standardized raw perf events. Perforator does not currently support those events.
  • Perforator has only limited support for multithreaded programs. It supports profiling programs with multiple threads as long as it is the case that each profiled region is only run by one thread (ever). In addition, the beginning and end of a region must be run by the same thread. This means if you are benchmarking Go you should call runtime.LockOSThread in your benchmark to prevent a goroutine migration while profiling.
  • A region is either active or inactive, it cannot be active multiple times at once. This means for recursive functions only the first invocation of the function is tracked.
  • Be careful of multiplexing, which occurs when you are trying to record more events than there are hardware counter registers. In particular, if you profile a function inside of another function being profiled, this will likely result in multiplexing and possibly incorrect counts. Perforator will automatically attempt to scale counts when multiplexing occurs. To see if this has happened, use the -V flag, which will print information when multiplexing is detected.
  • Be careful if your target functions are being inlined. Perforator will automatically attempt to read DWARF information to determine the inline sites for target functions but it's a good idea to double check if you are seeing weird results. Use the -V flag to see where Perforator thinks the inline site is.

How it works

Perforator uses ptrace to trace the target program and enable profiling for certain parts of the target program. Perforator places the 0xCC "interrupt" instruction at the beginning of the profiled function which allows it to regain control when the function is executed. At that point, Perforator will place the original code back (whatever was initially overwritten by the interrupt byte), determine the return address by reading the top of the stack, and place an interrupt byte at that address. Then Perforator will enable profiling and resume the target process. When the next interrupt happens, the target will have reached the return address and Perforator can stop profiling, remove the interrupt, and place a new interrupt back at the start of the function.

Issues
  • Add flag to not sort output

    Add flag to not sort output

    I've added a flag, --no-sort, which will disable the sorting of the summary table.

    I opted for the inverse (a flag to disable something) as it's non-breaking.

    opened by AlexGustafsson 5
  • Thank you

    Thank you

    This is no issue per say, but I'd just like to sincerely thank you for providing this tool. As I'm sure you've experienced first hand, using perf_event_open and the related tooling can be a big hassle. I've spent countless hours implementing perf_event_open in C for micro-benchmarks and as many hours reading up on the perf tool. The custom code I wrote worked great, but it was cumbersome to manually instrument the source code. The perf tool is great - but it lacks one important factor; the ability to run perf stat for a specific function that may be somewhere along the lifecycle of the program. Sure, perf record provides similar functionality, but it is not necessarily easy to convert the measurements to actual values. Your tool solves all of this, and for that I'm truly grateful.

    opened by AlexGustafsson 2
  • Add a flag to ignore missing regions

    Add a flag to ignore missing regions

    I've added a flag, --ignore-missing-regions, which will allow perforator to continue running even though it's unable to find a region.

    The change is non-breaking.

    The use case is times where you compare a non-optimized reference implementation with an optimized (-O3 or the like) implementation. In the latter case, GCC and friends may happily inline functions that are only used a single time. This results in the binary (symbol table) not containing the symbol at all - since it's no longer a function per say.

    With this change, one may use --ignore-missing-regions to run both binaries with the same perforator command, making it more useful in scripts where it may be unknown if the target binary lacks a specific region.

    opened by AlexGustafsson 2
  • Where is trace repo

    Where is trace repo

    hello, @zyedidia . I would like to try perforator tool. go get depends on utrace repo, but seems utrace absent. Is this a necessary dependency or is there an alternative?

    opened by LilyaBu 1
  • Analyzing a hand-written assembly file

    Analyzing a hand-written assembly file

    Note: this is not an issue regarding the software, rather information on how to use it with hand-written assembly files. That way others may more easily find the information.

    When a compiler creates an ELF binary used in Linux, it stores information about its symbols (functions etc.) - such as its name, size and type. In hand-written assembly, this information is usually lacking, which means perforator will be unable to properly identify the symbols.

    Let's say we have an assembly file poly_Rq_mul.s containing the following code:

    global poly_Rq_mul
    .global _poly_Rq_mul
    poly_Rq_mul:
    _poly_Rq_mul:
    ...
    mov %r8, %rsp
    pop %r12
    ret
    

    What we want to do is A; tell the compiler that poly_Rq_mul is a function and B; tell the compiler the size of poly_Rq_mul.

    We do that by first, adding a section after the .global definitions as follows:

    .global poly_Rq_mul
    .global _poly_Rq_mul
    +#ifdef __linux__
    +.type poly_Rq_mul, @function
    +.type _poly_Rq_mul, @function
    +#endif
    poly_Rq_mul:
    _poly_Rq_mul:
    

    After the end of a function, add these symbols to specify the size.

    mov %r8, %rsp
    pop %r12
    ret
    +#ifdef __linux__
    +.poly_Rq_mul_end:
    +.size poly_Rq_mul, .-poly_Rq_mul
    +.size _poly_Rq_mul, .-_poly_Rq_mul
    +#endif
    

    Lastly, we want to rename the file so that it's file extension is .S instead of .s as GCC (and I assume other compilers) will correctly invoke the preprocessor to deal with the ifdef. That way, the same assembly file may be compiled on Linux, macOS and more. If you are certain that the code will only be run on Linux, simply ignore the #ifdef __linux__ and #endif, as well as the file's extension.

    opened by AlexGustafsson 1
  • is it possible to attach to PID?

    is it possible to attach to PID?

    Hello, @zyedidia. Is it possible to attach to PID?

    opened by LilyaBu 3
Releases(v0.4.1)
Owner
Zachary Yedidia
Zachary Yedidia
gProfiler combines multiple sampling profilers to produce unified visualization of what your CPU

gProfiler combines multiple sampling profilers to produce unified visualization of what your CPU is spending time on, displaying stack traces of your processes across native programs1 (includes Golang), Java and Python runtimes, and kernel routines.

Granulate 332 Jul 21, 2021
Creates Prometheus Metrics for PolicyReports and ClusterPolicyReports. It also sends PolicyReportResults to different Targets like Grafana Loki or Slack

PolicyReporter Motivation Kyverno ships with two types of validation. You can either enforce a rule or audit it. If you don't want to block developers

Frank Jogeleit 57 Jul 15, 2021
A super simple Lodash like utility library with essential functions that empowers the development in Go

A simple Utility library for Go Go does not provide many essential built in functions when it comes to the data structure such as slice and map. This

Rahul Baruri 81 Jul 17, 2021
go-sundheit:A library built to provide support for defining service health for golang services

A library built to provide support for defining service health for golang services. It allows you to register async health checks for your dependencies and the service itself, and provides a health endpoint that exposes their status.

AppsFlyer 419 Jul 23, 2021
perfessor - Continuous Profiling Sidecar

perfessor - Continuous Profiling Sidecar About Perfessor is a continuous profiling agent that can profile running programs using perf It then converts

null 43 Jun 2, 2021
psutil for golang

gopsutil: psutil for golang This is a port of psutil (https://github.com/giampaolo/psutil). The challenge is porting all psutil functions on some arch

shirou 6.5k Jul 24, 2021
GoWrap is a command line tool for generating decorators for Go interfaces

GoWrap GoWrap is a command line tool that generates decorators for Go interface types using simple templates. With GoWrap you can easily add metrics,

Max Chechel 524 Jul 19, 2021
Cross-platform Bluetooth API for Go and TinyGo.

Go Bluetooth is a cross-platform package for using Bluetooth Low Energy hardware from the Go programming language.

TinyGo 283 Jul 21, 2021
提供能简单处理服务熔断逻辑的工具包。

circuit 简介 提供能简单处理服务熔断逻辑的工具包。

Rui Zhong 3 Jul 19, 2021
Goridge is high performance PHP-to-Golang codec library which works over native PHP sockets and Golang net/rpc package.

Goridge is high performance PHP-to-Golang codec library which works over native PHP sockets and Golang net/rpc package. The library allows you to call Go service methods from PHP with a minimal footprint, structures and []byte support.

Spiral Scout 1k Jul 26, 2021
ssdt - Survey security.txt files

ssdt - Survey security.txt files A program to quickly survey security.txt files found on the Alexa Top 1 Million websites. The program takes about 15

null 78 Jul 9, 2021
.NET LINQ capabilities in Go

go-linq A powerful language integrated query (LINQ) library for Go. Written in vanilla Go, no dependencies! Complete lazy evaluation with iterator pat

Ahmet Alp Balkan 2.6k Jul 24, 2021
A simple thread-safe, fixed size LRU written in Go. Based on dominictarr's Hashlru Algorithm. 🔃

go-hashlru A simple thread-safe, fixed size LRU written in Go. Based on dominictarr's Hashlru Algorithm. ?? Uses map[interface{}]interface{} to allow

Saurabh Pujari 66 Jul 11, 2021
Helpfully Functional Go like underscore.js

/\ \ __ __ ___ \_\ \ __ _ __ ____ ___ ___ _ __ __ __ __

null 342 Jul 14, 2021