Enhance the scheduler workload recording/replaying subsystem based on perf tool

Registered by Dmitry Antipov on 2012-05-14

Perf is a performance counters subsystem in Linux. Usually performance counters are CPU hardware registers that count hardware events such as instructions executed, cache-misses suffered, or branches mispredicted. This subsystem was designed mainly as a basis for profiling applications to trace dynamic control flow and identify hotspots. But recording some types of software events like context switch, fork() and exit() system calls, task wakeup or task migration from one core to another creates an opportunities to replay such events in a real-time fashion, thus simulating the workload was issued at the time of recording. Now there is a support for such kind of a simulation implemented by 'record' and 'sched replay' perf commands. It works by collecting samples from all online CPUs (or from a specified subset of online CPUs) for the time period when the specified process is running, thus recording a trace of system-wide activity. Then a number of so-called 'mockup' threads
are started to mimic the workload based on the events in the trace. These threads can then replay the timings (CPU runtime and sleep patterns) of the workload as it occurred when it was recorded.

This system has one major disadvantage: it records and replays system-wide workload, thus creating 'nested system' side-effects. For example, if some kernel thread like ksoftirqd becomes active during the trace recording, corresponding events will be accounted, among with an events caused by the real workload. When replaying, this ksoftirqd will be 'replayed' in addition to real underlying system ksoftirqd thread, thus affecting the kernel scheduler with one excessive thread. In addition, the perf tool itself may be quite CPU-intensive (this depends on how many counters are recorded, underlying media speed and so), so recorded data are affected with recording subsystem's own overhead.

Another important issue is the perf record file format and it's portability, at least between different systems of each major architecture (although small subset of the trace may be portable without any problems). This is the most important usability limitation - it's highly desirable, for example, to interchange the traces recorded on different systems with the same CPU core and see how core clock speed, memory speed/size and other hardware parameters affects the scheduler. Note that even if the perf.data itself is portable, it may be tricky to collect another system information needed for perf (vmlinux image and/or kallsyms may be easily provided, but not the whole contents of /sys/kernel/debug/tracing directory).

Another interesting feature may be a 'software scaling' of recorded trace. For example, if the trace was recorded on a hardware emulator which is known to be ~10x slower than the real hardware, it should be possible to 'upscale' the trace 10x times, thus matching 10x speedup from the emulated to real hardware.

The main perf development tree is hosted at git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux . To add it to the local git repository, use:

$ cd git/linux # or whichever name you chose
$ git remote add acme git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux
$ git remote update acme
$ git checkout -b tmp.perf/core acme/tmp.perf/core

Latest perf bits now allows 'perf report' to work across different machines and even between different architectures (at least for ARM and x86). The patch to allow 'perf sched replay' to do the same is still on the way (see http://patches.linaro.org/9316/ for patch details).

Since Linux 3.6.0-rc1, this feature (including cross-replaying) is expected to work out of the box in the vanilla kernel tree. To encourage people to try this feature, there are 4 recorded samples, each taken from 10-minute workload: https://wiki.linaro.org/WorkingGroups/PowerManagement/Doc/PerfRecordReplay.

Note that this feature doesn't work on Android, at least with 3.4x-based kernels. Android has their own version of perf which doesn't match the kernel version. This causes a lot of problems (see http://lists.linaro.org/pipermail/linaro-dev/2012-August/013131.html). Patch which is expected to fix them is still on the way (see http://lists.linaro.org/pipermail/linaro-dev/2012-August/013134.html).

Blueprint information

Amit Kucheria
Dmitry Antipov
Series goal:
Accepted for trunk
Milestone target:
milestone icon 2012.11
Started by
Amit Kucheria on 2012-06-18
Completed by
Amit Kucheria on 2012-11-28

Related branches



[dzin, Aug 22, 2012] Done as of 3.6-rc1 but not integrated into Linaro LEBs yet.
[dzin Sep 24, 2012] works in 3.6 mainline kernels. Dmitry is on vacation and will confirm once he returns. Retarget to 12.10
[dzin Nov 8, 2012] There is currently no resource available to work on this. Setting to blocked and moving to 12.11
[dzin Nov 19, 2012] still no resource for this blueprint, wait for Amit to return
[amitk Nov 28, 2012] Closing this blueprint, feature is in mainline and available now. Pending work items are only required if we start using this functionality extensively. Currently, we're awaiting the workload test suite to be delivered by ARM.

Pending Work items:
Develop basic perf verification tests for pm-qa: BLOCKED
Implement a kind of "selective replay" (e.g. the ability to sort out some processes/threads from recorded set and so replay a subset of recorded data): BLOCKED

Headline: TBD
Acceptance: TBD


Work Items

Work items for 2012.05:
Investigate portability of perf data format: DONE

Work items for 2012.06:
Make sure basic perf commands works with the data file recorded on another machine: DONE

Work items for 2012.07:
Add support for cross-platform record and playback: DONE

Work items for 2012.11:
Integrate perf improvement into linux-linaro: DONE

This blueprint contains Public information 
Everyone can see this information.