OpenCSD - CoreSight Trace Decode Library  0.10.0
/build/libopencsd-AGE5HB/libopencsd-0.10.0/decoder/tests/auto-fdo/autofdo.md
Go to the documentation of this file.
1 AutoFDO and ARM Trace {#AutoFDO}
2 =====================
3 
4 @brief Using CoreSight trace and perf with OpenCSD for AutoFDO.
5 
6 ## Introduction
7 
8 Feedback directed optimization (FDO, also know as profile guided
9 optimization - PGO) uses a profile of a program's execution to guide the
10 optmizations performed by the compiler. Traditionally, this involves
11 building an instrumented version of the program, which records a profile of
12 execution as it runs. The instrumentation adds significant runtime
13 overhead, possibly changing the behaviour of the program and it may not be
14 possible to run the instrumented program in a production environment
15 (e.g. where performance criteria must be met).
16 
17 AutoFDO uses facilities in the hardware to sample the behaviour of the
18 program in the production environment and generate the execution profile.
19 An improved profile can be obtained by including the branch history
20 (i.e. a record of the last branches taken) when generating an instruction
21 samples. On Arm systems, the ETM can be used to generate such records.
22 
23 The process can be broken down into the following steps:
24 
25 * Record execution trace of the program
26 * Convert the execution trace to instruction samples with branch histories
27 * Convert the instruction samples to source level profiles
28 * Use the source level profile with the compiler
29 
30 This article describes how to enable ETM trace on Arm targets running Linux
31 and use the ETM trace to generate AutoFDO profiles and compile an optimized
32 program.
33 
34 
35 ## Execution trace on Arm targets
36 
37 Debug and trace of Arm targets is provided by CoreSight. This consists of
38 a set of components that allow access to debug logic, record (trace) the
39 execution of a processor and route this data through the system, collecting
40 it into a store.
41 
42 To record the execution of a processor, we require the following
43 components:
44 
45 * A trace source. The core contains a trace unit, called an ETM that emits
46  data describing the instructions executed by the core.
47 * Trace links. The trace data generated by the ETM must be moved through
48  the system to the component that collects the data (sink). Links
49  include:
50  * Funnels: merge multiple streams of data
51  * FIFOs: buffer data to smooth out bursts
52  * Replicators: send a stream of data to multiple components
53 * Sinks. These receive the trace data and store it or send it to an
54  external device:
55  * ETB: A small circular buffer (64-128 kilobytes) that stores the most
56  recent data
57  * ETR: A larger (several megabytes) buffer that uses system RAM to
58  store data
59  * TPIU: Sends data to an off-chip capture device (e.g. Arm DSTREAM)
60 
61 Each Arm SoC design may have a different layout (topology) of components.
62 This topology is described to the OS drivers by the platform's devicetree
63 or (in future) ACPI firmware.
64 
65 For application profiling, we need to store several megabytes of data
66 within the system, so will use ETR with the capture tool (perf)
67 periodically draining the buffer to a file.
68 
69 Even though we have a large capture buffer, the ETM can still generate a
70 lot of data very quickly - typically an ETM will generate ~1 bit of data
71 per instruction (depending on the workload), which results in 256Mbytes per
72 second for a core running at 2GHz. This leads to problems storing and
73 decoding such large volumes of data. AutoFDO uses samples of program
74 execution, so we can avoid this problem by using the ETM's features to
75 only record small slices of execution - e.g. collect ~5000 cycles of data
76 every 50M cycles. This reduces the data rate to a manageable level - a few
77 megabytes per minute. This technique is known as 'strobing'.
78 
79 
80 ## Enabling trace
81 
82 ### Driver support
83 
84 To collect ETM trace, the CoreSight drivers must be included in the
85 kernel. Some of the driver support is not yet included in the mainline
86 kernel and many targets are using older kernels. To enable CoreSight trace
87 on these targets, Arm have provided backports of the latest CoreSight
88 drivers and ETM strobing patch at:
89 
90  [TODO: link to git repos for CoreSight backports]
91 
92 You can include these backports in your kernel by either merging the
93 appropriate branch using git or generating patches (using `git
94 format-patch`).
95 
96 For 4.9 based kernels, use the `coresight-4.9-etr-etm_strobe` branch:
97 
98 ```
99 git merge coresight-4.9-etr-etm_strobe
100 ```
101 
102 or
103 
104 ```
105 git format-patch --output-directory /output/dir v4.9..coresight-4.9-etr-etm_strobe
106 cd my_kernel
107 git am /output/dir/*.patch # or patch -p1 /output/dir/*.patch if not using git
108 ```
109 
110 For 4.14 based kernels, use the `coresight-4.14-etm_strobe` branch:
111 
112 ```
113 git merge coresight-4.14-etm_strobe
114 ```
115 
116 or
117 
118 ```
119 git format-patch --output-directory /output/dir v4.14..coresight-4.14-etm_strobe
120 cd my_kernel
121 git am /output/dir/*.patch # or patch -p1 /output/dir/*.patch if not using git
122 ```
123 
124 The CoreSight trace drivers must also be enabled in the kernel
125 configuration. This can be done using the configuration menu (`make
126 menuconfig`), selecting `Kernel hacking` / `CoreSight Tracing Support` and
127 enabling all options, or by setting the following in the configuration
128 file:
129 
130 ```
131 CONFIG_CORESIGHT=y
132 CONFIG_CORESIGHT_LINK_AND_SINK_TMC=y
133 CONFIG_CORESIGHT_SINK_TPIU=y
134 CONFIG_CORESIGHT_SOURCE_ETM4X=y
135 CONFIG_CORESIGHT_DYNAMIC_REPLICATOR=y
136 CONFIG_CORESIGHT_STM=y
137 CONFIG_CORESIGHT_CATU=y
138 ```
139 
140 Compile the kernel for your target in the usual way, e.g.
141 
142 ```
143 make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu-
144 ```
145 
146 Each target may have a different layout of CoreSight components. To
147 collect trace into a sink, the kernel drivers need to know which other
148 devices need to be configured to route data from the source to the sink.
149 This is described in the devicetree (and in future, the ACPI tables). The
150 device tree will define which CoreSight devices are present in the system,
151 where they are located and how they are connected together. The devicetree
152 for some platforms includes a description of the platform's CoreSight
153 components, but in other cases you may have to ask the platform/SoC vendor
154 to supply it or create it yourself (see Appendix: Describing CoreSight in
155 Devicetree).
156 
157 Once the target has been booted with the devicetree describing the
158 CoreSight devices, you should find the devices in sysfs:
159 
160 ```
161 # ls /sys/bus/coresight/devices/
162 28440000.etm 28540000.etm 28640000.etm 28740000.etm
163 28c03000.funnel 28c04000.etf 28c05000.replicator 28c06000.etr
164 28c07000.tpiu
165 ```
166 
167 ### Perf tools
168 
169 The perf tool is used to capture execution trace, configuring the trace
170 sources to generate trace, routing the data to the sink and collecting the
171 data from the sink.
172 
173 Arm recommends to use the perf version corresponding to the kernel running
174 on the target. This can be built from the same kernel sources with
175 
176 ```
177 make -C tools/perf ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu-
178 ```
179 
180 If the post-processing (`perf inject`) of the captured data is not being
181 done on the target, then the OpenCSD library is not required for this build
182 of perf.
183 
184 Trace is captured by collecting the `cs_etm` event from perf. The sink
185 to collect data into is specified as a parameter of this event. Trace can
186 also be restricted to user space or kernel space with 'u' or 'k'
187 parameters. For example:
188 
189 ```
190 perf record -e cs_etm/@28c06000.etr/u --per-thread -- /bin/ls
191 ```
192 
193 Will record the userspace execution of '/bin/ls' into the ETR located at
194 0x28c06000. Note the `--per-thread` option is required - perf currently
195 only supports trace of a single thread of execution. CPU wide trace is a
196 work in progresss.
197 
198 
199 ## Processing trace and profiles
200 
201 perf is also used to convert the execution trace an instruction profile.
202 This requires a different build of perf, using the version of perf from
203 Linux v4.17 or later, as the trace processing code isn't included in the
204 driver backports. Trace decode is provided by the OpenCSD library
205 (<https://github.com/Linaro/OpenCSD>), v0.9.1 or later. This is packaged
206 for debian testing (install the libopencsd0, libopencsd-dev packages) or
207 can be compiled from source and installed.
208 
209 The autoFDO tool <https://github.com/google/autofdo> is used to convert the
210 instruction profiles to source profiles for the GCC and clang/llvm
211 compilers.
212 
213 
214 ## Recording and profiling
215 
216 Once trace collection using perf is working, we can now use it to profile
217 an application.
218 
219 The application must be compiled to include sufficient debug information to
220 map instructions back to source lines. For GCC, use the `-g1` or `-gmlt`
221 options. For clang/llvm, also add the `-fdebug-info-for-profiling` option.
222 
223 perf identifies the active program or library using the build identifier
224 stored in the elf file. This should be added at link time with the compiler
225 flag `-Wl,--build-id=sha1`.
226 
227 The next step is to record the execution trace of the application using the
228 perf tool. The ETM strobing should be configured before running the perf
229 tool. There are two parameters:
230 
231  * window size: A number of CPU cycles (W)
232  * period: Trace is enabled for W cycle every _period_ * W cycles.
233 
234 For example, a typical configuration is to use a window size of 5000 cycles
235 and a period of 10000 - this will collect 5000 cycles of trace every 50M
236 cycles. With these proof-of-concept patches, the strobe parameters are
237 configured via sysfs - each ETM will have `strobe_window` and
238 `strobe_period` parameters in `/sys/bus/coresight/devices/NNNNNNNN.etm` and
239 these values will have to be written to each (In a future version, this
240 will be integrated into the drivers and perf tool). The attached `record.sh`
241 (TODO: attach!) script automates this process.
242 
243 To collect trace from an application using ETM strobing, run:
244 
245 ```
246 taskset -c 0 ./record.sh --strobe 5000 10000 28c06000.etr ./my_application arg1 arg2
247 ```
248 
249 The taskset command is used to ensure the process stays on the same CPU
250 during execution.
251 
252 The raw trace can be examined using the `perf report` command:
253 
254 ```
255 perf report -D -i perf.data --stdio
256 ```
257 
258 For example:
259 
260 ```
261 0x1d370 [0x30]: PERF_RECORD_AUXTRACE size: 0x2003c0 offset: 0 ref: 0x39ba881d145f8639 idx: 0 tid: 4551 cpu: -1
262 
263 . ... CoreSight ETM Trace data: size 2098112 bytes
264  Idx:0; ID:12; I_ASYNC : Alignment Synchronisation.
265  Idx:12; ID:12; I_TRACE_INFO : Trace Info.; INFO=0x0
266  Idx:17; ID:12; I_ADDR_L_64IS0 : Address, Long, 64 bit, IS0.; Addr=0xFFFF000008A4991C;
267  Idx:48; ID:14; I_ASYNC : Alignment Synchronisation.
268  Idx:60; ID:14; I_TRACE_INFO : Trace Info.; INFO=0x0
269  Idx:65; ID:14; I_ADDR_L_64IS0 : Address, Long, 64 bit, IS0.; Addr=0xFFFF000008A4991C;
270  Idx:96; ID:14; I_ASYNC : Alignment Synchronisation.
271  Idx:108; ID:14; I_TRACE_INFO : Trace Info.; INFO=0x0
272  Idx:113; ID:14; I_ADDR_L_64IS0 : Address, Long, 64 bit, IS0.; Addr=0xFFFF000008A4991C;
273  Idx:122; ID:14; I_TRACE_ON : Trace On.
274  Idx:123; ID:14; I_ADDR_CTXT_L_64IS0 : Address & Context, Long, 64 bit, IS0.; Addr=0x0000000000407B00; Ctxt: AArch64,EL0, NS;
275  Idx:134; ID:14; I_ATOM_F3 : Atom format 3.; ENN
276  Idx:135; ID:14; I_ATOM_F5 : Atom format 5.; NENEN
277  Idx:136; ID:14; I_ATOM_F5 : Atom format 5.; ENENE
278  Idx:137; ID:14; I_ATOM_F5 : Atom format 5.; NENEN
279  Idx:138; ID:14; I_ATOM_F3 : Atom format 3.; ENN
280  Idx:139; ID:14; I_ATOM_F3 : Atom format 3.; NNE
281  Idx:140; ID:14; I_ATOM_F1 : Atom format 1.; E
282 .....
283 ```
284 
285 The execution trace is then converted to an instruction profile using
286 the perf build with trace decode support. This may be done on a different
287 machine than that which collected the trace (e.g. when cross compiling for
288 an embedded target). The `perf inject` command
289 decodes the execution trace and generates periodic instruction samples,
290 with branch histories:
291 
292 ```
293 perf inject -i perf.data -o inj.data --itrace=i100000il
294 ```
295 
296 The `--itrace` option configures the instruction sample behaviour:
297 
298 * `i100000i` generates an instruction sample every 100000 instructions
299  (only instruction count periods are currently supported, future versions
300  may support time or cycle count periods)
301 * `l` includes the branch histories on each sample
302 * `b` generates a sample on each branch (not used here)
303 
304 Perf requires the original program binaries to decode the execution trace.
305 If running the `inject` command on a different system than the trace was
306 captured on, then the binary and any shared libraries must be added to
307 perf's cache with:
308 
309 ```
310 perf buildid-cache -a /path/to/binary_or_library
311 ```
312 
313 `perf report` can also be used to show the instruction samples:
314 
315 ```
316 perf report -D -i inj.data --stdio
317 .......
318 0x1528 [0x630]: PERF_RECORD_SAMPLE(IP, 0x2): 4551/4551: 0x434b98 period: 3093 addr: 0
319 ... branch stack: nr:64
320 ..... 0: 0000000000434b58 -> 0000000000434b68 0 cycles P 0
321 ..... 1: 0000000000436a88 -> 0000000000434b4c 0 cycles P 0
322 ..... 2: 0000000000436a64 -> 0000000000436a78 0 cycles P 0
323 ..... 3: 00000000004369d0 -> 0000000000436a60 0 cycles P 0
324 ..... 4: 000000000043693c -> 00000000004369cc 0 cycles P 0
325 ..... 5: 00000000004368a8 -> 0000000000436928 0 cycles P 0
326 ..... 6: 000000000042d070 -> 00000000004368a8 0 cycles P 0
327 ..... 7: 000000000042d108 -> 000000000042d070 0 cycles P 0
328 .......
329 ..... 57: 0000000000448ee0 -> 0000000000448f24 0 cycles P 0
330 ..... 58: 0000000000448ea4 -> 0000000000448ebc 0 cycles P 0
331 ..... 59: 0000000000448e20 -> 0000000000448e94 0 cycles P 0
332 ..... 60: 0000000000448da8 -> 0000000000448ddc 0 cycles P 0
333 ..... 61: 00000000004486f4 -> 0000000000448da8 0 cycles P 0
334 ..... 62: 00000000004480fc -> 00000000004486d4 0 cycles P 0
335 ..... 63: 0000000000448658 -> 00000000004480ec 0 cycles P 0
336  ... thread: program1:4551
337  ...... dso: /home/root/program1
338 .......
339 ```
340 
341 The instruction samples produced by `perf inject` is then passed to the
342 autofdo tool to generate source level profiles for the compiler. For
343 clang/LLVM:
344 
345 ```
346 create_llvm_prof -binary=/path/to/binary -profile=inj.data -out=program.llvmprof
347 ```
348 
349 And for GCC:
350 
351 ```
352 create_gcov -binary=/path/to/binary -profile=inj.data -gcov_version=1 -gcov=program.gcov
353 ```
354 
355 The profiles can be viewed with:
356 
357 ```
358 llvm-profdata show -sample program.llvmprof
359 ```
360 
361 Or, for GCC:
362 
363 ```
364 dump_gcov -gcov_version=1 program.gcov
365 ```
366 
367 ## Using profile in the compiler
368 
369 The profile produced by the above steps can then be passed to the compiler
370 to optimize the next build of the program.
371 
372 For GCC, use the `-fauto-profile` option:
373 
374 ```
375 gcc -O2 -fauto-profile=program.gcov -o program program.c
376 ```
377 
378 For Clang, use the `-fprofile-sample-use` option:
379 
380 ```
381 clang -O2 -fprofile-sample-use=program.llvmprof -o program program.c
382 ```
383 
384 
385 ## Summary
386 
387 The basic commands to run an application and create a compiler profile are:
388 
389 ```
390 taskset -c 0 ./record.sh --strobe 5000 10000 28c06000.etr ./my_application arg1 arg2
391 perf inject -i perf.data -o inj.data --itrace=i100000il
392 create_llvm_prof -binary=/path/to/binary -profile=inj.data -out=program.llvmprof
393 ```
394 
395 Use `create_gcov` for gcc.
396 
397 
398 ## References
399 
400 * AutoFDO tool: <https://github.com/google/autofdo>
401  * Build fix: <https://github.com/google/autofdo/pull/73>
402 * GCC's wiki on autofdo: <https://gcc.gnu.org/wiki/AutoFDO>, <https://gcc.gnu.org/wiki/AutoFDO/Tutorial>
403 * Google paper: <https://ai.google/research/pubs/pub45290>
404 * CoreSight kernel docs: Documentation/trace/coresight.txt
405 
406 ## Troubleshooting
407 
408 TODO:
409 
410 * Record simple program (e.g. /bin/ls)
411  * examine raw trace - look for overflows, corruption
412 * Check no errors reported
413  * mmap error indicates no route from source to sink - bad device tree
414  * try nearer sink
415  * data loss warning - bandwidth problems
416  * What if data loss is reported?
417  * Don't worry - strobing
418 
419 
420 ## Appendix: Describing CoreSight in Devicetree
421 
422 
423 Each component has an entry in the device tree that describes its:
424 
425 * type: The `compatible` field defines which driver to use
426 * location: A `reg` defines the component's address and size on the bus
427 * clocks: The `clocks` and `clock-names` fields state which clock provides
428  the `apb_pclk` clock.
429 * connections to other components: `port` and `ports` field link the
430  component to ports of other components
431 
432 To create the device tree, some information about the platform is required:
433 
434 * The memory address of the CoreSight components. This is the address in
435  the CPU's address space where the CPU can access each CoreSight
436  component.
437 * The connections between the components.
438 
439 This information can be found in the SoC's reference manual or you may need
440 to ask the platform/SoC vendor to supply it.
441 
442 An ETMv4 source is declared with a section like this:
443 
444 ```
445  etm0: etm@22040000 {
446  compatible = "arm,coresight-etm4x", "arm,primecell";
447  reg = <0 0x22040000 0 0x1000>;
448 
449  cpu = <&A72_0>;
450  clocks = <&soc_smc50mhz>;
451  clock-names = "apb_pclk";
452  port {
453  cluster0_etm0_out_port: endpoint {
454  remote-endpoint = <&cluster0_funnel_in_port0>;
455  };
456  };
457  };
458 ```
459 
460 This describes an ETMv4 attached to core A72_0, located at 0x22040000, with
461 its output linked to port 0 of a funnel. The funnel is described with:
462 
463 ```
464  funnel@220c0000 { /* cluster0 funnel */
465  compatible = "arm,coresight-funnel", "arm,primecell";
466  reg = <0 0x220c0000 0 0x1000>;
467 
468  clocks = <&soc_smc50mhz>;
469  clock-names = "apb_pclk";
470  power-domains = <&scpi_devpd 0>;
471  ports {
472  #address-cells = <1>;
473  #size-cells = <0>;
474 
475  port@0 {
476  reg = <0>;
477  cluster0_funnel_out_port: endpoint {
478  remote-endpoint = <&main_funnel_in_port0>;
479  };
480  };
481 
482  port@1 {
483  reg = <0>;
484  cluster0_funnel_in_port0: endpoint {
485  slave-mode;
486  remote-endpoint = <&cluster0_etm0_out_port>;
487  };
488  };
489 
490  port@2 {
491  reg = <1>;
492  cluster0_funnel_in_port1: endpoint {
493  slave-mode;
494  remote-endpoint = <&cluster0_etm1_out_port>;
495  };
496  };
497  };
498  };
499 ```
500 
501 This describes a funnel located at 0x220c0000, receiving data from 2 ETMs
502 and sending the merged data to another funnel. We continue describing
503 components with similar blocks until we reach the sink (an ETR):
504 
505 ```
506  etr@20070000 {
507  compatible = "arm,coresight-tmc", "arm,primecell";
508  reg = <0 0x20070000 0 0x1000>;
509  iommus = <&smmu_etr 0>;
510 
511  clocks = <&soc_smc50mhz>;
512  clock-names = "apb_pclk";
513  power-domains = <&scpi_devpd 0>;
514  port {
515  etr_in_port: endpoint {
516  slave-mode;
517  remote-endpoint = <&replicator_out_port1>;
518  };
519  };
520  };
521 ```
522 
523 Full descriptions of the properties of each component can be found in the
524 Linux source at Documentation/devicetree/bindings/arm/coresight.txt.
525 The Arm Juno platform's devicetree (arch/arm64/boot/dts/arm) provides an example
526 description of CoreSight description.
527 
528 Many systems include a TPIU for off-chip trace. While this isn't required
529 for self-hosted trace, it should still be included in the devicetree. This
530 allows the drivers to access it to ensure it is put into a disabled state,
531 otherwise it may limit the trace bandwidth causing data loss.