LLVM Code Coverage Mapping Format

Introduction

LLVM’s code coverage mapping format is used to provide code coverageanalysis using LLVM’s and Clang’s instrumentation based profiling(Clang’s -fprofile-instr-generate option).

This document is aimed at those who would like to know how LLVM’s code coveragemapping works under the hood. A prior knowledge of how Clang’s profile guidedoptimization works is useful, but not required. For those interested in usingLLVM to provide code coverage analysis for their own programs, see the Clangdocumentation <https://clang.llvm.org/docs/SourceBasedCodeCoverage.html&gt;.

We start by briefly describing LLVM’s code coverage mapping format and theway that Clang and LLVM’s code coverage tool work with this format. Afterthe basics are down, more advanced features of the coverage mapping formatare discussed - such as the data structures, LLVM IR representation andthe binary encoding.

High Level Overview

LLVM’s code coverage mapping format is designed to be a self containeddata format that can be embedded into the LLVM IR and into object files.It’s described in this document as a mapping format because its goal isto store the data that is required for a code coverage tool to map betweenthe specific source ranges in a file and the execution counts obtainedafter running the instrumented version of the program.

The mapping data is used in two places in the code coverage process:

  • When clang compiles a source file with -fcoverage-mapping, itgenerates the mapping information that describes the mapping between thesource ranges and the profiling instrumentation counters.This information gets embedded into the LLVM IR and convenientlyends up in the final executable file when the program is linked.
  • It is also used by llvm-cov - the mapping information is extracted from anobject file and is used to associate the execution counts (the values of theprofile instrumentation counters), and the source ranges in a file.After that, the tool is able to generate various code coverage reportsfor the program.The coverage mapping format aims to be a “universal format” that would besuitable for usage by any frontend, and not just by Clang. It also aims toprovide the frontend the possibility of generating the minimal coverage mappingdata in order to reduce the size of the IR and object files - for example,instead of emitting mapping information for each statement in a function, thefrontend is allowed to group the statements with the same execution count intoregions of code, and emit the mapping information only for those regions.

Advanced Concepts

The remainder of this guide is meant to give you insight into the way thecoverage mapping format works.

The coverage mapping format operates on a per-function level as theprofile instrumentation counters are associated with a specific function.For each function that requires code coverage, the frontend has to createcoverage mapping data that can map between the source code ranges andthe profile instrumentation counters for that function.

Mapping Region

The function’s coverage mapping data contains an array of mapping regions.A mapping region stores the source code range that is covered by this region,the file id, the coverage mapping counter andthe region’s kind.There are several kinds of mapping regions:

  • Code regions associate portions of source code and coverage mappingcounters. They make up the majority of the mapping regions. They are usedby the code coverage tool to compute the execution counts for lines,highlight the regions of code that were never executed, and to obtainthe various code coverage statistics for a function.For example:
  1. int main(int argc, const char *argv[]) { // Code Region from 1:40 to 9:2
  2.  
  3. if (argc > 1) { // Code Region from 3:17 to 5:4
  4. printf("%s\n", argv[1]);
  5. } else { // Code Region from 5:10 to 7:4
  6. printf("\n");
  7. }
  8. return 0;
  9. }
  • Skipped regions are used to represent source ranges that were skippedby Clang’s preprocessor. They don’t associate withcoverage mapping counters, as the frontend knows that they are neverexecuted. They are used by the code coverage tool to mark the skipped linesinside a function as non-code lines that don’t have execution counts.For example:
  1. int main() { // Code Region from 1:12 to 6:2
  2. #ifdef DEBUG // Skipped Region from 2:1 to 4:2
  3. printf("Hello world");
  4. #endif
  5. return 0;
  6. }
  • Expansion regions are used to represent Clang’s macro expansions. Theyhave an additional property - expanded file id. This property can beused by the code coverage tool to find the mapping regions that are createdas a result of this macro expansion, by checking if their file id matches theexpanded file id. They don’t associate with coverage mapping counters,as the code coverage tool can determine the execution count for this regionby looking up the execution count of the first region with a correspondingfile id.For example:
  1. int func(int x) {
  2. #define MAX(x,y) ((x) > (y)? (x) : (y))
  3. return MAX(x, 42); // Expansion Region from 3:10 to 3:13
  4. }

Source Range:

The source range record contains the starting and ending location of a certainmapping region. Both locations include the line and the column numbers.

File ID:

The file id an integer value that tells usin which source file or macro expansion is this region located.It enables Clang to produce mapping information for the codedefined inside macros, like this example demonstrates:

  1. void func(const char *str) { // Code Region from 1:28 to 6:2 with file id 0
  2. #define PUT printf("%s\n", str) // 2 Code Regions from 2:15 to 2:34 with file ids 1 and 2
  3. if(*str)
  4. PUT; // Expansion Region from 4:5 to 4:8 with file id 0 that expands a macro with file id 1
  5. PUT; // Expansion Region from 5:3 to 5:6 with file id 0 that expands a macro with file id 2
  6. }

Counter:

A coverage mapping counter can represents a reference to the profileinstrumentation counter. The execution count for a region with such counteris determined by looking up the value of the corresponding profileinstrumentation counter.

It can also represent a binary arithmetical expression that operates oncoverage mapping counters or other expressions.The execution count for a region with an expression counter is determined byevaluating the expression’s arguments and then adding them together orsubtracting them from one another.In the example below, a subtraction expression is used to compute the executioncount for the compound statement that follows the else keyword:

  1. int main(int argc, const char *argv[]) { // Region's counter is a reference to the profile counter #0
  2.  
  3. if (argc > 1) { // Region's counter is a reference to the profile counter #1
  4. printf("%s\n", argv[1]);
  5. } else { // Region's counter is an expression (reference to the profile counter #0 - reference to the profile counter #1)
  6. printf("\n");
  7. }
  8. return 0;
  9. }

Finally, a coverage mapping counter can also represent an execution count ofof zero. The zero counter is used to provide coverage mapping forunreachable statements and expressions, like in the example below:

  1. int main() {
  2. return 0;
  3. printf("Hello world!\n"); // Unreachable region's counter is zero
  4. }

The zero counters allow the code coverage tool to display proper line executioncounts for the unreachable lines and highlight the unreachable code.Without them, the tool would think that those lines and regions were stillexecuted, as it doesn’t possess the frontend’s knowledge.

LLVM IR Representation

The coverage mapping data is stored in the LLVM IR using a global constantstructure variable called llvm_coverage_mapping with the _IPSK_covmap_section specifier (i.e. “.lcovmap$M” on Windows and “llvm_covmap” elsewhere).

For example, let’s consider a C file and how it gets compiled to LLVM:

  1. int foo() {
  2. return 42;
  3. }
  4. int bar() {
  5. return 13;
  6. }

The coverage mapping variable generated by Clang has 2 fields:

  • Coverage mapping header.
  • An optionally compressed list of filenames present in the translation unit.

The variable has 8-byte alignment because ld64 cannot always pack symbols fromdifferent object files tightly (the word-level alignment assumption is baked intoo deeply).

  1. @llvm_coverage_mapping = internal constant { { i32, i32, i32, i32 }, [32 x i8] }{ { i32, i32, i32, i32 } ; Coverage map header { i32 0, ; Always 0. In prior versions, the number of affixed function records i32 32, ; The length of the string that contains the encoded translation unit filenames i32 0, ; Always 0. In prior versions, the length of the affixed string that contains the encoded coverage mapping data i32 3, ; Coverage mapping format version }, [32 x i8] c"…" ; Encoded data (dissected later)}, section "llvm_covmap", align 8

The current version of the format is version 4. There are two differences from version 3:

  • Function records are now named symbols, and are marked linkonce_odr. Thisallows linkers to merge duplicate function records. Merging of duplicatedummy records (emitted for functions included-but-not-used in a translationunit) reduces size bloat in the coverage mapping data. As part of thischange, region mapping information for a function is now included within thefunction record, instead of being affixed to the coverage header.
  • The filename list for a translation unit may optionally be zlib-compressed.

The only difference between versions 3 and 2 is that a special encoding forcolumn end locations was introduced to indicate gap regions.

In version 1, the function record for foo was defined as follows:

  1. { i8*, i32, i32, i64 } { i8* getelementptr inbounds ([3 x i8]* @__profn_foo, i32 0, i32 0), ; Function's name
  2. i32 3, ; Function's name length
  3. i32 9, ; Function's encoded coverage mapping data string length
  4. i64 0 ; Function's structural hash
  5. }

In version 2, the function record for foo was defined as follows:

  1. { i64, i32, i64 } {
  2. i64 0x5cf8c24cdb18bdac, ; Function's name MD5
  3. i32 9, ; Function's encoded coverage mapping data string length
  4. i64 0 ; Function's structural hash

Coverage Mapping Header:

The coverage mapping header has the following fields:

  • The number of function records affixed to the coverage header. Always 0, but present for backwards compatibility.
  • The length of the string in the third field of __llvm_coverage_mapping that contains the encoded translation unit filenames.
  • The length of the string in the third field of __llvm_coverage_mapping that contains any encoded coverage mapping data affixed to the coverage header. Always 0, but present for backwards compatibility.
  • The format version. The current version is 4 (encoded as a 3).

Function record:

A function record is a structure of the following type:

  1. { i64, i32, i64, i64, [? x i8] }

It contains the function name’s MD5, the length of the encoded mapping data forthat function, the function’s structural hash value, the hash of the filenamesin the function’s translation unit, and the encoded mapping data.

Dissecting the sample:

Here’s an overview of the encoded data that was stored in theIR for the coverage mapping sample that was shown earlier:

  • The IR contains the following string constant that represents the encodedcoverage mapping data for the sample translation unit:
  1. c"\01\15\1Dx\DA\13\D1\0F-N-*\D6/+\CE\D6/\C9-\D0O\CB\CF\D7K\06\00N+\07]"
  • The string contains values that are encoded in the LEB128 format, which isused throughout for storing integers. It also contains a compressed payload.

  • The first three LEB128-encoded numbers in the sample specify the number offilenames, the length of the uncompressed filenames, and the length of thecompressed payload (or 0 if compression is disabled). In this sample, thereis 1 filename that is 21 bytes in length (uncompressed), and stored in 29bytes (compressed).

  • The coverage mapping from the first function record is encoded in this string:

  1. c"\01\00\00\01\01\01\0C\02\02"

This string consists of the following bytes:

0x01The number of file ids used by this function. There is only one file id used by the mapping data in this function.0x00An index into the filenames array which corresponds to the file “/Users/alex/test.c”.0x00The number of counter expressions used by this function. This function doesn’t use any expressions.0x01The number of mapping regions that are stored in an array for the function’s file id #0.0x01The coverage mapping counter for the first region in this function. The value of 1 tells us that it’s a coveragemapping counter that is a reference to the profile instrumentation counter with an index of 0.0x01The starting line of the first mapping region in this function.0x0CThe starting column of the first mapping region in this function.0x02The ending line of the first mapping region in this function.0x02The ending column of the first mapping region in this function.

  • The length of the substring that contains the encoded coverage mapping datafor the second function record is also 9. It’s structured like the mapping datafor the first function record.

  • The two trailing bytes are zeroes and are used to pad the coverage mappingdata to give it the 8 byte alignment.

Encoding

The per-function coverage mapping data is encoded as a stream of bytes,with a simple structure. The structure consists of the encodingtypes like variable-length unsigned integers, thatare used to encode File ID Mapping, Counter Expressions andthe Mapping Regions.

The format of the structure follows:

[file id mapping, counter expressions, mapping regions]

The translation unit filenames are encoded using the same encodingtypes as the per-function coverage mapping data, with thefollowing structure:

[numFilenames : LEB128, filename0 : string, filename1 : string, …]

Types

This section describes the basic types that are used by the encoding formatand can appear after : in the [foo : type] description.

LEB128

LEB128 is an unsigned integer value that is encoded using DWARF’s LEB128encoding, optimizing for the case where values are small(1 byte for values less than 128).

Strings

[length : LEB128, characters…]

String values are encoded with a LEB value for the lengthof the string and a sequence of bytes for its characters.

File ID Mapping

[numIndices : LEB128, filenameIndex0 : LEB128, filenameIndex1 : LEB128, …]

File id mapping in a function’s coverage mapping streamcontains the indices into the translation unit’s filenames array.

Counter

[value : LEB128]

A coverage mapping counter is stored in a single LEB value.It is composed of two things — the tagwhich is stored in the lowest 2 bits, and the counter data which is storedin the remaining bits.

Tag:

The counter’s tag encodes the counter’s kindand, if the counter is an expression, the expression’s kind.The possible tag values are:

  • 0 - The counter is zero.
  • 1 - The counter is a reference to the profile instrumentation counter.
  • 2 - The counter is a subtraction expression.
  • 3 - The counter is an addition expression.

Data:

The counter’s data is interpreted in the following manner:

  • When the counter is a reference to the profile instrumentation counter,then the counter’s data is the id of the profile counter.
  • When the counter is an expression, then the counter’s datais the index into the array of counter expressions.

Counter Expressions

[numExpressions : LEB128, expr0LHS : LEB128, expr0RHS : LEB128, expr1LHS : LEB128, expr1RHS : LEB128, …]

Counter expressions consist of two counters as theyrepresent binary arithmetic operations.The expression’s kind is determined from the tag of thecounter that references this expression.

Mapping Regions

[numRegionArrays : LEB128, regionsForFile0, regionsForFile1, …]

The mapping regions are stored in an array of sub-arrays where everyregion in a particular sub-array has the same file id.

The file id for a sub-array of regions is the index of thatsub-array in the main array e.g. The first sub-array will have the file idof 0.

Sub-Array of Regions

[numRegions : LEB128, region0, region1, …]

The mapping regions for a specific file id are stored in an array that issorted in an ascending order by the region’s starting location.

Mapping Region

[header, source range]

The mapping region record contains two sub-records —the header, which stores the counter and/or the region’s kind,and the source range that contains the starting and endinglocation of this region.

Header

[counter]

or

[pseudo-counter]

The header encodes the region’s counter and the region’s kind.

The value of the counter’s tag distinguishes between the counters andpseudo-counters — if the tag is zero, than this header contains apseudo-counter, otherwise this header contains an ordinary counter.

Counter:

A mapping region whose header has a counter with a non-zero tag isa code region.

Pseudo-Counter:

[value : LEB128]

A pseudo-counter is stored in a single LEB value, just likethe ordinary counter. It has the following interpretation:

  • bits 0-1: tag, which is always 0.

  • bit 2: expansionRegionTag. If this bit is set, then this mapping regionis an expansion region.

  • remaining bits: data. If this region is an expansion region, then the datacontains the expanded file id of that region.

Otherwise, the data contains the region’s kind. The possible regionkind values are:

  • 0 - This mapping region is a code region with a counter of zero.
  • 2 - This mapping region is a skipped region.

Source Range

[deltaLineStart : LEB128, columnStart : LEB128, numLines : LEB128, columnEnd : LEB128]

The source range record contains the following fields:

  • deltaLineStart: The difference between the starting line of thecurrent mapping region and the starting line of the previous mapping region.

If the current mapping region is the first region in the currentsub-array, then it stores the starting line of that region.

  • columnStart: The starting column of the mapping region.

  • numLines: The difference between the ending line and the starting lineof the current mapping region.

  • columnEnd: The ending column of the mapping region. If the high bit is set,the current mapping region is a gap area. A count for a gap area is only usedas the line execution count if there are no other regions on a line.