LLVM Introduction

A typical compiler pipeline will consist of several stages. The middle phase will often consist of several representations of the code to be generated known as intermediate representations.

LLVM Introduction - 图1

LLVM is a statically typed intermediate representation and an associated toolchain for manipulating, optimizing and converting this intermediate form into native code. LLVM code comes in two flavors, a binary bitcode format (.bc) and assembly (.ll). The command line tools llvm-dis and llvm-as can be used to convert between the two forms. We’ll mostly be working with the human readable LLVM assembly and will just refer to it casually as IR and reserve the word assembly to mean the native assembly that is the result of compilation. An important note is that the binary format for LLVM bitcode starts with the magic two byte sequence ( 0x42 0x43 ) or “BC”.

An LLVM module consists of a sequence of toplevel mutually scoped definitions of functions, globals, type declarations, and external declarations.

Symbols used in an LLVM module are either global or local. Global symbols begin with @ and local symbols begin with %. All symbols must be defined or forward declared.

  1. declare i32 @putchar(i32)
  2. define i32 @add(i32 %a, i32 %b) {
  3. %1 = add i32 %a, %b
  4. ret i32 %1
  5. }
  6. define void @main() {
  7. %1 = call i32 @add(i32 0, i32 97)
  8. call i32 @putchar(i32 %1)
  9. ret void
  10. }

A LLVM function consists of a sequence of basic blocks containing a sequence of instructions and assignment to local values. During compilation basic blocks will roughly correspond to labels in the native assembly output.

  1. define double @main(double %x) {
  2. entry:
  3. %0 = alloca double
  4. br body
  5. body:
  6. store double %x, double* %0
  7. %1 = load double* %0
  8. %2 = fadd double %1, 1.000000e+00
  9. ret double %2
  10. }

First class types in LLVM align very closely with machine types. Alignment and platform specific sizes are detached from the type specification in the data layout for a module.

Type
i1A unsigned 1 bit integer
i32A unsigned 32 bit integer
i32A pointer to a 32 bit integer
i32**A pointer to a pointer to a 32 bit integer
doubleA 64-bit floating point value
float (i32)A function taking a i32 and returning a 32-bit floating point float
<4 x i32>A width 4 vector of 32-bit integer values.
{i32, double}A struct of a 32-bit integer and a double.
<{i8, i32}>A packed structure of an integer pointer and 32-bit integer.
[4 x i32]An array of four i32 values.

While LLVM is normally generated procedurally we can also write it by hand. For example consider the following minimal LLVM IR example.

  1. declare i32 @putchar(i32)
  2. define void @main() {
  3. call i32 @putchar(i32 42)
  4. ret void
  5. }

This will compile (using llc) into the following platform specific assembly. For example, using llc -march=x86-64 on a Linux system we generate output like the following:

  1. .file "minimal.ll"
  2. .text
  3. .globl main
  4. .align 16, 0x90
  5. .type main,@function
  6. main:
  7. movl $42, %edi
  8. jmp putchar
  9. .Ltmp0:
  10. .size main, .Ltmp0-main
  11. .section ".note.GNU-stack","",@progbits

What makes LLVM so compelling is it lets us write our assembly-like IR as if we had an infinite number of CPU registers and abstracts away the register allocation and instruction selection. LLVM IR also has the advantage of being mostly platform independent and retargatable, although there are some details about calling conventions, vectors, and pointer sizes which make it not entirely independent.

As an integral part of Clang, LLVM is very well suited for compiling C-like languages, but it is nonetheless a very adequate toolchain for compiling both imperative and functional languages. Some notable languages using LLVM include:

GHC has a LLVM compilation path that is enabled with the -fllvm flag. The library ghc-core can be used to view the IR compilation artifacts.