LLVM’s Optional Rich Disassembly Output

Introduction

LLVM’s default disassembly output is raw text. To allow consumers more abilityto introspect the instructions’ textual representation or to reformat for a moreuser friendly display there is an optional rich disassembly output.

This optional output is sufficient to reference into individual portions of theinstruction text. This is intended for clients like disassemblers, list filegenerators, and pretty-printers, which need more than the raw instructions andthe ability to print them.

To provide this functionality the assembly text is marked up with annotations.The markup is simple enough in syntax to be robust even in the case of versionmismatches between consumers and producers. That is, the syntax generally doesnot carry semantics beyond “this text has an annotation,” so consumers cansimply ignore annotations they do not understand or do not care about.

After calling LLVMCreateDisasm() to create a disassembler context theoptional output is enable with this call:

  1. LLVMSetDisasmOptions(DC, LLVMDisassembler_Option_UseMarkup);

Then subsequent calls to LLVMDisasmInstruction() will return output stringswith the marked up annotations.

Instruction Annotations

Contextual markups

Annotated assembly display will supply contextual markup to help clients moreefficiently implement things like pretty printers. Most markup will be targetindependent, so clients can effectively provide good display without any targetspecific knowledge.

Annotated assembly goes through the normal instruction printer, but optionallyincludes contextual tags on portions of the instruction string. An annotationis any ‘<’ ‘>’ delimited section of text(1).

  1. annotation: '<' tag-name tag-modifier-list ':' annotated-text '>'
  2. tag-name: identifier
  3. tag-modifier-list: comma delimited identifier list

The tag-name is an identifier which gives the type of the annotation. For thefirst pass, this will be very simple, with memory references, registers, andimmediates having the tag names “mem”, “reg”, and “imm”, respectively.

The tag-modifier-list is typically additional target-specific context, such asregister class.

Clients should accept and ignore any tag-names or tag-modifiers they do notunderstand, allowing the annotations to grow in richness without breaking olderclients.

For example, a possible annotation of an ARM load of a stack-relative locationmight be annotated as:

  1. ldr <reg gpr:r0>, <mem regoffset:[<reg gpr:sp>, <imm:#4>]>

1: For assembly dialects in which ‘<’ and/or ‘>’ are legal tokens, a literal token is escaped by following immediately with a repeat of the character. For example, a literal ‘<’ character is output as ‘<<’ in an annotated assembly string.

C API Details

The intended consumers of this information use the C API, therefore the new CAPI function for the disassembler will be added to provide an option to producedisassembled instructions with annotations, LLVMSetDisasmOptions() and theLLVMDisassembler_Option_UseMarkup option (see above).