User Guide for AMDGPU Backend

Introduction

The AMDGPU backend provides ISA code generation for AMD GPUs, starting with theR600 family up until the current GCN families. It lives in thellvm/lib/Target/AMDGPU directory.

LLVM

Target Triples

Use the clang -target <Architecture>-<Vendor>-<OS>-<Environment> option tospecify the target triple:

AMDGPU Architectures
ArchitectureDescription
r600AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
amdgcnAMD GPUs GCN GFX6 onwards for graphics and compute shaders.
AMDGPU Vendors
VendorDescription
amdCan be used for all AMD GPU usage.
mesa3dCan be used if the OS is mesa3d.
AMDGPU Operating Systems
OSDescription
<empty>Defaults to the unknown OS.
amdhsaCompute kernels executed on HSA [HSA] compatible runtimessuch as AMD’s ROCm [AMD-ROCm].
amdpalGraphic shaders and compute kernels executed on AMD PALruntime.
mesa3dGraphic shaders and compute kernels executed on Mesa 3Druntime.
AMDGPU Environments
EnvironmentDescription
<empty>Default.

Processors

Use the clang -mcpu <Processor> option to specify the AMDGPU processor. Thenames from both the Processor and Alternative Processor can be used.

AMDGPU Processors
ProcessorAlternativeProcessorTargetTripleArchitecturedGPU/APUTargetFeaturesSupported[Default]ROCmSupportExampleProducts
Radeon HD 2000/3000 Series (R600) [AMD-RADEON-HD-2000-3000]
r600 r600dGPU
r630 r600dGPU
rs880 r600dGPU
rv670 r600dGPU
Radeon HD 4000 Series (R700) [AMD-RADEON-HD-4000]
rv710 r600dGPU
rv730 r600dGPU
rv770 r600dGPU
Radeon HD 5000 Series (Evergreen) [AMD-RADEON-HD-5000]
cedar r600dGPU
cypress r600dGPU
juniper r600dGPU
redwood r600dGPU
sumo r600dGPU
Radeon HD 6000 Series (Northern Islands) [AMD-RADEON-HD-6000]
barts r600dGPU
caicos r600dGPU
cayman r600dGPU
turks r600dGPU
GCN GFX6 (Southern Islands (SI)) [AMD-GCN-GFX6]
gfx600
  • tahiti
amdgcndGPU
gfx601
  • hainan
  • oland
  • pitcairn
  • verde
amdgcndGPU
GCN GFX7 (Sea Islands (CI)) [AMD-GCN-GFX7]
gfx700
  • kaveri
amdgcnAPU
  • A6-7000
  • A6 Pro-7050B
  • A8-7100
  • A8 Pro-7150B
  • A10-7300
  • A10 Pro-7350B
  • FX-7500
  • A8-7200P
  • A10-7400P
  • FX-7600P
gfx701
  • hawaii
amdgcndGPU ROCm
  • FirePro W8100
  • FirePro W9100
  • FirePro S9150
  • FirePro S9170
gfx702 amdgcndGPU ROCm
  • Radeon R9 290
  • Radeon R9 290x
  • Radeon R390
  • Radeon R390x
gfx703
  • kabini
  • mullins
amdgcnAPU
  • E1-2100
  • E1-2200
  • E1-2500
  • E2-3000
  • E2-3800
  • A4-5000
  • A4-5100
  • A6-5200
  • A4 Pro-3340B
gfx704
  • bonaire
amdgcndGPU
  • Radeon HD 7790
  • Radeon HD 8770
  • R7 260
  • R7 260X
GCN GFX8 (Volcanic Islands (VI)) [AMD-GCN-GFX8]
gfx801
  • carrizo
amdgcnAPU
  • xnack[on]
  • A6-8500P
  • Pro A6-8500B
  • A8-8600P
  • Pro A8-8600B
  • FX-8800P
  • Pro A12-8800B
amdgcnAPU
  • xnack[on]
ROCm
  • A10-8700P
  • Pro A10-8700B
  • A10-8780P
amdgcnAPU
  • xnack[on]
  • A10-9600P
  • A10-9630P
  • A12-9700P
  • A12-9730P
  • FX-9800P
  • FX-9830P
amdgcnAPU
  • xnack[on]
  • E2-9010
  • A6-9210
  • A9-9410
gfx802
  • iceland
  • tonga
amdgcndGPU
  • xnack[off]
ROCm
  • FirePro S7150
  • FirePro S7100
  • FirePro W7100
  • Radeon R285
  • Radeon R9 380
  • Radeon R9 385
  • Mobile FireProM7170
gfx803
  • fiji
amdgcndGPU
  • xnack[off]
ROCm
  • Radeon R9 Nano
  • Radeon R9 Fury
  • Radeon R9 FuryX
  • Radeon Pro Duo
  • FirePro S9300x2
  • Radeon Instinct MI8
  • polaris10
amdgcndGPU
  • xnack[off]
ROCm
  • Radeon RX 470
  • Radeon RX 480
  • Radeon Instinct MI6
  • polaris11
amdgcndGPU
  • xnack[off]
ROCm
  • Radeon RX 460
gfx810
  • stoney
amdgcnAPU
  • xnack[on]
GCN GFX9 [AMD-GCN-GFX9]
gfx900 amdgcndGPU
  • xnack[off]
ROCm
  • Radeon VegaFrontier Edition
  • Radeon RX Vega 56
  • Radeon RX Vega 64
  • Radeon RX Vega 64Liquid
  • Radeon Instinct MI25
gfx902 amdgcnAPU
  • xnack[on]
  • Ryzen 3 2200G
  • Ryzen 5 2400G
gfx904 amdgcndGPU
  • xnack[off]
TBA
gfx906 amdgcndGPU
  • xnack[off]
  • Radeon Instinct MI50
  • Radeon Instinct MI60
gfx908 amdgcndGPU
  • xnack[off]sram-ecc[on]
TBA
gfx909 amdgcnAPU
  • xnack[on]
TBA (Raven Ridge 2)
GCN GFX10 [AMD-GCN-GFX10]
gfx1010 amdgcndGPU
  • xnack[off]
  • wavefrontsize64[off]
  • cumode[off]
TBA
gfx1011 amdgcndGPU
  • xnack[off]
  • wavefrontsize64[off]
  • cumode[off]
TBA
gfx1012 amdgcndGPU
  • xnack[off]
  • wavefrontsize64[off]
  • cumode[off]
TBA

Target Features

Target features control how code is generated to support certainprocessor specific features. Not all target features are supported byall processors. The runtime must ensure that the features supported bythe device used to execute the code match the features enabled whengenerating the code. A mismatch of features may result in incorrectexecution, or a reduction in performance.

The target features supported by each processor, and the default valueused if not specified explicitly, is listed inAMDGPU Processors.

Use the clang -m[no-]<TargetFeature> option to specify the AMDGPUtarget features.

For example:

  • -mxnack
  • Enable the xnack feature.
  • -mno-xnack
  • Disable the xnack feature.

AMDGPU Target FeaturesTarget FeatureDescription-m[no-]xnackEnable/disable generating code that hasmemory clauses that are compatible withhaving XNACK replay enabled.

This is used for demand paging and pagemigration. If XNACK replay is enabled inthe device, then if a page fault occursthe code may execute incorrectly if thexnack feature is not enabled. Executingcode that has the feature enabled on adevice that does not have XNACK replayenabled will execute correctly, but maybe less performant than code with thefeature disabled.-m[no-]sram-eccEnable/disable generating code that assumes SRAMECC is enabled/disabled.-m[no-]wavefrontsize64Control the default wavefront size used whengenerating code for kernels. When disablednative wavefront size 32 is used, when enabledwavefront size 64 is used.-m[no-]cumodeControl the default wavefront execution mode usedwhen generating code for kernels. When disablednative WGP wavefront execution mode is used,when enabled CU wavefront execution mode is used(see Memory Model).

Address Spaces

The AMDGPU architecture supports a number of memory address spaces. The addressspace names use the OpenCL standard names, with some additions.

The AMDGPU address spaces correspond to architecture-specific LLVM addressspace numbers used in LLVM IR.

The AMDGPU address spaces are described inAMDGPU Address Spaces. Only 64-bit process address spaces aresupported for the amdgcn target.

AMDGPU Address Spaces
64-Bit Process Address Space
Address Space NameLLVM IR AddressSpace NumberHSA SegmentNameHardwareNameAddressSizeNULL Value
Generic0flatflat640x0000000000000000
Global1globalglobal640x0000000000000000
Region2N/AGDS32not implemented for AMDHSA
Local3groupLDS320xFFFFFFFF
Constant4constantsame as global640x0000000000000000
Private5privatescratch320x00000000
Constant 32-bit6TODO
Buffer Fat Pointer (experimental)7TODO
  • Generic
  • The generic address space uses the hardware flat address support available inGFX7-GFX10. This uses two fixed ranges of virtual addresses (the private andlocal apertures), that are outside the range of addressable global memory, tomap from a flat address to a private or local address.

FLAT instructions can take a flat address and access global, private(scratch), and group (LDS) memory depending on if the address is within oneof the aperture ranges. Flat access to scratch requires hardware aperturesetup and setup in the kernel prologue (seeFlat Scratch). Flat access to LDS requireshardware aperture setup and M0 (GFX7-GFX8) register setup (seeM0).

To convert between a private or group address space address (termed a segmentaddress) and a flat address the base address of the corresponding aperturecan be used. For GFX7-GFX8 these are available in theHSA AQL Queue the address of which can be obtained withQueue Ptr SGPR (see Initial Kernel Execution State). ForGFX9-GFX10 the aperture base addresses are directly available as inlineconstant registers SRC_SHARED_BASE/LIMIT and SRC_PRIVATE_BASE/LIMIT.In 64-bit address mode the aperture sizes are 2^32 bytes and the base isaligned to 2^32 which makes it easier to convert from flat to segment orsegment to flat.

A global address space address has the same value when used as a flat addressso no conversion is needed.

  • Global and Constant
  • The global and constant address spaces both use global virtual addresses,which are the same virtual address space used by the CPU. However, somevirtual addresses may only be accessible to the CPU, some only accessibleby the GPU, and some by both.

Using the constant address space indicates that the data will not changeduring the execution of the kernel. This allows scalar read instructions tobe used. The vector and scalar L1 caches are invalidated of volatile databefore each kernel dispatch execution to allow constant memory to changevalues between kernel dispatches.

  • Region
  • The region address space uses the hardware Global Data Store (GDS). Allwavefronts executing on the same device will access the same memory for anygiven region address. However, the same region address accessed by wavefrontsexecuting on different devices will access different memory. It is higherperformance than global memory. It is allocated by the runtime. The datastore (DS) instructions can be used to access it.
  • Local
  • The local address space uses the hardware Local Data Store (LDS) which isautomatically allocated when the hardware creates the wavefronts of awork-group, and freed when all the wavefronts of a work-group haveterminated. All wavefronts belonging to the same work-group will access thesame memory for any given local address. However, the same local addressaccessed by wavefronts belonging to different work-groups will accessdifferent memory. It is higher performance than global memory. The data store(DS) instructions can be used to access it.
  • Private
  • The private address space uses the hardware scratch memory support whichautomatically allocates memory when it creates a wavefront, and frees it whena wavefronts terminates. The memory accessed by a lane of a wavefront for anygiven private address will be different to the memory accessed by another laneof the same or different wavefront for the same private address.

If a kernel dispatch uses scratch, then the hardware allocates memory from apool of backing memory allocated by the runtime for each wavefront. The lanesof the wavefront access this using dword (4 byte) interleaving. The mappingused from private address to backing memory address is:

wavefront-scratch-base +((private-address / 4) wavefront-size 4) +(wavefront-lane-id * 4) + (private-address % 4)

If each lane of a wavefront accesses the same private address, theinterleaving results in adjacent dwords being accessed and hence requiresfewer cache lines to be fetched.

There are different ways that the wavefront scratch base address isdetermined by a wavefront (seeInitial Kernel Execution State).

Scratch memory can be accessed in an interleaved manner using bufferinstructions with the scratch buffer descriptor and per wavefront scratchoffset, by the scratch instructions, or by flat instructions. Multi-dwordaccess is not supported except by flat and scratch instructions inGFX9-GFX10.

  • Constant 32-bit
  • TODO
  • Buffer Fat Pointer
  • The buffer fat pointer is an experimental address space that is currentlyunsupported in the backend. It exposes a non-integral pointer that is inthe future intended to support the modelling of 128-bit buffer descriptorsplus a 32-bit offset into the buffer (in total encapsulating a 160-bitpointer), allowing normal LLVM load/store/atomic operations to be used tomodel the buffer descriptors used heavily in graphics workloads targetingthe backend.

Memory Scopes

This section provides LLVM memory synchronization scopes supported by the AMDGPUbackend memory model when the target triple OS is amdhsa (seeMemory Model and Target Triples).

The memory model supported is based on the HSA memory model [HSA] which isbased in turn on HRF-indirect with scope inclusion [HRF]. The happens-beforerelation is transitive over the synchronizes-with relation independent of scope,and synchronizes-with allows the memory scope instances to be inclusive (seetable AMDHSA LLVM Sync Scopes).

This is different to the OpenCL [OpenCL] memory model which does not have scopeinclusion and requires the memory scopes to exactly match. However, thisis conservatively correct for OpenCL.

AMDHSA LLVM Sync Scopes
LLVM Sync ScopeDescription
none

The default: system.

Synchronizes with, and participates in modificationand seq_cst total orderings with, other operations(except image operations) for all address spaces(except private, or generic that accesses private)provided the other operation’s sync scope is:

  • system.
  • agent and executed by a thread on the sameagent.
  • workgroup and executed by a thread in thesame work-group.
  • wavefront and executed by a thread in thesame wavefront.
agent

Synchronizes with, and participates in modificationand seq_cst total orderings with, other operations(except image operations) for all address spaces(except private, or generic that accesses private)provided the other operation’s sync scope is:

  • system or agent and executed by a threadon the same agent.
  • workgroup and executed by a thread in thesame work-group.
  • wavefront and executed by a thread in thesame wavefront.
workgroup

Synchronizes with, and participates in modificationand seq_cst total orderings with, other operations(except image operations) for all address spaces(except private, or generic that accesses private)provided the other operation’s sync scope is:

  • system, agent or workgroup andexecuted by a thread in the same work-group.
  • wavefront and executed by a thread in thesame wavefront.
wavefront

Synchronizes with, and participates in modificationand seq_cst total orderings with, other operations(except image operations) for all address spaces(except private, or generic that accesses private)provided the other operation’s sync scope is:

  • system, agent, workgroup orwavefront and executed by a thread in thesame wavefront.
singlethreadOnly synchronizes with, and participates inmodification and seq_cst total orderings with,other operations (except image operations) runningin the same thread for all address spaces (forexample, in signal handlers).
one-asSame as system but only synchronizes with otheroperations within the same address space.
agent-one-asSame as agent but only synchronizes with otheroperations within the same address space.
workgroup-one-asSame as workgroup but only synchronizes withother operations within the same address space.
wavefront-one-asSame as wavefront but only synchronizes withother operations within the same address space.
singlethread-one-asSame as singlethread but only synchronizes withother operations within the same address space.

AMDGPU Intrinsics

The AMDGPU backend implements the following LLVM IR intrinsics.

This section is WIP.

AMDGPU Attributes

The AMDGPU backend supports the following LLVM IR attributes.

AMDGPU LLVM IR Attributes
LLVM AttributeDescription
“amdgpu-flat-work-group-size”=”min,max”Specify the minimum and maximum flat work group sizes thatwill be specified when the kernel is dispatched. Generatedby the amdgpu_flat_work_group_size CLANG attribute [CLANG-ATTR].
“amdgpu-implicitarg-num-bytes”=”n”Number of kernel argument bytes to add to the kernelargument block size for the implicit arguments. Thisvaries by OS and language (for OpenCL seeOpenCL kernel implicit arguments appended for AMDHSA OS).
“amdgpu-num-sgpr”=”n”Specifies the number of SGPRs to use. Generated bythe amdgpu_num_sgpr CLANG attribute [CLANG-ATTR].
“amdgpu-num-vgpr”=”n”Specifies the number of VGPRs to use. Generated by theamdgpu_num_vgpr CLANG attribute [CLANG-ATTR].
“amdgpu-waves-per-eu”=”m,n”Specify the minimum and maximum number of waves perexecution unit. Generated by the amdgpu_waves_per_euCLANG attribute [CLANG-ATTR].
“amdgpu-ieee” true/false.Specify whether the function expects the IEEE field of themode register to be set on entry. Overrides the default forthe calling convention.
“amdgpu-dx10-clamp” true/false.Specify whether the function expects the DX10_CLAMP field ofthe mode register to be set on entry. Overrides the defaultfor the calling convention.

Code Object

The AMDGPU backend generates a standard ELF [ELF] relocatable code object thatcan be linked by lld to produce a standard ELF shared code object which canbe loaded and executed on an AMDGPU target.

Header

The AMDGPU backend uses the following ELF header:

AMDGPU ELF Header
FieldValue
e_ident[EI_CLASS]ELFCLASS64
e_ident[EI_DATA]ELFDATA2LSB
e_ident[EI_OSABI]
  • ELFOSABI_NONE
  • ELFOSABI_AMDGPU_HSA
  • ELFOSABI_AMDGPU_PAL
  • ELFOSABI_AMDGPU_MESA3D
e_ident[EI_ABIVERSION]
  • ELFABIVERSION_AMDGPU_HSA
  • ELFABIVERSION_AMDGPU_PAL
  • ELFABIVERSION_AMDGPU_MESA3D
e_type
  • ET_REL
  • ET_DYN
e_machineEM_AMDGPU
e_entry0
e_flagsSee AMDGPU ELF Header e_flags
AMDGPU ELF Header Enumeration Values
NameValue
EM_AMDGPU224
ELFOSABI_NONE0
ELFOSABI_AMDGPU_HSA64
ELFOSABI_AMDGPU_PAL65
ELFOSABI_AMDGPU_MESA3D66
ELFABIVERSION_AMDGPU_HSA1
ELFABIVERSION_AMDGPU_PAL0
ELFABIVERSION_AMDGPU_MESA3D0
  • e_ident[EI_CLASS]
  • The ELF class is:

    • ELFCLASS32 for r600 architecture.
    • ELFCLASS64 for amdgcn architecture which only supports 64-bitprocess address space applications.
  • e_ident[EI_DATA]
  • All AMDGPU targets use ELFDATA2LSB for little-endian byte ordering.
  • e_ident[EI_OSABI]
  • One of the following AMDGPU architecture specific OS ABIs(see AMDGPU Operating Systems):

    • ELFOSABINONE for _unknown OS.
    • ELFOSABI_AMDGPU_HSA for amdhsa OS.
    • ELFOSABI_AMDGPU_PAL for amdpal OS.
    • ELFOSABI_AMDGPU_MESA3D for mesa3D OS.
  • e_ident[EI_ABIVERSION]
  • The ABI version of the AMDGPU architecture specific OS ABI to which the codeobject conforms:

    • ELFABIVERSION_AMDGPU_HSA is used to specify the version of AMD HSAruntime ABI.
    • ELFABIVERSION_AMDGPU_PAL is used to specify the version of AMD PALruntime ABI.
    • ELFABIVERSION_AMDGPU_MESA3D is used to specify the version of AMD MESA3D runtime ABI.
  • e_type
  • Can be one of the following values:

    • ET_REL
    • The type produced by the AMDGPU backend compiler as it is relocatable codeobject.
    • ET_DYN
    • The type produced by the linker as it is a shared code object.The AMD HSA runtime loader requires a ET_DYN code object.
  • e_machine

  • The value EM_AMDGPU is used for the machine for all processors supportedby the r600 and amdgcn architectures (seeAMDGPU Processors). The specific processor is specified in theEF_AMDGPU_MACH bit field of the e_flags (seeAMDGPU ELF Header e_flags).
  • e_entry
  • The entry point is 0 as the entry points for individual kernels must beselected in order to invoke them through AQL packets.
  • e_flags
  • The AMDGPU backend uses the following ELF header flags:

AMDGPU ELF Header e_flagsNameValueDescriptionAMDGPU Processor FlagSee AMDGPU Processors.EF_AMDGPU_MACH0x000000ffAMDGPU processor selectionmask forEF_AMDGPU_MACH_xxx valuesdefined inAMDGPU EF_AMDGPU_MACH Values.EF_AMDGPU_XNACK0x00000100Indicates if the xnacktarget feature isenabled for all codecontained in the code object.If the processordoes not support thexnack targetfeature then mustbe 0.SeeTarget Features.EF_AMDGPU_SRAM_ECC0x00000200Indicates if the sram-ecctarget feature isenabled for all codecontained in the code object.If the processordoes not support thesram-ecc targetfeature then mustbe 0.SeeTarget Features.

AMDGPU EFAMDGPU_MACH ValuesNameValueDescription (seeAMDGPU Processors)EF_AMDGPU_MACH_NONE0x000_not specifiedEF_AMDGPU_MACH_R600_R6000x001r600EF_AMDGPU_MACH_R600_R6300x002r630EF_AMDGPU_MACH_R600_RS8800x003rs880EF_AMDGPU_MACH_R600_RV6700x004rv670EF_AMDGPU_MACH_R600_RV7100x005rv710EF_AMDGPU_MACH_R600_RV7300x006rv730EF_AMDGPU_MACH_R600_RV7700x007rv770EF_AMDGPU_MACH_R600_CEDAR0x008cedarEF_AMDGPU_MACH_R600_CYPRESS0x009cypressEF_AMDGPU_MACH_R600_JUNIPER0x00ajuniperEF_AMDGPU_MACH_R600_REDWOOD0x00bredwoodEF_AMDGPU_MACH_R600_SUMO0x00csumoEF_AMDGPU_MACH_R600_BARTS0x00dbartsEF_AMDGPU_MACH_R600_CAICOS0x00ecaicosEF_AMDGPU_MACH_R600_CAYMAN0x00fcaymanEF_AMDGPU_MACH_R600_TURKS0x010turks_reserved_0x011 -0x01fReserved for r600architecture processors.EF_AMDGPU_MACH_AMDGCN_GFX6000x020gfx600EF_AMDGPU_MACH_AMDGCN_GFX6010x021gfx601EF_AMDGPU_MACH_AMDGCN_GFX7000x022gfx700EF_AMDGPU_MACH_AMDGCN_GFX7010x023gfx701EF_AMDGPU_MACH_AMDGCN_GFX7020x024gfx702EF_AMDGPU_MACH_AMDGCN_GFX7030x025gfx703EF_AMDGPU_MACH_AMDGCN_GFX7040x026gfx704_reserved_0x027Reserved.EF_AMDGPU_MACH_AMDGCN_GFX8010x028gfx801EF_AMDGPU_MACH_AMDGCN_GFX8020x029gfx802EF_AMDGPU_MACH_AMDGCN_GFX8030x02agfx803EF_AMDGPU_MACH_AMDGCN_GFX8100x02bgfx810EF_AMDGPU_MACH_AMDGCN_GFX9000x02cgfx900EF_AMDGPU_MACH_AMDGCN_GFX9020x02dgfx902EF_AMDGPU_MACH_AMDGCN_GFX9040x02egfx904EF_AMDGPU_MACH_AMDGCN_GFX9060x02fgfx906EF_AMDGPU_MACH_AMDGCN_GFX9080x030gfx908EF_AMDGPU_MACH_AMDGCN_GFX9090x031gfx909_reserved_0x032Reserved.EF_AMDGPU_MACH_AMDGCN_GFX10100x033gfx1010EF_AMDGPU_MACH_AMDGCN_GFX10110x034gfx1011EF_AMDGPU_MACH_AMDGCN_GFX10120x035gfx1012

Sections

An AMDGPU target ELF code object has the standard ELF sections which include:

AMDGPU ELF Sections
NameTypeAttributes
.bssSHTNOBITSSHF_ALLOC + SHF_WRITE
.dataSHT_PROGBITSSHF_ALLOC + SHF_WRITE
.debug*SHT_PROGBITSnone
.dynamicSHT_DYNAMICSHF_ALLOC
.dynstrSHT_PROGBITSSHF_ALLOC
.dynsymSHT_PROGBITSSHF_ALLOC
.gotSHT_PROGBITSSHF_ALLOC + SHF_WRITE
.hashSHT_HASHSHF_ALLOC
.noteSHT_NOTEnone
.relanameSHT_RELAnone
.rela.dynSHT_RELAnone
.rodataSHT_PROGBITSSHF_ALLOC
.shstrtabSHT_STRTABnone
.strtabSHT_STRTABnone
.symtabSHT_SYMTABnone
.textSHT_PROGBITSSHF_ALLOC + SHF_EXECINSTR

These sections have their standard meanings (see [ELF]) and are only generatedif needed.

  • .debug*
  • The standard DWARF sections. See DWARF for information on theDWARF produced by the AMDGPU backend.
  • .dynamic, .dynstr, .dynsym, .hash
  • The standard sections used by a dynamic loader.
  • .note
  • See Note Records for the note records supported by the AMDGPUbackend.
  • .relaname, .rela.dyn
  • For relocatable code objects, name is the name of the section that therelocation records apply. For example, .rela.text is the section name forrelocation records associated with the .text section.

For linked shared code objects, .rela.dyn contains all the relocationrecords from each of the relocatable code object’s .relaname sections.

See Relocation Records for the relocation records supported bythe AMDGPU backend.

  • .text
  • The executable machine code for the kernels and functions they call. Generatedas position independent code. See Code Conventions forinformation on conventions used in the isa generation.

Note Records

The AMDGPU backend code object contains ELF note records in the .notesection. The set of generated notes and their semantics depend on the codeobject version; see Code Object V2 Note Records (-mattr=-code-object-v3) andCode Object V3 Note Records (-mattr=+code-object-v3).

As required by ELFCLASS32 and ELFCLASS64, minimal zero byte paddingmust be generated after the name field to ensure the desc field is 4byte aligned. In addition, minimal zero byte padding must be generated toensure the desc field size is a multiple of 4 bytes. The sh_addralignfield of the .note section must be at least 4 to indicate at least 8 bytealignment.

Code Object V2 Note Records (-mattr=-code-object-v3)

Warning

Code Object V2 is not the default code object version emitted bythis version of LLVM. For a description of the notes generated with thedefault configuration (Code Object V3) see Code Object V3 Note Records (-mattr=+code-object-v3).

The AMDGPU backend code object uses the following ELF note record in the.note section when compiling for Code Object V2 (-mattr=-code-object-v3).

Additional note records may be present, but any which are not documented hereare deprecated and should not be used.

AMDGPU Code Object V2 ELF Note Records
NameTypeDescription
“AMD”NT_AMD_AMDGPU_HSA_METADATA<metadata null terminated string>
AMDGPU Code Object V2 ELF Note Record Enumeration Values
NameValue
reserved0-9
NT_AMD_AMDGPU_HSA_METADATA10
reserved11

Code Object V3 Note Records (-mattr=+code-object-v3)

The AMDGPU backend code object uses the following ELF note record in the.note section when compiling for Code Object V3 (-mattr=+code-object-v3).

Additional note records may be present, but any which are not documented hereare deprecated and should not be used.

AMDGPU Code Object V3 ELF Note Records
NameTypeDescription
“AMDGPU”NT_AMDGPU_METADATAMetadata in Message Pack [MsgPack]binary format.
AMDGPU Code Object V3 ELF Note Record Enumeration Values
NameValue
reserved0-31
NT_AMDGPU_METADATA32

Symbols

Symbols include the following:

AMDGPU ELF Symbols
NameTypeSectionDescription
link-nameSTT_OBJECT
  • .data
  • .rodata
  • .bss
Global variable
link-name.kdSTT_OBJECT
  • .rodata
Kernel descriptor
link-nameSTT_FUNC
  • .text
Kernel entry point
link-nameSTT_OBJECT
  • SHN_AMDGPU_LDS
Global variable in LDS
  • Global variable
  • Global variables both used and defined by the compilation unit.

If the symbol is defined in the compilation unit then it is allocated in theappropriate section according to if it has initialized data or is readonly.

If the symbol is external then its section is STN_UNDEF and the loaderwill resolve relocations using the definition provided by another code objector explicitly defined by the runtime.

If the symbol resides in local/group memory (LDS) then its section is thespecial processor-specific section name SHN_AMDGPU_LDS, and thest_value field describes alignment requirements as it does for commonsymbols.

  • Kernel descriptor
  • Every HSA kernel has an associated kernel descriptor. It is the address of thekernel descriptor that is used in the AQL dispatch packet used to invoke thekernel, not the kernel entry point. The layout of the HSA kernel descriptor isdefined in Kernel Descriptor.
  • Kernel entry point
  • Every HSA kernel also has a symbol for its machine code entry point.

Relocation Records

AMDGPU backend generates Elf64_Rela relocation records. Supportedrelocatable fields are:

  • word32
  • This specifies a 32-bit field occupying 4 bytes with arbitrary bytealignment. These values use the same byte order as other word values in theAMDGPU architecture.
  • word64
  • This specifies a 64-bit field occupying 8 bytes with arbitrary bytealignment. These values use the same byte order as other word values in theAMDGPU architecture.

Following notations are used for specifying relocation calculations:

  • A
  • Represents the addend used to compute the value of the relocatable field.
  • G
  • Represents the offset into the global offset table at which the relocationentry’s symbol will reside during execution.
  • GOT
  • Represents the address of the global offset table.
  • P
  • Represents the place (section offset for et_rel or address for et_dyn)of the storage unit being relocated (computed using r_offset).
  • S
  • Represents the value of the symbol whose index resides in the relocationentry. Relocations not using this must specify a symbol index ofSTN_UNDEF.
  • B
  • Represents the base address of a loaded executable or shared object which isthe difference between the ELF address and the actual load address.Relocations using this are only valid in executable or shared objects.

The following relocation types are supported:

AMDGPU ELF Relocation Records
Relocation TypeKindValueFieldCalculation
R_AMDGPU_NONE 0nonenone
R_AMDGPU_ABS32_LOStatic,Dynamic1word32(S + A) & 0xFFFFFFFF
R_AMDGPU_ABS32_HIStatic,Dynamic2word32(S + A) >> 32
R_AMDGPU_ABS64Static,Dynamic3word64S + A
R_AMDGPU_REL32Static4word32S + A - P
R_AMDGPU_REL64Static5word64S + A - P
R_AMDGPU_ABS32Static,Dynamic6word32S + A
R_AMDGPU_GOTPCRELStatic7word32G + GOT + A - P
R_AMDGPU_GOTPCREL32_LOStatic8word32(G + GOT + A - P) & 0xFFFFFFFF
R_AMDGPU_GOTPCREL32_HIStatic9word32(G + GOT + A - P) >> 32
R_AMDGPU_REL32_LOStatic10word32(S + A - P) & 0xFFFFFFFF
R_AMDGPU_REL32_HIStatic11word32(S + A - P) >> 32
reserved 12
R_AMDGPU_RELATIVE64Dynamic13word64B + A

R_AMDGPU_ABS32_LO and R_AMDGPU_ABS32_HI are only supported bythe mesa3d OS, which does not support R_AMDGPU_ABS64.

There is no current OS loader support for 32-bit programs and soR_AMDGPU_ABS32 is not used.

DWARF

Warning

This section describes a provisional proposal that is not currentlyfully implemented and is subject to change.

Standard DWARF [DWARF] sections can be generated. These contain informationthat maps the code object executable code and data to the source languageconstructs. It can be used by tools such as debuggers and profilers.

This section defines the AMDGPU target specific DWARF. It applies to DWARFVersion 4 and 5.

Overview

The AMDGPU has several features that require additional DWARF functionality inorder to support optimized code.

A single code object can contain code for kernels that have different wavesizes. The vector registers and some scalar registers are based on the wavesize. AMDGPU defines distinct DWARF registers for each wave size. Thissimplifies the consumer of the DWARF so that each register has a fixed size,rather than being dynamic according to the wave mode. Similarly, distinct DWARFregisters are defined for those registers that vary in size according to theprocess address size. This allows a consumer to treat a specific AMDGPU targetas a single architecture regardless of how it is configured. The compilerexplicitly specifies the registers that match the mode of the code it isgenerating.

AMDGPU optimized code may spill vector registers to non-global address spacememory, and this spilling may be done only for lanes that are active on entry tothe subprogram. To support this, a location description that can be created as amasked select is required.

Since the active lane mask may be held in a register, a way to get the value ofa register on entry to a subprogram is required. To support this an operationthat returns the caller value of a register as specified by the Call FrameInformation (see Call Frame Information) is required.

Current DWARF uses an empty expression to indicate an undefined locationdescription. Since the masked select composite location description operationtakes more than one location description, it is necessary to have an explicitway to specify an undefined location description. Otherwise it is not possibleto specify that a particular one of the input location descriptions isundefined.

CFI describes restoring callee saved registers that are spilled. Currently CFIonly allows a location description that is a register, memory address, orimplicit location description. AMDGPU optimized code may spill scalar registersinto portions of vector registers. This requires extending CFI to allow anylocation description.

The vector registers of the AMDGPU are represented as their full wave size,meaning the wave size times the dword size. This reflects the actual hardware,and allows the compiler to generate DWARF for languages that map a thread to thecomplete wave. It also allows more efficient DWARF to be generated to describethe CFI as only a single expression is required for the whole vector register,rather than a separate expression for each lane’s dword of the vector register.It also allows the compiler to produce DWARF that indexes the vector register ifit spills scalar registers into portions of a vector registers.

Since DWARF stack value entries have a base type and AMDGPU registers are avector of dwords, the ability to specify that a base type is a vector isrequired.

If the source language is mapped onto the AMDGPU wavefronts in a SIMT manner,then the variable DWARF location expressions must compute the location for asingle lane of the wavefront. Therefore, a DWARF operator is required to denotethe current lane, much like DWOP_push_object_address denotes the currentobject. The DW_OP*piece operators only allow literal indices. Therefore, acomposite location description is required that can take a computed index of alocation description (such as a vector register).

If the source language is mapped onto the AMDGPU wavefronts in a SIMT manner thecompiler can use the AMDGPU execution mask register to control which lanes areactive. To describe the conceptual location of non-active lanes a DWARFexpression is needed that can compute a per lane PC. For efficiency, this isdone for the wave as a whole. This expression benefits by having a masked selectcomposite location description operation. This requires an attribute for sourcelocation of each lane. The AMDGPU may update the execution mask for whole waveoperations and so needs an attribute that computes the current active lane mask.

AMDGPU needs to be able to describe addresses that are in different kinds ofmemory. Optimized code may need to describe a variable that resides in piecesthat are in different kinds of storage which may include parts of registers,memory that is in a mixture of memory kinds, implicit values, or be undefined.DWARF has the concept of segment addresses. However, the segment cannot bespecified within a DWARF expression, which is only able to specify the offsetportion of a segment address. The segment index is only provided by the entitythat species the DWARF expression. Therefore, the segment index is a propertythat can only be put on complete objects, such as a variable. That makes it onlysuitable for describing an entity (such as variable or subprogram code) that isin a single kind of memory. Therefore, AMDGPU uses the DWARF concept of addressspaces. For example, a variable may be allocated in a register that is partiallyspilled to the call stack which is in the private address space, and partiallyspilled to the local address space.

DWARF uses the concept of an address in many expression operators but does notdefine how it relates to address spaces. For example,DW_OP_push_object_address pushes the address of an object. Other contextsimplicitly push an address on the stack before evaluating an expression. Forexample, the DW_AT_use_location attribute of theDW_TAG_ptr_to_member_type. The expression that uses the address needs to doso in a general way and not need to be dependent on the address space of theaddress. For example, a pointer to member value may want to be applied to anobject that may reside in any address space.

The number of registers and the cost of memory operations is much higher forAMDGPU than a typical CPU. The compiler attempts to optimize whole variables andarrays into registers. Currently DWARF only allows DW_OP_push_object_addressand related operations to work with a global memory location. To support AMDGPUoptimized code it is required to generalize DWARF to allow any locationdescription to be used. This allows registers, or composite locationdescriptions that may be a mixture of memory, registers, or even implicitvalues.

Allowing a location description to be an entry on the DWARF stack allows them tocompose naturally. It allows objects to be located in any kind of memory addressspace, in registers, be implicit values, be undefined, or a composite of any ofthese.

By extending DWARF carefully, all existing DWARF expressions can retain theircurrent semantic meaning. DWARF has implicit conversions that convert from avalue that is treated as an address in the default address space to a memorylocation description. This can be extended to allow a default address spacememory location description to be implicitly converted back to its addressvalue. To allow composition of composite location descriptions, an explicitoperator that indicates the end is required. This can be implied if the end of aDWARF expression is reached, allowing current DWARF expressions to remain legal.

The DW_OP_plus and DW_OP_minus can be defined to operate on a memorylocation description in the default target architecture address space and ageneric type, and produce a memory location description. This allows them tocontinue to be used to offset an address. To generalize offsetting to anylocation description, including location descriptions that describe when bytesare in registers, are implicit, or a composite of these, theDW_OP_LLVM_offset and DW_OP_LLVM_bit_offset operations are added. Thesedo not perform wrapping which would be hard to define for location descriptionsof non-memory kinds. This allows DW_OP_push_object_address to push alocation description that may be in a register, or be an implicit value, and theDWARF expression of DW_TAG_ptr_to_member_type can containDW_OP_LLVM_offset to offset within it. DW_OP_LLVM_bit_offset generalizesDWARF to work with bit fields.

The DWARF DW_OP_xderef* operation allows a value to be converted into anaddress of a specified address space which is then read. But provides no way tocreate a memory location description for an address in the non-default addressspace. For example, AMDGPU variables can be allocated in the local address spaceat a fixed address. It is required to have an operation to create an address ina specific address space that can be used to define the location description ofthe variable. Defining this operation to produce a location description allowsthe size of addresses in an address space to be larger than the generic type.

If an operation had to produce a value that can be implicitly converted to amemory location description, then it would be limited to the size of the generictype which matches the size of the default address space. Its value would beunspecified and likely not match any value in the actual program. By making theresult a location description, it allows a consumer great freedom in how itimplements it. The implicit conversion back to a value can be limited only tothe default address space to maintain compatibility.

Similarly DW_OP_breg* treats the register as containing an address in thedefault address space. It is required to be able to specify the address space ofthe register value.

Almost all uses of addresses in DWARF are limited to defining locationdescriptions, or to be dereferenced to read memory. The exception isDW_CFA_val_offset which uses the address to set the value of a register. Bydefining the CFA DWARF expression as being a memory location description, it canmaintain what address space it is, and that can be used to convert the offsetaddress back to an address in that address space. (An alternative is to definedDW_CFA_val_offset to implicitly use the default address space, and addanother operation that specifies the address space.)

This approach allows all existing DWARF to have the identical semantics. Itallows the compiler to explicitly specify the address space it is using. Forexample, a compiler could choose to access private memory in a swizzled mannerwhen mapping a source language to a wave in a SIMT manner, or to access it in anunswizzled manner if mapping the same language with the wave being the thread.It also allows the compiler to mix the address space it uses to access privatememory. For example, for SIMT it can still spill entire vector registers in anunswizzled manner, while using swizzled for SIMT variable access. This approachallows memory location descriptions for different address spaces to be combinedusing the regular DWOP*piece operators.

Location descriptions are an abstraction of storage, they give freedom to theconsumer on how to implement them. They allow the address space to encode laneinformation so they can be used to read memory with only the memory descriptionand no extra arguments. The same set of operations can operate on locationsindependent of their kind of storage. The DW_OP_deref therefore can be usedon any storage kind. DW_OP_xderef is unnecessary except to become a morecompact way to convert a segment address followed by dereferencing it.

Several approaches were considered, and the one proposed appears to be thecleanest and offers the greatest improvement of DWARF’s ability to supportoptimized code. Examining the gdb debugger and LLVM compiler, it appears only torequire modest changes as they both already have to support general use oflocation descriptions. It is anticipated that will be the case for otherdebuggers and compilers.

The following provides the definitions for the additional operators, as well asclarifying how existing expression operators, CFI operators, and attributesbehave with respect to generalized location descriptions that support addressspaces. It has been defined such that it is backwards compatible with DWARF 5.The definitions are intended to fully define well-formed DWARF in a consistentstyle. Some sections are organized to mirror the DWARF 5 specificationstructure, with non-normative text shown in italics.

Language Names

Language codes defined for use with the DW_AT_language attribute aredefined in AMDGPU DWARF Language Names.

AMDGPU DWARF Language Names
Language NameCodeDefault Lower BoundDescription
DW_LANG_LLVM_HIP0x81000AMD HIP Language. See [HIP].

The DW_LANG_LLVM_HIP language can be supported by extending the C++language.

Register Mapping

DWARF registers are encoded as numbers, which are mapped to architectureregisters. The mapping for AMDGPU is defined inAMDGPU DWARF Register Mapping.

AMDGPU DWARF Register Mapping
DWARF RegisterAMDGPU RegisterBit SizeDescription
0PC3232Program Counter (PC) whenexecuting in a 32-bit processaddress space. Used in the CFI todescribe the PC of the callingframe.
1EXEC_MASK_3232Execution Mask Register whenexecuting in wave 32 mode.
2-15_Reserved
16PC6464Program Counter (PC) whenexecuting in a 64-bit processaddress space. Used in the CFI todescribe the PC of the callingframe.
17EXEC_MASK_6464Execution Mask Register whenexecuting in wave 64 mode.
18-31_Reserved
32-95SGPR0-SGPR6332Scalar General PurposeRegisters.
96-127Reserved
128-511Reserved
512-1023Reserved
1024-1087Reserved
1088-1129SGPR64-SGPR10532Scalar General Purpose Registers
1130-1535Reserved
1536-1791VGPR0-VGPR2553232Vector General Purpose Registerswhen executing in wave 32 mode.
1792-2047Reserved
2048-2303AGPR0-AGPR2553232Vector Accumulation Registerswhen executing in wave 32 mode.
2304-2559Reserved
2560-2815VGPR0-VGPR2556432Vector General Purpose Registerswhen executing in wave 64 mode.
2816-3071Reserved
3072-3327AGPR0-AGPR2556432Vector Accumulation Registerswhen executing in wave 64 mode.
3328-3583Reserved

The vector registers are represented as the full size for the wavefront. Theyare organized as consecutive dwords (32-bits), one per lane, with the dword atthe least significant bit position corresponding to lane 0 and so forth. DWARFlocation expressions involving the DW_OP_LLVM_offset andDW_OP_LLVM_push_lane operations are used to select the part of the vectorregister corresponding to the lane that is executing the current thread ofexecution in languages that are implemented using a SIMD or SIMT executionmodel.

If the wavefront size is 32 lanes then the wave 32 mode register definitionsare used. If the wavefront size is 64 lanes then the wave 64 mode registerdefinitions are used. Some AMDGPU targets support executing in both wave 32and wave 64 mode. The register definitions corresponding to the wave modeof the generated code will be used.

If code is generated to execute in a 32-bit process address space then the32-bit process address space register definitions are used. If code isgenerated to execute in a 64-bit process address space then the 64-bit processaddress space register definitions are used. The amdgcn target onlysupports the 64-bit process address space.

Address Class Mapping

DWARF address classes are used for languages with the concept of memory addressspaces. They are used in the DW_AT_address_class attribute for pointer type,reference type, subroutine, and subroutine type debugger information entries(DIEs).

The address class mapping for AMDGPU is defined inAMDGPU DWARF Address Class Mapping.

AMDGPU DWARF Address Class Mapping
DWARFAMDGPU
Address Class NameValueAddress Space
DW_ADDR_none0x00Generic (Flat)
DW_ADDR_AMDGPU_global0x01Global
DW_ADDR_AMDGPU_region0x02Region (GDS)
DW_ADDR_AMDGPU_local0x03Local (group/LDS)
DW_ADDR_AMDGPU_constant0x04Global
DW_ADDR_AMDGPU_private0x05Private (Scratch)

See Address Spaces for information on the AMDGPU address spacesincluding address size and NULL value.

For AMDGPU the address class encodes the address class as declared in thesource language type.

For AMDGPU if no DW_AT_address_class attribute is present, then theDW_ADDR_none address class is used.

Note

The DW_ADDR_none default was defined as Generic and not Globalto match the LLVM address space ordering. This ordering was chosen to bettersupport CUDA-like languages such as HIP that do not have address spaces inthe language type system, but do allow variables to be allocated indifferent address spaces. So effectively all CUDA and HIP source languageaddresses are generic.

Note

Currently DWARF defines address class values as architecture specific. Itis unclear how language specific address spaces are intended to berepresented in DWARF.

For example, OpenCL defines address spaces for global, local,constant, and private. These are part of the type system and aremodifies to pointer types. In addition, OpenCL defines generic pointersthat can reference either the global, local, or private addressspaces. To support the OpenCL language the debugger would want to supportcasting pointers between the generic and other address spaces, andpossibly using pointer casting to form an address for a specific addressspace out of an integral value.

The method to use to dereference a pointer type or reference type value isdefined in DWARF expressions using DW_OP_xderef* which uses anarchitecture specific address space.

DWARF defines the DW_AT_address_class attribute on pointer types andreference types. It specifies the method to use to dereference them. Whyis the value of this not the same as the address space value used inDW_OP_xderef* since in both cases it is architecture specific and thearchitecture presumably will use the same set of methods to dereferencepointers in both cases?

Since DW_AT_address_class uses an architecture specific value it cannotin general capture the source language address space type modifier concept.On some architectures all source language address space modifies mayactually use the same method for dereferencing pointers.

One possibility is for DWARF to add an DW_TAG_LLVM_address_class_typetype modifier that can be applied to a pointer type and reference type. TheDW_AT_address_class attribute could be re-defined to not be architecturespecific and instead define generalized language values that will supportOpenCL and other languages using address spaces. The DW_AT_address_classcould be defined to not be applied to pointer or reference types, butinstead only to the DW_TAG_LLVM_address_class_type type modifier entry.

If a pointer type or reference type is not modified byDW_TAG_LLVM_address_class_type or if DW_TAG_LLVM_address_class_typehas no DW_AT_address_class attribute, then the pointer type or referencetype would be defined to use the DW_ADDR_none address class ascurrently. Since modifiers can be chained, it would need to be defined ifmultiple DW_TAG_LLVM_address_class_type modifies was legal, and if so ifthe outermost one is the one that takes precedence.

A target implementation that supports multiple address spaces would need tomap DW_ADDR_none appropriately to support CUDA-like languagesthat have no address classes in the type system, but do support variableallocation in address spaces. See the above note that describes why AMDGPUchoose to make DW_ADDR_none map to the Generic AMDGPU address spaceand not the Global address space.

An alternative would be to define DW_ADDR_none as being the globaladdress class and then change DW_ADDR_global to DW_ADDR_generic.Compilers generating DWARF for CUDA-like languages would then have to defineevery CUDA-like language pointer type or reference type usingDW_TAG_LLVM_address_class_type with a DW_AT_address_class attributeof DW_ADDR_generic to match the language semantics. The AMDGPUalternative avoids needing to do this and seems to fit better into how CLANGand LLVM have added support for the CUDA-like languages on top of existingC++ language support.

A new DW_AT_address_space attribute could be defined that can be appliedto pointer type, reference type, subroutine, and subroutine type to describehow objects having the given type are dereferenced or called (the role thatDW_AT_address_class currently provides). The values ofDW_AT_address_space would be architecture specific and the same as usedin DW_OP_xderef*.

Address Space Mapping

DWARF address spaces are used in location expressions to describe the memoryspace where data resides. Address spaces correspond to a target specific memoryspace and are not tied to any source language concept.

The AMDGPU address space mapping is defined inAMDGPU DWARF Address Space Mapping.

AMDGPU DWARF Address Space Mapping
DWARF AMDGPUNotes
Address Space NameValueAddressBit SizeAddress Space
64-bitprocessaddressspace32-bitprocessaddressspace
DWASPACE_none0x0084Global_default address space
DWASPACE_AMDGPU_generic0x0184Generic (Flat)
DW_ASPACE_AMDGPU_region0x0244Region (GDS)
DW_ASPACE_AMDGPU_local0x0344Local (group/LDS)
_Reserved0x04
DWASPACE_AMDGPU_private_lane0x0544Private (Scratch)_focused lane
DWASPACE_AMDGPU_private_wave0x0644Private (Scratch)_unswizzled wave
Reserved0x07-0x1F
DWASPACE_AMDGPU_private_lane<0-63>0x20-0x5F44Private (Scratch)_specific lane

See Address Spaces for information on the AMDGPU address spacesincluding address size and NULL value.

The DW_ASPACE_none address space is the default address space used in DWARFoperations that do not specify an address space. It therefore has to map to theglobal address space so that the DW_OP_addr* and related operations canrefer to addresses in the program code.

The DW_ASPACE_AMDGPU_generic address space allows location expressions tospecify the flat address space. If the address corresponds to an address in thelocal address space then it corresponds to the wave that is executing thefocused thread of execution. If the address corresponds to an address in theprivate address space then it corresponds to the lane that is executing thefocused thread of execution for languages that are implemented using a SIMD orSIMT execution model.

Note

CUDA-like languages such as HIP that do not have address spaces in thelanguage type system, but do allow variables to be allocated in differentaddress spaces, will need to explicitly specify theDW_ASPACE_AMDGPU_generic address space in the DWARF operations as thedefault address space is the global address space.

The DW_ASPACE_AMDGPU_local address space allows location expressions tospecify the local address space corresponding to the wave that is executing thefocused thread of execution.

The DW_ASPACE_AMDGPU_private_lane address space allows location expressionsto specify the private address space corresponding to the lane that isexecuting the focused thread of execution for languages that are implementedusing a SIMD or SIMT execution model.

The DW_ASPACE_AMDGPU_private_wave address space allows location expressionsto specify the unswizzled private address space corresponding to the wave thatis executing the focused thread of execution. The wave view of private memoryis the per wave unswizzled backing memory layout defined inAddress Spaces, such that address 0 corresponds to the firstlocation for the backing memory of the wave (namely the address is not offsetby wavefront-scratch-base). So to convert from aDW_ASPACE_AMDGPU_private_lane to a DW_ASPACE_AMDGPU_private_wavesegment address perform the following:

  1. private-address-wave =
  2. ((private-address-lane / 4) * wavefront-size * 4) +
  3. (wavefront-lane-id * 4) + (private-address-lane % 4)

If the DW_ASPACE_AMDGPU_private_lane segment address is dword aligned andthe start of the dwords for each lane starting with lane 0 is required, thenthis simplifies to:

  1. private-address-wave =
  2. private-address-lane * wavefront-size

A compiler can use this address space to read a complete spilled vectorregister back into a complete vector register in the CFI. The frame pointer canbe a private lane segment address which is dword aligned, which can be shiftedto multiply by the wave size, and then used to form a private wave segmentaddress that gives a location for a contiguous set of dwords, one per lane,where the vector register dwords are spilled. The compiler knows the wave sizesince it generates the code. Note that the type of the address may have to beconverted as the size of a private lane segment address may be smaller than thesize of a private wave segment address.

The DW_ASPACE_AMDGPU_private_lane<n> address space allows locationexpressions to specify the private address space corresponding to a specificlane. For example, this can be used when the compiler spills scalar registersto scratch memory, with each scalar register being saved to a different lane’sscratch memory.

Expressions

The following sections define the new DWARF expression operator used by AMDGPU,as well as clarifying the extensions to already existing DWARF 5 operations.

DWARF expressions describe how to compute a value or specify a locationdescription. An expression is encoded as a stream of operations, each consistingof an opcode followed by zero or more literal operands. The number of operandsis implied by the opcode.

Operations represent a postfix operation on a simple stack machine. They can acton entries on the stack, including adding entries and removing entries. If thekind of a stack entry does not match the kind required by the operation, and isnot implicitly convertible to the required kind, then the DWARF expression isill-formed.

Each stack entry can be one of two kinds: a value or a location description.Value stack entries are described in Value Operations andlocation description stack entries are described inLocation Description Operations.

The evaluation of a DWARF expression can provide the location description of anobject, the value of an array bound, the length of a dynamic string, the desiredvalue itself, and so on.

The result of the evaluation of a DWARF expression is defined as:

  • If evaluation of the DWARF expression is on behalf of a DW_OP_calloperation for a DW_AT_location attribute that belongs to aDW_TAG_dwarf_procedure debugging information entry, then all the entrieson the stack are left, and execution of the DWARF expression containing theDW_OP_call operation continues.

  • If evaluation of the DWARF expression requires a location description, then:

    • If the stack is empty, an undefined location description is returned.

    • If the top stack entry is a location description, or can be converted toone, then the, possibly converted, location description is returned. Anyother entries on the stack are discarded.

    • Otherwise the DWARF expression is ill-formed.

Note

Could define this case as returning an implicit location description asif the DW_OP_implicit operation is performed.

  • If evaluation of the DWARF expression requires a value, then:

    • If the top stack entry is a value, or can be converted to one, then the,possibly converted, value is returned. Any other entries on the stack arediscarded.
    • Otherwise the DWARF expression is ill-formed.
Stack Operations

The following operations manipulate the DWARF stack. Operations that indexthe stack assume that the top of the stack (most recently added entry) has index0. They allow the stack entries to be either a value or location description.

If any stack entry accessed by a stack operation is an incomplete compositelocation description, then the DWARF expression is ill-formed.

Note

These operations now support stack entries that are values and locationdescriptions.

Note

If it is desired to also make them work with incomplete composite locationdescriptions then would need to define that the composite location storagespecified by the incomplete composite location description is also replicatedwhen a copy is pushed. This ensures that each copy of the incomplete compositelocation description can updated the composite location storage they specifyindependently.

  • DW_OP_dup

DW_OP_dup duplicates the stack entry at the top of the stack.

  • DW_OP_drop

DW_OP_drop pops the stack entry at the top of the stack and discards it.

  • DW_OP_pick

DW_OP_pick has a single unsigned 1-byte operand that is treated as anindex I. A copy of the stack entry with index I is pushed onto the stack.

  • DW_OP_over

DW_OP_over pushes a copy of the entry entry with index 1.

This is equivalent to a DW_OP_pick 1 operation.

  • DW_OP_swap

DW_OP_swap swaps the top two stack entries. The entry at the top of thestack becomes the second stack entry, and the second stack entry becomes thetop of the stack.

  • DW_OP_rot

DW_OP_rot rotates the first three stack entries. The entry at the top ofthe stack becomes the third stack entry, the second entry becomes the top ofthe stack, and the third entry becomes the second entry.

Value Operations

Each value stack entry has a type and a value, and can represent a value ofany supported base type of the target machine. The base type specifies the sizeand encoding of the value.

Note

It may be better to add an implicit pointer value kind that is produced whenDW_OP_deref* retrieves the full contents of an implicit pointer locationstorage created by the DW_OP_implicit_pointer orDW_OP_LLVM_aspace_implicit_pointer operations.

Instead of a base type, value stack entries can have a distinguished generictype, which is an integral type that has the size of an address in the targetarchitecture default address space on the target machine and unspecifiedsignedness.

The generic type is the same as the unspecified type used for stack operationsdefined in DWARF Version 4 and before.

An integral type is a base type that has an encoding of DW_ATE_signed,DW_ATE_signed_char, DW_ATE_unsigned, DW_ATE_unsigned_char,DW_ATE_boolean, or any target architecture defined integral encoding in theinclusive range DW_ATE_lo_user to DW_ATE_hi_user.

Note

Unclear if DW_ATE_address is an integral type. gdb does not seem toconsider as integral.

  • DWOP_LLVM_push_lane _New

DW_OP_LLVM_push_lane pushes a value with the generic type that is thetarget architecture lane identifier of the thread of execution for which auser presented expression is currently being evaluated. For languages thatare implemented using a SIMD or SIMT execution model this is the lane numberthat corresponds to the source language thread of execution upon which theuser is focused. Otherwise this is the value 0.

For AMDGPU, the lane identifier returned by DW_OP_LLVM_push_lanecorresponds to the the hardware lane number which is numbered from 0 to thewavefront size minus 1.

  • DW_OP_entry_value

DW_OP_entry_value pushes the value that the described location held uponentering the current subprogram.

It has two operands. The first is an unsigned LEB128 integer. The second isa block of bytes, with a length equal to the first operand, treated as aDWARF expression E.

E is evaluated as if it had been evaluated upon entering the currentsubprogram. E assumes no values are present on the DWARF stack initially andresults in exactly one value being pushed on the DWARF stack when completed.

DW_OP_push_object_address is not meaningful inside of this DWARFoperation.

If the result of E is a register location description (seeRegister Location Descriptions), DW_OP_entry_value pushesthe value that register had upon entering the current subprogram. The valueentry type is the target machine register base type. If the register valueis undefined or the register location description bit offset is not 0, thenthe DWARF expression is ill-formed.

The register location description provides a more compact form for the casewhere the value was in a register on entry to the subprogram.

Otherwise, the expression result is required to be a value, andDW_OP_entry_value pushes that value.

The values needed to evaluate DWOP_entry_value _could be obtained inseveral ways. The consumer could suspend execution on entry to thesubprogram, record values needed by DWOP_entry_value _expressionswithin the subprogram, and then continue; when evaluatingDWOP_entry_value, the consumer would use these recorded valuesrather than the current values. Or, when evaluating DW_OP_entry_value, the consumer could virtually unwind using the Call Frame Information(seeCall Frame Information) to recover register valuesthat might have been clobbered since the subprogram entry point._

Note

Unclear why this operation is defined this way. If the expression issimply using existing variables then it is just a regular expression. Itis unclear how the compiler instructs the consumer how to create the savedcopies of the variables on entry. Seems only the compiler knows how to dothis. If the main purpose is only to read the entry value of a registerusing CFI then would be better to have an operation that explicitly doesjust that such as DW_OP_LLVM_call_frame_entry_reg.

Location Description Operations

Information about the location of program objects is provided by locationdescriptions. Location descriptions specify the storage that holds the programobjects, and a position within the storage.

A location storage is a linear stream of bits that can hold values. Eachlocation storage has a size in bits and can be accessed using a zero-based bitoffset. The ordering of bits within location storage uses the bit numbering anddirection conventions that are appropriate to the current language on the targetarchitecture.

Note

For AMDGPU bytes are ordered with least significant bytes first, and bits areordered within bytes with least significant bits first.

There are five kinds of location storage: undefined, memory, register, implicit,and composite. Memory and register location storage corresponds to the targetarchitecture memory address spaces and registers. Implicit location storagecorresponds to fixed values that can only be read. Undefined location storageindicates no value is available and therefore cannot be read or written.Composite location storage allows a mixture of these where some bits come fromone kind of location storage and some from another kind of location storage.

Note

It may be better to add an implicit pointer location storage kind forDW_OP_implicit_pointer or DW_OP_LLVM_aspace_implicit_pointer.

Location description stack entries specify a location storage to which theyrefer, and a bit offset relative to the start of the location storage.

General Operations
  • DWOP_LLVM_offset _New

DW_OP_LLVM_offset pops two stack entries. The first must be an integraltype value that is treated as a byte displacement D. The second must be alocation description L.

It adds the value of D scaled by 8 (the byte size) to the bit offset of L,and pushes the updated L.

If the updated bit offset of L is less than 0 or greater than or equal tothe size of the location storage specified by L, then the DWARF expressionis ill-formed.

  • DWOP_LLVM_offset_uconst _New

DW_OP_LLVM_offset_uconst has a single unsigned LEB128 integer operandthat is treated as a displacement D.

It pops one stack entry that must be a location description L. It adds thevalue of D scaled by 8 (the byte size) to the bit offset of L, and pushesthe updated L.

If the updated bit offset of L is less than 0 or greater than or equal tothe size of the location storage specified by L, then the DWARF expressionis ill-formed.

This operation is supplied specifically to be able to encode more fielddisplacements in two bytes than can be done with DWOP_lit<n>DW_OP_LLVM_offset._

  • DWOP_LLVM_bit_offset _New

DW_OP_LLVM_bit_offset pops two stack entries. The first must be anintegral type value that is treated as a bit displacement D. The second mustbe a location description L.

It adds the value of D to the bit offset of L, and pushes the updated L.

If the updated bit offset of L is less than 0 or greater than or equal tothe size of the location storage specified by L, then the DWARF expressionis ill-formed.

  • DW_OP_deref

The DW_OP_deref operation pops one stack entry that must be a locationdescription L.

A value of the bit size of the generic type is retrieved from the locationstorage specified by L starting at the bit offset specified by L. Theretrieved generic type value V is pushed on the stack.

If any bit of the value is retrieved from the undefined location storage, orthe offset of any bit exceeds the size of the location storage specified byL, then the DWARF expression is ill-formed.

See Implicit Location Descriptions for special rulesconcerning implicit location descriptions created by theDW_OP_implicit_pointer and DW_OP_LLVM_implicit_aspace_pointeroperations.

  • DW_OP_deref_size

DW_OP_deref_size has a single 1-byte unsigned integral constant treatedas a byte result size S.

It pops one stack entry that must be a location description L.

A value of S scaled by 8 (the byte size) bits is retrieved from the locationstorage specified by L starting at the bit offset specified by L. The valueV retrieved is zero-extended to the bit size of the generic type beforebeing pushed onto the stack with the generic type.

If S is larger than the byte size of the generic type, if any bit of thevalue is retrieved from the undefined location storage, or if the offset ofany bit exceeds the size of the location storage specified by L, then theDWARF expression is ill-formed.

See Implicit Location Descriptions for special rulesconcerning implicit location descriptions created by theDW_OP_implicit_pointer and DW_OP_LLVM_implicit_aspace_pointeroperations.

  • DW_OP_deref_type

DW_OP_deref_type has two operands. The first is a 1-byte unsignedintegral constant whose value S is the same as the size of the base typereferenced by the second operand. The second operand is an unsigned LEB128integer that represents the offset of a debugging information entry E in thecurrent compilation unit, which must be a DW_TAG_base_type entry thatprovides the type of the result value.

It pops one stack entry that must be a location description L. A value ofthe bit size S is retrieved from the location storage specified by Lstarting at the bit offset specified by the L. The retrieved result typevalue V is pushed on the stack.

If any bit of the value is retrieved from the undefined location storage, orif the offset of any bit exceeds the size of the specified location storage,then the DWARF expression is ill-formed.

See Implicit Location Descriptions for special rulesconcerning implicit location descriptions created by theDW_OP_implicit_pointer and DW_OP_LLVM_implicit_aspace_pointeroperations.

While the size of the pushed value could be inferred from the base typedefinition, it is encoded explicitly into the operation so that theoperation can be parsed easily without reference to the .debuginfo_section.

  • DWOP_xderef _Deprecated

DW_OP_xderef pops two stack entries. The first must be an integral typevalue that is treated as an address A. The second must be an integral typevalue that is treated as an address space identifier AS for thosearchitectures that support multiple address spaces.

The operation is equivalent to performing DW_OP_swap;DW_OP_LLVM_form_aspace_address; DW_OP_deref. The retrieved generic typevalue V is left on the stack.

  • DWOP_xderef_size _Deprecated

DW_OP_xderef_size has a single 1-byte unsigned integral constant treatedas a byte result size S.

It pops two stack entries. The first must be an integral type value that istreated as an address A. The second must be an integral type value that istreated as an address space identifier AS for those architectures thatsupport multiple address spaces.

The operation is equivalent to performing DW_OP_swap;DW_OP_LLVM_form_aspace_address; DW_OP_deref_size S. The zero-extendedretrieved generic type value V is left on the stack.

  • DWOP_xderef_type _Deprecated

DW_OP_xderef_type has two operands. The first is a 1-byte unsignedintegral constant S whose value is the same as the size of the base typereferenced by the second operand. The second operand is an unsigned LEB128integer R that represents the offset of a debugging information entry E inthe current compilation unit, which must be a DW_TAG_base_type entrythat provides the type of the result value.

It pops two stack entries. The first must be an integral type value that istreated as an address A. The second must be an integral type value that istreated as an address space identifier AS for those architectures thatsupport multiple address spaces.

The operation is equivalent to performing DW_OP_swap;DW_OP_LLVM_form_aspace_address; DW_OP_deref_type S R. The retrieved resulttype value V is left on the stack.

  • DW_OP_push_object_address

DW_OP_push_object_address pushes the location description L of theobject currently being evaluated as part of evaluation of a user presentedexpression.

This object may correspond to an independent variable described by its owndebugging information entry or it may be a component of an array, structure,or class whose address has been dynamically determined by an earlier stepduring user expression evaluation.

This operator provides explicit functionality (especially for arraysinvolving descriptions) that is analogous to the implicit push of the baseaddress of a structure prior to evaluation of aDW_AT_data_member_location to access a data member of a structure.

  • DW_OP_call2, DW_OP_call4, DW_OP_call_ref

DW_OP_call2, DW_OP_call4, and DW_OP_call_ref perform DWARFprocedure calls during evaluation of a DWARF expression or locationdescription.

DW_OP_call2 and DW_OP_call4, have one operand that is a 2- or 4-byteunsigned offset, respectively, of a debugging information entry D in thecurrent compilation unit.

DW_OP_LLVM_call_ref has one operand that is a 4-byte unsigned value inthe 32-bit DWARF format, or an 8-byte unsigned value in the 64-bit DWARFformat, that is treated as an offset of a debugging information entry D in a.debug_info section, which may be contained in an executable or sharedobject file other than that containing the operator. For references from oneexecutable or shared object file to another, the relocation must beperformed by the consumer.

Operand interpretation of DWOP_call2, DW_OP_call4, andDW_OP_call_ref _is exactly like that for DWFORM_ref2,DW_FORM_ref4*, and DW_FORM_ref_addr, respectively._

If D has a DW_AT_location attribute, then the DWARF expression Ecorresponding to the current program location is selected.

Note

To allow DW_OP_call* to compute the location description for anyvariable or formal parameter regardless of whether the producer hasoptimized it to a constant, the following rule could be added:

Note

If D has a DW_AT_const_value attribute, then a DWARF expression Econsisting a DW_OP_implicit_value operation with the value of theDW_AT_const_value attribute is selected.

This would be consistent with DW_OP_implicit_pointer.

Alternatively, could deprecate using DW_AT_const_value forDW_TAG_variable and DW_TAG_formal_parameter debugger informationentries that are constants and instead use DW_AT_location with animplicit location description instead, then this rule would not berequired.

Otherwise, an empty expression E is selected.

If D is a DW_TAG_dwarf_procedure debugging information entry, then E isevaluated using the same DWARF expression stack. Any existing stack entriesmay be accessed and/or removed in the evaluation of E, and the evaluation ofE may add any new stack entries.

Values on the stack at the time of the call may be used as parameters bythe called expression and values left on the stack by the called expressionmay be used as return values by prior agreement between the calling andcalled expressions.

Otherwise, E is evaluated on a separate DWARF stack and the resultinglocation description L is pushed on the DW_OP_call* operation’s stack.

  • DWOP_LLVM_call_frame_entry_reg _New

DW_OP_LLVM_call_frame_entry_reg has a single unsigned LEB128 integeroperand that is treated as a target architecture register number R.

It pushes a location description L that holds the value of register R onentry to the current subprogram as defined by the Call Frame Information(see Call Frame Information).

If there is no Call Frame Information defined, then the default rules forthe target architecture are used. If the register rule is undefined,then the undefined location description is pushed. If the register rule is_same value, then a register location description for R is pushed._

Undefined Location Descriptions

The undefined location storage represents a piece or all of an object that ispresent in the source but not in the object code (perhaps due to optimization).Neither reading or writing to the undefined location storage is meaningful.

An undefined location description specifies the undefined location storage.There is no concept of the size of the undefined location storage, nor of a bitoffset for an undefined location description. The DWOP_LLVMoffsetoperations leave an undefined location description unchanged. TheDWOPpiece operations can explicitly or implicitly specify an undefinedlocation description, allowing any size and offset to be specified, and resultsin a part with all undefined bits.

  • DWOP_LLVM_undefined _New

DW_OP_LLVM_undefined pushes an undefined location description L.

Memory Location Descriptions

There is a memory location storage that corresponds to each of the targetarchitecture linear memory address spaces. The size of each memory locationstorage corresponds to the range of the addresses in the address space.

It is target architecture defined how address space location storage maps totarget architecture physical memory. For example, they may be independent memoryor more than one location storage may alias the same physical memory possibly atdifferent offsets and with different interleaving. The mapping may also bedictated by the source language address classes.

A memory location description specifies a memory location storage. The bitoffset corresponds to an address in the address space scaled by 8 (the bytesize). Bits accessed using a memory location description, access thecorresponding target architecture memory starting at the bit offset.

DW_ASPACE_none is defined as the target architecture default address space.

The target architecture default address space for AMDGPU is the global addressspace.

If a stack entry is required to be a location description, but it is a valuewith the generic type, then it is implicitly convert to a memory locationdescription that specifies memory in the target architecture default addressspace with a bit offset equal to the value scaled by 8 (the byte size).

Note

If want to allow any integral type value to be implicitly converted to amemory location description in the target architecture default addressspace:

Note

If a stack entry is required to be a location description, but it is avalue with an integral type, then it is implicitly convert to a memorylocation description. The stack entry value is zero extended to the sizeof the generic type and the least significant generic type size bits aretreated as a twos-complement unsigned value to be used as an address. Theconverted memory location description specifies memory location storagecorresponding to the target architecture default address space with a bitoffset equal to the address scaled by 8 (the byte size).

The implicit conversion could also be defined as target specific. Forexample, gdb checks if the value is an integral type. If it is not it givesan error. Otherwise, gdb zero-extends the value to 64 bits. If the gdbtarget defines a hook function then it is called and it can modify the 64bit value, possibly sign extending the original value. Finally, gdb treatsthe 64 bit value as a memory location address.

If a stack entry is required to be a location description, but it is an implicitpointer value IPV with the target architecture default address space, then it isimplicitly convert to the location description specified by IPV. SeeImplicit Location Descriptions.

If a stack entry is required to be a value with a generic type, but it is amemory location description in the target architecture default address spacewith a bit offset that is a multiple of 8, then it is implicitly converted to avalue with a generic type that is equal to the bit offset divided by 8 (the bytesize).

  • DW_OP_addr

DW_OP_addr has a single byte constant value operand, which has the sizeof the generic type, treated as an address A.

It pushes a memory location description L on the stack that specifies thememory location storage for the target architecture default address spacewith a bit offset equal to A scaled by 8 (the byte size).

If the DWARF is part of a code object, then A may need to be relocated. Forexample, in the ELF code object format, A must be adjusted by the differencebetween the ELF segment virtual address and the virtual address at which thesegment is loaded.

  • DW_OP_addrx

DW_OP_addrx has a single unsigned LEB128 integer operand that is treatedas a zero-based index into the .debug_addr section relative to the valueof the DW_AT_addr_base attribute of the associated compilation unit. Theaddress value A in the .debug_addr section has the size of generic type.

It pushes a memory location description L on the stack that specifies thememory location storage for the target architecture default address spacewith a bit offset equal to A scaled by 8 (the byte size).

If the DWARF is part of a code object, then A may need to be relocated. Forexample, in the ELF code object format, A must be adjusted by the differencebetween the ELF segment virtual address and the virtual address at which thesegment is loaded.

  • DWOP_LLVM_form_aspace_address _New

DW_OP_LLVM_form_aspace_address pops top two stack entries. The firstmust be an integral type value that is treated as an address spaceidentifier AS for those architectures that support multiple address spaces.The second must be an integral type value that is treated as an address A.

The address size S is defined as the address bit size of the targetarchitecture’s address space that corresponds to AS.

A is adjusted by zero extending it to S bits and the least significant Sbits are treated as a twos-complement unsigned value.

DW_OP_LLVM_form_aspace_address pushes a memory location description Lthat specifies the memory location storage that corresponds to AS, with abit offset equal to the adjusted A scaled by 8 (the byte size).

If AS is not one of the values defined by the target architecture’sDWASPACE* values, then the DWARF expression is ill-formed.

See Implicit Location Descriptions for special rulesconcerning implicit pointer values produced by dereferencing implicitlocation descriptions created by the DW_OP_implicit_pointer andDW_OP_LLVM_implicit_aspace_pointer operations.

The AMDGPU address spaces are defined inAMDGPU DWARF Address Space Mapping.

  • DW_OP_form_tls_address

DW_OP_form_tls_address pops one stack entry that must be an integraltype value, and treats it as a thread-local storage address.

DW_OP_form_tls_address pushes a memory location description L for thetarget architecture default address space that corresponds to thethread-local storage address.

The meaning of the thread-local storage address is defined by the run-timeenvironment. If the run-time environment supports multiple thread-localstorage blocks for a single thread, then the block corresponding to theexecutable or shared library containing this DWARF expression is used.

Some implementations of C, C++, Fortran, and other languages, support athread-local storage class. Variables with this storage class have distinctvalues and addresses in distinct threads, much as automatic variables havedistinct values and addresses in each function invocation. Typically, thereis a single block of storage containing all thread-local variables declaredin the main executable, and a separate block for the variables declared ineach shared library. Each thread-local variable can then be accessed in itsblock using an identifier. This identifier is typically an offset into theblock and pushed onto the DWARF stack by one of the DWOP_const<n><x>_operations prior to the DWOP_form_tls_address _operation. Computingthe address of the appropriate block can be complex (in some cases, thecompiler emits a function call to do it), and difficult to describe usingordinary DWARF location descriptions. Instead of forcing complexthread-local storage calculations into the DWARF expressions, theDWOP_form_tls_address _allows the consumer to perform the computationbased on the run-time environment.

  • DW_OP_call_frame_cfa

DW_OP_call_frame_cfa pushes the memory location description L of theCanonical Frame Address (CFA) of the current function, obtained from theCall Frame Information (see Call Frame Information).

Although the value of DWAT_frame_base _can be computed using otherDWARF expression operators, in some cases this would require an extensivelocation list because the values of the registers used in computing the CFAchange during a subroutine. If the Call Frame Information is present, thenit already encodes such changes, and it is space efficient to referencethat.

  • DW_OP_fbreg

DW_OP_fbreg has a single signed LEB128 integer operand that is treatedas a byte displacement D.

The DWARF expression E corresponding to the current program location isselected from the DW_AT_frame_base attribute of the current function andevaluated. The resulting memory location description L’s bit offset isupdated as if the DW_OP_LLVM_offset D operation were applied. Theupdated L is pushed.

This is typically a stack pointer register plus or minus some offset.

  • DW_OP_breg0, DW_OP_breg1, …, DW_OP_breg31

The DW_OP_breg<n> operations encode the numbers of up to 32 registers,numbered from 0 through 31, inclusive. The register number R corresponds tothe n in the operation name.

They have a single signed LEB128 integer operand that is treated as a bytedisplacement D.

The address space identifier AS is defined as the one corresponding to thetarget architecture’s default address space.

The address size S is defined as the address bit size of the targetarchitecture’s address space corresponding to AS.

The contents of the register specified by R is retrieved as atwos-complement unsigned value and zero extended to S bits. D is added andthe least significant S bits are treated as a twos-complement unsigned valueto be used as an address A.

They push a memory location description L that specifies the memory locationstorage that corresponds to AS, with a bit offset equal to A scaled by 8(the byte size).

  • DW_OP_bregx

DW_OP_bregx has two operands. The first is an unsigned LEB128 integerthat is treated as a register number R. The second is a signed LEB128integer that is treated as a byte displacement D.

The action is the same as for DW_OP_breg<n> except that R is used as theregister number and D is used as the byte displacement.

  • DWOP_LLVM_aspace_bregx _New

DW_OP_LLVM_aspace_bregx has two operands. The first is an unsignedLEB128 integer that is treated as a register number R. The second is asigned LEB128 integer that is treated as a byte displacement D. It pops onestack entry that is required to be an integral type value that is treated asan address space identifier AS for those architectures that support multipleaddress spaces.

The action is the same as for DW_OP_breg<n> except that R is used as theregister number, D is used as the byte displacement, and AS is used as theaddress space identifier.

If AS is not one of the values defined by the target architecture’sDWASPACE* values, then the DWARF expression is ill-formed.

Note

Could also consider adding DW_OP_aspace_breg0, DW_OP_aspace_breg1, …,DW_OP_aspace_bref31 which would save encoding size.

Register Location Descriptions

There is a register location storage that corresponds to each of the targetarchitecture registers. The size of each register location storage correspondsto the size of the corresponding target architecture register.

A register location description specifies a register location storage. The bitoffset corresponds to a bit position within the register. Bits accessed using aregister location description, access the corresponding target architectureregister starting at the bit offset.

  • DW_OP_reg0, DW_OP_reg1, …, DW_OP_reg31

DW_OP_reg<n> operations encode the numbers of up to 32 registers,numbered from 0 through 31, inclusive. The target architecture registernumber R corresponds to the n in the operation name.

DW_OP_reg<n> pushes a register location description L that specifies theregister location storage that corresponds to R, with a bit offset of 0.

  • DW_OP_regx

DW_OP_regx has a single unsigned LEB128 integer operand that is treatedas a target architecture register number R.

DW_OP_regx pushes a register location description L that specifies theregister location storage that corresponds to R, with a bit offset of 0.

These operations name a register location. To fetch the contents of a register,it is necessary to use DWOP_regval_type, or one of the register basedaddressing operations such as DW_OP_bregx, or using DW_OP_deref*_on a register location description.

Implicit Location Descriptions

Implicit location storage represents a piece or all of an object which has noactual location in the program but whose contents are nonetheless known, eitheras a constant or can be computed from other locations and values in the program.

An implicit location description specifies an implicit location storage. The bitoffset corresponds to a bit position within the implicit location storage. Bitsaccessed using an implicit location description, access the correspondingimplicit storage value starting at the bit offset.

  • DW_OP_implicit_value

DW_OP_implicit_value has two operands. The first is an unsigned LEB128integer treated as a byte size S. The second is a block of bytes with alength equal to S treated as a literal value V.

An implicit location storage LS is created with the literal value V and asize of S. An implicit location description L is pushed that specifies LSwith a bit offset of 0.

  • DW_OP_stack_value

DW_OP_stack_value pops one stack entry that must be a value treated as aliteral value V.

An implicit location storage LS is created with the literal value V and asize equal to V’s base type size. An implicit location description L ispushed that specifies LS with a bit offset of 0.

The DW_OP_stack_value operation specifies that the object does not existin memory but its value is nonetheless known and is at the top of the DWARFexpression stack. In this form of location description, the DWARF expressionrepresents the actual value of the object, rather than its location.

See Implicit Location Descriptions for special rulesconcerning implicit pointer values produced by dereferencing implicitlocation descriptions created by the DW_OP_implicit_pointer andDW_OP_LLVM_implicit_aspace_pointer operations.

Note

Since location descriptions are allowed on the stack, theDW_OP_stack_value operation no longer terminates the DWARF expression.

  • DW_OP_implicit_pointer

An optimizing compiler may eliminate a pointer, while still retaining thevalue that the pointer addressed. DWOP_implicit_pointer _allows aproducer to describe this value.

DW_OP_implicit_pointer specifies that the object is a pointer to thetarget architecture default address space that cannot be represented as areal pointer, even though the value it would point to can be described. Inthis form of location description, the DWARF expression refers to adebugging information entry that represents the actual location descriptionof the object to which the pointer would point. Thus, a consumer of thedebug information would be able to access the the dereferenced pointer, evenwhen it cannot access of the pointer itself.

DW_OP_implicit_pointer has two operands. The first is a 4-byte unsignedvalue in the 32-bit DWARF format, or an 8-byte unsigned value in the 64-bitDWARF format, that is treated as a debugging information entry reference R.The second is a signed LEB128 integer that is treated as a bytedisplacement D.

R is used as the offset of a debugging information entry E in a.debug_info section, which may be contained in an executable or sharedobject file other than that containing the operator. For references from oneexecutable or shared object file to another, the relocation must beperformed by the consumer.

The first operand interpretation is exactly like that forDWFORM_ref_addr._

The address space identifier AS is defined as the one corresponding to thetarget architecture’s default address space.

The address size S is defined as the address bit size of the targetarchitecture’s address space corresponding to AS.

An implicit location storage LS is created that has the bit size of S. Animplicit location description L is pushed that specifies LS and has a bitoffset of 0.

If a DW_OP_deref* operation pops a location description L’ and retrievesS’ bits where some retrieved bits come from LS such that either:

  • L’ is an implicit location description that specifies LS with bit offset0, and S’ equals S.
  • L’ is a complete composite location description that specifies acanonical form composite location storage LS’. The bits retrieved allcome from a single part P’ of LS’. P’ has a bit size of S and hasan implicit location description PL’. PL’ specifies LS with a bit offsetof 0.Then the value V pushed by the DW_OP_deref* operation is an implicitpointer value IPV with an address space of AS, a debugging information entryof E, and a base type of T. If AS is the target architecture default addressspace, then T is the generic type. Otherwise, T is an architecture specificintegral type with a bit size equal to S.

Otherwise, if a DW_OP_deref* operation is applied to a locationdescription such that some retrieved bits come from LS, then the DWARFexpression is ill-formed.

If IPV is either implicitly converted to a location description (only doneif AS is the target architecture default address space) or used byDW_OP_LLVM_form_aspace_address (only done if the address space specifiedis AS), then the resulting location description is:

  • If E has a DW_AT_location attribute, the DWARF expressioncorresponding to the current program location is selected and evaluatedfrom the DW_AT_location attribute. The expression result is theresulting location description RL.

  • If E has a DW_AT_const_value attribute, then an implicit locationstorage RLS is created from the DW_AT_const_value attribute’s value,with a size matching the size of the DW_AT_const_value attribute’svalue. The resulting implicit location description RL specifies RLS with abit offset of 0.

Note

If deprecate using DW_AT_const_value for variables and formalparameters and instead use DW_AT_location with an implicit locationdescription instead, then this rule would not be required.

  • Otherwise the DWARF expression is ill-formed.

The bit offset of RL is updated as if the DW_OP_LLVM_offset D operationwere applied.

If a DW_OP_stack_value operation pops a value that is the same as IPV,then it pushes a location description that is the same as L.

The DWARF expression is ill-formed if it accesses LS or IPV in any othermanner.

The restrictions on how an implicit pointer location description created byDW_OP_implicit_pointer and DW_OP_LLVM_aspace_implicit_pointer, or animplicit pointer value created by DW_OP_deref*, can be used are tosimplify the DWARF consumer.

  • DWOP_LLVM_aspace_implicit_pointer _New

DW_OP_LLVM_aspace_implicit_pointer has two operands that are the same asfor DW_OP_implicit_pointer.

It pops one stack entry that must be an integral type value that is treatedas an address space identifier AS for those architectures that supportmultiple address spaces.

The implicit location description L that is pushed is the same as forDW_OP_implicit_pointer except that the address space identifier used isAS.

If AS is not one of the values defined by the target architecture’sDWASPACE* values, then the DWARF expression is ill-formed.

The debugging information entry referenced by a DWOP_implicit_pointer orDW_OP_LLVM_aspace_implicit_pointer _operation is typically aDWTAG_variable _or DWTAG_formal_parameter _entry whoseDWAT_location _attribute gives a second DWARF expression or a location listthat describes the value of the object, but the referenced entry may be anyentry that contains a DWAT_location _or DWAT_const_value _attribute(for example, DWTAG_dwarf_procedure). By using the second DWARFexpression, a consumer can reconstruct the value of the object when asked todereference the pointer described by the original DWARF expression containingthe DW_OP_implicit_pointer or DW_OP_LLVM_aspace_implicit_pointer_operation.

Composite Location Descriptions

A composite location storage represents an object or value which may becontained in part of another location storage, or contained in parts of morethan one location storage.

Each part has a part location description L and a part bit size S. The bits ofthe part comprise S contiguous bits from the location storage specified by L,starting at the bit offset specified by L. All the bits must be within the sizeof the location storage specified by L or the DWARF expression is ill-formed.

A composite location storage can have zero or more parts. The parts arecontiguous such that the zero-based location storage bit index will range overeach part with no gaps between them. Therefore, the size of a composite locationstorage is the size of its parts. The DWARF expression is ill-formed if the sizeof the contiguous location storage is larger than the size of the memorylocation storage corresponding to the target architecture’s largest addressspace.

The canonical form of a composite location storage is computed by applying thefollowing steps to a composite location storage:

  • If any part P has a composite location description L, it is replaced by acopy of the parts of the composite location storage specified by L that areselected by the bit size of P starting at the bit offset of L. The locationdescription of the first copied part has its bit offset updated asnecessary, and the last copied part has its bit size updated as necessary,to reflect the bits selected by P. This rule is applied repeatedly until nopart has a composite location description.
  • If the size on any part is zero, it is removed.
  • If any adjacent parts P1 to Pn have location descriptionsthat specify the same location storage LS such that the bits selected form acontiguous portion of LS, then they are replaced by a single new part P’. P’has a location description L that specifies LS with the same bit offset asP1’s location description, and a bit size equal to the sum of thebit sizes of P1 to Pn inclusive.A composite location description specifies the canonical form of a compositelocation storage and a bit offset.

There are operations that push a composite location description that specifies acomposite location storage that is created by the operation.

There are other operations that allow a composite location storage and acomposite location description that specifies it to be created incrementally.Each part is described by a separate operation. There may be one or moreoperations to create the final composite location storage and associateddescription. A series of such operations describes the parts of the compositelocation storage that are in the order that the associated part operations areexecuted.

To support incremental creation, a composite location description can be in anincomplete state. When an incremental operation operates on an incompletecomposite location description, it adds a new part, otherwise it creates a newcomposite location description. The DW_OP_LLVM_piece_end operationexplicitly makes an incomplete composite location description complete.

If the top stack entry is an incomplete composite location description after theexecution of a DWARF expression has completed, it is converted to a completecomposite location description.

If a stack entry is required to be a location description, but it is anincomplete composite location description, then the DWARF expression isill-formed.

Note that a DWARF expression may arbitrarily compose composite locationdescriptions from any other location description, including other compositelocation descriptions.

The incremental composite location description operations are defined to becompatible with the definitions in DWARF 5 and earlier.

  • DW_OP_piece

DW_OP_piece has a single unsigned LEB128 integer that is treated as abyte size S.

The action is based on the context:

  • If the stack is empty, then an incomplete composite location descriptionL is pushed that specifies a new composite location storage LS and has abit offset of 0. LS has a single part P that specifies the undefinedlocation description, and has a bit size of S scaled by 8 (the byte size).
  • If the top stack entry is an incomplete composite location description L,then the composite location storage LS that it specifies is updated toappend a part that specifies an undefined location description, and has abit size S scaled by 8 (the byte size).
  • If the top stack entry is a location description or can be converted toone, then it is popped and treated as a part location description PL.Then:
    • If the stack is empty or the top stack entry is not an incompletecomposite location description, then an incomplete composite locationdescription L is pushed that specifies a new composite location storageLS. LS has a single part that specifies PL, and has a bit size of Sscaled by 8 (the byte size).
    • Otherwise, the composite location storage LS specified by the top stackincomplete composite location description L is updated to append a partthat specifies PL, and has a bit size S scaled by 8 (the byte size).
  • Otherwise, the DWARF expression is ill-formedIf LS is not in canonical form it is updated to be in canonical form.

Many compilers store a single variable in sets of registers, or store avariable partially in memory and partially in registers. DWOP_piece_provides a way of describing how large a part of a variable a particularDWARF location description refers to.

If a computed byte displacement is required, the DWOP_LLVM_offset_can be used to update the part location description.

  • DW_OP_bit_piece

DW_OP_bit_piece has two operands. The first is an unsigned LEB128integer that is treated as the part bit size S. The second is an unsignedLEB128 integer that is treated as a bit displacement D.

The action is the same as for DW_OP_piece except that any part createdhas the bit size S, and the location description of any created part has itsbit offset updated as if the DW_OP_LLVM_bit_offset D operation wereapplied.

If a computed bit displacement is required, the DWOP_LLVM_bit_offset_can be used to update the part location description.

Note

The bit offset operand is not needed as DW_OP_LLVM_bit_offset can beused on the part’s location description.

  • DWOP_LLVM_piece_end _New

If the top stack entry is an incomplete composite location description L,then it is updated to be a complete composite location description with thesame parts. Otherwise, the DWARF expression is ill-formed.

  • DWOP_LLVM_extend _New

DW_OP_LLVM_extend has two operands. The first is an unsigned LEB128integer that is treated as the element bit size S. The second is an unsignedLEB128 integer that is treated as a count C.

It pops one stack entry that must be a location description and is treatedas the part location description PL.

A complete composite location description L is pushed that comprises C partsthat each specify PL and have a bit size of S.

The DWARF expression is ill-formed if the element bit size or count are 0.

  • DWOP_LLVM_select_bit_piece _New

DW_OP_LLVM_select_bit_piece has two operands. The first is an unsignedLEB128 integer that is treated as the element bit size S. The second is anunsigned LEB128 integer that is treated as a count C.

It pops three stack entries. The first must be an integral type value thatis treated as a bit mask value M. The second must be a location descriptionthat is treated as the one-location description L1. The third must be alocation description that is treated as the zero-location description L0.

A complete composite location description L is pushed that specifies a newcomposite location storage LS. LS comprises C parts that each specify a partlocation description PL and have a bit size of S. The PL for part N isdefined as:

  • If the Nth least significant bit of M is a zero then the PL for part Nis the same as L0, otherwise it is the same as L1.
  • The PL for part N is updated as if the DW_OP_LLVM_bit_offset N*Soperation was applied.If LS is not in canonical form it is updated to be in canonical form.

The DWARF expression is ill-formed if S or C are 0, or if the bit size of Mis less than C.

DWOP_bit_piece _is used instead of DWOP_piece _when the piece to beassembled into a value or assigned to is not byte-sized or is not at the startof the part location description.

Note

For AMDGPU:

  • In CFI expressions DW_OP_LLVM_select_bit_piece is used to describeunwinding vector registers that are spilled under the execution mask tomemory: the zero location description is the vector register, and the onelocation description is the spilled memory location. TheDW_OP_LLVM_form_aspace_address is used to specify the address space ofthe memory location description.
  • DW_OP_LLVM_select_bit_piece is used by the lane_pc attributeexpression where divergent control flow is controlled by the execution mask.An undefined location description together with DW_OP_LLVM_extend isused to indicate the lane was not active on entry to the subprogram.
Expression Operation Encodings

The following table gives the encoding of the DWARF expression operations addedfor AMDGPU.

AMDGPU DWARF Expression Operation Encodings
OperationCodeNumberofOperandsNotes
DWOP_LLVM_form_aspace_address0xe70
DW_OP_LLVM_push_lane0xea0
DW_OP_LLVM_offset0xe90
DW_OP_LLVM_offset_uconst_TBD1ULEB128 byte displacement
DWOP_LLVM_bit_offset_TBD0
DWOP_LLVM_call_frame_entry_reg_TBD1ULEB128 register number
DWOP_LLVM_undefined_TBD0
DWOP_LLVM_aspace_bregx_TBD2ULEB128 register number,ULEB128 byte displacement
DWOP_LLVM_aspace_implicit_pointer_TBD24- or 8-byte offset of DIE,SLEB128 byte displacement
DWOP_LLVM_piece_end_TBD0
DWOP_LLVM_extend_TBD2ULEB128 bit size,ULEB128 count
DWOP_LLVM_select_bit_piece_TBD2ULEB128 bit size,ULEB128 count

Debugging Information Entry Attributes

This section provides changes to existing debugger information attributes anddefines attributes added by the AMDGPU target.

  • DW_AT_location

If the result of the DW_AT_location DWARF expression is required to be alocation description, then it may have any kind of location description (seeLocation Description Operations).

  • DW_AT_const_value

Note

Could deprecate using the DW_AT_const_value attribute forDW_TAG_variable or DW_TAG_formal_parameter debugger informationentries that are constants. Instead, DW_AT_location could be used witha DWARF expression that produces an implicit location description now thatany location description can be used within a DWARF expression. Thisallows the DW_OP_call* operations to be used to push the locationdescription of any variable regardless of how it is optimized.

  • DW_AT_frame_base

A DWTAG_subprogram or DW_TAG_entry_point debugger information entrymay have a DW_AT_frame_base attribute, whose value is a DWARF expressionor location list that describes the _frame base for the subroutine or entrypoint.

If the result of the DWARF expression is a register location description,then the DW_OP_deref operation is applied to compute the frame basememory location description in the target architecture default addressspace.

Note

This rule could be removed and require the producer to create therequired location descriptor directly using DW_OP_call_frame_cfa,DW_OP_fbreg, DW_OP_breg*, or DW_OP_LLVM-aspace_bregx. Thiswould also then allow a target to implement the call frames withing alarge register.

Otherwise, the result of the DWARF expression is required to be a memorylocation description in any of the target architecture address spaces whichis the frame base.

  • DW_AT_data_member_location

For a DW_AT_data_member_location attribute there are two cases:

  • If the value is an integer constant, it is the offset in bytes from thebeginning of the containing entity. If the beginning of the containingentity has a non-zero bit offset then the beginning of the member entryhas that same bit offset as well.

  • Otherwise, the value must be a DWARF expression or location list. TheDWARF expression E corresponding to the current program location isselected. The location description of the beginning of the containingentity is pushed on the DWARF stack before E is evaluated. The result ofthe evaluation is the location description of the base of the memberentry.

Note

The beginning of the containing entity can now be any locationdescription and can be bit aligned.

  • DW_AT_use_location

The DW_TAG_ptr_to_member_type debugging information entry has aDW_AT_use_location attribute whose value is a DWARF expression orlocation list. The DWARF expression E corresponding to the current programlocation is selected. It is used to computes the location description of themember of the class to which the pointer to member entry points

The method used to find the location description of a given member of aclass or structure is common to any instance of that class or structure andto any instance of the pointer or member type. The method is thus associatedwith the type entry, rather than with each instance of the type.

The DW_AT_use_location description is used in conjunction with thelocation descriptions for a particular object of the given pointer to membertype and for a particular structure or class instance.

Two values are pushed onto the DWARF expression stack before E is evaluated.The first value pushed is the value of the pointer to member object itself.The second value pushed is the location description of the base of theentire structure or union instance containing the member whose address isbeing calculated.

  • DW_AT_data_location

The DW_AT_data_location attribute may be used with any type thatprovides one or more levels of hidden indirection and/or run-time parametersin its representation. Its value is a DWARF expression E which computes thelocation description of the data for an object. When this attribute isomitted, the location description of the data is the same as the locationdescription of the object.

E will typically begin with DW_OP_push_object_address which loads thelocation description of the object which can then serve as a descriptor insubsequent calculation.

  • DW_AT_vtable_elem_location

An entry for a virtual function also has a DW_AT_vtable_elem_locationattribute whose value is a DWARF expression or location list. The DWARFexpression E corresponding to the current program location is selected. Thelocation description of the object of the enclosing type is pushed onto theexpression stack before E is evaluated. The resulting location descriptionis the slot for the function within the virtual function table for theenclosing class.

  • DW_AT_static_link

If a DW_TAG_subprogram or DW_TAG_entry_point debugger informationentry is nested, it may have a DW_AT_static_link attribute, whose valueis a DWARF expression or location list. The DWARF expression E correspondingto the current program location is selected. The result of evaluating E isthe frame base memory location description of the relevant instance of thesubroutine that immediately encloses the subroutine or entry point.

  • DW_AT_return_addr

A DW_TAG_subprogram, DW_TAG_inlined_subroutine, orDW_TAG_entry_point debugger information entry may have aDW_AT_return_addr attribute, whose value is a DWARF expression orlocation list. The DWARF expression E corresponding to the current programlocation is selected. The result of evaluating E is the location descriptionfor the place where the return address for the subroutine or entry point isstored.

Note

It is unclear why DW_TAG_inlined_subroutine has aDW_AT_return_addr attribute but not a DW_AT_frame_base orDW_AT_static_link attribute. Seems it would either have all of them ornone. Since inlined subprograms do not have a frame it seems they wouldhave none of these attributes.

  • DWAT_LLVM_lanes _New

For languages that are implemented using a SIMD or SIMT execution model, aDW_TAG_subprogram, DW_TAG_inlined_subroutine, orDW_TAG_entry_point debugger information entry may have aDW_AT_LLVM_lanes attribute whose value is an integer constant that isthe number of lanes per thread.

If not present, the default value of 1 is used.

The DWARF is ill-formed if the value is 0.

  • DWAT_LLVM_lane_pc _New

For languages that are implemented using a SIMD or SIMT execution model, aDW_TAG_subprogram, DW_TAG_inlined_subroutine, orDW_TAG_entry_point debugging information entry may have aDW_AT_LLVM_lane_pc attribute whose value is a DWARF expression orlocation list. The DWARF expression E corresponding to the current programlocation is selected. The result of evaluating E is a location descriptionthat references a wave size vector of generic type elements. Each elementholds the conceptual program location of the corresponding lane, where theleast significant element corresponds to the first target architecture laneidentifier and so forth. If the lane was not active when the subprogram wascalled, its element is an undefined location description.

DW_AT_LLVM_lane_pc allows the compiler to indicate conceptually whereeach lane of a SIMT thread is positioned even when it is in divergentcontrol flow that is not active.

If not present, the thread is not being used in a SIMT manner, and thethread’s program location is used.

SeeAMDGPU DW_AT_LLVM_lane_pcfor AMDGPUinformation.

  • DWAT_LLVM_active_lane _New

For languages that are implemented using a SIMD or SIMT execution model, aDW_TAG_subprogram, DW_TAG_inlined_subroutine, orDW_TAG_entry_point debugger information entry may have aDW_AT_LLVM_active_lane attribute whose value is a DWARF expression orlocation list. The DWARF expression E corresponding to the current programlocation is selected. The result of evaluating E is a integral value that isthe mask of active lanes for the current program location. The Nth leastsignificant bit of the mask corresponds to the Nth lane. If the bit is 1 thelane is active, otherwise it is inactive.

Some targets may update the target architecture execution mask for regionsof code that must execute with different sets of lanes than the currentactive lanes. For example, some code must execute in whole wave mode.`DW_AT_LLVM_active_lane allows the compiler can provide the means todetermine the actual active lanes.

If not present and DW_AT_LLVM_lanes is greater than 1, then the targetarchitecture execution mask is used.

SeeAMDGPU DW_AT_LLVM_active_lanefor AMDGPUinformation.

  • DWAT_LLVM_vector_size _New

A base type V may have the DW_AT_LLVM_vector_size attribute whose valueis an integer constant that is the vector size S.

The representation of a vector base type is as S contiguous elements, eachone having the representation of a base type E that is the same as V withoutthe DW_AT_LLVM_vector_size attribute.

If not present, the base type is not a vector.

The DWARF is ill-formed if S not greater than 0.

Note

LLVM has mention of non-upstreamed debugger information entry that isintended to support vector types. However, that was not for a base typeso would not be suitable as the type of a stack value entry. But perhapsthat could be replaced by using this attribute.

  • DWAT_LLVM_augmentation _New

A compilation unit may have a DW_AT_LLVM_augmentation attribute, whosevalue is an augmentation string.

The augmentation string allows users to indicate that there is additionaltarget-specific information in the debugging information entries. Forexample, this might be information about the version of target-specificextensions that are being used.

If not present, or if the string is empty, then the compilation unit has noaugmentation string.

Note

For AMDGPU, the augmentation string contains:

  1. [amd:v0.0]

The “vX.Y” specifies the major X and minor Y version number of the AMDGPUextensions used in the DWARF of the compilation unit. The version numberconforms to [SEMVER].

Attribute Encodings

The following table gives the encoding of the debugging information entryattributes added for AMDGPU.

AMDGPU DWARF Attribute Encodings
Attribute NameValueClasses
DW_AT_LLVM_lanes constant
DW_AT_LLVM_lane_pc exprloc, loclist
DW_AT_LLVM_active_lane exprloc, loclist
DW_AT_LLVM_vector_size constant
DW_AT_LLVM_augmentation string

Call Frame Information

DWARF Call Frame Information describes how an agent can virtually _unwind_call frames in a running process or core dump.

Note

AMDGPU conforms to the DWARF standard with additional support added foraddress spaces. Register unwind DWARF expressions are generalized to allow anylocation description, including composite and implicit location descriptions.

Structure of Call Frame Information

The register rules are:

  • undefined
  • A register that has this rule has no recoverable value in the previous frame.(By convention, it is not preserved by a callee.)
  • same value
  • This register has not been modified from the previous frame. (By convention,it is preserved by the callee, but the callee has not modified it.)
  • offset(N)
  • The previous value of this register is saved at the location descriptioncomputed as if the DW_OP_LLVM_offset N operation is applied to the currentCFA memory location description where N is a signed byte offset.
  • val_offset(N)
  • The previous value of this register is the address in the address space of thememory location description computed as if the DW_OP_LLVM_offset Noperation is applied to the current CFA memory location description where N isa signed byte displacement.

If the register size does not match the size of an address in the addressspace of the current CFA memory location description, then the DWARF isill-formed .

  • register(R)
  • The previous value of this register is stored in another register numbered R.

If the register sizes do not match, then the DWARF is ill-formed.

  • expression(E)
  • The previous value of this register is located at the location descriptionproduced by executing the DWARF expression E (seeExpressions).
  • val_expression(E)
  • The previous value of this register is the value produced by executing theDWARF expression E (see Expressions).

If value type size does not match the register size, then the DWARF isill-formed.

  • architectural
  • The rule is defined externally to this specification by the augmenter.

A Common Information Entry holds information that is shared among many FrameDescription Entries. There is at least one CIE in every non-empty.debug_frame section. A CIE contains the following fields, in order:

  • length (initial length)

A constant that gives the number of bytes of the CIE structure, notincluding the length field itself. The size of the length field plus thevalue of length must be an integral multiple of the address size specifiedin the address_size field.

A constant that is used to distinguish CIEs from FDEs.

In the 32-bit DWARF format, the value of the CIE id in the CIE header is0xffffffff; in the 64-bit DWARF format, the value is 0xffffffffffffffff.

  • version (ubyte)

A version number. This number is specific to the call frame information andis independent of the DWARF version number.

The value of the CIE version number is 4.

  • augmentation (sequence of UTF-8 characters)

A null-terminated UTF-8 string that identifies the augmentation to this CIEor to the FDEs that use it. If a reader encounters an augmentation stringthat is unexpected, then only the following fields can be read:

  • CIE: length, CIE_id, version, augmentation
  • FDE: length, CIE_pointer, initial_location, address_rangeIf there is no augmentation, this value is a zero byte.

The augmentation string allows users to indicate that there is additionaltarget-specific information in the CIE or FDE which is needed to virtuallyunwind a stack frame. For example, this might be information aboutdynamically allocated data which needs to be freed on exit from theroutine.

Because the .debug_frame section is useful independently of any.debug_info section, the augmentation string always uses UTF-8encoding.

Note

For AMDGPU, the augmentation string contains:

  1. [amd:v0.0]

The “vX.Y” specifies the major X and minor Y version number of the AMDGPUextensions used in the DWARF of the compilation unit. The version numberconforms to [SEMVER].

  • address_size (ubyte)

The size of a target address in this CIE and any FDEs that use it, in bytes.If a compilation unit exists for this frame, its address size must match theaddress size here.

Note

For AMDGPU:

  • segment_selector_size (ubyte)

The size of a segment selector in this CIE and any FDEs that use it, inbytes.

Note

For AMDGPU:

  • Does not use a segment selector so this is 0.
  • code_alignment_factor (unsigned LEB128)

A constant that is factored out of all advance location instructions (seeRow Creation Instructions). The resulting value is(operand * code_alignment_factor).

Note

For AMDGPU:

  • 4 bytes.
  • data_alignment_factor (signed LEB128)

A constant that is factored out of certain offset instructions (seeCFA Definition Instructions andRegister Rule Instructions). The resulting value is(operand * data_alignment_factor).

Note

For AMDGPU:

  • 4 bytes.
  • return_address_register (unsigned LEB128)

An unsigned LEB128 constant that indicates which column in the rule tablerepresents the return address of the function. Note that this column mightnot correspond to an actual machine register.

Note

For AMDGPU:

  • PC_32 for 32-bit processes and PC_64 for64-bit processes defined in Register Mapping.
  • initial_instructions (array of ubyte)

A sequence of rules that are interpreted to create the initial setting ofeach column in the table.

The default rule for all columns before interpretation of the initialinstructions is the undefined rule. However, an ABI authoring body or acompilation system authoring body may specify an alternate default value forany or all columns.

Note

For AMDGPU:

  • Since a subprogram A with fewer registers can be called from subprogramB that has more allocated, A will not change any of the extra registersas it cannot access them. Therefore, The default rule for all columns issame value.
  • padding (array of ubyte)

Enough DW_CFA_nop instructions to make the size of this entry match thelength value above.

An FDE contains the following fields, in order:

  • length (initial length)

A constant that gives the number of bytes of the header and instructionstream for this function, not including the length field itself. The size ofthe length field plus the value of length must be an integral multiple ofthe address size.

A constant offset into the .debug_frame section that denotes the CIEthat is associated with this FDE.

  • initial_location (segment selector and target address)

The address of the first location associated with this table entry. If thesegment_selector_size field of this FDE’s CIE is non-zero, the initiallocation is preceded by a segment selector of the given length.

  • address_range (target address)

The number of bytes of program instructions described by this entry.

  • instructions (array of ubyte)

A sequence of table defining instructions that are described inCall Frame Instructions.

  • padding (array of ubyte)

Enough DW_CFA_nop instructions to make the size of this entry match thelength value above.

Call Frame Instructions

Some call frame instructions have operands that are encoded as DWARF expressionsE (see Expressions). The DWARF operators that can be used inE have the following restrictions:

  • DW_OP_addrx, DW_OP_call2, DW_OP_call4, DW_OP_call_ref,DW_OP_const_type, DW_OP_constx, DW_OP_convert,DW_OP_deref_type, DW_OP_regval_type, and DW_OP_reinterpretoperators are not allowed because the call frame information must not dependon other debug sections.

  • DW_OP_push_object_address is not allowed because there is no objectcontext to provide a value to push.

  • DW_OP_call_frame_cfa and DW_OP_entry_value are not allowed becausetheir use would be circular.

  • DW_OP_LLVM_call_frame_entry_reg is not allowed if evaluating E causes acircular dependency between DW_OP_LLVM_call_frame_entry_reg operators.

For example, if a register R1 has a DWCFA_def_cfa_expression_instruction that evaluates a DWOP_LLVM_call_frame_entry_reg _operatorthat specifies register R2, and register R2 has aDWCFA_def_cfa_expression _instruction that that evaluates aDWOP_LLVM_call_frame_entry_reg _operator that specifies register R1.

Call frame instructions to which these restrictions apply includeDWCFA_def_cfa_expression, DW_CFA_expression, andDW_CFA_val_expression._

Row Creation Instructions

These instructions are the same as in DWARF 5.

CFA Definition Instructions
  • DW_CFA_def_cfa

The DW_CFA_def_cfa instruction takes two unsigned LEB128 operandsrepresenting a register number R and a (non-factored) byte displacement D.The required action is to define the current CFA rule to be the memorylocation description that is the result of evaluating the DWARF expressionDW_OP_bregx R, D.

Note

Could also consider adding DW_CFA_def_aspace_cfa andDW_CFA_def_aspace_cfa_sf which allow a register R, offset D, andaddress space AS to be specified. For example, that would save a byte ofencoding over using DW_CFA_def_cfa R, D; DW_CFA_LLVM_def_cfa_aspaceAS;.

  • DW_CFA_def_cfa_sf

The DW_CFA_def_cfa_sf instruction takes two operands: an unsigned LEB128value representing a register number R and a signed LEB128 factored bytedisplacement D. The required action is to define the current CFA rule to bethe memory location description that is the result of evaluating the DWARFexpression DW_OP_bregx R, D*data_alignment_factor.

The action is the same as DW_CFA_def_cfa except that the second operandis signed and factored.

  • DW_CFA_def_cfa_register

The DW_CFA_def_cfa_register instruction takes a single unsigned LEB128operand representing a register number R. The required action is to definethe current CFA rule to be the memory location description that is theresult of evaluating the DWARF expression DW_OP_constu AS;DW_OP_aspace_bregx R, D where D and AS are the old CFA byte displacementand address space respectively.

If the subprogram has no current CFA rule, or the rule was defined by aDW_CFA_def_cfa_expression instruction, then the DWARF is ill-formed.

  • DW_CFA_def_cfa_offset

The DW_CFA_def_cfa_offset instruction takes a single unsigned LEB128operand representing a (non-factored) byte displacement D. The requiredaction is to define the current CFA rule to be the memory locationdescription that is the result of evaluating the DWARF expressionDW_OP_constu AS; DW_OP_aspace_bregx R, D where R and AS are the old CFAregister number and address space respectively.

If the subprogram has no current CFA rule, or the rule was defined by aDW_CFA_def_cfa_expression instruction, then the DWARF is ill-formed.

  • DW_CFA_def_cfa_offset_sf

The DW_CFA_def_cfa_offset_sf instruction takes a signed LEB128 operandrepresenting a factored byte displacement D. The required action is todefine the current CFA rule to be the memory location description that isthe result of evaluating the DWARF expression DW_OP_constu AS;DW_OP_aspace_bregx R, D*data_alignment_factor where R and AS are the oldCFA register number and address space respectively.

If the subprogram has no current CFA rule, or the rule was defined by aDW_CFA_def_cfa_expression instruction, then the DWARF is ill-formed.

The action is the same as DW_CFA_def_cfa_offset except that the operandis signed and factored.

  • DWCFA_LLVM_def_cfa_aspace _New

The DW_CFA_LLVM_def_cfa_aspace instruction takes a single unsignedLEB128 operand representing an address space identifier AS for thosearchitectures that support multiple address spaces. The required action isto define the current CFA rule to be the memory location description L thatis the result of evaluating the DWARF expression DW_OP_constu AS;DW_OP_aspace_bregx R, D where R and D are the old CFA register number andbyte displacement respectively.

If AS is not one of the values defined by the target architecture’sDWASPACE* values then the DWARF expression is ill-formed.

  • DW_CFA_def_cfa_expression

The DW_CFA_def_cfa_expression instruction takes a single operand encodedas a DW_FORM_exprloc value representing a DWARF expression E. Therequired action is to define the current CFA rule to be the memory locationdescription computed by evaluating E.

See :ref:amdgpu-dwarf-call-frame-instructions regarding restrictions onthe DWARF expression operators that can be used in E.

If the result of evaluating E is not a memory location description with bitoffset that is a multiple of 8 (the byte size), then the DWARF isill-formed.

Register Rule Instructions

Note

For AMDGPU:

  • DW_CFA_undefined

The DW_CFA_undefined instruction takes a single unsigned LEB128 operandthat represents a register number R. The required action is to set the rulefor the register specified by R to undefined.

  • DW_CFA_same_value

The DW_CFA_same_value instruction takes a single unsigned LEB128 operandthat represents a register number R. The required action is to set the rulefor the register specified by R to same value.

  • DW_CFA_offset

The DWCFA_offset instruction takes two operands: a register number R(encoded with the opcode) and an unsigned LEB128 constant representing afactored displacement D. The required action is to change the rule for theregister specified by R to be an _offset(D*data_alignment_factor) rule.

Note

Seems this should be named DW_CFA_offset_uf since the offset isunsigned factored.

  • DW_CFA_offset_extended

The DW_CFA_offset_extended instruction takes two unsigned LEB128operands representing a register number R and a factored displacement D.This instruction is identical to DW_CFA_offset except for the encodingand size of the register operand.

Note

Seems this should be named DW_CFA_offset_extended_uf since thedisplacement is unsigned factored.

  • DW_CFA_offset_extended_sf

The DW_CFA_offset_extended_sf instruction takes two operands: anunsigned LEB128 value representing a register number R and a signed LEB128factored displacement D. This instruction is identical toDW_CFA_offset_extended except that D is signed.

  • DW_CFA_val_offset

The DWCFA_val_offset instruction takes two unsigned LEB128 operandsrepresenting a register number R and a factored displacement D. The requiredaction is to change the rule for the register indicated by R to be a_val_offset(D*data_alignment_factor) rule.

Note

Seems this should be named DW_CFA_val_offset_uf since the displacementis unsigned factored.

  • DW_CFA_val_offset_sf

The DW_CFA_val_offset_sf instruction takes two operands: an unsignedLEB128 value representing a register number R and a signed LEB128 factoreddisplacement D. This instruction is identical to DW_CFA_val_offsetexcept that D is signed.

  • DW_CFA_register

The DWCFA_register instruction takes two unsigned LEB128 operandsrepresenting register numbers R1 and R2 respectively. The required action isto set the rule for the register specified by R1 to be _register(R) where Ris R2.

  • DW_CFA_expression

The DWCFA_expression instruction takes two operands: an unsigned LEB128value representing a register number R, and a DW_FORM_block valuerepresenting a DWARF expression E. The required action is to change the rulefor the register specified by R to be an _expression(E) rule. The memorylocation description of the current CFA is pushed on the DWARF stack priorto execution of E.

That is, the DWARF expression computes the location description where theregister value can be retrieved.

See :ref:amdgpu-dwarf-call-frame-instructions regarding restrictions onthe DWARF expression operators that can be used in E.

  • DW_CFA_val_expression

The DWCFA_val_expression instruction takes two operands: an unsignedLEB128 value representing a register number R, and a DW_FORM_block valuerepresenting a DWARF expression E. The required action is to change the rulefor the register specified by R to be a _val_expression(E) rule. The memorylocation description of the current CFA is pushed on the DWARF evaluationstack prior to execution of E.

That is, E computes the value of register R.

See :ref:amdgpu-dwarf-call-frame-instructions regarding restrictions onthe DWARF expression operators that can be used in E.

If the result of evaluating E is not a value with a base type size thatmatches the register size, then the DWARF is ill-formed.

  • DW_CFA_restore

The DW_CFA_restore instruction takes a single operand (encoded with theopcode) that represents a register number R. The required action is tochange the rule for the register specified by R to the rule assigned it bythe initial_instructions in the CIE.

  • DW_CFA_restore_extended

The DW_CFA_restore_extended instruction takes a single unsigned LEB128operand that represents a register number R. This instruction is identicalto DW_CFA_restore except for the encoding and size of the registeroperand.

Row State Instructions

These instructions are the same as in DWARF 5.

Call Frame Calling Address

When virtually unwinding frames, consumers frequently wish to obtain theaddress of the instruction which called a subroutine. This information is notalways provided. Typically, however, one of the registers in the virtual unwindtable is the Return Address.

If a Return Address register is defined in the virtual unwind table, and itsrule is undefined (for example, by DW_CFA_undefined), then there is noreturn address and no call address, and the virtual unwind of stack activationsis complete.

In most cases the return address is in the same context as the calling address,but that need not be the case, especially if the producer knows in some way thecall never will return. The context of the ’return address’ might be on adifferent line, in a different lexical block, or past the end of the callingsubroutine. If a consumer were to assume that it was in the same context as thecalling address, the virtual unwind might fail.

For architectures with constant-length instructions where the return addressimmediately follows the call instruction, a simple solution is to subtract thelength of an instruction from the return address to obtain the callinginstruction. For architectures with variable-length instructions (for example,x86), this is not possible. However, subtracting 1 from the return address,although not guaranteed to provide the exact calling address, generally willproduce an address within the same context as the calling address, and thatusually is sufficient.

Note

For AMDGPU the instructions are variable size and a consumer can subtract 1from the return address to get the address of a byte within the call siteinstructions.

Call Frame Information Instruction Encodings

The following table gives the encoding of the DWARF call frame informationinstructions added for AMDGPU.

AMDGPU DWARF Call Frame Information Instruction Encodings
InstructionHigh2BitsLow6BitsOperand 1Operand 1
DW_CFA_LLVM_def_cfa_aspace00XxxULEB128

Line Table

Note

AMDGPU does not use the isa state machine registers and always sets it to0.

Accelerated Access

Lookup By Name

Note

For AMDGPU:

  • The rule for debugger information entries included in the nameindex in the optional .debug_names section is extended to also includenamed DW_TAG_variable debugging information entries with aDW_AT_location attribute that includes aDW_OP_LLVM_form_aspace_address operation.

  • The lookup by name section header augmentation_string string field contains:

  1. [amd:v0.0]

The “vX.Y” specifies the major X and minor Y version number of the AMDGPUextensions used in the DWARF of the compilation unit. The version numberconforms to [SEMVER].

Lookup By Address

Note

For AMDGPU:

  • The lookup by address section header table:

    • address_size (ubyte)
    • Match the address size for the Global address space defined inAMDGPU DWARF Address Space Mapping.

    • segment_selector_size (ubyte)

    • AMDGPU does not use a segment selector so this is 0. The entries in the.debug_aranges do not have a segment selector.

Data Representation

32-Bit and 64-Bit DWARF Formats

Note

For AMDGPU:

  • For the amdgcn target only 64-bit process address space is supported
  • The producer can generate either 32-bit or 64-bit DWARF format.
  • Within the body of the .debug_info section, certain forms of attributevalue depend on the choice of DWARF format as follows. For the 32-bit DWARFformat, the value is a 4-byte unsigned integer; for the 64-bit DWARF format,the value is an 8-byte unsigned integer.

AMDGPU DWARF .debug_info section attribute sizesFormRoleDW_FORM_line_strpoffset in .debug_line_strDW_FORM_ref_addroffset in .debug_infoDW_FORM_sec_offsetoffset in a section other than.debug_info or .debug_strDW_FORM_strpoffset in .debug_strDW_FORM_strp_supoffset in .debug_str section ofsupplementary object fileDW_OP_call_refoffset in .debug_infoDW_OP_implicit_pointeroffset in .debug_infoDW_OP_LLVM_aspace_implicit_pointeroffset in .debug_info

Unit Headers

Note

For AMDGPU:

AMDGPU DW_AT_LLVM_lane_pc

The DW_AT_LLVM_lane_pc attribute can be used to specify the program locationof the separate lanes of a SIMT thread. SeeDebugging Information Entry Attributes.

If the lane is an active lane then this will be the same as the current programlocation.

If the lane is inactive, but was active on entry to the subprogram, then this isthe program location in the subprogram at which execution of the lane isconceptual positioned.

If the lane was not active on entry to the subprogram, then this will be theundefined location. A client debugger can check if the lane is part of a validwork-group by checking that the lane is in the range of the associatedwork-group within the grid, accounting for partial work-groups. If it is notthen the debugger can omit any information for the lane. Otherwise, the debuggermay repeatedly unwind the stack and inspect the DW_AT_LLVM_lane_pc of thecalling subprogram until it finds a non-undefined location. Conceptually thelane only has the call frames that it has a non-undefinedDW_AT_LLVM_lane_pc.

The following example illustrates how the AMDGPU backend can generate a locationlist for the nested IF/THEN/ELSE structures of the following subprogrampseudo code for a target with 64 lanes per wave.

  1. 1 SUBPROGRAM X
  2. 2 BEGIN
  3. 3 a;
  4. 4 IF (c1) THEN
  5. 5 b;
  6. 6 IF (c2) THEN
  7. 7 c;
  8. 8 ELSE
  9. 9 d;
  10. 10 ENDIF
  11. 11 e;
  12. 12 ELSE
  13. 13 f;
  14. 14 ENDIF
  15. 15 g;
  16. 16 END

The AMDGPU backend may generate the following pseudo LLVM MIR to manipulate theexecution mask (EXEC) to linearized the control flow. The condition isevaluated to make a mask of the lanes for which the condition evaluates to true.First the THEN region is executed by setting the EXEC mask to thelogical AND of the current EXEC mask with the condition mask. Then theELSE region is executed by negating the EXEC mask and logical AND ofthe saved EXEC mask at the start of the region. After the IF/THEN/ELSEregion the EXEC mask is restored to the value it had at the beginning of theregion. This is shown below. Other approaches are possible, but the basicconcept is the same.

  1. 1 $lex_start:
  2. 2 a;
  3. 3 %1 = EXEC
  4. 4 %2 = c1
  5. 5 $lex_1_start:
  6. 6 EXEC = %1 & %2
  7. 7 $if_1_then:
  8. 8 b;
  9. 9 %3 = EXEC
  10. 10 %4 = c2
  11. 11 $lex_1_1_start:
  12. 12 EXEC = %3 & %4
  13. 13 $lex_1_1_then:
  14. 14 c;
  15. 15 EXEC = ~EXEC & %3
  16. 16 $lex_1_1_else:
  17. 17 d;
  18. 18 EXEC = %3
  19. 19 $lex_1_1_end:
  20. 20 e;
  21. 21 EXEC = ~EXEC & %1
  22. 22 $lex_1_else:
  23. 23 f;
  24. 24 EXEC = %1
  25. 25 $lex_1_end:
  26. 26 g;
  27. 27 $lex_end:

To create the location list that defines the location description of a vector oflane program locations, the LLVM MIR DBG_VALUE pseudo instruction can beused to annotate the linearized control flow. This can be done by defining anartificial variable for the lane PC. The location list created for it is used todefine the value of the DW_AT_LLVM_lane_pc attribute.

A DWARF procedure is defined for each well nested structured control flow regionwhich provides the conceptual lane program location for a lane if it is notactive (namely it is divergent). The expression for each region inherits thevalue of the immediately enclosing region and modifies it according to thesemantics of the region.

For an IF/THEN/ELSE region the divergent program location is at the start ofthe region for the THEN region since it is executed first. For the ELSEregion the divergent program location is at the end of the IF/THEN/ELSEregion since the THEN region has completed.

The lane PC artificial variable is assigned at each region transition. It usesthe immediately enclosing region’s DWARF procedure to compute the programlocation for each lane assuming they are divergent, and then modifies the resultby inserting the current program location for each lane that the EXEC maskindicates is active.

By having separate DWARF procedures for each region, they can be reused todefine the value for any nested region. This reduces the amount of DWARFrequired.

The following provides an example using pseudo LLVM MIR.

  1. 1 $lex_start:
  2. 2 DEFINE_DWARF %__uint_64 = DW_TAG_base_type[
  3. 3 DW_AT_name = "__uint64";
  4. 4 DW_AT_byte_size = 8;
  5. 5 DW_AT_encoding = DW_ATE_unsigned;
  6. 6 ];
  7. 7 DEFINE_DWARF %__active_lane_pc = DW_TAG_dwarf_procedure[
  8. 8 DW_AT_name = "__active_lane_pc";
  9. 9 DW_AT_location = [
  10. 10 DW_OP_regx PC;
  11. 11 DW_OP_LLVM_extend 64, 64;
  12. 12 DW_OP_regval_type EXEC, %uint_64;
  13. 13 DW_OP_LLVM_select_bit_piece 64, 64;
  14. 14 ];
  15. 15 ];
  16. 16 DEFINE_DWARF %__divergent_lane_pc = DW_TAG_dwarf_procedure[
  17. 17 DW_AT_name = "__divergent_lane_pc";
  18. 18 DW_AT_location = [
  19. 19 DW_OP_LLVM_undefined;
  20. 20 DW_OP_LLVM_extend 64, 64;
  21. 21 ];
  22. 22 ];
  23. 23 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
  24. 24 DW_OP_call_ref %__divergent_lane_pc;
  25. 25 DW_OP_call_ref %__active_lane_pc;
  26. 26 ];
  27. 27 a;
  28. 28 %1 = EXEC;
  29. 29 DBG_VALUE %1, $noreg, %__lex_1_save_exec;
  30. 30 %2 = c1;
  31. 31 $lex_1_start:
  32. 32 EXEC = %1 & %2;
  33. 33 $lex_1_then:
  34. 34 DEFINE_DWARF %__divergent_lane_pc_1_then = DW_TAG_dwarf_procedure[
  35. 35 DW_AT_name = "__divergent_lane_pc_1_then";
  36. 36 DW_AT_location = DIExpression[
  37. 37 DW_OP_call_ref %__divergent_lane_pc;
  38. 38 DW_OP_xaddr &lex_1_start;
  39. 39 DW_OP_stack_value;
  40. 40 DW_OP_LLVM_extend 64, 64;
  41. 41 DW_OP_call_ref %__lex_1_save_exec;
  42. 42 DW_OP_deref_type 64, %__uint_64;
  43. 43 DW_OP_LLVM_select_bit_piece 64, 64;
  44. 44 ];
  45. 45 ];
  46. 46 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
  47. 47 DW_OP_call_ref %__divergent_lane_pc_1_then;
  48. 48 DW_OP_call_ref %__active_lane_pc;
  49. 49 ];
  50. 50 b;
  51. 51 %3 = EXEC;
  52. 52 DBG_VALUE %3, %__lex_1_1_save_exec;
  53. 53 %4 = c2;
  54. 54 $lex_1_1_start:
  55. 55 EXEC = %3 & %4;
  56. 56 $lex_1_1_then:
  57. 57 DEFINE_DWARF %__divergent_lane_pc_1_1_then = DW_TAG_dwarf_procedure[
  58. 58 DW_AT_name = "__divergent_lane_pc_1_1_then";
  59. 59 DW_AT_location = DIExpression[
  60. 60 DW_OP_call_ref %__divergent_lane_pc_1_then;
  61. 61 DW_OP_xaddr &lex_1_1_start;
  62. 62 DW_OP_stack_value;
  63. 63 DW_OP_LLVM_extend 64, 64;
  64. 64 DW_OP_call_ref %__lex_1_1_save_exec;
  65. 65 DW_OP_deref_type 64, %__uint_64;
  66. 66 DW_OP_LLVM_select_bit_piece 64, 64;
  67. 67 ];
  68. 68 ];
  69. 69 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
  70. 70 DW_OP_call_ref %__divergent_lane_pc_1_1_then;
  71. 71 DW_OP_call_ref %__active_lane_pc;
  72. 72 ];
  73. 73 c;
  74. 74 EXEC = ~EXEC & %3;
  75. 75 $lex_1_1_else:
  76. 76 DEFINE_DWARF %__divergent_lane_pc_1_1_else = DW_TAG_dwarf_procedure[
  77. 77 DW_AT_name = "__divergent_lane_pc_1_1_else";
  78. 78 DW_AT_location = DIExpression[
  79. 79 DW_OP_call_ref %__divergent_lane_pc_1_then;
  80. 80 DW_OP_xaddr &lex_1_1_end;
  81. 81 DW_OP_stack_value;
  82. 82 DW_OP_LLVM_extend 64, 64;
  83. 83 DW_OP_call_ref %__lex_1_1_save_exec;
  84. 84 DW_OP_deref_type 64, %__uint_64;
  85. 85 DW_OP_LLVM_select_bit_piece 64, 64;
  86. 86 ];
  87. 87 ];
  88. 88 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
  89. 89 DW_OP_call_ref %__divergent_lane_pc_1_1_else;
  90. 90 DW_OP_call_ref %__active_lane_pc;
  91. 91 ];
  92. 92 d;
  93. 93 EXEC = %3;
  94. 94 $lex_1_1_end:
  95. 95 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
  96. 96 DW_OP_call_ref %__divergent_lane_pc;
  97. 97 DW_OP_call_ref %__active_lane_pc;
  98. 98 ];
  99. 99 e;
  100. 100 EXEC = ~EXEC & %1;
  101. 101 $lex_1_else:
  102. 102 DEFINE_DWARF %__divergent_lane_pc_1_else = DW_TAG_dwarf_procedure[
  103. 103 DW_AT_name = "__divergent_lane_pc_1_else";
  104. 104 DW_AT_location = DIExpression[
  105. 105 DW_OP_call_ref %__divergent_lane_pc;
  106. 106 DW_OP_xaddr &lex_1_end;
  107. 107 DW_OP_stack_value;
  108. 108 DW_OP_LLVM_extend 64, 64;
  109. 109 DW_OP_call_ref %__lex_1_save_exec;
  110. 110 DW_OP_deref_type 64, %__uint_64;
  111. 111 DW_OP_LLVM_select_bit_piece 64, 64;
  112. 112 ];
  113. 113 ];
  114. 114 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
  115. 115 DW_OP_call_ref %__divergent_lane_pc_1_else;
  116. 116 DW_OP_call_ref %__active_lane_pc;
  117. 117 ];
  118. 118 f;
  119. 119 EXEC = %1;
  120. 120 $lex_1_end:
  121. 121 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc DIExpression[
  122. 122 DW_OP_call_ref %__divergent_lane_pc;
  123. 123 DW_OP_call_ref %__active_lane_pc;
  124. 124 ];
  125. 125 g;
  126. 126 $lex_end:

The DWARF procedure %__active_lane_pc is used to update the lane pc elementsthat are active with the current program location.

Artificial variables %lex_1_save_exec and %lex_1_1_save_exec are created forthe execution masks saved on entry to a region. Using the DBG_VALUE pseudoinstruction, location lists that describes where they are allocated at any givenprogram location will be created. The compiler may allocate them to registers,or spill them to memory.

The DWARF procedures for each region use saved execution mask value to onlyupdate the lanes that are active on entry to the region. All other lanes retainthe value of the enclosing region where they were last active. If they were notactive on entry to the subprogram, then will have the undefined locationdescription.

Other structured control flow regions can be handled similarly. For example,loops would set the divergent program location for the region at the end of theloop. Any lanes active will be in the loop, and any lanes not active must haveexited the loop.

An IF/THEN/ELSEIF/ELSEIF/… region can be treated as a nest ofIF/THEN/ELSE regions.

The DWARF procedures can use the active lane artificial variable described inAMDGPU DW_AT_LLVM_active_lane rather than the actualEXEC mask in order to support whole or quad wave mode.

AMDGPU DW_AT_LLVM_active_lane

The DW_AT_LLVM_active_lane attribute can be used to specify the lanes thatare conceptually active for a SIMT thread. SeeDebugging Information Entry Attributes.

The execution mask may be modified to implement whole or quad wave modeoperations. For example, all lanes may need to temporarily be made active toexecute a whole wave operation. Such regions would save the EXEC mask,update it to enable the necessary lanes, perform the operations, and thenrestore the EXEC mask from the saved value. While executing the whole waveregion, the conceptual execution mask is the saved value, not the EXECvalue.

This is handled by defining an artificial variable for the active lane mask. Theactive lane mask artificial variable would be the actual EXEC mask fornormal regions, and the saved execution mask for regions where the mask istemporarily updated. The location list created for this artificial variable isused to define the value of the DW_AT_LLVM_active_lane attribute.

Source Text

Source text for online-compiled programs (e.g. those compiled by the OpenCLruntime) may be embedded into the DWARF v5 line table using the clang-gembed-source option, described in table AMDGPU Debug Options.

For example:

  • -gembed-source
  • Enable the embedded source DWARF v5 extension.
  • -gno-embed-source
  • Disable the embedded source DWARF v5 extension.

AMDGPU Debug OptionsDebug FlagDescription-g[no-]embed-sourceEnable/disable embedding source text in DWARFdebug sections. Useful for environments wheresource cannot be written to disk, such aswhen performing online compilation.

This option enables one extended content types in the DWARF v5 Line NumberProgram Header, which is used to encode embedded source.

AMDGPU DWARF Line Number Program Header Extended Content Types
Content TypeForm
DW_LNCT_LLVM_sourceDW_FORM_line_strp

The source field will contain the UTF-8 encoded, null-terminated source textwith '\n' line endings. When the source field is present, consumers can usethe embedded source instead of attempting to discover the source on disk. Whenthe source field is absent, consumers can access the file to get the sourcetext.

The above content type appears in the file_name_entry_format field of theline table prologue, and its corresponding value appear in the file_namesfield. The current encoding of the content type is documented in tableAMDGPU DWARF Line Number Program Header Extended Content Types Encoding

AMDGPU DWARF Line Number Program Header Extended Content Types Encoding
Content TypeValue
DW_LNCT_LLVM_source0x2001

Code Conventions

This section provides code conventions used for each supported target triple OS(see Target Triples).

AMDHSA

This section provides code conventions used when the target triple OS isamdhsa (see Target Triples).

Code Object Target Identification

The AMDHSA OS uses the following syntax to specify the code objecttarget as a single string:

<Architecture>-<Vendor>-<OS>-<Environment>-<Processor><Target Features>

Where:

  • <Architecture>, <Vendor>, <OS> and <Environment>are the same as the Target Triple (seeTarget Triples).
  • <Processor> is the same as the Processor (seeProcessors).
  • <Target Features> is a list of the enabled Target Features(see Target Features), each prefixed by a plus, thatapply to Processor. The list must be in the same order as listedin the table AMDGPU Target Features. Note that TargetFeatures must be included in the list if they are enabled even ifthat is the default for Processor.

For example:

"amdgcn-amd-amdhsa—gfx902+xnack"

Code Object Metadata

The code object metadata specifies extensible metadata associated with the codeobjects executed on HSA [HSA] compatible runtimes such as AMD’s ROCm[AMD-ROCm]. The encoding and semantics of this metadata depends on the codeobject version; see Code Object V2 Metadata (-mattr=-code-object-v3) andCode Object V3 Metadata (-mattr=+code-object-v3).

Code object metadata is specified in a note record (seeNote Records) and is required when the target triple OS isamdhsa (see Target Triples). It must contain the minimuminformation necessary to support the ROCM kernel queries. For example, thesegment sizes needed in a dispatch packet. In addition, a high level languageruntime may require other information to be included. For example, the AMDOpenCL runtime records kernel argument information.

Code Object V2 Metadata (-mattr=-code-object-v3)

Warning

Code Object V2 is not the default code object version emitted bythis version of LLVM. For a description of the metadata generated with thedefault configuration (Code Object V3) seeCode Object V3 Metadata (-mattr=+code-object-v3).

Code object V2 metadata is specified by the NT_AMD_AMDGPU_METADATA noterecord (see Code Object V2 Note Records (-mattr=-code-object-v3)).

The metadata is specified as a YAML formatted string (see [YAML] andYAML I/O).

The metadata is represented as a single YAML document comprised of the mappingdefined in table AMDHSA Code Object V2 Metadata Map andreferenced tables.

For boolean values, the string values of false and true are used forfalse and true respectively.

Additional information can be added to the mappings. To avoid conflicts, anynon-AMD key names should be prefixed by “vendor-name.”.

AMDHSA Code Object V2 Metadata Map
String KeyValue TypeRequired?Description
“Version”sequence of2 integersRequired
  • The first integer is the majorversion. Currently 1.
  • The second integer is the minorversion. Currently 0.
“Printf”sequence ofstrings

Each string is encoded informationabout a printf function call. Theencoded information is organized asfields separated by colon (‘:’):

ID:N:S[0]:S[1]:…:S[N-1]:FormatString

where:

ID
A 32-bit integer as a unique id foreach printf function call
N
A 32-bit integer equal to the numberof arguments of printf function callminus 1
S[i] (where i = 0, 1, … , N-1)
32-bit integers for the size in bytesof the i-th FormatString argument ofthe printf function call
FormatString
The format string passed to theprintf function call.
“Kernels”sequence ofmappingRequiredSequence of the mappings for eachkernel in the code object. SeeAMDHSA Code Object V2 Kernel Metadata Mapfor the definition of the mapping.
AMDHSA Code Object V2 Kernel Metadata Map
String KeyValue TypeRequired?Description
“Name”stringRequiredSource name of the kernel.
“SymbolName”stringRequiredName of the kerneldescriptor ELF symbol.
“Language”string

Source language of the kernel.Values include:

  • “OpenCL C”
  • “OpenCL C++”
  • “HCC”
  • “OpenMP”
“LanguageVersion”sequence of2 integers
  • The first integer is the majorversion.
  • The second integer is theminor version.
“Attrs”mapping Mapping of kernel attributes.SeeAMDHSA Code Object V2 Kernel Attribute Metadata Mapfor the mapping definition.
“Args”sequence ofmapping Sequence of mappings of thekernel arguments. SeeAMDHSA Code Object V2 Kernel Argument Metadata Mapfor the definition of the mapping.
“CodeProps”mapping Mapping of properties related tothe kernel code. SeeAMDHSA Code Object V2 Kernel Code Properties Metadata Mapfor the mapping definition.
AMDHSA Code Object V2 Kernel Attribute Metadata Map
String KeyValue TypeRequired?Description
“ReqdWorkGroupSize”sequence of3 integers

If not 0, 0, 0 then all valuesmust be >=1 and the dispatchwork-group size X, Y, Z mustcorrespond to the specifiedvalues. Defaults to 0, 0, 0.

Corresponds to the OpenCLreqd_work_group_sizeattribute.

“WorkGroupSizeHint”sequence of3 integers

The dispatch work-group sizeX, Y, Z is likely to be thespecified values.

Corresponds to the OpenCLwork_group_size_hintattribute.

“VecTypeHint”string

The name of a scalar or vectortype.

Corresponds to the OpenCLvec_type_hint attribute.

“RuntimeHandle”string The external symbol nameassociated with a kernel.OpenCL runtime allocates aglobal buffer for the symboland saves the kernel’s addressto it, which is used fordevice side enqueueing. Onlyavailable for device sideenqueued kernels.
AMDHSA Code Object V2 Kernel Argument Metadata Map
String KeyValue TypeRequired?Description
“Name”string Kernel argument name.
“TypeName”string Kernel argument type name.
“Size”integerRequiredKernel argument size in bytes.
“Align”integerRequiredKernel argument alignment inbytes. Must be a power of two.
“ValueKind”stringRequired

Kernel argument kind thatspecifies how to set up thecorresponding argument.Values include:

“ByValue”
The argument is copieddirectly into the kernarg.
“GlobalBuffer”
A global address space pointerto the buffer data is passedin the kernarg.
“DynamicSharedPointer”
A group address space pointerto dynamically allocated LDSis passed in the kernarg.
“Sampler”
A global address spacepointer to a S# is passed inthe kernarg.
“Image”
A global address spacepointer to a T# is passed inthe kernarg.
“Pipe”
A global address space pointerto an OpenCL pipe is passed inthe kernarg.
“Queue”
A global address space pointerto an OpenCL device enqueuequeue is passed in thekernarg.
“HiddenGlobalOffsetX”
The OpenCL grid dispatchglobal offset for the Xdimension is passed in thekernarg.
“HiddenGlobalOffsetY”
The OpenCL grid dispatchglobal offset for the Ydimension is passed in thekernarg.
“HiddenGlobalOffsetZ”
The OpenCL grid dispatchglobal offset for the Zdimension is passed in thekernarg.
“HiddenNone”
An argument that is not usedby the kernel. Space needs tobe left for it, but it doesnot need to be set up.
“HiddenPrintfBuffer”
A global address space pointerto the runtime printf bufferis passed in kernarg.
“HiddenHostcallBuffer”
A global address space pointerto the runtime hostcall bufferis passed in kernarg.
“HiddenDefaultQueue”
A global address space pointerto the OpenCL device enqueuequeue that should be used bythe kernel by default ispassed in the kernarg.
“HiddenCompletionAction”
A global address space pointerto help link enqueued kernels intothe ancestor tree for determiningwhen the parent kernel has finished.
“HiddenMultiGridSyncArg”
A global address space pointer formulti-grid synchronization ispassed in the kernarg.
“ValueType”stringRequired

Kernel argument value type. Onlypresent if “ValueKind” is“ByValue”. For vector datatypes, the value is for theelement type. Values include:

  • “Struct”
  • “I8”
  • “U8”
  • “I16”
  • “U16”
  • “F16”
  • “I32”
  • “U32”
  • “F32”
  • “I64”
  • “U64”
  • “F64”
“PointeeAlign”integer Alignment in bytes of pointeetype for pointer type kernelargument. Must be a powerof 2. Only present if“ValueKind” is“DynamicSharedPointer”.
“AddrSpaceQual”string

Kernel argument address spacequalifier. Only present if“ValueKind” is “GlobalBuffer” or“DynamicSharedPointer”. Valuesare:

  • “Private”
  • “Global”
  • “Constant”
  • “Local”
  • “Generic”
  • “Region”
“AccQual”string

Kernel argument accessqualifier. Only present if“ValueKind” is “Image” or“Pipe”. Valuesare:

  • “ReadOnly”
  • “WriteOnly”
  • “ReadWrite”
“ActualAccQual”string

The actual memory accessesperformed by the kernel on thekernel argument. Only present if“ValueKind” is “GlobalBuffer”,“Image”, or “Pipe”. This may bemore restrictive than indicatedby “AccQual” to reflect what thekernel actual does. If notpresent then the runtime mustassume what is implied by“AccQual” and “IsConst”. Valuesare:

  • “ReadOnly”
  • “WriteOnly”
  • “ReadWrite”
“IsConst”boolean Indicates if the kernel argumentis const qualified. Only presentif “ValueKind” is“GlobalBuffer”.
“IsRestrict”boolean Indicates if the kernel argumentis restrict qualified. Onlypresent if “ValueKind” is“GlobalBuffer”.
“IsVolatile”boolean Indicates if the kernel argumentis volatile qualified. Onlypresent if “ValueKind” is“GlobalBuffer”.
“IsPipe”boolean Indicates if the kernel argumentis pipe qualified. Only presentif “ValueKind” is “Pipe”.
AMDHSA Code Object V2 Kernel Code Properties Metadata Map
String KeyValue TypeRequired?Description
“KernargSegmentSize”integerRequiredThe size in bytes ofthe kernarg segmentthat holds the valuesof the arguments tothe kernel.
“GroupSegmentFixedSize”integerRequiredThe amount of groupsegment memoryrequired by awork-group inbytes. This does notinclude anydynamically allocatedgroup segment memorythat may be addedwhen the kernel isdispatched.
“PrivateSegmentFixedSize”integerRequiredThe amount of fixedprivate address spacememory required for awork-item inbytes. If the kerneluses a dynamic callstack then additionalspace must be addedto this value for thecall stack.
“KernargSegmentAlign”integerRequiredThe maximum bytealignment ofarguments in thekernarg segment. Mustbe a power of 2.
“WavefrontSize”integerRequiredWavefront size. Mustbe a power of 2.
“NumSGPRs”integerRequiredNumber of scalarregisters used by awavefront forGFX6-GFX10. Thisincludes the specialSGPRs for VCC, FlatScratch (GFX7-GFX10)and XNACK (forGFX8-GFX10). It doesnot include the 16SGPR added if a traphandler isenabled. It is notrounded up to theallocationgranularity.
“NumVGPRs”integerRequiredNumber of vectorregisters used byeach work-item forGFX6-GFX10
“MaxFlatWorkGroupSize”integerRequiredMaximum flatwork-group sizesupported by thekernel in work-items.Must be >=1 andconsistent withReqdWorkGroupSize ifnot 0, 0, 0.
“NumSpilledSGPRs”integer Number of stores froma scalar register toa register allocatorcreated spilllocation.
“NumSpilledVGPRs”integer Number of stores froma vector register toa register allocatorcreated spilllocation.
Code Object V3 Metadata (-mattr=+code-object-v3)

Code object V3 metadata is specified by the NT_AMDGPU_METADATA note record(see Code Object V3 Note Records (-mattr=+code-object-v3)).

The metadata is represented as Message Pack formatted binary data (see[MsgPack]). The top level is a Message Pack map that includes thekeys defined in tableAMDHSA Code Object V3 Metadata Map and referencedtables.

Additional information can be added to the maps. To avoid conflicts,any key names should be prefixed by “vendor-name.” wherevendor-name can be the name of the vendor and specific vendortool that generates the information. The prefix is abbreviated tosimply “.” when it appears within a map that has been added by thesame vendor-name.

AMDHSA Code Object V3 Metadata Map
String KeyValue TypeRequired?Description
“amdhsa.version”sequence of2 integersRequired
  • The first integer is the majorversion. Currently 1.
  • The second integer is the minorversion. Currently 0.
“amdhsa.printf”sequence ofstrings

Each string is encoded informationabout a printf function call. Theencoded information is organized asfields separated by colon (‘:’):

ID:N:S[0]:S[1]:…:S[N-1]:FormatString

where:

ID
A 32-bit integer as a unique id foreach printf function call
N
A 32-bit integer equal to the numberof arguments of printf function callminus 1
S[i] (where i = 0, 1, … , N-1)
32-bit integers for the size in bytesof the i-th FormatString argument ofthe printf function call
FormatString
The format string passed to theprintf function call.
“amdhsa.kernels”sequence ofmapRequiredSequence of the maps for eachkernel in the code object. SeeAMDHSA Code Object V3 Kernel Metadata Mapfor the definition of the keys includedin that map.
AMDHSA Code Object V3 Kernel Metadata Map
String KeyValue TypeRequired?Description
“.name”stringRequiredSource name of the kernel.
“.symbol”stringRequiredName of the kerneldescriptor ELF symbol.
“.language”string

Source language of the kernel.Values include:

  • “OpenCL C”
  • “OpenCL C++”
  • “HCC”
  • “HIP”
  • “OpenMP”
  • “Assembler”
“.language_version”sequence of2 integers
  • The first integer is the majorversion.
  • The second integer is theminor version.
“.args”sequence ofmap Sequence of maps of thekernel arguments. SeeAMDHSA Code Object V3 Kernel Argument Metadata Mapfor the definition of the keysincluded in that map.
“.reqd_workgroup_size”sequence of3 integers

If not 0, 0, 0 then all valuesmust be >=1 and the dispatchwork-group size X, Y, Z mustcorrespond to the specifiedvalues. Defaults to 0, 0, 0.

Corresponds to the OpenCLreqd_work_group_sizeattribute.

“.workgroup_size_hint”sequence of3 integers

The dispatch work-group sizeX, Y, Z is likely to be thespecified values.

Corresponds to the OpenCLwork_group_size_hintattribute.

“.vec_type_hint”string

The name of a scalar or vectortype.

Corresponds to the OpenCLvec_type_hint attribute.

“.device_enqueue_symbol”string The external symbol nameassociated with a kernel.OpenCL runtime allocates aglobal buffer for the symboland saves the kernel’s addressto it, which is used fordevice side enqueueing. Onlyavailable for device sideenqueued kernels.
“.kernarg_segment_size”integerRequiredThe size in bytes ofthe kernarg segmentthat holds the valuesof the arguments tothe kernel.
“.group_segment_fixed_size”integerRequiredThe amount of groupsegment memoryrequired by awork-group inbytes. This does notinclude anydynamically allocatedgroup segment memorythat may be addedwhen the kernel isdispatched.
“.private_segment_fixed_size”integerRequiredThe amount of fixedprivate address spacememory required for awork-item inbytes. If the kerneluses a dynamic callstack then additionalspace must be addedto this value for thecall stack.
“.kernarg_segment_align”integerRequiredThe maximum bytealignment ofarguments in thekernarg segment. Mustbe a power of 2.
“.wavefront_size”integerRequiredWavefront size. Mustbe a power of 2.
“.sgpr_count”integerRequiredNumber of scalarregisters required by awavefront forGFX6-GFX9. A registeris required if it isused explicitly, orif a higher numberedregister is usedexplicitly. Thisincludes the specialSGPRs for VCC, FlatScratch (GFX7-GFX9)and XNACK (forGFX8-GFX9). It doesnot include the 16SGPR added if a traphandler isenabled. It is notrounded up to theallocationgranularity.
“.vgpr_count”integerRequiredNumber of vectorregisters required byeach work-item forGFX6-GFX9. A registeris required if it isused explicitly, orif a higher numberedregister is usedexplicitly.
“.max_flat_workgroup_size”integerRequiredMaximum flatwork-group sizesupported by thekernel in work-items.Must be >=1 andconsistent withReqdWorkGroupSize ifnot 0, 0, 0.
“.sgpr_spill_count”integer Number of stores froma scalar register toa register allocatorcreated spilllocation.
“.vgpr_spill_count”integer Number of stores froma vector register toa register allocatorcreated spilllocation.
AMDHSA Code Object V3 Kernel Argument Metadata Map
String KeyValue TypeRequired?Description
“.name”string Kernel argument name.
“.type_name”string Kernel argument type name.
“.size”integerRequiredKernel argument size in bytes.
“.offset”integerRequiredKernel argument offset inbytes. The offset must be amultiple of the alignmentrequired by the argument.
“.value_kind”stringRequired

Kernel argument kind thatspecifies how to set up thecorresponding argument.Values include:

“by_value”
The argument is copieddirectly into the kernarg.
“global_buffer”
A global address space pointerto the buffer data is passedin the kernarg.
“dynamic_shared_pointer”
A group address space pointerto dynamically allocated LDSis passed in the kernarg.
“sampler”
A global address spacepointer to a S# is passed inthe kernarg.
“image”
A global address spacepointer to a T# is passed inthe kernarg.
“pipe”
A global address space pointerto an OpenCL pipe is passed inthe kernarg.
“queue”
A global address space pointerto an OpenCL device enqueuequeue is passed in thekernarg.
“hidden_global_offset_x”
The OpenCL grid dispatchglobal offset for the Xdimension is passed in thekernarg.
“hidden_global_offset_y”
The OpenCL grid dispatchglobal offset for the Ydimension is passed in thekernarg.
“hidden_global_offset_z”
The OpenCL grid dispatchglobal offset for the Zdimension is passed in thekernarg.
“hidden_none”
An argument that is not usedby the kernel. Space needs tobe left for it, but it doesnot need to be set up.
“hidden_printf_buffer”
A global address space pointerto the runtime printf bufferis passed in kernarg.
“hidden_hostcall_buffer”
A global address space pointerto the runtime hostcall bufferis passed in kernarg.
“hidden_default_queue”
A global address space pointerto the OpenCL device enqueuequeue that should be used bythe kernel by default ispassed in the kernarg.
“hidden_completion_action”
A global address space pointerto help link enqueued kernels intothe ancestor tree for determiningwhen the parent kernel has finished.
“hidden_multigrid_sync_arg”
A global address space pointer formulti-grid synchronization ispassed in the kernarg.
“.value_type”stringRequired

Kernel argument value type. Onlypresent if “.value_kind” is“by_value”. For vector datatypes, the value is for theelement type. Values include:

  • “struct”
  • “i8”
  • “u8”
  • “i16”
  • “u16”
  • “f16”
  • “i32”
  • “u32”
  • “f32”
  • “i64”
  • “u64”
  • “f64”
“.pointee_align”integer Alignment in bytes of pointeetype for pointer type kernelargument. Must be a powerof 2. Only present if“.value_kind” is“dynamic_shared_pointer”.
“.address_space”string

Kernel argument address spacequalifier. Only present if“.value_kind” is “global_buffer” or“dynamic_shared_pointer”. Valuesare:

  • “private”
  • “global”
  • “constant”
  • “local”
  • “generic”
  • “region”
“.access”string

Kernel argument accessqualifier. Only present if“.value_kind” is “image” or“pipe”. Valuesare:

  • “read_only”
  • “write_only”
  • “read_write”
“.actual_access”string

The actual memory accessesperformed by the kernel on thekernel argument. Only present if“.value_kind” is “global_buffer”,“image”, or “pipe”. This may bemore restrictive than indicatedby “.access” to reflect what thekernel actual does. If notpresent then the runtime mustassume what is implied by“.access” and “.is_const” . Valuesare:

  • “read_only”
  • “write_only”
  • “read_write”
“.is_const”boolean Indicates if the kernel argumentis const qualified. Only presentif “.value_kind” is“global_buffer”.
“.is_restrict”boolean Indicates if the kernel argumentis restrict qualified. Onlypresent if “.value_kind” is“global_buffer”.
“.is_volatile”boolean Indicates if the kernel argumentis volatile qualified. Onlypresent if “.value_kind” is“global_buffer”.
“.is_pipe”boolean Indicates if the kernel argumentis pipe qualified. Only presentif “.value_kind” is “pipe”.

Kernel Dispatch

The HSA architected queuing language (AQL) defines a user space memoryinterface that can be used to control the dispatch of kernels, in an agentindependent way. An agent can have zero or more AQL queues created for it usingthe ROCm runtime, in which AQL packets (all of which are 64 bytes) can beplaced. See the HSA Platform System Architecture Specification[HSA] for theAQL queue mechanics and packet layouts.

The packet processor of a kernel agent is responsible for detecting anddispatching HSA kernels from the AQL queues associated with it. For AMD GPUs thepacket processor is implemented by the hardware command processor (CP),asynchronous dispatch controller (ADC) and shader processor input controller(SPI).

The ROCm runtime can be used to allocate an AQL queue object. It uses the kernelmode driver to initialize and register the AQL queue with CP.

To dispatch a kernel the following actions are performed. This can occur in theCPU host program, or from an HSA kernel executing on a GPU.

  • A pointer to an AQL queue for the kernel agent on which the kernel is to beexecuted is obtained.
  • A pointer to the kernel descriptor (seeKernel Descriptor) of the kernel to execute is obtained.It must be for a kernel that is contained in a code object that that wasloaded by the ROCm runtime on the kernel agent with which the AQL queue isassociated.
  • Space is allocated for the kernel arguments using the ROCm runtime allocatorfor a memory region with the kernarg property for the kernel agent that willexecute the kernel. It must be at least 16 byte aligned.
  • Kernel argument values are assigned to the kernel argument memoryallocation. The layout is defined in the HSA Programmer’s LanguageReference[HSA]. For AMDGPU the kernel execution directly accesses thekernel argument memory in the same way constant memory is accessed. (Notethat the HSA specification allows an implementation to copy the kernelargument contents to another location that is accessed by the kernel.)
  • An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtimeapi uses 64-bit atomic operations to reserve space in the AQL queue for thepacket. The packet must be set up, and the final write must use an atomicstore release to set the packet kind to ensure the packet contents arevisible to the kernel agent. AQL defines a doorbell signal mechanism tonotify the kernel agent that the AQL queue has been updated. These rules, andthe layout of the AQL queue and kernel dispatch packet is defined in the HSASystem Architecture Specification[HSA].
  • A kernel dispatch packet includes information about the actual dispatch,such as grid and work-group size, together with information from the codeobject about the kernel, such as segment sizes. The ROCm runtime queries onthe kernel symbol can be used to obtain the code object values which arerecorded in the Code Object Metadata.
  • CP executes micro-code and is responsible for detecting and setting up theGPU to execute the wavefronts of a kernel dispatch.
  • CP ensures that when the a wavefront starts executing the kernel machinecode, the scalar general purpose registers (SGPR) and vector general purposeregisters (VGPR) are set up as required by the machine code. The requiredsetup is defined in the Kernel Descriptor. The initialregister state is defined inInitial Kernel Execution State.
  • The prolog of the kernel machine code (seeKernel Prolog) sets up the machine state as necessarybefore continuing executing the machine code that corresponds to the kernel.
  • When the kernel dispatch has completed execution, CP signals the completionsignal specified in the kernel dispatch packet if not 0.

Image and Samplers

Image and sample handles created by the ROCm runtime are 64-bit addresses of ahardware 32 byte V# and 48 byte S# object respectively. In order to support theHSA query_sampler operations two extra dwords are used to store the HSA BRIGenumeration values for the queries that are not trivially deducible from the S#representation.

HSA Signals

HSA signal handles created by the ROCm runtime are 64-bit addresses of astructure allocated in memory accessible from both the CPU and GPU. Thestructure is defined by the ROCm runtime and subject to change between releases(see [AMD-ROCm-github]).

HSA AQL Queue

The HSA AQL queue structure is defined by the ROCm runtime and subject to changebetween releases (see [AMD-ROCm-github]). For some processors it containsfields needed to implement certain language features such as the flat addressaperture bases. It also contains fields used by CP such as managing theallocation of scratch memory.

Kernel Descriptor

A kernel descriptor consists of the information needed by CP to initiate theexecution of a kernel, including the entry point address of the machine codethat implements the kernel.

Kernel Descriptor for GFX6-GFX10

CP microcode requires the Kernel descriptor to be allocated on 64 bytealignment.

Kernel Descriptor for GFX6-GFX10
BitsSizeField NameDescription
31:04 bytesGROUP_SEGMENT_FIXED_SIZEThe amount of fixed localaddress space memoryrequired for a work-groupin bytes. This does notinclude any dynamicallyallocated local addressspace memory that may beadded when the kernel isdispatched.
63:324 bytesPRIVATE_SEGMENT_FIXED_SIZEThe amount of fixedprivate address spacememory required for awork-item in bytes. Ifis_dynamic_callstack is 1then additional space mustbe added to this value forthe call stack.
127:648 bytes Reserved, must be 0.
191:1288 bytesKERNEL_CODE_ENTRY_BYTE_OFFSETByte offset (possiblynegative) from baseaddress of kerneldescriptor to kernel’sentry point instructionwhich must be 256 bytealigned.
351:27220bytes Reserved, must be 0.
383:3524 bytesCOMPUTE_PGM_RSRC3
GFX6-9
Reserved, must be 0.
GFX10
Compute Shader (CS)program settings used byCP to set upCOMPUTE_PGM_RSRC3configurationregister. Seecompute_pgm_rsrc3 for GFX10.
415:3844 bytesCOMPUTE_PGM_RSRC1Compute Shader (CS)program settings used byCP to set upCOMPUTE_PGM_RSRC1configurationregister. Seecompute_pgm_rsrc1 for GFX6-GFX10.
447:4164 bytesCOMPUTE_PGM_RSRC2Compute Shader (CS)program settings used byCP to set upCOMPUTE_PGM_RSRC2configurationregister. Seecompute_pgm_rsrc2 for GFX6-GFX10.
4481 bitENABLE_SGPR_PRIVATE_SEGMENT_BUFFER

Enable the setup of theSGPR user data registers(seeInitial Kernel Execution State).

The total number of SGPRuser data registersrequested must not exceed16 and match value incompute_pgm_rsrc2.user_sgpr.user_sgpr_count.Any requests beyond 16will be ignored.

4491 bitENABLE_SGPR_DISPATCH_PTRsee above
4501 bitENABLE_SGPR_QUEUE_PTRsee above
4511 bitENABLE_SGPR_KERNARG_SEGMENT_PTRsee above
4521 bitENABLE_SGPR_DISPATCH_IDsee above
4531 bitENABLE_SGPR_FLAT_SCRATCH_INITsee above
4541 bitENABLE_SGPR_PRIVATE_SEGMENT_SIZEsee above
457:4553 bits Reserved, must be 0.
4581 bitENABLE_WAVEFRONT_SIZE32
GFX6-9
Reserved, must be 0.
GFX10
  • If 0 execute inwavefront size 64 mode.
  • If 1 execute innative wavefront size32 mode.
463:4595 bits Reserved, must be 0.
511:4646 bytes Reserved, must be 0.
512Total size 64 bytes.
computepgm_rsrc1 for GFX6-GFX10
BitsSizeField NameDescription
5:06 bitsGRANULATED_WORKITEM_VGPR_COUNT

Number of vector registerblocks used by each work-item;granularity is devicespecific:

GFX6-GFX9
  • vgprs_used 0..256
  • max(0, ceil(vgprs_used / 4) - 1)
GFX10 (wavefront size 64)
  • max_vgpr 1..256
  • max(0, ceil(vgprs_used / 4) - 1)
GFX10 (wavefront size 32)
  • max_vgpr 1..256
  • max(0, ceil(vgprs_used / 8) - 1)

Where vgprs_used is definedas the highest VGPR numberexplicitly referenced plusone.

Used by CP to set upCOMPUTE_PGM_RSRC1.VGPRS.

TheAssemblercalculates thisautomatically for theselected processor fromvalues provided to the.amdhsa_kernel directiveby the.amdhsa_next_free_vgprnested directive (seeAMDHSA Kernel Assembler Directives).

9:64 bitsGRANULATED_WAVEFRONT_SGPR_COUNT

Number of scalar registerblocks used by a wavefront;granularity is devicespecific:

GFX6-GFX8
  • sgprs_used 0..112
  • max(0, ceil(sgprs_used / 8) - 1)
GFX9
  • sgprs_used 0..112
  • 2 * max(0, ceil(sgprs_used / 16) - 1)
GFX10
Reserved, must be 0.(128 SGPRs alwaysallocated.)

Where sgprs_used isdefined as the highestSGPR number explicitlyreferenced plus one, plusa target-specific numberof additional specialSGPRs for VCC,FLAT_SCRATCH (GFX7+) andXNACK_MASK (GFX8+), andany additionaltarget-specificlimitations. It does notinclude the 16 SGPRs addedif a trap handler isenabled.

The target-specificlimitations and specialSGPR layout are defined inthe hardwaredocumentation, which canbe found in theProcessorstable.

Used by CP to set upCOMPUTE_PGM_RSRC1.SGPRS.

TheAssemblercalculates thisautomatically for theselected processor fromvalues provided to the.amdhsa_kernel directiveby the.amdhsa_next_free_sgprand .amdhsa_reserve*nested directives (seeAMDHSA Kernel Assembler Directives).

11:102 bitsPRIORITY

Must be 0.

Start executing wavefrontat the specified priority.

CP is responsible forfilling inCOMPUTE_PGM_RSRC1.PRIORITY.

13:122 bitsFLOAT_ROUND_MODE_32

Wavefront starts executionwith specified roundingmode for single (32bit) floating pointprecision floating pointoperations.

Floating point roundingmode values are defined inFloating Point Rounding Mode Enumeration Values.

Used by CP to set upCOMPUTE_PGM_RSRC1.FLOAT_MODE.

15:142 bitsFLOAT_ROUND_MODE_16_64

Wavefront starts executionwith specified roundingdenorm mode for half/double (16and 64-bit) floating pointprecision floating pointoperations.

Floating point roundingmode values are defined inFloating Point Rounding Mode Enumeration Values.

Used by CP to set upCOMPUTE_PGM_RSRC1.FLOAT_MODE.

17:162 bitsFLOAT_DENORM_MODE_32

Wavefront starts executionwith specified denorm modefor single (32bit) floating pointprecision floating pointoperations.

Floating point denorm modevalues are defined inFloating Point Denorm Mode Enumeration Values.

Used by CP to set upCOMPUTE_PGM_RSRC1.FLOAT_MODE.

19:182 bitsFLOAT_DENORM_MODE_16_64

Wavefront starts executionwith specified denorm modefor half/double (16and 64-bit) floating pointprecision floating pointoperations.

Floating point denorm modevalues are defined inFloating Point Denorm Mode Enumeration Values.

Used by CP to set upCOMPUTE_PGM_RSRC1.FLOAT_MODE.

201 bitPRIV

Must be 0.

Start executing wavefrontin privilege trap handlermode.

CP is responsible forfilling inCOMPUTE_PGM_RSRC1.PRIV.

211 bitENABLE_DX10_CLAMP

Wavefront starts executionwith DX10 clamp modeenabled. Used by the vectorALU to force DX10 styletreatment of NaN’s (whenset, clamp NaN to zero,otherwise pass NaNthrough).

Used by CP to set upCOMPUTE_PGM_RSRC1.DX10_CLAMP.

221 bitDEBUG_MODE

Must be 0.

Start executing wavefrontin single step mode.

CP is responsible forfilling inCOMPUTE_PGM_RSRC1.DEBUG_MODE.

231 bitENABLE_IEEE_MODE

Wavefront starts executionwith IEEE modeenabled. Floating pointopcodes that supportexception flag gatheringwill quiet and propagatesignaling-NaN inputs perIEEE 754-2008. Min_dx10 andmax_dx10 become IEEE754-2008 compliant due tosignaling-NaN propagationand quieting.

Used by CP to set upCOMPUTE_PGM_RSRC1.IEEE_MODE.

241 bitBULKY

Must be 0.

Only one work-group allowedto execute on a computeunit.

CP is responsible forfilling inCOMPUTE_PGM_RSRC1.BULKY.

251 bitCDBG_USER

Must be 0.

Flag that can be used tocontrol debugging code.

CP is responsible forfilling inCOMPUTE_PGM_RSRC1.CDBG_USER.

261 bitFP16_OVFL
GFX6-GFX8
Reserved, must be 0.
GFX9-GFX10

Wavefront starts executionwith specified fp16 overflowmode.

  • If 0, fp16 overflow generates+/-INF values.
  • If 1, fp16 overflow that is theresult of an +/-INF input valueor divide by 0 produces a +/-INF,otherwise clamps computedoverflow to +/-MAX_FP16 asappropriate.

Used by CP to set upCOMPUTE_PGM_RSRC1.FP16_OVFL.

28:272 bits Reserved, must be 0.
291 bitWGP_MODE
GFX6-GFX9
Reserved, must be 0.
GFX10
  • If 0 execute work-groups inCU wavefront execution mode.
  • If 1 execute work-groups onin WGP wavefront execution mode.

See Memory Model.

Used by CP to set upCOMPUTE_PGM_RSRC1.WGP_MODE.

301 bitMEM_ORDERED
GFX6-9
Reserved, must be 0.
GFX10

Controls the behavior of thewaitcnt’s vmcnt and vscntcounters.

  • If 0 vmcnt reports completionof load and atomic with returnout of order with sampleinstructions, and the vscntreports the completion ofstore and atomic withoutreturn in order.
  • If 1 vmcnt reports completionof load, atomic with returnand sample instructions inorder, and the vscnt reportsthe completion of store andatomic without return in order.

Used by CP to set upCOMPUTE_PGM_RSRC1.MEM_ORDERED.

311 bitFWD_PROGRESS
GFX6-9
Reserved, must be 0.
GFX10
  • If 0 execute SIMD wavefrontsusing oldest first policy.
  • If 1 execute SIMD wavefronts toensure wavefronts will make someforward progress.

Used by CP to set upCOMPUTE_PGM_RSRC1.FWD_PROGRESS.

32Total size 4 bytes
compute_pgm_rsrc2 for GFX6-GFX10
BitsSizeField NameDescription
01 bitENABLE_SGPR_PRIVATE_SEGMENT_WAVEFRONT_OFFSET

Enable the setup of theSGPR wavefront scratch offsetsystem register (seeInitial Kernel Execution State).

Used by CP to set upCOMPUTE_PGM_RSRC2.SCRATCH_EN.

5:15 bitsUSER_SGPR_COUNT

The total number of SGPRuser data registersrequested. This number mustmatch the number of userdata registers enabled.

Used by CP to set upCOMPUTE_PGM_RSRC2.USER_SGPR.

61 bitENABLE_TRAP_HANDLER

Must be 0.

This bit representsCOMPUTE_PGM_RSRC2.TRAP_PRESENT,which is set by the CP ifthe runtime has installed atrap handler.

71 bitENABLE_SGPR_WORKGROUP_ID_X

Enable the setup of thesystem SGPR register forthe work-group id in the Xdimension (seeInitial Kernel Execution State).

Used by CP to set upCOMPUTE_PGM_RSRC2.TGID_X_EN.

81 bitENABLE_SGPR_WORKGROUP_ID_Y

Enable the setup of thesystem SGPR register forthe work-group id in the Ydimension (seeInitial Kernel Execution State).

Used by CP to set upCOMPUTE_PGM_RSRC2.TGID_Y_EN.

91 bitENABLE_SGPR_WORKGROUP_ID_Z

Enable the setup of thesystem SGPR register forthe work-group id in the Zdimension (seeInitial Kernel Execution State).

Used by CP to set upCOMPUTE_PGM_RSRC2.TGID_Z_EN.

101 bitENABLE_SGPR_WORKGROUP_INFO

Enable the setup of thesystem SGPR register forwork-group information (seeInitial Kernel Execution State).

Used by CP to set upCOMPUTE_PGM_RSRC2.TGID_SIZE_EN.

12:112 bitsENABLE_VGPR_WORKITEM_ID

Enable the setup of theVGPR system registers usedfor the work-item ID.System VGPR Work-Item ID Enumeration Valuesdefines the values.

Used by CP to set upCOMPUTE_PGM_RSRC2.TIDIG_CMP_CNT.

131 bitENABLE_EXCEPTION_ADDRESS_WATCH

Must be 0.

Wavefront starts executionwith address watchexceptions enabled whichare generated when L1 haswitnessed a thread accessan address ofinterest.

CP is responsible forfilling in the addresswatch bit inCOMPUTE_PGM_RSRC2.EXCP_EN_MSBaccording to what theruntime requests.

141 bitENABLE_EXCEPTION_MEMORY

Must be 0.

Wavefront starts executionwith memory violationexceptions exceptionsenabled which are generatedwhen a memory violation hasoccurred for this wavefront fromL1 or LDS(write-to-read-only-memory,mis-aligned atomic, LDSaddress out of range,illegal address, etc.).

CP sets the memoryviolation bit inCOMPUTE_PGM_RSRC2.EXCP_EN_MSBaccording to what theruntime requests.

23:159 bitsGRANULATED_LDS_SIZE

Must be 0.

CP uses the rounded valuefrom the dispatch packet,not this value, as thedispatch may containdynamically allocated groupsegment memory. CP writesdirectly toCOMPUTE_PGM_RSRC2.LDS_SIZE.

Amount of group segment(LDS) to allocate for eachwork-group. Granularity isdevice specific:

GFX6:
roundup(lds-size / (64 4))
GFX7-GFX10:
roundup(lds-size / (128 4))
241 bitENABLE_EXCEPTION_IEEE_754_FP_INVALID_OPERATION

Wavefront starts executionwith specified exceptionsenabled.

Used by CP to set upCOMPUTE_PGM_RSRC2.EXCP_EN(set from bits 0..6).

IEEE 754 FP InvalidOperation

251 bitENABLE_EXCEPTION_FP_DENORMAL_SOURCEFP Denormal one or moreinput operands is adenormal number
261 bitENABLE_EXCEPTION_IEEE_754_FP_DIVISION_BY_ZEROIEEE 754 FP Division byZero
271 bitENABLE_EXCEPTION_IEEE_754_FP_OVERFLOWIEEE 754 FP FP Overflow
281 bitENABLE_EXCEPTION_IEEE_754_FP_UNDERFLOWIEEE 754 FP Underflow
291 bitENABLE_EXCEPTION_IEEE_754_FP_INEXACTIEEE 754 FP Inexact
301 bitENABLE_EXCEPTION_INT_DIVIDE_BY_ZEROInteger Division by Zero(rcp_iflag_f32 instructiononly)
311 bit Reserved, must be 0.
32Total size 4 bytes.
compute_pgm_rsrc3 for GFX10
BitsSizeField NameDescription
3:04 bitsSHARED_VGPR_COUNTNumber of shared VGPRs for wavefront size 64. Granularity 8. Value 0-120.compute_pgm_rsrc1.vgprs + shared_vgpr_cnt cannot exceed 64.
31:428bits Reserved, must be 0.
32Total size 4 bytes.
Floating Point Rounding Mode Enumeration Values
Enumeration NameValueDescription
FLOAT_ROUND_MODE_NEAR_EVEN0Round Ties To Even
FLOAT_ROUND_MODE_PLUS_INFINITY1Round Toward +infinity
FLOAT_ROUND_MODE_MINUS_INFINITY2Round Toward -infinity
FLOAT_ROUND_MODE_ZERO3Round Toward 0
Floating Point Denorm Mode Enumeration Values
Enumeration NameValueDescription
FLOAT_DENORM_MODE_FLUSH_SRC_DST0Flush Source and DestinationDenorms
FLOAT_DENORM_MODE_FLUSH_DST1Flush Output Denorms
FLOAT_DENORM_MODE_FLUSH_SRC2Flush Source Denorms
FLOAT_DENORM_MODE_FLUSH_NONE3No Flush
System VGPR Work-Item ID Enumeration Values
Enumeration NameValueDescription
SYSTEM_VGPR_WORKITEM_ID_X0Set work-item X dimensionID.
SYSTEM_VGPR_WORKITEM_ID_X_Y1Set work-item X and Ydimensions ID.
SYSTEM_VGPR_WORKITEM_ID_X_Y_Z2Set work-item X, Y and Zdimensions ID.
SYSTEM_VGPR_WORKITEM_ID_UNDEFINED3Undefined.

Initial Kernel Execution State

This section defines the register state that will be set up by the packetprocessor prior to the start of execution of every wavefront. This is limited bythe constraints of the hardware controllers of CP/ADC/SPI.

The order of the SGPR registers is defined, but the compiler can specify whichones are actually setup in the kernel descriptor using the enablesgpr* bitfields (see Kernel Descriptor). The register numbers usedfor enabled registers are dense starting at SGPR0: the first enabled register isSGPR0, the next enabled register is SGPR1 etc.; disabled registers do not havean SGPR number.

The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply toall wavefronts of the grid. It is possible to specify more than 16 User SGPRsusing the enablesgpr* bit fields, in which case only the first 16 areactually initialized. These are then immediately followed by the System SGPRsthat are set up by ADC/SPI and can have different values for each wavefront ofthe grid dispatch.

SGPR register initial state is defined inSGPR Register Set Up Order.

SGPR Register Set Up Order
SGPR OrderName(kernel descriptor enablefield)NumberofSGPRsDescription
FirstPrivate Segment Buffer(enable_sgpr_private_segment_buffer)4

V# that can be used, togetherwith Scratch Wavefront Offsetas an offset, to access theprivate address space using asegment address.

CP uses the value provided bythe runtime.

thenDispatch Ptr(enable_sgpr_dispatch_ptr)264-bit address of AQL dispatchpacket for kernel dispatchactually executing.
thenQueue Ptr(enable_sgpr_queue_ptr)264-bit address of amd_queue_tobject for AQL queue on whichthe dispatch packet wasqueued.
thenKernarg Segment Ptr(enable_sgpr_kernarg_segment_ptr)2

64-bit address of Kernargsegment. This is directlycopied from thekernarg_address in the kerneldispatch packet.

Having CP load it once avoidsloading it at the beginning ofevery wavefront.

thenDispatch Id(enable_sgpr_dispatch_id)264-bit Dispatch ID of thedispatch packet beingexecuted.
thenFlat Scratch Init(enable_sgpr_flat_scratch_init)2

This is 2 SGPRs:

GFX6
Not supported.
GFX7-GFX8

The first SGPR is a 32-bitbyte offset fromSH_HIDDEN_PRIVATE_BASE_VIMIDto per SPI base of memoryfor scratch for the queueexecuting the kerneldispatch. CP obtains thisfrom the runtime. (TheScratch Segment Buffer baseaddress isSH_HIDDEN_PRIVATE_BASE_VIMIDplus this offset.) The valueof Scratch Wavefront Offset mustbe added to this offset bythe kernel machine code,right shifted by 8, andmoved to the FLAT_SCRATCH_HISGPR register.FLAT_SCRATCH_HI correspondsto SGPRn-4 on GFX7, andSGPRn-6 on GFX8 (where SGPRnis the highest numbered SGPRallocated to the wavefront).FLAT_SCRATCH_HI ismultiplied by 256 (as it isin units of 256 bytes) andadded toSH_HIDDEN_PRIVATE_BASE_VIMIDto calculate the per wavefrontFLAT SCRATCH BASE in flatmemory instructions thataccess the scratchaperture.

The second SGPR is 32-bitbyte size of a singlework-item’s scratch memoryusage. CP obtains this fromthe runtime, and it isalways a multiple of DWORD.CP checks that the value inthe kernel dispatch packetPrivate Segment Byte Size isnot larger, and requests theruntime to increase thequeue’s scratch size ifnecessary. The kernel codemust move it toFLAT_SCRATCH_LO which isSGPRn-3 on GFX7 and SGPRn-5on GFX8. FLAT_SCRATCH_LO isused as the FLAT SCRATCHSIZE in flat memoryinstructions. Having CP loadit once avoids loading it atthe beginning of everywavefront.

GFX9-GFX10
This is the64-bit base address of theper SPI scratch backingmemory managed by SPI forthe queue executing thekernel dispatch. CP obtainsthis from the runtime (anddivides it if there aremultiple Shader Arrays eachwith its own SPI). The valueof Scratch Wavefront Offset mustbe added by the kernelmachine code and the resultmoved to the FLAT_SCRATCHSGPR which is SGPRn-6 andSGPRn-5. It is used as theFLAT SCRATCH BASE in flatmemory instructions.
thenPrivate Segment Size1

The 32-bit byte size of a(enable_sgpr_private singlework-item’sscratch_segment_size) memoryallocation. This is thevalue from the kerneldispatch packet PrivateSegment Byte Size rounded upby CP to a multiple ofDWORD.

Having CP load it once avoidsloading it at the beginning ofevery wavefront.

This is not used forGFX7-GFX8 since it is the samevalue as the second SGPR ofFlat Scratch Init. However, itmay be needed for GFX9-GFX10 whichchanges the meaning of theFlat Scratch Init value.

thenGrid Work-Group Count X(enable_sgpr_grid_workgroup_count_X)132-bit count of the number ofwork-groups in the X dimensionfor the grid beingexecuted. Computed from thefields in the kernel dispatchpacket as ((grid_size.x +workgroup_size.x - 1) /workgroup_size.x).
thenGrid Work-Group Count Y(enable_sgpr_grid_workgroup_count_Y &&less than 16 previousSGPRs)1

32-bit count of the number ofwork-groups in the Y dimensionfor the grid beingexecuted. Computed from thefields in the kernel dispatchpacket as ((grid_size.y +workgroup_size.y - 1) /workgroupSize.y).

Only initialized if <16previous SGPRs initialized.

thenGrid Work-Group Count Z(enable_sgpr_grid_workgroup_count_Z &&less than 16 previousSGPRs)1

32-bit count of the number ofwork-groups in the Z dimensionfor the grid beingexecuted. Computed from thefields in the kernel dispatchpacket as ((grid_size.z +workgroup_size.z - 1) /workgroupSize.z).

Only initialized if <16previous SGPRs initialized.

thenWork-Group Id X(enable_sgpr_workgroup_id_X)132-bit work-group id in Xdimension of grid forwavefront.
thenWork-Group Id Y(enable_sgpr_workgroup_id_Y)132-bit work-group id in Ydimension of grid forwavefront.
thenWork-Group Id Z(enable_sgpr_workgroup_id_Z)132-bit work-group id in Zdimension of grid forwavefront.
thenWork-Group Info(enable_sgpr_workgroup_info)1{first_wavefront, 14’b0000,ordered_append_term[10:0],threadgroup_size_in_wavefronts[5:0]}
thenScratch Wavefront Offset(enable_sgpr_private_segment_wavefront_offset)132-bit byte offset from baseof scratch base of queueexecuting the kerneldispatch. Must be used as anoffset with Privatesegment address when usingScratch Segment Buffer. Itmust be used to set up FLATSCRATCH for flat addressing(seeFlat Scratch).

The order of the VGPR registers is defined, but the compiler can specify whichones are actually setup in the kernel descriptor using the enable_vgpr* bitfields (see Kernel Descriptor). The register numbers usedfor enabled registers are dense starting at VGPR0: the first enabled register isVGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have aVGPR number.

VGPR register initial state is defined inVGPR Register Set Up Order.

VGPR Register Set Up Order
VGPR OrderName(kernel descriptor enablefield)NumberofVGPRsDescription
FirstWork-Item Id X(Always initialized)132-bit work item id in Xdimension of work-group forwavefront lane.
thenWork-Item Id Y(enable_vgpr_workitem_id> 0)132-bit work item id in Ydimension of work-group forwavefront lane.
thenWork-Item Id Z(enable_vgpr_workitem_id> 1)132-bit work item id in Zdimension of work-group forwavefront lane.

The setting of registers is done by GPU CP/ADC/SPI hardware as follows:

  • SGPRs before the Work-Group Ids are set by CP using the 16 User Dataregisters.
  • Work-group Id registers X, Y, Z are set by ADC which supports anycombination including none.
  • Scratch Wavefront Offset is set by SPI in a per wavefront basis which is whyits value cannot included with the flat scratch init value which is perqueue.
  • The VGPRs are set by SPI which only supports specifying either (X), (X, Y)or (X, Y, Z).Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64-bitvalue to the hardware required SGPRn-3 and SGPRn-4 respectively.

The global segment can be accessed either using buffer instructions (GFX6 whichhas V# 64-bit address support), flat instructions (GFX7-GFX10), or globalinstructions (GFX9-GFX10).

If buffer operations are used then the compiler can generate a V# with thefollowing properties:

  • base address of 0
  • no swizzle
  • ATC: 1 if IOMMU present (such as APU)
  • ptr64: 1
  • MTYPE set to support memory coherence that matches the runtime (such as CC forAPU and NC for dGPU).

Kernel Prolog

The compiler performs initialization in the kernel prologue depending on thetarget and information about things like stack usage in the kernel and calledfunctions. Some of this initialization requires the compiler to request certainUser and System SGPRs be present in theInitial Kernel Execution State via theKernel Descriptor.

CFI
  • The CFI return address is undefined.
  • The CFI CFA is defined using an expression which evaluates to a memorylocation description for the private segment address 0.
M0
  • GFX6-GFX8
  • The M0 register must be initialized with a value at least the total LDS sizeif the kernel may access LDS via DS or flat operations. Total LDS size isavailable in dispatch packet. For M0, it is also possible to use maximumpossible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF forGFX7-GFX8).
  • GFX9-GFX10
  • The M0 register is not used for range checking LDS accesses and so does notneed to be initialized in the prolog.
Stack Pointer

If the kernel has function calls it must set up the ABI stack pointer describedin Non-Kernel Functions bysetting SGPR32 to the the unswizzled scratch offset of the address past thelast local allocation.

Frame Pointer

If the kernel needs a frame pointer for the reasons defined inSIFrameLowering then SGPR33 is used and is always set to 0 in thekernel prolog. If a frame pointer is not required then all uses of the framepointer are replaced with immediate 0 offsets.

Flat Scratch

If the kernel or any function it calls may use flat operations to accessscratch memory, the prolog code must set up the FLAT_SCRATCH register pair(FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which are in SGPRn-4/SGPRn-3). Initializationuses Flat Scratch Init and Scratch Wavefront Offset SGPR registers (seeInitial Kernel Execution State):

  • GFX6
  • Flat scratch is not supported.

GFX7-GFX8

  1. The low word of Flat Scratch Init is 32-bit byte offset fromSH_HIDDEN_PRIVATE_BASE_VIMID to the base of scratch backing memorybeing managed by SPI for the queue executing the kernel dispatch. This isthe same value used in the Scratch Segment Buffer V# base address. Theprolog must add the value of Scratch Wavefront Offset to get thewavefront’s byte scratch backing memory offset fromSH_HIDDEN_PRIVATE_BASE_VIMID. Since FLAT_SCRATCH_LO is in units of 256bytes, the offset must be right shifted by 8 before moving intoFLAT_SCRATCH_LO.
  2. The second word of Flat Scratch Init is 32-bit byte size of a singlework-items scratch memory usage. This is directly loaded from the kerneldispatch packet Private Segment Byte Size and rounded up to a multiple ofDWORD. Having CP load it once avoids loading it at the beginning of everywavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLATSCRATCH SIZE.
  • GFX9-GFX10
  • The Flat Scratch Init is the 64-bit address of the base of scratch backingmemory being managed by SPI for the queue executing the kernel dispatch. Theprolog must add the value of Scratch Wavefront Offset and moved to theFLAT_SCRATCH pair for use as the flat scratch base in flat memoryinstructions.
Private Segment Buffer

A set of four SGPRs beginning at a four-aligned SGPR index are always selectedto serve as the scratch V# for the kernel as follows:

  • If it is know during instruction selection that there is stack usage,SGPR0-3 is reserved for use as the scratch V#. Stack usage is assumed ifoptimisations are disabled (-O0), if stack objects already exist (forlocals, etc.), or if there are any function calls.

  • Otherwise, four high numbered SGPRs beginning at a four-aligned SGPR indexare reserved for the tentative scratch V#. These will be used if it isdetermined that spilling is needed.

    • If no use is made of the tentative scratch V#, then it is unreservedand the register count is determined ignoring it.
    • If use is made of the tenatative scratch V#, then its register numbersare shifted to the first four-aligned SGPR index after the highest oneallocated by the register allocator, and all uses are updated. Theregister count includes them in the shifted location.
    • In either case, if the processor has the SGPR allocation bug, thetentative allocation is not shifted or unreserved in order to ensurethe register count is higher to workaround the bug.

    Note

    This approach of using a tentative scratch V# and shifting the registernumbers if used avoids having to perform register allocation a secondtime if the tentative V# is eliminated. This is more efficient andavoids the problem that the second register allocation may performspilling which will fail as there is no longer a scratch V#.

When the kernel prolog code is being emitted it is known whether the scratch V#described above is actually used. If it is, the prolog code must set it up bycopying the Private Segment Buffer to the scratch V# registers and then addingthe Private Segment Wavefront Offset to the queue base address in the V#. Theresult is a V# with a base address pointing to the beginning of the wavefrontscratch backing memory.

The Private Segment Buffer is always requested, but the Private SegmentWavefront Offset is only requested if it is used (seeInitial Kernel Execution State).

Memory Model

This section describes the mapping of LLVM memory model onto AMDGPU machine code(see Memory Model for Concurrent Operations).

The AMDGPU backend supports the memory synchronization scopes specified inMemory Scopes.

The code sequences used to implement the memory model are defined in tableAMDHSA Memory Model Code Sequences GFX6-GFX10.

The sequences specify the order of instructions that a single thread mustexecute. The s_waitcnt and buffer_wbinvl1_vol are defined with respectto other memory instructions executed by the same thread. This allows them to bemoved earlier or later which can allow them to be combined with other instancesof the same instruction, or hoisted/sunk out of loops to improveperformance. Only the instructions related to the memory model are given;additional s_waitcnt instructions are required to ensure registers aredefined before being used. These may be able to be combined with the memorymodel s_waitcnt instructions as described above.

The AMDGPU backend supports the following memory models:

HSA Memory Model [HSA]
The HSA memory model uses a single happens-before relation for all addressspaces (see Address Spaces).
OpenCL Memory Model [OpenCL]
The OpenCL memory model which has separate happens-before relations for theglobal and local address spaces. Only a fence specifying both global andlocal address space, and seq_cst instructions join the relationships. Sincethe LLVM memfence instruction does not allow an address space to bespecified the OpenCL fence has to conservatively assume both local andglobal address space was specified. However, optimizations can often bedone to eliminate the additional s_waitcnt instructions when there areno intervening memory instructions which access the corresponding addressspace. The code sequences in the table indicate what can be omitted for theOpenCL memory. The target triple environment is used to determine if thesource language is OpenCL (see OpenCL).

ds/flat_load/store/atomic instructions to local memory are termed LDSoperations.

buffer/global/flat_load/store/atomic instructions to global memory aretermed vector memory operations.

For GFX6-GFX9:

  • Each agent has multiple shader arrays (SA).
  • Each SA has multiple compute units (CU).
  • Each CU has multiple SIMDs that execute wavefronts.
  • The wavefronts for a single work-group are executed in the same CU but may beexecuted by different SIMDs.
  • Each CU has a single LDS memory shared by the wavefronts of the work-groupsexecuting on it.
  • All LDS operations of a CU are performed as wavefront wide operations in aglobal order and involve no caching. Completion is reported to a wavefront inexecution order.
  • The LDS memory has multiple request queues shared by the SIMDs of aCU. Therefore, the LDS operations performed by different wavefronts of awork-group can be reordered relative to each other, which can result inreordering the visibility of vector memory operations with respect to LDSoperations of other wavefronts in the same work-group. A s_waitcntlgkmcnt(0) is required to ensure synchronization between LDS operations andvector memory operations between wavefronts of a work-group, but not betweenoperations performed by the same wavefront.
  • The vector memory operations are performed as wavefront wide operations andcompletion is reported to a wavefront in execution order. The exception isthat for GFX7-GFX9 flat_load/store/atomic instructions can report out ofvector memory order if they access LDS memory, and out of LDS operation orderif they access global memory.
  • The vector memory operations access a single vector L1 cache shared by allSIMDs a CU. Therefore, no special action is required for coherence between thelanes of a single wavefront, or for coherence between wavefronts in the samework-group. A buffer_wbinvl1_vol is required for coherence betweenwavefronts executing in different work-groups as they may be executing ondifferent CUs.
  • The scalar memory operations access a scalar L1 cache shared by all wavefrontson a group of CUs. The scalar and vector L1 caches are not coherent. However,scalar operations are used in a restricted way so do not impact the memorymodel. See Address Spaces.
  • The vector and scalar memory operations use an L2 cache shared by all CUs onthe same agent.
  • The L2 cache has independent channels to service disjoint ranges of virtualaddresses.
  • Each CU has a separate request queue per channel. Therefore, the vector andscalar memory operations performed by wavefronts executing in differentwork-groups (which may be executing on different CUs) of an agent can bereordered relative to each other. A s_waitcnt vmcnt(0) is required toensure synchronization between vector memory operations of different CUs. Itensures a previous vector memory operation has completed before executing asubsequent vector memory or LDS operation and so can be used to meet therequirements of acquire and release.
  • The L2 cache can be kept coherent with other agents on some targets, or rangesof virtual addresses can be set up to bypass it to ensure system coherence.

For GFX10:

  • Each agent has multiple shader arrays (SA).
  • Each SA has multiple work-group processors (WGP).
  • Each WGP has multiple compute units (CU).
  • Each CU has multiple SIMDs that execute wavefronts.
  • The wavefronts for a single work-group are executed in the sameWGP. In CU wavefront execution mode the wavefronts may be executed bydifferent SIMDs in the same CU. In WGP wavefront execution mode thewavefronts may be executed by different SIMDs in different CUs in the sameWGP.
  • Each WGP has a single LDS memory shared by the wavefronts of the work-groupsexecuting on it.
  • All LDS operations of a WGP are performed as wavefront wide operations in aglobal order and involve no caching. Completion is reported to a wavefront inexecution order.
  • The LDS memory has multiple request queues shared by the SIMDs of aWGP. Therefore, the LDS operations performed by different wavefronts of awork-group can be reordered relative to each other, which can result inreordering the visibility of vector memory operations with respect to LDSoperations of other wavefronts in the same work-group. A s_waitcntlgkmcnt(0) is required to ensure synchronization between LDS operations andvector memory operations between wavefronts of a work-group, but not betweenoperations performed by the same wavefront.
  • The vector memory operations are performed as wavefront wide operations.Completion of load/store/sample operations are reported to a wavefront inexecution order of other load/store/sample operations performed by thatwavefront.
  • The vector memory operations access a vector L0 cache. There is a single L0cache per CU. Each SIMD of a CU accesses the same L0 cache. Therefore, nospecial action is required for coherence between the lanes of a singlewavefront. However, a BUFFER_GL0_INV is required for coherence betweenwavefronts executing in the same work-group as they may be executing on SIMDsof different CUs that access different L0s. A BUFFER_GL0_INV is alsorequired for coherence between wavefronts executing in different work-groupsas they may be executing on different WGPs.
  • The scalar memory operations access a scalar L0 cache shared by all wavefrontson a WGP. The scalar and vector L0 caches are not coherent. However, scalaroperations are used in a restricted way so do not impact the memory model. SeeAddress Spaces.
  • The vector and scalar memory L0 caches use an L1 cache shared by all WGPs onthe same SA. Therefore, no special action is required for coherence betweenthe wavefronts of a single work-group. However, a BUFFER_GL1_INV isrequired for coherence between wavefronts executing in different work-groupsas they may be executing on different SAs that access different L1s.
  • The L1 caches have independent quadrants to service disjoint ranges of virtualaddresses.
  • Each L0 cache has a separate request queue per L1 quadrant. Therefore, thevector and scalar memory operations performed by different wavefronts, whetherexecuting in the same or different work-groups (which may be executing ondifferent CUs accessing different L0s), can be reordered relative to eachother. A s_waitcnt vmcnt(0) & vscnt(0) is required to ensuresynchronization between vector memory operations of different wavefronts. Itensures a previous vector memory operation has completed before executing asubsequent vector memory or LDS operation and so can be used to meet therequirements of acquire, release and sequential consistency.
  • The L1 caches use an L2 cache shared by all SAs on the same agent.
  • The L2 cache has independent channels to service disjoint ranges of virtualaddresses.
  • Each L1 quadrant of a single SA accesses a different L2 channel. Each L1quadrant has a separate request queue per L2 channel. Therefore, the vectorand scalar memory operations performed by wavefronts executing in differentwork-groups (which may be executing on different SAs) of an agent can bereordered relative to each other. A s_waitcnt vmcnt(0) & vscnt(0) isrequired to ensure synchronization between vector memory operations ofdifferent SAs. It ensures a previous vector memory operation has completedbefore executing a subsequent vector memory and so can be used to meet therequirements of acquire, release and sequential consistency.
  • The L2 cache can be kept coherent with other agents on some targets, or rangesof virtual addresses can be set up to bypass it to ensure system coherence.

Private address space uses buffer_load/store using the scratch V#(GFX6-GFX8), or scratch_load/store (GFX9-GFX10). Since only a single threadis accessing the memory, atomic memory orderings are not meaningful and allaccesses are treated as non-atomic.

Constant address space uses buffer/global_load instructions (or equivalentscalar memory instructions). Since the constant address space contents do notchange during the execution of a kernel dispatch it is not legal to performstores, and atomic memory orderings are not meaningful and all access aretreated as non-atomic.

A memory synchronization scope wider than work-group is not meaningful for thegroup (LDS) address space and is treated as work-group.

The memory model does not support the region address space which is treated asnon-atomic.

Acquire memory ordering is not meaningful on store atomic instructions and istreated as non-atomic.

Release memory ordering is not meaningful on load atomic instructions and istreated a non-atomic.

Acquire-release memory ordering is not meaningful on load or store atomicinstructions and is treated as acquire and release respectively.

AMDGPU backend only uses scalar memory operations to access memory that isproven to not change during the execution of the kernel dispatch. This includesconstant address space and global address space for program scope constvariables. Therefore the kernel machine code does not have to maintain thescalar L1 cache to ensure it is coherent with the vector L1 cache. The scalarand vector L1 caches are invalidated between kernel dispatches by CP sinceconstant address space data may change between kernel dispatch executions. SeeAddress Spaces.

The one exception is if scalar writes are used to spill SGPR registers. In thiscase the AMDGPU backend ensures the memory location used to spill is neveraccessed by vector memory operations at the same time. If scalar writes are usedthen a s_dcache_wb is inserted before the s_endpgm and before a functionreturn since the locations may be used for vector memory instructions by afuture wavefront that uses the same scratch area, or a function call thatcreates a frame at the same address, respectively. There is no need for as_dcache_inv as all scalar writes are write-before-read in the same thread.

For GFX6-GFX9, scratch backing memory (which is used for the private addressspace) is accessed with MTYPE NC_NV (non-coherent non-volatile). Since theprivate address space is only accessed by a single thread, and is alwayswrite-before-read, there is never a need to invalidate these entries from the L1cache. Hence all cache invalidates are done as *_vol to only invalidate thevolatile cache lines.

For GFX10, scratch backing memory (which is used for the private address space)is accessed with MTYPE NC (non-coherent). Since the private address space isonly accessed by a single thread, and is always write-before-read, there isnever a need to invalidate these entries from the L0 or L1 caches.

For GFX10, wavefronts are executed in native mode with in-order reporting ofloads and sample instructions. In this mode vmcnt reports completion of load,atomic with return and sample instructions in order, and the vscnt reports thecompletion of store and atomic without return in order. See MEM_ORDEREDfield in compute_pgm_rsrc1 for GFX6-GFX10.

In GFX10, wavefronts can be executed in WGP or CU wavefront execution mode:

  • In WGP wavefront execution mode the wavefronts of a work-group are executedon the SIMDs of both CUs of the WGP. Therefore, explicit management of the perCU L0 caches is required for work-group synchronization. Also accesses to L1at work-group scope need to be explicitly ordered as the accesses fromdifferent CUs are not ordered.
  • In CU wavefront execution mode the wavefronts of a work-group are executed onthe SIMDs of a single CU of the WGP. Therefore, all global memory access bythe work-group access the same L0 which in turn ensures L1 accesses areordered and so do not require explicit management of the caches forwork-group synchronization.

See WGP_MODE field incompute_pgm_rsrc1 for GFX6-GFX10 andTarget Features.

On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needingto invalidate the L2 cache. For GFX6-GFX9, this also causes it to be treated asnon-volatile and so is not invalidated by *_vol. On APU it is accessed as CC(cache coherent) and so the L2 cache will be coherent with the CPU and otheragents.

AMDHSA Memory Model Code Sequences GFX6-GFX10
LLVM InstrLLVM MemoryOrderingLLVM MemorySync ScopeAMDGPUAddressSpaceAMDGPU Machine CodeGFX6-9AMDGPU Machine CodeGFX10
Non-Atomic
loadnonenone
  • global
  • generic
  • private
  • constant
  • !volatile & !nontemporal
    1. buffer/global/flat_load
  • volatile & !nontemporal
    1. buffer/global/flat_loadglc=1
  • nontemporal
    1. buffer/global/flat_loadglc=1 slc=1
  • !volatile & !nontemporal
    1. buffer/global/flat_load
  • volatile & !nontemporal
    1. buffer/global/flat_loadglc=1 dlc=1
  • nontemporal
    1. buffer/global/flat_loadslc=1
loadnonenone
  • local
  1. ds_load
  1. ds_load
storenonenone
  • global
  • generic
  • private
  • constant
  • !nontemporal
    1. buffer/global/flat_store
  • nontemporal
    1. buffer/global/flat_storeglc=1 slc=1
  • !nontemporal

    1. buffer/global/flat_store
  • nontemporal

    1. buffer/global/flat_storeslc=1
storenonenone
  • local
  1. ds_store
  1. ds_store
Unordered Atomic
load atomicunorderedanyanySame as non-atomic.Same as non-atomic.
store atomicunorderedanyanySame as non-atomic.Same as non-atomic.
atomicrmwunorderedanyanySame as monotonicatomic.Same as monotonicatomic.
Monotonic Atomic
load atomicmonotonic
  • singlethread
  • wavefront
  • global
  • generic
  1. buffer/global/flat_load
  1. buffer/global/flat_load
load atomicmonotonic
  • workgroup
  • global
  • generic
  1. buffer/global/flat_load
  1. buffer/global/flat_loadglc=1
  • If CU wavefront execution mode, omit glc=1.
load atomicmonotonic
  • singlethread
  • wavefront
  • workgroup
  • local
  1. ds_load
  1. ds_load
load atomicmonotonic
  • agent
  • system
  • global
  • generic
  1. buffer/global/flat_loadglc=1
  1. buffer/global/flat_loadglc=1 dlc=1
store atomicmonotonic
  • singlethread
  • wavefront
  • workgroup
  • agent
  • system
  • global
  • generic
  1. buffer/global/flat_store
  1. buffer/global/flat_store
store atomicmonotonic
  • singlethread
  • wavefront
  • workgroup
  • local
  1. ds_store
  1. ds_store
atomicrmwmonotonic
  • singlethread
  • wavefront
  • workgroup
  • agent
  • system
  • global
  • generic
  1. buffer/global/flat_atomic
  1. buffer/global/flat_atomic
atomicrmwmonotonic
  • singlethread
  • wavefront
  • workgroup
  • local
  1. ds_atomic
  1. ds_atomic
Acquire Atomic
load atomicacquire
  • singlethread
  • wavefront
  • global
  • local
  • generic
  1. buffer/global/ds/flat_load
  1. buffer/global/ds/flat_load
load atomicacquire
  • workgroup
  • global
  1. buffer/global/flat_load
  1. buffer/global_load glc=1
  • If CU wavefront execution mode, omit glc=1.
  1. s_waitcnt vmcnt(0)
  • If CU wavefront execution mode, omit.
  • Must happen beforethe following buffer_gl0_invand before any followingglobal/genericload/loadatomic/store/storeatomic/atomicrmw.
  1. buffer_gl0_inv
  • If CU wavefront execution mode, omit.
  • Ensures thatfollowingloads will not seestale data.
load atomicacquire
  • workgroup
  • local
  1. ds_load
  2. s_waitcnt lgkmcnt(0)
  • If OpenCL, omit.
  • Must happen beforeany followingglobal/genericload/loadatomic/store/storeatomic/atomicrmw.
  • Ensures anyfollowing globaldata read is noolder than the loadatomic value beingacquired.
  1. ds_load
  2. s_waitcnt lgkmcnt(0)
  • If OpenCL, omit.
  • Must happen beforethe following buffer_gl0_invand before any followingglobal/generic load/loadatomic/store/storeatomic/atomicrmw.
  • Ensures anyfollowing globaldata read is noolder than the loadatomic value beingacquired.
  1. buffer_gl0_inv
  • If CU wavefront execution mode, omit.
  • If OpenCL, omit.
  • Ensures thatfollowingloads will not seestale data.
load atomicacquire
  • workgroup
  • generic
  1. flat_load
  2. s_waitcnt lgkmcnt(0)
  • If OpenCL, omit.
  • Must happen beforeany followingglobal/genericload/loadatomic/store/storeatomic/atomicrmw.
  • Ensures anyfollowing globaldata read is noolder than the loadatomic value beingacquired.
  1. flat_load glc=1
  • If CU wavefront execution mode, omit glc=1.
  1. s_waitcnt lgkmcnt(0) &vmcnt(0)
  • If CU wavefront execution mode, omit vmcnt.
  • If OpenCL, omitlgkmcnt(0).
  • Must happen beforethe followingbuffer_gl0_inv and anyfollowing global/genericload/loadatomic/store/storeatomic/atomicrmw.
  • Ensures anyfollowing globaldata read is noolder than the loadatomic value beingacquired.
  1. buffer_gl0_inv
  • If CU wavefront execution mode, omit.
  • Ensures thatfollowingloads will not seestale data.
load atomicacquire
  • agent
  • system
  • global
  1. buffer/global/flat_loadglc=1
  2. s_waitcnt vmcnt(0)
  • Must happen beforefollowingbuffer_wbinvl1_vol.
  • Ensures the loadhas completedbefore invalidatingthe cache.
  1. buffer_wbinvl1_vol
  • Must happen beforeany followingglobal/genericload/loadatomic/atomicrmw.
  • Ensures thatfollowingloads will not seestale global data.
  1. buffer/global_loadglc=1 dlc=1
  2. s_waitcnt vmcnt(0)
  • Must happen beforefollowingbuffer_gl_inv.
  • Ensures the loadhas completedbefore invalidatingthe caches.
  1. buffer_gl0_inv;buffer_gl1_inv
  • Must happen beforeany followingglobal/genericload/loadatomic/atomicrmw.
  • Ensures thatfollowingloads will not seestale global data.
load atomicacquire
  • agent
  • system
  • generic
  1. flat_load glc=1
  2. s_waitcnt vmcnt(0) &lgkmcnt(0)
  • If OpenCL omitlgkmcnt(0).
  • Must happen beforefollowingbuffer_wbinvl1_vol.
  • Ensures the flat_loadhas completedbefore invalidatingthe cache.
  1. buffer_wbinvl1_vol
  • Must happen beforeany followingglobal/genericload/loadatomic/atomicrmw.
  • Ensures thatfollowing loadswill not see staleglobal data.
  1. flat_load glc=1 dlc=1
  2. s_waitcnt vmcnt(0) &lgkmcnt(0)
  • If OpenCL omitlgkmcnt(0).
  • Must happen beforefollowingbuffer_gl_invl.
  • Ensures the flat_loadhas completedbefore invalidatingthe caches.
  1. buffer_gl0_inv;buffer_gl1_inv
  • Must happen beforeany followingglobal/genericload/loadatomic/atomicrmw.
  • Ensures thatfollowing loadswill not see staleglobal data.
atomicrmwacquire
  • singlethread
  • wavefront
  • global
  • local
  • generic
  1. buffer/global/ds/flat_atomic
  1. buffer/global/ds/flat_atomic
atomicrmwacquire
  • workgroup
  • global
  1. buffer/global/flat_atomic
  1. buffer/global_atomic
  2. s_waitcnt vm/vscnt(0)
  • If CU wavefront execution mode, omit.
  • Use vmcnt if atomic withreturn and vscnt if atomicwith no-return.
  • Must happen beforethe following buffer_gl0_invand before any followingglobal/genericload/loadatomic/store/storeatomic/atomicrmw.
  1. buffer_gl0_inv
  • If CU wavefront execution mode, omit.
  • Ensures thatfollowingloads will not seestale data.
atomicrmwacquire
  • workgroup
  • local
  1. ds_atomic
  2. waitcnt lgkmcnt(0)
  • If OpenCL, omit.
  • Must happen beforeany followingglobal/genericload/loadatomic/store/storeatomic/atomicrmw.
  • Ensures anyfollowing globaldata read is noolder than theatomicrmw valuebeing acquired.
  1. ds_atomic
  2. waitcnt lgkmcnt(0)
  • If OpenCL, omit.
  • Must happen beforethe followingbuffer_gl0_inv.
  • Ensures anyfollowing globaldata read is noolder than theatomicrmw valuebeing acquired.
  1. buffer_gl0_inv
  • If OpenCL omit.
  • Ensures thatfollowingloads will not seestale data.
atomicrmwacquire
  • workgroup
  • generic
  1. flat_atomic
  2. waitcnt lgkmcnt(0)
  • If OpenCL, omit.
  • Must happen beforeany followingglobal/genericload/loadatomic/store/storeatomic/atomicrmw.
  • Ensures anyfollowing globaldata read is noolder than theatomicrmw valuebeing acquired.
  1. flat_atomic
  2. waitcnt lgkmcnt(0) &vm/vscnt(0)
  • If CU wavefront execution mode, omit vm/vscnt.
  • If OpenCL, omitwaitcnt lgkmcnt(0)..
  • Use vmcnt if atomic withreturn and vscnt if atomicwith no-return.waitcnt lgkmcnt(0).
  • Must happen beforethe followingbuffer_gl0_inv.
  • Ensures anyfollowing globaldata read is noolder than theatomicrmw valuebeing acquired.
  1. buffer_gl0_inv
  • If CU wavefront execution mode, omit.
  • Ensures thatfollowingloads will not seestale data.
atomicrmwacquire
  • agent
  • system
  • global
  1. buffer/global/flat_atomic
  2. s_waitcnt vmcnt(0)
  • Must happen beforefollowingbuffer_wbinvl1_vol.
  • Ensures theatomicrmw hascompleted beforeinvalidating thecache.
  1. buffer_wbinvl1_vol
  • Must happen beforeany followingglobal/genericload/loadatomic/atomicrmw.
  • Ensures thatfollowing loadswill not see staleglobal data.
  1. buffer/global_atomic
  2. s_waitcnt vm/vscnt(0)
  • Use vmcnt if atomic withreturn and vscnt if atomicwith no-return.waitcnt lgkmcnt(0).
  • Must happen beforefollowingbuffer_gl_inv.
  • Ensures theatomicrmw hascompleted beforeinvalidating thecaches.
  1. buffer_gl0_inv;buffer_gl1_inv
  • Must happen beforeany followingglobal/genericload/loadatomic/atomicrmw.
  • Ensures thatfollowing loadswill not see staleglobal data.
atomicrmwacquire
  • agent
  • system
  • generic
  1. flat_atomic
  2. s_waitcnt vmcnt(0) &lgkmcnt(0)
  • If OpenCL, omitlgkmcnt(0).
  • Must happen beforefollowingbuffer_wbinvl1_vol.
  • Ensures theatomicrmw hascompleted beforeinvalidating thecache.
  1. buffer_wbinvl1_vol
  • Must happen beforeany followingglobal/genericload/loadatomic/atomicrmw.
  • Ensures thatfollowing loadswill not see staleglobal data.
  1. flat_atomic
  2. s_waitcnt vm/vscnt(0) &lgkmcnt(0)
  • If OpenCL, omitlgkmcnt(0).
  • Use vmcnt if atomic withreturn and vscnt if atomicwith no-return.
  • Must happen beforefollowingbuffer_gl_inv.
  • Ensures theatomicrmw hascompleted beforeinvalidating thecaches.
  1. buffer_gl0_inv;buffer_gl1_inv
  • Must happen beforeany followingglobal/genericload/loadatomic/atomicrmw.
  • Ensures thatfollowing loadswill not see staleglobal data.
fenceacquire
  • singlethread
  • wavefront
nonenonenone
fenceacquire
  • workgroup
none
  1. s_waitcnt lgkmcnt(0)
  • If OpenCL andaddress space isnot generic, omit.
  • However, since LLVMcurrently has noaddress space onthe fence need toconservativelyalways generate. Iffence had anaddress space thenset to addressspace of OpenCLfence flag, or togeneric if bothlocal and globalflags arespecified.
  • Must happen afterany precedinglocal/generic loadatomic/atomicrmwwith an equal orwider sync scopeand memory orderingstronger thanunordered (this istermed thefence-paired-atomic).
  • Must happen beforeany followingglobal/genericload/loadatomic/store/storeatomic/atomicrmw.
  • Ensures anyfollowing globaldata read is noolder than thevalue read by thefence-paired-atomic.
  1. s_waitcnt lgkmcnt(0) &vmcnt(0) & vscnt(0)
  • If CU wavefront execution mode, omit vmcnt andvscnt.
  • If OpenCL andaddress space isnot generic, omitlgkmcnt(0).
  • If OpenCL andaddress space islocal, omitvmcnt(0) and vscnt(0).
  • However, since LLVMcurrently has noaddress space onthe fence need toconservativelyalways generate. Iffence had anaddress space thenset to addressspace of OpenCLfence flag, or togeneric if bothlocal and globalflags arespecified.
  • Could be split intoseparate s_waitcntvmcnt(0), s_waitcntvscnt(0) and s_waitcntlgkmcnt(0) to allowthem to beindependently movedaccording to thefollowing rules.
  • s_waitcnt vmcnt(0)must happen afterany precedingglobal/generic loadatomic/atomicrmw-with-return-valuewith an equal orwider sync scopeand memory orderingstronger thanunordered (this istermed thefence-paired-atomic).
  • s_waitcnt vscnt(0)must happen afterany precedingglobal/genericatomicrmw-no-return-valuewith an equal orwider sync scopeand memory orderingstronger thanunordered (this istermed thefence-paired-atomic).
  • s_waitcnt lgkmcnt(0)must happen afterany precedinglocal/generic loadatomic/atomicrmwwith an equal orwider sync scopeand memory orderingstronger thanunordered (this istermed thefence-paired-atomic).
  • Must happen beforethe followingbuffer_gl0_inv.
  • Ensures that thefence-paired atomichas completedbefore invalidatingthecache. Thereforeany followinglocations read mustbe no older thanthe value read bythefence-paired-atomic.
  1. buffer_gl0_inv
  • If CU wavefront execution mode, omit.
  • Ensures thatfollowingloads will not seestale data.
fenceacquire
  • agent
  • system
none
  1. s_waitcnt lgkmcnt(0) &vmcnt(0)
  • If OpenCL andaddress space isnot generic, omitlgkmcnt(0).
  • However, since LLVMcurrently has noaddress space onthe fence need toconservativelyalways generate(see comment forprevious fence).
  • Could be split intoseparate s_waitcntvmcnt(0) ands_waitcntlgkmcnt(0) to allowthem to beindependently movedaccording to thefollowing rules.
  • s_waitcnt vmcnt(0)must happen afterany precedingglobal/generic loadatomic/atomicrmwwith an equal orwider sync scopeand memory orderingstronger thanunordered (this istermed thefence-paired-atomic).
  • s_waitcnt lgkmcnt(0)must happen afterany precedinglocal/generic loadatomic/atomicrmwwith an equal orwider sync scopeand memory orderingstronger thanunordered (this istermed thefence-paired-atomic).
  • Must happen beforethe followingbuffer_wbinvl1_vol.
  • Ensures that thefence-paired atomichas completedbefore invalidatingthecache. Thereforeany followinglocations read mustbe no older thanthe value read bythefence-paired-atomic.
  1. buffer_wbinvl1_vol
  • Must happen before anyfollowing global/genericload/loadatomic/store/storeatomic/atomicrmw.
  • Ensures thatfollowing loadswill not see staleglobal data.
  1. s_waitcnt lgkmcnt(0) &vmcnt(0) & vscnt(0)
  • If OpenCL andaddress space isnot generic, omitlgkmcnt(0).
  • If OpenCL andaddress space islocal, omitvmcnt(0) and vscnt(0).
  • However, since LLVMcurrently has noaddress space onthe fence need toconservativelyalways generate(see comment forprevious fence).
  • Could be split intoseparate s_waitcntvmcnt(0), s_waitcntvscnt(0) and s_waitcntlgkmcnt(0) to allowthem to beindependently movedaccording to thefollowing rules.
  • s_waitcnt vmcnt(0)must happen afterany precedingglobal/generic loadatomic/atomicrmw-with-return-valuewith an equal orwider sync scopeand memory orderingstronger thanunordered (this istermed thefence-paired-atomic).
  • s_waitcnt vscnt(0)must happen afterany precedingglobal/genericatomicrmw-no-return-valuewith an equal orwider sync scopeand memory orderingstronger thanunordered (this istermed thefence-paired-atomic).
  • s_waitcnt lgkmcnt(0)must happen afterany precedinglocal/generic loadatomic/atomicrmwwith an equal orwider sync scopeand memory orderingstronger thanunordered (this istermed thefence-paired-atomic).
  • Must happen beforethe followingbuffer_gl_inv.
  • Ensures that thefence-paired atomichas completedbefore invalidatingthecaches. Thereforeany followinglocations read mustbe no older thanthe value read bythefence-paired-atomic.
  1. buffer_gl0_inv;buffer_gl1_inv
  • Must happen before anyfollowing global/genericload/loadatomic/store/storeatomic/atomicrmw.
  • Ensures thatfollowing loadswill not see staleglobal data.
Release Atomic
store atomicrelease
  • singlethread
  • wavefront
  • global
  • local
  • generic
  1. buffer/global/ds/flat_store
  1. buffer/global/ds/flat_store
store atomicrelease
  • workgroup
  • global
  1. s_waitcnt lgkmcnt(0)
  • If OpenCL, omit.
  • Must happen afterany precedinglocal/genericload/store/loadatomic/storeatomic/atomicrmw.
  • Must happen beforethe followingstore.
  • Ensures that allmemory operationsto local havecompleted beforeperforming thestore that is beingreleased.
  1. buffer/global/flat_store
  1. s_waitcnt lgkmcnt(0) &vmcnt(0) & vscnt(0)
  • If CU wavefront execution mode, omit vmcnt andvscnt.
  • If OpenCL, omitlgkmcnt(0).
  • Could be split intoseparate s_waitcntvmcnt(0), s_waitcntvscnt(0) and s_waitcntlgkmcnt(0) to allowthem to beindependently movedaccording to thefollowing rules.
  • s_waitcnt vmcnt(0)must happen afterany precedingglobal/generic load/loadatomic/atomicrmw-with-return-value.
  • s_waitcnt vscnt(0)must happen afterany precedingglobal/genericstore/storeatomic/atomicrmw-no-return-value.
  • s_waitcnt lgkmcnt(0)must happen afterany precedinglocal/genericload/store/loadatomic/storeatomic/atomicrmw.
  • Must happen beforethe followingstore.
  • Ensures that allmemory operationshavecompleted beforeperforming thestore that is beingreleased.
  1. buffer/global_store
store atomicrelease
  • workgroup
  • local
  1. ds_store
  1. waitcnt vmcnt(0) & vscnt(0)
  • If CU wavefront execution mode, omit.
  • If OpenCL, omit.
  • Could be split intoseparate s_waitcntvmcnt(0) and s_waitcntvscnt(0) to allowthem to beindependently movedaccording to thefollowing rules.
  • s_waitcnt vmcnt(0)must happen afterany precedingglobal/generic load/loadatomic/atomicrmw-with-return-value.
  • s_waitcnt vscnt(0)must happen afterany precedingglobal/genericstore/store atomic/atomicrmw-no-return-value.
  • Must happen beforethe followingstore.
  • Ensures that allglobal memoryoperations havecompleted beforeperforming thestore that is beingreleased.
  1. ds_store
store atomicrelease
  • workgroup
  • generic
  1. s_waitcnt lgkmcnt(0)
  • If OpenCL, omit.
  • Must happen afterany precedinglocal/genericload/store/loadatomic/storeatomic/atomicrmw.
  • Must happen beforethe followingstore.
  • Ensures that allmemory operationsto local havecompleted beforeperforming thestore that is beingreleased.
  1. flat_store
  1. s_waitcnt lgkmcnt(0) &vmcnt(0) & vscnt(0)
  • If CU wavefront execution mode, omit vmcnt andvscnt.
  • If OpenCL, omitlgkmcnt(0).
  • Could be split intoseparate s_waitcntvmcnt(0), s_waitcntvscnt(0) and s_waitcntlgkmcnt(0) to allowthem to beindependently movedaccording to thefollowing rules.
  • s_waitcnt vmcnt(0)must happen afterany precedingglobal/generic load/loadatomic/atomicrmw-with-return-value.
  • s_waitcnt vscnt(0)must happen afterany precedingglobal/genericstore/storeatomic/atomicrmw-no-return-value.
  • s_waitcnt lgkmcnt(0)must happen afterany precedinglocal/generic load/store/loadatomic/store atomic/atomicrmw.
  • Must happen beforethe followingstore.
  • Ensures that allmemory operationshavecompleted beforeperforming thestore that is beingreleased.
  1. flat_store
store atomicrelease
  • agent
  • system
  • global
  • generic
  1. s_waitcnt lgkmcnt(0) &vmcnt(0)
  • If OpenCL, omitlgkmcnt(0).
  • Could be split intoseparate s_waitcntvmcnt(0) ands_waitcntlgkmcnt(0) to allowthem to beindependently movedaccording to thefollowing rules.
  • s_waitcnt vmcnt(0)must happen afterany precedingglobal/genericload/store/loadatomic/storeatomic/atomicrmw.
  • s_waitcnt lgkmcnt(0)must happen afterany precedinglocal/genericload/store/loadatomic/storeatomic/atomicrmw.
  • Must happen beforethe followingstore.
  • Ensures that allmemory operationsto memory havecompleted beforeperforming thestore that is beingreleased.
  1. buffer/global/ds/flat_store
  1. s_waitcnt lgkmcnt(0) &vmcnt(0) & vscnt(0)
  • If OpenCL, omitlgkmcnt(0).
  • Could be split intoseparate s_waitcntvmcnt(0), s_waitcnt vscnt(0)and s_waitcntlgkmcnt(0) to allowthem to beindependently movedaccording to thefollowing rules.
  • s_waitcnt vmcnt(0)must happen afterany precedingglobal/genericload/loadatomic/atomicrmw-with-return-value.
  • s_waitcnt vscnt(0)must happen afterany precedingglobal/genericstore/store atomic/atomicrmw-no-return-value.
  • s_waitcnt lgkmcnt(0)must happen afterany precedinglocal/genericload/store/loadatomic/storeatomic/atomicrmw.
  • Must happen beforethe followingstore.
  • Ensures that allmemory operationsto memory havecompleted beforeperforming thestore that is beingreleased.
  1. buffer/global/ds/flat_store
atomicrmwrelease
  • singlethread
  • wavefront
  • global
  • local
  • generic
  1. buffer/global/ds/flat_atomic
  1. buffer/global/ds/flat_atomic
atomicrmwrelease
  • workgroup
  • global
  1. s_waitcnt lgkmcnt(0)
  • If OpenCL, omit.
  • Must happen afterany precedinglocal/genericload/store/loadatomic/storeatomic/atomicrmw.
  • Must happen beforethe followingatomicrmw.
  • Ensures that allmemory operationsto local havecompleted beforeperforming theatomicrmw that isbeing released.
  1. buffer/global/flat_atomic
  1. s_waitcnt lgkmcnt(0) &vmcnt(0) & vscnt(0)
  • If CU wavefront execution mode, omit vmcnt andvscnt.
  • Could be split intoseparate s_waitcntvmcnt(0), s_waitcntvscnt(0) and s_waitcntlgkmcnt(0) to allowthem to beindependently movedaccording to thefollowing rules.
  • s_waitcnt vmcnt(0)must happen afterany precedingglobal/generic load/loadatomic/atomicrmw-with-return-value.
  • s_waitcnt vscnt(0)must happen afterany precedingglobal/genericstore/storeatomic/atomicrmw-no-return-value.
  • s_waitcnt lgkmcnt(0)must happen afterany precedinglocal/genericload/store/loadatomic/storeatomic/atomicrmw.
  • Must happen beforethe followingatomicrmw.
  • Ensures that allmemory operationshavecompleted beforeperforming theatomicrmw that isbeing released.
  1. buffer/global_atomic
atomicrmwrelease
  • workgroup
  • local
  1. ds_atomic
  1. waitcnt vmcnt(0) & vscnt(0)
  • If CU wavefront execution mode, omit.
  • If OpenCL, omit.
  • Could be split intoseparate s_waitcntvmcnt(0) and s_waitcntvscnt(0) to allowthem to beindependently movedaccording to thefollowing rules.
  • s_waitcnt vmcnt(0)must happen afterany precedingglobal/generic load/loadatomic/atomicrmw-with-return-value.
  • s_waitcnt vscnt(0)must happen afterany precedingglobal/genericstore/store atomic/atomicrmw-no-return-value.
  • Must happen beforethe followingstore.
  • Ensures that allglobal memoryoperations havecompleted beforeperforming thestore that is beingreleased.
  1. ds_atomic
atomicrmwrelease
  • workgroup
  • generic
  1. s_waitcnt lgkmcnt(0)
  • If OpenCL, omit.
  • Must happen afterany precedinglocal/genericload/store/loadatomic/storeatomic/atomicrmw.
  • Must happen beforethe followingatomicrmw.
  • Ensures that allmemory operationsto local havecompleted beforeperforming theatomicrmw that isbeing released.
  1. flat_atomic
  1. s_waitcnt lgkmcnt(0) &vmcnt(0) & vscnt(0)
  • If CU wavefront execution mode, omit vmcnt andvscnt.
  • If OpenCL, omitwaitcnt lgkmcnt(0).
  • Could be split intoseparate s_waitcntvmcnt(0), s_waitcntvscnt(0) and s_waitcntlgkmcnt(0) to allowthem to beindependently movedaccording to thefollowing rules.
  • s_waitcnt vmcnt(0)must happen afterany precedingglobal/generic load/loadatomic/atomicrmw-with-return-value.
  • s_waitcnt vscnt(0)must happen afterany precedingglobal/genericstore/storeatomic/atomicrmw-no-return-value.
  • s_waitcnt lgkmcnt(0)must happen afterany precedinglocal/generic load/store/loadatomic/store atomic/atomicrmw.
  • Must happen beforethe followingatomicrmw.
  • Ensures that allmemory operationshavecompleted beforeperforming theatomicrmw that isbeing released.
  1. flat_atomic
atomicrmwrelease
  • agent
  • system
  • global
  • generic
  1. s_waitcnt lgkmcnt(0) &vmcnt(0)
  • If OpenCL, omitlgkmcnt(0).
  • Could be split intoseparate s_waitcntvmcnt(0) ands_waitcntlgkmcnt(0) to allowthem to beindependently movedaccording to thefollowing rules.
  • s_waitcnt vmcnt(0)must happen afterany precedingglobal/genericload/store/loadatomic/storeatomic/atomicrmw.
  • s_waitcnt lgkmcnt(0)must happen afterany precedinglocal/genericload/store/loadatomic/storeatomic/atomicrmw.
  • Must happen beforethe followingatomicrmw.
  • Ensures that allmemory operationsto global and localhave completedbefore performingthe atomicrmw thatis being released.
  1. buffer/global/ds/flat_atomic
  1. s_waitcnt lkkmcnt(0) &
    vmcnt(0) & vscnt(0)
  • If OpenCL, omitlgkmcnt(0).
  • Could be split intoseparate s_waitcntvmcnt(0), s_waitcntvscnt(0) and s_waitcntlgkmcnt(0) to allowthem to beindependently movedaccording to thefollowing rules.
  • s_waitcnt vmcnt(0)must happen afterany precedingglobal/genericload/load atomic/atomicrmw-with-return-value.
  • s_waitcnt vscnt(0)must happen afterany precedingglobal/genericstore/store atomic/atomicrmw-no-return-value.
  • s_waitcnt lgkmcnt(0)must happen afterany precedinglocal/genericload/store/loadatomic/storeatomic/atomicrmw.
  • Must happen beforethe followingatomicrmw.
  • Ensures that allmemory operationsto global and localhave completedbefore performingthe atomicrmw thatis being released.
  1. buffer/global/ds/flat_atomic
fencerelease
  • singlethread
  • wavefront
nonenonenone
fencerelease
  • workgroup
none
  1. s_waitcnt lgkmcnt(0)
  • If OpenCL andaddress space isnot generic, omit.
  • However, since LLVMcurrently has noaddress space onthe fence need toconservativelyalways generate. Iffence had anaddress space thenset to addressspace of OpenCLfence flag, or togeneric if bothlocal and globalflags arespecified.
  • Must happen afterany precedinglocal/genericload/loadatomic/store/storeatomic/atomicrmw.
  • Must happen beforeany following storeatomic/atomicrmwwith an equal orwider sync scopeand memory orderingstronger thanunordered (this istermed thefence-paired-atomic).
  • Ensures that allmemory operationsto local havecompleted beforeperforming thefollowingfence-paired-atomic.
  1. s_waitcnt lgkmcnt(0) &vmcnt(0) & vscnt(0)
  • If CU wavefront execution mode, omit vmcnt andvscnt.
  • If OpenCL andaddress space isnot generic, omitlgkmcnt(0).
  • If OpenCL andaddress space islocal, omitvmcnt(0) and vscnt(0).
  • However, since LLVMcurrently has noaddress space onthe fence need toconservativelyalways generate. Iffence had anaddress space thenset to addressspace of OpenCLfence flag, or togeneric if bothlocal and globalflags arespecified.
  • Could be split intoseparate s_waitcntvmcnt(0), s_waitcntvscnt(0) and s_waitcntlgkmcnt(0) to allowthem to beindependently movedaccording to thefollowing rules.
  • s_waitcnt vmcnt(0)must happen afterany precedingglobal/genericload/loadatomic/atomicrmw-with-return-value.
  • s_waitcnt vscnt(0)must happen afterany precedingglobal/genericstore/store atomic/atomicrmw-no-return-value.
  • s_waitcnt lgkmcnt(0)must happen afterany precedinglocal/genericload/store/loadatomic/store atomic/atomicrmw.
  • Must happen beforeany following storeatomic/atomicrmwwith an equal orwider sync scopeand memory orderingstronger thanunordered (this istermed thefence-paired-atomic).
  • Ensures that allmemory operationshavecompleted beforeperforming thefollowingfence-paired-atomic.
fencerelease
  • agent
  • system
none
  1. s_waitcnt lgkmcnt(0) &vmcnt(0)
  • If OpenCL andaddress space isnot generic, omitlgkmcnt(0).
  • If OpenCL andaddress space islocal, omitvmcnt(0).
  • However, since LLVMcurrently has noaddress space onthe fence need toconservativelyalways generate. Iffence had anaddress space thenset to addressspace of OpenCLfence flag, or togeneric if bothlocal and globalflags arespecified.
  • Could be split intoseparate s_waitcntvmcnt(0) ands_waitcntlgkmcnt(0) to allowthem to beindependently movedaccording to thefollowing rules.
  • s_waitcnt vmcnt(0)must happen afterany precedingglobal/genericload/store/loadatomic/storeatomic/atomicrmw.
  • s_waitcnt lgkmcnt(0)must happen afterany precedinglocal/genericload/store/loadatomic/storeatomic/atomicrmw.
  • Must happen beforeany following storeatomic/atomicrmwwith an equal orwider sync scopeand memory orderingstronger thanunordered (this istermed thefence-paired-atomic).
  • Ensures that allmemory operationshavecompleted beforeperforming thefollowingfence-paired-atomic.
  1. s_waitcnt lgkmcnt(0) &vmcnt(0) & vscnt(0)
  • If OpenCL andaddress space isnot generic, omitlgkmcnt(0).
  • If OpenCL andaddress space islocal, omitvmcnt(0) and vscnt(0).
  • However, since LLVMcurrently has noaddress space onthe fence need toconservativelyalways generate. Iffence had anaddress space thenset to addressspace of OpenCLfence flag, or togeneric if bothlocal and globalflags arespecified.
  • Could be split intoseparate s_waitcntvmcnt(0), s_waitcntvscnt(0) and s_waitcntlgkmcnt(0) to allowthem to beindependently movedaccording to thefollowing rules.
  • s_waitcnt vmcnt(0)must happen afterany precedingglobal/genericload/load atomic/atomicrmw-with-return-value.
  • s_waitcnt vscnt(0)must happen afterany precedingglobal/genericstore/store atomic/atomicrmw-no-return-value.
  • s_waitcnt lgkmcnt(0)must happen afterany precedinglocal/genericload/store/loadatomic/storeatomic/atomicrmw.
  • Must happen beforeany following storeatomic/atomicrmwwith an equal orwider sync scopeand memory orderingstronger thanunordered (this istermed thefence-paired-atomic).
  • Ensures that allmemory operationshavecompleted beforeperforming thefollowingfence-paired-atomic.
Acquire-Release Atomic
atomicrmwacq_rel
  • singlethread
  • wavefront
  • global
  • local
  • generic
  1. buffer/global/ds/flat_atomic
  1. buffer/global/ds/flat_atomic
atomicrmwacq_rel
  • workgroup
  • global
  1. s_waitcnt lgkmcnt(0)
  • If OpenCL, omit.
  • Must happen afterany precedinglocal/genericload/store/loadatomic/storeatomic/atomicrmw.
  • Must happen beforethe followingatomicrmw.
  • Ensures that allmemory operationsto local havecompleted beforeperforming theatomicrmw that isbeing released.
  1. buffer/global/flat_atomic
  1. s_waitcnt lgkmcnt(0) &vmcnt(0) & vscnt(0)
  • If CU wavefront execution mode, omit vmcnt andvscnt.
  • If OpenCL, omits_waitcnt lgkmcnt(0).
  • Must happen afterany precedinglocal/genericload/store/loadatomic/storeatomic/atomicrmw.
  • Could be split intoseparate s_waitcntvmcnt(0), s_waitcntvscnt(0) and s_waitcntlgkmcnt(0) to allowthem to beindependently movedaccording to thefollowing rules.
  • s_waitcnt vmcnt(0)must happen afterany precedingglobal/generic load/loadatomic/atomicrmw-with-return-value.
  • s_waitcnt vscnt(0)must happen afterany precedingglobal/genericstore/storeatomic/atomicrmw-no-return-value.
  • s_waitcnt lgkmcnt(0)must happen afterany precedinglocal/generic load/store/loadatomic/store atomic/atomicrmw.
  • Must happen beforethe followingatomicrmw.
  • Ensures that allmemory operationshavecompleted beforeperforming theatomicrmw that isbeing released.
  1. buffer/global_atomic
  2. s_waitcnt vm/vscnt(0)
  • If CU wavefront execution mode, omit vm/vscnt.
  • Use vmcnt if atomic withreturn and vscnt if atomicwith no-return.waitcnt lgkmcnt(0).
  • Must happen beforethe followingbuffer_gl0_inv.
  • Ensures anyfollowing globaldata read is noolder than theatomicrmw valuebeing acquired.
  1. buffer_gl0_inv
  • If CU wavefront execution mode, omit.
  • Ensures thatfollowingloads will not seestale data.
atomicrmwacq_rel
  • workgroup
  • local
  1. ds_atomic
  2. s_waitcnt lgkmcnt(0)
  • If OpenCL, omit.
  • Must happen beforeany followingglobal/genericload/loadatomic/store/storeatomic/atomicrmw.
  • Ensures anyfollowing globaldata read is noolder than the loadatomic value beingacquired.
  1. waitcnt vmcnt(0) & vscnt(0)
  • If CU wavefront execution mode, omit.
  • If OpenCL, omit.
  • Could be split intoseparate s_waitcntvmcnt(0) and s_waitcntvscnt(0) to allowthem to beindependently movedaccording to thefollowing rules.
  • s_waitcnt vmcnt(0)must happen afterany precedingglobal/generic load/loadatomic/atomicrmw-with-return-value.
  • s_waitcnt vscnt(0)must happen afterany precedingglobal/genericstore/store atomic/atomicrmw-no-return-value.
  • Must happen beforethe followingstore.
  • Ensures that allglobal memoryoperations havecompleted beforeperforming thestore that is beingreleased.
  1. ds_atomic
  2. s_waitcnt lgkmcnt(0)
  • If OpenCL, omit.
  • Must happen beforethe followingbuffer_gl0_inv.
  • Ensures anyfollowing globaldata read is noolder than the loadatomic value beingacquired.
  1. buffer_gl0_inv
  • If CU wavefront execution mode, omit.
  • If OpenCL omit.
  • Ensures thatfollowingloads will not seestale data.
atomicrmwacq_rel
  • workgroup
  • generic
  1. s_waitcnt lgkmcnt(0)
  • If OpenCL, omit.
  • Must happen afterany precedinglocal/genericload/store/loadatomic/storeatomic/atomicrmw.
  • Must happen beforethe followingatomicrmw.
  • Ensures that allmemory operationsto local havecompleted beforeperforming theatomicrmw that isbeing released.
  1. flat_atomic
  2. s_waitcnt lgkmcnt(0)
  • If OpenCL, omit.
  • Must happen beforeany followingglobal/genericload/loadatomic/store/storeatomic/atomicrmw.
  • Ensures anyfollowing globaldata read is noolder than the loadatomic value beingacquired.
  1. s_waitcnt lgkmcnt(0) &vmcnt(0) & vscnt(0)
  • If CU wavefront execution mode, omit vmcnt andvscnt.
  • If OpenCL, omitwaitcnt lgkmcnt(0).
  • Could be split intoseparate s_waitcntvmcnt(0), s_waitcntvscnt(0) and s_waitcntlgkmcnt(0) to allowthem to beindependently movedaccording to thefollowing rules.
  • s_waitcnt vmcnt(0)must happen afterany precedingglobal/generic load/loadatomic/atomicrmw-with-return-value.
  • s_waitcnt vscnt(0)must happen afterany precedingglobal/genericstore/storeatomic/atomicrmw-no-return-value.
  • s_waitcnt lgkmcnt(0)must happen afterany precedinglocal/generic load/store/loadatomic/store atomic/atomicrmw.
  • Must happen beforethe followingatomicrmw.
  • Ensures that allmemory operationshavecompleted beforeperforming theatomicrmw that isbeing released.
  1. flat_atomic
  2. s_waitcnt lgkmcnt(0) &vm/vscnt(0)
  • If CU wavefront execution mode, omit vm/vscnt.
  • If OpenCL, omitwaitcnt lgkmcnt(0).
  • Must happen beforethe followingbuffer_gl0_inv.
  • Ensures anyfollowing globaldata read is noolder than the loadatomic value beingacquired.
  1. buffer_gl0_inv
  • If CU wavefront execution mode, omit.
  • Ensures thatfollowingloads will not seestale data.
atomicrmwacq_rel
  • agent
  • system
  • global
  1. s_waitcnt lgkmcnt(0) &vmcnt(0)
  • If OpenCL, omitlgkmcnt(0).
  • Could be split intoseparate s_waitcntvmcnt(0) ands_waitcntlgkmcnt(0) to allowthem to beindependently movedaccording to thefollowing rules.
  • s_waitcnt vmcnt(0)must happen afterany precedingglobal/genericload/store/loadatomic/storeatomic/atomicrmw.
  • s_waitcnt lgkmcnt(0)must happen afterany precedinglocal/genericload/store/loadatomic/storeatomic/atomicrmw.
  • Must happen beforethe followingatomicrmw.
  • Ensures that allmemory operationsto global havecompleted beforeperforming theatomicrmw that isbeing released.
  1. buffer/global/flat_atomic
  2. s_waitcnt vmcnt(0)
  • Must happen beforefollowingbuffer_wbinvl1_vol.
  • Ensures theatomicrmw hascompleted beforeinvalidating thecache.
  1. buffer_wbinvl1_vol
  • Must happen beforeany followingglobal/genericload/loadatomic/atomicrmw.
  • Ensures thatfollowing loadswill not see staleglobal data.
  1. s_waitcnt lgkmcnt(0) &vmcnt(0) & vscnt(0)
  • If OpenCL, omitlgkmcnt(0).
  • Could be split intoseparate s_waitcntvmcnt(0), s_waitcntvscnt(0) and s_waitcntlgkmcnt(0) to allowthem to beindependently movedaccording to thefollowing rules.
  • s_waitcnt vmcnt(0)must happen afterany precedingglobal/genericload/load atomic/atomicrmw-with-return-value.
  • s_waitcnt vscnt(0)must happen afterany precedingglobal/genericstore/store atomic/atomicrmw-no-return-value.
  • s_waitcnt lgkmcnt(0)must happen afterany precedinglocal/genericload/store/loadatomic/storeatomic/atomicrmw.
  • Must happen beforethe followingatomicrmw.
  • Ensures that allmemory operationsto global havecompleted beforeperforming theatomicrmw that isbeing released.
  1. buffer/global_atomic
  2. s_waitcnt vm/vscnt(0)
  • Use vmcnt if atomic withreturn and vscnt if atomicwith no-return.waitcnt lgkmcnt(0).
  • Must happen beforefollowingbuffer_gl_inv.
  • Ensures theatomicrmw hascompleted beforeinvalidating thecaches.
  1. buffer_gl0_inv;buffer_gl1_inv
  • Must happen beforeany followingglobal/genericload/loadatomic/atomicrmw.
  • Ensures thatfollowing loadswill not see staleglobal data.
atomicrmwacq_rel
  • agent
  • system
  • generic
  1. s_waitcnt lgkmcnt(0) &vmcnt(0)
  • If OpenCL, omitlgkmcnt(0).
  • Could be split intoseparate s_waitcntvmcnt(0) ands_waitcntlgkmcnt(0) to allowthem to beindependently movedaccording to thefollowing rules.
  • s_waitcnt vmcnt(0)must happen afterany precedingglobal/genericload/store/loadatomic/storeatomic/atomicrmw.
  • s_waitcnt lgkmcnt(0)must happen afterany precedinglocal/genericload/store/loadatomic/storeatomic/atomicrmw.
  • Must happen beforethe followingatomicrmw.
  • Ensures that allmemory operationsto global havecompleted beforeperforming theatomicrmw that isbeing released.
  1. flat_atomic
  2. s_waitcnt vmcnt(0) &lgkmcnt(0)
  • If OpenCL, omitlgkmcnt(0).
  • Must happen beforefollowingbuffer_wbinvl1_vol.
  • Ensures theatomicrmw hascompleted beforeinvalidating thecache.
  1. buffer_wbinvl1_vol
  • Must happen beforeany followingglobal/genericload/loadatomic/atomicrmw.
  • Ensures thatfollowing loadswill not see staleglobal data.
  1. s_waitcnt lgkmcnt(0) &vmcnt(0) & vscnt(0)
  • If OpenCL, omitlgkmcnt(0).
  • Could be split intoseparate s_waitcntvmcnt(0), s_waitcntvscnt(0) and s_waitcntlgkmcnt(0) to allowthem to beindependently movedaccording to thefollowing rules.
  • s_waitcnt vmcnt(0)must happen afterany precedingglobal/genericload/load atomicatomicrmw-with-return-value.
  • s_waitcnt vscnt(0)must happen afterany precedingglobal/genericstore/store atomic/atomicrmw-no-return-value.
  • s_waitcnt lgkmcnt(0)must happen afterany precedinglocal/genericload/store/loadatomic/storeatomic/atomicrmw.
  • Must happen beforethe followingatomicrmw.
  • Ensures that allmemory operationshavecompleted beforeperforming theatomicrmw that isbeing released.
  1. flat_atomic
  2. s_waitcnt vm/vscnt(0) &lgkmcnt(0)
  • If OpenCL, omitlgkmcnt(0).
  • Use vmcnt if atomic withreturn and vscnt if atomicwith no-return.
  • Must happen beforefollowingbuffer_gl_inv.
  • Ensures theatomicrmw hascompleted beforeinvalidating thecaches.
  1. buffer_gl0_inv;buffer_gl1_inv
  • Must happen beforeany followingglobal/genericload/loadatomic/atomicrmw.
  • Ensures thatfollowing loadswill not see staleglobal data.
fenceacq_rel
  • singlethread
  • wavefront
nonenonenone
fenceacq_rel
  • workgroup
none
  1. s_waitcnt lgkmcnt(0)
  • If OpenCL andaddress space isnot generic, omit.
  • However,since LLVMcurrently has noaddress space onthe fence need toconservativelyalways generate(see comment forprevious fence).
  • Must happen afterany precedinglocal/genericload/loadatomic/store/storeatomic/atomicrmw.
  • Must happen beforeany followingglobal/genericload/loadatomic/store/storeatomic/atomicrmw.
  • Ensures that allmemory operationsto local havecompleted beforeperforming anyfollowing globalmemory operations.
  • Ensures that theprecedinglocal/generic loadatomic/atomicrmwwith an equal orwider sync scopeand memory orderingstronger thanunordered (this istermed theacquire-fence-paired-atomic) has completedbefore followingglobal memoryoperations. Thissatisfies therequirements ofacquire.
  • Ensures that allprevious memoryoperations havecompleted before afollowinglocal/generic storeatomic/atomicrmwwith an equal orwider sync scopeand memory orderingstronger thanunordered (this istermed therelease-fence-paired-atomic). This satisfies therequirements ofrelease.
  1. s_waitcnt lgkmcnt(0) &vmcnt(0) & vscnt(0)
  • If CU wavefront execution mode, omit vmcnt andvscnt.
  • If OpenCL andaddress space isnot generic, omitlgkmcnt(0).
  • If OpenCL andaddress space islocal, omitvmcnt(0) and vscnt(0).
  • However,since LLVMcurrently has noaddress space onthe fence need toconservativelyalways generate(see comment forprevious fence).
  • Could be split intoseparate s_waitcntvmcnt(0), s_waitcntvscnt(0) and s_waitcntlgkmcnt(0) to allowthem to beindependently movedaccording to thefollowing rules.
  • s_waitcnt vmcnt(0)must happen afterany precedingglobal/genericload/loadatomic/atomicrmw-with-return-value.
  • s_waitcnt vscnt(0)must happen afterany precedingglobal/genericstore/store atomic/atomicrmw-no-return-value.
  • s_waitcnt lgkmcnt(0)must happen afterany precedinglocal/genericload/store/loadatomic/store atomic/atomicrmw.
  • Must happen beforeany followingglobal/genericload/loadatomic/store/storeatomic/atomicrmw.
  • Ensures that allmemory operationshavecompleted beforeperforming anyfollowing globalmemory operations.
  • Ensures that theprecedinglocal/generic loadatomic/atomicrmwwith an equal orwider sync scopeand memory orderingstronger thanunordered (this istermed theacquire-fence-paired-atomic) has completedbefore followingglobal memoryoperations. Thissatisfies therequirements ofacquire.
  • Ensures that allprevious memoryoperations havecompleted before afollowinglocal/generic storeatomic/atomicrmwwith an equal orwider sync scopeand memory orderingstronger thanunordered (this istermed therelease-fence-paired-atomic). This satisfies therequirements ofrelease.
  • Must happen beforethe followingbuffer_gl0_inv.
  • Ensures that theacquire-fence-pairedatomic has completedbefore invalidatingthecache. Thereforeany followinglocations read mustbe no older thanthe value read bytheacquire-fence-paired-atomic.
  1. buffer_gl0_inv
  • If CU wavefront execution mode, omit.
  • Ensures thatfollowingloads will not seestale data.
fenceacq_rel
  • agent
  • system
none
  1. s_waitcnt lgkmcnt(0) &vmcnt(0)
  • If OpenCL andaddress space isnot generic, omitlgkmcnt(0).
  • However, since LLVMcurrently has noaddress space onthe fence need toconservativelyalways generate(see comment forprevious fence).
  • Could be split intoseparate s_waitcntvmcnt(0) ands_waitcntlgkmcnt(0) to allowthem to beindependently movedaccording to thefollowing rules.
  • s_waitcnt vmcnt(0)must happen afterany precedingglobal/genericload/store/loadatomic/storeatomic/atomicrmw.
  • s_waitcnt lgkmcnt(0)must happen afterany precedinglocal/genericload/store/loadatomic/storeatomic/atomicrmw.
  • Must happen beforethe followingbuffer_wbinvl1_vol.
  • Ensures that theprecedingglobal/local/genericloadatomic/atomicrmwwith an equal orwider sync scopeand memory orderingstronger thanunordered (this istermed theacquire-fence-paired-atomic) has completedbefore invalidatingthe cache. Thissatisfies therequirements ofacquire.
  • Ensures that allprevious memoryoperations havecompleted before afollowingglobal/local/genericstoreatomic/atomicrmwwith an equal orwider sync scopeand memory orderingstronger thanunordered (this istermed therelease-fence-paired-atomic). This satisfies therequirements ofrelease.
  1. buffer_wbinvl1_vol
  • Must happen beforeany followingglobal/genericload/loadatomic/store/storeatomic/atomicrmw.
  • Ensures thatfollowing loadswill not see staleglobal data. Thissatisfies therequirements ofacquire.
  1. s_waitcnt lgkmcnt(0) &vmcnt(0) & vscnt(0)
  • If OpenCL andaddress space isnot generic, omitlgkmcnt(0).
  • If OpenCL andaddress space islocal, omitvmcnt(0) and vscnt(0).
  • However, since LLVMcurrently has noaddress space onthe fence need toconservativelyalways generate(see comment forprevious fence).
  • Could be split intoseparate s_waitcntvmcnt(0), s_waitcntvscnt(0) and s_waitcntlgkmcnt(0) to allowthem to beindependently movedaccording to thefollowing rules.
  • s_waitcnt vmcnt(0)must happen afterany precedingglobal/genericload/loadatomic/atomicrmw-with-return-value.
  • s_waitcnt vscnt(0)must happen afterany precedingglobal/genericstore/store atomic/atomicrmw-no-return-value.
  • s_waitcnt lgkmcnt(0)must happen afterany precedinglocal/genericload/store/loadatomic/storeatomic/atomicrmw.
  • Must happen beforethe followingbuffer_gl_inv.
  • Ensures that theprecedingglobal/local/genericloadatomic/atomicrmwwith an equal orwider sync scopeand memory orderingstronger thanunordered (this istermed theacquire-fence-paired-atomic) has completedbefore invalidatingthe caches. Thissatisfies therequirements ofacquire.
  • Ensures that allprevious memoryoperations havecompleted before afollowingglobal/local/genericstoreatomic/atomicrmwwith an equal orwider sync scopeand memory orderingstronger thanunordered (this istermed therelease-fence-paired-atomic). This satisfies therequirements ofrelease.
  1. buffer_gl0_inv;buffer_gl1_inv
  • Must happen beforeany followingglobal/genericload/loadatomic/store/storeatomic/atomicrmw.
  • Ensures thatfollowing loadswill not see staleglobal data. Thissatisfies therequirements ofacquire.
Sequential Consistent Atomic
load atomicseq_cst
  • singlethread
  • wavefront
  • global
  • local
  • generic
Same as correspondingload atomic acquire,except must generatedall instructions evenfor OpenCL.Same as correspondingload atomic acquire,except must generatedall instructions evenfor OpenCL.
load atomicseq_cst
  • workgroup
  • global
  • generic
  1. s_waitcnt lgkmcnt(0)
  • Musthappen afterprecedingglobal/generic loadatomic/storeatomic/atomicrmwwith memoryordering of seq_cstand with equal orwider sync scope.(Note that seq_cstfences have theirown s_waitcntlgkmcnt(0) and so donot need to beconsidered.)
  • Ensures anyprecedingsequentialconsistent localmemory instructionshave completedbefore executingthis sequentiallyconsistentinstruction. Thisprevents reorderinga seq_cst storefollowed by aseq_cst load. (Notethat seq_cst isstronger thanacquire/release asthe reordering ofload acquirefollowed by a storerelease isprevented by thewaitcnt ofthe release, butthere is nothingpreventing a storerelease followed byload acquire fromcompeting out oforder.)
  1. Followinginstructions same ascorresponding loadatomic acquire,except must generatedall instructions evenfor OpenCL.
  1. s_waitcnt lgkmcnt(0) &vmcnt(0) & vscnt(0)
  • If CU wavefront execution mode, omit vmcnt andvscnt.
  • Could be split intoseparate s_waitcntvmcnt(0), s_waitcntvscnt(0) and s_waitcntlgkmcnt(0) to allowthem to beindependently movedaccording to thefollowing rules.
  • waitcnt lgkmcnt(0) musthappen afterprecedinglocal loadatomic/storeatomic/atomicrmwwith memoryordering of seq_cstand with equal orwider sync scope.(Note that seq_cstfences have theirown s_waitcntlgkmcnt(0) and so donot need to beconsidered.)
  • waitcnt vmcnt(0)Must happen afterprecedingglobal/generic loadatomic/atomicrmw-with-return-valuewith memoryordering of seq_cstand with equal orwider sync scope.(Note that seq_cstfences have theirown s_waitcntvmcnt(0) and so donot need to beconsidered.)
  • waitcnt vscnt(0)Must happen afterprecedingglobal/generic storeatomic/atomicrmw-no-return-valuewith memoryordering of seq_cstand with equal orwider sync scope.(Note that seq_cstfences have theirown s_waitcntvscnt(0) and so donot need to beconsidered.)
  • Ensures anyprecedingsequentialconsistent global/localmemory instructionshave completedbefore executingthis sequentiallyconsistentinstruction. Thisprevents reorderinga seq_cst storefollowed by aseq_cst load. (Notethat seq_cst isstronger thanacquire/release asthe reordering ofload acquirefollowed by a storerelease isprevented by thewaitcnt ofthe release, butthere is nothingpreventing a storerelease followed byload acquire fromcompeting out oforder.)
  1. Followinginstructions same ascorresponding loadatomic acquire,except must generatedall instructions evenfor OpenCL.
load atomicseq_cst
  • workgroup
  • local
Same as correspondingload atomic acquire,except must generatedall instructions evenfor OpenCL.
  1. s_waitcnt vmcnt(0) & vscnt(0)
  • If CU wavefront execution mode, omit.
  • Could be split intoseparate s_waitcntvmcnt(0) and s_waitcntvscnt(0) to allowthem to beindependently movedaccording to thefollowing rules.
  • waitcnt vmcnt(0)Must happen afterprecedingglobal/generic loadatomic/atomicrmw-with-return-valuewith memoryordering of seq_cstand with equal orwider sync scope.(Note that seq_cstfences have theirown s_waitcntvmcnt(0) and so donot need to beconsidered.)
  • waitcnt vscnt(0)Must happen afterprecedingglobal/generic storeatomic/atomicrmw-no-return-valuewith memoryordering of seq_cstand with equal orwider sync scope.(Note that seq_cstfences have theirown s_waitcntvscnt(0) and so donot need to beconsidered.)
  • Ensures anyprecedingsequentialconsistent globalmemory instructionshave completedbefore executingthis sequentiallyconsistentinstruction. Thisprevents reorderinga seq_cst storefollowed by aseq_cst load. (Notethat seq_cst isstronger thanacquire/release asthe reordering ofload acquirefollowed by a storerelease isprevented by thewaitcnt ofthe release, butthere is nothingpreventing a storerelease followed byload acquire fromcompeting out oforder.)
  1. Followinginstructions same ascorresponding loadatomic acquire,except must generatedall instructions evenfor OpenCL.
load atomicseq_cst
  • agent
  • system
  • global
  • generic
  1. s_waitcnt lgkmcnt(0) &vmcnt(0)
  • Could be split intoseparate s_waitcntvmcnt(0)and s_waitcntlgkmcnt(0) to allowthem to beindependently movedaccording to thefollowing rules.
  • waitcnt lgkmcnt(0)must happen afterprecedingglobal/generic loadatomic/storeatomic/atomicrmwwith memoryordering of seq_cstand with equal orwider sync scope.(Note that seq_cstfences have theirown s_waitcntlgkmcnt(0) and so donot need to beconsidered.)
  • waitcnt vmcnt(0)must happen afterprecedingglobal/generic loadatomic/storeatomic/atomicrmwwith memoryordering of seq_cstand with equal orwider sync scope.(Note that seq_cstfences have theirown s_waitcntvmcnt(0) and so donot need to beconsidered.)
  • Ensures anyprecedingsequentialconsistent globalmemory instructionshave completedbefore executingthis sequentiallyconsistentinstruction. Thisprevents reorderinga seq_cst storefollowed by aseq_cst load. (Notethat seq_cst isstronger thanacquire/release asthe reordering ofload acquirefollowed by a storerelease isprevented by thewaitcnt ofthe release, butthere is nothingpreventing a storerelease followed byload acquire fromcompeting out oforder.)
  1. Followinginstructions same ascorresponding loadatomic acquire,except must generatedall instructions evenfor OpenCL.
  1. s_waitcnt lgkmcnt(0) &vmcnt(0) & vscnt(0)
  • Could be split intoseparate s_waitcntvmcnt(0), s_waitcntvscnt(0) and s_waitcntlgkmcnt(0) to allowthem to beindependently movedaccording to thefollowing rules.
  • waitcnt lgkmcnt(0)must happen afterprecedinglocal loadatomic/storeatomic/atomicrmwwith memoryordering of seq_cstand with equal orwider sync scope.(Note that seq_cstfences have theirown s_waitcntlgkmcnt(0) and so donot need to beconsidered.)
  • waitcnt vmcnt(0)must happen afterprecedingglobal/generic loadatomic/atomicrmw-with-return-valuewith memoryordering of seq_cstand with equal orwider sync scope.(Note that seq_cstfences have theirown s_waitcntvmcnt(0) and so donot need to beconsidered.)
  • waitcnt vscnt(0)Must happen afterprecedingglobal/generic storeatomic/atomicrmw-no-return-valuewith memoryordering of seq_cstand with equal orwider sync scope.(Note that seq_cstfences have theirown s_waitcntvscnt(0) and so donot need to beconsidered.)
  • Ensures anyprecedingsequentialconsistent globalmemory instructionshave completedbefore executingthis sequentiallyconsistentinstruction. Thisprevents reorderinga seq_cst storefollowed by aseq_cst load. (Notethat seq_cst isstronger thanacquire/release asthe reordering ofload acquirefollowed by a storerelease isprevented by thewaitcnt ofthe release, butthere is nothingpreventing a storerelease followed byload acquire fromcompeting out oforder.)
  1. Followinginstructions same ascorresponding loadatomic acquire,except must generatedall instructions evenfor OpenCL.
store atomicseq_cst
  • singlethread
  • wavefront
  • workgroup
  • global
  • local
  • generic
Same as correspondingstore atomic release,except must generatedall instructions evenfor OpenCL.Same as correspondingstore atomic release,except must generatedall instructions evenfor OpenCL.
store atomicseq_cst
  • agent
  • system
  • global
  • generic
Same as correspondingstore atomic release,except must generatedall instructions evenfor OpenCL.Same as correspondingstore atomic release,except must generatedall instructions evenfor OpenCL.
atomicrmwseq_cst
  • singlethread
  • wavefront
  • workgroup
  • global
  • local
  • generic
Same as correspondingatomicrmw acq_rel,except must generatedall instructions evenfor OpenCL.Same as correspondingatomicrmw acq_rel,except must generatedall instructions evenfor OpenCL.
atomicrmwseq_cst
  • agent
  • system
  • global
  • generic
Same as correspondingatomicrmw acq_rel,except must generatedall instructions evenfor OpenCL.Same as correspondingatomicrmw acq_rel,except must generatedall instructions evenfor OpenCL.
fenceseq_cst
  • singlethread
  • wavefront
  • workgroup
  • agent
  • system
noneSame as correspondingfence acq_rel,except must generatedall instructions evenfor OpenCL.Same as correspondingfence acq_rel,except must generatedall instructions evenfor OpenCL.

The memory order also adds the single thread optimization constrains defined intableAMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX10.

AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX10
LLVM MemoryOptimization Constraints
Ordering
unorderednone
monotonicnone
acquire
  • If a load atomic/atomicrmw then no following load/loadatomic/store/ store atomic/atomicrmw/fence instruction canbe moved before the acquire.
  • If a fence then same as load atomic, plus no precedingassociated fence-paired-atomic can be moved after the fence.
release
  • If a store atomic/atomicrmw then no preceding load/loadatomic/store/ store atomic/atomicrmw/fence instruction canbe moved after the release.
  • If a fence then same as store atomic, plus no followingassociated fence-paired-atomic can be moved before thefence.
acq_relSame constraints as both acquire and release.
seq_cst
  • If a load atomic then same constraints as acquire, plus nopreceding sequentially consistent load atomic/storeatomic/atomicrmw/fence instruction can be moved after theseq_cst.
  • If a store atomic then the same constraints as release, plusno following sequentially consistent load atomic/storeatomic/atomicrmw/fence instruction can be moved before theseq_cst.
  • If an atomicrmw/fence then same constraints as acq_rel.

Trap Handler ABI

For code objects generated by AMDGPU backend for HSA [HSA] compatible runtimes(such as ROCm [AMD-ROCm]), the runtime installs a trap handler that supportsthe s_trap instruction with the following usage:

AMDGPU Trap Handler for AMDHSA OS
UsageCode SequenceTrap HandlerInputsDescription
reserveds_trap 0x00 Reserved by hardware.
debugtrap(arg)s_trap 0x01
SGPR0-1:
queue_ptr
VGPR0:
arg
Reserved for HSAdebugtrapintrinsic (notimplemented).
llvm.traps_trap 0x02
SGPR0-1:
queue_ptr
Causes dispatch to beterminated and itsassociated queue putinto the error state.
llvm.debugtraps_trap 0x03
  • If debugger notinstalled thenbehaves as ano-operation. Thetrap handler isentered andimmediately returnsto continueexecution of thewavefront.
  • If the debugger isinstalled, causesthe debug trap to bereported by thedebugger and thewavefront is put inthe halt state untilresumed by thedebugger.
reserveds_trap 0x04 Reserved.
reserveds_trap 0x05 Reserved.
reserveds_trap 0x06 Reserved.
debugger breakpoints_trap 0x07 Reserved for debuggerbreakpoints.
reserveds_trap 0x08 Reserved.
reserveds_trap 0xfe Reserved.
reserveds_trap 0xff Reserved.

Call Convention

Note

This section is currently incomplete and has inakkuracies. It is WIP that willbe updated as information is determined.

See Address Space Mapping for information on swizzledaddresses. Unswizzled addresses are normal linear addresses.

Kernel Functions

This section describes the call convention ABI for the outer kernel function.

See Initial Kernel Execution State for the kernel callconvention.

The following is not part of the AMDGPU kernel calling convention but describeshow the AMDGPU implements function calls:

  • Clang decides the kernarg layout to match the HSA Programmer’s LanguageReference[HSA].
    • All structs are passed directly.
    • Lambda values are passed TBA.
  • The kernel performs certain setup in its prolog, as described inKernel Prolog.
Non-Kernel Functions

This section describes the call convention ABI for functions other than theouter kernel function.

If a kernel has function calls then scratch is always allocated and used forthe call stack which grows from low address to high address using the swizzledscratch address space.

On entry to a function:

  • SGPR0-3 contain a V# with the following properties (seePrivate Segment Buffer):

    • Base address pointing to the beginning of the wavefront scratch backingmemory.
    • Swizzled with dword element size and stride of wavefront size elements.
  • The FLAT_SCRATCH register pair is setup. SeeFlat Scratch.

  • GFX6-8: M0 register set to the size of LDS in bytes. SeeM0.

  • The EXEC register is set to the lanes active on entry to the function.

  • MODE register: TBD

  • VGPR0-31 and SGPR4-29 are used to pass function input arguments as describedbelow.

  • SGPR30-31 return address (RA). The code address that the function mustreturn to when it completes. The value is undefined if the function is noreturn.

  • SGPR32 is used for the stack pointer (SP). It is an unswizzled scratchoffset relative to the beginning of the wavefront scratch backing memory.

The unswizzled SP can be used with buffer instructions as an unswizzled SGPRoffset with the scratch V# in SGPR0-3 to access the stack in a swizzledmanner.

The unswizzled SP value can be converted into the swizzled SP value by:

swizzled SP = unswizzled SP / wavefront size

This may be used to obtain the private address space address of stackobjects and to convert this address to a flat address by adding the flatscratch aperture base address.

The swizzled SP value is always 4 bytes aligned for the r600architecture and 16 byte aligned for the amdgcn architecture.

Note

The amdgcn value is selected to avoid dynamic stack alignment for theOpenCL language which has the largest base type defined as 16 bytes.

On entry, the swizzled SP value is the address of the first functionargument passed on the stack. Other stack passed arguments are positiveoffsets from the entry swizzled SP value.

The function may use positive offsets beyond the last stack passed argumentfor stack allocated local variables and register spill slots. If necessarythe function may align these to greater alignment than 16 bytes. After thesethe function may dynamically allocate space for such things as runtime sizedalloca local allocations.

If the function calls another function, it will place any stack allocatedarguments after the last local allocation and adjust SGPR32 to the addressafter the last local allocation.

  • All other registers are unspecified.

  • Any necessary waitcnt has been performed to ensure memory is availableto the function.

On exit from a function:

  • VGPR0-31 and SGPR4-29 are used to pass function result arguments asdescribed below. Any registers used are considered clobbered registers.

  • The following registers are preserved and have the same value as on entry:

    • FLAT_SCRATCH

    • EXEC

    • GFX6-8: M0

    • All SGPR and VGPR registers except the clobbered registers of SGPR4-31 andVGPR0-31.

For the AMDGPU backend, an inter-procedural register allocation (IPRA)optimization may mark some of clobbered SGPR4-31 and VGPR0-31 registers aspreserved if it can be determined that the called function does not changetheir value.

  • The PC is set to the RA provided on entry.
  • MODE register: TBD.
  • All other registers are clobbered.
  • Any necessary waitcnt has been performed to ensure memory accessed byfunction is available to the caller.The function input arguments are made up of the formal arguments explicitlydeclared by the source language function plus the implicit input arguments usedby the implementation.

The source language input arguments are:

  • Any source language implicit this or self argument comes first as apointer type.
  • Followed by the function formal arguments in left to right source order.The source language result arguments are:

  • The function result argument.The source language input or result struct type arguments that are less than orequal to 16 bytes, are decomposed recursively into their base type fields, andeach field is passed as if a separate argument. For input arguments, if thecalled function requires the struct to be in memory, for example because itsaddress is taken, then the function body is responsible for allocating a stacklocation and copying the field arguments into it. Clang terms this directstruct.

The source language input struct type arguments that are greater than 16 bytes,are passed by reference. The caller is responsible for allocating a stacklocation to make a copy of the struct value and pass the address as the inputargument. The called function is responsible to perform the dereference whenaccessing the input argument. Clang terms this by-value struct.

A source language result struct type argument that is greater than 16 bytes, isreturned by reference. The caller is responsible for allocating a stack locationto hold the result value and passes the address as the last input argument(before the implicit input arguments). In this case there are no resultarguments. The called function is responsible to perform the dereference whenstoring the result value. Clang terms this structured return (sret).

TODO: correct the sret definition.

Lambda argument types are treated as struct types with an implementation definedset of fields.

For AMDGPU backend all source language arguments (including the decomposedstruct type arguments) are passed in VGPRs unless marked inreg in which casethey are passed in SGPRs.

The AMDGPU backend walks the function call graph from the leaves to determinewhich implicit input arguments are used, propagating to each caller of thefunction. The used implicit arguments are appended to the function argumentsafter the source language arguments in the following order:

  • Work-Item ID (1 VGPR)

The X, Y and Z work-item ID are packed into a single VGRP with the followinglayout. Only fields actually used by the function are set. The other bitsare undefined.

The values come from the initial kernel execution state. SeeVGPR Register Set Up Order.

Work-item implict argument layoutBitsSizeField Name9:010 bitsX Work-Item ID19:1010 bitsY Work-Item ID29:2010 bitsZ Work-Item ID31:302 bitsUnused

  • Dispatch Ptr (2 SGPRs)

The value comes from the initial kernel execution state. SeeSGPR Register Set Up Order.

  • Queue Ptr (2 SGPRs)

The value comes from the initial kernel execution state. SeeSGPR Register Set Up Order.

  • Kernarg Segment Ptr (2 SGPRs)

The value comes from the initial kernel execution state. SeeSGPR Register Set Up Order.

  • Dispatch id (2 SGPRs)

The value comes from the initial kernel execution state. SeeSGPR Register Set Up Order.

  • Work-Group ID X (1 SGPR)

The value comes from the initial kernel execution state. SeeSGPR Register Set Up Order.

  • Work-Group ID Y (1 SGPR)

The value comes from the initial kernel execution state. SeeSGPR Register Set Up Order.

  • Work-Group ID Z (1 SGPR)

The value comes from the initial kernel execution state. SeeSGPR Register Set Up Order.

  • Implicit Argument Ptr (2 SGPRs)

The value is computed by adding an offset to Kernarg Segment Ptr to get theglobal address space pointer to the first kernarg implicit argument.

The input and result arguments are assigned in order in the following manner:

..note:

  1. There are likely some errors and ommissions in the following description that
  2. need correction.
  3.  
  4. ..TODO::
  5.  
  6. Check the clang source code to decipher how funtion arguments and return
  7. results are handled. Also see the AMDGPU specific values used.
  • VGPR arguments are assigned to consecutive VGPRs starting at VGPR0 up toVGPR31.

If there are more arguments than will fit in these registers, the remainingarguments are allocated on the stack in order on naturally alignedaddresses.

  • SGPR arguments are assigned to consecutive SGPRs starting at SGPR0 up toSGPR29.

If there are more arguments than will fit in these registers, the remainingarguments are allocated on the stack in order on naturally alignedaddresses.

Note that decomposed struct type arguments may have some fields passed inregisters and some in memory.

..TODO:

  1. So a struct which can pass some fields as decomposed register arguments, will
  2. pass the rest as decomposed stack elements? But an arguent that will not start
  3. in registers will not be decomposed and will be passed as a non-decomposed
  4. stack value?

The following is not part of the AMDGPU function calling convention butdescribes how the AMDGPU implements function calls:

  • SGPR33 is used as a frame pointer (FP) if necessary. Like the SP it is anunswizzled scratch address. It is only needed if runtime sized allocaare used, or for the reasons defined in SIFrameLowering.
  • Runtime stack alignment is not currently supported.
  • Allocating SGPR arguments on the stack are not supported.

  • No CFI is currently generated. See Call Frame Information.

..note:

  1. CFI will be generated that defines the CFA as the unswizzled address
  2. relative to the wave scratch base in the unswizzled private address space
  3. of the lowest address stack allocated local variable.
  4.  
  5. ``DW_AT_frame_base`` will be defined as the swizzled address in the
  6. swizzled private address space by dividing the CFA by the wavefront size
  7. (since CFA is always at least dword aligned which matches the scratch
  8. swizzle element size).
  9.  
  10. If no dynamic stack alignment was performed, the stack allocated arguments
  11. are accessed as negative offsets relative to ``DW_AT_frame_base``, and the
  12. local variables and register spill slots are accessed as positive offsets
  13. relative to ``DW_AT_frame_base``.
  • Function argument passing is implemented by copying the input physicalregisters to virtual registers on entry. The register allocator can spill ifnecessary. These are copied back to physical registers at call sites. Thenet effect is that each function call can have these values in entirelydistinct locations. The IPRA can help avoid shuffling argument registers.

  • Call sites are implemented by setting up the arguments at positive offsetsfrom SP. Then SP is incremented to account for the known frame size beforethe call and decremented after the call.

..note:

  1. The CFI will reflect the changed calculation needed to compute the CFA
  2. from SP.
  • 4 byte spill slots are used in the stack frame. One slot is allocated for anemergency spill slot. Buffer instructions are used for stack accesses andnot the flat_scratch instruction.

..TODO:

  1. Explain when the emergency spill slot is used.

AMDPAL

This section provides code conventions used when the target triple OS isamdpal (see Target Triples) for passing runtime parametersfrom the application/runtime to each invocation of a hardware shader. Theseparameters include both generic, application-controlled parameters calleduser data as well as system-generated parameters that are a product of thedraw or dispatch execution.

User Data

Each hardware stage has a set of 32-bit user data registers which can bewritten from a command buffer and then loaded into SGPRs when waves are launchedvia a subsequent dispatch or draw operation. This is the way most arguments arepassed from the application/runtime to a hardware shader.

Compute User Data

Compute shader user data mappings are simpler than graphics shaders, and have afixed mapping.

Note that there are always 10 available user data entries in registers -entries beyond that limit must be fetched from memory (via the spill tablepointer) by the shader.

PAL Compute Shader User Data Registers
User RegisterDescription
0Global Internal Table (32-bit pointer)
1Per-Shader Internal Table (32-bit pointer)
2 - 11Application-Controlled User Data (10 32-bit values)
12Spill Table (32-bit pointer)
13 - 14Thread Group Count (64-bit pointer)
15GDS Range

Graphics User Data

Graphics pipelines support a much more flexible user data mapping:

PAL Graphics Shader User Data Registers
User RegisterDescription
0Global Internal Table (32-bit pointer)
Per-Shader Internal Table (32-bit pointer)
  • 1-15
Application Controlled User Data(1-15 Contiguous 32-bit Values in Registers)
Spill Table (32-bit pointer)
Draw Index (First Stage Only)
Vertex Offset (First Stage Only)
Instance Offset (First Stage Only)

The placement of the global internal table remains fixed in the first userdata SGPR register. Otherwise all parameters are optional, and can be mappedto any desired user data SGPR register, with the following restrictions:

  • Draw Index, Vertex Offset, and Instance Offset can only be used by the firstactive hardware stage in a graphics pipeline (i.e. where the API vertexshader runs).
  • Application-controlled user data must be mapped into a contiguous range ofuser data registers.
  • The application-controlled user data range supports compaction remapping, soonly entries that are actually consumed by the shader must be assigned tocorresponding registers. Note that in order to support an efficient runtimeimplementation, the remapping must pack registers in the same order asentries, with unused entries removed.

Global Internal Table

The global internal table is a table of shader resource descriptors (SRDs)that define how certain engine-wide, runtime-managed resources should beaccessed from a shader. The majority of these resources have HW-defined formats,and it is up to the compiler to write/read data as required by the targethardware.

The following table illustrates the required format:

PAL Global Internal Table
OffsetDescription
0-3Graphics Scratch SRD
4-7Compute Scratch SRD
8-11ES/GS Ring Output SRD
12-15ES/GS Ring Input SRD
16-19GS/VS Ring Output #0
20-23GS/VS Ring Output #1
24-27GS/VS Ring Output #2
28-31GS/VS Ring Output #3
32-35GS/VS Ring Input SRD
36-39Tessellation Factor Buffer SRD
40-43Off-Chip LDS Buffer SRD
44-47Off-Chip Param Cache Buffer SRD
48-51Sample Position Buffer SRD
52vaRange::ShadowDescriptorTable High Bits

The pointer to the global internal table passed to the shader as user datais a 32-bit pointer. The top 32 bits should be assumed to be the same asthe top 32 bits of the pipeline, so the shader may use the programcounter’s top 32 bits.

Unspecified OS

This section provides code conventions used when the target triple OS isempty (see Target Triples).

Trap Handler ABI

For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime doesnot install a trap handler. The llvm.trap and llvm.debugtrapinstructions are handled as follows:

AMDGPU Trap Handler for Non-AMDHSA OS
UsageCode SequenceDescription
llvm.traps_endpgmCauses wavefront to be terminated.
llvm.debugtrapnoneCompiler warning given that there is notrap handler installed.

Source Languages

OpenCL

When the language is OpenCL the following differences occur:

OpenCL kernel implicit arguments appended for AMDHSA OS
PositionByteSizeByteAlignmentDescription
188OpenCL Global Offset X
288OpenCL Global Offset Y
388OpenCL Global Offset Z
488OpenCL address of printf buffer
588OpenCL address of virtual queue used byenqueue_kernel.
688OpenCL address of AqlWrap struct used byenqueue_kernel.
788Pointer argument used for Multi-girdsynchronization.

HCC

When the language is HCC the following differences occur:

Assembler

AMDGPU backend has LLVM-MC based assembler which is currently in development.It supports AMDGCN GFX6-GFX10.

This section describes general syntax for instructions and operands.

Instructions

An instruction has the following syntax:

<opcode> <operand0>, <operand1>,…<modifier0> <modifier1>…

Operands are comma-separated whilemodifiers are space-separated.

The order of operands and modifiers is fixed.Most modifiers are optional and may be omitted.

Links to detailed instruction syntax description may be found in the followingtable. Note that features under development are not includedin this description.

For more information about instructions, their semantics and supportedcombinations of operands, refer to one of instruction set architecture manuals[AMD-GCN-GFX6], [AMD-GCN-GFX7], [AMD-GCN-GFX8], [AMD-GCN-GFX9] and[AMD-GCN-GFX10].

Operands

Detailed description of operands may be found here.

Modifiers

Detailed description of modifiers may be foundhere.

Instruction Examples

DS
  1. ds_add_u32 v2, v4 offset:16
  2. ds_write_src2_b64 v2 offset0:4 offset1:8
  3. ds_cmpst_f32 v2, v4, v6
  4. ds_min_rtn_f64 v[8:9], v2, v[4:5]

For full list of supported instructions, refer to “LDS/GDS instructions” in ISAManual.

FLAT
  1. flat_load_dword v1, v[3:4]
  2. flat_store_dwordx3 v[3:4], v[5:7]
  3. flat_atomic_swap v1, v[3:4], v5 glc
  4. flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
  5. flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc

For full list of supported instructions, refer to “FLAT instructions” in ISAManual.

MUBUF
  1. buffer_load_dword v1, off, s[4:7], s1
  2. buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
  3. buffer_store_format_xy v[1:2], off, s[4:7], s1
  4. buffer_wbinvl1
  5. buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc

For full list of supported instructions, refer to “MUBUF Instructions” in ISAManual.

SMRD/SMEM
  1. s_load_dword s1, s[2:3], 0xfc
  2. s_load_dwordx8 s[8:15], s[2:3], s4
  3. s_load_dwordx16 s[88:103], s[2:3], s4
  4. s_dcache_inv_vol
  5. s_memtime s[4:5]

For full list of supported instructions, refer to “Scalar Memory Operations” inISA Manual.

SOP1
  1. s_mov_b32 s1, s2
  2. s_mov_b64 s[0:1], 0x80000000
  3. s_cmov_b32 s1, 200
  4. s_wqm_b64 s[2:3], s[4:5]
  5. s_bcnt0_i32_b64 s1, s[2:3]
  6. s_swappc_b64 s[2:3], s[4:5]
  7. s_cbranch_join s[4:5]

For full list of supported instructions, refer to “SOP1 Instructions” in ISAManual.

SOP2
  1. s_add_u32 s1, s2, s3
  2. s_and_b64 s[2:3], s[4:5], s[6:7]
  3. s_cselect_b32 s1, s2, s3
  4. s_andn2_b32 s2, s4, s6
  5. s_lshr_b64 s[2:3], s[4:5], s6
  6. s_ashr_i32 s2, s4, s6
  7. s_bfm_b64 s[2:3], s4, s6
  8. s_bfe_i64 s[2:3], s[4:5], s6
  9. s_cbranch_g_fork s[4:5], s[6:7]

For full list of supported instructions, refer to “SOP2 Instructions” in ISAManual.

SOPC
  1. s_cmp_eq_i32 s1, s2
  2. s_bitcmp1_b32 s1, s2
  3. s_bitcmp0_b64 s[2:3], s4
  4. s_setvskip s3, s5

For full list of supported instructions, refer to “SOPC Instructions” in ISAManual.

SOPP
  1. s_barrier
  2. s_nop 2
  3. s_endpgm
  4. s_waitcnt 0 ; Wait for all counters to be 0
  5. s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
  6. s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
  7. s_sethalt 9
  8. s_sleep 10
  9. s_sendmsg 0x1
  10. s_sendmsg sendmsg(MSG_INTERRUPT)
  11. s_trap 1

For full list of supported instructions, refer to “SOPP Instructions” in ISAManual.

Unless otherwise mentioned, little verification is performed on the operandsof SOPP Instructions, so it is up to the programmer to be familiar with therange or acceptable values.

VALU

For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),the assembler will automatically use optimal encoding based on its operands. Toforce specific encoding, one can add a suffix to the opcode of the instruction:

  • _e32 for 32-bit VOP1/VOP2/VOPC
  • _e64 for 64-bit VOP3
  • _dpp for VOP_DPP
  • _sdwa for VOP_SDWA

VOP1/VOP2/VOP3/VOPC examples:

  1. v_mov_b32 v1, v2
  2. v_mov_b32_e32 v1, v2
  3. v_nop
  4. v_cvt_f64_i32_e32 v[1:2], v2
  5. v_floor_f32_e32 v1, v2
  6. v_bfrev_b32_e32 v1, v2
  7. v_add_f32_e32 v1, v2, v3
  8. v_mul_i32_i24_e64 v1, v2, 3
  9. v_mul_i32_i24_e32 v1, -3, v3
  10. v_mul_i32_i24_e32 v1, -100, v3
  11. v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
  12. v_max_f16_e32 v1, v2, v3

VOP_DPP examples:

  1. v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
  2. v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
  3. v_mov_b32 v0, v0 wave_shl:1
  4. v_mov_b32 v0, v0 row_mirror
  5. v_mov_b32 v0, v0 row_bcast:31
  6. v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
  7. v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
  8. v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0

VOP_SDWA examples:

  1. v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
  2. v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
  3. v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
  4. v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
  5. v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0

For full list of supported instructions, refer to “Vector ALU instructions”.

Code Object V2 Predefined Symbols (-mattr=-code-object-v3)

Warning

Code Object V2 is not the default code object version emitted bythis version of LLVM. For a description of the predefined symbols availablewith the default configuration (Code Object V3) seeCode Object V3 Predefined Symbols (-mattr=+code-object-v3).

The AMDGPU assembler defines and updates some symbols automatically. Thesesymbols do not affect code generation.

.option.machine_version_major

Set to the GFX major generation number of the target being assembled for. Forexample, when assembling for a “GFX9” target this will be set to the integervalue “9”. The possible GFX major generation numbers are presented inProcessors.

.option.machine_version_minor

Set to the GFX minor generation number of the target being assembled for. Forexample, when assembling for a “GFX810” target this will be set to the integervalue “1”. The possible GFX minor generation numbers are presented inProcessors.

.option.machine_version_stepping

Set to the GFX stepping generation number of the target being assembled for.For example, when assembling for a “GFX704” target this will be set to theinteger value “4”. The possible GFX stepping generation numbers are presentedin Processors.

.kernel.vgpr_count

Set to zero each time a.amdgpu_hsa_kernel (name) directive isencountered. At each instruction, if the current value of this symbol is lessthan or equal to the maximum VPGR number explicitly referenced within thatinstruction then the symbol value is updated to equal that VGPR number plusone.

.kernel.sgpr_count

Set to zero each time a.amdgpu_hsa_kernel (name) directive isencountered. At each instruction, if the current value of this symbol is lessthan or equal to the maximum VPGR number explicitly referenced within thatinstruction then the symbol value is updated to equal that SGPR number plusone.

Code Object V2 Directives (-mattr=-code-object-v3)

Warning

Code Object V2 is not the default code object version emitted bythis version of LLVM. For a description of the directives supported withthe default configuration (Code Object V3) seeCode Object V3 Directives (-mattr=+code-object-v3).

AMDGPU ABI defines auxiliary data in output code object. In assembly source,one can specify them with assembler directives.

.hsa_code_object_version major, minor

major and minor are integers that specify the version of the HSA codeobject that will be generated by the assembler.

.hsa_code_object_isa [major, minor, stepping, vendor, arch]

major, minor, and stepping are all integers that describe the instructionset architecture (ISA) version of the assembly program.

vendor and arch are quoted strings. vendor should always be equal to“AMD” and arch should always be equal to “AMDGPU”.

By default, the assembler will derive the ISA version, vendor, and _arch_from the value of the -mcpu option that is passed to the assembler.

.amdgpu_hsa_kernel (name)

This directives specifies that the symbol with given name is a kernel entrypoint (label) and the object should contain corresponding symbol of typeSTT_AMDGPU_HSA_KERNEL.

.amd_kernel_code_t

This directive marks the beginning of a list of key / value pairs that are usedto specify the amdkernel_code_t object that will be emitted by the assembler.The list must be terminated by the .endamd_kernel_code_t directive. For anyamd_kernel_code_t values that are unspecified a default value will be used. Thedefault value for all keys is 0, with the following exceptions:

  • amd_code_version_major defaults to 1.
  • amd_kernel_code_version_minor defaults to 2.
  • amd_machine_kind defaults to 1.
  • amd_machine_version_major, machine_version_minor, andamd_machine_version_stepping are derived from the value of the -mcpu optionthat is passed to the assembler.
  • kernel_code_entry_byte_offset defaults to 256.
  • wavefront_size defaults 6 for all targets before GFX10. For GFX10 onwardsdefaults to 6 if target feature wavefrontsize64 is enabled, otherwise 5.Note that wavefront size is specified as a power of two, so a value of nmeans a size of 2^ n.
  • call_convention defaults to -1.
  • kernarg_segment_alignment, group_segment_alignment, andprivate_segment_alignment default to 4. Note that alignments are specifiedas a power of 2, so a value of n means an alignment of 2^ n.
  • enable_wgp_mode defaults to 1 if target feature cumode is disabled forGFX10 onwards.
  • enable_mem_ordered defaults to 1 for GFX10 onwards.

The .amd_kernel_code_t directive must be placed immediately after thefunction label and before any instructions.

For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.

Code Object V2 Example Source Code (-mattr=-code-object-v3)

Warning

Code Object V2 is not the default code object version emitted bythis version of LLVM. For a description of the directives supported withthe default configuration (Code Object V3) seeCode Object V3 Example Source Code (-mattr=+code-object-v3).

Here is an example of a minimal assembly source file, defining one HSA kernel:

  1. 1 .hsa_code_object_version 1,0
  2. 2 .hsa_code_object_isa
  3. 3
  4. 4 .hsatext
  5. 5 .globl hello_world
  6. 6 .p2align 8
  7. 7 .amdgpu_hsa_kernel hello_world
  8. 8
  9. 9 hello_world:
  10. 10
  11. 11 .amd_kernel_code_t
  12. 12 enable_sgpr_kernarg_segment_ptr = 1
  13. 13 is_ptr64 = 1
  14. 14 compute_pgm_rsrc1_vgprs = 0
  15. 15 compute_pgm_rsrc1_sgprs = 0
  16. 16 compute_pgm_rsrc2_user_sgpr = 2
  17. 17 compute_pgm_rsrc1_wgp_mode = 0
  18. 18 compute_pgm_rsrc1_mem_ordered = 0
  19. 19 compute_pgm_rsrc1_fwd_progress = 1
  20. 20 .end_amd_kernel_code_t
  21. 21
  22. 22 s_load_dwordx2 s[0:1], s[0:1] 0x0
  23. 23 v_mov_b32 v0, 3.14159
  24. 24 s_waitcnt lgkmcnt(0)
  25. 25 v_mov_b32 v1, s0
  26. 26 v_mov_b32 v2, s1
  27. 27 flat_store_dword v[1:2], v0
  28. 28 s_endpgm
  29. 29 .Lfunc_end0:
  30. 30 .size hello_world, .Lfunc_end0-hello_world

Code Object V3 Predefined Symbols (-mattr=+code-object-v3)

The AMDGPU assembler defines and updates some symbols automatically. Thesesymbols do not affect code generation.

.amdgcn.gfx_generation_number

Set to the GFX major generation number of the target being assembled for. Forexample, when assembling for a “GFX9” target this will be set to the integervalue “9”. The possible GFX major generation numbers are presented inProcessors.

.amdgcn.gfx_generation_minor

Set to the GFX minor generation number of the target being assembled for. Forexample, when assembling for a “GFX810” target this will be set to the integervalue “1”. The possible GFX minor generation numbers are presented inProcessors.

.amdgcn.gfx_generation_stepping

Set to the GFX stepping generation number of the target being assembled for.For example, when assembling for a “GFX704” target this will be set to theinteger value “4”. The possible GFX stepping generation numbers are presentedin Processors.

.amdgcn.next_free_vgpr

Set to zero before assembly begins. At each instruction, if the current valueof this symbol is less than or equal to the maximum VGPR number explicitlyreferenced within that instruction then the symbol value is updated to equalthat VGPR number plus one.

May be used to set the .amdhsa_next_free_vpgr directive inAMDHSA Kernel Assembler Directives.

May be set at any time, e.g. manually set to zero at the start of each kernel.

.amdgcn.next_free_sgpr

Set to zero before assembly begins. At each instruction, if the current valueof this symbol is less than or equal the maximum SGPR number explicitlyreferenced within that instruction then the symbol value is updated to equalthat SGPR number plus one.

May be used to set the .amdhsa_next_free_spgr directive inAMDHSA Kernel Assembler Directives.

May be set at any time, e.g. manually set to zero at the start of each kernel.

Code Object V3 Directives (-mattr=+code-object-v3)

Directives which begin with .amdgcn are valid for all amdgcnarchitecture processors, and are not OS-specific. Directives which begin with.amdhsa are specific to amdgcn architecture processors when theamdhsa OS is specified. See Target Triples andProcessors.

.amdgcn_target <target>

Optional directive which declares the target supported by the containingassembler source file. Valid values are described inCode Object Target Identification. Used by the assemblerto validate command-line options such as -triple, -mcpu, and thosewhich specify target features.

.amdhsa_kernel <name>

Creates a correctly aligned AMDHSA kernel descriptor and a symbol,<name>.kd, in the current location of the current section. Only valid whenthe OS is amdhsa. <name> must be a symbol that labels the firstinstruction to execute, and does not need to be previously defined.

Marks the beginning of a list of directives used to generate the bytes of akernel descriptor, as described in Kernel Descriptor.Directives which may appear in this list are described inAMDHSA Kernel Assembler Directives. Directives may appear in any order, mustbe valid for the target being assembled for, and cannot be repeated. Directivessupport the range of values specified by the field they reference inKernel Descriptor. If a directive is not specified, it isassumed to have its default value, unless it is marked as “Required”, in whichcase it is an error to omit the directive. This list of directives isterminated by an .end_amdhsa_kernel directive.

AMDHSA Kernel Assembler Directives
DirectiveDefaultSupported OnDescription
.amdhsa_group_segment_fixed_size0GFX6-GFX10Controls GROUP_SEGMENT_FIXED_SIZE inKernel Descriptor for GFX6-GFX10.
.amdhsa_private_segment_fixed_size0GFX6-GFX10Controls PRIVATE_SEGMENT_FIXED_SIZE inKernel Descriptor for GFX6-GFX10.
.amdhsa_user_sgpr_private_segment_buffer0GFX6-GFX10Controls ENABLE_SGPR_PRIVATE_SEGMENT_BUFFER inKernel Descriptor for GFX6-GFX10.
.amdhsa_user_sgpr_dispatch_ptr0GFX6-GFX10Controls ENABLE_SGPR_DISPATCH_PTR inKernel Descriptor for GFX6-GFX10.
.amdhsa_user_sgpr_queue_ptr0GFX6-GFX10Controls ENABLE_SGPR_QUEUE_PTR inKernel Descriptor for GFX6-GFX10.
.amdhsa_user_sgpr_kernarg_segment_ptr0GFX6-GFX10Controls ENABLE_SGPR_KERNARG_SEGMENT_PTR inKernel Descriptor for GFX6-GFX10.
.amdhsa_user_sgpr_dispatch_id0GFX6-GFX10Controls ENABLE_SGPR_DISPATCH_ID inKernel Descriptor for GFX6-GFX10.
.amdhsa_user_sgpr_flat_scratch_init0GFX6-GFX10Controls ENABLE_SGPR_FLAT_SCRATCH_INIT inKernel Descriptor for GFX6-GFX10.
.amdhsa_user_sgpr_private_segment_size0GFX6-GFX10Controls ENABLE_SGPR_PRIVATE_SEGMENT_SIZE inKernel Descriptor for GFX6-GFX10.
.amdhsa_wavefront_size32TargetFeatureSpecific(-wavefrontsize64)GFX10Controls ENABLE_WAVEFRONT_SIZE32 inKernel Descriptor for GFX6-GFX10.
.amdhsa_system_sgpr_private_segment_wavefront_offset0GFX6-GFX10Controls ENABLE_SGPR_PRIVATE_SEGMENT_WAVEFRONT_OFFSET incompute_pgm_rsrc2 for GFX6-GFX10.
.amdhsa_system_sgpr_workgroup_id_x1GFX6-GFX10Controls ENABLE_SGPR_WORKGROUP_ID_X incompute_pgm_rsrc2 for GFX6-GFX10.
.amdhsa_system_sgpr_workgroup_id_y0GFX6-GFX10Controls ENABLE_SGPR_WORKGROUP_ID_Y incompute_pgm_rsrc2 for GFX6-GFX10.
.amdhsa_system_sgpr_workgroup_id_z0GFX6-GFX10Controls ENABLE_SGPR_WORKGROUP_ID_Z incompute_pgm_rsrc2 for GFX6-GFX10.
.amdhsa_system_sgpr_workgroup_info0GFX6-GFX10Controls ENABLE_SGPR_WORKGROUP_INFO incompute_pgm_rsrc2 for GFX6-GFX10.
.amdhsa_system_vgpr_workitem_id0GFX6-GFX10Controls ENABLE_VGPR_WORKITEM_ID incompute_pgm_rsrc2 for GFX6-GFX10.Possible values are defined inSystem VGPR Work-Item ID Enumeration Values.
.amdhsa_next_free_vgprRequiredGFX6-GFX10Maximum VGPR number explicitly referenced, plus one.Used to calculate GRANULATED_WORKITEM_VGPR_COUNT incompute_pgm_rsrc1 for GFX6-GFX10.
.amdhsa_next_free_sgprRequiredGFX6-GFX10Maximum SGPR number explicitly referenced, plus one.Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT incompute_pgm_rsrc1 for GFX6-GFX10.
.amdhsa_reserve_vcc1GFX6-GFX10Whether the kernel may use the special VCC SGPR.Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT incompute_pgm_rsrc1 for GFX6-GFX10.
.amdhsa_reserve_flat_scratch1GFX7-GFX10Whether the kernel may use flat instructions to accessscratch memory. Used to calculateGRANULATED_WAVEFRONT_SGPR_COUNT incompute_pgm_rsrc1 for GFX6-GFX10.
.amdhsa_reserve_xnack_maskTargetFeatureSpecific(+xnack)GFX8-GFX10Whether the kernel may trigger XNACK replay.Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT incompute_pgm_rsrc1 for GFX6-GFX10.
.amdhsa_float_round_mode_320GFX6-GFX10Controls FLOAT_ROUND_MODE_32 incompute_pgm_rsrc1 for GFX6-GFX10.Possible values are defined inFloating Point Rounding Mode Enumeration Values.
.amdhsa_float_round_mode_16_640GFX6-GFX10Controls FLOAT_ROUND_MODE_16_64 incompute_pgm_rsrc1 for GFX6-GFX10.Possible values are defined inFloating Point Rounding Mode Enumeration Values.
.amdhsa_float_denorm_mode_320GFX6-GFX10Controls FLOAT_DENORM_MODE_32 incompute_pgm_rsrc1 for GFX6-GFX10.Possible values are defined inFloating Point Denorm Mode Enumeration Values.
.amdhsa_float_denorm_mode_16_643GFX6-GFX10Controls FLOAT_DENORM_MODE_16_64 incompute_pgm_rsrc1 for GFX6-GFX10.Possible values are defined inFloating Point Denorm Mode Enumeration Values.
.amdhsa_dx10_clamp1GFX6-GFX10Controls ENABLE_DX10_CLAMP incompute_pgm_rsrc1 for GFX6-GFX10.
.amdhsa_ieee_mode1GFX6-GFX10Controls ENABLE_IEEE_MODE incompute_pgm_rsrc1 for GFX6-GFX10.
.amdhsa_fp16_overflow0GFX9-GFX10Controls FP16_OVFL incompute_pgm_rsrc1 for GFX6-GFX10.
.amdhsa_workgroup_processor_modeTargetFeatureSpecific(-cumode)GFX10Controls ENABLE_WGP_MODE inKernel Descriptor for GFX6-GFX10.
.amdhsa_memory_ordered1GFX10Controls MEM_ORDERED incompute_pgm_rsrc1 for GFX6-GFX10.
.amdhsa_forward_progress0GFX10Controls FWD_PROGRESS incompute_pgm_rsrc1 for GFX6-GFX10.
.amdhsa_exception_fp_ieee_invalid_op0GFX6-GFX10Controls ENABLE_EXCEPTION_IEEE_754_FP_INVALID_OPERATION incompute_pgm_rsrc2 for GFX6-GFX10.
.amdhsa_exception_fp_denorm_src0GFX6-GFX10Controls ENABLE_EXCEPTION_FP_DENORMAL_SOURCE incompute_pgm_rsrc2 for GFX6-GFX10.
.amdhsa_exception_fp_ieee_div_zero0GFX6-GFX10Controls ENABLE_EXCEPTION_IEEE_754_FP_DIVISION_BY_ZERO incompute_pgm_rsrc2 for GFX6-GFX10.
.amdhsa_exception_fp_ieee_overflow0GFX6-GFX10Controls ENABLE_EXCEPTION_IEEE_754_FP_OVERFLOW incompute_pgm_rsrc2 for GFX6-GFX10.
.amdhsa_exception_fp_ieee_underflow0GFX6-GFX10Controls ENABLE_EXCEPTION_IEEE_754_FP_UNDERFLOW incompute_pgm_rsrc2 for GFX6-GFX10.
.amdhsa_exception_fp_ieee_inexact0GFX6-GFX10Controls ENABLE_EXCEPTION_IEEE_754_FP_INEXACT incompute_pgm_rsrc2 for GFX6-GFX10.
.amdhsa_exception_int_div_zero0GFX6-GFX10Controls ENABLE_EXCEPTION_INT_DIVIDE_BY_ZERO incompute_pgm_rsrc2 for GFX6-GFX10.
.amdgpu_metadata

Optional directive which declares the contents of the NT_AMDGPU_METADATAnote record (see AMDGPU Code Object V3 ELF Note Records).

The contents must be in the [YAML] markup format, with the same structure andsemantics described in Code Object V3 Metadata (-mattr=+code-object-v3).

This directive is terminated by an .end_amdgpu_metadata directive.

Code Object V3 Example Source Code (-mattr=+code-object-v3)

Here is an example of a minimal assembly source file, defining one HSA kernel:

  1. 1 .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional
  2. 2
  3. 3 .text
  4. 4 .globl hello_world
  5. 5 .p2align 8
  6. 6 .type hello_world,@function
  7. 7 hello_world:
  8. 8 s_load_dwordx2 s[0:1], s[0:1] 0x0
  9. 9 v_mov_b32 v0, 3.14159
  10. 10 s_waitcnt lgkmcnt(0)
  11. 11 v_mov_b32 v1, s0
  12. 12 v_mov_b32 v2, s1
  13. 13 flat_store_dword v[1:2], v0
  14. 14 s_endpgm
  15. 15 .Lfunc_end0:
  16. 16 .size hello_world, .Lfunc_end0-hello_world
  17. 17
  18. 18 .rodata
  19. 19 .p2align 6
  20. 20 .amdhsa_kernel hello_world
  21. 21 .amdhsa_user_sgpr_kernarg_segment_ptr 1
  22. 22 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
  23. 23 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
  24. 24 .end_amdhsa_kernel
  25. 25
  26. 26 .amdgpu_metadata
  27. 27 ---
  28. 28 amdhsa.version:
  29. 29 - 1
  30. 30 - 0
  31. 31 amdhsa.kernels:
  32. 32 - .name: hello_world
  33. 33 .symbol: hello_world.kd
  34. 34 .kernarg_segment_size: 48
  35. 35 .group_segment_fixed_size: 0
  36. 36 .private_segment_fixed_size: 0
  37. 37 .kernarg_segment_align: 4
  38. 38 .wavefront_size: 64
  39. 39 .sgpr_count: 2
  40. 40 .vgpr_count: 3
  41. 41 .max_flat_workgroup_size: 256
  42. 42 ...
  43. 43 .end_amdgpu_metadata

If an assembly source file contains multiple kernels and/or functions, the.amdgcn.next_free_vgpr and.amdgcn.next_free_sgpr symbols may be reset usingthe .set <symbol>, <expression> directive. For example, in the case of twokernels, where function1 is only called from kernel1 it is sufficientto group the function with the kernel that calls it and reset the symbolsbetween the two connected components:

  1. 1 .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional
  2. 2
  3. 3 // gpr tracking symbols are implicitly set to zero
  4. 4
  5. 5 .text
  6. 6 .globl kern0
  7. 7 .p2align 8
  8. 8 .type kern0,@function
  9. 9 kern0:
  10. 10 // ...
  11. 11 s_endpgm
  12. 12 .Lkern0_end:
  13. 13 .size kern0, .Lkern0_end-kern0
  14. 14
  15. 15 .rodata
  16. 16 .p2align 6
  17. 17 .amdhsa_kernel kern0
  18. 18 // ...
  19. 19 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
  20. 20 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
  21. 21 .end_amdhsa_kernel
  22. 22
  23. 23 // reset symbols to begin tracking usage in func1 and kern1
  24. 24 .set .amdgcn.next_free_vgpr, 0
  25. 25 .set .amdgcn.next_free_sgpr, 0
  26. 26
  27. 27 .text
  28. 28 .hidden func1
  29. 29 .global func1
  30. 30 .p2align 2
  31. 31 .type func1,@function
  32. 32 func1:
  33. 33 // ...
  34. 34 s_setpc_b64 s[30:31]
  35. 35 .Lfunc1_end:
  36. 36 .size func1, .Lfunc1_end-func1
  37. 37
  38. 38 .globl kern1
  39. 39 .p2align 8
  40. 40 .type kern1,@function
  41. 41 kern1:
  42. 42 // ...
  43. 43 s_getpc_b64 s[4:5]
  44. 44 s_add_u32 s4, s4, func1@rel32@lo+4
  45. 45 s_addc_u32 s5, s5, func1@rel32@lo+4
  46. 46 s_swappc_b64 s[30:31], s[4:5]
  47. 47 // ...
  48. 48 s_endpgm
  49. 49 .Lkern1_end:
  50. 50 .size kern1, .Lkern1_end-kern1
  51. 51
  52. 52 .rodata
  53. 53 .p2align 6
  54. 54 .amdhsa_kernel kern1
  55. 55 // ...
  56. 56 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
  57. 57 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
  58. 58 .end_amdhsa_kernel

These symbols cannot identify connected components in order to automaticallytrack the usage for each kernel. However, in some cases careful organization ofthe kernels and functions in the source file means there is minimal additionaleffort required to accurately calculate GPR usage.

Additional Documentation

[AMD-RADEON-HD-2000-3000]AMD R6xx shader ISA
[AMD-RADEON-HD-4000]AMD R7xx shader ISA
[AMD-RADEON-HD-5000]AMD Evergreen shader ISA
[AMD-RADEON-HD-6000]AMD Cayman/Trinity shader ISA
[AMD-GCN-GFX6](1, 2)AMD Southern Islands Series ISA
[AMD-GCN-GFX7](1, 2)AMD Sea Islands Series ISA
[AMD-GCN-GFX8](1, 2)AMD GCN3 Instruction Set Architecture
[AMD-GCN-GFX9](1, 2)AMD “Vega” Instruction Set Architecture
[AMD-GCN-GFX10](1, 2)AMD “RDNA 1.0” Instruction Set Architecture
[AMD-ROCm](1, 2, 3, 4)ROCm: Open Platform for Development, Discovery and Education Around GPU Computing
[AMD-ROCm-github](1, 2)ROCm github
[HSA](1, 2, 3, 4, 5, 6, 7, 8, 9, 10)Heterogeneous System Architecture (HSA) Foundation
[HIP]HIP Programming Guide
[ELF](1, 2)Executable and Linkable Format (ELF)
[DWARF]DWARF Debugging Information Format
[YAML](1, 2)YAML Ain’t Markup Language (YAML™) Version 1.2
[MsgPack](1, 2, 3)Message Pack
[SEMVER](1, 2, 3)Semantic Versioning
[OpenCL](1, 2)The OpenCL Specification Version 2.0
[HRF]Heterogeneous-race-free Memory Models
[CLANG-ATTR](1, 2, 3, 4)Attributes in Clang