User Guide for NVPTX Back-end

Introduction

To support GPU programming, the NVPTX back-end supports a subset of LLVM IRalong with a defined set of conventions used to represent GPU programmingconcepts. This document provides an overview of the general usage of the back-end, including a description of the conventions used and the set of acceptedLLVM IR.

Note

This document assumes a basic familiarity with CUDA and the PTXassembly language. Information about the CUDA Driver API and the PTX assemblylanguage can be found in the CUDA documentation.

Conventions

Marking Functions as Kernels

In PTX, there are two types of functions: device functions, which are onlycallable by device code, and kernel functions, which are callable by hostcode. By default, the back-end will emit device functions. Metadata is used todeclare a function as a kernel function. This metadata is attached to thenvvm.annotations named metadata object, and has the following format:

  1. !0 = !{<function-ref>, metadata !"kernel", i32 1}

The first parameter is a reference to the kernel function. The followingexample shows a kernel function calling a device function in LLVM IR. Thefunction @my_kernel is callable from host code, but @my_fmad is not.

  1. define float @my_fmad(float %x, float %y, float %z) {
  2. %mul = fmul float %x, %y
  3. %add = fadd float %mul, %z
  4. ret float %add
  5. }
  6.  
  7. define void @my_kernel(float* %ptr) {
  8. %val = load float, float* %ptr
  9. %ret = call float @my_fmad(float %val, float %val, float %val)
  10. store float %ret, float* %ptr
  11. ret void
  12. }
  13.  
  14. !nvvm.annotations = !{!1}
  15. !1 = !{void (float*)* @my_kernel, !"kernel", i32 1}

When compiled, the PTX kernel functions are callable by host-side code.

Address Spaces

The NVPTX back-end uses the following address space mapping:

Address SpaceMemory Space
0Generic
1Global
2Internal Use
3Shared
4Constant
5Local

Every global variable and pointer type is assigned to one of these addressspaces, with 0 being the default address space. Intrinsics are provided whichcan be used to convert pointers between the generic and non-generic addressspaces.

As an example, the following IR will define an array @g that resides inglobal device memory.

  1. @g = internal addrspace(1) global [4 x i32] [ i32 0, i32 1, i32 2, i32 3 ]

LLVM IR functions can read and write to this array, and host-side code cancopy data to it by name with the CUDA Driver API.

Note that since address space 0 is the generic space, it is illegal to haveglobal variables in address space 0. Address space 0 is the default addressspace in LLVM, so the addrspace(N) annotation is required for globalvariables.

Triples

The NVPTX target uses the module triple to select between 32/64-bit codegeneration and the driver-compiler interface to use. The triple architecturecan be one of nvptx (32-bit PTX) or nvptx64 (64-bit PTX). Theoperating system should be one of cuda or nvcl, which determines theinterface used by the generated code to communicate with the driver. Mostusers will want to use cuda as the operating system, which makes thegenerated PTX compatible with the CUDA Driver API.

Example: 32-bit PTX for CUDA Driver API: nvptx-nvidia-cuda

Example: 64-bit PTX for CUDA Driver API: nvptx64-nvidia-cuda

NVPTX Intrinsics

Address Space Conversion

‘llvm.nvvm.ptr.*.to.gen’ Intrinsics

Syntax:

These are overloaded intrinsics. You can use these on any pointer types.

  1. declare i8* @llvm.nvvm.ptr.global.to.gen.p0i8.p1i8(i8 addrspace(1)*)
  2. declare i8* @llvm.nvvm.ptr.shared.to.gen.p0i8.p3i8(i8 addrspace(3)*)
  3. declare i8* @llvm.nvvm.ptr.constant.to.gen.p0i8.p4i8(i8 addrspace(4)*)
  4. declare i8* @llvm.nvvm.ptr.local.to.gen.p0i8.p5i8(i8 addrspace(5)*)
Overview:

The ‘llvm.nvvm.ptr.*.to.gen’ intrinsics convert a pointer in a non-genericaddress space to a generic address space pointer.

Semantics:

These intrinsics modify the pointer value to be a valid generic address spacepointer.

‘llvm.nvvm.ptr.gen.to.*’ Intrinsics

Syntax:

These are overloaded intrinsics. You can use these on any pointer types.

  1. declare i8 addrspace(1)* @llvm.nvvm.ptr.gen.to.global.p1i8.p0i8(i8*)
  2. declare i8 addrspace(3)* @llvm.nvvm.ptr.gen.to.shared.p3i8.p0i8(i8*)
  3. declare i8 addrspace(4)* @llvm.nvvm.ptr.gen.to.constant.p4i8.p0i8(i8*)
  4. declare i8 addrspace(5)* @llvm.nvvm.ptr.gen.to.local.p5i8.p0i8(i8*)
Overview:

The ‘llvm.nvvm.ptr.gen.to.*’ intrinsics convert a pointer in the genericaddress space to a pointer in the target address space. Note that theseintrinsics are only useful if the address space of the target address space ofthe pointer is known. It is not legal to use address space conversionintrinsics to convert a pointer from one non-generic address space to anothernon-generic address space.

Semantics:

These intrinsics modify the pointer value to be a valid pointer in the targetnon-generic address space.

Reading PTX Special Registers

‘llvm.nvvm.read.ptx.sreg.*’

Syntax:
  1. declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  2. declare i32 @llvm.nvvm.read.ptx.sreg.tid.y()
  3. declare i32 @llvm.nvvm.read.ptx.sreg.tid.z()
  4. declare i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
  5. declare i32 @llvm.nvvm.read.ptx.sreg.ntid.y()
  6. declare i32 @llvm.nvvm.read.ptx.sreg.ntid.z()
  7. declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()
  8. declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.y()
  9. declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.z()
  10. declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.x()
  11. declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.y()
  12. declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.z()
  13. declare i32 @llvm.nvvm.read.ptx.sreg.warpsize()
Overview:

The ‘@llvm.nvvm.read.ptx.sreg.*’ intrinsics provide access to the PTXspecial registers, in particular the kernel launch bounds. These registersmap in the following way to CUDA builtins:

CUDA BuiltinPTX Special Register Intrinsic
threadId@llvm.nvvm.read.ptx.sreg.tid.
blockIdx@llvm.nvvm.read.ptx.sreg.ctaid.
blockDim@llvm.nvvm.read.ptx.sreg.ntid.
gridDim@llvm.nvvm.read.ptx.sreg.nctaid.

Barriers

‘llvm.nvvm.barrier0’

Syntax:
  1. declare void @llvm.nvvm.barrier0()
Overview:

The ‘@llvm.nvvm.barrier0()’ intrinsic emits a PTX bar.sync 0instruction, equivalent to the __syncthreads() call in CUDA.

Other Intrinsics

For the full set of NVPTX intrinsics, please see theinclude/llvm/IR/IntrinsicsNVVM.td file in the LLVM source tree.

Linking with Libdevice

The CUDA Toolkit comes with an LLVM bitcode library called libdevice thatimplements many common mathematical functions. This library can be used as ahigh-performance math library for any compilers using the LLVM NVPTX target.The library can be found under nvvm/libdevice/ in the CUDA Toolkit andthere is a separate version for each compute architecture.

For a list of all math functions implemented in libdevice, seelibdevice Users Guide.

To accommodate various math-related compiler flags that can affect codegeneration of libdevice code, the library code depends on a special LLVM IRpass (NVVMReflect) to handle conditional compilation within LLVM IR. Thispass looks for calls to the @__nvvm_reflect function and replaces themwith constants based on the defined reflection parameters. Such conditionalcode often follows a pattern:

  1. float my_function(float a) {
  2. if (__nvvm_reflect("FASTMATH"))
  3. return my_function_fast(a);
  4. else
  5. return my_function_precise(a);
  6. }

The default value for all unspecified reflection parameters is zero.

The NVVMReflect pass should be executed early in the optimizationpipeline, immediately after the link stage. The internalize pass is alsorecommended to remove unused math functions from the resulting PTX. For aninput IR module module.bc, the following compilation flow is recommended:

  • Save list of external functions in module.bc
  • Link module.bc with libdevice.compute_XX.YY.bc
  • Internalize all functions not in list from (1)
  • Eliminate all unused internal functions
  • Run NVVMReflect pass
  • Run standard optimization pipeline

Note

linkonce and linkonce_odr linkage types are not suitable for thelibdevice functions. It is possible to link two IR modules that have beenlinked against libdevice using different reflection variables.

Since the NVVMReflect pass replaces conditionals with constants, it willoften leave behind dead code of the form:

  1. entry:
  2. ..
  3. br i1 true, label %foo, label %bar
  4. foo:
  5. ..
  6. bar:
  7. ; Dead code
  8. ..

Therefore, it is recommended that NVVMReflect is executed early in theoptimization pipeline before dead-code elimination.

The NVPTX TargetMachine knows how to schedule NVVMReflect at the beginningof your pass manager; just use the following code when setting up your passmanager:

  1. std::unique_ptr<TargetMachine> TM = ...;
  2. PassManagerBuilder PMBuilder(...);
  3. if (TM)
  4. TM->adjustPassManager(PMBuilder);

Reflection Parameters

The libdevice library currently uses the following reflection parameters tocontrol code generation:

FlagDescription
__CUDA_FTZ=[0,1]Use optimized code paths that flush subnormals to zero

The value of this flag is determined by the “nvvm-reflect-ftz” module flag.The following sets the ftz flag to 1.

  1. !llvm.module.flag = !{!0}
  2. !0 = !{i32 4, !"nvvm-reflect-ftz", i32 1}

(i32 4 indicates that the value set here overrides the value in anothermodule we link with. See the _LangRef <LangRef.html#module-flags-metadata>_for details.)

Executing PTX

The most common way to execute PTX assembly on a GPU device is to use the CUDADriver API. This API is a low-level interface to the GPU driver and allows forJIT compilation of PTX code to native GPU machine code.

Initializing the Driver API:

  1. CUdevice device;
  2. CUcontext context;
  3.  
  4. // Initialize the driver API
  5. cuInit(0);
  6. // Get a handle to the first compute device
  7. cuDeviceGet(&device, 0);
  8. // Create a compute device context
  9. cuCtxCreate(&context, 0, device);

JIT compiling a PTX string to a device binary:

  1. CUmodule module;
  2. CUfunction function;
  3.  
  4. // JIT compile a null-terminated PTX string
  5. cuModuleLoadData(&module, (void*)PTXString);
  6.  
  7. // Get a handle to the "myfunction" kernel function
  8. cuModuleGetFunction(&function, module, "myfunction");

For full examples of executing PTX assembly, please see the CUDA Samples distribution.

Common Issues

ptxas complains of undefined function: __nvvm_reflect

When linking with libdevice, the NVVMReflect pass must be used. SeeLinking with Libdevice for more information.

Tutorial: A Simple Compute Kernel

To start, let us take a look at a simple compute kernel written directly inLLVM IR. The kernel implements vector addition, where each thread computes oneelement of the output vector C from the input vectors A and B. To make thiseasier, we also assume that only a single CTA (thread block) will be launched,and that it will be one dimensional.

The Kernel

  1. target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
  2. target triple = "nvptx64-nvidia-cuda"
  3.  
  4. ; Intrinsic to read X component of thread ID
  5. declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind
  6.  
  7. define void @kernel(float addrspace(1)* %A,
  8. float addrspace(1)* %B,
  9. float addrspace(1)* %C) {
  10. entry:
  11. ; What is my ID?
  12. %id = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind
  13.  
  14. ; Compute pointers into A, B, and C
  15. %ptrA = getelementptr float, float addrspace(1)* %A, i32 %id
  16. %ptrB = getelementptr float, float addrspace(1)* %B, i32 %id
  17. %ptrC = getelementptr float, float addrspace(1)* %C, i32 %id
  18.  
  19. ; Read A, B
  20. %valA = load float, float addrspace(1)* %ptrA, align 4
  21. %valB = load float, float addrspace(1)* %ptrB, align 4
  22.  
  23. ; Compute C = A + B
  24. %valC = fadd float %valA, %valB
  25.  
  26. ; Store back to C
  27. store float %valC, float addrspace(1)* %ptrC, align 4
  28.  
  29. ret void
  30. }
  31.  
  32. !nvvm.annotations = !{!0}
  33. !0 = !{void (float addrspace(1)*,
  34. float addrspace(1)*,
  35. float addrspace(1)*)* @kernel, !"kernel", i32 1}

We can use the LLVM llc tool to directly run the NVPTX code generator:

  1. # llc -mcpu=sm_20 kernel.ll -o kernel.ptx

Note

If you want to generate 32-bit code, change p:64:64:64 to p:32:32:32in the module data layout string and use nvptx-nvidia-cuda as thetarget triple.

The output we get from llc (as of LLVM 3.4):

  1. //
  2. // Generated by LLVM NVPTX Back-End
  3. //
  4.  
  5. .version 3.1
  6. .target sm_20
  7. .address_size 64
  8.  
  9. // .globl kernel
  10. // @kernel
  11. .visible .entry kernel(
  12. .param .u64 kernel_param_0,
  13. .param .u64 kernel_param_1,
  14. .param .u64 kernel_param_2
  15. )
  16. {
  17. .reg .f32 %f<4>;
  18. .reg .s32 %r<2>;
  19. .reg .s64 %rl<8>;
  20.  
  21. // %bb.0: // %entry
  22. ld.param.u64 %rl1, [kernel_param_0];
  23. mov.u32 %r1, %tid.x;
  24. mul.wide.s32 %rl2, %r1, 4;
  25. add.s64 %rl3, %rl1, %rl2;
  26. ld.param.u64 %rl4, [kernel_param_1];
  27. add.s64 %rl5, %rl4, %rl2;
  28. ld.param.u64 %rl6, [kernel_param_2];
  29. add.s64 %rl7, %rl6, %rl2;
  30. ld.global.f32 %f1, [%rl3];
  31. ld.global.f32 %f2, [%rl5];
  32. add.f32 %f3, %f1, %f2;
  33. st.global.f32 [%rl7], %f3;
  34. ret;
  35. }

Dissecting the Kernel

Now let us dissect the LLVM IR that makes up this kernel.

Data Layout

The data layout string determines the size in bits of common data types, theirABI alignment, and their storage size. For NVPTX, you should use one of thefollowing:

32-bit PTX:

  1. target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"

64-bit PTX:

  1. target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"

Target Intrinsics

In this example, we use the @llvm.nvvm.read.ptx.sreg.tid.x intrinsic toread the X component of the current thread’s ID, which corresponds to a readof register %tid.x in PTX. The NVPTX back-end supports a large set ofintrinsics. A short list is shown below; please seeinclude/llvm/IR/IntrinsicsNVVM.td for the full list.

IntrinsicCUDA Equivalent
i32 @llvm.nvvm.read.ptx.sreg.tid.{x,y,z}threadIdx.{x,y,z}
i32 @llvm.nvvm.read.ptx.sreg.ctaid.{x,y,z}blockIdx.{x,y,z}
i32 @llvm.nvvm.read.ptx.sreg.ntid.{x,y,z}blockDim.{x,y,z}
i32 @llvm.nvvm.read.ptx.sreg.nctaid.{x,y,z}gridDim.{x,y,z}
void @llvm.nvvm.barrier0()__syncthreads()

Address Spaces

You may have noticed that all of the pointer types in the LLVM IR example hadan explicit address space specifier. What is address space 1? NVIDIA GPUdevices (generally) have four types of memory:

  • Global: Large, off-chip memory
  • Shared: Small, on-chip memory shared among all threads in a CTA
  • Local: Per-thread, private memory
  • Constant: Read-only memory shared across all threads

These different types of memory are represented in LLVM IR as address spaces.There is also a fifth address space used by the NVPTX code generator thatcorresponds to the “generic” address space. This address space can representaddresses in any other address space (with a few exceptions). This allowsusers to write IR functions that can load/store memory using the sameinstructions. Intrinsics are provided to convert pointers between the genericand non-generic address spaces.

See Address Spaces and NVPTX Intrinsics for more information.

Kernel Metadata

In PTX, a function can be either a kernel function (callable from the hostprogram), or a device function (callable only from GPU code). You can thinkof kernel functions as entry-points in the GPU program. To mark an LLVM IRfunction as a kernel function, we make use of special LLVM metadata. TheNVPTX back-end will look for a named metadata node callednvvm.annotations. This named metadata must contain a list of metadata thatdescribe the IR. For our purposes, we need to declare a metadata node thatassigns the “kernel” attribute to the LLVM IR function that should be emittedas a PTX kernel function. These metadata nodes take the form:

  1. !{<function ref>, metadata !"kernel", i32 1}

For the previous example, we have:

  1. !nvvm.annotations = !{!0}
  2. !0 = !{void (float addrspace(1)*,
  3. float addrspace(1)*,
  4. float addrspace(1)*)* @kernel, !"kernel", i32 1}

Here, we have a single metadata declaration in nvvm.annotations. Thismetadata annotates our @kernel function with the kernel attribute.

Running the Kernel

Generating PTX from LLVM IR is all well and good, but how do we execute it ona real GPU device? The CUDA Driver API provides a convenient mechanism forloading and JIT compiling PTX to a native GPU device, and launching a kernel.The API is similar to OpenCL. A simple example showing how to load andexecute our vector addition code is shown below. Note that for brevity thiscode does not perform much error checking!

Note

You can also use the ptxas tool provided by the CUDA Toolkit to offlinecompile PTX to machine code (SASS) for a specific GPU architecture. Suchbinaries can be loaded by the CUDA Driver API in the same way as PTX. Thiscan be useful for reducing startup time by precompiling the PTX kernels.

  1. #include <iostream>
  2. #include <fstream>
  3. #include <cassert>
  4. #include "cuda.h"
  5.  
  6.  
  7. void checkCudaErrors(CUresult err) {
  8. assert(err == CUDA_SUCCESS);
  9. }
  10.  
  11. /// main - Program entry point
  12. int main(int argc, char **argv) {
  13. CUdevice device;
  14. CUmodule cudaModule;
  15. CUcontext context;
  16. CUfunction function;
  17. CUlinkState linker;
  18. int devCount;
  19.  
  20. // CUDA initialization
  21. checkCudaErrors(cuInit(0));
  22. checkCudaErrors(cuDeviceGetCount(&devCount));
  23. checkCudaErrors(cuDeviceGet(&device, 0));
  24.  
  25. char name[128];
  26. checkCudaErrors(cuDeviceGetName(name, 128, device));
  27. std::cout << "Using CUDA Device [0]: " << name << "\n";
  28.  
  29. int devMajor, devMinor;
  30. checkCudaErrors(cuDeviceComputeCapability(&devMajor, &devMinor, device));
  31. std::cout << "Device Compute Capability: "
  32. << devMajor << "." << devMinor << "\n";
  33. if (devMajor < 2) {
  34. std::cerr << "ERROR: Device 0 is not SM 2.0 or greater\n";
  35. return 1;
  36. }
  37.  
  38. std::ifstream t("kernel.ptx");
  39. if (!t.is_open()) {
  40. std::cerr << "kernel.ptx not found\n";
  41. return 1;
  42. }
  43. std::string str((std::istreambuf_iterator<char>(t)),
  44. std::istreambuf_iterator<char>());
  45.  
  46. // Create driver context
  47. checkCudaErrors(cuCtxCreate(&context, 0, device));
  48.  
  49. // Create module for object
  50. checkCudaErrors(cuModuleLoadDataEx(&cudaModule, str.c_str(), 0, 0, 0));
  51.  
  52. // Get kernel function
  53. checkCudaErrors(cuModuleGetFunction(&function, cudaModule, "kernel"));
  54.  
  55. // Device data
  56. CUdeviceptr devBufferA;
  57. CUdeviceptr devBufferB;
  58. CUdeviceptr devBufferC;
  59.  
  60. checkCudaErrors(cuMemAlloc(&devBufferA, sizeof(float)*16));
  61. checkCudaErrors(cuMemAlloc(&devBufferB, sizeof(float)*16));
  62. checkCudaErrors(cuMemAlloc(&devBufferC, sizeof(float)*16));
  63.  
  64. float* hostA = new float[16];
  65. float* hostB = new float[16];
  66. float* hostC = new float[16];
  67.  
  68. // Populate input
  69. for (unsigned i = 0; i != 16; ++i) {
  70. hostA[i] = (float)i;
  71. hostB[i] = (float)(2*i);
  72. hostC[i] = 0.0f;
  73. }
  74.  
  75. checkCudaErrors(cuMemcpyHtoD(devBufferA, &hostA[0], sizeof(float)*16));
  76. checkCudaErrors(cuMemcpyHtoD(devBufferB, &hostB[0], sizeof(float)*16));
  77.  
  78.  
  79. unsigned blockSizeX = 16;
  80. unsigned blockSizeY = 1;
  81. unsigned blockSizeZ = 1;
  82. unsigned gridSizeX = 1;
  83. unsigned gridSizeY = 1;
  84. unsigned gridSizeZ = 1;
  85.  
  86. // Kernel parameters
  87. void *KernelParams[] = { &devBufferA, &devBufferB, &devBufferC };
  88.  
  89. std::cout << "Launching kernel\n";
  90.  
  91. // Kernel launch
  92. checkCudaErrors(cuLaunchKernel(function, gridSizeX, gridSizeY, gridSizeZ,
  93. blockSizeX, blockSizeY, blockSizeZ,
  94. 0, NULL, KernelParams, NULL));
  95.  
  96. // Retrieve device data
  97. checkCudaErrors(cuMemcpyDtoH(&hostC[0], devBufferC, sizeof(float)*16));
  98.  
  99.  
  100. std::cout << "Results:\n";
  101. for (unsigned i = 0; i != 16; ++i) {
  102. std::cout << hostA[i] << " + " << hostB[i] << " = " << hostC[i] << "\n";
  103. }
  104.  
  105.  
  106. // Clean up after ourselves
  107. delete [] hostA;
  108. delete [] hostB;
  109. delete [] hostC;
  110.  
  111. // Clean-up
  112. checkCudaErrors(cuMemFree(devBufferA));
  113. checkCudaErrors(cuMemFree(devBufferB));
  114. checkCudaErrors(cuMemFree(devBufferC));
  115. checkCudaErrors(cuModuleUnload(cudaModule));
  116. checkCudaErrors(cuCtxDestroy(context));
  117.  
  118. return 0;
  119. }

You will need to link with the CUDA driver and specify the path to cuda.h.

  1. # clang++ sample.cpp -o sample -O2 -g -I/usr/local/cuda-5.5/include -lcuda

We don’t need to specify a path to libcuda.so since this is installed in asystem location by the driver, not the CUDA toolkit.

If everything goes as planned, you should see the following output whenrunning the compiled program:

  1. Using CUDA Device [0]: GeForce GTX 680
  2. Device Compute Capability: 3.0
  3. Launching kernel
  4. Results:
  5. 0 + 0 = 0
  6. 1 + 2 = 3
  7. 2 + 4 = 6
  8. 3 + 6 = 9
  9. 4 + 8 = 12
  10. 5 + 10 = 15
  11. 6 + 12 = 18
  12. 7 + 14 = 21
  13. 8 + 16 = 24
  14. 9 + 18 = 27
  15. 10 + 20 = 30
  16. 11 + 22 = 33
  17. 12 + 24 = 36
  18. 13 + 26 = 39
  19. 14 + 28 = 42
  20. 15 + 30 = 45

Note

You will likely see a different device identifier based on your hardware

Tutorial: Linking with Libdevice

In this tutorial, we show a simple example of linking LLVM IR with thelibdevice library. We will use the same kernel as the previous tutorial,except that we will compute C = pow(A, B) instead of C = A + B.Libdevice provides an __nv_powf function that we will use.

  1. target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
  2. target triple = "nvptx64-nvidia-cuda"
  3.  
  4. ; Intrinsic to read X component of thread ID
  5. declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind
  6. ; libdevice function
  7. declare float @__nv_powf(float, float)
  8.  
  9. define void @kernel(float addrspace(1)* %A,
  10. float addrspace(1)* %B,
  11. float addrspace(1)* %C) {
  12. entry:
  13. ; What is my ID?
  14. %id = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind
  15.  
  16. ; Compute pointers into A, B, and C
  17. %ptrA = getelementptr float, float addrspace(1)* %A, i32 %id
  18. %ptrB = getelementptr float, float addrspace(1)* %B, i32 %id
  19. %ptrC = getelementptr float, float addrspace(1)* %C, i32 %id
  20.  
  21. ; Read A, B
  22. %valA = load float, float addrspace(1)* %ptrA, align 4
  23. %valB = load float, float addrspace(1)* %ptrB, align 4
  24.  
  25. ; Compute C = pow(A, B)
  26. %valC = call float @__nv_powf(float %valA, float %valB)
  27.  
  28. ; Store back to C
  29. store float %valC, float addrspace(1)* %ptrC, align 4
  30.  
  31. ret void
  32. }
  33.  
  34. !nvvm.annotations = !{!0}
  35. !0 = !{void (float addrspace(1)*,
  36. float addrspace(1)*,
  37. float addrspace(1)*)* @kernel, !"kernel", i32 1}

To compile this kernel, we perform the following steps:

  • Link with libdevice
  • Internalize all but the public kernel function
  • Run NVVMReflect and set __CUDA_FTZ to 0
  • Optimize the linked module
  • Codegen the moduleThese steps can be performed by the LLVM llvm-link, opt, and llctools. In a complete compiler, these steps can also be performed entirelyprogrammatically by setting up an appropriate pass configuration (seeLinking with Libdevice).
  1. # llvm-link t2.bc libdevice.compute_20.10.bc -o t2.linked.bc
  2. # opt -internalize -internalize-public-api-list=kernel -nvvm-reflect-list=__CUDA_FTZ=0 -nvvm-reflect -O3 t2.linked.bc -o t2.opt.bc
  3. # llc -mcpu=sm_20 t2.opt.bc -o t2.ptx

Note

The -nvvm-reflect-list=_CUDA_FTZ=0 is not strictly required, as anyundefined variables will default to zero. It is shown here for evaluationpurposes.

This gives us the following PTX (excerpt):

  1. //
  2. // Generated by LLVM NVPTX Back-End
  3. //
  4.  
  5. .version 3.1
  6. .target sm_20
  7. .address_size 64
  8.  
  9. // .globl kernel
  10. // @kernel
  11. .visible .entry kernel(
  12. .param .u64 kernel_param_0,
  13. .param .u64 kernel_param_1,
  14. .param .u64 kernel_param_2
  15. )
  16. {
  17. .reg .pred %p<30>;
  18. .reg .f32 %f<111>;
  19. .reg .s32 %r<21>;
  20. .reg .s64 %rl<8>;
  21.  
  22. // %bb.0: // %entry
  23. ld.param.u64 %rl2, [kernel_param_0];
  24. mov.u32 %r3, %tid.x;
  25. ld.param.u64 %rl3, [kernel_param_1];
  26. mul.wide.s32 %rl4, %r3, 4;
  27. add.s64 %rl5, %rl2, %rl4;
  28. ld.param.u64 %rl6, [kernel_param_2];
  29. add.s64 %rl7, %rl3, %rl4;
  30. add.s64 %rl1, %rl6, %rl4;
  31. ld.global.f32 %f1, [%rl5];
  32. ld.global.f32 %f2, [%rl7];
  33. setp.eq.f32 %p1, %f1, 0f3F800000;
  34. setp.eq.f32 %p2, %f2, 0f00000000;
  35. or.pred %p3, %p1, %p2;
  36. @%p3 bra BB0_1;
  37. bra.uni BB0_2;
  38. BB0_1:
  39. mov.f32 %f110, 0f3F800000;
  40. st.global.f32 [%rl1], %f110;
  41. ret;
  42. BB0_2: // %__nv_isnanf.exit.i
  43. abs.f32 %f4, %f1;
  44. setp.gtu.f32 %p4, %f4, 0f7F800000;
  45. @%p4 bra BB0_4;
  46. // %bb.3: // %__nv_isnanf.exit5.i
  47. abs.f32 %f5, %f2;
  48. setp.le.f32 %p5, %f5, 0f7F800000;
  49. @%p5 bra BB0_5;
  50. BB0_4: // %.critedge1.i
  51. add.f32 %f110, %f1, %f2;
  52. st.global.f32 [%rl1], %f110;
  53. ret;
  54. BB0_5: // %__nv_isinff.exit.i
  55.  
  56. ...
  57.  
  58. BB0_26: // %__nv_truncf.exit.i.i.i.i.i
  59. mul.f32 %f90, %f107, 0f3FB8AA3B;
  60. cvt.rzi.f32.f32 %f91, %f90;
  61. mov.f32 %f92, 0fBF317200;
  62. fma.rn.f32 %f93, %f91, %f92, %f107;
  63. mov.f32 %f94, 0fB5BFBE8E;
  64. fma.rn.f32 %f95, %f91, %f94, %f93;
  65. mul.f32 %f89, %f95, 0f3FB8AA3B;
  66. // inline asm
  67. ex2.approx.ftz.f32 %f88,%f89;
  68. // inline asm
  69. add.f32 %f96, %f91, 0f00000000;
  70. ex2.approx.f32 %f97, %f96;
  71. mul.f32 %f98, %f88, %f97;
  72. setp.lt.f32 %p15, %f107, 0fC2D20000;
  73. selp.f32 %f99, 0f00000000, %f98, %p15;
  74. setp.gt.f32 %p16, %f107, 0f42D20000;
  75. selp.f32 %f110, 0f7F800000, %f99, %p16;
  76. setp.eq.f32 %p17, %f110, 0f7F800000;
  77. @%p17 bra BB0_28;
  78. // %bb.27:
  79. fma.rn.f32 %f110, %f110, %f108, %f110;
  80. BB0_28: // %__internal_accurate_powf.exit.i
  81. setp.lt.f32 %p18, %f1, 0f00000000;
  82. setp.eq.f32 %p19, %f3, 0f3F800000;
  83. and.pred %p20, %p18, %p19;
  84. @!%p20 bra BB0_30;
  85. bra.uni BB0_29;
  86. BB0_29:
  87. mov.b32 %r9, %f110;
  88. xor.b32 %r10, %r9, -2147483648;
  89. mov.b32 %f110, %r10;
  90. BB0_30: // %__nv_powf.exit
  91. st.global.f32 [%rl1], %f110;
  92. ret;
  93. }