SIMD for faster computing

SIMD for faster computing

The basics of SIMD are now available!SIMD stands for “single instruction, multiple data.” Consider a function likethis:


#![allow(unused_variables)]
fn main() {
pub fn foo(a: &[u8], b: &[u8], c: &mut [u8]) {
    for ((a, b), c) in a.iter().zip(b).zip(c) {
        *c = *a + *b;
    }
}
}

Here, we’re taking two slices, and adding the numbers together, placing theresult in a third slice. The simplest possible way to do this would be to doexactly what the code does, and loop through each set of elements, add themtogether, and store it in the result. However, compilers can often do better.LLVM will usually “autovectorize” code like this, which is a fancy term for“use SIMD.” Imagine that a and b were both 16 elements long. Each elementis a u8, and so that means that each slice would be 128 bits of data. UsingSIMD, we could put both a and b into 128 bit registers, add them togetherin a single instruction, and then copy the resulting 128 bits into c.That’d be much faster!

While stable Rust has always been able to take advantage ofautovectorization, sometimes, the compiler just isn’t smart enough to realizethat we can do something like this. Additionally, not every CPU has thesefeatures, and so LLVM may not use them so your program can be used on a widevariety of hardware. The std::arch module allows us to use these kinds ofinstructions directly, which means we don’t need to rely on a smart compiler.Additionally, it includes some features that allow us to choose a particularimplementation based on various criteria. For example:

#[cfg(all(any(target_arch = "x86", target_arch = "x86_64"),
      target_feature = "avx2"))]
fn foo() {
    #[cfg(target_arch = "x86")]
    use std::arch::x86::_mm256_add_epi64;
    #[cfg(target_arch = "x86_64")]
    use std::arch::x86_64::_mm256_add_epi64;
    unsafe {
        _mm256_add_epi64(...);
    }
}

Here, we use cfg flags to choose the correct version based on the machinewe’re targeting; on x86 we use that version, and on x86_64 we use itsversion. We can also choose at runtime:

fn foo() {
    #[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
    {
        if is_x86_feature_detected!("avx2") {
            return unsafe { foo_avx2() };
        }
    }
    foo_fallback();
}

Here, we have two versions of the function: one which uses AVX2, a specifickind of SIMD feature that lets you do 256-bit operations. Theis_x86_feature_detected! macro will generate code that detects if your CPUsupports AVX2, and if so, calls the foo_avx2 function. If not, then we fallback to a non-AVX implementation, foo_fallback. This means that our code willrun super fast on CPUs that support AVX2, but still work on ones that don’t,albeit slower.

If all of this seems a bit low-level and fiddly, well, it is! std::arch isspecifically primitives for building these kinds of things. We hope toeventually stabilize a std::simd module with higher-level stuff in thefuture. But landing the basics now lets the ecosystem experiment with higherlevel libraries starting today. For example, check out thefaster crate. Here’s a code snippetwith no SIMD:

let lots_of_3s = (&[-123.456f32; 128][..]).iter()
    .map(|v| {
        9.0 * v.abs().sqrt().sqrt().recip().ceil().sqrt() - 4.0 - 2.0
    })
    .collect::<Vec<f32>>();

To use SIMD with this code via faster, you’d change it to this:

let lots_of_3s = (&[-123.456f32; 128][..]).simd_iter()
    .simd_map(f32s(0.0), |v| {
        f32s(9.0) * v.abs().sqrt().rsqrt().ceil().sqrt() - f32s(4.0) - f32s(2.0)
    })
    .scalar_collect();

It looks almost the same: simd_iter instead of iter, simd_map instead of map,f32s(2.0) instead of 2.0. But you get a SIMD-ified version generated for you.

Beyond that, you may never write any of this yourself, but as always, thelibraries you depend on may. For example, the regex crate contains these SIMDspeedups without you needing to do anything at all!

3.9. SIMD for faster computing

SIMD for faster computing