Last Update:
Practical C++17: Loop Unrolling with Lambdas and Fold Expressions
Table of Contents
In this blog post, we’ll delve into the unroll<N>()
template function for template unrolling, understand its mechanics, and see how it can improve your code. We’ll look at lambdas, fold expressions, and integer sequences.
Let’s get started!
A little background
In a recent article Vector math library codegen in Debug · Aras’ website - Aras Pranckevičius discusses some coding techniques that help with performance of debug code… and I came across an intriguing technique utilized in the Blender Math library that he used in his text.
One interesting example was this one:
friend VecBase operator+(const VecBase &a, const VecBase &b)
{
VecBase result;
unroll<Size>([&](auto i) { result[i] = a[i] + b[i]; });
return result;
}
And I’m curious how this unroll<Size>()
function works under the hood.
Why unroll()
Matters
Before we go into the intricacies of the unroll()
function, it’s good to learn such a technique is valuable. In performance-critical applications—such as graphics rendering, real-time simulations, or high-frequency trading—every millisecond counts. Traditional loops, while easy to write, introduce runtime overhead that can be minimized or eliminated using compile-time optimizations like loop unrolling.
In short, template unrolling automates the expansion of loops during compilation, replacing iterative constructs with repetitive code blocks.
Introducing the unroll()
Template Function
Let’s break down the unroll()
function inspired by Blender’s C++ math library. This function leverages modern C++ features such as lambdas, variadic templates, and fold expressions to perform compile-time loop unrolling efficiently.
Here’s a simplified implementation of the unroll()
function:
#include <utility>
// Helper to implement unroll via parameter pack expansion
template<class Fn, std::size_t... I>
void unroll_impl(Fn fn, std::index_sequence<I...>) {
(fn(I), ...); // Calls fn(0), fn(1), ..., fn(N-1)
}
// Primary unroll function
template<int N, class Fn>
void unroll(Fn fn) {
unroll_impl(fn, std::make_index_sequence<N>());
}
Breaking It Down:
unroll_impl
fn
: The lambda function to execute.std::index_sequence<I...>
: A compile-time sequence of indices.- Utilizes a fold expression
(fn(I), ...)
to callfn
for each index in the sequence.
unroll
:N
: The number of times to unroll (i.e., the size).fn
: The lambda function to execute.- Generates an
index_sequence
from0
toN-1
usingstd::make_index_sequence<N>()
and passes it tounroll_impl
.
This setup ensures that the lambda fn
is invoked exactly N
times, each with a unique index from 0
to N-1
.
You can learn more about iteration at compile time in my other article: C++ Templates: How to Iterate through std::tuple: the Basics - C++ Stories
Practical Example: Vector Addition
To illustrate the power of unroll()
combined with lambdas, let’s implement a simple vector addition operation.
#include <array>
#include <cassert>
#include <iostream>
// Base vector structure with 4 components
template<typename T>
struct Vector4 {
T x, y, z, w;
// Element access using indices
T& operator[](int index) {
assert(index >= 0 && index < 4);
return reinterpret_cast<T*>(this)[index];
}
const T& operator[](int index) const {
assert(index >= 0 && index < 4);
return reinterpret_cast<const T*>(this)[index];
}
// Vector addition using unroll and lambda
Vector4 operator+(const Vector4& other) const {
Vector4 result;
unroll<4>([&](auto i) {
result[i] = (*this)[i] + other[i];
});
return result;
}
};
- Vector4 Structure: Holds four components—
x
,y
,z
, andw
. operator[]
: Allows accessing components via indices0
to3
.- Addition Operator (
operator+
):- Creates a new
Vector4
namedresult
. - Calls
unroll<4>()
with a lambda that adds corresponding components:result[0] = this->x + other.x
result[1] = this->y + other.y
result[2] = this->z + other.z
result[3] = this->w + other.w
- Returns the
result
vector.
- Creates a new
The Blender Math code is available here: @Github commit
Using the Vector Addition
template<typename T>
std::ostream& operator<<(std::ostream& os, const Vector4<T>& v) {
unroll<4>([&](auto i) {
os << v[i] << " ";
});
return os;
}
int main() {
Vector4<float> vec1 = {1.0f, 2.0f, 3.0f, 4.0f};
Vector4<float> vec2 = {5.0f, 6.0f, 7.0f, 8.0f};
Vector4<float> sum = vec1 + vec2;
std::cout << "Sum: " << sum;
}
Play with the code @Compiler Explorer
When vec1 + vec2
is executed:
- The lambda inside
operator+
is called four times (for indices0
to3
), performing component-wise addition. - Thanks to
unroll()
, there’s no loop overhead—the compiler expands these calls at compile time. - The result is a new
Vector4
containing the sums of corresponding components.
This approach not only enhances performance but also keeps the code clean and easy to understand.
Other techniques
unroll()
isn’t the only choice for loop unrolling; here are some other worth mentioning:
- Manual Loop Unrolling: This technique involves explicitly writing out each iteration of the loop in your code. It’s straightforward and gives you complete control over the unrolling process. However, it can become tedious and error-prone for larger loops, and it may reduce code readability and maintainability.
- Compiler Pragmas/Directives: Many compilers offer pragmas or directives that suggest or enforce loop unrolling. This method is easy to apply and allows the compiler to handle the complexity of unrolling. However, it is compiler-dependent, meaning not all compilers support the same pragmas, and the results may vary.
- SIMD (Single Instruction, Multiple Data) Instructions: SIMD instructions enable the execution of the same operation on multiple data points simultaneously, effectively unrolling loops at the hardware level. This can lead to substantial performance improvements by utilizing the parallel processing capabilities of modern CPUs. The downside is that it requires specific knowledge of hardware instructions, making the code less portable and more complex.
A recursive version, C++14
If you cannot use fold expressions, then here’s a recursive solution:
// Primary template for unrolling
template<int N>
struct Unroll {
template<typename Fn>
static inline void apply(Fn fn) {
Unroll<N - 1>::apply(fn); // Recurse with N-1
fn(N - 1); // Process the current index
}
};
// Specialization for the base case
template<>
struct Unroll<0> {
template<typename Fn>
static inline void apply(Fn fn) {
// Base case: do nothing
}
};
And here’s a working example @Compiler Explorer
There’s also a good example, with Dot Product in the great book on templates: C++ Templates: The Complete Guide (2nd Edition)
Summary
In this text, we explored an interesting technique for unrolling repetitive and simple loop statements. Thanks to C++17 features like fold expressions, combined with templates and lambdas, the code is elegant and easy to understand.
Read more:
- Book: C++ Templates: The Complete Guide (2nd Edition)
- Vector math library codegen in Debug · Aras’ website
Back to you
- Have you implemented loop unrolling in your C++ projects?
- Do you prefer using lambdas over traditional functors for performance-critical code?
I've prepared a valuable bonus if you're interested in Modern C++!
Learn all major features of recent C++ Standards!
Check it out here: