Getting high performance out of C++ applications can be essential, particularly for scientific and engineering code. Happily, we have a lot of tools and techniques that can be brought to bear on this task. Less happily, this can be a substantial effort, depending on how much speed you want to wring out of your code. While we can't cover the full breadth of optimizing performance here, we take a look at some powerful and interesting approaches below.
Profile the Suspects
You can't make good use of your performance refactoring time if you don't know where the performance "hot spots" in the code are. You may have a good idea about what the bottlenecks are but it is easy for our intuition about performance to be wrong. Our favorite profilers at the moment are Intel's VTune and, on Linux, prof.
Profilers have limitations to be aware of. The stalwart gprof can be useful but it omits system call time so you don't see important things like heap allocation time. Sampling rates/methods may be inaccurate. Performance can also vary across platforms and compilers considerably, so profile on all your target environments.
Hoist Away, Mateys!
Loops are where most hot spots live in technical applications. Hoisting computations out of inner loops is an old standby method for improving code performance that is still useful. Compilers are getting better at doing this but they don't get them all (yet). Even small bits of code like std::sin(x) can be expensive performance "bugs" if called repeatedly with the same argument. A corollary to this is "do it once": don't compute the same thing multiple times, even if it isn't in a loop.
Transcendental Mediation Yes, Computation No
Transcendental and algebraic functions are surprisingly expensive and should be avoided when possible. This might mean hoisting them out of loops where possible or eliminating them altogether. For example, std::pow(x,2) can be replaced by x*x or a little inline square(x) function. Disappointingly, not all compilers do this optimization for you so put a bunch of these little functions into your performance toolkit. Looking at the formulas the C++ implements you can often find ways to reduce these calls by restructuring the operations or recognizing simpler forms within them.
A Heap of Trouble
Heap allocation is becoming a relatively bigger bottleneck with modern processors. Avoiding unnecessary heap use is a great way to reap large performance gains. A lot of these heap allocations are sort of hidden in innocuous C++ code like:
void foo( Heapy h ); // Passes a heap-allocating object by value
Function-local (non-static) objects can also be a source of non-obvious heap hits. Making some of these static may be a solution in some cases. Declaring them only within code blocks that need them can reduce the heap hit if they aren't needed on all code paths. And these heap allocating objects may be possible to eliminate by rethinking the algorithm, maybe by doing some in-place operations. Heap allocating classes that are being created and destroyed many times can benefit from a pool allocator and/or a small object optimization strategy to reduce the heap use. Also, look for unnecessary copies of these expensive types, both within algorithms and pass-by-value outside of a move context.
Move It!
C++11 added a lot of great features for code clarity and robustness but the addition of move semantics is a big win for performance. By adding move support to your classes that have expensive resources you can see performance gains without any other code changes, such as passing temporaries. Of course, you can go farther such as explicit moves for discardable non-temporaries. So, yes, move it!
Memoize That Function
Hot spot functions with expensive algorithms may be called with the same arguments often enough that a memoization strategy can pay off. Memoization means building a lookup table of results for given sets of inputs that can be used for subsequent calls. (Any global data used by the function must be treated as inputs.) With a sufficiently fast lookup, maybe a hash table, this can be a big win. A limited size lookup using a smart pruning strategy is usually needed to avoid excessive memory use. Functions should usually be "pure" (without side-effects) for memoization to be safe.
Inlining
It isn't exciting but getting the right inlining granularity can yield big performance gains. Getting small, high-call count functions set up for inlining is the first step. Breaking larger high-call count functions into a frequently used inline part and a seldom used out-of-line part can be effective. We've found that compilers are not usually aggressive enough in their inlining size growth criteria for technical applications so experimenting with compiler inlining controls is worth a try. The "always inline" directives that some compilers offer can be another way to get critical functions inlined: writing a portable macro to wrap these directives is recommended. Conversely, the "never inline" directives can be useful on non-critical functions to fine tune the code growth vs. inlining tradeoffs. On a related note, recently we've found that the compiler option that allows inlining of functions not (implicitly or explicitly) inlined can greatly slow builds and with no benefit for code that has been properly inline tuned already.
Be Scalable, Not Complex
The relationship between the size of the problem or model being analyzed and the run time is called computational, or "big-O", complexity. Some algorithms have nice complexity such that run time rises only linearly with model size, O(N), or slightly super-linearly, such as O(N log N), in big-O notation. Others are much less scalable, rising with some higher power of model size or even exponentially. Often teams will benchmark an application with small, easy to construct test cases, missing the devastating performance loss in large cases when unforeseen complexity dominates. Finding hidden scalability problems in a code requires profiling large models.
Once the culprits have been identified it is time to find solutions. Changing Data structures is sometimes the answer. Different data structures, both those in the C++ Standard Library and the ones we write ourselves, have different complexity properties. So a std::vector is fast for lookups it is a bad choice for heavy insertion/deletion use such as for a dynamic sorted collection. Often just changing to a common C++ container will do the trick. Sometimes something more exotic is needed, like an octree. We've had a few projects where quadratic, O(N2), complexity could be reduced to O(N log N) by use of an octree to represent a collection of spatial objects in 3D for spatial queries, such as which objects are within some distance of each other or which might be hit by a ray of light. Sometimes it isn't the type of data structure but the way it is used that creates a scalability problem. For instance, looping over a large collection to only operate on a sub-linear sized subset is a common source of "hidden" complexity: creating the subcollection of interest outside of the hot spot will do the trick. Changing to a lower complexity algorithm may be possible. In the worst case, where no sufficiently low-complexity solutions exist, resorting to a parallel design may be necessary.
Loopy Containers
Scientific and engineering codes use a lot of small, fixed size vector and matrix objects, such as spatial coordinates and rotation matrices or quaternions. Using dynamic-sized array types for these can be a serious performance hit, both due to the heap allocation when constructed and the use of loops for operations. Developing heap-free and loop-free little class templates for these types can be a little work but can yield a nice speed boost. We use the ObjexxFCL Vector2, Vector3, and Vector4 types in reengineering projects with great results.
Array Lookups
C++ doesn't provide multidimensional arrays so scientific and engineering applications either write their own or use one of the available array libraries such as Eigen, Armadillo, uBLAS, or Boost.MultiArray. High quality C++ array libraries can provide generally excellent performance but yet simple element lookup speed can still lag that of Fortran. One reason is that Fortran arrays are built in to the language so its compilers can figure out array index strides in loops and avoid computing the memory offset from the indexes for each element. We can't change the C++ compiler so the next best solution is to offer a linear, or flat, 1D indexing operator for multidimensional arrays that can be used to improve performance of hot spot loops. The ObjexxFCL Arrays provide this with operator[] and we use it to tune loop performance all the way back to Fortran speed parity. The Armadillo library also provides a flat accessor operator.
For an n x n matrix-vector multiplication the traditional but slower approach looks like:
for(int i = 0; i < n; ++i ) { double b_i = 0.0; for ( int j = 0; j < n; ++j ) { b_i += A(i,j) * x(j); // Array must compute offset = i*n+j } b(i) = b_i; }
but with linear indexing we can write it like this and gain some speed:
for ( int i = 0, l = 0; i < n; ++i ) { double b_i = 0.0; for ( int j = 0; j < n; ++j, ++l ) { b += A[l] * x(j); // Offset l only requires increment } b(i) = b_i; }
For 2D arrays we save an integer multiply but for 3+D arrays the offset computation involves a series of multiply+add steps, so the performance benefit of linear indexing rises with array rank.
Vectorization
This is a big topic but suffice it to say that making your hot spot loops vectorization friendly can yield big speedups. Aim first for clean enough loops to auto-vectorize. Ideally, that means unit strides through array memory and no branching or functions calls in those loops. You need "fast math" options like GCC's -Ofast to get floating point loops to autovectorize: be aware that floating point error handling can be affected. Intel's Vectorization Advisor and compiler vectorization reports can help you see if vectorization is happening. Worry less about memory alignment with modern CPUs but if you can use an aligned array container then adding alignment directives to the code will be beneficial by eliminating peel loops. The ObjexxFCL Arrays can be built with alignment but std::vector does not guarantee aligned data. If your critical loops won't auto-vectorize then explicit SIMD intrinsic calls may be worthwhile: these are harder to get right and to maintain. Also, build with compile options for your target hardware so that the actual SIMD hardware can be fully exploited: AVX SIMD is twice as wide as SSE so it can double your throughput.
Cache is King
Cache efficiency is a vital part of performance on modern CPUs. Cache misses can hobble an application. Cache misses are a problem in technical applications when a multidimensional array is accessed in the "wrong" direction, causing large jumps through memory instead of unit striding. This is such a large performance hit that it can pay to create transposed copies or algorithm variants for large arrays. Some ObjexxFCL Array functions have cache-friendly logic for sufficiently large arrays.
Another place cache inefficiency can rear its head is the use of nested objects in computational hot spots. Object-oriented designs tend to lead to "array of structs" (AoS) bookkeeping data structures that are great for code modularity but create cache inefficiencies when used computationally. For example, consider computing the net torque of a system of Mass objects: we just want a dot product of the mass weight and radius vectors but they are sprinkled in these large Mass objects instead of in contiguous vectors that can be processed efficiently with few cache misses (and vectorized). The common recommendation to change to a "struct of arrays" (SoA) layout may be both an impractical refactoring task and/or a big sacrifice of the benefits of your OO design. A more practical approach can be to create arrays of these computational quantities outside of the performance-critical sections.
Sometimes these remote object lookups are not for the computational quantities but rather appear in conditionals inside performance-critical loops. These conditionals not only prevent vectorization but are cache antagonistic, so reworking logic to avoid them when possible can give a big performance boost.
Parallelization
This is too big to delve into here but effective parallelization can be a significant undertaking so it may make sense to explore it after other techniques, particularly when refactoring existing code. But when crafting new algorithms it does make sense to consider whether parallel approaches would be beneficial.
Odds and Ends
A few other ideas from the C++ performance grab-bag for hot spots:
- Bit shifting instead of multiplication/division.
- Switch statements instead of long if/else blocks.
- Prefer printf family calls to C++ stream i/o.
- Prefer binary i/o to formatted i/o.
Contact Objexx for help boosting your C++ code performance.