Accelerating Application Performance
Run time performance is critical for many technical applications. Objexx brings a variety of capabilities to improving the performance and scalability of applications and has produced large and often dramatic speed gains.
Profiling applications is an important first step in understanding the location and nature of performance bottlenecks. Objexx uses a range of profiling tools such as Intel VTune, Zoom, prof, oprofile, and gprof, often using more than one to exploit the strengths of each tool and to evaluate performance on multiple platforms.
Performance tuning involves localized code changes without significantly altering algorithms, such as:
- Avoiding unnecessary heap allocations in loops or on every function call.
- Hoisting expressions out of loops.
- Using local accumulators.
- Using less expensive but equivalent computations.
- Altering data structures or access sequences for better cache efficiency.
- Precomputing sub-collections of interest to avoid extra loop passes.
It is often possible to find algorithms with much higher performance for a hot spot operation. Sometimes this is a refinement of the existing algorithm, possibly exploiting some precomputed data. In other cases a very different algorithm with better performance and/or complexity can be developed. The best solution may involve moving to a parallel algorithm. Often the improved algorithm is tied to a new data structure.
Data structures can be critical to performance. A better data structure can significantly reduce the operation cost or amortized cost. Sometimes the computational complexity can be reduced by moving to a different data structure, yielding large scalability gains. Data structure optimization can include:
- Migrating to a data structure with better performance/complexity for the usage pattern. For example, a C++ std::vector can be less efficient than std::set for use as an ordered collection where insertions and deletions costs dominate.
- Exploiting the complexity benefits of specialized data structures, such as the use of octrees or k-d trees for spatial queries.
- Using a data structure more appropriate to the access patterns and any application multithreading or multiprocessing.
Getting the highest throughput on modern hardware may require exploiting CPU SIMD vectorization, multithreading, multiprocessing, or GPU/coprocessor execution units.
Recent x86 CPU SSE and AVX SIMD capabilities allow amenable loops to be vectorized, providing typically a 4X or 8X performance boost. It can be challenging to restructure loops to get auto-vectorization to occur on multiple compilers as the capabilities of auto-vectorizers varies. Sometimes alignment and dependency directives can help vectorization. Assuring aligned data and unit stride memory access will facilitate good vectorization. In more challenging situations explicit vectorization intrinsic functions are needed to obtain optimal vectorization. Keeping this all cross-platform friendly and automatically adaptable to different CPUs is challenging but important. Specialized builds for different CPU SIMD generations can allow optimal performance across a range of end-user hardware.
Code refactoring may be part of enabling vectorization. Moving from an array of structures to a structure of arrays design is sometimes needed to get unit stride memory access. In other cases it may suffice to copy data into arrays before a performance-critical loop. Transposing 2D arrays may be necessary to get unit striding. Padding 2D array rows can be used to get row-aligned memory. Splitting complex loops into multiple, vectorization-friendly loops may be beneficial. Refactoring code to eliminate function calls or conditional expressions from loops can also bring vectorization speed gains.
Multithreading and multiprocessing can be used to exploit multicore and multiple CPU architectures to obtain large performance gains. These can introduce complexity and overhead into a code base so they must be used judiciously. Protection of shared resources can lead to stalls and failing to protect resources leaves hard to isolate bugs. Parallelization often depends on non-standard language extensions and third-party libraries such as OpenMP and Cilk.
Modern graphics hardware is capable of highly parallel floating point vector operations that can be used as a compute engine for performance-critical loops in some applications. GPU programming APIs such as CUDA and OpenACC can be used to exploit this hardware. Language extensions such as OpenCL can simplify the best use of available CPU and GPU hardware at run time. Specialized coprocessors such as the Intel Phi may be another target for hardware-based performance enhancement.
Graphical algorithms that will run on a suitable GPU if present can be faster than CPU-based algorithms such as computing the solar radiation on a building model with shading caused by building surfaces and external objects. Such algorithms might be built on OpenGL®.
Objexx introduced an octree algorithm in the Rosetta biomolecular modeling system to reduce performance-critical neighbor identification from O(N2) to O(N log N) complexity, resulting in dramatic scalability gains.
Objexx carried out performance tuning and refactoring for EnergyPlus after its conversion from Fortran to C++, obtaining performance that is up to 5X faster than the Fortran version. An octree system built for EnergyPlus provides major complexity/scalability gains for the spatial queries in daylighting and solar computations.