JavaScript is not enabled in your browser
Objexx Engineering

Objexx Labs

Objexx Labs
Discoveries, ideas, and musings from the Objexx development labs


OpenSceneGraph Windows Builds

OpenSceneGraph
Posted 2016/04/23

OpenSceneGraph provides the integrated real-time 3D visualization for ObjexxSISAME. OpenSceneGraph (OSG) packages are readily available for modern Linux platforms but Windows binaries for recent OSG and Visual C++ releases have been scarse. OpenSceneGraph is not simple to build due to the numerous 3rd party library dependencies, the many build options, and a lack of up-to-date build information.

Recently, we took the plunge and developed a recipe for building OSG and we are happy to report that we had success. The CMake support in OSG is a big help but some tweaking was needed to point it to the latest Windows SDK libraries. Getting the 3rd party libraries built required some sequencing and special handling. Visual C++ 2015 was used in this first phase but we will try Intel C++ builds and compare performance in the near future.

We are making the binaries available for download on the OpenSceneGraph page. We have both generic and AVX2-optimized builds. Now that we have a cookbook for these builds we plan to add binaries for new OSG versions as they are released.



Multidimensional C++ Arrays

Performance
Posted 2016/03/29

Since C++ doesn't provide multidimensional arrays in its standard library technical applications typically use one of the more well-known array libraries such as Eigen, Armadillo, uBLAS, or Boost.MultiArray. These each have somewhat different goals, capabilities, and performance. At Objexx we have a focus on array performance and we do a lot of conversions from Fortran to C++. You can criticize Fortran for its many limitations but its arrays have capabilities that are very nice for scientific and engineering application development.

From this perspective we have certain features that we would like to see these major array libraries adopt:

  • Flexible index ranges (like Fortran). Sometimes natural indexing for an algorithm doesn't start at zero. With C++11 arrays can provide a nice syntax for this, such as we use in the ObjexxFCL arrays:
      Array2D_int A( {-n,n}, 3 );
    
    This can be supported with no lookup performance penalty so it nice to give projects the option of natural indexing.
  • Live slice (section) proxies. These can exploit the same C++11 syntax shown above for index ranges for elegant usage:
      A( {-i,i,2}, 2 )
    
    Slices can be non-contiguous so their use must be balanced against performance considerations but a proper "Fortranic" implementation can eliminate other kinds of overhead.
  • Linear indexing to eliminate the index offset overhead in lookups for 2+D arrays in loops. See Array Lookups for more on linear indexing.
  • Row and column major array layout support. Row major is more natural in the C/C++ world but column major is more familiar for projects migrating from Fortran. We provide both variants in the ObjexxFCL and we built a nifty Clang-based tool for converting a code base between them.


High-Performance C++

Performance
Posted 2016/03/28

Getting high performance out of C++ applications can be essential, particularly for scientific and engineering code. Happily, we have a lot of tools and techniques that can be brought to bear on this task. Less happily, this can be a substantial effort, depending on how much speed you want to wring out of your code. While we can't cover the full breadth of optimizing performance here, we take a look at some powerful and interesting approaches below.

Profile the Suspects

You can't make good use of your performance refactoring time if you don't know where the performance "hot spots" in the code are. You may have a good idea about what the bottlenecks are but it is easy for our intuition about performance to be wrong. Our favorite profilers at the moment are Intel's VTune and, on Linux, prof.

Profilers have limitations to be aware of. The stalwart gprof can be useful but it omits system call time so you don't see important things like heap allocation time. Sampling rates/methods may be inaccurate. Performance can also vary across platforms and compilers considerably, so profile on all your target environments.

Hoist Away, Mateys!

Loops are where most hot spots live in technical applications. Hoisting computations out of inner loops is an old standby method for improving code performance that is still useful. Compilers are getting better at doing this but they don't get them all (yet). Even small bits of code like std::sin(x) can be expensive performance "bugs" if called repeatedly with the same argument. A corollary to this is "do it once": don't compute the same thing multiple times, even if it isn't in a loop.

Transcendental Mediation Yes, Computation No

Transcendental and algebraic functions are surprisingly expensive and should be avoided when possible. This might mean hoisting them out of loops where possible or eliminating them altogether. For example, std::pow(x,2) can be replaced by x*x or a little inline square(x) function. Disappointingly, not all compilers do this optimization for you so put a bunch of these little functions into your performance toolkit. Looking at the formulas the C++ implements you can often find ways to reduce these calls by restructuring the operations or recognizing simpler forms within them.

A Heap of Trouble

Heap allocation is becoming a relatively bigger bottleneck with modern processors. Avoiding unnecessary heap use is a great way to reap large performance gains. A lot of these heap allocations are sort of hidden in innocuous C++ code like:

void foo( Heapy h ); // Passes a heap-allocating object by value

Function-local (non-static) objects can also be a source of non-obvious heap hits. Making some of these static may be a solution in some cases. Declaring them only within code blocks that need them can reduce the heap hit if they aren't needed on all code paths. And these heap allocating objects may be possible to eliminate by rethinking the algorithm, maybe by doing some in-place operations. Heap allocating classes that are being created and destroyed many times can benefit from a pool allocator and/or a small object optimization strategy to reduce the heap use. Also, look for unnecessary copies of these expensive types, both within algorithms and pass-by-value outside of a move context.

Move It!

C++11 adds a lot of great features for code clarity and robustness but the addition of move semantics is a win for performance. By adding move support to your classes that have expensive resources you can see performance gains without any other code changes, such as passing temporaries. Of course, you can go farther such as explicit moves for discardable non-temporaries. So, yes, move it!

Memoize That Function

Hot spot functions with expensive algorithms may be called with the same arguments often enough that a memoization strategy can pay off. Memoization means building a lookup table of results for given sets of inputs that can be used for subsequent calls. (Any global data used by the function must be treated as inputs.) With a sufficiently fast lookup, maybe a hash table, this can be a big win. A limited size lookup using a smart pruning strategy is usually needed to avoid excessive memory use. Functions should usually be "pure" (without side-effects) for memoization to be safe.

Inlining

It isn't exciting but getting the right inlining granularity can yield big performance gains. Getting small, high-call count functions set up for inlining is the first step. Breaking larger high-call count functions into a frequently used inline part and a seldom used out-of-line part can be effective. We've found that compilers are not usually aggressive enough in their inlining size growth criteria for technical applications so experimenting with compiler inlining controls is worth a try. The "always inline" directives that some compilers offer can be another way to get critical functions inlined: writing a portable macro to wrap these directives is recommended. Conversely, the "never inline" directives can be useful on non-critical functions to fine tune the code growth vs. inlining tradeoffs. On a related note, recently we've found that the compiler option that allows inlining of functions not (implicitly or explicitly) inlined can greatly slow builds and with no benefit for code that has been properly inline tuned already.

Be Scalable, Not Complex

The relationship between the size of the problem or model being analyzed and the run time is called computational, or "big-O", complexity. Some algorithms have nice complexity such that run time rises only linearly with model size, O(N), or slightly super-linearly, such as O(N log N), in big-O notation. Others are much less scalable, rising with some higher power of model size or even exponentially. Often teams will benchmark an application with small, easy to construct test cases, missing the devastating performance loss in large cases when unforseen complexity dominates. Finding hidden scalability problems in a code requires profiling large models.

Once the culprits have been identified it is time to find solutions. Changing Data structures is sometimes the answer. Different data structures, both those in the C++ Standard Library and the ones we write ourselves, have different complexity properties. So a std::vector is fast for lookups it is a bad choice for heavy insertion/deletion use such as for a dynamic sorted collection. Often just changing to a common C++ container will do the trick. Sometimes something more exotic is needed, like an octree. We've had a few projects where quadratic, O(N2), complexity could be reduced to O(N log N) by use of an octree to represent a collection of spatial objects in 3D for spatial queries, such as which objects are within some distance of each other or which might be hit by a ray of light. Sometimes it isn't the type of data structure but the way it is used that creates a scalability problem. For instance, looping over a large collection to only operate on a sub-linear sized subset is a common source of "hidden" complexity: creating the subcollection of interest outside of the hot spot will do the trick. Changing to a lower complexity algorithm may be possible. In the worst case, where no sufficiently low-complexity solutions exist, resorting to a parallel design may be necessary.

Loopy Containers

Scientific and engineering codes use a lot of small, fixed size vector and matrix objects, such as spatial coordinates and rotation matrices or quaternions. Using dynamic-sized array types for these can be a serious performance hit, both due to the heap allocation when constructed and the use of loops for operations. Developing heap-free and loop-free little class templates for these types can be a little work but can yield a nice speed boost. We use the ObjexxFCL Vector2, Vector3, and Vector4 types in reengineering projects with great results.

Array Lookups

C++ doesn't provide multidimensional arrays so scientific and engineering applications either write their own or use one of the available array libraries such as Eigen, Armadillo, uBLAS, or Boost.MultiArray. High quality C++ array libraries can provide generally excellent performance but yet simple element lookup speed can still lag that of Fortran. One reason is that Fortran arrays are built in to the language so its compilers can figure out array index strides in loops and avoid computing the memory offset from the indexes for each element. We can't change the C++ compiler so the next best solution is to offer a linear, or flat, 1D indexing operator for multidimensional arrays that can be used to improve performance of hot spot loops. The ObjexxFCL Arrays provide this with operator[] and we use it to tune loop performance all the way back to Fortran speed parity. The Armadillo library also provides a flat accessor operator.

For an n x n matrix-vector multiplication the traditional but slower approach looks like:

for ( int i = 0; i < n; ++i ) {
   double b_i = 0.0;
   for ( int j = 0; j < n; ++j ) {
      b_i += A(i,j) * x(j); // Array must compute offset = i*n+j
   }
   b(i) = b_i;
}

but with linear indexing we can write it like this and gain some speed:

for ( int i = 0, l = 0; i < n; ++i ) {
   double b_i = 0.0;
   for ( int j = 0; j < n; ++j, ++l ) {
      b += A[l] * x(j); // Offset l only requires increment
   }
   b(i) = b_i;
}

For 2D arrays we save an integer multiply but for 3+D arrays the offset computation involves a series of multiply+add steps, so the performance benefit of linear indexing rises with array rank.

Vectorization

This is a big topic but suffice it to say that making your hot spot loops vectorization friendly can yield big speedups. Aim first for clean enough loops to auto-vectorize. Ideally, that means unit strides through array memory and no branching or functions calls in those loops. You need "fast math" options like GCC's -Ofast to get floating point loops to autovectorize: be aware that floating point error handling can be affected. Intel's Vectorization Advisor and compiler vectorization reports can help you see if vectorization is happening. Worry less about memory alignment with modern CPUs but if you can use an aligned array container then adding alignment directives to the code will be beneficial by eliminating peel loops. The ObjexxFCL Arrays can be built with alignment but std::vector does not guarantee aligned data. If your critical loops won't auto-vectorize then explicit SIMD intrinsic calls may be worthwhile: these are harder to get right and to maintain. Also, build with compile options for your target hardware so that the actual SIMD hardware can be fully exploited: AVX SIMD is twice as wide as SSE so it can double your throughput.

Cache is King

Cache efficiency is a vital part of performance on modern CPUs. Cache misses can hobble an application. Cache misses are a problem in technical applications when a multidimensional array is accessed in the "wrong" direction, causing large jumps through memory instead of unit striding. This is such a large performance hit that it can pay to create transposed copies or algorithm variants for large arrays. Some ObjexxFCL Array functions have cache-friendly logic for sufficiently large arrays.

Another place cache inefficiency can rear its head is the use of nested objects in computational hot spots. Object-oriented designs tend to lead to "array of structs" (AoS) bookkeeping data structures that are great for code modularity but create cache inefficiencies when used computationally. For example, consider computing the net torque of a system of Mass objects: we just want a dot product of the mass weight and radius vectors but they are sprinkled in these large Mass objects instead of in contiguous vectors that can be processed efficiently with few cache misses (and vectorized). The common recommendation to change to a "struct of arrays" (SoA) layout may be both an impractical refactoring task and/or a big sacrifice of the benefits of your OO design. A more practical approach can be to create arrays of these computational quantities outside of the performance-critical sections.

Sometimes these remote object lookups are not for the computational quantities but rather appear in conditionals inside performance-critical loops. These conditionals not only prevent vectorization but are cache antagonistic, so reworking logic to avoid them when possible can give a big performance boost.

Parallelization

This is too big to delve into here but effective parallelization can be a significant undertaking so it may make sense to explore it after other techniques, particularly when refactoring existing code. But when crafting new algorithms it does make sense to consider whether parallel approaches would be beneficial.

Odds and Ends

A few other ideas from the C++ performance grab-bag for hot spots:

  • Bit shifting instead of multiplication/division.
  • Switch statements instead of long if/else blocks.
  • Prefer printf family calls to C++ stream i/o.
  • Prefer binary i/o to formatted i/o.


Contact Objexx for help boosting your C++ code performance.



Fast Keys and Slow Maps

Performance
Posted 2013/02/03

The C++ Standard Library provides the std::map container for collections of items that need to be looked up by a "key" identifier. This is much more flexible that an array container like std::vector that can only look items up by a contiguous set of integer indexes. While std::map (and its new C++11 relative std::unordered_map) have some admirable complexity guarantees they have much worse performance than an array. For fast technical applications we would love to have the semantics and syntax of std::map but the performance of an array. Happily, there are a number of situations where, with a little work, we can craft such a solution.

The C++ map containers are designed to provide good complexity guarantees for large containers with an arbitrary mix of insertion, lookup, and deletion operations. Achieving this asymptotic complexity comes at the cost of a large time and space overhead that is wasteful for collections that don't fall in this category.

Collections that are small or where lookups dominate insertion/deletion are common in scientific and engineering applications. An example that is prominent in ObjexxSISAME is vectors and matrices indexed by the axes for the model's degree of freedom set (ObjexxSISAME supports any meaningful 1, 2, or 3D model DOF set): these have only up to 6 keys and the key set is fixed once they are initialized. Or we may be modeling proteins where we need to look up the 20 canonical amino acids by a named key. Using a map for lookups on such containers is taking a massive, yes massive, performance hit.

So, how do we create this fast map-like container? Well, there are variations depending on the specifics but the basic ingredients are:

  • A lightweight key class that holds an internal index value.
    • The index type could be an appropriately small integer type for the number of possible keys.
    • The key will usually have an associated name that is either stored in the key or determined by lookup in an external container when many containers with the same key set are expected.
    • The key class may contain an enumeration the defines the possible index values and a static mechanism for looking up the name from that index and vice versa.
    • The key class must define an order via operator< and an operator that does conversion to the index type.
    • The key type is normally small enough that it is fastest to pass it by value rather than const reference.
  • A container with map (and optionally vector) semantics that holds a contiguous vector of the actual values and holds or has access to another vector that maps key indexes to indexes in this internal vector.
    • This second vector has room for all possible keys and uses a sentinel flag value to indicate a non-present key.
    • When this mapping from keys to indexes is common to many containers it can be held externally to the container and accessed via a stored pointer.
    • The container provides operator[] and/or operator() that take key arguments and use a two-step lookup for the value for a given key, of the form return vals_[idxs_[key]]. This two-step lookup is a bit slower than a normal one-step vector lookup but is still dramatically faster than std::map.
    • If useful, the container can also expose index-based lookup operators for faster one-step lookup by client code.
Key Map vs map

The performance impact of using a custom key map system can be dramatic, as the benchmark results shown here demonstrate.

The specific design can vary depending on such criteria as whether the key-to-index map is common to many containers and whether it is small enough that repeating it is worth the space cost to avoid the extra indirection, whether the key-to-index map is known at compile time (and thus can be a template argument, avoiding both space and indirection costs).

Additional notes:

  • Operations over the whole container can be performed internally over the underlying vector, avoiding even the fast key-based lookup step.
  • Multidimensional key-accessed "matrix" containers can be created along the same lines, built on top of a vector of vectors, for example.
  • An obvious question is how large does the key set (and thus container) have to be for std::map to start to be competitive again. The precise answer to this depends on the specific platform and compiler but in our experience map isn't faster until you are up to thousands of possible keys.
  • You may want different containers depending on whether the value being stored for each key is a small/built-in type that is fastest if passed by value or a larger object that is best passed by const reference.
  • Normally we have the internal vector hold just the values but in some cases you may want a container that holds std::pair pairs like std::map. An alternative that may suffice is to have a mechanism for reverse lookup of the key for a given internal vector index.
  • If your container must support frequent insertion/deletion operations note that for sufficiently small containers it will be faster to hold just an unordered vector of std::pair and do a linear search on it for each key-based lookup: the overhead of sorting and binary search dominates linear search for small collections.

The takeaway from this is that use of std::map and similar containers should be limited in technical software and that there are much faster containers for many situations where maps get used. Objexx has developed a suite of fast key map containers for various scenarios that we have used with great results in our in-house and client applications.



Python Parallel Job Controller

Python
Posted 2012/12/10

Python has built-in support for parallel processing via the multiprocessing, subprocess, and thread packages but does not provide any tools to help use system resources effectively to run a set of jobs in parallel. That requires the ability to measure the system resource loads and start queued jobs when sufficient resources become available. It is not hard to build such a package with the right tools and some basic observations.

There are not a lot of python packages for measuring system resources and loads but psutil is a good choice. Using psutil you can obtain metrics such as the total and available system memory, and the total or per-core CPU load. Because psutil cannot measure the i/o load on a system, there is not a good way to avoid running too many i/o-bound jobs in parallel. The best simple approach is probably to add a cap on the number of jobs that can be running in parallel. This is not an ideal approach but can suffice for most purposes until a better package becomes available.

Job Controller Sequence Diagram

A basic job controller and Job class system can be written with a modest amount of code. This probably has:

  • A constructor that may take arguments for the baseline priority level and CPU and RAM thresholds for starting a job. Optional arguments with default job CPU time and RAM resource usage may also be useful.
  • Methods for adding and removing jobs to/from the queue.
  • A run method that checks the system load and starts one or more jobs if resources allow.
  • Optionally, a mechanism for priority escalation/deescalation.
  • One or more Job classes that holds the specification for a job, estimates for its resource requirements if available, and a mechanism to run the job when the controller tells it to.

Some considerations in building such a system include:

  • Providing an "aggressiveness" priority setting to client code and/or users to let them control the tradeoff between system responsiveness and job throughput will make your job controller more flexible.
  • Abstracting the Job concept into classes allows a range of job types to be supported, with customization of the underlying run mechanism (subprocess, multiprocessing, ...), job resource estimates, and so forth.
  • A sequential "Job" type is useful when a meta-Job must run a list of smaller jobs in a given order.
  • Estimates for a job's resource (CPU and memory) requirements, if available, can be used to refine the logic on starting jobs.
  • To limit the number of running jobs (such as to provide a backstop for i/o bound jobs as described above) in a manner that is portable to different parallel packages requires saving and polling the process to count the number running.
  • A time.sleep() should normally be used to avoid an excessive load due to polling the system.
  • Using the lowest per-core CPU load metric in determining whether to start a job is recommended for most applications but a more sophisticated algorithm that looks at the CPU load distribution could be warranted for a complex or mission-critical system.
  • Providing an escalation mechanism may be necessary to get jobs to start on a heavily loaded system. This raises the job starting "aggressiveness" as time passes with no jobs starting and then drops it back to the baseline level after a job starts. Without this your jobs may almost never catch a window when the system is not running at full bore. The escalation logic may need to be tuned to reflect the priority of the jobs compared to other tasks being run.
  • Setting the priority escalation strategy and sleep time between system load polling are heuristics that may take some fine tuning to find a good balance.



Efficient Qt Frozen Columns and Rows

Qt
Posted 2012/05/28

Qt provides a rich GUI framework with excellent support for tables but when your need a feature that isn't directly supported things can get tricky. One such feature for tables is a so-called "frozen" column or row that should remain fixed like the headers when scrolling. Qt table widget and view classes do not support frozen columns or rows and enabling them can be both tricky to get right and an efficiency problem.

Frozen Row Table

Frozen rows and columns are useful when you they contain label-like content that doesn't belong in the header for some reason, such as it is editable. In our case, we use a frozen row in numerical tables for the dimensional units of the columns as in the example shown here. As you scroll the table it is useful to keep the units in view and sometimes the units can be changed via drop-down combo boxes to convert the values.

Qt's Frozen Column Example presents the basics of an approach but we found it lacking in a few ways:

  • Frozen row support isn't covered but is readily built by analogy.
  • It is seriously and unnecessarily inefficient. The reason is that the frozen table view shares the regular table view model so it has to call setColumnHidden on all the non-frozen columns (or the analogous call for a frozen row). For a large table this is bad enough if the table size is static but for an editable table you have to re-hide all those columns or rows whenever the table structure changes. We found this to be a real performance problem even with tables of modest size. The fix is relatively simple: derive a model for the frozen table that only has one column or row. For the frozen row case it looks something like this in PySide/PyQt:
    class MyTableFrozenRowModel( MyTableModel ):
    
        def __init__( self, parent = None ):
            MyTableModel.__init__( self, parent )
    
        def rowCount( self, index = None ):
            return 1 # Treat the frozen table as having just the one frozen row
    The effect on performance is dramatic. This could probably also be done as a proxy model with similar performance benefits.
  • It is fragile and has visual defects if used as written. For example, the updateFrozenTableGeometry implementation is wrong because the verticalHeader width is not always set yet when it gets called. The work-around is to set the table and frozen table verticalHeader widths yourself to a reasonable value and then call setGeometry.
  • It neglects to mention that when you insert or remove rows, for a frozen column, or columns, for a frozen row, you have to emit layoutChanged on the frozen table model.
  • It forgets some other details that you need, such as coupling setFont for the two table views.

With careful attention to these details a defect-free and efficient Qt frozen column or row table can be implemented.



Migrating to PySide

PySide
Posted 2012/03/31

PySide is a Python binding for Qt. PySide is similar to the PyQt system but provides LGPL licensing and a community development model that makes it more attractive for commercial projects than PyQt. While the commercial license cost of PyQt is a factor the bigger issue with PyQt is the need to build it manually on each development system for every release: unlike the GPL version no binary installers are provided for the commercial PyQt.

We have been waiting for the right time to attempt migrating some projects from PyQt to PySide. The main issues were the maturity/stability of PySide and support for our full toolchain. We had experimented with early PySide releases with some applications and found that, despite a few problems and the lack of matplotlib plotting support, the GUIs ran correctly. With the recent release of PySide 1.1.0 and matplotlib 1.1.0 with PySide support it seemed that the time had come to attempt the full migration.

Recently we successfully ported a couple of engineering applications from PyQt to PySide.

Basics

The basics of a PyQt to PySide port can be found in the Differences Between PySide and PyQt page. Among other items this includes:

  • Changing PyQt4 to PySide in your import statements.
  • Changing pyqtSignal to Signal and pyqtSlot to Slot.
  • Removing QVariant and QString imports and replacing their use by the corresponding Python types. Note that occurrences of QVariant() would normally be replaced by None but that may something to look at more carefully. You also need to remove uses of toString() and toPyObject() since those are QVariant methods. If the variable that had .toString() is not an str after the removal of QVariant wrappers then wrap it in str(). Removing QVariant wrappers could leave you with some lines of the form
     var = var 
    or the more amusing
     var = var if var is not None else None 
    and those can, of course, be removed.
  • Adding a [0] to the return value of the QFileDialog static function getOpenFileName to get the file name from the returned tuple. With PySide 1.1.0 you also need to do this for getOpenFileNames and getSaveFileName as well. Note that these return unicode strings so wrap the result in an str() call if a Python string is needed (and your app doesn't attempt to support Unicode path/file names).

Here are some other porting tips we learned along the way:

  • PySide does not declare or call __del__() methods on its Qt derived classes when they are destroyed. In the rare case that you depended on PyQt calling your custom destructor you will have to call this yourself with PySide.
  • Sometimes the object held in a QVariant might have been a Python builtin type or a user-defined type, requiring some code to handle it properly: you might have needed to test whether to call toPyObject() on it with PyQt. When QVariant is eliminated the need for such logic disappears.
  • PySide 1.1.0 has a bug in the QAction setShortcut() method: it won't accept a single shortcut value, either as a scalar or in a list with one item. The work around is to wrap these scalar shortcuts like this:
     action.setShortcut( QKeySequence( shortcut ) ) 

Packaging

Packaging with PyInstaller was expected to be an easy migration from PyQt but we hit a few speedbumps. The biggest was that PyInstaller sees matplotlib importing both PySide and PyQt modules and so it will bundle libraries from both of them if both are installed. If you have a commercial application but have the GPL PyQt installed (for non-commercial use, of course) this could create at least the appearance of a licensing issue. It took some trial and error to get the right solution because the PyInstaller documentation is a bit hand-wavy and because you have to look at the binaries and pure return values from Analysis() to see what to exclude. Adding the right excludes argument to the Analysis() constructor in the spec file does the trick:

a = Analysis( ..., excludes=['sip', 'PyQt4', 'PyQt4.QtCore', 'PyQt4.QtGui'])

If you simply remove PyQt dependencies from a.binaries after Analysis() is built you will still have the PyQt dependencies in your binaries and they will fail to run without the PyQt libraries.

Using matplotlib's pylab/pyplot interface causes PyInstaller to build a dependency on Tk into your application executables even though Tk is not used in a PySide application. Fixing this to avoid shipping Tk libraries with your application means growing our excludes call. On Windows it looks like this:

a = Analysis( ..., excludes=['sip', 'PyQt4', 'PyQt4.QtCore', 'PyQt4.QtGui',
    '_tkinter', 'tk85.dll', 'tcl85.dll'])

Since you are presumably using matplotlib's Qt4Agg backend it is nice to also eliminate the other backends. We find that you can just delete them from the binary collection PyInstaller generates but why not exclude them from the beginning and save the unnecessary copying/deletion:

a = Analysis( ..., excludes=['sip', 'PyQt4', 'PyQt4.QtCore', 'PyQt4.QtGui',
    '_tkinter', 'tk85.dll', 'tcl85.dll',
    'matplotlib.backends._tkagg',
    'matplotlib.backends._gtkagg',
    'matplotlib.backends._backend_gdk'])

You may also need to modify the matplotlibrc and matplotlib.conf files PyInstaller puts in the mpl-data subdirectory to change the default backends to Qt4Agg. If your system-wide versions of those files already have this change then you won't need to do it as part of the packaging process.

Another issue we found was that the generated executables don't freeze in the QT_API=pyside environment choice set during the build phase so that your application will still need QT_API to be set at runtime. To avoid making users need to know/set this it is best to move the QT_API setup inside your code before the first matplotlib import that could happen in any code path:

os.environ[ 'QT_API' ] = 'pyside'
import matplotlib


Using OSG 3.x in Qt

OSG
Posted 2012/02/27

ObjexxSISAME uses OpenSceneGraph (OSG) for real-time 3D visualization. We integrated an OSG viewer into the Qt-based GUI being developed for ObjexxSISAME 2.0. The Qt integration "cookbook" for OSG has changed with OSG 3.0 and is still being refined. Various examples are helpful but none give the whole picture. After some effort we have this running nicely and here are notes on the approach that we found best:

ObjexxSISAME GUI
  • The parts of your model that can move need to have callback classes inherited from osg::NodeCallback that the corresponding osg::PositionAttitudeTransform is connected to through a setUpdateCallback() call.
  • In a simulation application you are updating the model state (position) outside of OSG. To get real-time visualization your simulation loop needs to call the frame() method of your OSG viewer instance after updating the positions.
  • To manipulate the camera view during the simulation you need to add process_events() on the OSG viewer that calls QCoreApplication::processEvents(). Without that call all the user interaction operations are saved and applied after the simulation completes and OSG yields control. When multi-threading of OSG in Qt becomes reliable this issue may go away.
  • To add this process_events() call into the OSG viewer class we found it easiest to create a ModelViewer class that inherits from osgViewer::Viewer and osgQt::GLWidget. ModelViewer needs some OSG secret sauce that we adapted from various examples and refined by experimentation.
  • Your OSG code can avoid dependencies on Qt with only the concrete ModelViewer dependent on both OSG and Qt. We do this in ObjexxSISAME to allow us to share the OSG code with a console/batch mode that has an optional real-time 3D visualization in a standard osgViewer window: this code path has no Qt dependency and thus could be used to build a non-GUI version for an exotic platform lacking Qt support.
  • Antialiasing is great on graphics card+driver combinations that support it but there is no call you can make from your application to determine whether the system supports AA. We provide the user control over the AA sampling with the ability to disable it if necessary. Currently, for example, we find that the nouveau driver does not support AA on older nVidia cards.
  • The antialiasing setup that works in a stand-alone osgViewer, calling setNumMultiSamples() on the osg::DisplaySettings instance, does not work when the viewer in embedded in Qt: it appears that Qt controls antialiasing in that context. The Qt-based method from this OSG discussion works.

With this hard-won knowledge we have OSG running happily in the ObjexxSISAME 2.0 code base.



Why is YAML Slow?

Development
Posted 2012/02/20

We are using YAML style input files for a number of applications because they are easy to read and edit outside of a GUI. We were dismayed to find that loading times were quite slow for files of modest size (5K lines, 200KB) when using stock YAML parsers such as yaml-cpp and libyaml loading.

Digging into the specification and stock implementations reveals why parsing YAML is slow: YAML supports many features and syntax variations that complicate parsing and can require super-linear look ahead processing. Here's an example from the YAML 1.2 specification:

!!map {
  ? !!str "implicit block key"
  : !!seq [
    !!map {
      ? !!str "implicit flow key"
      : !!str "value",
    }
  ]
}

Clearly, the readability and elegance of YAML has been lost along with good performance.

Realizing that we do not want or need syntactic support for higher level capabilities (since we are building that into our file structure semantics where it belongs) we found that very simple and fast (linear time) parsers could be written (in C++ and Python for our purposes) for this simplified YAML. These custom parsers are now in use in ObjexxSISAME and a Python-based wrapper developed as part of the reengineering/modernization of a legacy FEM application



Real-World F2PY

F2PY
Posted 2012/02/13

F2PY is a solid tool for interfacing a Python front end application with a Fortran compute engine but we found some limitations and quirks that led us to develop a nonstandard method for using it. Using F2PY in the standard, documented fashion has some problems. If you are not using the canonical Fortran compiler, especially on Windows, you can have difficulty getting F2PY to find and use your compiler. And if you want or need to use certain compiler switches the process for doing that is not easy.

The problem that F2PY makes is trying to store and maintain knowledge about all the major Fortran compilers on different platforms, down to the level of the format that they report their versions with. Inevitably, much of this information is out of date and thus support for your platform/compiler combination within the F2PY system requires a bit of hacking. This is worse on Windows, primarily because the F2PY developer does not use Windows (Objexx has contributed a few patches and problem reports for these issues). There are many F2PY mailing list reports about it failing to find and use the compiler that was requested.

After investing some time in trying to get F2PY to work with alternative compilers on Windows Objexx went a different route and we extracted the necessary parts and do our own compiling, just using F2PY to generate the interfaces. This is not hard to do and eliminates much grief.

Here is an outline of the steps we followed:

  • Copy the fortranobject.* files to your build directory
  • Set up a build script that uses F2PY to generate the .pyf interfaces and then builds the interface code using the necessary options and your preferred compiler.

Veneer Interface

Another recommendation for fast and reliable F2PY use is to only present a lightweight Fortran veneer with the interfaces that the Python needs to access to F2PY and to keep the actual computational code in separately built libraries. This interface is essentially a set of wrappers that call the real computational code that lives in these separate libraries. This has the benefits of faster F2PY processing and lower likelihood of F2PY seeing something that trips it up. It also keeps !f2py comments (that are used to get the desired interface without customizing the .pyf file by hand) out of your computational Fortran code, where they can confuse developers not familiar with F2PY and could be altered or removed, which is hard to notice in a large source file. Another benefit is that it clearly identifies and limits the interface used by Python.

Pluggable User Fortran Libraries

Using an interface veneer as described above enables pluggable dynamic/shared Fortran libraries that can be built without installing or using F2PY, which is very desirable for end user custom libraries, and swapped in and out for each run of your application. The approach is as follows:

  • Put a lib sub-directory under the bin directory where executables go.
  • Create "stub" source for each customizable routine or bundle of routines.
    • The stub routines should have the argument lists and little else except documentation on the arguments and requirements/recommendations on the computations that must occur in each. They may contain sample/standard computations if that is desired.
    • If the stub routines are not actual default implementations it is a good idea to have the application automatically detect if any stub routine is called. One way to do this is to have a flag argument in the stub code that real implementations must alter to indicate that the routine can be called.
    • It is recommended to have an identifier string in the stub routine that users are encouraged to set to indicate the routine's source, developer, and content. This can be part of the interface and passed back to the application to include a report of the custom routines used in a run.
  • Build the stub libraries as part of your build process and put them in this lib subdirectory.
  • Point the application at the stub libraries in lib by default.
  • Include the source to the stub routines (or user-specific variants) in the application distribution and instructions on the compiler required for each platform (must be ABI compatible with the Fortran compiler used to build the application).
  • Users build their custom dynamic/shared libraries and put them in a directory such as /app/custom and can then "point" your application to them either by a command line argument to the application, a saved user preference, or an environment variable, depending on the approach preferred.

Argument-Free F2PY Callbacks

Objexx has developed a method for using callbacks within the Fortran that does not require passing the callback routine throughout the Fortran argument lists, which is a burden and can degrade the clarity of the code. This requires use of a Fortran 2003 feature called procedure pointers, which are supported by recent versions of a number of the major Fortran compilers including both Intel Fortran and GFortran.

This method involves storing the pointer to the callback procedure in a module that is then made available in any routine by adding a USE module statement in that routine. This is a fairly obvious approach in C-based languages but until procedure pointers were added to Fortran there was no way to get the callback reference to a Fortran routine other than by passing it as an argument.

The callback routine and associated procedure pointer is defined in the callback module like this:

! Callback routine interface
INTERFACE
  SUBROUTINE callback_prototype( msg, level )
    CHARACTER(*), INTENT(IN) :: msg
    INTEGER, INTENT(IN) :: level
  END SUBROUTINE callback_prototype
END INTERFACE

! Callback procedure pointer
PROCEDURE( callback_prototype ), POINTER :: callback_ptr => NULL()

On every entry into the Fortran the callback needs to be registered with the Fortran since the Fortran DLL state is not carried over. That is done by calling a routine like this:

SUBROUTINE message_caller( message_callback )
  USE CallbackModule

  IMPLICIT NONE

  ! Arguments
  EXTERNAL :: message_callback
!f2py character*(*) msg
!f2py integer level
!f2py call message_callback(msg,level)

  ! Set the message callback procedure pointer
  callback_ptr => message_callback

END SUBROUTINE message_caller

and putting a call like this at the top of every routine that is an entry point from Python:

CALL message_caller( message_callback )

Then within the Fortran messages can be sent via the registered callback by adding the line:

USE CallbackModule

to the routine and making calls like this:

CALL message_call( msg, LogLevel_FATAL )

where message_call is a wrapper routine that lives in the callback module that looks like this:

SUBROUTINE message_call( msg, level )
  CHARACTER(*), INTENT(IN) :: msg
  INTEGER, INTENT(IN) :: level
  CALL message_callback_ptr( TRIM( msg ), level )
END SUBROUTINE message_call

This setup appears a bit complicated but all the details are in the callback module and the client code just has to add one USE statement and can then do callbacks without adding arguments throughout the Fortran call tree.

F2PY Logging

Leveraging the F2PY callback capabilities, logging systems can be built in Python and Fortran that work together. First, this means that the Python code has a nicely featured logging system that can do things such as routing log messages to one or more "sinks" that can include the console in a CLI run, a log pane/widget in a GUI run, and a log file for any run. Also, the level at which logging happens, e.g., informational messages and higher or also debug messages when debugging, can be controlled by the user. Python's built-in logging package provides this capability.

The next step is to add a Fortran logging module, that exploits the argument free callback mechanism, that can accept messages from anywhere in the Fortran with levels (info, warning, …) and can both act on those messages within the Fortran and pass them back to the Python where they can be wired into the Python logging mechanism. In addition to simple text messages, interfaces can be added for other message types, such as the time step completed and the current estimate or max number of time steps that the Python GUI can use to display a progress bar. This integration is also important when the Fortran is doing a STOP to terminate a run due to some problem: the Fortran can notify the Python that it is about to terminate, clean up resources, and then do the STOP. This is especially important when using Python multiprocessing so that processes are not left hanging, which happens if the Fortran dies with signaling the Python.

A fairly small amount of boilerplate code is required to make this all work and the results are well worth the effort.