Home Objexx Engineering

Objexx Labs

Objexx Labs
Discoveries, ideas, and musings from the Objexx development labs

Fast Keys and Slow Maps

Performance
Posted 2013/02/03

The C++ Standard Library provides the std::map container for collections of items that need to be looked up by a "key" identifier. This is much more flexible that an array container like std::vector that can only look items up by a contiguous set of integer indexes. While std::map (and its new C++11 relative std::unordered_map) have some admirable complexity guarantees they have much worse performance than an array. For fast technical applications we would love to have the semantics and syntax of std::map but the performance of an array. Happily, there are a number of situations where, with a little work, we can craft such a solution.

The C++ map containers are designed to provide good complexity guarantees for large containers with an arbitrary mix of insertion, lookup, and deletion operations. Achieving this asymptotic complexity comes at the cost of a large time and space overhead that is wasteful for collections that don't fall in this category.

Collections that are small or where lookups dominate insertion/deletion are common in scientific and engineering applications. An example that is prominent in ObjexxSISAME is vectors and matrices indexed by the axes for the model's degree of freedom set (ObjexxSISAME supports any meaningful 1, 2, or 3D model DOF set): these have only up to 6 keys and the key set is fixed once they are initialized. Or we may be modeling proteins where we need to look up the 20 canonical amino acids by a named key. Using a map for lookups on such containers is taking a massive, yes massive, performance hit.

So, how do we create this fast map-like container? Well, there are variations depending on the specifics but the basic ingredients are:

  • A lightweight key class that holds an internal index value.
    • The index type could be an appropriately small integer type for the number of possible keys.
    • The key will usually have an associated name that is either stored in the key or determined by lookup in an external container when many containers with the same key set are expected.
    • The key class may contain an enumeration the defines the possible index values and a static mechanism for looking up the name from that index and vice versa.
    • The key class must define an order via operator< and an operator that does conversion to the index type.
    • The key type is normally small enough that it is fastest to pass it by value rather than const reference.
  • A container with map (and optionally vector) semantics that holds a contiguous vector of the actual values and holds or has access to another vector that maps key indexes to indexes in this internal vector.
    • This second vector has room for all possible keys and uses a sentinel flag value to indicate a non-present key.
    • When this mapping from keys to indexes is common to many containers it can be held externally to the container and accessed via a stored pointer.
    • The container provides operator[] and/or operator() that take key arguments and use a two-step lookup for the value for a given key, of the form return vals_[idxs_[key]]. This two-step lookup is a bit slower than a normal one-step vector lookup but is still dramatically faster than std::map.
    • If useful, the container can also expose index-based lookup operators for faster one-step lookup by client code.
Key Map vs map

The performance impact of using a custom key map system can be dramatic, as the benchmark results shown here demonstrate.

The specific design can vary depending on such criteria as whether the key-to-index map is common to many containers and whether it is small enough that repeating it is worth the space cost to avoid the extra indirection, whether the key-to-index map is known at compile time (and thus can be a template argument, avoiding both space and indirection costs).

Additional notes:

  • Operations over the whole container can be performed internally over the underlying vector, avoiding even the fast key-based lookup step.
  • Multidimensional key-accessed "matrix" containers can be created along the same lines, built on top of a vector of vectors, for example.
  • An obvious question is how large does the key set (and thus container) have to be for std::map to start to be competitive again. The precise answer to this depends on the specific platform and compiler but in our experience map isn't faster until you are up to thousands of possible keys.
  • You may want different containers depending on whether the value being stored for each key is a small/built-in type that is fastest if passed by value or a larger object that is best passed by const reference.
  • Normally we have the internal vector hold just the values but in some cases you may want a container that holds std::pair pairs like std::map. An alternative that may suffice is to have a mechanism for reverse lookup of the key for a given internal vector index.
  • If your container must support frequent insertion/deletion operations note that for sufficiently small containers it will be faster to hold just an unordered vector of std::pair and do a linear search on it for each key-based lookup: the overhead of sorting and binary search dominates linear search for small collections.

The takeaway from this is that use of std::map and similar containers should be limited in technical software and that there are much faster containers for many situations where maps get used. Objexx has developed a suite of fast key map containers for various scenarios that we have used with great results in our in-house and client applications.

Python Parallel Job Controller

Python
Posted 2012/12/10

Python has built-in support for parallel processing via the multiprocessing, subprocess, and thread packages but does not provide any tools to help use system resources effectively to run a set of jobs in parallel. That requires the ability to measure the system resource loads and start queued jobs when sufficient resources become available. It is not hard to build such a package with the right tools and some basic observations.

There are not a lot of python packages for measuring system resources and loads but psutil is a good choice. Using psutil you can obtain metrics such as the total and available system memory, and the total or per-core CPU load. Because psutil cannot measure the i/o load on a system, there is not a good way to avoid running too many i/o-bound jobs in parallel. The best simple approach is probably to add a cap on the number of jobs that can be running in parallel. This is not an ideal approach but can suffice for most purposes until a better package becomes available.

Job Controller Sequence Diagram

A basic job controller and Job class system can be written with a modest amount of code. This probably has:

  • A constructor that may take arguments for the baseline priority level and CPU and RAM thresholds for starting a job. Optional arguments with default job CPU time and RAM resource usage may also be useful.
  • Methods for adding and removing jobs to/from the queue.
  • A run method that checks the system load and starts one or more jobs if resources allow.
  • Optionally, a mechanism for priority escalation/deescalation.
  • One or more Job classes that holds the specification for a job, estimates for its resource requirements if available, and a mechanism to run the job when the controller tells it to.

Some considerations in building such a system include:

  • Providing an "agressiveness" priority setting to client code and/or users to let them control the tradeoff between system responsiveness and job throughput will make your job controller more flexible.
  • Abstracting the Job concept into classes allows a range of job types to be supported, with customization of the underlying run mechanism (subprocess, multiprocessing, ...), job resource estimates, and so forth.
  • A sequential "Job" type is useful when a meta-Job must run a list of smaller jobs in a given order.
  • Estimates for a job's resource (CPU and memory) requirements, if available, can be used to refine the logic on starting jobs.
  • To limit the number of running jobs (such as to provide a backstop for i/o bound jobs as described above) in a manner that is portable to different parallel packages requires saving and polling the process to count the number running.
  • A time.sleep() should normally be used to avoid an excessive load due to polling the system.
  • Using the lowest per-core CPU load metric in determining whether to start a job is recommended for most applications but a more sophisticated algorithm that looks at the CPU load distribution could be warranted for a complex or mission-critical system.
  • Providing an escalation mechanism may be necessary to get jobs to start on a heavily loaded system. This raises the job starting "agressiveness" as time passes with no jobs starting and then drops it back to the baseline level after a job starts. Without this your jobs may almost never catch a window when the system is not running at full bore. The escalation logic may need to be tuned to reflect the priority of the jobs compared to other tasks being run.
  • Setting the priority escalation strategy and sleep time between system load polling are heuristics that may take some fine tuning to find a good balance.

Efficient Qt Frozen Columns and Rows

Qt
Posted 2012/05/28

Qt provides a rich GUI framework with excellent support for tables but when your need a feature that isn't directly supported things can get tricky. One such feature for tables is a so-called "frozen" column or row that should remain fixed like the headers when scrolling. Qt table widget and view classes do not support frozen columns or rows and enabling them can be both tricky to get right and an efficiency problem.

Frozen Row Table

Frozen rows and columns are useful when you they contain label-like content that doesn't belong in the header for some reason, such as it is editable. In our case, we use a frozen row in numerical tables for the dimensional units of the columns as in the example shown here. As you scroll the table it is useful to keep the units in view and sometimes the units can be changed via drop-down combo boxes to convert the values.

Qt's Frozen Column Example presents the basics of an approach but we found it lacking in a few ways:

  • Frozen row support isn't covered but is readily built by analogy.
  • It is seriously and unnecessarily inefficient. The reason is that the frozen table view shares the regular table view model so it has to call setColumnHidden on all the non-frozen columns (or the analogous call for a frozen row). For a large table this is bad enough if the table size is static but for an editable table you have to re-hide all those columns or rows whenever the table structure changes. We found this to be a real performance problem even with tables of modest size. The fix is relatively simple: derive a model for the frozen table that only has one column or row. For the frozen row case it looks something like this in PySide/PyQt:
    class MyTableFrozenRowModel( MyTableModel ):
    
        def __init__( self, parent = None ):
            MyTableModel.__init__( self, parent )
    
        def rowCount( self, index = None ):
            return 1 # Treat the frozen table as having just the one frozen row
    The effect on performance is dramatic. This could probably also be done as a proxy model with similar performance benefits.
  • It is fragile and has visual defects if used as written. For example, the updateFrozenTableGeometry implementation is wrong because the verticalHeader width is not always set yet when it gets called. The work-around is to set the table and frozen table verticalHeader widths yourself to a reasonable value and then call setGeometry.
  • It neglects to mention that when you insert or remove rows, for a frozen column, or columns, for a frozen row, you have to emit layoutChanged on the frozen table model.
  • It forgets some other details that you need, such as coupling setFont for the two table views.

With careful attention to these details a defect-free and efficient Qt frozen column or row table can be implemented.

Migrating to PySide

PySide
Posted 2012/03/31

PySide is a Python binding for Qt. PySide is similar to the PyQt system but provides LGPL licensing and a community development model that makes it more attractive for commercial projects than PyQt. While the commercial license cost of PyQt is a factor the bigger issue with PyQt is the need to build it manually on each development system for every release: unlike the GPL version no binary installers are provided for the commercial PyQt.

We have been waiting for the right time to attempt migrating some projects from PyQt to PySide. The main issues were the maturity/stability of PySide and support for our full toolchain. We had experimented with early PySide releases with some applications and found that, despite a few problems and the lack of matplotlib plotting support, the GUIs ran correctly. With the recent release of PySide 1.1.0 and matplotlib 1.1.0 with PySide support it seemed that the time had come to attempt the full migration.

Recently we successfully ported a couple of engineering applications from PyQt to PySide.

Basics

The basics of a PyQt to PySide port can be found in the Differences Between PySide and PyQt page. Among other items this includes:

  • Changing PyQt4 to PySide in your import statements.
  • Changing pyqtSignal to Signal and pyqtSlot to Slot.
  • Removing QVariant and QString imports and replacing their use by the corresponding Python types. Note that occurrences of QVariant() would normally be replaced by None but that may something to look at more carefully. You also need to remove uses of toString() and toPyObject() since those are QVariant methods. If the variable that had .toString() is not an str after the removal of QVariant wrappers then wrap it in str(). Removing QVariant wrappers could leave you with some lines of the form
     var = var 
    or the more amusing
     var = var if var is not None else None 
    and those can, of course, be removed.
  • Adding a [0] to the return value of the QFileDialog static function getOpenFileName to get the file name from the returned tuple. With PySide 1.1.0 you also need to do this for getOpenFileNames and getSaveFileName as well. Note that these return unicode strings so wrap the result in an str() call if a Python string is needed (and your app doesn't attempt to support Unicode path/file names).

Here are some other porting tips we learned along the way:

  • PySide does not declare or call __del__() methods on its Qt derived classes when they are destroyed. In the rare case that you depended on PyQt calling your custom destructor you will have to call this yourself with PySide.
  • Sometimes the object held in a QVariant might have been a Python builtin type or a user-defined type, requiring some code to handle it properly: you might have needed to test whether to call toPyObject() on it with PyQt. When QVariant is eliminated the need for such logic disappears.
  • PySide 1.1.0 has a bug in the QAction setShortcut() method: it won't accept a single shortcut value, either as a scalar or in a list with one item. The work around is to wrap these scalar shortcuts like this:
     action.setShortcut( QKeySequence( shortcut ) ) 

Packaging

Packaging with PyInstaller was expected to be an easy migration from PyQt but we hit a few speedbumps. The biggest was that PyInstaller sees matplotlib importing both PySide and PyQt modules and so it will bundle libraries from both of them if both are installed. If you have a commercial application but have the GPL PyQt installed (for non-commercial use, of course) this could create at least the appearance of a licensing issue. It took some trial and error to get the right solution because the PyInstaller documentation is a bit hand-wavy and because you have to look at the binaries and pure return values from Analysis() to see what to exclude. Adding the right excludes argument to the Analysis() constructor in the spec file does the trick:

a = Analysis( ..., excludes=['sip', 'PyQt4', 'PyQt4.QtCore', 'PyQt4.QtGui'])

If you simply remove PyQt dependencies from a.binaries after Analysis() is built you will still have the PyQt dependencies in your binaries and they will fail to run without the PyQt libraries.

Using matplotlib's pylab/pyplot interface causes PyInstaller to build a dependency on Tk into your application executables even though Tk is not used in a PySide application. Fixing this to avoid shipping Tk libraries with your application means growing our excludes call. On Windows it looks like this:

a = Analysis( ..., excludes=['sip', 'PyQt4', 'PyQt4.QtCore', 'PyQt4.QtGui',
    '_tkinter', 'tk85.dll', 'tcl85.dll'])

Since you are presumably using matplotlib's Qt4Agg backend it is nice to also eliminate the other backends. We find that you can just delete them from the binary collection PyInstaller generates but why not exlude them from the beginning and save the unnecessary copying/deletion:

a = Analysis( ..., excludes=['sip', 'PyQt4', 'PyQt4.QtCore', 'PyQt4.QtGui',
    '_tkinter', 'tk85.dll', 'tcl85.dll',
    'matplotlib.backends._tkagg',
    'matplotlib.backends._gtkagg',
    'matplotlib.backends._backend_gdk'])

You may also need to modify the matplotlibrc and matplotlib.conf files PyInstaller puts in the mpl-data subdirectory to change the default backends to Qt4Agg. If your system-wide versions of those files already have this change then you won't need to do it as part of the packaging process.

Another issue we found was that the generated executables don't freeze in the QT_API=pyside environment choice set during the build phase so that your application will still need QT_API to be set at runtime. To avoid making users need to know/set this it is best to move the QT_API setup inside your code before the first matplotlib import that could happen in any code path:

os.environ[ 'QT_API' ] = 'pyside'
import matplotlib

Using OSG 3.x in Qt

OSG
Posted 2012/02/27

ObjexxSISAME uses OpenSceneGraph (OSG) for real-time 3D visualization. We integrated an OSG viewer into the Qt-based GUI being developed for ObjexxSISAME 2.0. The Qt integration "cookbook" for OSG has changed with OSG 3.0 and is still being refined. Various examples are helpful but none give the whole picture. After some effort we have this running nicely and here are notes on the approach that we found best:

ObjexxSISAME GUI
  • The parts of your model that can move need to have callback classes inherited from osg::NodeCallback that the corresponding osg::PositionAttitudeTransform is connected to through a setUpdateCallback() call.
  • In a simulation application you are updating the model state (position) outside of OSG. To get real-time visualization your simulation loop needs to call the frame() method of your OSG viewer instance after updating the positions.
  • To manipulate the camera view during the simulation you need to add process_events() on the OSG viewer that calls QCoreApplication::processEvents(). Without that call all the user interaction operations are saved and applied after the simulation completes and OSG yields control. When multi-threading of OSG in Qt becomes reliable this issue may go away.
  • To add this process_events() call into the OSG viewer class we found it easiest to create a ModelViewer class that inherits from osgViewer::Viewer and osgQt::GLWidget. ModelViewer needs some OSG secret sauce that we adapted from various examples and refined by experimentation.
  • Your OSG code can avoid dependencies on Qt with only the concrete ModelViewer dependent on both OSG and Qt. We do this in ObjexxSISAME to allow us to share the OSG code with a console/batch mode that has an optional real-time 3D visualization in a standard osgViewer window: this code path has no Qt dependency and thus could be used to build a non-GUI version for an exotic platform lacking Qt support.
  • Antialiasing is great on graphics card+driver combinations that support it but there is no call you can make from your application to determine whether the system supports AA. We provide the user control over the AA sampling with the ability to disable it if necessary. Currently, for example, we find that the nouveau driver does not support AA on older nVidia cards.
  • The antialiasing setup that works in a stand-alone osgViewer, calling setNumMultiSamples() on the osg::DisplaySettings instance, does not work when the viewer in embedded in Qt: it appears that Qt controls antialiasing in that context. The Qt-based method from this OSG discussion works.

With this hard-won knowledge we have OSG running happily in the ObjexxSISAME 2.0 code base.

Why is YAML Slow?

Development
Posted 2012/02/20

We are using YAML style input files for a number of applications because they are easy to read and edit outside of a GUI. We were dismayed to find that loading times were quite slow for files of modest size (5K lines, 200KB) when using stock YAML parsers such as yaml-cpp and libyaml loading.

Digging into the specification and stock implementations reveals why parsing YAML is slow: YAML supports many features and syntax variations that complicate parsing and can require super-linear look ahead processing. Here's an example from the YAML 1.2 specification:

!!map {
  ? !!str "implicit block key"
  : !!seq [
    !!map {
      ? !!str "implicit flow key"
      : !!str "value",
    }
  ]
}

Clearly, the readability and elegance of YAML has been lost along with good performance.

Realizing that we do not want or need syntactic support for higher level capabilities (since we are building that into our file structure semantics where it belongs) we found that very simple and fast (linear time) parsers could be written (in C++ and Python for our purposes) for this simplified YAML. These custom parsers are now in use in ObjexxSISAME and a Python-based wrapper developed as part of the reengineering/modernization of a legacy FEM application

Real-World F2PY

F2PY
Posted 2012/02/13

F2PY is a solid tool for interfacing a Python front end application with a Fortran compute engine but we found some limitations and quirks that led us to develop a nonstandard method for using it. Using F2PY in the standard, documented fashion has some problems. If you are not using the canonical Fortran compiler, especially on Windows, you can have difficulty getting F2PY to find and use your compiler. And if you want or need to use certain compiler switches the process for doing that is not easy.

The problem that F2PY makes is trying to store and maintain knowledge about all the major Fortran compilers on different platforms, down to the level of the format that they report their versions with. Inevitably, much of this information is out of date and thus support for your platform/compiler combination within the F2PY system requires a bit of hacking. This is worse on Windows, primarily because the F2PY developer does not use Windows (Objexx has contributed a few patches and problem reports for these issues). There are many F2PY mailing list reports about it failing to find and use the compiler that was requested.

After investing some time in trying to get F2PY to work with alternative compilers on Windows Objexx went a different route and we extracted the necessary parts and do our own compiling, just using F2PY to generate the interfaces. This is not hard to do and eliminates much grief.

Here is an outline of the steps we followed:

  • Copy the fortranobject.* files to your build directory
  • Set up a build script that uses F2PY to generat the .pyf interfaces and then builds the interface code using the necessary options and your preferred compiler.

Veneer Interface

Another recommendation for fast and reliable F2PY use is to only present a lightweight Fortran veneer with the interfaces that the Python needs to access to F2PY and to keep the actual computational code in separately built libraries. This interface is essentially a set of wrappers that call the real computational code that lives in these separate libraries. This has the benefits of faster F2PY processing and lower likelihood of F2PY seeing something that trips it up. It also keeps !f2py comments (that are used to get the desired interface without customizing the .pyf file by hand) out of your computational Fortran code, where they can confuse developers not familiar with F2PY and could be altered or removed, which is hard to notice in a large source file. Another benefit is that it clearly identifies and limits the interface used by Python.

Pluggable User Fortran Libraries

Using an interface veneer as described above enables pluggable dynamic/shared Fortran libraries that can be built without installing or using F2PY, which is very desirable for end user custom libraries, and swapped in and out for each run of your application. The approach is as follows:

  • Put a lib sub-directory under the bin directory where executables go.
  • Create "stub" source for each customizable routine or bundle of routines.
    • The stub routines should have the argument lists and little else except documentation on the arguments and requirements/recommendations on the computations that must occur in each. They may contain sample/standard computations if that is desired.
    • If the stub routines are not actual default implementations it is a good idea to have the application automatically detect if any stub routine is called. One way to do this is to have a flag argument in the stub code that real implementations must alter to indicate that the routine can be called.
    • It is recommended to have an identifier string in the stub routine that users are encouraged to set to indicate the routine's source, developer, and content. This can be part of the interface and passed back to the application to include a report of the custom routines used in a run.
  • Build the stub libraries as part of your build process and put them in this lib subdirectory.
  • Point the application at the stub libraries in lib by default.
  • Include the source to the stub routines (or user-specific variants) in the application distribution and instructions on the compiler required for each platform (must be ABI compatible with the Fortran compiler used to build the application).
  • Users build their custom dynamic/shared libraries and put them in a directory such as /app/custom and can then "point" your application to them either by a command line argument to the application, a saved user preference, or an environment variable, depending on the approach preferred.

Argument-Free F2PY Callbacks

Objexx has developed a method for using callbacks within the Fortran that does not require passing the callback routine throughout the Fortran argument lists, which is a burden and can degrade the clarity of the code. This requires use of a Fortran 2003 feature called procedure pointers, which are supported by recent versions of a number of the major Fortran compilers including both Intel Fortran and GFortran.

This method involves storing the pointer to the callback procedure in a module that is then made available in any routine by adding a USE module statement in that routine. This is a fairly obvious approach in C-based languages but until procedure pointers were added to Fortran there was no way to get the callback reference to a Fortran routine other than by passing it as an argument.

The callback routine and associated procedure pointer is defined in the callback module like this:

! Callback routine interface
INTERFACE
  SUBROUTINE callback_prototype( msg, level )
    CHARACTER(*), INTENT(IN) :: msg
    INTEGER, INTENT(IN) :: level
  END SUBROUTINE callback_prototype
END INTERFACE

! Callback procedure pointer
PROCEDURE( callback_prototype ), POINTER :: callback_ptr => NULL()

On every entry into the Fortran the callback needs to be registered with the Fortran since the Fortran DLL state is not carried over. That is done by calling a routine like this:

SUBROUTINE message_caller( message_callback )
  USE CallbackModule

  IMPLICIT NONE

  ! Arguments
  EXTERNAL :: message_callback
!f2py character*(*) msg
!f2py integer level
!f2py call message_callback(msg,level)

  ! Set the message callback procedure pointer
  callback_ptr => message_callback

END SUBROUTINE message_caller

and putting a call like this at the top of every routine that is an entry point from Python:

CALL message_caller( message_callback )

Then within the Fortran messages can be sent via the registered callback by adding the line:

USE CallbackModule

to the routine and making calls like this:

CALL message_call( msg, LogLevel_FATAL )

where message_call is a wrapper routine that lives in the callback module that looks like this:

SUBROUTINE message_call( msg, level )
  CHARACTER(*), INTENT(IN) :: msg
  INTEGER, INTENT(IN) :: level
  CALL message_callback_ptr( TRIM( msg ), level )
END SUBROUTINE message_call

This setup appears a bit complicated but all the details are in the callback module and the client code just has to add one USE statement and can then do callbacks without adding arguments throughout the Fortran call tree.

F2PY Logging

Leveraging the F2PY callback capabilities, logging systems can be built in Python and Fortran that work together. First, this means that the Python code has a nicely featured logging system that can do things such as routing log messages to one or more "sinks" that can include the console in a CLI run, a log pane/widget in a GUI run, and a log file for any run. Also, the level at which logging happens, e.g., informational messages and higher or also debug messages when debugging, can be controlled by the user. Python's built-in logging package provides this capability.

The next step is to add a Fortran logging module, that exploits the argument free callback mechanism, that can accept messages from anywhere in the Fortran with levels (info, warning, …) and can both act on those messages within the Fortran and pass them back to the Python where they can be wired into the Python logging mechanism. In addition to simple text messages, interfaces can be added for other message types, such as the time step completed and the current estimate or max number of time steps that the Python GUI can use to display a progress bar. This integration is also important when the Fortran is doing a STOP to terminate a run due to some problem: the Fortran can notify the Python that it is about to terminate, clean up resources, and then do the STOP. This is especially important when using Python multiprocessing so that processes are not left hanging, which happens if the Fortran dies with signaling the Python.

A fairly small amount of boilerplate code is required to make this all work and the results are well worth the effort.

Articles

Fast Keys and Slow Maps

Python Parallel Job Controller

Efficient Qt Frozen Columns and Rows

Migrating to PySide

Using OSG 3.x in Qt

Why is YAML Slow?

Real-World F2PY