[chemfp] alignment, speed, OpenMP vs. multithreading

Andrew Dalke dalke at dalkescientific.com
Mon Nov 14 21:10:45 EST 2011


Kim Walisch contributed an improved popcount method, which gives roughly two-fold performance increase on "large" fingerprints. The fingerprint must be at least 768 bits long for it to kick in. I select the best method depending on the fingerprint length and alignments.

His algorithm requires 64-bit/8-byte aligned data structures.

The FingerprintArena uses a Python string, and the implementation on my Mac is 4 byte aligned. I've updated the code to support extra start padding (and end padding, for technical reason) so

I've made a few changes to the FingerprintArena so that it supports start and end padding, so I can align the overall content. If the fingerprints are not a multiple of the alignment size then I add the needed 0s to the end of fingerprint so those are also aligned.


The end result is a new "alignment" parameter to load_fingerprints(), with a default value of 8. You can also use alignment=1, 2, 4, or another power of two.


Kim also contributed OpenMP support. I tested it with a benchmark I have based on 110885 2048-bit fingerprints generated RDKit. It does NxN similarity calculations with a very high threshold of 0.99 (which means it's not actually doing 110885*110885 calculations.)

Note: this is on a machine with an Intel Core 2 Duo; and the reproducibility is +/- 1% and not to the number of decimal points I give.

Baseline (no OpenMP): 15.90 seconds
OMP_NUM_THREADS=1: 15.86 seconds
OMP_NUM_THREADS=2: 10.60 seconds
OMP_NUM_THREADS=4: 12.19 seconds

I compared it to equivalent multi-threaded and multi-processing versions:
num_threads=1: 16.00 seconds
num_threads=2: 9.84 seconds
num_threads=3: 7.56 seconds
num_threads=4: 7.17 seconds

num_processes=1: 16.34 seconds
num_processes=2: 10.15 seconds
num_processes=3: 7.62 seconds
num_processes=4: 7.32 seconds

I did the same test with 253564 4096-bit fingerprints:

Baseline (no OpenMP): 157.12 seconds
OMP_NUM_THREADS=1: 155.35 seconds
OMP_NUM_THREADS=2: 95.52 seconds
OMP_NUM_THREADS=3: 88.22 seconds
OMP_NUM_THREADS=4: 88.23 seconds

num_threads=1: 157.26 seconds
num_threads=2: 97.32 seconds
num_threads=3: 73.20 seconds
num_threads=4: 70.70 seconds

num_processes=1: 155.76 seconds
num_processes=2: 96.30 seconds
num_processes=3: 72.25 seconds
num_processes=4: 67.71 seconds

Clearly this benchmark code scales better with multi-threaded and multi-process programming than with OpenMP.

You can also see the scaling. The problem is (253564./110885)**2 * (4096/2048) = 10. times larger, and it takes 10 times longer.


However, as I found out, OpenMP and multi-threading don't mix on a Mac. I tried to use the multi-threaded benchmark with a chemfp compiled with OpenMP, and it gave a segfault. This seems to be a well-known problem with OpenMP.

I don't know how I'm going to handle this. I want people to be able to choose between the OpenMP support and the multi-threaded support. The easiest is to make this a compile-time option. More complicated is support for a "FingerprintArena" and "FingerprintArenaOMP", with different code paths.

There still a problem with multi-process support. If I spawn off two processes, and each wants 4 threads, then that's going to be slower than two processes each with 2 thread (assuming a 4 core CPU).


I suspect I'll make this a compile-time option and add a function call to report if the package was compiled with OpenMP support.


Andrew
dalke at dalkescientific.com




More information about the chemfp mailing list