[chemfp] popcount methods for sparse fingerprints

Andrew Dalke dalke at dalkescientific.com
Wed Nov 30 09:14:24 EST 2011

Previous message: [chemfp] alignment, speed, OpenMP vs. multithreading
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

chemfp includes support for RDKit's "Morgan" fingerprints, which are based on the ECFP fingerprints.

These produce sparse fingerprints; on the order of 5% density. The intersection of two fingerprints has about a 1% density.

I investigated if there was a better way to do the popcount of very sparse fingerprints. Various sources suggest that the method of Wegner (1960) might be good for that case.

My conclusion is that it's not worthwhile to add specific support in chemfp's dense fingerprint popcount routines to go faster with very sparse fingerprints.

=================

Here's how I came to that conclusion:

I took the popcnt.cpp benchmark from http://www.dalkescientific.com/writings/diary/archive/2011/11/02/faster_popcount_update.html and changed the initialization routine so it produced a 1% dense set of bits.

#define DENSITY 100
int main()
{
int size = BUFSIZE / sizeof(unsigned);
buf = new unsigned[size];
for(int i = 0; i < size; i++) {
// buf[i] = rand();
buf[i] = ( ((rand() % DENSITY == 0) ? 1 : 0) |
((rand() % DENSITY == 0) ? 2 : 0) |
((rand() % DENSITY == 0) ? 4 : 0) |
((rand() % DENSITY == 0) ? 8 : 0) |
((rand() % DENSITY == 0) ? 16 : 0) |
((rand() % DENSITY == 0) ? 32: 0) |
((rand() % DENSITY == 0) ? 64 : 0) |
((rand() % DENSITY == 0) ? 128 : 0));
}

I tweaked the algorithm to handle 128 bit integers (this proved to be a bit faster than with 64 bit integers)

/* Define uint128_t for gcc */
typedef unsigned int uint128_t __attribute__((mode(TI)));

static inline int popcount_wegner(uint128_t *buf, int n) {
int cnt=0;
uint128_t v;
while (n--) {
v = *buf;
while (v) {
cnt++;
v &= v-1; /* clear the least significant bit set */
}
buf++;
}
return cnt;
}

The timing numbers I got were:

Bitslice(7) 1181259 us; cnt = 167617
Bitslice(24) 850004 us; cnt = 167617
Lauradoux 473657 us; cnt = 167617
SSE2 8-bit 460307 us; cnt = 167617
SSE2 16-bit 491941 us; cnt = 167617
SSE2 32-bit 558696 us; cnt = 167617
SSSE3 381819 us; cnt = 167617
16-bit LUT 902937 us; cnt = 167617
8-bit LUT 1444457 us; cnt = 167617
gcc popcount 2791489 us; cnt = 167617
gcc popcountll 1439977 us; cnt = 167617
FreeBSD version 1 2634433 us; cnt = 167617
FreeBSD version 2 1776582 us; cnt = 167617
Wikipedia #2 1426408 us; cnt = 167617
Wikipedia #3 952089 us; cnt = 167617
HAKMEM 169/X11 3015327 us; cnt = 167617
naive 1722101 us; cnt = 167617
Wegner/Kernigan 820007 us; cnt = 167617
...

You can see that the Wegner method is now only 1/2 the speed of Lauradoux, but it's simply not competitive to the other, faster methods.

I also tried short-circuiting the LUT-8 method so it does a short-circuit check for 0, as in

do {
i = *buf;
if (i==0) { buf++; continue;} /* this is the new line */
cnt += lut[i&255];
cnt += lut[i>>8&255];
cnt += lut[i>>16&255];
cnt += lut[i>>24];
buf++;
} while(--n);

but the performance only goes from

8-bit LUT 1444457 us; cnt = 167617

to

8-bit LUT 1125538 us; cnt = 167617

With a density of 1 in 100 bits, the 32-bit check against 0 only occurs 2/3rds of the time, so apparently this isn't often enough.

So I tried with a bit density of 1:1000 (that is, 0.1%). In that case the
timings are:

8-bit LUT 556267 us; cnt = 16625
Bitslice(7) 1188504 us; cnt = 16625
Bitslice(24) 849170 us; cnt = 16625
Lauradoux 473826 us; cnt = 16625
SSE2 8-bit 461019 us; cnt = 16625
SSE2 16-bit 491207 us; cnt = 16625
SSE2 32-bit 556490 us; cnt = 16625
SSSE3 379895 us; cnt = 16625
16-bit LUT 902497 us; cnt = 16625
8-bit LUT 558576 us; cnt = 16625
gcc popcount 2789873 us; cnt = 16625
gcc popcountll 1438053 us; cnt = 16625
FreeBSD version 1 2632657 us; cnt = 16625
FreeBSD version 2 1778908 us; cnt = 16625
Wikipedia #2 1429721 us; cnt = 16625
Wikipedia #3 954920 us; cnt = 16625
HAKMEM 169/X11 3008643 us; cnt = 16625
naive 787304 us; cnt = 16625
Wegner/Kernigan 384522 us; cnt = 16625

Finally, a case where Wegner is only a bit slower than SSSE3, and the 8-bit LUT is only a bit slower than Lauradoux. However, it's also possible to optimize Lauradoux for that case, with

for (j = 0; j < 30; j += 3) {
count1 = data[j];
count2 = data[j+1];
half1 = data[j+2];
if (count1 == 0 && count2 == 0 && half1 == 0) { // special case for sparse bits
continue;
}

which gives a time of

Lauradoux 398458 us; cnt = 16625

Going one step further, with a bit density of 1:10000 the corresponding modified Lauradoux, LUT and Wegner times are:

Lauradoux 345854 us; cnt = 1652 (473002 us without the special check)
SSSE3 379182 us; cnt = 1652
8-bit LUT 483056 us; cnt = 1652
Wegner/Kernigan 350071 us; cnt = 1652
(estimated using POPCNT instruction; 250000 us)

Therefore, Wegner is never faster than other methods. (Note that this benchmark does not include a test for the POPCNT instruction from the SSSE4.2 instruction set. My tests elsewhere have it as twice as fast as Lauradoux.)

Therefore, it may be faster (though only on non-POPCNT hardware) to have special support for very sparse fingerprints.

Will that ever happen for real data? I don't think so.

To get an intersection bit density of 1:1000 means that the fingerprints need a density of under 3%, assuming a random distribution. The sparsest fingerprints are not randomly distributed, so I'll estimate 2% density. In the worst case, that's 82 bits set out of 4096. (Preliminary tests show that 4096 bits isn't more effective than 2048 bits.) The sparsest fingerprints have about 3*num_atoms bits set, so 27 atoms (14 atoms for the more realistic case of 2048 bits).

Testing one data set I have, the average molecule size is 16-17 atoms (excluding hydrogens).

Therefore, at the extreme edge it is possible to be in a situation where special support for very sparse fingerprints gives an extra 25% performance, but that requires:

- using one of the sparse fingerprints on very small molecules

- with a fingerprint size of 4096 (instead of the suggested 1024 bits for that case,
which would be 4x faster)

- on a machine without SSSE3 or SSE4.2 instructions

I believe this combination is unrealistic, and so it is not worth exploring this option further.

Instead, it would be better to develop code which specifically handles sparse fingerprints without converting/folding them to dense fingerprints.

Andrew
dalke at dalkescientific.com

Previous message: [chemfp] alignment, speed, OpenMP vs. multithreading
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the chemfp mailing list