[chemfp] initial FPS Tanimoto search code

Andrew Dalke dalke at dalkescientific.com
Sun Feb 14 23:37:31 EST 2010


I've been doing a lot of work to make the chemfp Python library handle the text-oriented fps files both well and fast. It handles over 600,000 1024 bit fingerprints per second, reading from disk, and on a recent model MacBook Pro.

The code is now pushed onto the project hg repository. It includes some rather experimental code for multithreaded search which I'm going to pull out, but will report on later in this email.

You'll have to use the normal "python setup.py install" to build and install the code.

It uses a C extension because popcount is just something that Python is poor at. My best Python implementation was about 1/8th the speed of the code in C. Once I had the C code I started adding to it, to speed up even more bits of the fps processing by working directly with the raw file format in C. I'm now using Python mostly for I/O and memory management, and C for performance.

As a result there's about 600 lines of C code for what's about 60 lines of Python, but the code is about 6 times faster than the version which only uses a C extension for population counts. (I haven't implemented a pure-Python solution, but would guess it's about 12x slower than the Python code.)


Here's an example of how to use it. I'm working here with the NCI data set converted to an fps file using RDKit, through rdkit2fps:

First step, open the fps file and get the first record


>>> import chemfp

>>> targets = chemfp.open("nci.fps")

>>> fp, id = next(iter(targets))

>>> len(fp)

128

>>> fp[:10], id

(' \x02\x06\x00\x00\x00\x00\x04\x02\x80', '9425004')


Because it's file based, I need to reset the file stream in order to reuse it.


>>> targets.reset()


There's some built-in high-level search routines. This figures out how many compounds have a Tanimoto similarity of at least 0.9 with the query fingerprint. Since the query is in the targets, the answer must be at least 1.


>>> chemfp.tanimoto_count(fp, targets, 0.9)

1

What about a lower threshold?


>>> chemfp.tanimoto_count(fp, targets, 0.7)

0

Ooops! Need to reset the stream. (Should I do this automatically?)


>>> targets.reset()

>>> chemfp.tanimoto_count(fp, targets, threshold=0.4)

97

That's not so helpful. What about reporting the 10 closest structures? The "tanimoto_knearest" function takes both k and threshold values.


>>> targets.reset()

>>> chemfp.tanimoto_knearest(fp, targets, k=10, threshold=0.4)

[('9425004', 1.0), ('9425034', 0.80821917808219179), ('9425021', 0.68681318681318682), ('9428620', 0.67759562841530052), ('9428618', 0.66272189349112431), ('9428616', 0.63030303030303025), ('9428627', 0.6211180124223602), ('9425032', 0.60256410256410253), ('9425062', 0.59523809523809523), ('9425057', 0.59281437125748504)]

>>>


Or perhaps you want to see the distribution of all scores? For that, use the "tanimoto()" function, which also takes a threshold but I'm using the default of 0.0 here:


>>> targets.reset()

>>> bins = [0] * 11

>>> for target_id, score in chemfp.tanimoto(fp, targets):

... bins[int(score*10)] += 1
...

>>> bins

[75, 15292, 8331, 528, 51, 38, 6, 0, 1, 0, 1]

>>> sum(bins)

24323

>>>


You can see that most scores are in the 10% <= score < 20% range.



The counts code is the fastest (though k-nearest is very close, at least when k is small), and can process about 2/3rds of a million fingerprints per second.


>>> import time

>>> targets.reset()

>>> t1 = time.time(); chemfp.tanimoto_count(fp, targets, 0.0); t2 = time.time()

24323

>>> print 24323/(t2-t1)

668488.671725

>>>


(The slowest code is chemfp.tanimoto() which has to extract all identifiers and return all Tanimoto scores to Python. That takes 0.12s, or a bit over 200,000 fingerprints per second.)

These numbers should be comparable to those reported in Kristensen, Nielsen, and Pedersen (same fingerprint size, very similar CPU)

http://www.almob.org/content/pdf/1748-7188-5-9.pdf

which processes about 1/2 a million fingerprints per second for Tanimoto scores < 0.8. The performance comes from using C and from using a data representation which is a lot faster to process (but required a lot of C code).


BTW, my numbers are about 15% faster if I have the data all in memory instead of reading from disk. Still, if performance is a concern then it's best to load the fps data into an in-memory binary structure, which saves memory and makes things faster still.

I also implemented a multi-threaded version of the count code using the in-memory data structure. Using two threads gives ms a 40% speedup. I have a dual-core machine. That code is in fps_reader.py but will be removed, re-written, and relocated.


The code in version control should work, but it's still for the brave - try it out!


Andrew
dalke at dalkescientific.com




More information about the chemfp mailing list