[chemfp] status of format parsers

Andrew Dalke dalke at dalkescientific.com
Wed Feb 10 04:04:03 EST 2010

Next message: [chemfp] infinite similarity measures
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hello everyone, and welcome to the chemfp list.

I've been working on code to handle the format spec I proposed, and to see what difference a binary format has vs. a text format for storing fingerprint data.

Only a few things have come up with the text format which need some fixing in the spec. These are mostly things like saying that the fingerprint lines my have columns separated by more than a single whitespace (which is how I originally wrote it). I expect in next couple of weeks to go back through the spec and resolve any issues.

If you have any comments or proposals, do let me know.

I've been doing some timing tests in finding the top-N Tanimoto similarities to a given query which exceed a certain threshold. My test targets are the 24329 compounds from the NCI data set, using OpenBabel's 1021 bit FP2 hash fingerprints.

I wrote several different parsers. Note that in all cases I'm computing the Tanimoto using a C extension, with one specialized for hex strings and another for byte strings, which I use when appropriate to reduce the number of conversions.

1) parse file format using Python into record object, then feed the objects to the Tanimoto algorithm. In this case the N-best algorithm is done in Python. This requires scanning the file for each search.

2) parse the file into an in-memory string containing the fingerprints as bytes. Here the Tanimoto search and N-best algorithm are both implemented in C. This has a preparse step and memory overhead compared to the first, but the searches are much faster.

3) parse the file into B+1 in-memory strings, sorted by popcount. All target fingerprints with 0 bits set go into bin 0, .... and all target fingerprints with all bits set go into bin B. I used the same Tanimoto search and N-best algorithm as #2, with some Python code to merge the best hits for each batch, and using the Baldi ordering to hopefully find near matches first.

4) save the data structure from #2 into a binary file, and load directly from that file.

5) Yesterday I realized that in #1 I make record objects only to throw them away, and I'm parsing in Python instead of C. I wrote a C extension to parse MB-size chunks of FPS input and compute the Tanimoto directly. This was a bit tricky to write

I timed these numbers for the 5 closest structures with a minimum similarity threshold of 0.4.

The results are:

1: 0.20s (FPS parsing in Python and making a generic record structure)

2: 0.27s, with 0.012 seconds spent in search

3: 0.29s, with 0.014 second spent in search
(Search is really fast, so the Python overhead of doing the Baldi
ordering probably explains the slowdown. With a threshold of 0.9
the search time is 0.007 seconds.)

4: 0.021s, with 0.007 for the search.
(Search time should be the same as #2, but there's a StringIO()
to string conversion which does a malloc and copy. I don't think
it should take 0.005 seconds though.)

5: 0.04s (C extension for FPS parsing and similarity calculation)

You can read the difference between #1 and #5 to mean that C is about 5 times faster than Python. It's probably higher then that, since I'm still using Python to handle file I/O and to do some of the sorting.

The take-home message should be that bit counting is extremely fast. That leads to two different conclusions:

A. high performance means minimizing data translation and management overhead
B. few will care about high performance (after all, how many people suffer
from being able to do a Tanimoto search of 1 million compounds in 1.25
seconds instead of 0.1 s?)

However, I care. ;)

This also means I could have a C helper function to make parsing FPS files faster, but I don't know if I care enough for that. One of the advantages to the binary format is that it's dead simple to parse, as well as fast. (Okay, so perhaps I don't care enough.)

I compared sizes of the FPS and (proposed) FPB binary formats. The binary format has a couple of optional sections, including a table for the Baldi counts:

6_445_749 - FPS
3_405_472 - minimal FPB
3_409_576 - FPB with Baldi counts (1024 bits * 4 bytes/bit = 4K bytes extra)

The factor of almost 2 is a direct consequence of using hex encoding of each byte.

Comparing compressed sizes (using gzip -c -9)

766_718 - FPS.gz
725_806 - minimal FPB.gz
895_372 - with Baldi counts (Why is this so much bigger? There's only
4K of extra data there, but it takes 17K more when compressed!)

compressed using bzip2

772_478 - FPS
710_087 - minimal FPB
753_319 - with Baldi counts

Andrew
dalke at dalkescientific.com

Next message: [chemfp] infinite similarity measures
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the chemfp mailing list