[chemfp] chemfp 1.5 release

Andrew Dalke dalke at dalkescientific.com
Wed Aug 15 18:02:58 EDT 2018


Hi all,

  I'm putting the last touches putting together chemfp 1.5, which I'll release tomorrow.

You can download it now at http://dalkescientific.com/releases/chemfp-1.5.tar.gz .

There may be a couple of documentation changes between now and then, but the code won't change.

The main goal of this release was to make chemfp 1.x be more useful as a reference baseline for similarity search performance. For example, the "--times" option now distinguishes between search time and output time.

I have also sped up the k-nearest and count searches by about 10% by copying over a "fast integer rejection test" that was already used in the threshold search. In short, chemfp expects that most fingerprints will not be included in the result. The calculation of the Tanimoto score, which requires a division of a 64-bit float, takes a lot of time. The speedup comes by using integer multiplication to reject nearly all mismatches, rather than a double division.

Those who use chemfp without a popcount index, that is, with "reorder=False", will like that I have optimized that case. Previously it uses a generic and slow scoring function. Now it uses the optimized popcount and intersect popcount functions. The result is about 5x faster.

The full details from the CHANGELOG are below.

Best regards,

				Andrew
				dalke at dalkescientific.com



What's new in 1.5 (16 August 2018)
=================================

BUG FIX: the k-nearest symmetric Tanimoto search code contained a flaw
when there was more than one fingerprint with no bits set and the
threshold was 0.0. Since all of the scores will be 0.0, the code uses
the first k fingerprints as the matches. However, they put all of the
hits into the first search result (item 0), rather than the
corresponding result for each given query. This also opened up a race
condition for the OpenMP implementation, which could cause chemfp to
crash.

* The threshold search used a fast integer-based rejection test before
computing the exact score. The rejection test is now included in the
count and k-nearest algorithms, making them about 10% faster.

* Unindexed search (which occurs when the fingerprints are not in
popcount order) now uses the fast popcount implementations rather than
the generic byte-based one. The result is about 5x faster.

* Changed the simsearch --times option for more fine-grained
reporting. The output (sent to stderr) now looks like:

  open 0.01 read 0.08 search 0.10 output 0.27 total 0.46

where 'open' is the time to open the file and read the metadata,
'read' is the time spent reading the file, 'search' is the time for
the actual search, 'output' is the time to write the search results,
and 'total' is the total time from when the file is opened to when the
last output is written.

* Added SearchResult.format_ids_and_scores_as_bytes() to improve the
simsearch output performance when there are many hits. Turns out the
limiting factor in that case is not the search time but output
formatting. The old code uses Python calls to convert each score to a
double. The new code pushes that code into C. My benchmark used a
k=all --NxN search of ~2,000 PubChem fingerprints to generate about 4M
scores. The output time went from 15.60s to 5.62s. (The search time
was only 0.11 on my laptop.)

* There is a new option, "report-algorithm" with the corresponding
environment variable CHEMFP_REPORT_ALGORITHM. The default does
nothing. Set it to "1" to have chemfp print a description of the
search algorithm used, including any specialization, and the number of
threads. For examples:

  chemfp search using threshold Tanimoto arena, index, single threaded (generic)
  chemfp search using knearest Tanimoto arena symmetric, OpenMP (generic), 8 threads

This feature is only available if chemfp is compiled with OpenMP
support.

* Better error handling in simsearch so that I/O error prints an error
message and exit rather than give a full stack trace.

* Chemfp 3.3 added the options "use-specialized-algorithms" and
"num-column-threads", and the corresponding environment variables
CHEMFP_USE_SPECIALIZED_ALGORITHMS and CHEMFP_NUM_COLUMN_THREADS. These
are supported for future-compatibility, but will alway be 0 and 1,
respectively.

* Don't warn about the CHEMFP_LICENSE or CHEMFP_LICENSE_MANAGER
variables. These are used by chemfp versions which require a license key.

* Fixed bugs in bitops.get_option(). The C API returned an error value
and raised an exception on error, and the Python API forgot to return
the value.

* The setup code now recognizes if you are using clang and will set
the OpenMP compiler flags.



More information about the chemfp mailing list