[chemfp] call for chemfp-1.3 beta testers

Andrew Dalke dalke at dalkescientific.com
Wed Aug 30 08:05:25 EDT 2017


Hi all,

  As I mentioned a few months ago, I'm putting together a chemfp-1.3 release. A feature-complete version is ready, as of this morning. This includes some performance improvements and some back-ports from the commercial chemfp-3.0 version.

I'm looking for people who are willing to test it out, that is, make sure it compiles on their machines and works with their code.

If you are interested, please reply to me directly and I'll let you know how to download it.

The biggest performance boost is for the similarity searches of the 166-bit MACCS keys, which is now 40% faster.

Some of these back-ports are:
  - support for RDKit Pattern and Avalon fingerprints
  - updates for the latest versions of all three supported toolkits
  - FingerprintWriter class

I've put the the admittedly terse CHANGELOG at the end of this email.

It is not ready for general use. There is still some testing to do, and the documentation has not been updated. It will likely be a few weeks until that is ready.

Best regards,

				Andrew
				dalke at dalkescientific.com

What's new in 1.3a1 (30 Aug 2017)
=================================

This version of chemfp only supports Python 2.7. Version chemfp-1.1
supports 2.5 and 2.6. For Python 3.5+ support, ask about buying a copy
of chemfp-3.1.

WARNING: Changed the default nBitsPerHash for RDKitFingerprint from 4
to 2 to match the RDKit default.

RDKit changed its hash and MACCS fingerprint implementation a few
years ago. Updated chemfp to identify newer implementations as
"RDKit-Fingerprint/2" and "RDKit-MACCS166/2".

Added support for RDKit-Pattern and RDKit-Avalon fingerprints. The new
rdkit2fps command-line options are "--pattern" and "--avalon".

RDKit-Pattern/1 is from very old versions of RDKit. RDKit-Pattern/2 is
up to 2016, RDKit-Pattern/3 is from 2017.3 and RDKit-Pattern/4 will be
in 2017.9.

Added a definition for key 44 to the 'RDMACCS'. It was missing in
version 1. Chemfp supports both definitions. The rdkit2fps option
"--rdmaccs" uses the most recent version. To be specific, specify
either "--rdmaccs/1" or "--rdmaccs/2".

Removed support for OEGraphSim v1.0, which OpenEye replaced in 2010.

New OpenEye-MACCS166/3 fingerprint type, to match OEGraphSim v2.2.0.

Improved the FPS reader performance. Simsearch in '--scan' mode is
about 40% faster and '--memory' load time is about 10%
faster. chemfp.load_fingerprints() is about 15% faster. (Measured as
(old_time-new_time)/old_time.)

Improved the similarity search performance of the 166-bit MACCS keys
by about 40%.

Backported the FPS reader and writer code from chemfp-3.0 as well
as support for io.Location.

Renamed chemfp.read_structure_fingerprints() to
chemfp.read_molecule_fingerprints(). The old API is still valid, but
the first call to it will generate a warning message.

Fix: Some of the Tanimoto calculations stored intermediate values as a
double. Some of the values, like 0.6, cannot be represented exactly as
a double. As a result, some Tanimoto scores were off by 1 ulp (the
last bit in the double). They are now exactly correct.

Fix: if the query fingerprint had 1 bit set and the threshold was 0.0
then the sublinear bounds for the Tanimoto searches (used when there
is a popcount index) failed to check targets with 0 bits set.

Fix: If a query had 0 bits then the k-nearest code for a symmetric
arena returned 0 matches, even when the threshold was 0.0. It now
returns the first k targets.

There was a bug in the sublinear range checks. It should only occur in
the symmetric searches the batch_size is larger than the number of
records with a popcount just outside of the expected range.

Changed rdkit2fps, ob2fps, and oe2fps so the default --errors is
'ignore' instead of 'strict'. This is based on a lot of feedback
asking how to make those tools ignore errors. I decided that silent
errors (at the chemfp level, but toolkits still send warnings and
errors to stderr) were simply not the right thing for those tools.


The configuration of the --with-* or --without-* options (for OpenMP
and SSSE3) support, can now be specified via environment variables. In
the following, the value "0" means disable (same as "--without-*") and
"1" means enable (same as "--with-*"):
  CHEMFP_OPENMP -  compile for OpenMP (default: "1")
  CHEMFP_SSSE3  -  compile SSSE3 popcount support (default: "1")
  CHEMFP_AVX2   -  compile AVX2 popcount support (default: "0")

This makes it easier to do a "pip install" directly on the tar.gz file
or use chemfp under an automated testing system like tox, even when
the default options are not appropriate. For example, the default C
compiler on Mac OS X doesn't support OpenMP. If you want OpenMP
support then install gcc and specify it with the "CC". If you don't
want OpenMP support then you can do:

  CHEMFP_OPENMP=0 pip install chemfp-1.3a1.tar.gz

Backported bitops functions from chemfp-3.0. The new functions are:
  hex_contains, hex_contains_bit, hex_intersect, hex_union, hex_difference,
  byte_hex_tanimoto, byte_contains_bit,
  byte_to_bitlist, byte_from_bitlist,
  hex_to_bitlist, hex_from_bitlist,
  hex_encode, hex_encode_as_bytes, hex_decode

The new hex encode/decode functions are important if you want to write
code which is forward compatible for Python 3, where s.encode("hex")
is no longer supported.


More information about the chemfp mailing list