[chemfp] chemfp-1.4 and chemfp-3.2 are available

Andrew Dalke dalke at dalkescientific.com
Mon Mar 19 19:41:00 EDT 2018


Hi chemfp users!

I've just released chemfp 1.4 (the no-cost version) and chemfp-3.2 (the commercial version).

You can download version 1.4 at http://dalkescientific.com/releases/chemfp-1.4.tar.gz .

The relevant section of the CHANGELOG is below.

Please let me know if there are any problems.

I'll be making a more public announcement tomorrow.

Cheers!


				Andrew
				dalke at dalkescientific.com

What's new in 1.4 (19 March 2018)
=================================

This version mostly contains bug fixes and internal improvements. The
biggest additions are the fpcat command-line program, support for Dave
Cosgrove's 'flush' fingerprint file format, and support for
'fromAtoms' in some of the RDKit fingerprints.

The configuration has changed to use setuptools.

Previously the command-line programs were installed as small
scripts. Now they are created and installed using the
"console_scripts" entry_point as part of the install process. This is
more in line with the modern way of installing command-line tools for
Python.

If these scripts are no longer installed correctly, please let me
know.

The :ref:`fpcat <fpcat>` command-line tools was back-ported from
chemfp 3.1. It can be used to merge a set of FPS files together, and
to convert to/from the flush file format. This version does not
support the FPB file format.

If you have installed the `chemfp_converters package
<https://pypi.python.org/pypi/chemfp-converters/>`_ then chemfp will
use it to read and write fingerprint files in flush format. It can be
used as output from the *2fps programs, as input and output to fpcat,
and as query input to simsearch.

Added "fromAtoms" support for the RDKit hash, torsion, Morgan, and
pair fingerprints. This is primarily useful if you want to generate
the circular environment around specific atoms of a single molecule,
and you know the atom indices. If you pass in multiple molecules then
the same indices will be used for all of them. Out-of-range values are
ignored.

The command-line option is "--from-atoms", which takes a
comma-separated list of non-negative integer atom indices. For
examples:

        --from-atoms 0
        --from-atoms 29,30

The corresponding fingerprint type strings have also been updated. If
fromAtoms is specified then the string "fromAtoms=i,j,k,..." is added
to the string. If it is not specified then the fromAtoms term is not
present, in order to maintain compability with older types
strings. (The philosophy is that two fingerprint types are equivalent
if and only if their type strings are equivalent.)

The --from-atoms option is only useful when there's a single query and
when you have some other mechanism to determine which subset of the
atoms to use. For example, you might parse a SMILES, use a SMARTS
pattern to find the subset, get the indices of the SMARTS match, and
pass the SMILES and indices to rdk2fps to generate the fingerprint for
that substructure.

Be aware that the union of the fingerprint for --from-atoms X and the
fingerprint for --from-atoms Y might not be equal to the fingerprint
for --from-atoms X,Y. However, if a bit is present in the union of the
X and Y fingerprints then it will be present in the X,Y fingerprint.

Why?  The fingerprint implementation first generates a sparse count
fingerprint, then converts that to a bitstring fingerprint. The
conversion is affected by the feature count. If a feature is present
in both X and Y then X,Y fingerprint may have additional bits sets
over the individual fingerprints.

The ob2fps, rdk2fps, and oe2fps programs now also include the chemfp
version information on the software line of the metadata. This
improves data provenance because the fingerprint output might be
affected by a bug in chemfp.

The Metadata 'date' is now always a datetime instance, and not a
string. If you pass a string into the Metadata constructor, like
Metadata(date="datestr"), then the date will be converted to a
datetime instance. Use "metadata.datestamp" to get the ISO string
representation of the Metadata date.

Fixed a bug where a k=0 similarity search using an FPS file as the
targets caused a segfault. The code assumed that k would be at least
1. With the fix, a k=0 search will read the entire file, checking for
format errors, and return no hits.

Fixed a bug where only the first ~100 queries against an FPS
target search would return the correct ids. (Forgot to include the
block offset when extracting the ids.)

Fix a bug where if the query fingerprint had 1 bit set and the
threshold was 0.0 then the sublinear bounds for the Tanimoto searches
(used when there is a popcount index) failed to check targets with 0
bits set.


More information about the chemfp mailing list