[chemfp] scipy sparse matrix adapter

Wed Jun 14 19:20:28 EDT 2017

Hi all,

  A lot of people use the scikit-learn clustering algorithms. I've decided to add a way to convert a chemfp SearchResults into a scipy sparse row matrix, which can passed into at least some of the clustering algorithms. 

This would be part of the upcoming 1.3 release (the no-cost version) and in 3.1 (the commercial version).

I need feedback because I have no experience with scipy or the clustering tools in scikit-learn. I have put a prototype version of the code, which works with chemfp-1.1, at

 http://dalkescientific.com/chemfp_to_scipy_csr.py 

In theory (see previous disclaimer), when run as a command-line tool it will use DBSCAN to cluster the specified fingerprint file.

Here's the command-line --help:

usage: chemfp_to_scipy_csr.py [-h] [-t FLOAT] [--eps FLOAT]
                             [--min-samples INT] [--num-jobs INT]
                             FILENAME

test prototype adapter between chemfp and scipy.cluster using DBSCAN

positional arguments:
 FILENAME

optional arguments:
 -h, --help            show this help message and exit
 -t FLOAT, --threshold FLOAT
                       minimum similarity threshold (default: 0.8)
 --eps FLOAT           The maximum distance between two samples for them to
                       be considered as in the same neighborhood. (default:
                       0.1)
 --min-samples INT     The number of samples (or total weight) in a
                       neighborhood for a point to be considered as a core
                       point. This includes the point itself. (default: 5)
 --num-jobs INT, -j INT
                       The number of parallel jobs to run. If -1, then the
                       number of jobs is set to the number of CPU cores.
                       (default: 1)

Internally the important function is:

  to_scipy_csr(
    search_results,  # a chemfp SearchResults object from the NxN or NxM search

    reorder=True,    # by default ensure the SearchResults are ordered by index

    dtype=None,      # by default store the scores as a float64, which is what
                     # chemfp uses internally. You could use float32 to save space.

    as_distance=False,  # return the values the distance "1-Tanimoto"
    num_cols=None):  # work around a current limitation in chemfp

One limitation is that it will copy all of the data to a new array, so roughly double the memory consumption (or increase it by a bit over 50% if you use "float32" instead of the default "float64").

What I want to know is:
 - does the DBSCAN example work and make sense? I could easily have messed up the array construction, and I don't have an easy way to verify it because I would be re-using my own assumptions.

 - does it work with the other clustering algorithms?

 - are there other arguments I should add to make it easier to work with another clustering algorithm? (For example, I added the "as_distance" option to compute 1-similarity once I learned that DBSCAN needs a distance.)

				Andrew
				dalke at dalkescientific.com