[chemfp] scipy sparse matrix adapter
Andrew Dalke
dalke at dalkescientific.com
Wed Jun 14 19:20:28 EDT 2017
Hi all,
A lot of people use the scikit-learn clustering algorithms. I've decided to add a way to convert a chemfp SearchResults into a scipy sparse row matrix, which can passed into at least some of the clustering algorithms.
This would be part of the upcoming 1.3 release (the no-cost version) and in 3.1 (the commercial version).
I need feedback because I have no experience with scipy or the clustering tools in scikit-learn. I have put a prototype version of the code, which works with chemfp-1.1, at
http://dalkescientific.com/chemfp_to_scipy_csr.py
In theory (see previous disclaimer), when run as a command-line tool it will use DBSCAN to cluster the specified fingerprint file.
Here's the command-line --help:
usage: chemfp_to_scipy_csr.py [-h] [-t FLOAT] [--eps FLOAT]
[--min-samples INT] [--num-jobs INT]
FILENAME
test prototype adapter between chemfp and scipy.cluster using DBSCAN
positional arguments:
FILENAME
optional arguments:
-h, --help show this help message and exit
-t FLOAT, --threshold FLOAT
minimum similarity threshold (default: 0.8)
--eps FLOAT The maximum distance between two samples for them to
be considered as in the same neighborhood. (default:
0.1)
--min-samples INT The number of samples (or total weight) in a
neighborhood for a point to be considered as a core
point. This includes the point itself. (default: 5)
--num-jobs INT, -j INT
The number of parallel jobs to run. If -1, then the
number of jobs is set to the number of CPU cores.
(default: 1)
Internally the important function is:
to_scipy_csr(
search_results, # a chemfp SearchResults object from the NxN or NxM search
reorder=True, # by default ensure the SearchResults are ordered by index
dtype=None, # by default store the scores as a float64, which is what
# chemfp uses internally. You could use float32 to save space.
as_distance=False, # return the values the distance "1-Tanimoto"
num_cols=None): # work around a current limitation in chemfp
One limitation is that it will copy all of the data to a new array, so roughly double the memory consumption (or increase it by a bit over 50% if you use "float32" instead of the default "float64").
What I want to know is:
- does the DBSCAN example work and make sense? I could easily have messed up the array construction, and I don't have an easy way to verify it because I would be re-using my own assumptions.
- does it work with the other clustering algorithms?
- are there other arguments I should add to make it easier to work with another clustering algorithm? (For example, I added the "as_distance" option to compute 1-similarity once I learned that DBSCAN needs a distance.)
Andrew
dalke at dalkescientific.com
More information about the chemfp
mailing list