[chemfp] convert chemfp arena to NumPy array
Andrew Dalke
dalke at dalkescientific.com
Fri Sep 28 06:26:43 EDT 2018
I received a question in private email about how to convert a chemfp arena into a NumPy array, more specifically into a NumPy boolean array with one column per bit, which could be passed into scikit-learn and other machine learning packages.
I figure that people here might be interested, so I have placed a copy at http://dalkescientific.com/chemfp_to_numpy.py .
Here's an example of use.
>>> import chemfp
>>> import chemfp_to_numpy
>>> arena = chemfp.load_fingerprints("chembl_23.maccs.fps")
>>> arr = chemfp_to_numpy.to_numpy_array(arena)
>>> arr.shape
(1727081, 24)
>>> arr[0]
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0], dtype=uint8)
>>> arr[-1]
array([128, 4, 0, 1, 224, 146, 77, 190, 151, 255, 247, 254, 255,
255, 255, 255, 255, 255, 255, 255, 63, 0, 0, 0], dtype=uint8)
>>>
>>> bitarr = chemfp_to_numpy.to_numpy_bitarray(arena)
>>> bitarr.shape
(1727081, 166)
>>> bitarr[0]
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=uint8)
>>> bitarr[-1]
array([0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0,
1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1,
1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,
0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=uint8)
This seems like useful functionality, so I plan to add them to the next chemfp release, as methods to the chemfp arena object.
I would appreciate any feedback. Two questions I have are:
- are there more appropriate names?
- how important is it to be able to select specific bits in to_numpy_bitarray() instead of returning all bits?
Best regards,
Andrew
dalke at dalkescientific.com
More information about the chemfp
mailing list