[chemfp] convert chemfp arena to NumPy array

Andrew Dalke dalke at dalkescientific.com
Fri Sep 28 06:26:43 EDT 2018


I received a question in private email about how to convert a chemfp arena into a NumPy array, more specifically into a NumPy boolean array with one column per bit, which could be passed into scikit-learn and other machine learning packages.

I figure that people here might be interested, so I have placed a copy at http://dalkescientific.com/chemfp_to_numpy.py .

Here's an example of use.

>>> import chemfp
>>> import chemfp_to_numpy
>>> arena = chemfp.load_fingerprints("chembl_23.maccs.fps")

>>> arr = chemfp_to_numpy.to_numpy_array(arena)
>>> arr.shape
(1727081, 24)
>>> arr[0]
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0], dtype=uint8)
>>> arr[-1]
array([128,   4,   0,   1, 224, 146,  77, 190, 151, 255, 247, 254, 255,
       255, 255, 255, 255, 255, 255, 255,  63,   0,   0,   0], dtype=uint8)
>>>


>>> bitarr = chemfp_to_numpy.to_numpy_bitarray(arena)
>>> bitarr.shape
(1727081, 166)
>>> bitarr[0]
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=uint8)
>>> bitarr[-1]
array([0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0,
       1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1,
       1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,
       0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=uint8)


This seems like useful functionality, so I plan to add them to the next chemfp release, as methods to the chemfp arena object.

I would appreciate any feedback. Two questions I have are:
  - are there more appropriate names?
  - how important is it to be able to select specific bits in to_numpy_bitarray() instead of returning all bits?

Best regards,

				Andrew
				dalke at dalkescientific.com




More information about the chemfp mailing list