[chemfp] sparse fingerprints

Greg Landrum greg at landrumdecker.com
Sat Nov 5 02:14:52 EDT 2011


Andrew,

The comments below are mostly agreement. I think this looks good.

On Wed, Sep 28, 2011 at 1:34 PM, Andrew Dalke <dalke at dalkescientific.com> wrote:

> Now that I have released 1.0 of the dense fingerprints, I'm

> going to work on support for sparse fingerprints. This is

> mostly to support circular fingerprints like the ones in RDKit.

>

> I would also like to support count fingerprints.

>

> Proposed fingerprint encoding

> =============================

>

> I have much less experience with sparse fingerprints than with

> dense ones. I talked a bit with some of the folks at

> AstraZeneca/Mölndal about how to encode the bits, and we came

> up with this format:

>

> bits ::= 'X'

>      |  bit (',' bit)*

>

> bit ::= bitno ( ':' count)

>

> bitno ::= non-negative integer

> count ::= positive integer


so far so good.


>

> Here are some examples:

>

>   1,5,9   -- set bits 1, 5, and 9

>   1:3,5,9 -- bit 1 is set 3 times, bits 5 and 9 are set once

>   X       -- none of the bits are set

>   2097149,4063232 - two bits are set

>

> In addition, the bit numbers need to be in numeric order. That

> is, "1,5,9" is valid while "1,9,5" is not. This makes it faster

> to compute the similarity between two encoded strings - it

> can be done in Nbits_1 + Nbits_2 time, rather than building

> a hash table or other N log N data structure.


I don't think this would cause any problems.


> The 'X' symbol

> --------------

>

> The 'X' meaning "no bits" is there because I'm going to preserve the

> same essential file format as the FPS file, with bits + "\t" + id.

> Without the 'X' then there's lines like:

>

>  1:3,5,9\tID00001\n

>  \tID00002\n

>

> and I believe (without evidence) that it's going to lead to

> more mistakes than

>

>  1:3,5,9\tID00001\n

>  X\tID00002\n

>

> For one, it's easier to embed into other file formats.


Makes sense.


>

> Alternate encodings

> ===================

>

> RDKit uses a delta compression scheme. The differences between

> two fingerprints should follow a geometric distribution, which means

> they are optimally compressed with a Golomb code.

>

> However, there's no way I'm going to force someone to go

> through this complexity. The point of chemfp is to promote

> an easy to use exchange format.

>

>

> Fingerprint data line

> =====================

>

> I'm still thinking about the issues related to the corresponding

> file format. I want the fingerprint data lines to look like

> the data lines in the FPS format, meaning:

>

>   bits '\t' id ('\t' another column)* '\r'? '\n'

>

> as in

>

>   1:3,5,9\tID00001\n

>   26,8832\tID00002\tother fields\n

>

>

> Format disambiguation

> =====================

>

> In nearly every case it's trivial to figure out if a data

> line is an FPS line or a sparse fingerprint:

>  - if it has a ',' or ':' then it's a sparse fingerprint

>  - if it has the characters [a-fA-F] then it's a dense fingerprint

>  - if it starts with a '0' then it's a dense fingerprint

>  - if it has an odd number of digits then it's a sparse fingerprint

>

> However, it's not possible to figure out if

>

> 1234\tXYZZY

>

> is a sparse fingerprint with one bit (bit #1234) or a dense

> fingerprint of length 16.

>

> Therefore, to disambiguate the two, there's a new "magic" type

> line at the top of the file:

>

> #SPF1

>

> This is for "SParse Fingerprint".

>

> The other fields are unchanged ('type', 'aromaticity', 'date', etc.)

> except that 'num_bits' is, I think, not useful when one talks about

> fingerprints of size 2**32 or 2**64.


It could make a difference in which data structure you read the
fingerprints into, but that's the only thing I can think of.


> New extension? Likely not.

> ==========================

>

> This leads to the question - should these new fingerprints have

> a different extension? With the "#SPF1" line it's possible to

> figure things out based on the content.

>

> But experience tells me that content introspection leads to

> confusion. For example, "chemfp.open" right now opens an FPS

> file, which supports iteration over the dense fingerprints and

> the id, where the fingerprint is a byte string.

>

> The sparse file reader can't return a byte string, and

> functions with polymorphic return types are bad.

>

> So I'll have new APIs for working with sparse fingerprints,

> and a new ".sfp" extension.


yep.


> Non-numeric sparse fingerprints?

> ================================

>

> I don't know how to handle non-numeric sparse fingerprints. For

> example, I'm doing some work where I generate subgraphs as canonical

> SMARTS strings, including with counts. For example, phenol

> might be represented as:

>

>   c         6 times

>   c1ccccc1  1 time

>   O         1 time

>   Oc        1 times

>   Occ       2 times

>

>

> The above scheme requires bit positions, not strings.

>

> I think it's find to say that this is outside of the scope of

> what I'll support for this format. It's better to normalize

> the strings and just use a bit number as the index into a table

> stored elsewhere.


agreed


More information about the chemfp mailing list