[chemfp] sparse fingerprints
Greg Landrum
greg at landrumdecker.com
Sat Nov 5 02:14:52 EDT 2011
Andrew,
The comments below are mostly agreement. I think this looks good.
On Wed, Sep 28, 2011 at 1:34 PM, Andrew Dalke <dalke at dalkescientific.com> wrote:
> Now that I have released 1.0 of the dense fingerprints, I'm
> going to work on support for sparse fingerprints. This is
> mostly to support circular fingerprints like the ones in RDKit.
>
> I would also like to support count fingerprints.
>
> Proposed fingerprint encoding
> =============================
>
> I have much less experience with sparse fingerprints than with
> dense ones. I talked a bit with some of the folks at
> AstraZeneca/Mölndal about how to encode the bits, and we came
> up with this format:
>
> bits ::= 'X'
> | bit (',' bit)*
>
> bit ::= bitno ( ':' count)
>
> bitno ::= non-negative integer
> count ::= positive integer
so far so good.
>
> Here are some examples:
>
> 1,5,9 -- set bits 1, 5, and 9
> 1:3,5,9 -- bit 1 is set 3 times, bits 5 and 9 are set once
> X -- none of the bits are set
> 2097149,4063232 - two bits are set
>
> In addition, the bit numbers need to be in numeric order. That
> is, "1,5,9" is valid while "1,9,5" is not. This makes it faster
> to compute the similarity between two encoded strings - it
> can be done in Nbits_1 + Nbits_2 time, rather than building
> a hash table or other N log N data structure.
I don't think this would cause any problems.
> The 'X' symbol
> --------------
>
> The 'X' meaning "no bits" is there because I'm going to preserve the
> same essential file format as the FPS file, with bits + "\t" + id.
> Without the 'X' then there's lines like:
>
> 1:3,5,9\tID00001\n
> \tID00002\n
>
> and I believe (without evidence) that it's going to lead to
> more mistakes than
>
> 1:3,5,9\tID00001\n
> X\tID00002\n
>
> For one, it's easier to embed into other file formats.
Makes sense.
>
> Alternate encodings
> ===================
>
> RDKit uses a delta compression scheme. The differences between
> two fingerprints should follow a geometric distribution, which means
> they are optimally compressed with a Golomb code.
>
> However, there's no way I'm going to force someone to go
> through this complexity. The point of chemfp is to promote
> an easy to use exchange format.
>
>
> Fingerprint data line
> =====================
>
> I'm still thinking about the issues related to the corresponding
> file format. I want the fingerprint data lines to look like
> the data lines in the FPS format, meaning:
>
> bits '\t' id ('\t' another column)* '\r'? '\n'
>
> as in
>
> 1:3,5,9\tID00001\n
> 26,8832\tID00002\tother fields\n
>
>
> Format disambiguation
> =====================
>
> In nearly every case it's trivial to figure out if a data
> line is an FPS line or a sparse fingerprint:
> - if it has a ',' or ':' then it's a sparse fingerprint
> - if it has the characters [a-fA-F] then it's a dense fingerprint
> - if it starts with a '0' then it's a dense fingerprint
> - if it has an odd number of digits then it's a sparse fingerprint
>
> However, it's not possible to figure out if
>
> 1234\tXYZZY
>
> is a sparse fingerprint with one bit (bit #1234) or a dense
> fingerprint of length 16.
>
> Therefore, to disambiguate the two, there's a new "magic" type
> line at the top of the file:
>
> #SPF1
>
> This is for "SParse Fingerprint".
>
> The other fields are unchanged ('type', 'aromaticity', 'date', etc.)
> except that 'num_bits' is, I think, not useful when one talks about
> fingerprints of size 2**32 or 2**64.
It could make a difference in which data structure you read the
fingerprints into, but that's the only thing I can think of.
> New extension? Likely not.
> ==========================
>
> This leads to the question - should these new fingerprints have
> a different extension? With the "#SPF1" line it's possible to
> figure things out based on the content.
>
> But experience tells me that content introspection leads to
> confusion. For example, "chemfp.open" right now opens an FPS
> file, which supports iteration over the dense fingerprints and
> the id, where the fingerprint is a byte string.
>
> The sparse file reader can't return a byte string, and
> functions with polymorphic return types are bad.
>
> So I'll have new APIs for working with sparse fingerprints,
> and a new ".sfp" extension.
yep.
> Non-numeric sparse fingerprints?
> ================================
>
> I don't know how to handle non-numeric sparse fingerprints. For
> example, I'm doing some work where I generate subgraphs as canonical
> SMARTS strings, including with counts. For example, phenol
> might be represented as:
>
> c 6 times
> c1ccccc1 1 time
> O 1 time
> Oc 1 times
> Occ 2 times
>
>
> The above scheme requires bit positions, not strings.
>
> I think it's find to say that this is outside of the scope of
> what I'll support for this format. It's better to normalize
> the strings and just use a bit number as the index into a table
> stored elsewhere.
agreed
More information about the chemfp
mailing list