[chemfp] sparse fingerprints

Andrew Dalke dalke at dalkescientific.com
Wed Sep 28 07:34:49 EDT 2011


Now that I have released 1.0 of the dense fingerprints, I'm
going to work on support for sparse fingerprints. This is
mostly to support circular fingerprints like the ones in RDKit.

I would also like to support count fingerprints.

Proposed fingerprint encoding
=============================

I have much less experience with sparse fingerprints than with
dense ones. I talked a bit with some of the folks at
AstraZeneca/Mölndal about how to encode the bits, and we came
up with this format:

bits ::= 'X'
| bit (',' bit)*

bit ::= bitno ( ':' count)

bitno ::= non-negative integer
count ::= positive integer

Here are some examples:

1,5,9 -- set bits 1, 5, and 9
1:3,5,9 -- bit 1 is set 3 times, bits 5 and 9 are set once
X -- none of the bits are set
2097149,4063232 - two bits are set

In addition, the bit numbers need to be in numeric order. That
is, "1,5,9" is valid while "1,9,5" is not. This makes it faster
to compute the similarity between two encoded strings - it
can be done in Nbits_1 + Nbits_2 time, rather than building
a hash table or other N log N data structure.


The 'X' symbol
--------------

The 'X' meaning "no bits" is there because I'm going to preserve the
same essential file format as the FPS file, with bits + "\t" + id.
Without the 'X' then there's lines like:

1:3,5,9\tID00001\n
\tID00002\n

and I believe (without evidence) that it's going to lead to
more mistakes than

1:3,5,9\tID00001\n
X\tID00002\n

For one, it's easier to embed into other file formats.

Alternate encodings
===================

RDKit uses a delta compression scheme. The differences between
two fingerprints should follow a geometric distribution, which means
they are optimally compressed with a Golomb code.

However, there's no way I'm going to force someone to go
through this complexity. The point of chemfp is to promote
an easy to use exchange format.


Fingerprint data line
=====================

I'm still thinking about the issues related to the corresponding
file format. I want the fingerprint data lines to look like
the data lines in the FPS format, meaning:

bits '\t' id ('\t' another column)* '\r'? '\n'

as in

1:3,5,9\tID00001\n
26,8832\tID00002\tother fields\n


Format disambiguation
=====================

In nearly every case it's trivial to figure out if a data
line is an FPS line or a sparse fingerprint:
- if it has a ',' or ':' then it's a sparse fingerprint
- if it has the characters [a-fA-F] then it's a dense fingerprint
- if it starts with a '0' then it's a dense fingerprint
- if it has an odd number of digits then it's a sparse fingerprint

However, it's not possible to figure out if

1234\tXYZZY

is a sparse fingerprint with one bit (bit #1234) or a dense
fingerprint of length 16.

Therefore, to disambiguate the two, there's a new "magic" type
line at the top of the file:

#SPF1

This is for "SParse Fingerprint".

The other fields are unchanged ('type', 'aromaticity', 'date', etc.)
except that 'num_bits' is, I think, not useful when one talks about
fingerprints of size 2**32 or 2**64.

New extension? Likely not.
==========================

This leads to the question - should these new fingerprints have
a different extension? With the "#SPF1" line it's possible to
figure things out based on the content.

But experience tells me that content introspection leads to
confusion. For example, "chemfp.open" right now opens an FPS
file, which supports iteration over the dense fingerprints and
the id, where the fingerprint is a byte string.

The sparse file reader can't return a byte string, and
functions with polymorphic return types are bad.

So I'll have new APIs for working with sparse fingerprints,
and a new ".sfp" extension.



Non-numeric sparse fingerprints?
================================

I don't know how to handle non-numeric sparse fingerprints. For
example, I'm doing some work where I generate subgraphs as canonical
SMARTS strings, including with counts. For example, phenol
might be represented as:

c 6 times
c1ccccc1 1 time
O 1 time
Oc 1 times
Occ 2 times


The above scheme requires bit positions, not strings.

I think it's find to say that this is outside of the scope of
what I'll support for this format. It's better to normalize
the strings and just use a bit number as the index into a table
stored elsewhere.

(It could actually be stored in the fingerprint as a "x-"
experimental header, for those wishing to try.)


Any comments?


Andrew
dalke at dalkescientific.com




More information about the chemfp mailing list