[chemfp] single fingerprint queries against PubChem-sized data

Andrew Dalke dalke at dalkescientific.com
Mon Feb 13 21:05:47 EST 2012


Hi all,

I've been experimenting with how well chemfp handles large data sets. I generated 1024 bit fingerprints of the 32,245,089 compounds in PubChem. This takes just under 4GB of memory for the fingerprint data, plus memory for the identifiers.

It turns out that chemfp uses signed integers for a lot of the internal calculations. These max out at 2**31 or 2GB. Fixing that will _not_ be part of the upcoming release.

For a workaround, I split the data set into two parts and work with it that way.

The code for the workaround is:

import chemfp
from chemfp import fps_search

class MultiArena(object):
def __init__(self, arenas):
self.arenas = arenas
self._size = sum(map(len, arenas))

def __len__(self):
return self._size

def __iter__(self):
for arena in self.arenas:
for x in arena:
yield x

def __getitem__(self, i):
assert i >= 0, i
for arena in self.arenas:
if i >= len(arena):
i -= len(arena)
else:
return arena[i]
raise IndexError(i)

def count_tanimoto_hits_fp(self, fp, threshold=0.7):
return sum(arena.count_tanimoto_hits_fp(fp, threshold) for arena in self.arenas)

def knearest_tanimoto_search_fp(self, fp, k=3, threshold=0.7):
# The only way to merge the k-nearest values is to extract
# everything and turn them into FPSSearchResults
search_results = []
for arena in self.arenas:
arena_results = arena.knearest_tanimoto_search_fp(fp, k, threshold)
search_results.extend(arena_results.get_ids_and_scores())
if len(search_results) > k:
search_results.sort(key=lambda x: x[1], reverse=True)
search_results = search_results[:k]
ids, scores = zip(*search_results)
return fps_search.FPSSearchResult(ids, scores)

def threshold_tanimoto_search_fp(self, fp, threshold=0.7):
search_results = []
for arena in self.arenas:
arena_results = arena.threshold_tanimoto_search_fp(fp, threshold)
search_results.extend(arena_results.get_ids_and_scores())
ids, scores = zip(*search_results)
return fps_search.FPSSearchResult(ids, scores)


With that in place I can do

import itertools

# 32 245 089 records
fps = chemfp.open("tree.fps")
evens = chemfp.load_fingerprints(itertools.islice(fps, 0, None, 2), fps.metadata)
print "Loaded evens"
fps = chemfp.open("tree.fps")
odds = chemfp.load_fingerprints(itertools.islice(fps, 1, None, 2), fps.metadata)
print "Loaded odds"

a = MultiArena([evens, odds])

This at least lets me do simple fingerprint searches. BTW, the load time is rather long - this is where my proposed FPB format would be perfect. I haven't spent much time optimizing the loading path, and perhaps there's something I can do in C to make it faster.


I wanted to see how well chemfp scales for datasets of this size, when used with a single fingerprint as the query.


I did a series of k-nearest neighbor searches, for a large set of values for k. My use case is a web page where you sketch a structure and as you are sketching you see the nearest results interactively. I figured that k=100, or a 10x10 grid, is an approximate upper bound, but I went up to k=10,000 because I could.

For each k I randomly sampled 100 fingerprints from the queries, and timed the search. (The queries are in the search so there will alway be at least 1 hit in the result.) This is the timing code:

def knearest_time(arena):
N = 100
print N, "samples from", len(arena), "fingerprints"
print " k avg stddev min max"
fmt = "%3d %.02f %.02f %.02f %.02f"
for k in (range(1, 10) + range(10, 50, 5) + range(50, 100, 10) +
range(100, 500, 25) + range(500, 1000, 100) + range(1000, 10001, 500)):
times = []
for attempt in range(N):
id, fp = random.choice(arena)
t1 = time.time()
arena.knearest_tanimoto_search_fp(fp, k=k, threshold=0.0)
dt = time.time()-t1
times.append(dt)
mean, std = meanstdv(times)
print fmt % (k, mean, std, min(times), max(times))




-------------- next part --------------
A non-text attachment was scrubbed...
Name: PubChem-knearest.png
Type: image/png
Size: 103553 bytes
Desc: not available
Url : <http://eight.pairlist.net/pipermail/chemfp/attachments/20120214/3ce81213/attachment-0002.png>
-------------- next part --------------


This is the worst-case situation for chemfp because:
- I search two arenas, each for k-nearest terms, and merge the results
to find the new k-nearest results.
- This code path contains no parallelism; it's a single core only.
- Random sampling means there's little cache coherency. From other
work I've found that that's very important for high-speed performance.


The search is still sub-second, on average, and even the worst case for the target scenario is only 1.2 seconds. Ideally I would want 0.1 seconds for an "instantaneous" search, but I think what's there now is pretty fast.




I did the same for the threshold search (that is, find all targets which which are at least 'threshold' similar to the query. Here's the benchmark code:

def threshold_time(arena):
N = 100
print N, "samples from", len(arena), "fingerprints"
print "threshold avg stddev min max"
fmt = " %-5s %.02f %.02f %.02f %.02f"
for threshold in ("0.99", "0.98", "0.97", "0.96", "0.95", "0.94", "0.93", "0.92", "0.91", "0.90",
"0.875", "0.85", "0.825", "0.8", "0.775", "0.75", "0.725", "0.7",
"0.65", "0.6", "0.55", "0.5", "0.4", "0.3", "0.25", "0.2",
"0.175", "0.15", "0.125", "0.1", "0.09", "0.08", "0.07", "0.06",
"0.05", "0.04", "0.03", "0.02", "0.01", "0.0"): times = []
for attempt in range(N):
id, fp = random.choice(arena)
t1 = time.time()
arena.count_tanimoto_hits_fp(fp, threshold=float(threshold))
dt = time.time()-t1
times.append(dt)
mean, std = meanstdv(times)
print fmt % (threshold, mean, std, min(times), max(times))


-------------- next part --------------
A non-text attachment was scrubbed...
Name: PubChem-threshold.png
Type: image/png
Size: 96589 bytes
Desc: not available
Url : <http://eight.pairlist.net/pipermail/chemfp/attachments/20120214/3ce81213/attachment-0003.png>
-------------- next part --------------


Again, this is on a single core, synthesized from two searches, and represents the worst-case performance.

As you can see on the left, it takes 0.0 seconds to determine that everything is at least 0.0 similar to the query. What I don't understand is how the time is maximized around threshold=0.15. I've verified that it's real by repeating the run, and double checking that that the counts are monotonic non-increasing as I raise the threshold.

This plot doesn't match anyone else's results that I know of. For example, Liu et al, "Accelerating Chemical Database Searching using GPUs", JCIM, figure 4, p 1814 shows a very smooth graph, with no estimate of standard deviation. (I dislike when people don't show error bars.) It also has a very different shape than what I see. It matches more of a concave shape, and is nearly flat from 0.7 to 1.0.




Andrew
dalke at dalkescientific.com




More information about the chemfp mailing list