[chemfp] simsearch output format

Andrew Dalke dalke at dalkescientific.com
Sun Nov 6 21:15:01 EST 2011

Previous message: [chemfp] FPSFormat in OpenBabel
Next message: [chemfp] 33% speedups for 4096 bit searches
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi all,

I'm having second thoughts about the simsearch output format.

It's line oriented, with one line per query. I designed it with
k-nearest searches, and high thresholds in mind. In those cases
there's only a few dozen or columns.

However, he wanted a single compound against the ZINC database,
and as a result ended up with one very long line.

He did something neat, which was to reverse the query and target
parameters. I hadn't thought of that solution. Still, it's not
the right solution for chemfp. For example, it prevents the
queries from being used in a pipeline mode.

I'm going to move that to a new format:

"%Query" \t <query_id1> \t <number_of_matches1> \n
<score1_1> \t <target_id1_1> \n
<score1_2> \t <target_id1_2> \n
<score1_3> \t <target_id1_3> \n
...
"%Query" \t <query_id2> \t <number_of_matches2> \n
<score2_1> \t <target_id2_1> \n
...

This has a few advantages over the other format, suggested at
http://blueobelisk.shapado.com/questions/how-should-i-report-similarity-search-results

<score1_1> <query_id1> \t <target_id1_1> \n
<score1_2> <query_id1> \t <target_id1_2> \n
<score1_3> <query_id1> \t <target_id1_3> \n
...
<score2_1> <query_id2> \t <target_id2_1> \n
...

This latter format takes almost twice as much space
(although compression would help more here), doesn't
handle the "no hits" case, and can't be used for the
"--counts" option to simsearch.

So I'm going to go ahead and change the output format.

Andrew
dalke at dalkescientific.com

Previous message: [chemfp] FPSFormat in OpenBabel
Next message: [chemfp] 33% speedups for 4096 bit searches
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the chemfp mailing list