[chemfp] simsearch output format

Andrew Dalke dalke at dalkescientific.com
Sun Nov 6 21:15:01 EST 2011


Hi all,

I'm having second thoughts about the simsearch output format.

It's line oriented, with one line per query. I designed it with
k-nearest searches, and high thresholds in mind. In those cases
there's only a few dozen or columns.

However, he wanted a single compound against the ZINC database,
and as a result ended up with one very long line.

He did something neat, which was to reverse the query and target
parameters. I hadn't thought of that solution. Still, it's not
the right solution for chemfp. For example, it prevents the
queries from being used in a pipeline mode.

I'm going to move that to a new format:


"%Query" \t <query_id1> \t <number_of_matches1> \n
<score1_1> \t <target_id1_1> \n
<score1_2> \t <target_id1_2> \n
<score1_3> \t <target_id1_3> \n
...
"%Query" \t <query_id2> \t <number_of_matches2> \n
<score2_1> \t <target_id2_1> \n
...


This has a few advantages over the other format, suggested at
http://blueobelisk.shapado.com/questions/how-should-i-report-similarity-search-results

<score1_1> <query_id1> \t <target_id1_1> \n
<score1_2> <query_id1> \t <target_id1_2> \n
<score1_3> <query_id1> \t <target_id1_3> \n
...
<score2_1> <query_id2> \t <target_id2_1> \n
...


This latter format takes almost twice as much space
(although compression would help more here), doesn't
handle the "no hits" case, and can't be used for the
"--counts" option to simsearch.


So I'm going to go ahead and change the output format.

Andrew
dalke at dalkescientific.com




More information about the chemfp mailing list