[chemfp] Open Babel Fastsearch to FPS conversion tool
Andrew Dalke
dalke at dalkescientific.com
Tue Dec 19 21:17:13 EST 2017
Hi chemfp users,
I've been working on a tool to convert an Open Babel "Fastsearch" index file (typically with a .fs extension) into FPS fingerprints. I've included the current --help at the bottom of this email.
Is anyone here interested in helping me test it out?
I developed it so I could do a side-by-side comparison of Open Babel's and chemfp's similarity search. I wanted to make sure chemfp used the same fingerprints.
It might also be useful for those who want both OB and chemfp fingerprints, because the structures will only need to be processed once, to produce the .fs file, rather than processed again with ob2fps.
What's mostly needed is feedback about the --help documentation and the error messages/error handling. I also don't know if my code works correctly for large files where the original data file is more than 2GB long.
If you are interested, let me know by private email.
Andrew
dalke at dalkescientific.com
usage: fs2fps.py [-h] [-d FILENAME] [--in FORMAT]
[--delimiter {to-eol,tab,whitespace,space}]
[--type TYPE | --no-type] [--source SOURCE | --no-source]
[--date DATE | --no-date] [--output FILENAME] [--out FORMAT]
INDEX_FILE
Convert an Open Babel fastsearch index into FPS format.
positional arguments:
INDEX_FILE fastsearch index filename (usually ends with '.fs')
optional arguments:
-h, --help show this help message and exit
-d FILENAME, --datafile FILENAME
Location of structure data filename . Used to extract
record identifiers. Only SD and SMILES files are
supported.
--in FORMAT Structure data file format (default: guess based on
the extension and default to 'smi')
--delimiter {to-eol,tab,whitespace,space}
delimiter style a SMILES file (default: 'to-eol')
--type TYPE specify the 'type' metadata value (default: use the
type from the index file)
--no-type do not include the type in the metadata
--source SOURCE specify the 'source' metadata value (default: use the
--datafile or the data filename in the index)
--no-source do not include the source in the metadata
--date DATE specify the 'date' metadata value (default: use the
creation data of the index). This must be in ISO
format in UTC, for example: '2017-12-20T02:12:18'
--no-date do not include the source in the metadata
--output FILENAME, -o FILENAME
save the fingerprints to FILENAME (default=stdout)
--out FORMAT output fingerprint format. One of fps, fps.gz, or fpb.
(default guesses from output filename, or is 'fps')
More information about the chemfp
mailing list