logoalt Hacker News

rogerbinnslast Wednesday at 11:28 PM1 replyview on HN

My project APSW may have exactly what you need. It wraps SQLite proving a Python API, and that includes the FTS5 full text search functionality. https://rogerbinns.github.io/apsw/textsearch.html

You can store your text and PDFs in SQLite (or their filenames) and use the FTS5 infrastructure to do tokenization, query execution, and ranking. You can write your own tokenizer in Python, as well as ranking functions. A pure Python tokenizer for HTML is included, as well as a pure Python implementation of BM25.

You can chain tokenizers so it is just a few lines of code to call pypdf's extract_text method, and then have the bundled UnicodeWords tokenizer properly extract tokens/words, and Simplify to do case folding and accent stripping if desired.

There is a lot more useful functionality, all done from Python. You can see code in action in the example/tour at https://rogerbinns.github.io/apsw/example-fts.html


Replies

radiatorlast Wednesday at 11:48 PM

Thank you for publishing your work. Do you know of any similar projects with examples of custom tokenizers, e.g. for synonyms, snowball, but written in C?

show 1 reply