logoalt Hacker News

radiatorlast Wednesday at 11:48 PM1 replyview on HN

Thank you for publishing your work. Do you know of any similar projects with examples of custom tokenizers, e.g. for synonyms, snowball, but written in C?


Replies

rogerbinnslast Thursday at 5:34 PM

SQLite itself is in C so you can use the API directly https://www.sqlite.org/fts5.html#custom_tokenizers

The text is in UTF8 bytes so any C code would have to deal with that and mapping to Unicode codepoints, plus lots of other text processing so some kind of library would also be needed. I don't know of any.