logoalt Hacker News

radiator11/20/20241 replyview on HN

Thank you for publishing your work. Do you know of any similar projects with examples of custom tokenizers, e.g. for synonyms, snowball, but written in C?


Replies

rogerbinns11/21/2024

SQLite itself is in C so you can use the API directly https://www.sqlite.org/fts5.html#custom_tokenizers

The text is in UTF8 bytes so any C code would have to deal with that and mapping to Unicode codepoints, plus lots of other text processing so some kind of library would also be needed. I don't know of any.