Thank you for publishing your work. Do you know of any similar projects with examples of custom tokenizers, e.g. for synonyms, snowball, but written in C?
The text is in UTF8 bytes so any C code would have to deal with that and mapping to Unicode codepoints, plus lots of other text processing so some kind of library would also be needed. I don't know of any.
SQLite itself is in C so you can use the API directly https://www.sqlite.org/fts5.html#custom_tokenizers
The text is in UTF8 bytes so any C code would have to deal with that and mapping to Unicode codepoints, plus lots of other text processing so some kind of library would also be needed. I don't know of any.