logoalt Hacker News

mgaunardyesterday at 12:38 PM2 repliesview on HN

In practice you should always normalize your Unicode data, then all you need to do is memcmp + boundary check.

Interestingly enough this library doesn't provide grapheme cluster tokenization and/or boundary checking which is one of the most useful primitive for this.


Replies

stingraycharlesyesterday at 12:57 PM

That’s not practical in many situations, as the normalization alone may very well be more expensive than the search.

If you’re in control of all data representations in your entire stack, then yes of course, but that’s hardly ever the case and different tradeoffs are made at different times (eg storage in UTF-8 because of efficiency, but in-memory representation in UTF-32 because of speed).

show 1 reply
orthoxeroxyesterday at 1:03 PM

In practice the data is not always yours to normalize. You're not going to case-fold your library, but you still want to be able to search it.