In practice you should always normalize your Unicode data, then all you need to do is memcmp + bound...

mgaunard • yesterday at 12:38 PM • 2 replies • view on HN

In practice you should always normalize your Unicode data, then all you need to do is memcmp + boundary check.

Interestingly enough this library doesn't provide grapheme cluster tokenization and/or boundary checking which is one of the most useful primitive for this.

Replies

stingraycharles • yesterday at 12:57 PM

That’s not practical in many situations, as the normalization alone may very well be more expensive than the search.

If you’re in control of all data representations in your entire stack, then yes of course, but that’s hardly ever the case and different tradeoffs are made at different times (eg storage in UTF-8 because of efficiency, but in-memory representation in UTF-32 because of speed).

➕ show 1 reply

orthoxerox • yesterday at 1:03 PM

In practice the data is not always yours to normalize. You're not going to case-fold your library, but you still want to be able to search it.

alt Hacker News

Replies