Full Unicode Search at 50× ICU Speed with AVX‑512

206 points • by ashvardanian • last Monday at 4:42 PM • 74 comments • view on HN

Comments

I was really confused about the case folding, this page explained the motivation well https://jean.abou-samra.fr/blog/unicode-misconceptions

""" Continuing with the previous example of “ß”, one has lowercase("ss") != lowercase("ß") but uppercase("ss") == uppercase("ß"). Conversely, for legacy reasons (compatibility with encodings predating Unicode), there exists a Kelvin sign “K”, which is distinct from the Latin uppercase letter “K”, but also lowercases to the normal Latin lowercase letter “k”, so that uppercase("K") != uppercase("K") but lowercase("K") == lowercase("K").

The correct way is to use Unicode case folding, a form of normalization designed specifically for case-insensitive comparisons. Both casefold("ß") == casefold("ss") and casefold("K") == casefold("K") are true. Case folding usually yields the same result as lowercasing, but not always (e.g., “ß” lowercases to itself but case-folds to “ss”). """

One question I have is why have Kelvin sign that is distinct from Latin K and other indistinguishable symbols? To make quantified machine readable (oh, this is not a 100K license plate or money amount, but a temperature)? Or to make it easier for specialized software to display it in correct placed/units?

➕ show 2 replies

ashvardanian • yesterday at 1:37 PM

This article is about the ugliest — but arguably the most important — piece of open-source software I’ve written this year. The write-up ended up long and dense, so here’s a short TL;DR:

I grouped all Unicode 17 case-folding rules and built ~3K lines of AVX-512 kernels around them to enable fully standards-compliant, case-insensitive substring search across the entire 1M+ Unicode range, operating directly on UTF-8 bytes. In practice, this is often ~50× faster than ICU, and also less wrong than most tools people rely on today—from grep-style utilities to products like Google Docs, Microsoft Excel, and VS Code.

StringZilla v4.5 is available for C99, C++11, Python 3, Rust, Swift, Go, and JavaScript. The article covers the algorithmic tradeoffs, benchmarks across 20+ Wikipedia dumps in different languages, and quick starts for each binding.

Thanks to everyone for feature requests and bug reports. I'll do my best to port this to Arm as well — but first, I'm trying to ship one more thing before year's end.

➕ show 5 replies

unwind • yesterday at 1:28 PM

Very cool and impressive performance.

I was worried (I find it confusing when Unicode "shadows" of normal letters exist, and those are of course also dangerous in some cases when they can be mis-interpreted for the letter they look more or less exactly like) by the article's use of U+212A (Kelvin symbol) as sample text, so I had to look it up [1].

Anyway, according to Wikipedia the dedicated symbol should not be used:

However, this is a compatibility character provided for compatibility with legacy encodings. The Unicode standard recommends using U+004B K LATIN CAPITAL LETTER K instead; that is, a normal capital K.

That was comforting, to me. :)

[1]: https://en.wikipedia.org/wiki/Kelvin#Orthography

➕ show 1 reply

orthoxerox • yesterday at 1:17 PM

Is it possible to extend this to support additional transformation rules like Any-Latin;Latin-ASCII? To make it possible to find "Վարդանյան" in a haystack by searching for "vardanyan"?

➕ show 1 reply

mgaunard • yesterday at 12:38 PM

In practice you should always normalize your Unicode data, then all you need to do is memcmp + boundary check.

Interestingly enough this library doesn't provide grapheme cluster tokenization and/or boundary checking which is one of the most useful primitive for this.

➕ show 2 replies

hans_castorp • yesterday at 3:36 PM

Would be cool if that could be integrated into Postgres :)

➕ show 1 reply

bawolff • yesterday at 8:59 PM

While this is definitely really cool, wouldn't people who need this type of speed usually just store the text to be searched already case folded?

anonnon • yesterday at 4:05 PM

> ICU has bindings for Rust that provide case-folding functionality, but not case-insensitive substring search.

> ICU has many bindings. The Rust one doesn’t expose any substring search functionality, but the Python one does:

Python's ICU support is based on ICU4C. Rust's ICU "bindings" are actually a new implementation called ICU4X, by developers who worked on i18n at Mozilla and Google and on ICU4C, with the goal of a cleaner, more performant implementation that is also memory safe. Maybe not relevant (as in substantially altering the benchmarks), but it's at least worth noting that the ICU backends aren't consistent throughout.

➕ show 1 reply

moralestapia • yesterday at 3:28 PM

Ash is amazing!

Also very cool and approachable guy.

(Best wishes if you're reading this.)

kardianos • yesterday at 2:48 PM

Looks neat. What are all the genomic sequence comparisons in there for? Is this a grab bag of interesting string methods or is there a motivation for this?

➕ show 1 reply

user3939382 • today at 6:42 AM

I just used AVX-512 today for a lisp processor very performant.

xking5205 • yesterday at 1:36 PM

its good

andersa • yesterday at 12:37 PM

From a German user perspective, ICU and your fancy library are incorrect, actually. Mass is not a different casing of Maß, they are different characters. Google likely changed this because it didn't do what users wanted.

➕ show 5 replies

alt Hacker News

Full Unicode Search at 50× ICU Speed with AVX‑512

Comments