logoalt Hacker News

kwilletsyesterday at 8:05 PM1 replyview on HN

This is a fairly simple idea of indexing characters for each column/offset and compressing the bitmaps. Simple is good, as the overhead of more sophisticated ideas (eg suffix sorting) is often prohibitive.

One suggestion is to index the end-of-string as a character as well; then you don't need negative offsets. But that turns the suffix search into a wildcard type of thing where you have to try all offsets, which is what the '%pat%' searches do already, so maybe it's OK.


Replies

Sesse__today at 9:31 AM

AFAIK the most common design for these kinds of systems is using trigram posting lists with position information, i.e., where in the string does the trigram occur. (It's the extra position information that means that you don't need to re-check the string itself.) No need for many different bitmaps; you just take an existing GIN-like design, remove deduplication and add some side information.