It feels weird that the search index is bigger than the underlying data, weren't search indexes...

oblio • last Friday at 9:25 PM • 4 replies • view on HN

It feels weird that the search index is bigger than the underlying data, weren't search indexes supposed to be efficient formats giving fast access to the underlying data?

Replies

andylizf • last Friday at 9:29 PM

Exactly. That's because instead of just mapping keywords, vector search stores the rich meaning of the text as massive data structures, and LEANN is our solution to that paradoxical inefficiency.

iezepov • last Saturday at 4:26 AM

Good point! Maybe indexing is a bad term here, and it's more like feature extraction (and since embeddings are high dimensional we extract a lot of features). From that point of view it makes sense that "the index" takes more space than the original data.

➕ show 1 reply

yichuan • last Friday at 9:37 PM

I guess for semantic search(rather than keyword search), the index is larger than the text because we need to embed them into a huge semantic space, which make sense to me

brookst • last Saturday at 1:49 PM

Nonclustered indexes in RDBMS can be larger than the tables. It’s usually poor design or indexing a very simple schema in a non-trivial way, but the ultimate goal of the index is speed, not size. As long as you can select and use only a subset of the index based on its ordering it’s still a win.

alt Hacker News

Replies