28M Hacker News comments as vector embedding search dataset

226 points • by walterbell • today at 6:02 PM • 84 comments • view on HN

Comments

Don't use all-MiniLM-L6-v2 for new vector embeddings datasets.

Yes, it's the open-weights embedding model used in all the tutorials and it was the most pragmatic model to use in sentence-transformers when vector stores were in their infancy, but it's old and does not implement the newest advances in architectures and data training pipelines, and it has a low context length of 512 when embedding models can do 2k+ with even more efficient tokenizers.

For open-weights, I would recommend EmbeddingGemma (https://huggingface.co/google/embeddinggemma-300m) instead which has incredible benchmarks and a 2k context window: although it's larger/slower to encode, the payoff is worth it. For a compromise, bge-base-en-v1.5 (https://huggingface.co/BAAI/bge-base-en-v1.5) or nomic-embed-text-v1.5 (https://huggingface.co/nomic-ai/nomic-embed-text-v1.5) are also good.

➕ show 3 replies

afiodorov • today at 6:23 PM

I've been embedding all HN comments since 2023 from BigQuery and hosting at https://hn.fiodorov.es

Source is at https://github.com/afiodorov/hn-search

➕ show 1 reply

slurrpurr • today at 8:31 PM

The most smug AI ever will be trained on this

➕ show 2 replies

isodev • today at 6:55 PM

Maybe I’m reading this wrong, but commercial use of comments is prohibited by the HN Privacy and data Policy. So is creating derivative works (so technically a vector representation)

➕ show 3 replies

delichon • today at 6:34 PM

I think it would be useful to add a right-click menu option to HN content, like "similar sentences", which displays a list of links to them. I wonder if it would tell me that this suggestion has been made before.

➕ show 1 reply

SchwKatze • today at 6:33 PM

I know it's unrelated but does anyone knows a good paper comparing vector searches vs "normal" full text search? Sometimes I ask myself of the squeeze worth the juice

➕ show 1 reply

cdblades • today at 9:07 PM

Can I submit a request somewhere to have my data removed?

➕ show 1 reply

catapart • today at 6:32 PM

Am I misunderstanding what a parquet file is, or are all of the HN posts along with the embedding metadata a total of 55GB?

➕ show 1 reply

j4coh • today at 6:22 PM

Oh to have had a delete account/comments option.

➕ show 2 replies

doctorslimm • today at 8:46 PM

why is this not on huggingface as a dataset yet? is anyone poutine this on hugginggface?

dangoodmanUT • today at 7:35 PM

Why all-MiniLM-L6-v2? This is so old and terribly behind the new models...

dmezzetti • today at 8:54 PM

Fun project. I'm sure it will get a lot of interest here.

For those into vector storage in general, one thing that has interested me lately is the idea of storing vectors as GGUF files and bring the familiar llama.cpp style quants to it (i.e. Q4_K, MXFP4 etc). An example of this is below.

https://gist.github.com/davidmezzetti/ca31dff155d2450ea1b516...

ProofHouse • today at 6:33 PM

Scratches off one of my todos,

zkmon • today at 7:04 PM

I don't know how to feel about this. Is the only purpose of the comments here is to train some commercial model? I have a feeling that, this might affect my involvement here going forward.

➕ show 1 reply

SilverElfin • today at 8:18 PM

Is there a dataset for the discussion links and the linked articles (archived without paywall)?

baalimago • today at 6:40 PM

Finetune LLM to post_score -> high quality slop generator

doctorslimm • today at 8:42 PM

lmao this is gold

Joshua-Peter • today at 7:21 PM

[dead]

John-Tony12 • today at 7:01 PM

[dead]

John-Tony • today at 6:34 PM

[dead]

GeoAtreides • today at 6:42 PM

I don't remember licensing my HN comments for 3rd party processing.

➕ show 1 reply

alt Hacker News

28M Hacker News comments as vector embedding search dataset

Comments