logoalt Hacker News

Ask HN: How are you doing RAG locally?

413 pointsby tmaly01/14/2026156 commentsview on HN

I am curious how people are doing RAG locally with minimal dependencies for internal code or complex documents?

Are you using a vector database, some type of semantic search, a knowledge graph, a hypergraph?


Comments

ehsanu101/15/2026

Embedded usearch vector database. https://github.com/unum-cloud/USearch

lee101201/15/2026

lee101/gobed https://github.com/lee101/gobed static embedding models so they are embedded in milliseconds and on gpu search with a cagra style on gpu index with a few things for speed like int8 quantization on the embeddings and fused embedding and search in the same kernel as the embedding really is just a trained map of embeddings per token/averaging

beret4breakfast01/15/2026

For the purposes of learning, I’ve built a chatbot using ollama, streamlit, chromadb and docling. Mostly playing around with embedding and chunking on a document library.

show 1 reply
eajr01/14/2026

Local LibreChat which bundles a vector db for docs.

nineteen99901/14/2026

A little BM25 can get you quite a way with an LLM.

geuis01/15/2026

I don't. I actually write code.

To answer the question more directly, I've spent the last couple of years with a few different quant models mostly running on llama.cpp and ollama, depending. The results are way slower than the paid token api versions, but they are completely free of external influence and cost.

However the models I've tests generally turn out to be pretty dumb at the quant level I'm running to be relatively fast. And their code generation capabilities are just a mess not to be dealt with.

ramesh3101/14/2026

SQLite with FTS5

lormayna01/15/2026

I have done some experiments with nomic embedding through Ollama and ChromaDB.

Works well, but I didn't tested on larger scale

juleshenry01/15/2026

SurrealDB coupled with local vectorization. Mac M1 16GB

yandrypozo01/15/2026

Is there a thread for hardware used for local LLMs?

andoando01/15/2026

Anyone have suggestions for doing semantic caching?

sinandrei01/15/2026

Anyone use these approaches with academic pdfs?

show 3 replies
jacekm01/15/2026

I am curious what are you using local RAG for?

mooball01/15/2026

i thought rag/embeddings were dead with the large context windows. thats what i get for listening to chatgpt.

baalimago01/15/2026

I thought that context building via tooling was shown to be more effective than rag in practically every way?

Question being: WHY would I be doing RAG locally?

show 1 reply
Strift01/15/2026

I just use a web server and a search engine.

TL;DR: - chunk files, index chunks - vector/hybrid search over the index - node app to handle requests (was the quickest to implement, LLMs understand OpenAPI well)

I wrote about it here: https://laurentcazanove.com/blog/obsidian-rag-api

electroglyph01/15/2026

simple lil setup with qdrant

jeanloolz01/15/2026

Sqlite-vec

show 1 reply
whattheheckheck01/14/2026

Anythingllm is promising

xpl01/16/2026

sqlite with extensions, scales to millions of docs easily

show 1 reply
pdyc01/15/2026

sqlite's bm25

jeffchuber01/15/2026

try out chroma or better yet as opus to!

__mharrison__01/15/2026

Grep (rg)

VerifiedReports01/16/2026

Whatever "RAG" is...

MohskiBroskiAI01/18/2026

[flagged]

jackfranklyn01/15/2026

[flagged]

Agent_Builder01/17/2026

[dead]

lee10101/15/2026

[dead]

sascha1000001/15/2026

[dead]

undergrowth01/15/2026

[flagged]

undergrowth01/15/2026

[flagged]

show 1 reply