Ask HN: How are you doing RAG locally?

413 points • by tmaly • 01/14/2026 • 156 comments • view on HN

I am curious how people are doing RAG locally with minimal dependencies for internal code or complex documents?

Are you using a vector database, some type of semantic search, a knowledge graph, a hypergraph?

Comments

ehsanu1 • 01/15/2026

Embedded usearch vector database. https://github.com/unum-cloud/USearch

lee1012 • 01/15/2026

lee101/gobed https://github.com/lee101/gobed static embedding models so they are embedded in milliseconds and on gpu search with a cagra style on gpu index with a few things for speed like int8 quantization on the embeddings and fused embedding and search in the same kernel as the embedding really is just a trained map of embeddings per token/averaging

beret4breakfast • 01/15/2026

For the purposes of learning, I’ve built a chatbot using ollama, streamlit, chromadb and docling. Mostly playing around with embedding and chunking on a document library.

➕ show 1 reply

eajr • 01/14/2026

Local LibreChat which bundles a vector db for docs.

nineteen999 • 01/14/2026

A little BM25 can get you quite a way with an LLM.

geuis • 01/15/2026

I don't. I actually write code.

To answer the question more directly, I've spent the last couple of years with a few different quant models mostly running on llama.cpp and ollama, depending. The results are way slower than the paid token api versions, but they are completely free of external influence and cost.

However the models I've tests generally turn out to be pretty dumb at the quant level I'm running to be relatively fast. And their code generation capabilities are just a mess not to be dealt with.

ramesh31 • 01/14/2026

SQLite with FTS5

lormayna • 01/15/2026

I have done some experiments with nomic embedding through Ollama and ChromaDB.

Works well, but I didn't tested on larger scale

juleshenry • 01/15/2026

SurrealDB coupled with local vectorization. Mac M1 16GB

yandrypozo • 01/15/2026

Is there a thread for hardware used for local LLMs?

andoando • 01/15/2026

Anyone have suggestions for doing semantic caching?

sinandrei • 01/15/2026

Anyone use these approaches with academic pdfs?

➕ show 3 replies

jacekm • 01/15/2026

I am curious what are you using local RAG for?

mooball • 01/15/2026

i thought rag/embeddings were dead with the large context windows. thats what i get for listening to chatgpt.

baalimago • 01/15/2026

I thought that context building via tooling was shown to be more effective than rag in practically every way?

Question being: WHY would I be doing RAG locally?

➕ show 1 reply

Strift • 01/15/2026

I just use a web server and a search engine.

TL;DR: - chunk files, index chunks - vector/hybrid search over the index - node app to handle requests (was the quickest to implement, LLMs understand OpenAPI well)

I wrote about it here: https://laurentcazanove.com/blog/obsidian-rag-api

electroglyph • 01/15/2026

simple lil setup with qdrant

jeanloolz • 01/15/2026

Sqlite-vec

➕ show 1 reply

whattheheckheck • 01/14/2026

Anythingllm is promising

xpl • 01/16/2026

sqlite with extensions, scales to millions of docs easily

➕ show 1 reply

pdyc • 01/15/2026

sqlite's bm25

jeffchuber • 01/15/2026

try out chroma or better yet as opus to!

__mharrison__ • 01/15/2026

Grep (rg)

VerifiedReports • 01/16/2026

Whatever "RAG" is...

MohskiBroskiAI • 01/18/2026

[flagged]

jackfranklyn • 01/15/2026

[flagged]

Agent_Builder • 01/17/2026

[dead]

lee101 • 01/15/2026

[dead]

sascha10000 • 01/15/2026

[dead]

undergrowth • 01/15/2026

[flagged]

undergrowth • 01/15/2026

[flagged]

➕ show 1 reply

alt Hacker News

Ask HN: How are you doing RAG locally?

Comments