logoalt Hacker News

Ask HN: How are you doing RAG locally?

134 pointsby tmalyyesterday at 2:38 PM46 commentsview on HN

I am curious how people are doing RAG locally with minimal dependencies for internal code or complex documents?

Are you using a vector database, some type of semantic search, a knowledge graph, a hypergraph?


Comments

lsbtoday at 9:15 AM

I'm using Sonnet with 1M Context Window at work, just stuffing everything in a window (it works fine for now), and I'm hoping to investigate Recursive Language Models with DSPy when I'm using local models with Ollama

esperenttoday at 7:48 AM

I'm lucky enough to have 95% of my docs in small markdown markdown files so I'm just... not (+). I'm using SQLite FTS5 (full text search) to build a normal search index and using that. Well, I already had the index so I just wired it up to my mastra agents. Each file has a short description field, so if a keyword search surfaces the doc they check the description and if it matches, load the whole doc.

This took about one hour to set up and works very well.

(+) At least, I don't think this counts as RAG. I'm honestly a bit hazy on the definition. But there's no vectordb anyway.

show 1 reply
CuriouslyCtoday at 4:55 AM

Don't use a vector database for code, embeddings are slow and bad for code. Code likes bm25+trigram, that gets better results while keeping search responses snappy.

show 6 replies
spqwtoday at 7:54 AM

I am surprised to see very few setups leveraging LSP support. (Language Server Protocol) It has been added to Claude Code last month. Most setups rely on naive grep.

show 2 replies
Bombthecattoday at 8:21 AM

AnythingLLM for documents, amazing tool!

softwaredougtoday at 8:18 AM

I built a Pandas extension SearchArray, I just use that (plus in memory embeddings) for any toy thing

https://github.com/softwaredoug/searcharray

autogn0metoday at 7:06 AM

https://github.com/ggozad/haiku.rag/ - the embedded lancedb is convenient and has benchmarks; uses docling. qwen3-embedding:4b, 2560 w/ gpt-oss:20b.

baalimagotoday at 7:32 AM

I thought that context building via tooling was shown to be more effective than rag in practically every way?

Question being: WHY would I be doing RAG locally?

show 1 reply
cbcoutinhotoday at 6:43 AM

The Nextcloud MCP Server [0] supports Qdrant as a vectordb to store embeddings and provide semantic search across your personal documents. This enables any LLM & MCP client (e.g. claude code) into a RAG system that you can use to chat with your files.

For local deployments, Qdrant supports storing embeddings in memory as well as in a local directory (similar to sqlite) - for larger deployments Qdrant supports running as a standalone service/sidecar and can be made available over the network.

[0] https://github.com/cbcoutinho/nextcloud-mcp-server

rahimnathwaniyesterday at 4:31 PM

If your data aren't too large, you can use faiss-cpu and pickle

https://pypi.org/project/faiss-cpu/

show 2 replies
geuistoday at 7:59 AM

I don't. I actually write code.

To answer the question more directly, I've spent the last couple of years with a few different quant models mostly running on llama.cpp and ollama, depending. The results are way slower than the paid token api versions, but they are completely free of external influence and cost.

However the models I've tests generally turn out to be pretty dumb at the quant level I'm running to be relatively fast. And their code generation capabilities are just a mess not to be dealt with.

beret4breakfasttoday at 7:39 AM

For the purposes of learning, I’ve built a chatbot using ollama, streamlit, chromadb and docling. Mostly playing around with embedding and chunking on a document library.

show 1 reply
init0today at 5:33 AM

I built a lib for myself https://pypi.org/project/piragi/

show 1 reply
dvorkatoday at 6:41 AM

Any suggestion what to use as embeddings model runtime and semantic search in C++?

ehsanu1today at 6:38 AM

Embedded usearch vector database. https://github.com/unum-cloud/USearch

lormaynatoday at 7:00 AM

I have done some experiments with nomic embedding through Ollama and ChromaDB.

Works well, but I didn't tested on larger scale

Strifttoday at 7:47 AM

I just use a web server and a search engine.

TL;DR: - chunk files, index chunks - vector/hybrid search over the index - node app to handle requests (was the quickest to implement, LLMs understand OpenAPI well)

I wrote about it here: https://laurentcazanove.com/blog/obsidian-rag-api

eajryesterday at 3:57 PM

Local LibreChat which bundles a vector db for docs.

motakukyesterday at 4:42 PM

LightRAG, Archestra as a UI with LightRAG mcp

lee1012today at 4:58 AM

lee101/gobed https://github.com/lee101/gobed static embedding models so they are embedded in milliseconds and on gpu search with a cagra style on gpu index with a few things for speed like int8 quantization on the embeddings and fused embedding and search in the same kernel as the embedding really is just a trained map of embeddings per token/averaging

whattheheckheckyesterday at 4:10 PM

Anythingllm is promising

jeanloolztoday at 5:44 AM

Sqlite-vec

nineteen999yesterday at 8:19 PM

A little BM25 can get you quite a way with an LLM.

jeffchubertoday at 4:36 AM

try out chroma or better yet as opus to!

electroglyphtoday at 4:46 AM

simple lil setup with qdrant

pdyctoday at 5:30 AM

sqlite's bm25

ramesh31yesterday at 7:54 PM

SQLite with FTS5

lee101today at 4:54 AM

[dead]

undergrowthtoday at 5:18 AM

undergrowth.io

undergrowthtoday at 5:19 AM

Undergrowth.io

show 1 reply