I am curious how people are doing RAG locally with minimal dependencies for internal code or complex documents?
Are you using a vector database, some type of semantic search, a knowledge graph, a hypergraph?
I'm lucky enough to have 95% of my docs in small markdown markdown files so I'm just... not (+). I'm using SQLite FTS5 (full text search) to build a normal search index and using that. Well, I already had the index so I just wired it up to my mastra agents. Each file has a short description field, so if a keyword search surfaces the doc they check the description and if it matches, load the whole doc.
This took about one hour to set up and works very well.
(+) At least, I don't think this counts as RAG. I'm honestly a bit hazy on the definition. But there's no vectordb anyway.
Don't use a vector database for code, embeddings are slow and bad for code. Code likes bm25+trigram, that gets better results while keeping search responses snappy.
I am surprised to see very few setups leveraging LSP support. (Language Server Protocol) It has been added to Claude Code last month. Most setups rely on naive grep.
AnythingLLM for documents, amazing tool!
I built a Pandas extension SearchArray, I just use that (plus in memory embeddings) for any toy thing
https://github.com/ggozad/haiku.rag/ - the embedded lancedb is convenient and has benchmarks; uses docling. qwen3-embedding:4b, 2560 w/ gpt-oss:20b.
I thought that context building via tooling was shown to be more effective than rag in practically every way?
Question being: WHY would I be doing RAG locally?
The Nextcloud MCP Server [0] supports Qdrant as a vectordb to store embeddings and provide semantic search across your personal documents. This enables any LLM & MCP client (e.g. claude code) into a RAG system that you can use to chat with your files.
For local deployments, Qdrant supports storing embeddings in memory as well as in a local directory (similar to sqlite) - for larger deployments Qdrant supports running as a standalone service/sidecar and can be made available over the network.
If your data aren't too large, you can use faiss-cpu and pickle
I don't. I actually write code.
To answer the question more directly, I've spent the last couple of years with a few different quant models mostly running on llama.cpp and ollama, depending. The results are way slower than the paid token api versions, but they are completely free of external influence and cost.
However the models I've tests generally turn out to be pretty dumb at the quant level I'm running to be relatively fast. And their code generation capabilities are just a mess not to be dealt with.
For the purposes of learning, I’ve built a chatbot using ollama, streamlit, chromadb and docling. Mostly playing around with embedding and chunking on a document library.
Any suggestion what to use as embeddings model runtime and semantic search in C++?
Embedded usearch vector database. https://github.com/unum-cloud/USearch
I have done some experiments with nomic embedding through Ollama and ChromaDB.
Works well, but I didn't tested on larger scale
I just use a web server and a search engine.
TL;DR: - chunk files, index chunks - vector/hybrid search over the index - node app to handle requests (was the quickest to implement, LLMs understand OpenAPI well)
I wrote about it here: https://laurentcazanove.com/blog/obsidian-rag-api
Local LibreChat which bundles a vector db for docs.
LightRAG, Archestra as a UI with LightRAG mcp
lee101/gobed https://github.com/lee101/gobed static embedding models so they are embedded in milliseconds and on gpu search with a cagra style on gpu index with a few things for speed like int8 quantization on the embeddings and fused embedding and search in the same kernel as the embedding really is just a trained map of embeddings per token/averaging
Anythingllm is promising
Sqlite-vec
A little BM25 can get you quite a way with an LLM.
try out chroma or better yet as opus to!
simple lil setup with qdrant
sqlite's bm25
SQLite with FTS5
[dead]
undergrowth.io
I'm using Sonnet with 1M Context Window at work, just stuffing everything in a window (it works fine for now), and I'm hoping to investigate Recursive Language Models with DSPy when I'm using local models with Ollama