logoalt Hacker News

Ask HN: How are you doing RAG locally?

413 pointsby tmaly01/14/2026156 commentsview on HN

I am curious how people are doing RAG locally with minimal dependencies for internal code or complex documents?

Are you using a vector database, some type of semantic search, a knowledge graph, a hypergraph?


Comments

navar01/15/2026

For the retrieval stage, we have developed a highly efficient, CPU-only-friendly text embedding model:

https://huggingface.co/MongoDB/mdbr-leaf-ir

It ranks #1 on a bunch of leaderboards for models of its size. It can be used interchangeably with the model it has been distilled from (https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v1...).

You can see an example comparing semantic (i.e., embeddings-based) search vs bm25 vs hybrid here: http://search-sensei.s3-website-us-east-1.amazonaws.com (warning! It will download ~50MB of data for the model weights and onnx runtime on first load, but should otherwise run smoothly even on a phone)

This mini app illustrates the advantage of semantic vs bm25 search. For instance, embedding models "know" that j lo refers to jennifer lopez.

We have also published the recipe to train this type of models if you were interested in doing so; we show that it can be done on relatively modest hardware and training data is very easy to obtain: https://arxiv.org/abs/2509.12539

show 4 replies
__jf__01/15/2026

For vector generation I started using Meta-LLama-3-8B in april 2024 with Python and Transformers for each text chunk on an RTX-A6000. Wow that thing was fast but noisy and also burns 500W. So a year ago I switched to an M1 Ultra and only had to replace Transformers with Apple's MLX python library. Approximately the same speed but less heat and noise. The Llama model has 4k dimensions so at fp16 thats 8 kilobyte per chunk, which I store in a BLOB column in SQLite via numpy.save(). Between running on the RTX and M1 there is a very small difference in vector output but not enough for me to change retrieval results, regenerate the vectors or change to another LLM.

For retrieval I load all the vectors from the SQlite database into a numpy.array and hand it to FAISS. Faiss-gpu was impressively fast on the RTX6000 and faiss-cpu is slower on the M1 Ultra but still fast enough for my purposes (I'm firing a few queries per day, not per minute). For 5 million chunks memory usage is around 40 GB which both fit into the A6000 and easily fits into the 128GB of the M1 Ultra. It works, I'm happy.

beklein01/15/2026

Most of my complex documents are, luckily, Markdown files.

I can recommend https://github.com/tobi/qmd/ . It’s a simple CLI tool for searching in these kinds of files. My previous workflow was based on fzf, but this tool gives better results and enables even more fuzzy queries. I don’t use it for code, though.

show 1 reply
CuriouslyC01/15/2026

Don't use a vector database for code, embeddings are slow and bad for code. Code likes bm25+trigram, that gets better results while keeping search responses snappy.

show 7 replies
esperent01/15/2026

I'm lucky enough to have 95% of my docs in small markdown markdown files so I'm just... not (+). I'm using SQLite FTS5 (full text search) to build a normal search index and using that. Well, I already had the index so I just wired it up to my mastra agents. Each file has a short description field, so if a keyword search surfaces the doc they check the description and if it matches, load the whole doc.

This took about one hour to set up and works very well.

(+) At least, I don't think this counts as RAG. I'm honestly a bit hazy on the definition. But there's no vectordb anyway.

show 1 reply
eb0la01/15/2026

We started with PGVector just because we already knew Postgres and it was easy to hand over to the operations people.

After some time we noticed a semi-structured field in the prompt had a 100% match with the content needed to process the prompt.

Turns out operators started puting tags both in the input and the documents that needed to match on every use case (not much, about 50 docs).

Now we look for the field first and put the corresponding file in the prompt, then we look for matches in the database using the embedding.

85% of the time we don't need the vectordb.

show 1 reply
theahura01/15/2026

SQLite works shockingly well. The agents know how to write good queries, know how to chain queries, and can generally manipulate the DB however they need. At nori (https://usenori.ai/watchtower) we use SQLite + vec0 + fts5 for semantic and word search

scosman01/15/2026

Kiln wraps up all the parts in on app. Just drag and drop in files. You can easily compare different configs on your dataset: extraction methods, embedding model, search method (BM25, hybrid, vector), etc.

It uses LanceDB and has dozens of different extraction/embedding models to choose from. It even has evals for checking retrieval accuracy, including automatically generating the eval dataset.

You can use its UI, or call the RAG via MCP.

https://github.com/kiln-ai/kiln

https://docs.kiln.tech/docs/documents-and-search-rag

acutesoftware01/15/2026

I am using LangChain with a SQLite database - it works pretty well on a 16G GPU, but I started running it on a crappy NUC, which also worked with lesser results.

The real lightbulb moment is when you realise the ONLY thing a RAG passes to the LLM is a short string of search results with small chunks of text. This changes it from 'magic' to 'ahh, ok - I need better search results'. With small models you cannot pass a lot of search results ( TOP_K=5 is probably the limit ), otherwise the small models 'forget context'.

It is fun trying to get decent results - and it is a rabbithole, next step I am going into is pre-summarising files and folders.

I open sourced the code I was using - https://github.com/acutesoftware/lifepim-ai-core

show 2 replies
amscotti01/15/2026

More of a proof of concept to test out ideas, but here's my approach for local RAG, https://github.com/amscotti/local-LLM-with-RAG

Using Ollama for the embeddings with “nomic-embed-text”, with LanceDB for the vector database. Recently updated it to use “agentic” RAG, but probably not fully needed for a small project.

show 2 replies
autogn0me01/15/2026

https://github.com/ggozad/haiku.rag/ - the embedded lancedb is convenient and has benchmarks; uses docling. qwen3-embedding:4b, 2560 w/ gpt-oss:20b.

show 1 reply
juanre01/15/2026

I built https://github.com/juanre/llmemory and I use it both locally and as part of company apps. Quite happy with the performance.

It uses PostgreSQL with pgvector, hybrid BM25, multi-query expansion, and reranking.

(It's the first time I share it publicly, so I am sure there'll be quirks.)

lmeyerov01/15/2026

Claude code / codex which internally uses ripgrep, and I'm unsure if it's using parallel mode. And, project specific static analyzers.

Studies generally show when you do agentic retrieval w/ text search, that's pretty good. Adding vector retrieval and graph rag, so the typical parallel multi-retrieval followed by reranking, gives a bit of speedup and quality lift. That lines up with my local flow experience, where it is only enough that I want that for $$$$ consumer/prosumer tools, and not easy enough for DIY that I want to invest in that locally. For those who struggle with tools like spotlight running when it shouldn't, that kind of thing turns me off on the cost/benefit side.

For code, I experiment with unsound tools (semgrep, ...) vs sound flow analyzers, carefully setup for the project. Basically, ai coders love to use grep/sed for global replace refactors and other global needs, but keeps tripped up on sound flow analysis. Similar to lint and type checking, that needs to be setup for a project and taught as a skill. I'm not happy with any of my experiments here yet however :(

show 1 reply
marwamc01/15/2026

BM25 has been sufficient for my needs. I typically need to refer to codebases of existing tools as referential sources (istio, envoy, oauth2-proxy, tantivy index etc) so I just clone those repos, index them and search away. Built a cli and mcp tool for this workflow.

https://github.com/rhobimd-oss/shebe

One area where BM25 particularly shines is the refactoring workflow: let's say you want to upgrade your istio installation from 1.28 to 1.29 and maybe in 1.29 the authorizationpolicy crd has a breaking change in one of it's properties. BM25 allows you to efficiently enumerate all code locations in your codebase that need to change and then you can set the cli coders off using this list. Grep and LSP can still perform this enumeration but they have shortcomings. Wrote about it here https://github.com/rhobimd-oss/shebe/blob/main/WHY_SHEBE.md#...

show 1 reply
spqw01/15/2026

I am surprised to see very few setups leveraging LSP support. (Language Server Protocol) It has been added to Claude Code last month. Most setups rely on naive grep.

show 3 replies
yokuze01/15/2026

I made, and use this: https://github.com/libragen/libragen

It’s a CLI tool and MCP server for creating discrete, versioned “libraries” of RAG-able content.

Under the hood, it uses an embedding model locally. It chunks your content and stores embeddings in SQLite. The search functionality uses vector + keyword search + a re-ranking model.

You can also point it at any GitHub repo and it will create a RAG DB out of it.

You can also use the MCP server to create and query the libraries.

Site: https://www.libragen.dev/

show 1 reply
raghavankl01/15/2026

I have a python tooling to do indexing and relevance offline using ollama.

https://github.com/raghavan/pdfgptindexer-offline

threecheese01/16/2026

For my personal PKM slash “learn this crap”, I have a fully local hybrid search on my MacBook using MLX and SQLite.

I store file content blobs in SQLite, and use FTS5 (bm25) to maintain a fulltext index plus sqlite-vec for storing embeddings. Search uses both of these, and then reciprocal rank fusion gets the best results and pipes those to a local transformers model to judge. It’s all Python with mlx-lm and mlx-embeddings libraries, the models are grabbed from huggingface. It’s not the fastest, but it’s local and easy to understand (and for Claude to write, mostly).

gaganyatri01/15/2026

Built discovery using - Qwen-3-VL-8B for Document Ocr + Prompts + Tool Call - ChromaDB for Vector storage. - BM25 + Embedding model for Hybrid RAG. - Backend- FastAPI + Python - Frontend- React + Typescript - vllm + docker for model deployment on L40 GPU

Demo: https://app.dwani.ai

GitHub: https://github.com/dwani-ai/discovery

Now working on added Agentic features, by continuous analysis of Document with Generated prompts.

rahimnathwani01/14/2026

If your data aren't too large, you can use faiss-cpu and pickle

https://pypi.org/project/faiss-cpu/

show 2 replies
init001/15/2026

I built a lib for myself https://pypi.org/project/piragi/

show 1 reply
oliveiracwb01/15/2026

We handle ~300k customer interactions per day, so latency and precision really matter. We built an internal RAG-based portal on top of our knowledge base (basically a much better FAQ).

On the retrieval side, I built a custom search/indexing layer (Node) specifically for service traceability and discovery. It uses a hybrid approach — embeddings + full-text search + IVF-HNSW — to index and cross-reference our APIs, services, proxies and orchestration repos. The RAG pipelines sit on top of this layer, which gives us reasonable recall and predictable latency.

Compliance and observability are still a problem. Every year new vendors show up promising audits, data lineage and observability, but none of them really handle the informational sprawl of ~600 distributed systems. The entropy keeps increasing.

Lately I’ve been experimenting with a more semantic/logical KAG approach on top of knowledge graphs to map business rules scattered across those systems. The goal is to answer higher-level questions about how things actually work — Palantir-like outcomes, but with explicit logic instead of magic.

Curious if others are moving beyond “pure RAG” toward graph-based or hybrid reasoning setups.

bzGoRust01/15/2026

In my company, we build the internal chatbot based on RAG through LangChain + Milvus + LLM. Since the documents are well formatted, it is easy to do the overlapping chunking, then all those chunking data are inserted into vector db Milvus. The hybrid search (combine dense search and sparse search) is native supported in the Milvus could help us to do better retrieve. Thus the better quality answers are there.

show 1 reply
folli01/15/2026

I was just working on a RAG implementation for >500k news articles, completely local, using postgres as a vector database: https://github.com/r-follador/TeletextSignals

I'm positively surprised on how well it works, especially if you also connect it to an LLM.

tschellenbach01/15/2026

Vector & BM25 on Turbopuffer. (see https://github.com/GetStream/Vision-Agents/blob/main/plugins...)

philip120901/15/2026

I run a Mac Mini home datacenter [1]. I've been using Chroma, Qwen 0.6B embeddings, and gpt-oss-20b to build a search agent over my blog.

[1]: https://www.contraption.co/a-mini-data-center/

podgietaru01/15/2026

I made a small RAG database just using Postgres. I outlined it in the blog post below. I use it for RSS Feed organisation, and searching. They are small blobs. I do the labeling using a pseudo-KNN algorithm.

https://aws.amazon.com/blogs/machine-learning/use-language-e...

The code for it is here: https://github.com/aws-samples/rss-aggregator-using-cohere-e...

The example link no longer works, as I no longer work at AWS.

cbcoutinho01/15/2026

The Nextcloud MCP Server [0] supports Qdrant as a vectordb to store embeddings and provide semantic search across your personal documents. This enables any LLM & MCP client (e.g. claude code) into a RAG system that you can use to chat with your files.

For local deployments, Qdrant supports storing embeddings in memory as well as in a local directory (similar to sqlite) - for larger deployments Qdrant supports running as a standalone service/sidecar and can be made available over the network.

[0] https://github.com/cbcoutinho/nextcloud-mcp-server

g0wda01/15/2026

Store fp16 vector blobs in sqlite. Load the vectors after filter queries into memory and do a matvec multiplication for similarity scores (this part will be fast if the library (e.g. numpy/torch) uses multithreading/blas/GPU). I will migrate this to the very based https://github.com/sqliteai/sqlite-vector when it starts to become a bottleneck. In my case the filters by other features (e.g. date, location) just subset a lot. All this is behind some interface that will allow me to switch out the backend.

prakashn2701/15/2026

I feel local rag system , slows down my computer (I got M1 Pro 32 GB)

So I use hosted one to prevent this. My business use vector db, so created a new db to vectorize and host my knowledge base. 1. All my knowledge base is markdown files. So I split that by header tags. 2. The split is hashed and hash value is stored in SQLite 3. The hashed version is vectorized and pushed to cloud db. 4. When ever I make changes , I run a script which splits and checks hash, if it is changed the. I upsert the document. If not I don’t do anything. This helps me keep the store up to date

For search I have a cli query which searches and fetches from vector store.

metawake01/15/2026

I am using a vector DB using Docker image. And for debugging and benchmarking local RAG retrieval, I've been building a CLI tool that shows what's actually being retrieved:

  ragtune explain "your query" --collection prod
Shows scores, sources, and diagnostics. Helps catch when your chunking or embeddings are silently failing or you need numeric estimations to base your judgements on.

Open source: https://github.com/metawake/ragtune

mmargenot01/15/2026

I made an obsidian extension that does semantic and hybrid (RRF with FTS) search with local models. I have done some knowledge graph and ontology experimentation around this, but nothing that I’d like to include yet.

This is specifically a “remembrance agent”, so it surfaces related atoms to what you’re writing rather than doing anything generative.

Extension: https://github.com/mmargenot/tezcat

Also available in community plugins.

IXCoach01/19/2026

I have production agents which run vector search via FAISS locally ( in their env not 3rd party environments ), and for which I am creating embeddings for specific domains.

1 - agent memory ( its an ai coach so its the unique training methods that allow for instant adoption of new skills and distilling best fit skills for context )

2 - user memory ( the ai coaches memory of a user )

3 - session memory ( for long conversations, instead of compaction or truncation )

Then separately I have coding agents which I give semantic search, same system FAISS

- on command they create new memories from lessons ( consumes tokens * ) - they vector search FAISS when needing more context ( 2x greater agent alignment / outcomes this way )

And finally I forked openais codex terminal agent code to add - inbuilt vector search and injection

So I say "Find any uncovered TDD opportunity matching intent to actuality for auth on these 3 repos, write TDD coverage, and bring failures to my attention"

They set my message to {$query}

vector search on {$query}

embed results in their context window

programmatically - so no token consumption ( what a freaking dream )

thats open source if helpful

Its here

https://github.com/Next-AI-Labs-Inc/codex/tree/nextailabs

Im trying to determine where something like this fits in

https://huggingface.co/MongoDB/mdbr-leaf-ir

My gaps right now are ...

I am not training the agents yet, like fine tuning the underlying models.

Would love the simplest approach to test this, because at least with the codex clone I could easily swap out local models, but somehow doubting that they will be able to match performance of the outsourced models.

especially bc claude code just launched ahead of codex in the last week or so in quality, and they are closed source. Im seeing clear swarm agentic coding internally which is a dream for context window efficiency. ( in claude code as of today )

init001/15/2026

from piragi import Ragi

kb = Ragi(["./docs", "s3://bucket/data/*/*.pdf", "https://api.example.com/docs"])

answer = kb.ask("How do I deploy this?")

that's it! with https://pypi.org/project/piragi/

yakkomajuri01/15/2026

I've written about this (and the post was even here on HN) but mostly from the perspective of running a RAG on your infra as an organization. But I cover the general components and alternatives to Cloud services.

Not sure how useful it is for what you need specifically: https://blog.yakkomajuri.com/blog/local-rag

pj453301/15/2026

Well this isn’t code, but I’ve been working on a memory system for Claude Code. This portion provides semantic search over the session files in .claude/projects. It uses OpenAI for embeddings so not completely local (would be easy to modify) and storage in ChromaDB.

https://github.com/pj4533/seance

reactordev01/15/2026

I have three tools dedicated to this.

save_memory, recall_memory, search

Save memory vectorizes a session, summarizes it, and stores it in SQLite. Recall memory takes vector or a previous tool run id and loads the full text output. Search takes a vector array or string array and searches through the graph using fuzzy matching and vector dot products.

It’s not fancy, but it works really well. gpt-oss

motakuk01/14/2026

LightRAG, Archestra as a UI with LightRAG mcp

codebolt01/15/2026

Giving the LLM tools with an OData query interface has worked well for me. In C# it's pretty trivial to set up an MCP server with OData querying for an arbitrary data model. At work we have an Excel sheet with 40k rows which the LLM was able to quickly and reliably analyse using this method.

Bombthecat01/15/2026

AnythingLLM for documents, amazing tool!

ktyptorio01/18/2026

I've just released a casual personal project for Ephemeral GraphRAG. It's still experimental and open source: https://github.com/gibram-io/gibram

robotswantdata01/15/2026

You don’t need a vector database or graph, it really depends on your existing infrastructure , file types and needs.

The newer “agent” search approach can just query a file system or api. It’s slightly slower but easier to setup and maintain as no extra infrastructure.

softwaredoug01/15/2026

I built a Pandas extension SearchArray, I just use that (plus in memory embeddings) for any toy thing

https://github.com/softwaredoug/searcharray

turnsout01/15/2026

The Claude Code model highlights the power of simple search (grep) and selective reads (only reading in excerpts). The only time I vectorize is when I explicitly want to similarity-based searching, but that's actually pretty rare.

lsb01/15/2026

I'm using Sonnet with 1M Context Window at work, just stuffing everything in a window (it works fine for now), and I'm hoping to investigate Recursive Language Models with DSPy when I'm using local models with Ollama

dvorka01/15/2026

Any suggestion what to use as embeddings model runtime and semantic search in C++?

SamLeBarbare01/15/2026

sqlite + FTS + sqlite-vec + local LLM for reranking results (reasoning model)

show 1 reply
claylyons01/15/2026

Has anyone tried this? https://aws.amazon.com/s3/features/vectors/

throwaway778301/15/2026

We have a Q&A database. The questions, answers are both trigram indexed and also have embeddings. All in postgres. We then use pgvector + trigram search, combine them by relevance scores.

🔗 View 33 more comments