I am curious how people are doing RAG locally with minimal dependencies for internal code or complex documents?
Are you using a vector database, some type of semantic search, a knowledge graph, a hypergraph?
lee101/gobed https://github.com/lee101/gobed static embedding models so they are embedded in milliseconds and on gpu search with a cagra style on gpu index with a few things for speed like int8 quantization on the embeddings and fused embedding and search in the same kernel as the embedding really is just a trained map of embeddings per token/averaging
For the purposes of learning, I’ve built a chatbot using ollama, streamlit, chromadb and docling. Mostly playing around with embedding and chunking on a document library.
Local LibreChat which bundles a vector db for docs.
A little BM25 can get you quite a way with an LLM.
I don't. I actually write code.
To answer the question more directly, I've spent the last couple of years with a few different quant models mostly running on llama.cpp and ollama, depending. The results are way slower than the paid token api versions, but they are completely free of external influence and cost.
However the models I've tests generally turn out to be pretty dumb at the quant level I'm running to be relatively fast. And their code generation capabilities are just a mess not to be dealt with.
SQLite with FTS5
I have done some experiments with nomic embedding through Ollama and ChromaDB.
Works well, but I didn't tested on larger scale
SurrealDB coupled with local vectorization. Mac M1 16GB
Is there a thread for hardware used for local LLMs?
Anyone have suggestions for doing semantic caching?
I am curious what are you using local RAG for?
i thought rag/embeddings were dead with the large context windows. thats what i get for listening to chatgpt.
I thought that context building via tooling was shown to be more effective than rag in practically every way?
Question being: WHY would I be doing RAG locally?
I just use a web server and a search engine.
TL;DR: - chunk files, index chunks - vector/hybrid search over the index - node app to handle requests (was the quickest to implement, LLMs understand OpenAPI well)
I wrote about it here: https://laurentcazanove.com/blog/obsidian-rag-api
simple lil setup with qdrant
Anythingllm is promising
sqlite's bm25
try out chroma or better yet as opus to!
Grep (rg)
Whatever "RAG" is...
[flagged]
[flagged]
[dead]
[dead]
[dead]
[flagged]
Embedded usearch vector database. https://github.com/unum-cloud/USearch