RAG is taking a bunch of docs, chunking them it to text blocks of a certain length (how best todo this up for debate), creating a search API that takes query (like a google search) and compares it to the document chunks (very much how your describing). Take the returned chunks, ignore the score from vector search, feed those chunks into a re-ranker with the original query (this step is important vector search mostly sucks), filter those re-ranked for the top 1/2 results and then format a prompt like;
The user ask 'long query', we fetched some docs (see below), answer the query based on the docs (reference the docs if u feel like it)
Doc1.pdf - Chunk N Eat cheese
Doc2.pdf- Chunk Y Dont eat cheese
You then expose the search API as a "tool" for the LLM to call, slightly reformatting the prompt above into a multi turn convo, and suddenly you're in ze money.
But once your users are happy with those results they'll want something dumb like the latest football scores, then you need a web tool - and then it never ends.
To be fair though, its pretty powerful once you've got in place.
Sorry for my lack of knowledge, but I've been wondering what if you ask a question to the RAG, where the answer to the question is not close in embedding space to the embedded question? Will that not limit the quality of the result? Or how does a RAG handle that? I guess maybe the multi-turn convo you mentioned helps in this regard?
The way I see RAG is it's basically some sort of semantic search, where the query needs to be similar to whatever you are searching for in the embedding space order to get good results.
Is RAG how I would process my 20+ year old bug list for a piece of software I work on?
I've been thinking about this because it would be nice to have a fuzzier search.
Or you find your users search for id strings like k1231o to find ref docs and end up needing key word search and reranking.