> Embeddings are crucial here, as they efficiently identify and integrate vital information—like documents, conversation history, and tool definitions—directly into a model's working memory.
I feel like I'm falling behind here, but can someone explain this to me?
My high-level view of embedding is that I send some text to the provider, they tokenize the text and then run it through some NN that spits out a vector of numbers of a particular size (looks to be variable in this case including 768, 1536 and 3072). I can then use those embeddings in places like a vector DB where I might want to do some kind of similarity search (e.g. cosine difference). I can also use them to do clustering on that similarity which can give me some classification capabilities.
But how does this translate to these things being "directly into a model's working memory'? My understanding is that with RAG I just throw a bunch of the embeddings into a vector DB as keys but the ultimate text I send in the context to the LLM is the source text that the keys represent. I don't actually send the embeddings themselves to the LLM.
So what is is marketing stuff about "directly into a model's working memory."? Is my mental view wrong?
> So what is is marketing stuff about "directly into a model's working memory."? Is my mental view wrong?
Context is sometimes called working memory. But no your understanding is right: find the right document through cosine similarity (and thus through embeddings), then add the content of those docs to the context
The directly into working memory bit is nonsense of course, but it does point to a problem that is probably worth solving.
What would it take to make the KV cache more portable and cut/paste vs. highly specific to the query?
In theory today, I should be able to process <long quote from document> <specific query> and just stop after the long document and save the KV cache right? The next time around, I can just load it in, and continue from <new query>?
To keep going, you should be able to train the model to operate so that you can have discontinous KV cache segments that are unrelated, so you can drop in <cached KV from doc 1> <cached KV from doc 2> with <query related to both> and have it just work ... but I don't think you can do that today.
I seem remember seeing some papers that tried to "unRoPE" the KV and then "re-RoPE" it, so it can be reused ... but I have not seen the latest. Anybody know what the current state is?
Seems crazy to have to re-process the same context multiple times just to ask it a new query.
Perhaps the person that wrote it is also confused. I guess Geminis embedding model offers multilingual support, but we can use anything. The assumption is the developer uses these embeddings on their end with their implementation of storage/querying (their own vector db). The confusing thing is the article is suggesting that whole process is now done automatically soon as you send the embeddings to Gemini (which doesn’t even make sense, shouldn’t it only take text?).
Your mental model is correct.
They're listing applications of that by third parties to demonstrate the use-case, but this is just a model for generating those vectors.
At least in theory. If the model is the same, the embeddings can be reused by the model rather than recomputing them.
I believe this is what they mean.
In practice, how fast will the model change (including tokenizer)? how fast will the vector db be fully backfilled to match the model version?
That would be the “cache hit rate” of sorts and how much it helps likely depends on some of those variables for your specific corpus and query volumes.
LLMs can use search engines as a tool. One possibility is Google embeds the search query through these embeddings and does retrieval using them and then the retrieved result is pasted into the model's chain of thought (which..unless they have an external memory module in their model, is basically the model's only working memory).
You're right on this. "Advanced" RAG techniques are all complete marketing BS, in the end all you're doing it passing the text into the model's context window.
Your comment really helps me improve my mental model about LLM. Can someone smarter help me verify my understanding:
1) at the end of the day, we are still sending raw text over LLM as input to get output back as response.
2) RAG/Embedding is just a way to identify a "certain chunk" to be included in the LLM input so that you don't have to dump the entire ground truth document into LLM Let's take Everlaw for example: all of their legal docs are in embeddings format and RAG/tool call will retrieve relevant document to feed into LLM input.
So in that sense, what do these non-foundational models startups mean when they say they are training or fine tuning models? Where does the line end between inputting into LLM vs having them baked in model weights
Oh what you don't understand is that LLMs also use embeddings inside, it's how they represent tokens. It's just that you don't get to see the embeddings, they are inner workings.
RAG is taking a bunch of docs, chunking them it to text blocks of a certain length (how best todo this up for debate), creating a search API that takes query (like a google search) and compares it to the document chunks (very much how your describing). Take the returned chunks, ignore the score from vector search, feed those chunks into a re-ranker with the original query (this step is important vector search mostly sucks), filter those re-ranked for the top 1/2 results and then format a prompt like;
The user ask 'long query', we fetched some docs (see below), answer the query based on the docs (reference the docs if u feel like it)
Doc1.pdf - Chunk N Eat cheese
Doc2.pdf- Chunk Y Dont eat cheese
You then expose the search API as a "tool" for the LLM to call, slightly reformatting the prompt above into a multi turn convo, and suddenly you're in ze money.
But once your users are happy with those results they'll want something dumb like the latest football scores, then you need a web tool - and then it never ends.
To be fair though, its pretty powerful once you've got in place.