Gemini Embedding: Powering RAG and context engineering

278 points • by simonpure • 07/31/2025 • 91 comments • view on HN

Comments

stillpointlab • 07/31/2025

> Embeddings are crucial here, as they efficiently identify and integrate vital information—like documents, conversation history, and tool definitions—directly into a model's working memory.

I feel like I'm falling behind here, but can someone explain this to me?

My high-level view of embedding is that I send some text to the provider, they tokenize the text and then run it through some NN that spits out a vector of numbers of a particular size (looks to be variable in this case including 768, 1536 and 3072). I can then use those embeddings in places like a vector DB where I might want to do some kind of similarity search (e.g. cosine difference). I can also use them to do clustering on that similarity which can give me some classification capabilities.

But how does this translate to these things being "directly into a model's working memory'? My understanding is that with RAG I just throw a bunch of the embeddings into a vector DB as keys but the ultimate text I send in the context to the LLM is the source text that the keys represent. I don't actually send the embeddings themselves to the LLM.

So what is is marketing stuff about "directly into a model's working memory."? Is my mental view wrong?

➕ show 10 replies

djoldman • 07/31/2025

It may be worth pointing out that a few open weights models score higher than gemini-embedding-001 on MTEB:

https://huggingface.co/spaces/mteb/leaderboard

Particularly Qwen3-Embedding-8B and Qwen3-Embedding-4B:

https://huggingface.co/Qwen/Qwen3-Embedding-8B

➕ show 1 reply

bryan0 • 07/31/2025

The Matryoshka embeddings seem interesting:

> The Gemini embedding model, gemini-embedding-001, is trained using the Matryoshka Representation Learning (MRL) technique which teaches a model to learn high-dimensional embeddings that have initial segments (or prefixes) which are also useful, simpler versions of the same data. Use the output_dimensionality parameter to control the size of the output embedding vector. Selecting a smaller output dimensionality can save storage space and increase computational efficiency for downstream applications, while sacrificing little in terms of quality. By default, it outputs a 3072-dimensional embedding, but you can truncate it to a smaller size without losing quality to save storage space. We recommend using 768, 1536, or 3072 output dimensions. [0]

looks like even the 256-dim embeddings perform really well.

[0]: https://ai.google.dev/gemini-api/docs/embeddings#quality-for...

➕ show 4 replies

mvieira38 • 07/31/2025

To anyone working in these types of applications, are embeddings still worth it compared to agentic search for text? If I have a directory of text files, for example, is it better to save all of their embeddings in a VDB and use that, or are LLMs now good enough that I can just let them use ripgrep or something to search for themselves?

➕ show 6 replies

morkalork • 07/31/2025

Question to other GCP users, how are you finding Google's aggressive deprecation of older embedding models? Feels like you have to pay to rerun your data through every 12 months.

➕ show 3 replies

dmezzetti • 07/31/2025

It's always worth checking out the MTEB leaderboard: https://huggingface.co/spaces/mteb/leaderboard

There are some good open models there that have longer context limits and fewer dimensions.

The benchmarks are just a guide. It's best to build a test dataset with your own data. This is a good example of that: https://github.com/beir-cellar/beir/wiki/Load-your-custom-da...

Another benefit of having your own test dataset, is that it can grow as your data grows. And you can quickly test new models to see how it performs with YOUR data.

aziis98 • 08/01/2025

I'm just can't wait for a globally scaled rag system. I think that will be a turning point for search engines.

For now there is only https://exa.ai/ that is currently doing something similar it seems.

_Chief • 08/01/2025

I have been thinking around solving this problem. I think one of the reasons some AI assistants shine vs others is how they can reduce the amount of context the LLM needs to work with using in-built tools. I think there's room to democratize these capabilities. One such capability is allowing the LLMs to directly work with the embeddings.

I wrote an MCP server directory-indexer[1] for this (self-hosted indexing mcp server). The goal being indexing any directories you want your AI to know about and gives the it MCP tools to search through the embeddings etc. While an agentic grep may be valuable, when working with tons of files with similar topics (like customer cases, technical docs), pre-processed embeddings have proven valuable for me. One reason I really like it is that it democratizes my data and documents: giving consistent results when working with different AI assistants - the alternative being vastly different results based on the in-built capabilities of the coding assistants. Another being having access to you "knowledge" from any project you're on. Though since this is selfhosted, I use nomic-embed-text for the embedding which has been sufficient for most use cases.

[1] https://github.com/peteretelej/directory-indexer

➕ show 1 reply

miohtama • 07/31/2025

> Everlaw, a platform providing verifiable RAG to help legal professionals analyze large volumes of discovery documents, requires precise semantic matching across millions of specialized texts. Through internal benchmarks, Everlaw found gemini-embedding-001 to be the best, achieving 87% accuracy in surfacing relevant answers from 1.4 million documents filled with industry-specific and complex legal terms, surpassing Voyage (84%) and OpenAI (73%) models. Furthermore, Gemini Embedding's Matryoshka property enables Everlaw to use compact representations, focusing essential information in fewer dimensions. This leads to minimal performance loss, reduced storage costs, and more efficient retrieval and search.

This will make a lot of junior lawyers or their work obsolete.

Here is a good podcast on the topic how will AI affect legal industry

https://open.spotify.com/episode/4IAHG68BeGZzr9uHXYvu5z?si=q...

➕ show 1 reply

curl-up • 08/01/2025

Anyone who has recently worked on embedding model finetuning, any useful tools you'd recommend (both for dataset curation and actual finetuning)? Any models you'd recommend as especially good for finetuning?

I'm interested in both full model finetunes, and downstream matrix optimization as done in [1].

[1] https://github.com/openai/openai-cookbook/blob/main/examples...

jcims • 07/31/2025

I'm short on vocabulary here but it seems that using content embedding similarity to find relevant (chunks of) content to feed an LLM is orthogonal to the use of LLMs to take automatically curated content chunks and use them to enrich a context.

Is that correct?

I'm just curious why this type of content selection seems to have been popularized and in many ways become the defacto standard for RAG, and (as far as I know but I haven't looked at 'search' in a long time) not generally used for general purpose search?

➕ show 2 replies

TN1ck • 08/01/2025

VP of Engineering of re:cap here (featured in the article), if anybody has any more detailed questions, happy to answer!

asdev • 07/31/2025

I feel like tool calling killed RAG, however you have less control over how the retrieved data is injected in the context.

➕ show 4 replies

jgalt212 • 08/01/2025

Is one LLM embedding much better than another? To me, if you're building a vector database off embeddings, it's best and not punitive to stick to a self hosted public weights model.

keizo • 08/01/2025

has anyone done some simple latency profiling of gemini embedding vs open ai embedding api? seem like that api call is one of the biggest chunks of time in a simple rag setup.

➕ show 1 reply

mijoharas • 07/31/2025

What open embeddings models would people recommend. Still Nomic?

➕ show 3 replies

zapnuk • 08/01/2025

Good luck to anyone using it. We used it for embedding about 6k documents.

The API constantly gives you quota errors when you reach about 150 requests/min eventhough the quota should allow about 50_000 requests/min.

We’d like to use the Batch API, but the model isn’t available yet.

Quite a nice model though. Being able to get embeddings for a specific task type [1] is very interesting. We used classification specific embeddings and noticed a meaningful improvment when we used the embeddings as input for a classifier.

1: https://ai.google.dev/gemini-api/docs/embeddings#supported-t...

➕ show 1 reply

nikolayasdf123 • 08/01/2025

interesting. high quality optimized embeddings is very nice to have

nikolayasdf123 • 08/01/2025

no image support is a deal breaker. multi-modality is a must

alt Hacker News

Gemini Embedding: Powering RAG and context engineering

Comments