It may be worth pointing out that a few open weights models score higher than gemini-embedding-001 on MTEB:
https://huggingface.co/spaces/mteb/leaderboard
Particularly Qwen3-Embedding-8B and Qwen3-Embedding-4B:
The Matryoshka embeddings seem interesting:
> The Gemini embedding model, gemini-embedding-001, is trained using the Matryoshka Representation Learning (MRL) technique which teaches a model to learn high-dimensional embeddings that have initial segments (or prefixes) which are also useful, simpler versions of the same data. Use the output_dimensionality parameter to control the size of the output embedding vector. Selecting a smaller output dimensionality can save storage space and increase computational efficiency for downstream applications, while sacrificing little in terms of quality. By default, it outputs a 3072-dimensional embedding, but you can truncate it to a smaller size without losing quality to save storage space. We recommend using 768, 1536, or 3072 output dimensions. [0]
looks like even the 256-dim embeddings perform really well.
[0]: https://ai.google.dev/gemini-api/docs/embeddings#quality-for...
To anyone working in these types of applications, are embeddings still worth it compared to agentic search for text? If I have a directory of text files, for example, is it better to save all of their embeddings in a VDB and use that, or are LLMs now good enough that I can just let them use ripgrep or something to search for themselves?
Question to other GCP users, how are you finding Google's aggressive deprecation of older embedding models? Feels like you have to pay to rerun your data through every 12 months.
It's always worth checking out the MTEB leaderboard: https://huggingface.co/spaces/mteb/leaderboard
There are some good open models there that have longer context limits and fewer dimensions.
The benchmarks are just a guide. It's best to build a test dataset with your own data. This is a good example of that: https://github.com/beir-cellar/beir/wiki/Load-your-custom-da...
Another benefit of having your own test dataset, is that it can grow as your data grows. And you can quickly test new models to see how it performs with YOUR data.
I'm just can't wait for a globally scaled rag system. I think that will be a turning point for search engines.
For now there is only https://exa.ai/ that is currently doing something similar it seems.
I have been thinking around solving this problem. I think one of the reasons some AI assistants shine vs others is how they can reduce the amount of context the LLM needs to work with using in-built tools. I think there's room to democratize these capabilities. One such capability is allowing the LLMs to directly work with the embeddings.
I wrote an MCP server directory-indexer[1] for this (self-hosted indexing mcp server). The goal being indexing any directories you want your AI to know about and gives the it MCP tools to search through the embeddings etc. While an agentic grep may be valuable, when working with tons of files with similar topics (like customer cases, technical docs), pre-processed embeddings have proven valuable for me. One reason I really like it is that it democratizes my data and documents: giving consistent results when working with different AI assistants - the alternative being vastly different results based on the in-built capabilities of the coding assistants. Another being having access to you "knowledge" from any project you're on. Though since this is selfhosted, I use nomic-embed-text for the embedding which has been sufficient for most use cases.
> Everlaw, a platform providing verifiable RAG to help legal professionals analyze large volumes of discovery documents, requires precise semantic matching across millions of specialized texts. Through internal benchmarks, Everlaw found gemini-embedding-001 to be the best, achieving 87% accuracy in surfacing relevant answers from 1.4 million documents filled with industry-specific and complex legal terms, surpassing Voyage (84%) and OpenAI (73%) models. Furthermore, Gemini Embedding's Matryoshka property enables Everlaw to use compact representations, focusing essential information in fewer dimensions. This leads to minimal performance loss, reduced storage costs, and more efficient retrieval and search.
This will make a lot of junior lawyers or their work obsolete.
Here is a good podcast on the topic how will AI affect legal industry
https://open.spotify.com/episode/4IAHG68BeGZzr9uHXYvu5z?si=q...
Anyone who has recently worked on embedding model finetuning, any useful tools you'd recommend (both for dataset curation and actual finetuning)? Any models you'd recommend as especially good for finetuning?
I'm interested in both full model finetunes, and downstream matrix optimization as done in [1].
[1] https://github.com/openai/openai-cookbook/blob/main/examples...
I'm short on vocabulary here but it seems that using content embedding similarity to find relevant (chunks of) content to feed an LLM is orthogonal to the use of LLMs to take automatically curated content chunks and use them to enrich a context.
Is that correct?
I'm just curious why this type of content selection seems to have been popularized and in many ways become the defacto standard for RAG, and (as far as I know but I haven't looked at 'search' in a long time) not generally used for general purpose search?
VP of Engineering of re:cap here (featured in the article), if anybody has any more detailed questions, happy to answer!
I feel like tool calling killed RAG, however you have less control over how the retrieved data is injected in the context.
Is one LLM embedding much better than another? To me, if you're building a vector database off embeddings, it's best and not punitive to stick to a self hosted public weights model.
has anyone done some simple latency profiling of gemini embedding vs open ai embedding api? seem like that api call is one of the biggest chunks of time in a simple rag setup.
What open embeddings models would people recommend. Still Nomic?
Good luck to anyone using it. We used it for embedding about 6k documents.
The API constantly gives you quota errors when you reach about 150 requests/min eventhough the quota should allow about 50_000 requests/min.
We’d like to use the Batch API, but the model isn’t available yet.
Quite a nice model though. Being able to get embeddings for a specific task type [1] is very interesting. We used classification specific embeddings and noticed a meaningful improvment when we used the embeddings as input for a classifier.
1: https://ai.google.dev/gemini-api/docs/embeddings#supported-t...
interesting. high quality optimized embeddings is very nice to have
no image support is a deal breaker. multi-modality is a must
> Embeddings are crucial here, as they efficiently identify and integrate vital information—like documents, conversation history, and tool definitions—directly into a model's working memory.
I feel like I'm falling behind here, but can someone explain this to me?
My high-level view of embedding is that I send some text to the provider, they tokenize the text and then run it through some NN that spits out a vector of numbers of a particular size (looks to be variable in this case including 768, 1536 and 3072). I can then use those embeddings in places like a vector DB where I might want to do some kind of similarity search (e.g. cosine difference). I can also use them to do clustering on that similarity which can give me some classification capabilities.
But how does this translate to these things being "directly into a model's working memory'? My understanding is that with RAG I just throw a bunch of the embeddings into a vector DB as keys but the ultimate text I send in the context to the LLM is the source text that the keys represent. I don't actually send the embeddings themselves to the LLM.
So what is is marketing stuff about "directly into a model's working memory."? Is my mental view wrong?