Google DeepMind puts Gemini Embedding 2 into the race for multimodal AI search

AIDesign

29 May

The new model is available through the Gemini API and Google Cloud Vertex AI, supporting search and retrieval across text, image, video, audio, documents, and code.

Multimodal AI search system showing text, image, video, audio, and code flowing into one digital retrieval model — ***Google DeepMind has released Gemini Embedding 2, a native multimodal embedding model available through the Gemini API and Google Cloud Vertex AI.***

Google DeepMind has released Gemini Embedding 2, a native multimodal embedding model built to help developers and organizations search across text, image, video, audio, documents, and code through one system.

The model is available now through the Gemini API and Google Cloud Vertex AI. It is designed for AI search, retrieval-augmented generation, recommendations, document retrieval, video search, audio-based search, and future agentic RAG workflows.

For education, research, and EdTech teams, the launch points to AI systems that can search across more than written content. Universities, schools, research organizations, and learning platforms increasingly hold information in lectures, PDFs, slides, diagrams, videos, audio recordings, code repositories, and internal knowledge bases.

Gemini Embedding 2 follows Google’s earlier text-only Gemini Embedding model. The new version extends the approach across multiple content types, while Google DeepMind reports stronger results on several text and code benchmarks as well as multimodal retrieval tasks.

The model produces embeddings of up to 3,072 dimensions, with support optimized for 768 and 1,536 dimensions.

One model for different types of content

Gemini Embedding 2 is built to help AI systems find relevant material across mixed formats. A search could involve a written query, an image, a video, an audio file, a document, or a combination of inputs.

That gives the model a practical role in education and research settings where information rarely sits in one format. A university could use multimodal retrieval to search across lecture recordings, research papers, diagrams, slides, video archives, student support materials, and internal documentation. An EdTech company could use it to improve discovery across learning content, assessment resources, help centers, product documentation, and training materials.

The Google Deepmind white paper released with the model describes Gemini Embedding 2 as able to embed "arbitrary combinations of interleaved inputs" across text, image, video, and audio.

The release also matters for retrieval-augmented generation, where an AI system needs to find the right source material before generating an answer. If the source material sits across PDFs, classroom video, audio recordings, charts, screenshots, code, and written documents, retrieval becomes a bigger part of the product experience.

Benchmark results cover search, code, video, audio, and documents

Google DeepMind reports state-of-the-art performance across several embedding benchmarks, including unimodal, cross-modal, and multimodal retrieval.

Gemini Embedding 2 recorded 62.9 Recall@1 on MSCOCO for text-to-image retrieval, 68.8 NDCG@10 on Vatex for text-to-video retrieval, 69.9 on MTEB multilingual, and 84.0 on MTEB Code.

The model also achieved 79.4 Recall@1 on Google Universal Embedding Challenge image-to-image retrieval, 80.5 mean Recall@1 on text-to-image retrieval, 91.2 mean Recall@1 on image-to-text retrieval, and 64.9 NDCG@10 on ViDoRe V2 document retrieval.

Those results are relevant beyond model rankings. Text retrieval still matters for institutional search, student support, research workflows, and AI assistants. Code retrieval matters for developer tools, computing courses, technical training, and workforce skills. Document retrieval matters for PDFs, tables, forms, reports, slide decks, and policy materials.

Audio is another part of the release. On the Massive Sound Embedding Benchmark retrieval split, Gemini Embedding 2 with native audio achieved an average mrr@10 of 73.99, compared with 70.40 when audio was first converted through automatic speech recognition. In cross-lingual retrieval, native audio reached 72.56, compared with 67.55 for the ASR-based route.

Specialist use cases include science, art, and food data

Gemini Embedding 2 was also tested on specialized image-to-text retrieval domains that were not explicit training targets.

The evaluation covered microscopy, fine art, astronomy, and culinary datasets. Gemini Embedding 2 achieved 79.3 R@5 on MicroVQA, 67.7 on ArtCap, 64.4 on AstroLLaVA, 90.2 on Recipe1M ingredients, and 92.1 on Recipe1M instructions.

For universities, research institutions, and scientific teams, those results are one of the more useful parts of the release. Research and teaching materials often include specialist images, terminology, data tables, diagrams, scans, videos, and PDFs. A multimodal embedding model that performs across different domains could support search tools for labs, digital libraries, research repositories, and subject-specific learning platforms.

Google DeepMind’s paper also points to future work around agentic RAG, video recommendation, interleaved multimodal retrieval, ranking signals, and end-to-end RAG training.

Gemini Embedding 2 is available now through the Gemini API and Google Cloud Vertex AI. Google DeepMind is positioning the model for developers and organizations building multimodal AI search, recommendation, document retrieval, audio retrieval, video search, and RAG systems.

ETIH Innovation Awards 2026