What is Gemini Embedding 2?

Gemini Embedding 2 is Google's first natively multimodal embedding model. It creates embeddings for text, images, video, audio, and PDFs in a single unified embedding space, enabling semantic search across different content types.

What content types does Gemini Embedding 2 support?

Gemini Embedding 2 supports text (up to 8192 tokens), images (up to 6 per request), video (up to 120 seconds), audio (native support), and PDFs (up to 6 pages). All modalities are embedded in the same vector space.

What are the embedding dimensions for Gemini Embedding 2?

Gemini Embedding 2 offers three output dimension options: 3072 (default, highest quality), 1536, and 768. Smaller dimensions reduce storage and compute costs while maintaining good performance.

Can I combine text and images in one embedding request?

Yes! Gemini Embedding 2 supports interleaved input, meaning you can embed a combination of text and images in a single request. This is powerful for RAG applications where context includes both text and visual elements.

What vector databases work with Gemini Embedding 2?

Gemini Embedding 2 integrates with popular vector databases and frameworks including LangChain, LlamaIndex, Weaviate, ChromaDB, Pinecone, and Vertex AI Vector Search.

Gemini Embedding 2: Google's New Multimodal Embedding Model

Today Google released Gemini Embedding 2 — the first embedding model that natively handles text, images, video, audio, and PDFs in a single unified embedding space. This is a game-changer for anyone building RAG applications, semantic search, or multimodal AI apps.

Until now, building a search system that could handle both images and text required separate embedding models and complex orchestration. Gemini Embedding 2 changes that: one model, one embedding space, all modalities.

🚀 Why This Matters for Vibe Coders

If you're building AI apps with Claude Code, Cursor, or any coding assistant, multimodal embeddings let you build search and RAG systems that actually understand your content — whether it's code snippets, screenshots, tutorial videos, or documentation PDFs.

What Makes It Different

Previous embedding models like OpenAI's text-embedding-3 or Google's text-embedding-004 only handled text. For images, you needed CLIP. For audio, something else entirely. Each modality lived in its own vector space, making cross-modal search impossible without complex pipelines.

Gemini Embedding 2 solves this by training a single model across all modalities from the ground up. The result:

Search an image collection with text queries — "find screenshots showing error messages"
Search text docs with image queries — upload a diagram, find related documentation
Build RAG over video/audio content — query meeting recordings, podcasts, tutorials
Interleaved input — embed text + images together in a single request

Model Specifications

Here's what you're working with:

Modality	Limit
Text	Up to 8,192 tokens
Images	Up to 6 per request
Video	Up to 120 seconds
Audio	Native support
PDFs	Up to 6 pages

Output Dimensions

Choose your embedding size based on your needs:

3072 — Default, highest quality
1536 — Balanced quality/cost
768 — Smallest, fastest retrieval

Smaller dimensions reduce storage costs and speed up similarity searches, with a modest tradeoff in retrieval quality. For most applications, 1536 is the sweet spot.

Quick Start: Python Example

Here's how to embed text, images, and audio in a single request:

client = genai.Client()

with open("example.png", "rb") as f:
    image_bytes = f.read()

with open("sample.mp3", "rb") as f:
    audio_bytes = f.read()

result = client.models.embed_content(
    model="gemini-embedding-2-preview",
    contents=[
        "What is the meaning of life?",
        types.Part.from_bytes(
            data=image_bytes,
            mime_type="image/png",
        ),
        types.Part.from_bytes(
            data=audio_bytes,
            mime_type="audio/mpeg",
        ),
    ],
)

The model returns embeddings for each content item. All embeddings live in the same vector space, so you can directly compare a text embedding to an image embedding using cosine similarity.

Specifying Output Dimensions

To use smaller embeddings, add the output_dimensionality parameter:

result = client.models.embed_content(
    model="gemini-embedding-2-preview",
    contents=["Your text here"],
    config={"output_dimensionality": 768}
)

Practical Use Cases for Vibe Coders

Here's where this gets exciting for builders:

📚 Documentation RAG

Index your docs, screenshots, and code samples together. Query with text or images.

🎬 Video Search

Make YouTube tutorials or Loom recordings searchable by content — not just titles.

🎧 Audio Knowledge Base

Index podcast episodes, meeting recordings, or voice memos for semantic retrieval.

📱 Visual Search Apps

Let users search by uploading a photo — find similar products, screenshots, or designs.

Example: Building a Multimodal RAG System

Let's say you're building a coding assistant that can reference your project's documentation, screenshots, and tutorial videos. Here's the flow:

Index your content: Embed all docs, images, and video clips with Gemini Embedding 2
Store in a vector DB: Use ChromaDB, Weaviate, or Pinecone
Query multimodally: User asks "how do I fix this error?" + attaches screenshot
Retrieve relevant context: Find matching docs, similar error screenshots, and tutorial segments
Generate answer: Pass retrieved context to Claude or GPT for the response

Previously this required stitching together 3+ different models. Now it's one API call.

Integrations

Gemini Embedding 2 works with the tools you're already using:

LangChain — Native integration for RAG pipelines
LlamaIndex — Multimodal indexing and retrieval
Weaviate — Vector database with multimodal support
ChromaDB — Local vector store for prototyping
Pinecone — Managed vector database at scale
Vertex AI Vector Search — Google Cloud native option

When to Use This

✅ Good Fit

RAG over mixed content (docs + images + videos)
Semantic search across media types
Content recommendation systems
Multimodal classification and clustering
Knowledge bases with visual elements

Considerations

Preview model: gemini-embedding-2-preview — expect iteration before GA
Video limit: 120 seconds max — chunk longer videos
PDF limit: 6 pages — split larger documents
Cost: Pricing not yet announced for GA — factor this into production planning

What This Means for AI Apps

Multimodal embeddings represent a step change in what's possible with semantic search and RAG. Instead of building separate pipelines for each content type, you can now:

Build unified search experiences across all your content
Create AI assistants that truly understand context — text, images, and audio together
Reduce infrastructure complexity and maintenance burden

For vibe coders building AI-powered apps, this is a significant unlock. The barrier to multimodal AI just got a lot lower.

Build AI Apps Faster

Join Vibe Coding Academy for tutorials, guides, and a community of builders shipping AI-powered products.

Join the Academy

Key Takeaways

First natively multimodal embedding model — text, images, video, audio, PDFs in one space
Model name: gemini-embedding-2-preview
Output dimensions: 3072 (default), 1536, or 768
Interleaved input: Combine text + images in one request
Integrations: LangChain, LlamaIndex, Weaviate, ChromaDB, Pinecone
Use cases: RAG, semantic search, classification, content recommendation

Learn more in the official Google announcement.

Gemini Embedding 2