- Published on
Multimodal Embeddings — Searching Across Text, Images, and Audio Together
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
RAG has been purely text-based. But human knowledge is multimodal: documents contain images, videos, charts, and diagrams. Multimodal embeddings like CLIP allow searching across modalities—a text query retrieving images or a sketch retrieving similar code. This opens new use cases in e-commerce, medical imaging, and content discovery.
- CLIP: Contrastive Language-Image Pre-training
- Contrastive Learning Concept
- ImageBind: Alignment Across Six Modalities
- Embedding Images for RAG
- Cross-Modal Search Implementation
- Multimodal Document Understanding
- Storage Strategy for Mixed Content
- Use Cases: E-Commerce, Medical, Video Search
- Checklist
- Conclusion
CLIP: Contrastive Language-Image Pre-training
CLIP trains image and text encoders jointly with contrastive learning: a photo of a cat and the caption "a cat" should be similar in embedding space.
interface CLIPModel {
imageEncoder: ImageEncoder; // ResNet or ViT → 512-dim vector
textEncoder: TextEncoder; // Transformer → 512-dim vector
}
async function clipEmbeddings(
image: Buffer,
text: string,
model: CLIPModel
): Promise<{ imageEmbedding: number[]; textEmbedding: number[] }> {
const imageEmbedding = await model.imageEncoder.encode(image);
const textEmbedding = await model.textEncoder.encode(text);
// Normalize to unit length (important for cosine similarity)
const normalize = (v: number[]) => {
const norm = Math.sqrt(v.reduce((s, x) => s + x * x, 0));
return v.map(x => x / norm);
};
return {
imageEmbedding: normalize(imageEmbedding),
textEmbedding: normalize(textEmbedding),
};
}
// Key insight: image and text embeddings are in the SAME space
async function textToImageRetrieval(
query: string,
images: Buffer[],
model: CLIPModel
): Promise<Buffer[]> {
const { textEmbedding } = await clipEmbeddings(Buffer.alloc(0), query, model);
// Score each image by similarity to text query
const scored = await Promise.all(
images.map(async (img) => {
const { imageEmbedding } = await clipEmbeddings(img, '', model);
return {
image: img,
similarity: cosineSimilarity(textEmbedding, imageEmbedding),
};
})
);
// Sort by similarity, return top-10
return scored
.sort((a, b) => b.similarity - a.similarity)
.slice(0, 10)
.map(s => s.image);
}
CLIP's breakthrough: a single embedding space bridging vision and language. A text query and an image can be compared directly.
Contrastive Learning Concept
Contrastive learning maximizes similarity between matched pairs and minimizes similarity between mismatched pairs.
interface ContrastiveTrainingBatch {
images: number[][][]; // [batchSize, height, width, channels]
texts: string[];
// images[i] and texts[i] are matched; others are mismatches
}
async function contrastiveLoss(
batch: ContrastiveTrainingBatch,
model: CLIPModel,
temperature: number = 0.07 // lower = sharper distinctions
): Promise<number> {
const imageEmbeddings = await model.imageEncoder.encodeBatch(batch.images);
const textEmbeddings = await model.textEncoder.encodeBatch(batch.texts);
// Normalize
const normalize = (v: number[][]) =>
v.map(emb => emb.map(x => x / Math.sqrt(emb.reduce((s, y) => s + y * y, 0))));
const normalizedImages = normalize(imageEmbeddings);
const normalizedTexts = normalize(textEmbeddings);
// Compute similarity matrix [batchSize, batchSize]
// similarityMatrix[i][j] = dot product of image i and text j
const similarityMatrix = matmul(normalizedImages, normalizedTexts.T);
// Scale by temperature
const scaledMatrix = similarityMatrix.map(row =>
row.map(s => s / temperature)
);
// Contrastive loss: maximize diagonal (matched pairs),
// minimize off-diagonal (mismatches)
let loss = 0;
for (let i = 0; i < batch.images.length; i++) {
// Softmax over row i (text similarity for image i)
const textScores = softmax(scaledMatrix[i]);
loss -= Math.log(textScores[i]); // log probability of correct match
// Cross-entropy: image should match its text
}
return loss / batch.images.length;
}
The temperature parameter controls sharpness: lower = model must be more confident (sharper loss).
ImageBind: Alignment Across Six Modalities
ImageBind extends CLIP to six modalities: images, text, audio, depth, thermal, and IMU sensors. All share a single embedding space.
interface ImageBindModel {
imageEncoder: Encoder;
textEncoder: Encoder;
audioEncoder: Encoder;
depthEncoder: Encoder;
thermalEncoder: Encoder;
imuEncoder: Encoder;
}
type Modality = 'image' | 'text' | 'audio' | 'depth' | 'thermal' | 'imu';
async function imagebindEmbed(
data: Buffer,
modality: Modality,
model: ImageBindModel
): Promise<number[]> {
const encoder: Encoder = {
image: model.imageEncoder,
text: model.textEncoder,
audio: model.audioEncoder,
depth: model.depthEncoder,
thermal: model.thermalEncoder,
imu: model.imuEncoder,
}[modality];
return encoder.encode(data);
}
// Cross-modal retrieval: text query → images, audio, depth
async function crossModalRetrieval(
textQuery: string,
corpus: Array<{ data: Buffer; modality: Modality }>,
model: ImageBindModel
): Promise<{
imageResults: Buffer[];
audioResults: Buffer[];
depthResults: Buffer[];
}> {
const queryEmbedding = await imagebindEmbed(Buffer.from(textQuery), 'text', model);
const scored = await Promise.all(
corpus.map(async (item) => ({
...item,
embedding: await imagebindEmbed(item.data, item.modality, model),
similarity: cosineSimilarity(
queryEmbedding,
await imagebindEmbed(item.data, item.modality, model)
),
}))
);
const imageResults = scored
.filter(s => s.modality === 'image')
.sort((a, b) => b.similarity - a.similarity)
.slice(0, 5)
.map(s => s.data);
const audioResults = scored
.filter(s => s.modality === 'audio')
.sort((a, b) => b.similarity - a.similarity)
.slice(0, 5)
.map(s => s.data);
const depthResults = scored
.filter(s => s.modality === 'depth')
.sort((a, b) => b.similarity - a.similarity)
.slice(0, 5)
.map(s => s.data);
return { imageResults, audioResults, depthResults };
}
ImageBind enables creative retrieval: "Find me documents with similar mood to this song" or "Show me thermal images matching this visual scene."
Embedding Images for RAG
Two strategies: image captioning or vision encoder.
Strategy 1: Image Captioning (Convert to text)
async function imageCaptioningStrategy(
image: Buffer,
captioningModel: VisionCaptioningModel
): Promise<string> {
// Generate natural language caption for image
const caption = await captioningModel.caption(image);
// Example: "a dog sitting on a beach at sunset"
return caption;
}
// Then embed caption as text
async function indexImageAsCaption(
image: Buffer,
textEmbedder: TextEmbedder
): Promise<string> {
const caption = await imageCaptioningModel.caption(image);
return await textEmbedder.embed(caption);
}
Pros: retrieves images via text description. Cons: loses visual details, caption quality varies.
Strategy 2: Vision Encoder (Direct embedding)
async function visionEncoderStrategy(
image: Buffer,
visionEncoder: VisionEncoder
): Promise<number[]> {
// Directly embed image features
return await visionEncoder.encode(image);
}
// Retrieve images visually
async function imageToImageRetrieval(
queryImage: Buffer,
corpus: Buffer[],
visionEncoder: VisionEncoder
): Promise<Buffer[]> {
const queryEmbedding = await visionEncoder.encode(queryImage);
const scored = await Promise.all(
corpus.map(async (img) => ({
image: img,
similarity: cosineSimilarity(
queryEmbedding,
await visionEncoder.encode(img)
),
}))
);
return scored
.sort((a, b) => b.similarity - a.similarity)
.slice(0, 10)
.map(s => s.image);
}
Pros: preserves visual detail, fast. Cons: can only search by image, not text query (unless multimodal).
Best practice for RAG: Use CLIP (multimodal). Embed both caption and image in the same space for flexible retrieval.
async function multimodalImageIndexing(
image: Buffer,
imagePath: string,
clipModel: CLIPModel,
captioningModel: VisionCaptioningModel
): Promise<{
imageCLIPEmbedding: number[];
captionEmbedding: number[];
caption: string;
}> {
// Get caption
const caption = await captioningModel.caption(image);
// Embed image via CLIP
const { imageEmbedding } = await clipEmbeddings(
image,
'',
clipModel
);
// Embed caption
const { textEmbedding } = await clipEmbeddings(
Buffer.alloc(0),
caption,
clipModel
);
return {
imageCLIPEmbedding: imageEmbedding,
captionEmbedding: textEmbedding,
caption,
};
}
Store both embeddings; retrieve via text or image query.
Cross-Modal Search Implementation
Retrieve images from text query (or vice versa) in a single vector space.
interface MultimodalDocument {
id: string;
textEmbedding: number[];
imageEmbedding: number[];
imagePath: string;
caption: string;
}
async function crossModalSearch(
query: string | Buffer, // Text query or image query
documents: MultimodalDocument[],
clipModel: CLIPModel,
isImage: boolean
): Promise<MultimodalDocument[]> {
let queryEmbedding: number[];
if (isImage) {
const { imageEmbedding } = await clipEmbeddings(
query as Buffer,
'',
clipModel
);
queryEmbedding = imageEmbedding;
} else {
const { textEmbedding } = await clipEmbeddings(
Buffer.alloc(0),
query as string,
clipModel
);
queryEmbedding = textEmbedding;
}
// Score documents by both text and image similarity
const scored = documents.map(doc => ({
doc,
textSimilarity: cosineSimilarity(queryEmbedding, doc.textEmbedding),
imageSimilarity: cosineSimilarity(queryEmbedding, doc.imageEmbedding),
combinedScore:
0.6 * cosineSimilarity(queryEmbedding, doc.textEmbedding) +
0.4 * cosineSimilarity(queryEmbedding, doc.imageEmbedding), // Weight text higher
}));
return scored
.sort((a, b) => b.combinedScore - a.combinedScore)
.slice(0, 10)
.map(s => s.doc);
}
Multimodal Document Understanding
Extract meaning from mixed-content documents (text + images).
interface MultimodalDocument {
id: string;
pages: Array<{
text: string;
images: Buffer[];
textEmbedding: number[];
imageEmbeddings: number[];
}>;
}
async function processMultimodalDocument(
pdfPath: string,
clipModel: CLIPModel
): Promise<MultimodalDocument> {
const pages = await extractPagesFromPDF(pdfPath);
const processed = await Promise.all(
pages.map(async (page) => {
const textEmbedding = await clipModel.textEncoder.encode(page.text);
const imageEmbeddings = await Promise.all(
page.images.map(img => clipModel.imageEncoder.encode(img))
);
return {
text: page.text,
images: page.images,
textEmbedding,
imageEmbeddings,
};
})
);
return {
id: generateId(),
pages: processed,
};
}
// Retrieve by page, combining text and image relevance
async function multimodalDocumentRetrieval(
query: string,
documents: MultimodalDocument[],
clipModel: CLIPModel
): Promise<Array<{ documentId: string; pageIndex: number; score: number }>> {
const queryEmbedding = await clipModel.textEncoder.encode(query);
const results: Array<{ documentId: string; pageIndex: number; score: number }> = [];
for (const doc of documents) {
for (let pageIdx = 0; pageIdx < doc.pages.length; pageIdx++) {
const page = doc.pages[pageIdx];
const textScore = cosineSimilarity(queryEmbedding, page.textEmbedding);
const imageScores = page.imageEmbeddings.map(imgEmb =>
cosineSimilarity(queryEmbedding, imgEmb)
);
const avgImageScore =
imageScores.length > 0 ? imageScores.reduce((a, b) => a + b) / imageScores.length : 0;
const combinedScore = 0.7 * textScore + 0.3 * avgImageScore;
results.push({
documentId: doc.id,
pageIndex: pageIdx,
score: combinedScore,
});
}
}
return results
.sort((a, b) => b.score - a.score)
.slice(0, 20);
}
Storage Strategy for Mixed Content
Store efficiently without duplicating data.
interface StorageLayer {
vectorDb: VectorStore; // embeddings
blobStore: BlobStore; // images, audio
documentDb: DocumentDatabase; // metadata, text
}
async function storeMultimodalDocument(
doc: MultimodalDocument,
storage: StorageLayer
): Promise<void> {
// Store metadata and text in document DB
await storage.documentDb.insert({
id: doc.id,
textContent: doc.pages.map(p => p.text).join('\n'),
});
// Store images in blob store
for (const page of doc.pages) {
for (const image of page.images) {
const imageId = generateId();
await storage.blobStore.put(imageId, image);
}
}
// Store embeddings in vector DB
const vectors = doc.pages.flatMap((page, pageIdx) => [
{
id: `${doc.id}-page-${pageIdx}-text`,
embedding: page.textEmbedding,
metadata: { documentId: doc.id, pageIndex: pageIdx, type: 'text' },
},
...page.imageEmbeddings.map((emb, imgIdx) => ({
id: `${doc.id}-page-${pageIdx}-image-${imgIdx}`,
embedding: emb,
metadata: { documentId: doc.id, pageIndex: pageIdx, type: 'image', imageIndex: imgIdx },
})),
]);
await storage.vectorDb.addMany(vectors);
}
Use Cases: E-Commerce, Medical, Video Search
E-Commerce: Search products by image. Customer uploads a photo of a style; system finds similar products.
Medical Imaging: Retrieve radiology images by pathology description or find similar cases.
Video Search: Index keyframes from videos; search by text or sketch.
Checklist
- Use CLIP for multimodal RAG (text + image)
- Consider ImageBind for richer modalities (audio, depth)
- Strategy 1: embed image captions as text (flexible but lossy)
- Strategy 2: embed images directly with vision encoder (rich but image-only search)
- Best: dual embeddings (caption + image) via CLIP for flexibility
- Store text in document DB, images in blob store, embeddings in vector DB
- Weight text and image scores (e.g., 70% text, 30% image for RAG)
- Benchmark on your domain; image relevance varies widely
Conclusion
Multimodal embeddings unlock RAG beyond text. CLIP, the easiest entry point, bridges vision and language in one embedding space. For richer modalities, ImageBind aligns six modalities. Store efficiently by separating text, blobs, and embeddings. The combination of caption embeddings and visual embeddings gives you retrieval flexibility: search by text or by image.