Published on

Qdrant in Production — Collections, Quantization, and Filtering at Scale

Authors

Introduction

Qdrant has emerged as the vector database for teams that want flexibility without sacrificing simplicity. Whether self-hosted or on Qdrant Cloud, production deployments require careful attention to collection design, quantization strategies, and payload filtering. This guide covers real-world patterns.

Collection Setup: Vector Config and Storage

Qdrant collections combine vector configuration with payload schemas. Get this right from the start:

from qdrant_client import QdrantClient
from qdrant_client.models import (
    VectorParams, Distance, StorageType, PointStruct
)

client = QdrantClient(url="http://localhost:6333")

# Create collection with explicit vector config
client.recreate_collection(
    collection_name="documents",
    vectors_config=VectorParams(
        size=1536,  # text-embedding-3-small dimension
        distance=Distance.COSINE,
        on_disk=False,  # Keep vectors in RAM for <100M vectors
    ),
)

# For >100M vectors, use on_disk=True
client.recreate_collection(
    collection_name="large_corpus",
    vectors_config=VectorParams(
        size=1536,
        distance=Distance.COSINE,
        on_disk=True,  # Significantly reduces RAM, slight latency cost
    ),
)

# Distance options:
# - Distance.COSINE: normalized vectors (-1 to 1)
# - Distance.EUCLID: Euclidean distance (good for clusters)
# - Distance.DOT: dot product (fastest, vectors should be normalized)

# Insert sample data
points = [
    PointStruct(
        id=1,
        vector=[0.1] * 1536,
        payload={"title": "AI Infrastructure", "domain": "ai"},
    ),
    PointStruct(
        id=2,
        vector=[0.2] * 1536,
        payload={"title": "Vector Databases", "domain": "db"},
    ),
]
client.upsert(collection_name="documents", points=points)

on_disk=True benefits: 10-50× less RAM, scales to billions. on_disk=False benefits: Sub-millisecond latency.

Payload Schema and Indexed Fields

Indexed payloads unlock fast filtering. Define your schema explicitly:

from qdrant_client.models import (
    PointStruct, PayloadIndexType, PayloadSchemaType, CreateIndex
)

# Before inserting large amounts of data, create indexes
client.create_payload_index(
    collection_name="documents",
    field_name="domain",
    field_schema=PayloadSchemaType.KEYWORD,  # Exact match
)

# Index integer field for range queries
client.create_payload_index(
    collection_name="documents",
    field_name="date_created",
    field_schema=PayloadSchemaType.INTEGER,
)

# Index floating-point field
client.create_payload_index(
    collection_name="documents",
    field_name="confidence_score",
    field_schema=PayloadSchemaType.FLOAT,
)

# Insert data with indexed payloads
points = [
    PointStruct(
        id=i,
        vector=[0.1] * 1536,
        payload={
            "domain": "ai",
            "date_created": 1710604800,  # 2024-03-17
            "confidence_score": 0.95,
            "tags": ["embedding", "production"],
        },
    )
    for i in range(1000)
]

# Upsert with indexed payloads for fast filtering
client.upsert(collection_name="documents", points=points, wait=True)

# Verify indexes were created
info = client.get_collection("documents")
print(f"Collection info: {info}")

Index payload fields you'll filter on frequently. Filtering unindexed fields is O(n).

Scalar Quantization: 4× Memory Reduction

Quantization trades ~2-3% recall for massive memory savings:

from qdrant_client.models import (
    VectorParams, Distance, QuantizationConfig, ScalarQuantization,
    ScalarType, PointStruct
)

# Create collection with int8 scalar quantization
client.recreate_collection(
    collection_name="quantized_docs",
    vectors_config=VectorParams(
        size=1536,
        distance=Distance.COSINE,
    ),
    quantization_config=QuantizationConfig(
        scalar=ScalarQuantization(
            type=ScalarType.INT8,
            quantile=0.99,
            always_ram=False,  # Keep quantized vectors on disk if memory-constrained
        ),
    ),
)

# Qdrant quantizes vectors on insert automatically
for i in range(10000):
    point = PointStruct(
        id=i,
        vector=[0.1 + (i % 256) / 256.0] * 1536,  # Varied vectors
        payload={"index": i},
    )
    client.upsert(collection_name="quantized_docs", points=[point])

# Memory savings:
# float32: 1536 * 4 bytes = 6 KB per vector
# int8: 1536 * 1 byte = 1.5 KB per vector
# Savings: 75% (6 KB -> 1.5 KB)

# Query performance is identical to user
results = client.search(
    collection_name="quantized_docs",
    query_vector=[0.5] * 1536,
    limit=10,
)
print(f"Found {len(results)} results")

When to use scalar quantization: >10M vectors and RAM is expensive. Trade: 2-3% recall loss.

Binary Quantization for Search Speed

Binary quantization compresses to 1 bit per dimension:

from qdrant_client.models import (
    VectorParams, Distance, QuantizationConfig, BinaryQuantization
)

# Binary quantization: 1536 dimensions -> 192 bytes (1536/8)
client.recreate_collection(
    collection_name="binary_quantized",
    vectors_config=VectorParams(
        size=1536,
        distance=Distance.COSINE,
    ),
    quantization_config=QuantizationConfig(
        binary=BinaryQuantization(always_ram=True),  # Keep in RAM for speed
    ),
)

# Memory per vector: 1536 bits / 8 = 192 bytes (vs 6 KB float32)
# Compression ratio: 30×

# Insert vectors
points = [
    PointStruct(id=i, vector=[0.1] * 1536, payload={"idx": i})
    for i in range(1_000_000)
]

# Batch insert for efficiency
batch_size = 10000
for i in range(0, len(points), batch_size):
    client.upsert(
        collection_name="binary_quantized",
        points=points[i : i + batch_size],
        wait=True,
    )

# Binary quantization has higher recall loss (~5-10%) but extreme compression

Trade-off: Binary quantization loses 5-10% recall but compresses 30×. Use for massive corpus where top-1000 recall is acceptable.

Payload Filtering: must, should, must_not

Complex filtering is Qdrant's strength. Build filter logic explicitly:

from qdrant_client.models import Filter, FieldCondition, MatchValue, Range

# must: all conditions required
must_filter = Filter(
    must=[
        FieldCondition(
            key="domain",
            match=MatchValue(value="ai"),
        ),
        FieldCondition(
            key="date_created",
            range=Range(gte=1710604800),  # After 2024-03-17
        ),
    ],
)

# should: at least one condition (OR logic)
should_filter = Filter(
    should=[
        FieldCondition(
            key="domain",
            match=MatchValue(value="ai"),
        ),
        FieldCondition(
            key="domain",
            match=MatchValue(value="ml"),
        ),
    ],
)

# must_not: exclude matching
must_not_filter = Filter(
    must_not=[
        FieldCondition(
            key="status",
            match=MatchValue(value="draft"),
        ),
    ],
)

# Complex filter: published AI or ML docs
complex_filter = Filter(
    must=[
        Filter(
            should=[
                FieldCondition(key="domain", match=MatchValue(value="ai")),
                FieldCondition(key="domain", match=MatchValue(value="ml")),
            ],
        ),
    ],
    must_not=[
        FieldCondition(key="status", match=MatchValue(value="draft")),
    ],
)

# Search with filtering
results = client.search(
    collection_name="documents",
    query_vector=[0.1] * 1536,
    query_filter=complex_filter,
    limit=10,
)

for result in results:
    print(f"ID: {result.id}, Score: {result.score}, Payload: {result.payload}")

Combine must/should/must_not for sophisticated queries. Filter after vector search (post-filtering) for best latency.

Batch Upsert With Metadata

Large-scale ingestion requires batching. Qdrant handles up to 10K points per batch:

from qdrant_client.models import PointStruct
import time

# Generate 100K documents
def generate_documents(count: int):
    for i in range(count):
        yield PointStruct(
            id=i,
            vector=[0.1 + (i % 1000) / 1000.0] * 1536,
            payload={
                "title": f"Document {i}",
                "domain": ["ai", "ml", "db", "infra"][i % 4],
                "created_at": 1710604800 + (i * 3600),
                "author": f"author_{i % 100}",
                "tags": ["production", "documentation"],
            },
        )

# Upsert in batches
batch_size = 5000
docs = generate_documents(100_000)

start_time = time.time()
batch = []

for doc in docs:
    batch.append(doc)
    if len(batch) == batch_size:
        client.upsert(
            collection_name="documents",
            points=batch,
            wait=False,  # Async upsert for speed
        )
        batch = []

# Final batch
if batch:
    client.upsert(collection_name="documents", points=batch, wait=True)

elapsed = time.time() - start_time
print(f"Upserted 100K vectors in {elapsed:.2f} seconds")

wait=False is faster but doesn't guarantee consistency immediately. Use wait=True for final batch.

Qdrant Cloud vs Self-Hosted

Production comparison:

# Qdrant Cloud (managed)
from qdrant_client import QdrantClient

cloud_client = QdrantClient(
    url="https://your-cluster-name.qdrant.io",
    api_key="YOUR_API_KEY",
)

# Self-hosted (Docker)
self_hosted_client = QdrantClient(
    url="http://qdrant-server:6333",
    prefer_grpc=True,  # Use gRPC for lower latency
)

# Self-hosted with authentication
auth_client = QdrantClient(
    url="http://qdrant-server:6333",
    api_key="YOUR_API_KEY",
)

Qdrant Cloud: Managed backups, scaling, monitoring. ~$0.50/GB/month.

Self-hosted: Full control, lower cost at scale (>1TB). DevOps overhead.

Choose Cloud for <100GB or low DevOps bandwidth. Self-hosted for >1TB or compliance requirements.

Backup and Restore

Critical for production:

import requests
from datetime import datetime

def backup_collection(
    client: QdrantClient,
    collection_name: str,
    backup_path: str,
):
    """Export collection as snapshot"""
    snapshot = client.create_snapshot(
        collection_name=collection_name,
        wait=True,
    )
    print(f"Snapshot created: {snapshot}")

    # For Qdrant Cloud, download via API
    # For self-hosted, snapshots are in storage/snapshots/

def restore_collection(
    client: QdrantClient,
    snapshot_path: str,
    collection_name: str,
):
    """Restore from snapshot"""
    # Self-hosted: place snapshot in storage/snapshots/
    # Then recover via API

    recovered = client.recover(
        wait=True,
    )
    print(f"Recovery result: {recovered}")

# Automated backup (daily at 2 AM)
# Use cron or cloud scheduler

backup_collection(client, "documents", f"backup-{datetime.now().date()}")

Store snapshots in S3 or GCS. Test restore procedures quarterly.

Monitoring Qdrant Metrics

Expose Prometheus metrics for observability:

# Qdrant exports metrics on port 6334/metrics

import requests
import json

def get_qdrant_metrics():
    """Fetch Prometheus metrics"""
    response = requests.get("http://localhost:6334/metrics")
    return response.text

# Key metrics to monitor:
# - qdrant_collection_vectors_count: vector count per collection
# - qdrant_collection_payload_size_bytes: payload storage
# - qdrant_search_requests_total: query volume
# - qdrant_search_duration_seconds: query latency
# - qdrant_update_duration_seconds: upsert latency

# Set up alerts:
# - search_duration_seconds > 1000ms: performance degradation
# - collection_vectors_count approaching disk limits
# - error rates spiking

def get_collection_size(collection_name: str):
    """Get collection metadata"""
    info = client.get_collection(collection_name)
    return {
        "vectors": info.vectors_count,
        "indexed_vectors": info.indexed_vectors_count,
        "payload_bytes": info.points_count,
    }

size = get_collection_size("documents")
print(f"Collection size: {json.dumps(size, indent=2)}")

Set up Grafana dashboards to track collection growth and query performance.

Checklist

  • Define vector dimensions and distance metric (Cosine/Euclidean/Dot)
  • Decide on_disk vs in-memory based on scale
  • Index payload fields for filtering (KEYWORD, INTEGER, FLOAT)
  • Benchmark quantization trade-offs (scalar vs binary vs none)
  • Implement batch upsert pipeline (5K-10K batch size)
  • Set up daily snapshots to S3/GCS
  • Configure Prometheus metrics and Grafana dashboard
  • Test filter performance on production payloads
  • Document collection schema and migration procedures
  • Plan capacity for 12-month growth

Conclusion

Qdrant's production-ready architecture supports billions of vectors with sophisticated filtering and quantization. Start with clear vector config and indexed payloads. Quantization unlocks cost-efficiency at scale. Batch operations reduce latency. Cloud vs self-hosted depends on scale and DevOps capacity. Monitor metrics rigorously. Master these patterns and you'll scale efficiently.