- Published on
Qdrant in Production — Collections, Quantization, and Filtering at Scale
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
Qdrant has emerged as the vector database for teams that want flexibility without sacrificing simplicity. Whether self-hosted or on Qdrant Cloud, production deployments require careful attention to collection design, quantization strategies, and payload filtering. This guide covers real-world patterns.
- Collection Setup: Vector Config and Storage
- Payload Schema and Indexed Fields
- Scalar Quantization: 4× Memory Reduction
- Binary Quantization for Search Speed
- Payload Filtering: must, should, must_not
- Batch Upsert With Metadata
- Qdrant Cloud vs Self-Hosted
- Backup and Restore
- Monitoring Qdrant Metrics
- Checklist
- Conclusion
Collection Setup: Vector Config and Storage
Qdrant collections combine vector configuration with payload schemas. Get this right from the start:
from qdrant_client import QdrantClient
from qdrant_client.models import (
VectorParams, Distance, StorageType, PointStruct
)
client = QdrantClient(url="http://localhost:6333")
# Create collection with explicit vector config
client.recreate_collection(
collection_name="documents",
vectors_config=VectorParams(
size=1536, # text-embedding-3-small dimension
distance=Distance.COSINE,
on_disk=False, # Keep vectors in RAM for <100M vectors
),
)
# For >100M vectors, use on_disk=True
client.recreate_collection(
collection_name="large_corpus",
vectors_config=VectorParams(
size=1536,
distance=Distance.COSINE,
on_disk=True, # Significantly reduces RAM, slight latency cost
),
)
# Distance options:
# - Distance.COSINE: normalized vectors (-1 to 1)
# - Distance.EUCLID: Euclidean distance (good for clusters)
# - Distance.DOT: dot product (fastest, vectors should be normalized)
# Insert sample data
points = [
PointStruct(
id=1,
vector=[0.1] * 1536,
payload={"title": "AI Infrastructure", "domain": "ai"},
),
PointStruct(
id=2,
vector=[0.2] * 1536,
payload={"title": "Vector Databases", "domain": "db"},
),
]
client.upsert(collection_name="documents", points=points)
on_disk=True benefits: 10-50× less RAM, scales to billions. on_disk=False benefits: Sub-millisecond latency.
Payload Schema and Indexed Fields
Indexed payloads unlock fast filtering. Define your schema explicitly:
from qdrant_client.models import (
PointStruct, PayloadIndexType, PayloadSchemaType, CreateIndex
)
# Before inserting large amounts of data, create indexes
client.create_payload_index(
collection_name="documents",
field_name="domain",
field_schema=PayloadSchemaType.KEYWORD, # Exact match
)
# Index integer field for range queries
client.create_payload_index(
collection_name="documents",
field_name="date_created",
field_schema=PayloadSchemaType.INTEGER,
)
# Index floating-point field
client.create_payload_index(
collection_name="documents",
field_name="confidence_score",
field_schema=PayloadSchemaType.FLOAT,
)
# Insert data with indexed payloads
points = [
PointStruct(
id=i,
vector=[0.1] * 1536,
payload={
"domain": "ai",
"date_created": 1710604800, # 2024-03-17
"confidence_score": 0.95,
"tags": ["embedding", "production"],
},
)
for i in range(1000)
]
# Upsert with indexed payloads for fast filtering
client.upsert(collection_name="documents", points=points, wait=True)
# Verify indexes were created
info = client.get_collection("documents")
print(f"Collection info: {info}")
Index payload fields you'll filter on frequently. Filtering unindexed fields is O(n).
Scalar Quantization: 4× Memory Reduction
Quantization trades ~2-3% recall for massive memory savings:
from qdrant_client.models import (
VectorParams, Distance, QuantizationConfig, ScalarQuantization,
ScalarType, PointStruct
)
# Create collection with int8 scalar quantization
client.recreate_collection(
collection_name="quantized_docs",
vectors_config=VectorParams(
size=1536,
distance=Distance.COSINE,
),
quantization_config=QuantizationConfig(
scalar=ScalarQuantization(
type=ScalarType.INT8,
quantile=0.99,
always_ram=False, # Keep quantized vectors on disk if memory-constrained
),
),
)
# Qdrant quantizes vectors on insert automatically
for i in range(10000):
point = PointStruct(
id=i,
vector=[0.1 + (i % 256) / 256.0] * 1536, # Varied vectors
payload={"index": i},
)
client.upsert(collection_name="quantized_docs", points=[point])
# Memory savings:
# float32: 1536 * 4 bytes = 6 KB per vector
# int8: 1536 * 1 byte = 1.5 KB per vector
# Savings: 75% (6 KB -> 1.5 KB)
# Query performance is identical to user
results = client.search(
collection_name="quantized_docs",
query_vector=[0.5] * 1536,
limit=10,
)
print(f"Found {len(results)} results")
When to use scalar quantization: >10M vectors and RAM is expensive. Trade: 2-3% recall loss.
Binary Quantization for Search Speed
Binary quantization compresses to 1 bit per dimension:
from qdrant_client.models import (
VectorParams, Distance, QuantizationConfig, BinaryQuantization
)
# Binary quantization: 1536 dimensions -> 192 bytes (1536/8)
client.recreate_collection(
collection_name="binary_quantized",
vectors_config=VectorParams(
size=1536,
distance=Distance.COSINE,
),
quantization_config=QuantizationConfig(
binary=BinaryQuantization(always_ram=True), # Keep in RAM for speed
),
)
# Memory per vector: 1536 bits / 8 = 192 bytes (vs 6 KB float32)
# Compression ratio: 30×
# Insert vectors
points = [
PointStruct(id=i, vector=[0.1] * 1536, payload={"idx": i})
for i in range(1_000_000)
]
# Batch insert for efficiency
batch_size = 10000
for i in range(0, len(points), batch_size):
client.upsert(
collection_name="binary_quantized",
points=points[i : i + batch_size],
wait=True,
)
# Binary quantization has higher recall loss (~5-10%) but extreme compression
Trade-off: Binary quantization loses 5-10% recall but compresses 30×. Use for massive corpus where top-1000 recall is acceptable.
Payload Filtering: must, should, must_not
Complex filtering is Qdrant's strength. Build filter logic explicitly:
from qdrant_client.models import Filter, FieldCondition, MatchValue, Range
# must: all conditions required
must_filter = Filter(
must=[
FieldCondition(
key="domain",
match=MatchValue(value="ai"),
),
FieldCondition(
key="date_created",
range=Range(gte=1710604800), # After 2024-03-17
),
],
)
# should: at least one condition (OR logic)
should_filter = Filter(
should=[
FieldCondition(
key="domain",
match=MatchValue(value="ai"),
),
FieldCondition(
key="domain",
match=MatchValue(value="ml"),
),
],
)
# must_not: exclude matching
must_not_filter = Filter(
must_not=[
FieldCondition(
key="status",
match=MatchValue(value="draft"),
),
],
)
# Complex filter: published AI or ML docs
complex_filter = Filter(
must=[
Filter(
should=[
FieldCondition(key="domain", match=MatchValue(value="ai")),
FieldCondition(key="domain", match=MatchValue(value="ml")),
],
),
],
must_not=[
FieldCondition(key="status", match=MatchValue(value="draft")),
],
)
# Search with filtering
results = client.search(
collection_name="documents",
query_vector=[0.1] * 1536,
query_filter=complex_filter,
limit=10,
)
for result in results:
print(f"ID: {result.id}, Score: {result.score}, Payload: {result.payload}")
Combine must/should/must_not for sophisticated queries. Filter after vector search (post-filtering) for best latency.
Batch Upsert With Metadata
Large-scale ingestion requires batching. Qdrant handles up to 10K points per batch:
from qdrant_client.models import PointStruct
import time
# Generate 100K documents
def generate_documents(count: int):
for i in range(count):
yield PointStruct(
id=i,
vector=[0.1 + (i % 1000) / 1000.0] * 1536,
payload={
"title": f"Document {i}",
"domain": ["ai", "ml", "db", "infra"][i % 4],
"created_at": 1710604800 + (i * 3600),
"author": f"author_{i % 100}",
"tags": ["production", "documentation"],
},
)
# Upsert in batches
batch_size = 5000
docs = generate_documents(100_000)
start_time = time.time()
batch = []
for doc in docs:
batch.append(doc)
if len(batch) == batch_size:
client.upsert(
collection_name="documents",
points=batch,
wait=False, # Async upsert for speed
)
batch = []
# Final batch
if batch:
client.upsert(collection_name="documents", points=batch, wait=True)
elapsed = time.time() - start_time
print(f"Upserted 100K vectors in {elapsed:.2f} seconds")
wait=False is faster but doesn't guarantee consistency immediately. Use wait=True for final batch.
Qdrant Cloud vs Self-Hosted
Production comparison:
# Qdrant Cloud (managed)
from qdrant_client import QdrantClient
cloud_client = QdrantClient(
url="https://your-cluster-name.qdrant.io",
api_key="YOUR_API_KEY",
)
# Self-hosted (Docker)
self_hosted_client = QdrantClient(
url="http://qdrant-server:6333",
prefer_grpc=True, # Use gRPC for lower latency
)
# Self-hosted with authentication
auth_client = QdrantClient(
url="http://qdrant-server:6333",
api_key="YOUR_API_KEY",
)
Qdrant Cloud: Managed backups, scaling, monitoring. ~$0.50/GB/month.
Self-hosted: Full control, lower cost at scale (>1TB). DevOps overhead.
Choose Cloud for <100GB or low DevOps bandwidth. Self-hosted for >1TB or compliance requirements.
Backup and Restore
Critical for production:
import requests
from datetime import datetime
def backup_collection(
client: QdrantClient,
collection_name: str,
backup_path: str,
):
"""Export collection as snapshot"""
snapshot = client.create_snapshot(
collection_name=collection_name,
wait=True,
)
print(f"Snapshot created: {snapshot}")
# For Qdrant Cloud, download via API
# For self-hosted, snapshots are in storage/snapshots/
def restore_collection(
client: QdrantClient,
snapshot_path: str,
collection_name: str,
):
"""Restore from snapshot"""
# Self-hosted: place snapshot in storage/snapshots/
# Then recover via API
recovered = client.recover(
wait=True,
)
print(f"Recovery result: {recovered}")
# Automated backup (daily at 2 AM)
# Use cron or cloud scheduler
backup_collection(client, "documents", f"backup-{datetime.now().date()}")
Store snapshots in S3 or GCS. Test restore procedures quarterly.
Monitoring Qdrant Metrics
Expose Prometheus metrics for observability:
# Qdrant exports metrics on port 6334/metrics
import requests
import json
def get_qdrant_metrics():
"""Fetch Prometheus metrics"""
response = requests.get("http://localhost:6334/metrics")
return response.text
# Key metrics to monitor:
# - qdrant_collection_vectors_count: vector count per collection
# - qdrant_collection_payload_size_bytes: payload storage
# - qdrant_search_requests_total: query volume
# - qdrant_search_duration_seconds: query latency
# - qdrant_update_duration_seconds: upsert latency
# Set up alerts:
# - search_duration_seconds > 1000ms: performance degradation
# - collection_vectors_count approaching disk limits
# - error rates spiking
def get_collection_size(collection_name: str):
"""Get collection metadata"""
info = client.get_collection(collection_name)
return {
"vectors": info.vectors_count,
"indexed_vectors": info.indexed_vectors_count,
"payload_bytes": info.points_count,
}
size = get_collection_size("documents")
print(f"Collection size: {json.dumps(size, indent=2)}")
Set up Grafana dashboards to track collection growth and query performance.
Checklist
- Define vector dimensions and distance metric (Cosine/Euclidean/Dot)
- Decide on_disk vs in-memory based on scale
- Index payload fields for filtering (KEYWORD, INTEGER, FLOAT)
- Benchmark quantization trade-offs (scalar vs binary vs none)
- Implement batch upsert pipeline (5K-10K batch size)
- Set up daily snapshots to S3/GCS
- Configure Prometheus metrics and Grafana dashboard
- Test filter performance on production payloads
- Document collection schema and migration procedures
- Plan capacity for 12-month growth
Conclusion
Qdrant's production-ready architecture supports billions of vectors with sophisticated filtering and quantization. Start with clear vector config and indexed payloads. Quantization unlocks cost-efficiency at scale. Batch operations reduce latency. Cloud vs self-hosted depends on scale and DevOps capacity. Monitor metrics rigorously. Master these patterns and you'll scale efficiently.