- Published on
The Grafana LGTM Stack — Logs, Metrics, Traces, and Profiles in One Platform
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
Observability means understanding system behavior through logs, metrics, traces, and profiles. Historically, these signals lived in separate silos: Prometheus for metrics, ELK for logs, Jaeger for traces, and custom profiling tools. The Grafana LGTM stack (Loki, Grafana, Tempo, Mimir—and Prometheus for metrics) converges these signals into a single platform. With correlation between logs and traces, dashboards that pull from all sources, and unified alerting, incident resolution accelerates dramatically. This post covers Prometheus metrics with recording rules, Loki log aggregation, Tempo distributed tracing, building dashboards as code, and cost optimization.
- Prometheus Metrics and Recording Rules
- Loki Log Aggregation with LogQL
- Tempo Distributed Tracing
- Grafana Dashboards as Code (JSON)
- Alert Routing with Grafana OnCall
- Correlation Between Logs, Metrics, and Traces
- Cost Optimization
- Checklist
- Conclusion
Prometheus Metrics and Recording Rules
Prometheus scrapes metrics from instrumented applications. Recording rules pre-compute expensive queries and reduce query load.
Prometheus configuration (prometheus.yml):
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: us-east-1
environment: prod
rule_files:
- 'recording_rules.yml'
- 'alerting_rules.yml'
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: "true"
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- job_name: 'prometheus'
static_configs:
- targets:
- localhost:9090
Recording rules (recording_rules.yml):
groups:
- name: api_server
interval: 30s
rules:
- record: api:requests:rate1m
expr: rate(api_requests_total[1m])
- record: api:requests:rate5m
expr: rate(api_requests_total[5m])
- record: api:latency:p95
expr: histogram_quantile(0.95, rate(api_request_duration_seconds_bucket[5m]))
- record: api:latency:p99
expr: histogram_quantile(0.99, rate(api_request_duration_seconds_bucket[5m]))
- record: api:error_rate:rate1m
expr: rate(api_requests_total{status=~"5.."}[1m]) / rate(api_requests_total[1m])
- name: database
interval: 30s
rules:
- record: db:connections:active
expr: pg_stat_activity_count{state="active"}
- record: db:connections:idle
expr: pg_stat_activity_count{state="idle"}
- record: db:replication:lag_bytes
expr: pg_wal_lsn_lag_bytes
- record: db:cache_hit_ratio
expr: sum(rate(pg_heap_blks_hit[5m])) / sum(rate(pg_heap_blks_hit[5m] + pg_heap_blks_read[5m]))
Alerting rules (alerting_rules.yml):
groups:
- name: api_alerts
rules:
- alert: HighErrorRate
expr: api:error_rate:rate1m > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.instance }}"
description: "Error rate {{ $value | humanizePercentage }} exceeds 5% threshold"
- alert: HighLatency
expr: api:latency:p99 > 1.0
for: 10m
labels:
severity: warning
annotations:
summary: "High latency on {{ $labels.instance }}"
description: "P99 latency {{ $value }}s exceeds 1s threshold"
- alert: DatabaseReplicationLag
expr: db:replication:lag_bytes > 1073741824 # 1GB
for: 5m
labels:
severity: warning
annotations:
summary: "Database replication lag exceeds 1GB"
Loki Log Aggregation with LogQL
Loki is a log aggregation system designed for Grafana. Unlike Elasticsearch, Loki indexes labels (not log content), making it dramatically cheaper at scale.
Promtail configuration (promtail-config.yaml):
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: kubernetes
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- production
- staging
relabel_configs:
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: pod
- source_labels: [__meta_kubernetes_namespace_name]
action: replace
target_label: namespace
- source_labels: [__meta_kubernetes_pod_label_app]
action: replace
target_label: app
- source_labels: [__meta_kubernetes_pod_container_name]
action: replace
target_label: container
- job_name: syslog
syslog:
listen_address: 0.0.0.0:514
labels:
job: syslog
relabel_configs:
- source_labels: [__syslog_message_hostname]
target_label: hostname
LogQL queries:
# Count logs per app
count_over_time({app="api"}[5m])
# Filter errors
{app="api"} |= "error" | json | status >= 500
# Parse JSON logs and extract fields
{app="api"} | json level="level", status="status" | status >= 500
# Calculate error rate
sum(rate({app="api"} |= "error" | json status="status" | status >= 500 [5m]))
sum(rate({app="api"} | json status="status" [5m]))
# Show logs with specific pattern
{app="worker"} |= "timeout" | regex "request_id=(?P<request_id>[\\w-]+)" | request_id="abc123"
Tempo Distributed Tracing
Tempo captures end-to-end request flows. With Tempo in the LGTM stack, you can jump from a metric alert to logs to the trace of that request.
Tempo configuration (tempo.yaml):
server:
http_listen_port: 3200
grpc_listen_port: 4317
distributor:
rate_limit_enabled: true
rate_limit: 100000
rate_limit_bytes: 100000000
ingester:
lifecycler:
ring:
kvstore:
store: inmemory
replication_factor: 1
max_block_duration: 5m
storage:
trace:
backend: local
wal:
path: /var/tempo/wal
local:
path: /var/tempo/blocks
blocklist_poll: 5m
Instrumentation (OpenTelemetry in TypeScript):
import { NodeSDK } from "@opentelemetry/sdk-node";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { Resource } from "@opentelemetry/resources";
import { SemanticResourceAttributes } from "@opentelemetry/semantic-conventions";
const sdk = new NodeSDK({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: "api-server",
[SemanticResourceAttributes.SERVICE_VERSION]: "1.2.3",
environment: "production",
}),
traceExporter: new OTLPTraceExporter({
url: "http://tempo:4318/v1/traces",
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
console.log("OpenTelemetry SDK started");
process.on("SIGTERM", () => {
sdk.shutdown()
.then(() => console.log("Tracing terminated"))
.catch((err) => console.log("Error terminating tracing", err))
.finally(() => process.exit(0));
});
Grafana Dashboards as Code (JSON)
Store dashboards in Git. Use Grafonnet (Jsonnet DSL) or raw JSON.
Dashboard JSON:
{
"dashboard": {
"title": "API Server",
"tags": ["api", "production"],
"timezone": "browser",
"panels": [
{
"title": "Request Rate",
"targets": [
{
"expr": "rate(api_requests_total[5m])",
"legendFormat": "{{ handler }} {{ method }}"
}
],
"type": "graph"
},
{
"title": "Error Rate",
"targets": [
{
"expr": "rate(api_requests_total{status=~\"5..\"}[5m]) / rate(api_requests_total[5m])",
"legendFormat": "{{ handler }}"
}
],
"type": "graph",
"alert": {
"name": "HighErrorRate",
"conditions": [
{
"operator": { "type": "gt" },
"query": { "params": ["0.05"] },
"reducer": { "params": [] },
"type": "query"
}
]
}
}
]
}
}
Grafonnet (Jsonnet) example:
local grafana = import 'github.com/grafana/grafonnet/gen/grafonnet-latest/main.libsonnet';
local dashboard = grafana.dashboard;
local row = grafana.row;
local graph = grafana.graphPanel;
local prometheus = grafana.prometheus;
dashboard.new(
title='API Server Dashboard',
tags=['api', 'production'],
timezone='browser',
panels=[
row.new(title='Request Metrics'),
graph.new(
title='Request Rate',
targets=[prometheus.target('rate(api_requests_total[5m])')],
gridPos={h: 8, w: 12, x: 0, y: 0},
),
graph.new(
title='Error Rate',
targets=[
prometheus.target(
'rate(api_requests_total{status=~"5.."}[5m]) / rate(api_requests_total[5m])',
legendFormat='{{ handler }}'
)
],
gridPos={h: 8, w: 12, x: 12, y: 0},
),
]
)
Alert Routing with Grafana OnCall
Route alerts based on rules, escalate to on-call responders, and notify via Slack, PagerDuty, email.
OnCall integration:
- Create escalation policy
- Create on-call schedule
- Route alerts to on-call
- Notify via Slack/webhook
Example alert routing:
groups:
- name: critical_alerts
rules:
- alert: DatabaseDown
expr: up{job="postgres"} == 0
for: 2m
labels:
severity: critical
team: database
annotations:
summary: "PostgreSQL is down"
description: "PostgreSQL on {{ $labels.instance }} is unreachable"
Configure notification channel in Grafana to send to OnCall. OnCall escalates based on schedule and acks.
Correlation Between Logs, Metrics, and Traces
Grafana unified querying correlates signals. Jump from metrics to logs to traces within Grafana.
Example workflow:
- Dashboard shows error rate spike
- Click on spike → view logs for that time range
- See error messages in logs
- Click on a log entry → view the trace for that request
- Trace shows bottleneck: slow database query
- Jump to database metrics: high query latency
Trace linking in dashboard:
{
"fieldConfig": {
"defaults": {
"links": [
{
"targetBlank": true,
"title": "View Trace",
"url": "d/trace-detail?var-trace_id=${__data.fields.trace_id}"
}
]
}
}
}
Cost Optimization
Observability at scale is expensive. Optimize carefully.
Metrics retention:
- High resolution (15s scrape interval): 30 days
- Lower resolution (1m): 1 year
- Aggregates (hourly): 5 years
# Prometheus storage config
storage:
tsdb:
retention:
size: "100GB" # Keep last 100GB
Log sampling:
- ERROR logs: 100% (keep all)
- WARN logs: 100%
- INFO logs: 10% (sample 90%)
- DEBUG logs: 1% (sample 99%)
# Loki sampling rules
sampling:
level: error
percentage: 100
level: warn
percentage: 100
level: info
percentage: 10
level: debug
percentage: 1
Trace sampling:
- Errors: 100%
- Long requests (>1s): 50%
- Normal requests: 5%
const sampler = new TraceIdRatioBasedSampler(0.05);
Cardinality control: Avoid high-cardinality labels. Bad:
api_requests_total{user_id="123456"}
Good:
api_requests_total{service="api", handler="orders"}
Checklist
- Prometheus configured with appropriate scrape intervals per job
- Recording rules pre-compute expensive queries (p95, p99, error rates)
- Alerting rules define critical thresholds with runbook URLs
- Loki configured with appropriate label selectors (not log content)
- Promtail scrapes all relevant log sources and adds labels
- Tempo ingests traces from all services via OpenTelemetry
- Dashboards stored as JSON in Git; updated via CI/CD
- Grafana alerts routed through OnCall with escalation policies
- Links between metrics, logs, and traces configured
- Retention policies set per signal type (30d metrics, 7d logs, 3d traces)
- Sampling rules optimize cost without losing critical signals
- Cardinality monitoring enabled; high-cardinality labels blocked
- Runbooks linked from all production alerts
Conclusion
The Grafana LGTM stack converges observability signals, enabling faster incident resolution. Prometheus with recording rules pre-computes complex aggregations. Loki provides affordable log aggregation. Tempo captures distributed traces. Grafana correlates these signals and powers alerting. Store dashboards in Git, automate their deployment, and leverage unified querying to jump from metrics to logs to traces. With proper retention policies, sampling strategies, and cardinality controls, you can observe production systems at massive scale without cost spiraling out of control.