Correlation¶
Correlation is the process of connecting related telemetry data across traces, metrics, and logs. When telemetry is properly correlated, you can seamlessly navigate from a metric anomaly to the specific traces and logs that explain what happened.
Why Correlation Matters¶
Without correlation, troubleshooting looks like this:
- Alert fires for high error rate
- Search logs for errors around that time
- Guess which requests might be related
- Manually piece together the story
With correlation:
- Alert fires for high error rate
- Click through to see exactly which traces had errors
- View the logs for those specific requests
- Understand root cause in minutes
How Correlation Works¶
Trace Context Propagation¶
Every request gets a unique trace ID that follows it through your system:
graph LR
A[Request] -->|trace-id: abc123| B[Service A]
B -->|trace-id: abc123| C[Service B]
C -->|trace-id: abc123| D[Database]
B -->|trace-id: abc123| E[Service C] All telemetry generated during that request includes the same trace ID, making it easy to find related data.
Correlation Identifiers¶
| Identifier | Scope | Purpose |
|---|---|---|
| Trace ID | Entire request | Links all operations for a single request |
| Span ID | Single operation | Identifies a specific operation within a trace |
| Parent Span ID | Operation relationship | Shows which operation called which |
Linking Telemetry Types¶
IAPM automatically correlates:
| From | To | How |
|---|---|---|
| Trace → Logs | Logs emitted during a span | Trace ID and Span ID in log context |
| Trace → Metrics | Metrics with trace exemplars | Exemplar links to representative traces |
| Logs → Trace | Log entries to their request | Trace ID extracted from log attributes |
Correlation in Practice¶
Scenario: Debugging a Slow Request¶
1. Dashboard shows P99 latency spike
└─ Click metric to see exemplar traces
2. View trace with 2.5s duration
└─ See span breakdown: Auth(50ms) → API(100ms) → DB(2300ms)
3. Click slow database span
└─ View correlated logs: "Query timeout after 2000ms"
4. Root cause identified: Missing index on customer_id column
Scenario: Investigating an Error¶
1. Error rate alert fires
└─ View affected traces
2. Trace shows failure in Payment Service
└─ Span has error flag and exception details
3. View correlated logs
└─ Full stack trace and request context
4. Root cause identified: Third-party API timeout
Enabling Correlation¶
Automatic Correlation¶
OpenTelemetry SDKs automatically:
- Generate and propagate trace context
- Inject trace IDs into logs (with proper configuration)
- Add correlation attributes to metrics
Manual Correlation¶
For custom telemetry, include correlation context:
// Add trace context to custom logs
logger.LogInformation("Processing order {OrderId}", orderId);
// SDK automatically adds trace_id and span_id when configured
Best Practices¶
Do¶
- Configure log correlation - Ensure your logging framework includes trace context
- Use structured logging - Makes correlation queries more effective
- Propagate context - Pass trace headers to all downstream services
- Add business context - Include relevant IDs (user, order, session) in spans
Avoid¶
- Breaking context propagation - Don't lose trace headers in async operations
- Over-relying on timestamps - Time-based correlation is fragile; use trace IDs
- Ignoring sampling - Ensure correlated data has matching sampling decisions
IAPM Correlation Features¶
IAPM enhances correlation with:
- Automatic log-trace linking - View logs inline with trace spans
- Metric exemplars - Jump from metrics to representative traces
- Service maps - Visualize how services correlate in your architecture
- Cross-service queries - Search across all telemetry types simultaneously
Next Steps¶
- Learn about Observability - The three pillars that get correlated
- Understand Instrumentation - How to emit correlated telemetry
- Explore OpenTelemetry - The framework powering correlation