Specification Template — Example (Platform)
Title
Distributed tracing across services A/B/C using OpenTelemetry
Summary
Lack of end‑to‑end request visibility slows incident resolution and obscures latency regressions. Introduce distributed tracing across services A/B/C with consistent trace/segment IDs and correlation into logs and metrics.
Background
Currently, services log request IDs inconsistently and emit partial metrics. Cross‑service latency is inferred. P1 incidents require manual log stitching.
Problem statement
We cannot reliably follow a request across services or attribute latency to a specific hop. MTTR remains high; regressions ship unnoticed.
Desired outcome
≥ 95% of inbound requests produce a complete trace spanning services A/B/C with service/hop timings and error annotations. On‑call can pivot from an incident to the slow hop within 2 minutes.
Scope
In scope
- HTTP/gRPC ingress for A/B/C
- Asynchronous messaging between B→C
- Trace context propagation and sampling policy
Out of scope
- Legacy admin service D
- Mobile client instrumentation (tracked separately)
Users / stakeholders affected
- Platform/Infra: owns tracing stack and sampling policy
- Service owners A/B/C: adopt instrumentation
- Support/On‑call: use traces during incidents
Constraints
- Overhead ≤ 3% CPU and ≤ 5% latency on P95
- No PII in traces; follow existing data retention policy
- Self‑hosted collector; no new vendor contracts this quarter
Requirements
Functional requirements
- Inject/extract W3C trace context across HTTP/gRPC and messaging
- Emit spans with standard attributes (service, http.method, status, db.statement redacted)
- Correlate trace_id with structured logs
Non-functional requirements
- Trace coverage ≥ 95% for A/B/C ingress within 30 days
- Sampling: 10% baseline with dynamic upsampling on 5xx bursts
- Storage retention: 14 days
Acceptance criteria
- Synthetic request across A→B→C yields a single trace with three service spans and correct timing within ±10ms tolerant error.
- Incident drill: pager on 5xx burst; on‑call locates slow hop and error cause via trace within 2 minutes.
- Overhead measured in staging ≤ 3% CPU/≤ 5% P95.
Proposed approach
- Adopt OpenTelemetry SDKs/auto‑instrumentation for A/B/C (language‑appropriate).
- Deploy OTel Collector with pipelines → export to existing metrics/logs backends.
- Implement middleware for context propagation in messaging.
- Add logging interceptor to append trace_id/span_id to structured logs.
Alternatives considered
Alternative 1
Custom request IDs only. Rejected: lacks timings, spans, and ecosystem support.
Alternative 2
Buy managed tracing now. Rejected this quarter due to procurement timing; revisit next Q.
Dependencies
- Existing metrics/logs backends capacity
- Sidecar/base image updates for services
Risks
- Sampling too low to capture rare failures
- Excess cardinality from labels
Mitigations
- Dynamic sampling rules for error conditions; targeted always‑on sampling for critical routes
- Attribute allow‑list; drop high‑cardinality fields
Validation plan
- Staging load: confirm coverage, overhead, and span correctness
- Fault injection: induce B 5xx and verify drill success
- Dashboards: trace coverage, collector queue depth, span export failures
Rollout / release notes
- Week 1: A service; Week 2: B; Week 3: C; Week 4: sampling tuning
- Canary 10% traffic per service; expand when overhead within thresholds
- Rollback: disable exporter; keep context propagation
Operational considerations
- Ownership: Platform team (collector), service owners (SDK updates)
- Monitoring: coverage %, export latency, collector CPU/mem
- Alerting: coverage drop > 10pp, collector queue backpressure
- Runbooks: collector restart, sampling change, SDK version rollback
Decisions already made
- Use W3C Trace Context and OTel semantic conventions
- Export to existing backends (no new vendor)
Decisions still needed
- Whether to enable DB/sql instrumentation in C (privacy review)
Completion criteria
- Coverage ≥ 95% for A/B/C ingress for 7 consecutive days; incident drill passed
- Runbooks and ownership acknowledged by teams
References
- RFC: Tracing adoption plan (link)
- OTel semantic conventions (link)
- Logging guide with trace correlation (link)