Skip to content

Specification Template — Example (Product)

Title

Task update notifications delivered within 60 seconds (p99)

Summary

Customers report that notifications for task updates arrive 15–45 minutes late. We will deliver task update notifications within 60 seconds at the 99th percentile without exceeding a 5% infrastructure cost increase.

Background

Notifications are currently batched on a 15‑minute schedule to reduce cost. This design trades timeliness for batch efficiency and no longer meets user expectations for near‑real‑time updates.

Problem statement

Notification latency is too high (15–45 minutes). Users miss timely updates and must refresh manually to see changes.

Desired outcome

End‑to‑end notification delivery (event → user receipt) is ≤ 60s at p99 with no duplicate notifications and ≤ 5% infra cost increase.

Scope

In scope

  • Task update events (create, assign, status change, comment)
  • In‑app bell, email, and push channels

Out of scope

  • Weekly digests and marketing emails
  • SLA guarantees for third‑party push providers beyond current contracts

Users / stakeholders affected

  • End users: receive timely task updates
  • Internal teams: product (priorities), platform (eventing), support (incident playbooks)
  • Operational owners: on‑call for notifications service

Constraints

  • Cost increase ≤ 5% monthly for notification infrastructure
  • Maintain opt‑out preferences and rate limits
  • No PII expansion; comply with existing privacy posture

Requirements

Functional requirements

  • Emit a notification for each task update event and route to user’s active channels per preferences.
  • De‑dupe per event/user across retries.
  • Back‑pressure when downstream channel provider latency spikes; do not drop events.

Non-functional requirements

  • p99 end‑to‑end delivery ≤ 60s measured externally
  • Availability ≥ 99.9% for notification API
  • Idempotent processing; exactly‑once delivery at user experience level

Acceptance criteria

  • For a synthetic task update, 1000 events across a 10‑minute window deliver ≤ 60s at p99; ≤ 5 failures.
  • No duplicate notifications for the same event/user across 3 retry scenarios.
  • Cost dashboard shows ≤ 5% month‑over‑month increase for notification stack.

Proposed approach

  • Replace 15‑minute batching with event‑driven processing using existing event bus.
  • Introduce per‑channel worker pools with retry and jitter; de‑duplication by event_id+user_id key.
  • Add “notification-delivered” ack to measure end‑to‑end latency.

Alternatives considered

Alternative 1

Shorten batch interval to 1 minute. Rejected due to inherent latency floor and bursty load alignment.

Alternative 2

Third‑party real‑time vendor. Rejected to stay within cost cap and avoid new vendor dependencies mid‑quarter.

Dependencies

  • Event bus SLA; schema for task.update events
  • Channel providers (email/push) rate limits and current contracts

Risks

  • Downstream provider throttling increases latency
  • Cost spike from bursty traffic

Mitigations

  • Adaptive retry with back‑off; circuit breaker per channel
  • Dynamic worker pool sizing with caps; cost guardrails alerting

Validation plan

  • Synthetic load test during off‑peak; measure end‑to‑end p95/p99
  • Shadow traffic for 48 hours before cutover
  • Dashboards: latency histogram, duplicate rate, error rate, cost

Rollout / release notes

  • Phase 1: dark‑read and ack pipeline (no user‑visible change)
  • Phase 2: 10% traffic canary by org; expand to 100% over 24–48h
  • Rollback: switch routing to batch pipeline; preserve events in DLQ

Operational considerations

  • Ownership: Notifications team
  • Monitoring: latency, duplicate rate, provider 4xx/5xx, DLQ depth
  • Alerting: p99 > 60s for 15m, duplicate rate > 0.1%
  • Runbooks: provider outage, DLQ drain, cost spike
  • On‑call: pager policy updated

Decisions already made

  • Keep existing channels/providers this quarter
  • Measure user‑visible latency at the client receipt event

Decisions still needed

  • Whether to pre‑aggregate comment notifications within 30s window (PM)

Completion criteria

  • Latency SLO met for 7 consecutive days in production
  • Documentation and runbooks updated; owner acknowledged

References

  • Intent: Notification latency reduction (link)
  • Event schema: task.update v2 (link)
  • Dashboards: Notifications SLO (link)