SLOs & System Reliability
We publish our service level objectives so you know exactly what to expect. When we miss, here's what happens.
Our Service Level Objectives
These are the targets we hold ourselves to. Live component health is shown above and on the status page.
Manifest/API Availability
Target: 99.9%Percentage of successful responses (2xx) from manifest and API endpoints over a 30-day rolling window.
Decision Latency (p95)
Target: <200msTime from cue receipt to pod decision completion, measured at the 95th percentile.
Pod Deadline Hit Rate
Target: > 98%Percentage of ad pods that complete stitching before the playback deadline.
Invalid Manifest Rate
Target: < 0.01%Percentage of manifests that fail HLS/DASH validation or cause player parse errors.
Tracking Integrity
Target: 100%Exactly-once semantics: no duplicates, no missed events. Verified by reconciliation.
How We Measure
# SLI Calculation Pipeline ┌─────────────────────────────────────────────────────────────────┐ │ Real-Time Metrics Flow │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────┐ ┌─────────────┐ ┌──────────────┐ │ │ │ Request │───▶│ Prometheus │───▶│ SLI Bucket │ │ │ │ Logs │ │ Scrape │ │ Aggregator │ │ │ └─────────┘ └─────────────┘ └──────────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────────────────────┐ │ │ │ 30-Day Rolling Window │ │ │ │ ───────────────────── │ │ │ │ availability = good/total │ │ │ │ latency_p95 = histogram │ │ │ │ error_rate = errors/total │ │ │ └──────────────────────────────┘ │ │ │ │ │ ┌──────────────────────┴───────┐ │ │ ▼ ▼ │ │ ┌─────────────┐ ┌─────────────┐ │ │ │ Dashboard │ │ Alerts │ │ │ │ (Grafana) │ │ (PagerDuty) │ │ │ └─────────────┘ └─────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘
What Happens When We Miss
Automated mitigations kick in before humans need to act. Here's the protection stack.
Safe Mode
When error rates exceed thresholds, the system automatically enters safe mode: reduced fanout, strict deadlines, and slate fallback for ad breaks.
Automatic Rollback
Configuration changes that cause SLI regression are automatically rolled back within 60 seconds of detection.
Partner Quarantine
Partners with repeated timeouts or errors are temporarily quarantined with exponential backoff cooldown.
Circuit Breaker
Per-partner circuit breakers prevent cascading failures. Trips after threshold breaches, recovers automatically.
Live Status Snapshot
Status temporarily unavailable
For real-time status and incident updates, visit our status page.
View Status PageTrust starts with transparency
Download our Security Pack for the full picture: architecture, threat model, compliance, and more.