Operational Guarantees & Failure Modes
How ApexMediation SSAI handles failures, timeouts, and edge cases. This page documents exact fallback behavior for Live vs VOD modes.
Core Principle: Never Black Screen
The system is designed to fail open and keep playback moving. When ad delivery fails, we fall back to slate (neutral filler) or pass through content where possible—avoiding black screens and player stalls.
SLA Metrics
| Metric | Target | Notes |
|---|---|---|
| Pod Decision Latency (p95) | <200ms | From cue detection to filled pod |
| Manifest Generation (p99) | < 50ms | Stitched manifest response time |
| Session Availability | 99.9% | Active sessions recoverable after incident |
| Tracking Event Delivery | 99.95% | Exactly-once delivery with retry |
| Pod Deadline Miss Rate | < 5% | Percent of breaks requiring fallback |
Failure Scenarios & Fallbacks
Partner Timeout
Demand partner doesn't respond within the bid deadline.
Live Mode
- • Partner excluded from current pod
- • If no ads available: serve slate
- • Circuit breaker trips after 3 consecutive timeouts
- • Auto-heal after 60s cooldown
VOD Mode
- • Partner excluded, try next partner
- • Longer timeout budget (500ms vs 200ms)
- • If no ads: skip break, content continues
- • Retry on next session for same asset
Cue Drift (Late/Early Cue)
SCTE-35 cue arrives after expected position or with unexpected timing.
Live Mode
- • Late cue (<2s): align to next segment boundary
- • Very late cue (>2s): skip break, log warning
- • Early cue: buffer until segment boundary
- • Metrics track drift distribution
VOD Mode
- • Cue positions fixed at manifest generation
- • No runtime drift (pre-stitched)
- • Original cue positions preserved in metadata
Missing Segments (404 Mid-Ad)
Ad segment returns 404 or times out during playback.
Live Mode
- • Substitute slate segment at same duration
- • Creative flagged, excluded from future pods
- • Tracking event: error with reason
- • Alert to ops if pattern detected
VOD Mode
- • Manifest pre-validated before serving
- • All segment URLs verified reachable
- • Stale ads auto-expired from cache
- • Player sees discontinuity if unavoidable
Encoder/Origin Restart
Origin stream resets with new sequence numbers or timestamp discontinuity.
Live Mode
- • Detect via media sequence jump
- • Insert
#EXT-X-DISCONTINUITY - • Abort in-flight pod if mid-break
- • Session continues (no reconnect needed)
VOD Mode
- • N/A (VOD origin is immutable)
- • Asset version tracking prevents stale refs
Safe Mode Activation
System detects elevated error rates or latency.
Thresholds (from SafeModeManager)
- • Error rate threshold: 10% of requests
- • Latency threshold: <200ms
- • Minimum samples: 100 before triggering
Fallback Ladder
When primary ad delivery fails, the system cascades through fallback options:
Try Alternate Partners
If primary partner times out, immediately try secondary/tertiary partners if time permits.
Serve House Ads
Publisher-provided house ads (promos, PSAs) configured as zero-cost fallback.
Serve Slate
Generic slate content (spinning logo, "We'll be right back") for the break duration.
Skip Break (VOD only)
For VOD, if all options exhausted, break is omitted and content continues seamlessly.
Circuit Breakers
Per-partner circuit breakers prevent cascading failures:
// From SSAICircuitBreaker.ts
export const CircuitBreakerConfig = {
// Trip conditions
errorRateThreshold: 0.5, // 50% error rate trips breaker
latencyThresholdMs: 500, // p95 > 500ms trips breaker
minRequestsForTrip: 10, // Need 10 requests before evaluation
// Recovery
cooldownMs: 60000, // 60s cooldown before retry
halfOpenRequests: 3, // 3 test requests in half-open state
recoveryThreshold: 0.8, // 80% success to fully recover
};Normal operation
Partner bypassed
Testing recovery
Incident Response
During major incidents, the system provides:
Incident Bundle
Exportable evidence bundle with decision traces, manifest versions, partner timings, and error logs.
Auto-Rollback
If new config causes degradation, automatic rollback to last-known-good state.
Partner Quarantine
Problematic partners automatically isolated until manual review.
Real-time Alerting
PagerDuty/Slack integration for p1 incidents with runbook links.