Skip to content
Docs/SSAI/Operational Guarantees

Operational Guarantees & Failure Modes

How ApexMediation SSAI handles failures, timeouts, and edge cases. This page documents exact fallback behavior for Live vs VOD modes.

Core Principle: Never Black Screen

The system is designed to fail open and keep playback moving. When ad delivery fails, we fall back to slate (neutral filler) or pass through content where possible—avoiding black screens and player stalls.

SLA Metrics

MetricTargetNotes
Pod Decision Latency (p95)<200msFrom cue detection to filled pod
Manifest Generation (p99)< 50msStitched manifest response time
Session Availability99.9%Active sessions recoverable after incident
Tracking Event Delivery99.95%Exactly-once delivery with retry
Pod Deadline Miss Rate< 5%Percent of breaks requiring fallback

Failure Scenarios & Fallbacks

Partner Timeout

Demand partner doesn't respond within the bid deadline.

Live Mode

  • • Partner excluded from current pod
  • • If no ads available: serve slate
  • • Circuit breaker trips after 3 consecutive timeouts
  • • Auto-heal after 60s cooldown

VOD Mode

  • • Partner excluded, try next partner
  • • Longer timeout budget (500ms vs 200ms)
  • • If no ads: skip break, content continues
  • • Retry on next session for same asset

Cue Drift (Late/Early Cue)

SCTE-35 cue arrives after expected position or with unexpected timing.

Live Mode

  • • Late cue (<2s): align to next segment boundary
  • • Very late cue (>2s): skip break, log warning
  • • Early cue: buffer until segment boundary
  • • Metrics track drift distribution

VOD Mode

  • • Cue positions fixed at manifest generation
  • • No runtime drift (pre-stitched)
  • • Original cue positions preserved in metadata

Missing Segments (404 Mid-Ad)

Ad segment returns 404 or times out during playback.

Live Mode

  • • Substitute slate segment at same duration
  • • Creative flagged, excluded from future pods
  • • Tracking event: error with reason
  • • Alert to ops if pattern detected

VOD Mode

  • • Manifest pre-validated before serving
  • • All segment URLs verified reachable
  • • Stale ads auto-expired from cache
  • • Player sees discontinuity if unavoidable

Encoder/Origin Restart

Origin stream resets with new sequence numbers or timestamp discontinuity.

Live Mode

  • • Detect via media sequence jump
  • • Insert #EXT-X-DISCONTINUITY
  • • Abort in-flight pod if mid-break
  • • Session continues (no reconnect needed)

VOD Mode

  • • N/A (VOD origin is immutable)
  • • Asset version tracking prevents stale refs

Safe Mode Activation

System detects elevated error rates or latency.

Thresholds (from SafeModeManager)

  • • Error rate threshold: 10% of requests
  • • Latency threshold: <200ms
  • • Minimum samples: 100 before triggering
Reduced fanout: Only healthy partners called
Reduced timeout: 200ms hard deadline
Slate fallback: Enabled by default
Config freeze: No config changes during incident

Fallback Ladder

When primary ad delivery fails, the system cascades through fallback options:

  1. Try Alternate Partners

    If primary partner times out, immediately try secondary/tertiary partners if time permits.

  2. Serve House Ads

    Publisher-provided house ads (promos, PSAs) configured as zero-cost fallback.

  3. Serve Slate

    Generic slate content (spinning logo, "We'll be right back") for the break duration.

  4. Skip Break (VOD only)

    For VOD, if all options exhausted, break is omitted and content continues seamlessly.

Circuit Breakers

Per-partner circuit breakers prevent cascading failures:

// From SSAICircuitBreaker.ts
export const CircuitBreakerConfig = {
  // Trip conditions
  errorRateThreshold: 0.5,    // 50% error rate trips breaker
  latencyThresholdMs: 500,    // p95 > 500ms trips breaker
  minRequestsForTrip: 10,     // Need 10 requests before evaluation
  
  // Recovery
  cooldownMs: 60000,          // 60s cooldown before retry
  halfOpenRequests: 3,        // 3 test requests in half-open state
  recoveryThreshold: 0.8,     // 80% success to fully recover
};
CLOSED

Normal operation

OPEN

Partner bypassed

HALF-OPEN

Testing recovery

Incident Response

During major incidents, the system provides:

Incident Bundle

Exportable evidence bundle with decision traces, manifest versions, partner timings, and error logs.

Auto-Rollback

If new config causes degradation, automatic rollback to last-known-good state.

Partner Quarantine

Problematic partners automatically isolated until manual review.

Real-time Alerting

PagerDuty/Slack integration for p1 incidents with runbook links.

Related Documentation