Prometheus Uptime Alerts for 24/7 Streams: Know When It Breaks

Published on 2026-01-11 | Updated on 2026-05-06

Monitor your 24/7 stream with simple uptime checks and alerts so you can restart quickly

prometheus uptime alerts 247 streams

You cannot watch the stream all day. Alerts tell you when to act.

Alerts

What to monitor
Simple alert rules
Restart playbooks
Common causes of outages
VPS hosting notes

1. What to monitor

Is the stream live? Is the process running?

2. Simple alert rules

Alert if live status drops or process exits.

3. Restart playbooks

Have a quick restart plan ready.

4. Common causes of outages

Network issues, bitrate spikes, memory leaks.

5. VPS hosting notes

Run on a Streaming VPS for stable upload.

Why Prometheus matters for a 24/7 stream operator

If your stream goes black at 3 AM, the only way to find out within minutes is automated probing + paging. Prometheus + Alertmanager + a probe (blackbox-exporter or a custom exporter) is the standard self-hosted stack.

What you can detect cheaply:

HLS / DASH manifest goes stale (no new segments for N seconds).
ffmpeg process died.
RTMP push to YouTube/Twitch returning errors.
VPS itself unreachable.
CPU pegged at 100 % (encoder dropping frames).
Disk approaching full (long-tail recording / DVR).

Minimal Prometheus + blackbox setup

# prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: stream-health
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://yt.example.com/hls/live.m3u8
          - https://twitch.example.com/hls/live.m3u8
    relabel_configs:
      - source_labels: [__address__]
        target_label:  __param_target
      - source_labels: [__param_target]
        target_label:  instance
      - target_label:  __address__
        replacement:   blackbox:9115

  - job_name: node
    static_configs:
      - targets: ['stream-vps:9100']

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

rule_files:
  - /etc/prometheus/rules/*.yml

Useful alert rules

# rules/stream.yml
groups:
- name: stream
  rules:
  - alert: HLSManifestDown
    expr: probe_success{job="stream-health"} == 0
    for: 1m
    labels: { severity: critical }
    annotations:
      summary: "HLS manifest {{ $labels.instance }} not returning 200"

  - alert: StreamCPUSaturated
    expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 95
    for: 5m
    labels: { severity: warning }
    annotations:
      summary: "Encoder CPU >95% for 5 minutes on {{ $labels.instance }}"

  - alert: StreamDiskFilling
    expr: (node_filesystem_free_bytes{mountpoint="/var/recordings"} / node_filesystem_size_bytes{mountpoint="/var/recordings"}) &lt; 0.10
    for: 10m
    labels: { severity: warning }

  - alert: FFmpegProcessMissing
    expr: absent(namedprocess_namegroup_num_procs{groupname="ffmpeg"} > 0)
    for: 2m
    labels: { severity: critical }
    annotations:
      summary: "ffmpeg has no running processes on {{ $labels.instance }}"

The FFmpegProcessMissing alert needs node_exporter's process collector or process-exporter.

Alertmanager routing to your phone

receivers:
  - name: pager
    discord_configs:
      - webhook_url: https://discord.com/api/webhooks/...
    telegram_configs:
      - bot_token: 123:abc
        chat_id: -1001234567890

route:
  receiver: pager
  group_by: [alertname]
  group_wait: 30s
  repeat_interval: 1h

Don't use email alone; phone notifications cut median response time by ~10 minutes in our tests.

What goes wrong without good alerts

A stream sits black for 8 hours and you find out from a viewer DM.
A disk fills, ffmpeg starts dropping segments silently, recordings are corrupt.
A YouTube auth token expires, RTMP refuses, you don't notice until next day.

The total setup is one VPS with ~1 GB RAM running prometheus + blackbox + alertmanager. Cost: tiny. Value: enormous.

Prometheus Uptime Alerts for 24/7 Streams: Know When It Breaks

Table of Contents