Prometheus Uptime Alerts for 24/7 Streams: Know When It Breaks

Published on

Monitor your 24/7 stream with simple uptime checks and alerts so you can restart quickly

Written by Jochem, Infrastructure Expert, 5-10 years experience in game server hosting, VPS infrastructure, and 24/7 streaming solutions. Read author bio →

Prometheus Uptime Alerts for 24/7 Streams: Know When It Breaks

prometheus uptime alerts 247 streams

You cannot watch the stream all day. Alerts tell you when to act.

Alerts

Table of Contents

  1. What to monitor
  2. Simple alert rules
  3. Restart playbooks
  4. Common causes of outages
  5. VPS hosting notes

1. What to monitor

Is the stream live? Is the process running?

2. Simple alert rules

Alert if live status drops or process exits.

3. Restart playbooks

Have a quick restart plan ready.

4. Common causes of outages

Network issues, bitrate spikes, memory leaks.

5. VPS hosting notes

Run on a Streaming VPS for stable upload.

Why Prometheus matters for a 24/7 stream operator

If your stream goes black at 3 AM, the only way to find out within minutes is automated probing + paging. Prometheus + Alertmanager + a probe (blackbox-exporter or a custom exporter) is the standard self-hosted stack.

What you can detect cheaply:

  • HLS / DASH manifest goes stale (no new segments for N seconds).
  • ffmpeg process died.
  • RTMP push to YouTube/Twitch returning errors.
  • VPS itself unreachable.
  • CPU pegged at 100 % (encoder dropping frames).
  • Disk approaching full (long-tail recording / DVR).

Minimal Prometheus + blackbox setup

# prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: stream-health
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://yt.example.com/hls/live.m3u8
          - https://twitch.example.com/hls/live.m3u8
    relabel_configs:
      - source_labels: [__address__]
        target_label:  __param_target
      - source_labels: [__param_target]
        target_label:  instance
      - target_label:  __address__
        replacement:   blackbox:9115

  - job_name: node
    static_configs:
      - targets: ['stream-vps:9100']

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

rule_files:
  - /etc/prometheus/rules/*.yml

Useful alert rules

# rules/stream.yml
groups:
- name: stream
  rules:
  - alert: HLSManifestDown
    expr: probe_success{job="stream-health"} == 0
    for: 1m
    labels: { severity: critical }
    annotations:
      summary: "HLS manifest {{ $labels.instance }} not returning 200"

  - alert: StreamCPUSaturated
    expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 95
    for: 5m
    labels: { severity: warning }
    annotations:
      summary: "Encoder CPU >95% for 5 minutes on {{ $labels.instance }}"

  - alert: StreamDiskFilling
    expr: (node_filesystem_free_bytes{mountpoint="/var/recordings"} / node_filesystem_size_bytes{mountpoint="/var/recordings"}) < 0.10
    for: 10m
    labels: { severity: warning }

  - alert: FFmpegProcessMissing
    expr: absent(namedprocess_namegroup_num_procs{groupname="ffmpeg"} > 0)
    for: 2m
    labels: { severity: critical }
    annotations:
      summary: "ffmpeg has no running processes on {{ $labels.instance }}"

The FFmpegProcessMissing alert needs node_exporter's process collector or process-exporter.

Alertmanager routing to your phone

receivers:
  - name: pager
    discord_configs:
      - webhook_url: https://discord.com/api/webhooks/...
    telegram_configs:
      - bot_token: 123:abc
        chat_id: -1001234567890

route:
  receiver: pager
  group_by: [alertname]
  group_wait: 30s
  repeat_interval: 1h

Don't use email alone; phone notifications cut median response time by ~10 minutes in our tests.

What goes wrong without good alerts

  • A stream sits black for 8 hours and you find out from a viewer DM.
  • A disk fills, ffmpeg starts dropping segments silently, recordings are corrupt.
  • A YouTube auth token expires, RTMP refuses, you don't notice until next day.

The total setup is one VPS with ~1 GB RAM running prometheus + blackbox + alertmanager. Cost: tiny. Value: enormous.

Jochem

About the Author

Jochem, Infrastructure Expert, expert in game server hosting, VPS infrastructure, and 24/7 streaming solutions with 5-10 years experience.

Since 2023
500+ servers hosted
4.8/5 avg rating

I specialize in Minecraft, FiveM, Rust, and 24/7 streaming infrastructure, operating enterprise-grade AMD Ryzen 9 hardware in Netherlands datacenters.

View my full bio and credentials →

Launch Your 24/7 Stream

Join content creators worldwide who trust our streaming infrastructure. Setup is instant and support is always available.