Reliability

Reliability that an alerting platform has to earn

The whole point of an alerting tool is to survive the worst minutes of the year. Here is what WardenPoint does to be ready for them: channel redundancy, predictable retries, acknowledgement-aware chains and an audit log you can read.

Uptime target
99.99%
P99 dispatch
~142ms
Channels per step
3+

An alert that survives a bad day

  1. +0s
    API ingest — Accepted; UUID assigned
  2. +1s
    Telegram — Voice message dispatched
  3. +10s
    Voice fallback — Telegram unreachable · PSTN call placed
  4. +14s
    Responder — DTMF 1 · chain cancelled

Four pillars

How we keep an alert alive

Four reliability concerns that an alerting platform has to address. Each lists what is wired in code today, not what we hope to ship.

01 · Channel redundancy

Multiple channels per step

Each escalation step can fire more than one channel in parallel. If Telegram is down, the phone call still rings. If both fail, SMS is the configured fallback.

  • Per-step channel arrays — fire many in parallel
  • Fallback channel per channel — if Telegram fails, SMS takes over
  • Carrier rotation for voice and SMS — no single provider failure stops dispatch
  • Channel health tracked in the audit log per dispatch

02 · Retry policy

Predictable, bounded retries

We retry the same channel before moving on, but never indefinitely. Retry intervals are explicit; failures fall through to the next channel.

  • Configurable per-channel attempts (default 3 for voice, 2 for Telegram)
  • Exponential or linear backoff per channel
  • Idempotency by (api_key, idempotency_key) — duplicate dispatches collapse
  • Hard cap per chain so a runaway alert cannot page forever

03 · Acknowledgement-aware

The chain stops when someone owns it

Acknowledgement from any channel that received the alert cancels the rest of the chain. The audit log records who acked and when.

  • Ack from Telegram button, SMS reply, voice DTMF, email click or dashboard tap
  • Resolved hook from monitoring source cancels the chain automatically
  • Re-dispatch handled cleanly when the responder hands off mid-incident
  • No lingering retries after ack — guaranteed by the queue cancellation step

04 · Audit trail

Structured, queryable, exportable

Every dispatch, ack, escalation and resolve writes a JSON line. The shape is stable; old fields stay; new fields are added without breaking parsers.

  • JSON Lines audit log per company
  • Stable schema with notification_uuid, channel, status, actor, ip, request_id
  • CSV export per recipient group for SLA reviews
  • Linked to application logs via request_id

Retry policy

What we retry, and when we stop

Retries should be predictable. Pick the channel attempts, the backoff and the fallback per channel. WardenPoint ships sensible defaults and lets paid plans override them.

  • Voice calls retry up to 3× with exponential backoff before falling back to Telegram voice
  • Telegram retries 2× with linear backoff, then falls back to SMS
  • SMS retries 2× with provider rotation; carrier 5xx fails over to the next provider
  • Email retries on transient SMTP 4xx; permanent 5xx hard-fails and the chain moves on
  • Every retry decision lands in the audit log so the path is reconstructible
config/escalation.phpPHP
# config/escalation.php — retry policy
'retry' => [
'voice_call' => [
'attempts' => 3,
'backoff' => 'exponential',
'fallback_channel' => 'telegram_voice',
],
'telegram_voice' => [
'attempts' => 2,
'backoff' => 'linear',
'fallback_channel' => 'sms',
],
],

Honest numbers

Numbers we operate against

Uptime target
99.99%

Internal SLO across the public API and dispatcher stack. /status reports the actual rolling number.

P99 dispatch latency
~142ms

WardenPoint-side latency from ingest to channel hand-off. Carrier-side adds 2–10s for voice/SMS.

Channels per step
3+

Each step can fire any combination of the eight channels in parallel. No artificial cap.

Retry hard cap
Per-chain

A chain cannot exceed its configured total duration. We refuse to page forever — by design.

Reliability FAQ

Common reliability questions

99.99% for the public API ingest and the dispatcher tier. The status page reports the actual rolling figure with a 90-day timeline.
Free plan

Prove the reliability claims with your own test

Set up a recipient, kill Telegram on the responder phone and watch the fallback fire. The audit line tells the whole story.

  • Free forever plan
  • Audit log per dispatch
  • Carrier rotation built in