FabrikFabrik
FabrikNotifications

Digest and escalation

Batch noisy sources into hourly summaries with digest mode, and auto-route unread critical notifications to designated users with escalation rules.

Preferences let you silence what doesn't matter. Digest and escalation are the two features that handle the other end — what to do when there's too much signal, and what to do when nobody's looking.

Digest mode

Digest replaces immediate delivery with batched summaries. Instead of twelve individual notifications across an hour, you get one: "12 scheduled task events — 10 succeeded, 2 failed."

Turning it on

Two fields in notification preferences:

  • digest_enabled — default off.
  • digest_interval_minutes — default 60. How often the buffer is flushed.

With digest on, the standard create_notification() call doesn't store a notification — it writes to NotificationBuffer instead. A periodic task flushes the buffer into one summary notification per source.

How the flush works

The notifications.flush_notification_digests Celery task runs on a short cadence. For every user with digest enabled:

  1. Find buffered notifications older than digest_interval_minutes.
  2. Group them by source (scheduled_task_success, awx_execution_failure, etc.).
  3. For each source group:
    • Count successes vs. errors/warnings.
    • Pick the worst severity in the group as the summary type (error > warning > success).
    • Create a single summary notification: "12 scheduled task events — 10 succeeded, 2 failed" with metadata.digest=true.
  4. If email is enabled and the user's severity threshold allows it, email the summary.
  5. Delete the flushed buffer entries.

The flushed summary is a normal notification from that point — it shows up in the bell, the center, and email. Its digest=true flag lets the UI render it distinctly (count badge, expandable breakdown).

Worst-severity wins

A batch of ten successes and two failures summarizes as error severity — because errors are the thing you'd care about. The UI still shows the success count; you're just being alerted at the severity level that matters.

This is deliberate: digest mode should never demote your attention. Batching is a convenience, not a reason to miss failures.

What digest does well

  • High-frequency, mostly-successful sources. Hourly health checks across 20 APICs. 479 successes and 1 failure per day → one hourly ping instead of 480.
  • Noisy sources during migrations. A bulk AWX run spins off dozens of execution notifications. Digest collapses them into a single "25 jobs, 23 succeeded, 2 failed" summary.
  • Email hygiene. Without digest, every event is a separate email thread. With digest, one email per hour per source.

What digest does badly

  • Time-critical failures. A single AWX failure at 14:05 won't surface until the next digest flush. If the source is "I need to know within minutes," keep digest off for it.
  • Interactive workflows. An approval request buffered for 50 minutes is useless — the requester is waiting.

You can't currently opt specific sources into digest while leaving others immediate — digest is a user-level toggle, not per-source. A workable pattern is to turn digest off and rely on high email-severity thresholds instead.

Quiet hours and digest serve different purposes. Quiet hours drop notifications during a window. Digest defers them and eventually delivers a summary. If you want overnight silence with morning summaries, quiet hours alone won't give you the summary — you'd leave quiet hours off and let digest collapse the overnight volume.

Escalation

Escalation auto-routes unread critical notifications to designated users after a configurable window. The intent: catastrophic notifications can't be silently ignored because the primary recipient went home.

The model

An EscalationRule is defined by an admin and lives in the database:

FieldPurpose
NameHuman-readable, shown in admin UI
SourceOptional source filter (e.g. only awx_execution_failure). Empty = match all sources.
Min severityOnly escalate at or above this level. Default error.
Escalate after minutesWait this long after the original fires before escalating. Default 30.
Escalate toM2M set of user recipients.
Email on escalationWhether to ping the recipients via email too.
Is activeKill switch; deactivated rules never fire.

How escalation actually fires

The notifications.check_escalations Celery task runs every 5 minutes. For each active rule:

  1. Find notifications with:
    • is_read=false
    • created_at <= now - escalate_after_minutes
    • Severity >= min_severity
    • Not already escalated (metadata.escalated not set)
    • Source matches the rule (if the rule specifies one)
  2. For each match, emit a new notification to every target in escalate_to with:
    • Title: [ESCALATED] {original title}
    • Message: includes original user, original message truncated, escalation age
    • Source: system_maintenance
    • Metadata: {escalated_from: <original id>, original_user: <username>}
  3. Mark the original notification's metadata escalated=true so it won't re-escalate.

The escalated notification is a distinct row — the original stays put, and the escalation copy lives in each recipient's inbox with its own read/unread state. Recipients can acknowledge the escalation independently of the original.

Why the original keeps its unread state

Two reasons:

  • Escalation doesn't mean the original user is off the hook. They may come back online and deal with it. The unread state is a personal to-do signal.
  • Marking it read would hide it from the original user's dashboard — they'd never know an incident fired against them while they were away.

If the original user does read and resolve the notification, that doesn't unwind the escalation (the copy is already out there). The two notifications are independent once the escalation fires.

Source-specific escalations

Many deployments want different escalation behavior per source. You can create multiple rules:

  • APIC failure escalation — source=scheduled_task_failure, after 15 min, to the network on-call team.
  • AWX critical failure escalation — source=awx_execution_failure, min_severity=error, after 30 min, to the automation owners.
  • Catch-all error escalation — source empty, min_severity=error, after 60 min, to the platform admins.

Rules are evaluated independently; a single notification can be escalated by multiple rules if all of them match.

The [ESCALATED] convention

The title prefix is literal: [ESCALATED] Original title here. Recipients see it prominently in the bell dropdown — a visual indicator that this isn't a new event, it's an unread-too-long event. The metadata carries escalated_from so the UI could deep-link back to the original (current UI shows the text reference; direct click-through is a roadmap item).

Email on escalation

When email_on_escalation=true and the recipient has email enabled and passes the severity gate, the escalation copy emails. Most escalations want this — the whole point of escalation is "the in-app bell didn't work, try harder."

Admins usually create escalation rules with email_on_escalation=true. The recipients' own email-severity threshold still applies, so if a recipient has set email_min_severity=error they only get paged for error-class escalations.

Interactions with other features

Escalation + digest

If an escalation target has digest mode enabled, the escalation is buffered like any other notification. This is almost certainly wrong — you want escalations immediate. Admins typically configure escalation-target users to disable digest for their accounts, or at least set a short digest_interval_minutes (5 minutes) so escalations don't languish.

Escalation + quiet hours

Escalations in a recipient's quiet hours are dropped for that user (quiet hours apply to all notifications, no exceptions). If critical escalations must pierce quiet hours, route to multiple users — someone's quiet hours won't overlap someone else's on-call window.

Escalation + suppression

A notification that got filtered out entirely (per-source opt-out, in_app_enabled=false) was never persisted in the first place, so it can't be escalated from. The escalation starts from the persisted Notification row — upstream suppression means no escalation.

Operational patterns

The classic night-shift pattern

Rule 1 — all errors, escalate after 20 minutes, to the secondary on-call. Email on.

Rule 2 — AWX failures specifically, escalate after 10 minutes, to a senior engineer list.

Primary on-call gets the original. If they acknowledge within 20 min → no escalation. If they don't → secondary gets paged. If the AWX path is failing → senior team gets paged faster.

The compliance pattern

Rulesystem_maintenance source, severity error, escalate after 0 minutes, to the compliance admin list, email on.

Any system-level error fires an immediate duplicate to the compliance team. Zero-minute escalation is technically valid — the task runs every 5 min so the practical minimum is 5 min. Use when the compliance team needs visibility into every system-level incident.

The quiet escalation pattern

Rule — source empty, severity warning, escalate after 240 minutes (4 hours), to the user's manager, email off.

Unread warning-class notifications drift up to managers silently. No email noise — just an in-app heads-up that someone on the team is behind on their notifications. Used occasionally in regulated environments.

Troubleshooting

Digest and escalation issues that come up often:

  • "Digest notifications never arrive." The flush task runs on a Celery Beat schedule. If Celery Beat is down, digests don't flush. Check docker compose logs celery-beat.
  • "My digest is empty but I had events." The buffer is cleared after each flush. An event that arrived in the last flush window already went out; an event in the current window hasn't matured yet (digest_interval_minutes hasn't passed for it).
  • "Escalation fired for a notification I already read." Escalation check runs every 5 minutes; if you read the notification between the fire time and the next check, the escalation still goes out. The race window is at most 5 minutes.
  • "Escalated notifications keep firing." Check the metadata.escalated flag on the original — it should be set after first escalation. If a rule edit reset it, the notification is eligible again.
  • "I want to escalate to a Slack channel, not a user." Current escalations target users only. Route to a dedicated "slack-bridge" user whose email address forwards to Slack via SMTP-to-Slack. First-class webhook targets are a roadmap item.
  • "A rule matches too many notifications." Narrow by source and raise min_severity. Source-empty + severity info catches everything including success pings, which is rarely what you want.
  • "Escalation fires immediately for old notifications after I create a rule." Expected — the rule applies to any existing unread notification older than escalate_after_minutes. If you create a rule with escalate_after_minutes=30 and there are unread notifications from two hours ago, they escalate on the next check. Consider acknowledging old notifications before enabling aggressive rules.

That covers notifications — in-app, email, digest, escalation. The next major section — Administration — is about the admin-only view: user management, groups, permissions, audit logs, and the system-wide settings that keep everything in line.