Retention and settings

Per-user and global retention policies, size limits, duplicate handling, and the daily cleanup task that keeps snapshot storage bounded without surprises.

Time Machine settings are the knobs that decide how much history sticks around, how big a snapshot can get, and whether duplicate results are stored or skipped. Tuning them well is the difference between a lean archive and a database that eats disk for breakfast.

Global vs per-user settings

Two layers:

Global defaults — one row with user=null. Applies to anyone without their own override. Managed by admins.
Per-user overrides — one row per user who needs different behavior. Overrides the global entirely (not merged field-by-field).

The resolution is first-match: user-specific row → fall back to global → if no global exists, create one with sensible defaults on first use.

This structure lets an admin set sane defaults (30 days, 10 MB cap, auto-cleanup on) and let specific users opt into different behavior — longer retention for compliance, larger caps for big-tenant queries, no auto-cleanup for forensic work.

Retention policies

Three modes:

Policy	Behavior	Typical use
By days	Delete snapshots older than N days	Default; keeps a rolling window of history
By count	Keep the newest N snapshots per saved query, delete the rest	Fixed-footprint retention regardless of snapshot frequency
Unlimited	Never delete	Compliance, long-term audit, forensics

By days

Field: retention_days, default 30.
Every snapshot with executed_at older than now - retention_days is deleted by the cleanup task.
Simple to reason about: "show me the last 30 days."
Storage grows linearly with snapshot frequency in that window.

A daily-capture task over 30 days retains 30 snapshots per APIC. An hourly one retains 720. Plan caps accordingly.

By count

Fields: retention_count, default 100.
Per saved query, keep the newest retention_count snapshots.
Anything beyond is purged regardless of age.
Fixed upper bound on storage per query.

Useful when capture frequency varies — an hourly task and a daily task both keep 100 snapshots, but the hourly covers ~4 days while the daily covers ~3 months.

By-count retention applies per saved query, not globally. 100 queries each with retention_count=100 means you can end up with 10K snapshots total. Budget accordingly.

Unlimited

No deletion. Snapshots persist until you manually remove them.
Use for queries where long-term history has compliance value.
Disk growth is your responsibility. Monitor storage; the cleanup task does nothing here.

Size limits

Two fields:

max_snapshot_size_mb — default 10 MB. Snapshots larger than this are refused at capture time.
warn_large_snapshots — default on. Log a warning when a snapshot exceeds the limit.

10 MB covers most queries comfortably. Specific workloads that need more:

All endpoints across a fabric — often 20–50 MB.
Full tenant dump with children — can be 30–100 MB on a busy tenant.
Compliance-style full-state snapshots — routinely 50 MB+.

Raise the limit for the specific users who need it; keep the global low. An accidental "snapshot all objects" query without filters is a good thing to have refused.

When a snapshot is too large

The capture returns {success: false, error: 'snapshot_too_large', size_mb, limit_mb}. The user sees a clear error with the actual size and the limit. Two options:

Narrow the query. Post-processors that drop unused attributes shrink snapshots dramatically — dropping modTs, stats, nested children often halves the size.
Raise the cap. Settings → Time Machine → edit the limit. Only for the users who need it.

Snapshots are jsonb. Storage is cheap but not free; retrieval time scales with size. Snapshots in the 100+ MB range are workable but make the snapshot detail view slow to load.

Duplicate handling

store_duplicates — default off.

With it off (the usual case), capture_snapshot() skips save when the SHA-256 hash matches the previous snapshot for the same (saved_query, APIC). The previous snapshot ID is returned so the caller knows which row "stood in" for this run.

With it on, every run creates a row regardless of content. Reasons to turn it on:

Strict audit trails where "the query ran at 02:00 today with the same result as yesterday" is itself an audit artifact.
Debugging the capture pipeline — want to see every capture, not just the novel ones.

Almost nobody needs this. The default is the right answer for nearly every deployment.

What duplicate skipping buys you

A stable fabric running daily snapshots generates ~1 row per month per query (the hash never changes). A fabric where things are changing generates rows only on the days drift happens. The database size reflects actual change volume, not snapshot frequency.

Auto-cleanup

auto_cleanup_enabled — default on, global only.

This is a kill switch for the Celery Beat cleanup task. When off, the daily run (03:30 server time) logs "cleanup skipped" and does nothing.

Turning it off makes sense during:

Migrations — you don't want cleanup touching tables you're restructuring.
Compliance holds — legal wants "nothing deleted for the next 60 days."
Forensics — an incident is under investigation and history must be preserved.

Turn it back on when the hold lifts. Expired snapshots accumulated during the hold will be cleaned up on the next daily run, possibly a big delete — not usually a problem but worth noticing.

How cleanup actually runs

The Celery Beat task queries.cleanup_time_machine_snapshots fires daily at 03:30 server time (not user time — server). It:

Loads global settings.
Short-circuits if auto_cleanup_enabled is false.
Calls settings.execute_cleanup() which issues one set-based SQL DELETE.
Returns the deleted row count.
If any rows were deleted, sends an info notification to every superuser ("Automatic cleanup removed 42 old snapshot(s)").
If the task itself errors, sends an error notification and returns a failure payload.

The delete is deliberately raw SQL, not Django's .delete(). Django loads primary keys into Python memory and iterates — fine for tens of rows, catastrophic for hundreds of thousands. A single DELETE ... WHERE executed_at < cutoff on PostgreSQL completes in seconds against millions of rows.

By-count cleanup uses a window function

For retention_by_count, the delete uses a window function:

DELETE FROM time_machine_snapshots
WHERE id IN (
  SELECT id FROM (
    SELECT id, ROW_NUMBER() OVER (
      PARTITION BY saved_query_id ORDER BY executed_at DESC
    ) AS rn
    FROM time_machine_snapshots
    WHERE saved_query_id IS NOT NULL
  ) ranked
  WHERE rn > {retention_count}
)

One query per saved query would be O(N). The window-function approach is one query total, partitioned internally. Important when you have hundreds of saved queries with history.

Cleanup preview

Before running cleanup, admins (or curious users) can ask "what would be deleted right now?" via the cleanup preview endpoint:

Returns up to 500 rows that would be deleted (with query name, executed_at, size, result count).
Returns the true total count separately — so the UI shows "5 sample rows, 12,483 total."
Doesn't actually delete anything.

This is the right tool for verifying a retention change before applying it. Set the policy, preview the impact, then let the daily task run.

Per-user setting the UI surfaces

Settings → Time Machine in the user profile shows:

Retention policy dropdown (days / count / unlimited).
Retention value (days or count, depending on policy).
Max snapshot size (MB).
Warn on large snapshots (checkbox).
Store duplicates (checkbox).

Admin-only additionally:

Auto-cleanup enabled toggle (global only).
Global defaults editor — drives the fallback for users without overrides.

Exporting before cleanup

Retention deletes are irreversible. If you want to preserve history past the policy, export the snapshots first — the snapshot detail view has a Download JSON button that pulls the full result_data. For bulk export across many snapshots, use the admin interface or a direct PostgreSQL query.

This is the current story. A first-class export flow is on the roadmap.

Monitoring storage

A few queries that help keep an eye on growth:

-- Total snapshot count
SELECT COUNT(*) FROM time_machine_snapshots;

-- Snapshots per saved query, largest first
SELECT saved_query_id, COUNT(*) AS snapshots, SUM(result_size_bytes)/1024/1024 AS total_mb
FROM time_machine_snapshots
GROUP BY saved_query_id
ORDER BY total_mb DESC
LIMIT 20;

-- Daily capture rate
SELECT DATE(executed_at), COUNT(*)
FROM time_machine_snapshots
WHERE executed_at > NOW() - INTERVAL '30 days'
GROUP BY 1 ORDER BY 1;

Run against the Fabrik PostgreSQL instance. These surface the queries you've over-snapshotted and let you dial retention down where it matters most.

Troubleshooting

Retention issues that come up often:

"My snapshots vanished after a week." Your retention policy is by_days with retention_days=7. Check your Time Machine settings; you may be under the per-user settings that didn't get cleared after a test.
"Cleanup ran but nothing was deleted." auto_cleanup_enabled is off, or the policy is unlimited. Confirm in global settings.
"A user can't raise their own snapshot size limit." That setting is per-user and editable from their profile. If they're locked out, an admin can edit their row.
"The cleanup task failed last night." Check Celery Beat logs and superuser notifications; the task sends an error notification on failure. Usually a database lock or a migration in progress.
"Disk usage is growing despite retention." store_duplicates may be on, inflating snapshot counts. Turn it off, then run cleanup manually by invoking the Celery task via the Django shell.
"I want cleanup to run now, not at 03:30." Trigger the task manually: docker compose exec backend celery -A fabrik call queries.cleanup_time_machine_snapshots. Effect is identical to the scheduled run.
"Retention days count in server timezone?" Yes — cleanup compares to now() in the server's timezone. For most deployments the difference between server time and user time doesn't matter here; it's only visible at midnight-adjacent edges.

That's Time Machine. Combined with the Query Builder, scheduled tasks, and AWX automation, you now have the full read/write cycle — read fabric state, snapshot it over time, compare it, and push changes back via Ansible playbooks.

The next major section — Notifications — is about how Fabrik surfaces everything that's happened: task completions, approvals, drift alerts, system events.

Retention and settings

On this page