Operations

Backup, restore, and high-availability.

RackWatch is a single binary writing to a single database. That makes operations refreshingly simple — but you should still have a backup and a recovery story before you put it in front of an audit. Here's both, end-to-end.

Last updated: 2026-05-03

1. What state lives where

One process, one DB, one volume. That's the whole picture.

WhatWhereBackup priority
Fleet telemetry, alerts, history SQLite at /data/rackwatch.db (default) or your SQL Server instance. Critical
Configuration Environment variables on the platform container — LICENSE_KEY, AGENT_API_KEY, DatabaseProvider, connection strings. Critical (back up your docker-compose.yml or systemd unit; the values are secrets).
License key In the env var above. Also in the email RackWatch sent at subscription time. Replaceable — you can re-request from hello@rackwatch.io if lost.
Agent state Stateless. The agent reads its config, scans the host, and posts to the platform. Nothing to back up. None

If the platform DB and your env-var file both survive, every other piece of RackWatch can be reconstructed from a fresh image pull.

2. Backups

2.1 SQLite (default)

SQLite ships with a hot-backup API. You can take a consistent snapshot of a running database without stopping the platform:

# Inside the platform container — or wherever the db lives
sqlite3 /data/rackwatch.db ".backup '/backup/rackwatch-$(date +%Y%m%d).db'"

# Compress + ship offsite
gzip /backup/rackwatch-*.db
aws s3 cp /backup/rackwatch-*.db.gz s3://your-bucket/rackwatch/

Schedule via cron (host) or a Kubernetes CronJob. A reasonable cadence is hourly for fleets above 50 hosts, daily for smaller ones. Keep at least 30 days of snapshots — the DB is small (typically <1 GB even for 500 hosts).

2.2 SQL Server

Use whatever your DBA team already does for production SQL Server. The minimum we recommend:

Standard BACKUP DATABASE works — RackWatch doesn't use anything exotic. The schema is around two dozen tables; the index footprint is small.

3. Restore

3.1 SQLite

# Stop the platform
docker stop rackwatch-platform

# Replace the db file
gunzip -c /backup/rackwatch-20260501.db.gz > /data/rackwatch.db

# Restart
docker start rackwatch-platform

# Verify the dashboard shows expected hosts
curl -fsSL http://localhost:5000/healthz

RackWatch will auto-apply any pending schema migrations on startup. If your backup is from an older version, the restore + auto-migrate path is supported within the current major version and the immediately preceding major (EULA §6).

3.2 SQL Server

Standard RESTORE DATABASE + bring the connection string back online. The platform will reconnect automatically; agents will resume posting on their next interval (60s default).

3.3 What you'll see right after restore

Risk scores recompute on the first agent check-in following restore — typically within 1–2 minutes for the first agents, full fleet visibility within 5 minutes. Patch lag and CVE mappings are computed on demand from the data the agents have already posted, so they reappear without intervention.

4. Disaster recovery — realistic targets

ScenarioRealistic targetHow
Platform host dies RTO ≤ 30 min, RPO ≤ 1 hour Provision a new host, pull the image, restore the latest hourly SQLite snapshot, point DNS or load balancer at the new host. Agents auto-resume.
DB corruption (rare with SQLite if you use the backup API correctly) RTO ≤ 15 min, RPO = your backup interval Stop platform, restore most recent good backup, restart. sqlite3 .recover can sometimes salvage a corrupt file in place.
Whole rack / data center loss RTO depends on cross-site replication; RPO = your offsite backup cadence Restore from offsite backup at your DR site. Re-deploy agents pointed at the new platform URL — or better, use the same hostname so agents need no reconfiguration.
License key lost RTO < 1 day Email hello@rackwatch.io; we re-issue from the same Stripe customer record. Or fall back to TRIAL mode while you wait — fleet keeps working, banner just changes.

5. High availability

Honest read: RackWatch the platform is a single-process binary. There is no native clustering or active-active mode today, and there won't be one until customer demand justifies the architectural cost. Most fleets that buy server-monitoring software are fine with a 30-minute RTO; if yours isn't, here are the real options.

5.1 Cold standby (recommended for most)

Take regular SQLite (or SQL Server) backups offsite. If the platform host dies, provision a new host, restore, redirect DNS. Most teams already have this pattern for other internal services. RTO 15–30 min, RPO matches your backup interval.

5.2 Active-passive via shared volume

Run two platform containers on a shared volume (NFS, EFS, or block storage with single-attach failover). Only one is "active" at a time — the other waits for the orchestrator to bring it up. Works with Docker Swarm, Kubernetes StatefulSet + PersistentVolumeClaim, or systemd + corosync/pacemaker. Failover is automatic; RTO drops to 1–5 min depending on the orchestrator.

Don't run two active containers writing to one SQLite file. SQLite's locking model assumes a single writer. The active-passive pattern is fine because only one writes at a time; the secondary is dormant until promoted.

5.3 SQL Server Always On

Switch DatabaseProvider=SqlServer and point the connection string at an Always On Availability Group listener. The platform reconnects automatically on AG failover. Pair with a load balancer in front of two platform instances and you have an active-active read path for the dashboard plus a single writer for state changes.

5.4 What's NOT supported

6. Upgrades

Pull the new tag, restart the container. The platform applies schema migrations on startup; agent telemetry continues to queue at the agent end if the platform is down briefly during the swap.

# Take a backup first
sqlite3 /data/rackwatch.db ".backup '/backup/rackwatch-pre-upgrade.db'"

# Pull + restart
docker pull rackwatch/platform:latest
docker stop rackwatch-platform
docker rm rackwatch-platform
docker run -d --name rackwatch-platform --restart=always \
  -p 5000:5000 -v /data:/data -e LICENSE_KEY="..." rackwatch/platform:latest

To roll back: stop the new container, restore the pre-upgrade backup, restart with the previous image tag (rackwatch/platform:v1.X.Y). The image registry retains every published version.

7. Storage planning

Telemetry rows compress well. Empirical defaults:

If storage is tight, drop retention to 30 days via the RETENTION_DAYS env var. The aggregate score and patch lag don't lose accuracy at shorter retention; only historical trend graphs do.

8. Monitoring the monitor

Standard pattern: have something else check the platform's /healthz endpoint and alert if it's down.

The platform also writes startup and license-state lines to stdout. Forward those to your log aggregator if you have one.

9. Where to ask

Operations questions, edge cases, post-mortems, or "is this the right architecture for my fleet" sanity checks: email hello@rackwatch.io. We reply within one business day.

For procurement teams: this page plus the security policy and EULA §5 (continuity & data portability) are usually what audit asks for. We'll happily sign a DPA or answer a security questionnaire — just write in.