Backup, restore, and high-availability.
RackWatch is a single binary writing to a single database. That makes operations refreshingly simple — but you should still have a backup and a recovery story before you put it in front of an audit. Here's both, end-to-end.
1. What state lives where
One process, one DB, one volume. That's the whole picture.
| What | Where | Backup priority |
|---|---|---|
| Fleet telemetry, alerts, history | SQLite at /data/rackwatch.db (default) or your SQL Server instance. |
Critical |
| Configuration | Environment variables on the platform container — LICENSE_KEY, AGENT_API_KEY, DatabaseProvider, connection strings. |
Critical (back up your docker-compose.yml or systemd unit; the values are secrets). |
| License key | In the env var above. Also in the email RackWatch sent at subscription time. | Replaceable — you can re-request from hello@rackwatch.io if lost. |
| Agent state | Stateless. The agent reads its config, scans the host, and posts to the platform. Nothing to back up. | None |
If the platform DB and your env-var file both survive, every other piece of RackWatch can be reconstructed from a fresh image pull.
2. Backups
2.1 SQLite (default)
SQLite ships with a hot-backup API. You can take a consistent snapshot of a running database without stopping the platform:
# Inside the platform container — or wherever the db lives sqlite3 /data/rackwatch.db ".backup '/backup/rackwatch-$(date +%Y%m%d).db'" # Compress + ship offsite gzip /backup/rackwatch-*.db aws s3 cp /backup/rackwatch-*.db.gz s3://your-bucket/rackwatch/
Schedule via cron (host) or a Kubernetes CronJob. A reasonable cadence is hourly for fleets above 50 hosts, daily for smaller ones. Keep at least 30 days of snapshots — the DB is small (typically <1 GB even for 500 hosts).
2.2 SQL Server
Use whatever your DBA team already does for production SQL Server. The minimum we recommend:
- Full backup nightly to a remote share or blob.
- Differential backup every 6 hours.
- Transaction log backup every 15 minutes if you're running in
FULLrecovery mode and need point-in-time recovery.
Standard BACKUP DATABASE works — RackWatch doesn't use anything exotic. The schema is around two dozen tables; the index footprint is small.
3. Restore
3.1 SQLite
# Stop the platform docker stop rackwatch-platform # Replace the db file gunzip -c /backup/rackwatch-20260501.db.gz > /data/rackwatch.db # Restart docker start rackwatch-platform # Verify the dashboard shows expected hosts curl -fsSL http://localhost:5000/healthz
RackWatch will auto-apply any pending schema migrations on startup. If your backup is from an older version, the restore + auto-migrate path is supported within the current major version and the immediately preceding major (EULA §6).
3.2 SQL Server
Standard RESTORE DATABASE + bring the connection string back online. The platform will reconnect automatically; agents will resume posting on their next interval (60s default).
3.3 What you'll see right after restore
Risk scores recompute on the first agent check-in following restore — typically within 1–2 minutes for the first agents, full fleet visibility within 5 minutes. Patch lag and CVE mappings are computed on demand from the data the agents have already posted, so they reappear without intervention.
4. Disaster recovery — realistic targets
| Scenario | Realistic target | How |
|---|---|---|
| Platform host dies | RTO ≤ 30 min, RPO ≤ 1 hour | Provision a new host, pull the image, restore the latest hourly SQLite snapshot, point DNS or load balancer at the new host. Agents auto-resume. |
| DB corruption (rare with SQLite if you use the backup API correctly) | RTO ≤ 15 min, RPO = your backup interval | Stop platform, restore most recent good backup, restart. sqlite3 .recover can sometimes salvage a corrupt file in place. |
| Whole rack / data center loss | RTO depends on cross-site replication; RPO = your offsite backup cadence | Restore from offsite backup at your DR site. Re-deploy agents pointed at the new platform URL — or better, use the same hostname so agents need no reconfiguration. |
| License key lost | RTO < 1 day | Email hello@rackwatch.io; we re-issue from the same Stripe customer record. Or fall back to TRIAL mode while you wait — fleet keeps working, banner just changes. |
5. High availability
Honest read: RackWatch the platform is a single-process binary. There is no native clustering or active-active mode today, and there won't be one until customer demand justifies the architectural cost. Most fleets that buy server-monitoring software are fine with a 30-minute RTO; if yours isn't, here are the real options.
5.1 Cold standby (recommended for most)
Take regular SQLite (or SQL Server) backups offsite. If the platform host dies, provision a new host, restore, redirect DNS. Most teams already have this pattern for other internal services. RTO 15–30 min, RPO matches your backup interval.
5.2 Active-passive via shared volume
Run two platform containers on a shared volume (NFS, EFS, or block storage with single-attach failover). Only one is "active" at a time — the other waits for the orchestrator to bring it up. Works with Docker Swarm, Kubernetes StatefulSet + PersistentVolumeClaim, or systemd + corosync/pacemaker. Failover is automatic; RTO drops to 1–5 min depending on the orchestrator.
Don't run two active containers writing to one SQLite file. SQLite's locking model assumes a single writer. The active-passive pattern is fine because only one writes at a time; the secondary is dormant until promoted.
5.3 SQL Server Always On
Switch DatabaseProvider=SqlServer and point the connection string at an Always On Availability Group listener. The platform reconnects automatically on AG failover. Pair with a load balancer in front of two platform instances and you have an active-active read path for the dashboard plus a single writer for state changes.
5.4 What's NOT supported
- Active-active SQLite. Will corrupt the database. Don't do it.
- Multi-master writes. The platform assumes a single source of truth for fleet state.
- Auto-failover via the platform itself. Failover lives in your orchestrator (Kubernetes, Swarm, pacemaker), not in RackWatch.
6. Upgrades
Pull the new tag, restart the container. The platform applies schema migrations on startup; agent telemetry continues to queue at the agent end if the platform is down briefly during the swap.
# Take a backup first sqlite3 /data/rackwatch.db ".backup '/backup/rackwatch-pre-upgrade.db'" # Pull + restart docker pull rackwatch/platform:latest docker stop rackwatch-platform docker rm rackwatch-platform docker run -d --name rackwatch-platform --restart=always \ -p 5000:5000 -v /data:/data -e LICENSE_KEY="..." rackwatch/platform:latest
To roll back: stop the new container, restore the pre-upgrade backup, restart with the previous image tag (rackwatch/platform:v1.X.Y). The image registry retains every published version.
7. Storage planning
Telemetry rows compress well. Empirical defaults:
- ~5 MB / host / month at the default 60-second sample interval.
- ~1.5 GB for a 100-host fleet over 90 days of retention.
- ~15 GB for 500 hosts at 90 days.
If storage is tight, drop retention to 30 days via the RETENTION_DAYS env var. The aggregate score and patch lag don't lose accuracy at shorter retention; only historical trend graphs do.
8. Monitoring the monitor
Standard pattern: have something else check the platform's /healthz endpoint and alert if it's down.
- External: a free uptime checker (Better Uptime, Updown, Healthchecks.io) hitting your platform URL every minute.
- Internal: a Prometheus blackbox probe, a cron +
curlon a separate host, or a simple systemd timer that pages on failure. - Heartbeat-from-agent: if no agent telemetry has arrived in 5 minutes, something is wrong with either the agent or the platform — your monitoring tool should already catch this from the agent side.
The platform also writes startup and license-state lines to stdout. Forward those to your log aggregator if you have one.
9. Where to ask
Operations questions, edge cases, post-mortems, or "is this the right architecture for my fleet" sanity checks: email hello@rackwatch.io. We reply within one business day.
For procurement teams: this page plus the security policy and EULA §5 (continuity & data portability) are usually what audit asks for. We'll happily sign a DPA or answer a security questionnaire — just write in.