Operations

Backup, restore, and high-availability.

RackWatch is a single binary writing to a single database. That makes operations refreshingly simple — but you should still have a backup and a recovery story before you put it in front of an audit. Here's both, end-to-end.

Last updated: 2026-05-03

1. What state lives where

One process, one DB, one volume. That's the whole picture.

What	Where	Backup priority
Fleet telemetry, alerts, history	SQLite at `/data/rackwatch.db` (default) or your SQL Server instance.	Critical
Configuration	Environment variables on the platform container — `LICENSE_KEY`, `AGENT_API_KEY`, `DatabaseProvider`, connection strings.	Critical (back up your `docker-compose.yml` or systemd unit; the values are secrets).
License key	In the env var above. Also in the email RackWatch sent at subscription time.	Replaceable — you can re-request from hello@rackwatch.io if lost.
Agent state	Stateless. The agent reads its config, scans the host, and posts to the platform. Nothing to back up.	None

If the platform DB and your env-var file both survive, every other piece of RackWatch can be reconstructed from a fresh image pull.

2. Backups

2.1 SQLite (default)

SQLite ships with a hot-backup API. You can take a consistent snapshot of a running database without stopping the platform:

# Inside the platform container — or wherever the db lives
sqlite3 /data/rackwatch.db ".backup '/backup/rackwatch-$(date +%Y%m%d).db'"

# Compress + ship offsite
gzip /backup/rackwatch-*.db
aws s3 cp /backup/rackwatch-*.db.gz s3://your-bucket/rackwatch/

Schedule via cron (host) or a Kubernetes CronJob. A reasonable cadence is hourly for fleets above 50 hosts, daily for smaller ones. Keep at least 30 days of snapshots — the DB is small (typically <1 GB even for 500 hosts).

2.2 SQL Server

Use whatever your DBA team already does for production SQL Server. The minimum we recommend:

Full backup nightly to a remote share or blob.
Differential backup every 6 hours.
Transaction log backup every 15 minutes if you're running in FULL recovery mode and need point-in-time recovery.

Standard BACKUP DATABASE works — RackWatch doesn't use anything exotic. The schema is around two dozen tables; the index footprint is small.

3. Restore

3.1 SQLite

# Stop the platform
docker stop rackwatch-platform

# Replace the db file
gunzip -c /backup/rackwatch-20260501.db.gz > /data/rackwatch.db

# Restart
docker start rackwatch-platform

# Verify the dashboard shows expected hosts
curl -fsSL http://localhost:5000/healthz

RackWatch will auto-apply any pending schema migrations on startup. If your backup is from an older version, the restore + auto-migrate path is supported within the current major version and the immediately preceding major (EULA §6).

3.2 SQL Server

Standard RESTORE DATABASE + bring the connection string back online. The platform will reconnect automatically; agents will resume posting on their next interval (60s default).

3.3 What you'll see right after restore

Risk scores recompute on the first agent check-in following restore — typically within 1–2 minutes for the first agents, full fleet visibility within 5 minutes. Patch lag and CVE mappings are computed on demand from the data the agents have already posted, so they reappear without intervention.

4. Disaster recovery — realistic targets

Scenario	Realistic target	How
Platform host dies	RTO ≤ 30 min, RPO ≤ 1 hour	Provision a new host, pull the image, restore the latest hourly SQLite snapshot, point DNS or load balancer at the new host. Agents auto-resume.
DB corruption (rare with SQLite if you use the backup API correctly)	RTO ≤ 15 min, RPO = your backup interval	Stop platform, restore most recent good backup, restart. `sqlite3 .recover` can sometimes salvage a corrupt file in place.
Whole rack / data center loss	RTO depends on cross-site replication; RPO = your offsite backup cadence	Restore from offsite backup at your DR site. Re-deploy agents pointed at the new platform URL — or better, use the same hostname so agents need no reconfiguration.
License key lost	RTO < 1 day	Email hello@rackwatch.io; we re-issue from the same Stripe customer record. Or fall back to TRIAL mode while you wait — fleet keeps working, banner just changes.

5. High availability

Honest read: RackWatch the platform is a single-process binary. There is no native clustering or active-active mode today, and there won't be one until customer demand justifies the architectural cost. Most fleets that buy server-monitoring software are fine with a 30-minute RTO; if yours isn't, here are the real options.

5.1 Cold standby (recommended for most)

Take regular SQLite (or SQL Server) backups offsite. If the platform host dies, provision a new host, restore, redirect DNS. Most teams already have this pattern for other internal services. RTO 15–30 min, RPO matches your backup interval.

5.2 Active-passive via shared volume

Run two platform containers on a shared volume (NFS, EFS, or block storage with single-attach failover). Only one is "active" at a time — the other waits for the orchestrator to bring it up. Works with Docker Swarm, Kubernetes StatefulSet + PersistentVolumeClaim, or systemd + corosync/pacemaker. Failover is automatic; RTO drops to 1–5 min depending on the orchestrator.

Don't run two active containers writing to one SQLite file. SQLite's locking model assumes a single writer. The active-passive pattern is fine because only one writes at a time; the secondary is dormant until promoted.

5.3 SQL Server Always On

Switch DatabaseProvider=SqlServer and point the connection string at an Always On Availability Group listener. The platform reconnects automatically on AG failover. Pair with a load balancer in front of two platform instances and you have an active-active read path for the dashboard plus a single writer for state changes.

5.4 What's NOT supported

Active-active SQLite. Will corrupt the database. Don't do it.
Multi-master writes. The platform assumes a single source of truth for fleet state.
Auto-failover via the platform itself. Failover lives in your orchestrator (Kubernetes, Swarm, pacemaker), not in RackWatch.

6. Upgrades

Pull the new tag, restart the container. The platform applies schema migrations on startup; agent telemetry continues to queue at the agent end if the platform is down briefly during the swap.

# Take a backup first
sqlite3 /data/rackwatch.db ".backup '/backup/rackwatch-pre-upgrade.db'"

# Pull + restart
docker pull rackwatch/platform:latest
docker stop rackwatch-platform
docker rm rackwatch-platform
docker run -d --name rackwatch-platform --restart=always \
  -p 5000:5000 -v /data:/data -e LICENSE_KEY="..." rackwatch/platform:latest

To roll back: stop the new container, restore the pre-upgrade backup, restart with the previous image tag (rackwatch/platform:v1.X.Y). The image registry retains every published version.

7. Storage planning

Telemetry rows compress well. Empirical defaults:

~5 MB / host / month at the default 60-second sample interval.
~1.5 GB for a 100-host fleet over 90 days of retention.
~15 GB for 500 hosts at 90 days.

If storage is tight, drop retention to 30 days via the RETENTION_DAYS env var. The aggregate score and patch lag don't lose accuracy at shorter retention; only historical trend graphs do.

8. Monitoring the monitor

Standard pattern: have something else check the platform's /healthz endpoint and alert if it's down.

External: a free uptime checker (Better Uptime, Updown, Healthchecks.io) hitting your platform URL every minute.
Internal: a Prometheus blackbox probe, a cron + curl on a separate host, or a simple systemd timer that pages on failure.
Heartbeat-from-agent: if no agent telemetry has arrived in 5 minutes, something is wrong with either the agent or the platform — your monitoring tool should already catch this from the agent side.

The platform also writes startup and license-state lines to stdout. Forward those to your log aggregator if you have one.

9. Where to ask

Operations questions, edge cases, post-mortems, or "is this the right architecture for my fleet" sanity checks: email hello@rackwatch.io. We reply within one business day.

For procurement teams: this page plus the security policy and EULA §5 (continuity & data portability) are usually what audit asks for. We'll happily sign a DPA or answer a security questionnaire — just write in.