High availability and failover
Windmill's architecture makes high availability straightforward: all application state lives in PostgreSQL. Servers and workers are stateless — they can be started, stopped, or moved between datacenters without any migration or state transfer.
Architecture overview
Windmill consists of three components:
- Servers — serve the API and UI, stateless
- Workers — execute jobs, stateless
- PostgreSQL — stores all state (job queue, scripts, flows, resources, schedules, audit logs)
Since servers and workers hold no local state, high availability reduces to ensuring PostgreSQL is highly available and that servers/workers can reach it.
Single-datacenter HA
For high availability within a single datacenter:
- Run multiple server replicas behind a load balancer. Any server can handle any request.
- Run multiple workers. Workers pull jobs from a shared queue — if one worker goes down, others continue processing.
- Use a managed PostgreSQL service or a HA PostgreSQL setup (e.g., CloudNativePG, Patroni, or your cloud provider's managed database).
No special Windmill configuration is needed. Windmill natively supports multiple servers and workers connecting to the same database.
Multi-datacenter failover
For failover across two or more datacenters:
Recommended setup
- Primary datacenter: runs Windmill servers, workers, and the primary PostgreSQL instance.
- Secondary datacenter: runs a PostgreSQL replica (streaming replication). Windmill servers and workers are either pre-deployed (pointing at the primary DB) or ready to be spun up.
Failover procedure
When the primary datacenter becomes unavailable:
- Promote the PostgreSQL replica to primary in the secondary datacenter.
- Start (or redirect) Windmill servers and workers to point at the new primary database.
- Windmill will resume normal operation — workers will pick up queued jobs, servers will serve the API.
No data migration, state transfer, or manual intervention on the Windmill side is required. The only dependency is the database connection string.
What happens to in-flight jobs
- Queued jobs (not yet started) are stored in PostgreSQL and will be picked up by workers in the secondary datacenter.
- In-flight jobs (mid-execution when the primary went down) will not complete. They will appear as failed or timed-out. You can re-run them after failover.
- Scheduled jobs will resume on schedule once workers are running against the new primary database.
- Triggers (webhooks, SQS, Kafka, etc.) will resume once servers are running. Some trigger messages may need to be reprocessed depending on the trigger type's at-least-once guarantees.
Active-active (both datacenters running)
You can run Windmill workers in both datacenters simultaneously, all pointing at the primary PostgreSQL instance. This gives you:
- Faster failover — workers in the secondary datacenter are already running and will start processing jobs as soon as the database is available locally.
- Geographic distribution — jobs can be processed closer to data sources using worker groups and tags.
In this setup, only the database needs to fail over. Windmill servers and workers continue running and automatically reconnect.
Database considerations
Connection string
Windmill connects to PostgreSQL via the DATABASE_URL environment variable. For failover, you have two options:
- DNS-based failover: point
DATABASE_URLat a DNS name that gets updated during failover (e.g., a cloud provider's endpoint that follows the primary). - Restart with new URL: update the
DATABASE_URLand restart Windmill servers/workers after promoting the replica.
PostgreSQL replication
Windmill is compatible with any PostgreSQL replication solution:
- CloudNativePG (Kubernetes-native)
- Patroni (HA with automatic failover)
- AWS RDS Multi-AZ / Aurora
- Google Cloud SQL HA
- Azure Database for PostgreSQL with HA
Windmill does not require any special PostgreSQL extensions or configuration beyond standard PostgreSQL.
License key
Your Windmill license key is stored in the database and is not tied to a specific server or cluster. It will work on the secondary datacenter without any changes.
Kubernetes example
If you run Windmill on Kubernetes with CloudNativePG:
# CloudNativePG cluster with replica in secondary DC
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: windmill-db
spec:
instances: 3
postgresql:
parameters:
max_connections: "200"
bootstrap:
initdb:
database: windmill
owner: windmill
Windmill's Helm chart can be deployed in both datacenters. In the secondary datacenter, keep replicas at 0 until failover, or run them actively pointing at the primary database.
# Secondary DC: deploy with 0 replicas (standby)
helm install windmill windmill/windmill \
--set windmill.databaseUrl="postgres://windmill:password@primary-db:5432/windmill" \
--set windmill.server.replicas=0 \
--set windmill.worker.replicas=0
# During failover: scale up
kubectl scale deployment windmill-server --replicas=2
kubectl scale deployment windmill-worker --replicas=4
Git sync for disaster recovery
Git sync provides an additional layer of protection. When enabled, all scripts, flows, apps, and resources are version-controlled in a git repository. Even in a total database loss scenario, you can:
- Set up a fresh Windmill instance with a new database.
- Use
wmill sync pushto restore all your scripts, flows, and apps from git.
This restores your codebase but not job history, schedules, or runtime state.