Scaling workers

Windmill uses a worker queue architecture where workers pull jobs from a shared queue and execute them one at a time. Understanding this pattern is essential for properly sizing your worker pool to meet your business requirements.

How workers process jobs

Workers are autonomous processes that pull jobs from a queue in order of their scheduled time. Each worker:

Executes one job at a time using full CPU and memory
Pulls the next job as soon as the current one completes
Can run up to 26 million jobs per month (at 100ms per job)

This architecture is horizontally scalable: add more workers to increase throughput, remove workers to reduce costs. There is no coordination overhead between workers.

Workers and worker groups

Learn about Windmill's worker architecture and how to configure worker groups.

Architecture

Overview of Windmill's technical architecture.

Interactive simulator

Use this simulator to visualize how jobs flow through the queue and understand the relationship between job arrival rate, job duration, and worker count.

Job pool (0)

Jobs:15

Duration:1s

3 jobs/s

Job queue (0)

Workers

Count:3

Completed (0)

Simulator modes

Batch: All jobs are submitted at once, simulating scheduled bulk operations
Continuous: Jobs arrive at a steady rate, simulating regular workloads
Random: Jobs arrive at varying intervals, simulating unpredictable traffic

Key metrics

Elapsed time: Total time from first job to last completion
Jobs/sec: Actual throughput achieved
Worker occupancy: Percentage of time each worker spent processing (vs idle)

Sizing your worker pool

The right number of workers depends on your specific requirements. Consider these factors:

Job duration and arrival rate

The fundamental relationship is:

Required workers ≥ Job arrival rate × Average job duration

For example, if jobs arrive at 10/second and each takes 2 seconds:

Minimum workers needed: 10 × 2 = 20 workers

With fewer workers, jobs will queue up. With more workers, some will be idle.

Maximum acceptable queue time

If jobs must not wait more than X seconds before starting:

Required workers = (Peak arrival rate × Job duration) + (Peak arrival rate × Max queue time)

Example: Peak rate 5 jobs/sec, duration 3s, max wait 2s:

Workers needed: (5 × 3) + (5 × 2) = 15 + 10 = 25 workers

This ensures even during peak load, no job waits more than 2 seconds.

Handling traffic peaks

If your workload has predictable peaks (weekends, end of month, etc.):

Fixed capacity: Size for peak load, accept idle workers during off-peak
Autoscaling: Configure min/max workers to automatically adjust

Autoscaling

Automatically scale workers based on queue depth and occupancy.

Practical examples

Scenario 1: Batch ETL processing

Requirement: Process 1,000 daily reports, each taking 30 seconds, complete within 2 hours

Total processing time: 1,000 × 30s = 30,000 seconds
Available time: 2 hours = 7,200 seconds
Minimum workers: 30,000 / 7,200 = 4.2 → 5 workers

With 5 workers, all jobs complete in approximately 100 minutes.

Scenario 2: Real-time webhook processing

Requirement: Handle 100 webhooks/minute during business hours, each taking 5 seconds, max latency 10 seconds

Arrival rate: 100/60 = 1.67 jobs/second
Minimum workers: 1.67 × 5 = 8.3 workers
For 10s max latency headroom: 10 workers

Scenario 3: Weekend traffic spikes

Requirement: Normal load 2 jobs/sec, weekend peaks at 8 jobs/sec, jobs take 1 second each

Normal load: 2 × 1 = 2 workers minimum
Peak load: 8 × 1 = 8 workers minimum
Recommended: Use autoscaling with min=3, max=10

Configure autoscaling to scale up when queue depth increases and scale down when occupancy drops below 25%.

Priority queues with worker groups

For mixed workloads where some jobs are more time-sensitive:

Create separate worker groups with different tags
Assign high-priority jobs to dedicated workers
Let lower-priority jobs share remaining capacity

Example configuration:

high-priority worker group: 5 dedicated workers, handles critical customer-facing operations
default worker group: 10 workers, handles everything else
low-priority worker group: 3 workers, handles background analytics

This ensures critical jobs are never blocked by bulk operations.

Worker groups

Configure worker groups with different tags for job routing.

High priority jobs

Set job priorities within a queue.

Monitoring and alerting

Track worker performance to identify scaling needs:

Queue metrics: Monitor delayed jobs per tag and queue wait times
Occupancy rates: High sustained occupancy (>75%) suggests adding workers
Worker alerts: Get notified when workers go offline

Queue metrics

Visualize queue depth and delays across worker groups.

Critical alerts

Configure alerts for worker failures and queue buildup.

Autoscaling configuration

For dynamic workloads, configure autoscaling to automatically adjust worker count:

Parameter	Recommended starting value
Min workers	Expected base load / job duration
Max workers	Peak load / job duration × 1.5
Scale-out trigger	75% occupancy or jobs waiting > min_workers
Scale-in trigger	Less than 25% occupancy for 5+ minutes
Cooldown	60-120 seconds between scaling events

The autoscaling algorithm checks every 30 seconds and considers:

Number of jobs waiting in queue
Worker occupancy rates over 15s, 5m, and 30m intervals
Cooldown periods to prevent thrashing

Autoscaling

Configure automatic worker scaling based on demand.

Worker memory sizing

Workers come in different sizes based on memory limits. The right size depends on your job requirements:

Worker size	Memory	Compute units
Small	1GB	0.5 CU
Standard	2GB	1 CU
Large	>2GB	2 CU (self-hosted capped at 2 CU regardless of actual memory)

Choosing the right memory limit

Set worker memory based on the maximum memory any individual job will need, plus some headroom:

Simple API calls, webhooks, light scripts: 1-2GB is typically sufficient
Data processing, ETL jobs: May need 4GB+ depending on data volume processed in memory
Large file processing, ML inference: Consider 8GB+ for memory-intensive operations

If a job exceeds the worker's memory limit, it will be killed by the operating system. Monitor job memory usage and increase worker memory if you see OOM (out of memory) errors.

Memory vs worker count trade-off

For the same compute budget, you can choose between:

More small workers: Better parallelism for many short jobs
Fewer large workers: Better for memory-intensive jobs that can't be parallelized

Example: 4 CUs can be configured as:

8 small workers (1GB each) - good for high-volume, light jobs
4 standard workers (2GB each) - balanced configuration
2 large workers (4GB each) - good for memory-intensive ETL

Plans and pricing

Understand compute units and how worker memory affects billing.

Cost optimization

Worker billing is based on usage time with minute granularity:

10 workers for 1/10th of the month costs the same as 1 worker for the full month
Use autoscaling to minimize idle workers
Consider dedicated workers for high-throughput single-script scenarios

Mark development and staging instances as "Non-prod" in instance settings so they don't count toward your compute limits.

Plans and pricing

Understand compute units and pricing.

Dedicated workers

High-throughput execution for single scripts.

How workers process jobs​

Interactive simulator​

Job pool (0)

Job queue (0)

Workers

Completed (0)

Simulator modes​

Key metrics​

Sizing your worker pool​

Job duration and arrival rate​

Maximum acceptable queue time​

Handling traffic peaks​

Practical examples​

Scenario 1: Batch ETL processing​

Scenario 2: Real-time webhook processing​

Scenario 3: Weekend traffic spikes​

Priority queues with worker groups​

Monitoring and alerting​

Autoscaling configuration​

Worker memory sizing​

Choosing the right memory limit​

Memory vs worker count trade-off​

Cost optimization​