Comparison orchestration engines - Methodology
- JavaScript
- Python
- Go
* Airflow/Prefect only support Python, see Python tab for results
* Airflow/Prefect/Kestra don't support Go natively
For additional insights about benchmark methodology, refer to our blog post.
In this benchmark study, we compared six job orchestration engines: Airflow, Prefect, Temporal, Kestra, Hatchet, and Windmill, focusing on performance across several scenarios. The aim was to evaluate not just raw task execution time, but also deeper engine-level behaviors such as scheduling efficiency, task dispatch latency, and worker utilization.
We chose to compute Fibonacci numbers as a simple task that can easily be run with the three orchestrators. Given that Airflow has a first class support for Python, we used Python for all 3 orchestrators. The function in charge of computing the Fibonacci numbers was very naive:
Benchmark use cases
We defined three categories of workflow scenarios:
-
Lightweight tasks: Simulates high-frequency, short-lived operations where engine overhead may dominate.
-
Long-running tasks: Designed to surface runtime performance and engine efficiency when task duration is significant.
-
Multi-worker scenarios: For engines demonstrating high efficiency and Go support, we ran:
- 400 lightweight tasks
- 100 long-running tasks These were distributed across multiple workers, examining parallelism, load balancing, and assignment latency.
Task definition
To ensure simplicity, repeatability, and no external dependencies, we used the classic recursive Fibonacci function:
- Python
- JavaScript
- Go
def fibo(n: int):
if n <= 1:
return n
else:
return fibo(n - 1) + fibo(n - 2)
function fibo(n) {
if (n <= 1) {
return n;
}
return fibo(n - 1) + fibo(n - 2);
}
func fibo(n int) int {
if n <= 1 {
return n
}
return fibo(n - 1) + fibo(n - 2)
fibo(10)
was used for lightweight tasks, with an average execution time of ~10ms.fibo(33/38)
was selected for long-running tasks, typically taking several hundred milliseconds.
This approach eliminates the need for external libraries, providing a level playing field and highlighting the core performance of the orchestration engines.
Language and runtime environment
Given native Python support in Airflow, Python was used as the primary implementation language for initial benchmarks. For orchestrators supporting multiple runtimes, we expanded testing to:
- JavaScript (where supported)
- Go for its speed, concurrency features, and lack of warmup latency
For Go-enabled engines, we also evaluated multi-worker configurations to explore scaling behavior.
Infrastructure setup
To standardize the environment, each orchestrator was deployed using its recommended docker-compose.yml setup, running on AWS m4.large instances. This provides a balanced compute and memory profile while ensuring consistency across platforms.
Performance evaluation metrics
The benchmarking framework was designed to expose both high-level throughput and low-level engine characteristics.
Key metrics observed:
- Execution time: the time it takes for the orchestrator to execute the task once is has been assigned to an executor
- Assignment time: the time is takes for a task to be assigned to an executor once it has been created in the queue
- Transition time: the time it takes for to create the following time once a task is finished
- Worker load distribution: whether tasks were evenly distributed or exhibited contention/idling.
Observational expectations:
- Short-running tasks: Performance is expected to be dominated by orchestration overhead, making it a strong indicator of engine efficiency in high-frequency workflows.
- Long-running tasks: Majority of time should be spent on actual computation, with minimal overhead from task management and worker assignment.
Extraction of timings
The timings were extracted either through the exposed api of the orchestrators or by directly querying the database of the orchestrators. For most of the engines the following timestamps could be extracted:
Reference time (t0)
: workflow start timecreated_at
: task added to queue / scheduled timestarted_at
: task assigned to workercompleted_at
: task finished
Raw measurements for each engine:
The scripts used to extract the data can be found in the windmill-benchmarks repository. We did not dive into the source code of each orchestration engine to see if the timestamp generation is consistent across all engines or there are some slight differences.