Skip to main content

Airflow vs Prefect vs Temporal vs Kestra vs Windmill


We compared Airflow, Prefect, Temporal, Kestra and Windmill with the following usecases:

  • One flow composed of 40 lightweight tasks.
  • One flow composed of 10 long-running tasks.
More context

For additional insights about this study, refer to our blog post.

We chose to compute Fibonacci numbers as a simple task that can easily be run with the three orchestrators. Given that Airflow has a first class support for Python, we used Python for all 3 orchestrators. The function in charge of computing the Fibonacci numbers was very naive:

def fibo(n: int):
if n <= 1:
return n
else:
return fibo(n - 1) + fibo(n - 2)

After some testing, we chose to compute fibo(10) for the lightweight tasks (taking around 10ms in our setup), and fibo(33) for what we called "long-running" tasks (taking at least a few hundreds milliseconds as seen in the results).

On the infrastructure side, we went simple and used the docker-compose.yml recommended in the documentation of each orchestrator. We deployed the orchestrators on AWS m4-large instances.

Airflow setup

We set up Airflow version 2.7.3 using the docker-compose.yaml referenced in Airflows official documentation.

The DAG was the following:

ITER = 10     # respectively 40
FIBO_N = 33 # respectively 10

with DAG(
dag_id="bench_{}".format(ITER),
schedule=None,
start_date=datetime(2023, 1, 1),
catchup=False,
tags=["benchmark"],
) as dag:
for i in range(ITER):
@task(task_id=f"task_{i}")
def task_module():
return fibo(FIBO_N)
fibo_task = task_module()

if i > 0:
previous_task >> fibo_task
previous_task = fibo_task

Results

For 10 long running tasks run sequentially:

Details
TaskCreated atStarted atCompleted at
task_000.0004.3476.910
task_017.3159.69016.387
task_0216.54518.36120.077
task_0320.13021.78523.487
task_0423.86925.31927.463
task_0528.06129.66532.354
task_0633.21034.99637.498
task_0738.37839.93841.754
task_0842.36643.93345.887
task_0946.28150.17954.668

For 40 lightweights tasks run sequentially:

Details
TaskCreated atStarted atCompleted at
task_000.0004.3354.752
task_016.2368.7108.923
task_029.79211.11711.320
task_0312.15713.51313.733
task_0413.80415.41315.622
task_0516.20117.58717.849
task_0618.90220.22720.432
task_0721.26222.69122.958
task_0824.01525.34925.558
task_0926.36828.15828.635
task_1029.36131.03531.357
task_1131.86136.24537.062
task_1238.86842.18042.388
task_1342.64144.02744.280
task_1445.32146.67646.877
task_1547.67649.07349.298
task_1650.43251.78651.999
task_1752.41553.85254.051
task_1854.15555.56455.771
task_1956.57558.34658.781
task_2059.25460.99961.355
task_2162.07163.67164.079
task_2264.36666.01166.442
task_2367.06168.61968.866
task_2469.60171.84272.303
task_2573.37377.49578.212
task_2678.42879.89680.134
task_2781.19982.49582.741
task_2883.66584.95885.153
task_2985.20586.56186.766
task_3087.69089.35789.778
task_3190.41991.97092.282
task_3293.02494.61095.031
task_3395.63697.49597.745
task_3498.857100.626100.877
task_35101.926103.271103.477
task_36103.915105.523105.875
task_37105.996107.412107.622
task_38108.409112.610113.214
task_39114.054115.998116.221

Prefect setup

We set up Prefect version 2.14.4. We wrote our own simple docker compose since we couldn't find a recommended one in Prefect's documentation. We chose to use Postgresql as a database, as it is the recommended option for production usecases.

version: '3.8'

services:
postgres:
image: postgres:14
restart: unless-stopped
volumes:
- db_data:/var/lib/postgresql/data
expose:
- 5432
environment:
POSTGRES_PASSWORD: changeme
POSTGRES_DB: prefect
healthcheck:
test: ['CMD-SHELL', 'pg_isready -U postgres']
interval: 10s
timeout: 5s
retries: 5

prefect-server:
image: prefecthq/prefect:2-latest
command:
- prefect
- server
- start
ports:
- 4200:4200
depends_on:
postgres:
condition: service_started
volumes:
- ${PWD}/prefect:/root/.prefect
- ${PWD}/flows:/flows
environment:
PREFECT_API_DATABASE_CONNECTION_URL: postgresql+asyncpg://postgres:changeme@postgres:5432/prefect
PREFECT_LOGGING_SERVER_LEVEL: INFO
PREFECT_API_URL: http://localhost:4200/api
volumes:
db_data: null

The flow was defined using the following Python file.

from prefect import flow, task

ITER = 10 # respectively 40
FIBO_N = 33 # respectively 10

def fibo(n: int):
if n <= 1:
return n
else:
return fibo(n - 1) + fibo(n - 2)

@task
def fibo_task():
return fibo(FIBO_N)

@flow(name="bench_{}".format(ITER))
def benchmark_flow():
for i in range(ITER):
fibo_task()

if __name__ == "__main__":
benchmark_flow.serve(name="bench_{}".format(ITER))

Results

For 10 long running tasks:

Details
TaskCreated atStarted atCompleted at
task_000.0001.2702.629
task_012.6732.7034.059
task_024.0954.1215.475
task_035.5085.5346.916
task_046.9516.9798.337
task_058.3738.4019.816
task_069.8499.87411.253
task_0711.28711.31312.675
task_0812.71012.73714.070
task_0914.10214.12915.489

For 40 lightweights tasks run sequentially:

Details
TaskCreated atStarted atCompleted at
task_000.0001.2131.257
task_011.2941.3211.362
task_021.3941.4231.463
task_031.4961.5221.558
task_041.5871.6121.647
task_051.6761.7001.738
task_061.7671.7911.828
task_071.8581.8821.943
task_081.9741.9982.037
task_092.0682.0932.131
task_102.1622.1882.228
task_112.2602.2922.330
task_122.3592.3822.420
task_132.4492.4762.517
task_142.5482.5732.612
task_152.6402.6702.713
task_162.7422.7652.800
task_172.8282.8512.886
task_182.9162.9402.975
task_193.0043.0283.066
task_203.0953.1193.156
task_213.1873.2113.247
task_223.2763.2993.335
task_233.3643.3893.427
task_243.4623.4893.528
task_253.5573.5793.613
task_263.6413.6643.699
task_273.7263.7513.788
task_283.8173.8393.873
task_293.9003.9214.004
task_304.0334.0594.094
task_314.1234.1514.185
task_324.2114.2344.267
task_334.2934.3154.349
task_344.3774.4044.442
task_354.4704.4924.526
task_364.5554.5774.611
task_374.6384.6614.696
task_384.7264.7494.784
task_394.8144.8384.872

Temporal setup

We set up Temporal version 2.19.0 using the docker-compose.yml from the official GitHub repository.

The flow was defined using the following Python file. We executed it on the EC2 instance, using Python 3.10.12.

ITER = 10     # respectively 40
FIBO_N = 33 # respectively 10

@activity.defn
async def fibo_activity(n: int) -> int:
return fibo(n)

@workflow.defn
class BenchWorkflow:
@workflow.run
async def run(self) -> None:
for i in range(ITER):
await workflow.execute_activity(
fibo_activity,
FIBO_N,
activity_id="task_{}".format(i),
start_to_close_timeout=timedelta(seconds=60),
)

async def main():
client = await Client.connect("localhost:7233")
flow_name = "bench-{}".format(ITER)
async with Worker(
client,
task_queue=flow_name,
workflows=[BenchWorkflow],
activities=[fibo_activity],
):
await client.execute_workflow(
BenchWorkflow.run,
id=flow_name,
task_queue=flow_name,
)


if __name__ == "__main__":
asyncio.run(main())

Results

For 10 long running tasks:

Details
TaskCreated atStarted atCompleted at
task_000.0000.0121.357
task_011.3801.3882.697
task_022.7202.7294.034
task_034.0564.0655.371
task_045.3945.4036.711
task_056.7336.7428.050
task_068.0748.0839.388
task_079.4119.42010.739
task_0810.76210.77312.086
task_0912.11112.12013.434

For 40 lightweights tasks run sequentially:

Details
TaskCreated atStarted atCompleted at
task_000.0000.0090.016
task_010.0340.0440.052
task_020.0720.0790.087
task_030.1070.1160.124
task_040.1440.1530.161
task_050.1800.1890.197
task_060.2180.2270.235
task_070.2560.2650.273
task_080.2960.3050.312
task_090.3320.3400.348
task_100.3670.3760.383
task_110.4030.4120.420
task_120.4400.4490.457
task_130.4860.4980.507
task_140.5270.5360.545
task_150.5650.5740.583
task_160.6220.6600.669
task_170.7210.7590.768
task_180.8200.8590.867
task_190.9200.9590.967
task_201.0201.0591.069
task_211.1221.1591.167
task_221.2211.2591.268
task_231.3211.3601.368
task_241.4211.4601.468
task_251.5211.5601.568
task_261.6221.6601.669
task_271.7211.7591.767
task_281.8221.8591.867
task_291.9211.9601.969
task_302.0212.0592.067
task_312.1212.1602.168
task_322.2202.2602.269
task_332.3222.3592.368
task_342.4272.4592.467
task_352.5222.5592.568
task_362.6212.6592.668
task_372.7212.7592.768
task_382.8202.8592.867
task_392.9212.9592.967

Kestra setup

We set up Kestra version v0.22.3 using the docker-compose.yml from their official Documentation. We made some adjustments to it to have a similar setup compared to the other orchestrator.

The flow we used to run the benchmarks is the following:

id: benchmark
namespace: company.team
inputs:
- id: n
type: INT
- id: iters
type: INT
tasks:
- id: processIterations
type: io.kestra.plugin.core.flow.ForEach
values: '{{ range(0, inputs.iters - 1) }}'
concurrencyLimit: 1
tasks:
- id: python
type: io.kestra.plugin.scripts.python.Script
containerImage: python:slim
taskRunner:
type: io.kestra.plugin.core.runner.Process
script: |

def fibo(n: int):
if n <= 1:
return n
else:
return fibo(n - 1) + fibo(n - 2)

print(str(fibo({{ inputs.n }})))

We executed it once with n=33, iters=10 and once with n=10 and iters=40. Note that we set the concurrency limit to 1 meaning all task will run sequentially on one worker. Furthermore, no extra python dependencies had to be installed during the execution of those flows, , and we use a Process task runner to avoid starting a Docker container for each task execution.

Results

For 10 long running tasks:

Details
TaskCreated atStarted atCompleted at
task_000.0000.8492.279
task_012.3242.3733.764
task_023.8263.8935.293
task_035.3305.3836.775
task_046.8266.8748.253
task_058.2958.3589.759
task_069.8039.87011.275
task_0711.32311.37912.822
task_0812.84412.91114.303
task_0914.34414.39015.786

For 40 lightweights tasks run sequentially:

Details
TaskCreated atStarted atCompleted at
task_000.0000.9310.966
task_011.0071.0371.073
task_021.1081.1421.175
task_031.2201.2461.286
task_041.3161.3761.415
task_051.4371.4851.523
task_061.5751.6171.653
task_071.6741.6971.735
task_081.7931.8531.891
task_091.9241.9842.018
task_102.0462.0642.105
task_112.1552.2202.258
task_122.3042.3532.390
task_132.4502.5102.550
task_142.5892.6202.664
task_152.7022.7542.789
task_162.8402.8862.922
task_172.9783.0173.053
task_183.1043.1493.184
task_193.2593.3073.344
task_203.4033.4383.473
task_213.5483.5963.634
task_223.6623.7023.737
task_233.7903.8353.871
task_243.9373.9924.033
task_254.1024.1524.190
task_264.2444.3104.345
task_274.4124.4694.505
task_284.5894.6544.692
task_294.7524.8114.844
task_304.9154.9424.986
task_315.0485.0745.110
task_325.1375.1805.217
task_335.2705.3115.349
task_345.3815.4185.459
task_355.4905.5255.566
task_365.6085.6585.696
task_375.7305.7655.804
task_385.8435.9005.936
task_395.9656.0066.044

Windmill setup

We set up Windmill version 1.483.1 using the docker-compose.yml from the official GitHub repository. We made some adjustments to it to have a similar setup compared to the other orchestrator. We set the number of workers to only one and removed the native workers since they would have been useless.

We executed the Windmill benchmarks in both "normal" and "dedicated worker" mode. To implement the 2 flows in Windmill, we first created a script simply computing the Fibonacci numbers:

# WIMDMILL script: `u/benchmarkuser/fibo_script`
def fibo(n: int):
if n <= 1:
return n
else:
return fibo(n - 1) + fibo(n - 2)

def main(
n: int,
):
return fibo(n)

And then we used this script in a simple flow composed of a For-Loop sequentially executing the scripts. The JSON representation of the flow is as follow:

summary: Fibonacci benchmark flow
description: Flow running 10 (resp. 40) times Fibonacci of 33 (resp. 10)
value:
modules:
- id: a
value:
type: forloopflow
modules:
- id: b
value:
path: u/admin/fibo_script
type: script
input_transforms:
n:
type: static
value: 33 # respectively 10
iterator:
expr: Array(10) # respectively 40
type: javascript
parallel: false
skip_failures: true
schema:
'$schema': https://json-schema.org/draft/2020-12/schema
properties: {}
required: []
type: object

Results

For 10 long running tasks in normal mode:

Details
TaskCreated atStarted atCompleted at
task_000.0000.0020.846
task_010.8580.9061.705
task_021.7151.7612.539
task_032.5482.5953.365
task_043.3753.4214.206
task_054.2154.2635.033
task_065.0425.0895.857
task_075.8665.9136.684
task_086.6936.7407.519
task_097.5297.5798.347

For 40 lightweights tasks run sequentially in normal mode:

Details
TaskCreated atStarted atCompleted at
task_000.0000.0030.059
task_010.0670.1130.171
task_020.1800.2260.280
task_030.2900.3350.389
task_040.3980.4460.501
task_050.5100.5580.614
task_060.6220.6690.725
task_070.7320.7800.834
task_080.8420.8890.942
task_090.9500.9971.052
task_101.0611.1081.166
task_111.1751.2201.274
task_121.2831.3301.385
task_131.3941.4401.494
task_141.5031.5501.605
task_151.6121.6611.716
task_161.7231.7701.823
task_171.8311.8781.930
task_181.9391.9862.041
task_192.0492.0962.152
task_202.1612.2092.266
task_212.2742.3202.376
task_222.3842.4312.486
task_232.4952.5422.596
task_242.6042.6522.706
task_252.7152.7612.816
task_262.8252.8722.925
task_272.9332.9793.033
task_283.0423.0903.145
task_293.1543.2013.269
task_303.2783.3253.382
task_313.3913.4373.493
task_323.5013.5483.602
task_333.6113.6603.715
task_343.7233.7703.823
task_353.8333.8793.934
task_363.9423.9904.045
task_374.0534.1014.157
task_384.1654.2124.268
task_394.2774.3244.383

In dedicated worker mode, we obtained the following results. For 10 long running tasks:

Details
TaskCreated atStarted atCompleted at
task_000.0000.0230.745
task_010.7760.7971.518
task_021.5461.5712.292
task_032.2982.3403.057
task_043.0633.1143.845
task_053.8743.8894.608
task_064.6144.6615.380
task_075.3855.4336.151
task_086.1586.2086.925
task_096.9336.9817.701

And for the 40 lightweight tasks:

Details
TaskCreated atStarted atCompleted at
task_000.0000.0190.022
task_010.0290.0730.077
task_020.0810.1250.128
task_030.1340.1790.182
task_040.1870.2310.234
task_050.2390.2840.287
task_060.2920.3380.341
task_070.3450.3910.394
task_080.3980.4440.447
task_090.4510.4970.500
task_100.5050.5490.552
task_110.5570.6030.606
task_120.6100.6550.659
task_130.6630.7090.712
task_140.7160.7610.764
task_150.7680.8140.817
task_160.8210.8670.870
task_170.8760.9210.924
task_180.9290.9730.976
task_190.9811.0271.030
task_201.0351.0801.083
task_211.0871.1321.135
task_221.1391.1861.189
task_231.1931.2381.241
task_241.2461.2921.295
task_251.2991.3451.348
task_261.3521.3981.401
task_271.4051.4511.454
task_281.4581.5041.507
task_291.5121.5571.560
task_301.5641.6111.614
task_311.6181.6641.667
task_321.6711.7171.720
task_331.7241.7701.773
task_341.7771.8231.826
task_351.8301.8761.879
task_361.8841.9301.933
task_371.9371.9831.986
task_381.9912.0362.039
task_392.0432.0892.092

Comparisons

We can exclude Airflow from the previous chart:

At a macro level, it took 54.668s to Airflow to execute the 10 long running tasks, where Prefect took 15.489s, Temporal 13.434s, Kestra 15.78s and Windmill 8.347s in normal mode (7.701s in dedicated worker mode).

The same can be observed for the 40 lightweight tasks, where Airflow took total of 116.221s, Prefect 4.872s, Temporal 2.967s, Kestra 6.04s and Windmill 4.383s in normal mode (2.092s in dedicated worker mode).

By far, Airflow is the slowest. Temporal, Prefect and Kestra are faster, but not as fast as Windmill. For the 40 lightweight tasks, Windmill in normal mode was equivalent to Prefect and slightly slower than Temporal. This can be explained by the fact that the way Temporal works is closer to the way Windmill works in dedicated mode. I.e. Windmill in normal mode does a cold starts for each tasks, and when the tasks are numerous and lightweight, most of the execution ends up being taken by the cold start. In dedicated worker mode however, Windmill behavior is closer to Temporal, and we can see that the performance are similar, with a slight advantage for Windmill.

But we can deep dive in a little and compare the orchestrators three categories:

  • Execution time: The time it takes for the orchestrator to execute the task once is has been assigned to an executor
  • Assignment time: The time is takes for a task to be assigned to an executor once it has been created in the queue
  • Transition time: The time it takes for to create the following time once a task is finished

After looking at the macro numbers above, it's interesting to compare the time spent in each of the above categories, relative to the total time the orchestrator took to execute the flow.

For the 10 long running tasks flow, we see the following:

AirflowPrefectTemporalKestraWindmill
Normal
Windmill
Dedicated Worker
Total duration (in secconds)54.66815.48913.43415.788.3477.701
Assignement40.36%9.77%0.71%8.66%5.15%4.82%
Execution51.72%88.18%97.74%88.86%93.83%93.55%
Transition7.93%2.05%1.55%2.49%1.02%1.62%

The proportion of time spent in execution is important here since each task takes a long time to run. We see that Airflow and Prefect are spending a lot of time assigning the tasks compared to the others (When we look at the actual numbers, we see that both Prefect and Airflow are spending a lot of time assigning the first tasks, but after that, assignment duration decrease. Kestra's assignment and transition duration are somewhere in the middle, and we see that it spends most of the time in the execution phase. Airflow remain relatively slow though, and Prefect and Kestra are reaching decent performance. The exact same can be observed with the 40 tasks workflow below). Temporal and Windmill in normal mode are pretty similar. Windmill in dedicated worker mode is incredibly fast at executing the jobs, at a cost of spending a little more time in the assignment phase, but overall it is the fastest.

If we look at the 40 lightweight tasks flow, we have:

AirflowPrefectTemporalKestraWindmill
Normal
Windmill
Dedicated Worker
Total duration (in secconds)56.2214.8722.9676.044.3832.092
Assignement64.63%44.62%35.58%44.35%42.00%85.67%
Execution10.77%31.73%11.26%24.81%50.27%5.83%
Transition24.60%23.65%53.16%30.84%7.73%8.49%

Here we see that Windmill takes a greater portion of time executing the tasks, which can be explained by the fact that Windmill runs a "cold start" for each tasks submitted to the worker. However, it's by far the fastest transitioning to a new tasks. As observed above, Windmill in dedicated worker mode is lightning fast at executing the tasks, but takes more time assigning a task to a worker.

Conclusion

Airflow is the slowest in all categories, followed by Prefect. If you're looking for a high performance job orchestrator, they seem to not be the best option. Temporal, Kestra and Windmill have better performance and are closer to each other in terms of performance, but in both cases Windmill performs better either in normal mode or in dedicated mode. If you're looking for a job orchestrator for various long-running tasks, Windmill in normal mode will be the most performant solution, optimizing the duration of each tasks knowing that transitions and assignments will remain a small portion of the overall workload. To run lightweight tasks at a very fast pace Windmill in dedicated worker mode should be your preferred choice, provided that the tasks are similar. It is lightening fast at execution and assignment.

Appendix: Scaling Windmill

We performed those benchmarks with a single worker assuming the capacity to process jobs would scale linearly with the number of workers deployed on the stack. We haven't verified this assumption for Airflow, Prefect, Kestra and Temporal, but we've scaled Windmill up to a 100 virtual workers to verify. And the conclusion is that it scales pretty linearly.

For this test, we've deployed the same docker compose as above on an AWS m4.xlarge instance (4 vCPU, 16Gb of memory) and to virtually increase the number of workers, we've used the NUM_WORKERS environment variable Windmill accepts. Note that it is not strictly equivalent to adding real hardware to the stack, but until we reach the maximum capacities on the instance, both in terms of CPU and memory, we can assume it's a good approximation. The other change we had to make was to bump the max_connections to 1000 on Postgresql: as we're adding more and more workers, each worker needs to connect to the database and we need to increase the maximum number of connections Posgtresql allows.

The job we ran was a simple sleeping job sleeping for 100ms, which is a good average during for a job running on an orchestrator.

import time
def main():
time.sleep(0.1)

Finally, we've ran it on Windmill Dedicated Worker mode, and we used a specific endpoint to "bulk-create" the jobs before any worker can start pulling them from the queue. For this test to be representative, we had to measure the performance of Windmill processing a large number of jobs (10000 in this case), and we quickly realised that the time it was taking to only insert the jobs one by one in the queue was non negligible and was affecting the real performance of workers.

The results are the following:

Details
Number of workersThroughput (jobs/sec) batch of 10K jobs
219.9
659.8
1099.6
20198
30298
40391
50496
60591
70693
80786
90887
100981

This proves that Windmill scales linearly with the number of workers (at least up to 100 workers). We can also notice that the throughput is close to the optimal: given that the job takes 100ms to be executed, N workers processing the jobs in parallel can't go above N*100 jobs per seconds, and Windmill is pretty close.