Results: Go

Results for each benchmark run for each engine in all languages and settings

To evaluate orchestrator performance in Go, we tested Hatchet, Temporal, and Windmill using a consistent set of benchmark scenarios: 400 lightweight tasks and 100 long-running tasks, both in single-worker and 10-worker configurations.

Go provides an ideal testbed for orchestration engine performance due to its low startup overhead and tight runtime, meaning most latency observed can be attributed to the orchestration engine itself.

The other orchestration engines either didn't support the Go language natively (e.g Kestra only via Docker) or not at all.

Single worker

Long-Running Tasks (10 tasks)

In the single-worker configuration for long-running tasks, total flow durations were: Hatchet: 30.715s, Temporal: 28.313s and Windmill: 27.648s.

As expected with heavier workloads, execution time dominated the run: Windmill: 93.82%, Temporal: 91.81% and Hatchet: 88.69%

This suggests that all three engines handle Go workloads efficiently once tasks are assigned. Assignment times were lowest for Windmill (2.46%) and Temporal (2.60%), with Hatchet slightly higher at 4.95%. Transition times were also well managed, with Windmill again leading at just 3.73% of total time.

The results confirm that in long-running tasks, all three engines effectively minimize orchestration overhead in Go, with Windmill slightly outperforming the others in total time and transition latency.

	Hatchet	Temporal	Windmill
Total duration (in seconds)	30.715	28.313	27.648
Assignment	1.519 (4.95%)	0.735 (2.60%)	0.679 (2.46%)
Execution	27.242 (88.69%)	25.995 (91.81%)	25.938 (93.82%)
Transition	1.954 (6.36%)	1.583 (5.59%)	1.031 (3.73%)

Lightweight Tasks (400 tasks)

For lightweight tasks, engine overhead becomes critical. The results show: Hatchet: 35.845s, Temporal: 39.016s and Windmill: 19.702s.

Here, the execution of fibo(10) is negligible (~10ms), so orchestration mechanics dominate performance. Windmill stands out with less than half the total time of either Temporal or Hatchet.

Breaking it down: Assignment: Windmill: 18.53%, Temporal: 41.24% and Hatchet: 68.87%. Transition: Windmill: 67.17%, Temporal: 52.62% and Hatchet: 26.75%.

The high transition percentage for Windmill reflects its very fast execution and relatively even task distribution. Hatchet and Temporal, in contrast, spend more time assigning tasks and maintaining queues, which stretches the total duration.

Overall, Windmill clearly leads in lightweight throughput in Go with a single worker, suggesting minimal orchestration latency and efficient task sequencing.

	Hatchet	Temporal	Windmill
Total duration (in seconds)	35.845	39.016	19.702
Assignment	24.688 (68.87%)	16.090 (41.24%)	3.651 (18.53%)
Execution	1.570 (4.38%)	2.397 (6.14%)	2.817 (14.30%)
Transition	9.587 (26.75%)	20.529 (52.62%)	13.234 (67.17%)

Multi-worker

10 workers: 100 long running tasks

All three engines performed well under this heavier workload, with durations as follows: Temporal: 11.152s, Windmill: 11.899s and Hatchet: 17.753s.

Temporal had slightly better worker utilization (94.86%) compared to Windmill (92.19%), with Windmill showing slightly higher avg wait time (5.478s vs. 5.012s). Execution time per worker was nearly identical across engines (roughly 11s), suggesting that the core compute workload was equally distributed.

Windmill's advantage, however, lies in its fast scheduling (0.024s) and low transition costs, which helped maintain competitive performance despite a slightly higher wait time.

Hatchet showed lower utilization (63.05%) and higher idle times, indicating room for improvement in orchestrating long-running parallel workloads in Go. Note that for multi-worker scenarios, we tried both using RunBulkNoWait that splits up all individual fibonacci tasks as separate workflows and setting PARALLEL=true that runs all fibonacci tasks in parallel in a single workflow with parrallel tasks. We did not see significant difference in the results.

Multi-Worker Comparison (10 workers)

Metric per	Hatchet	Temporal	Windmill
Total Duration (s)	17.753	11.152	11.899
First Task Scheduled (s)	0.000	0.041	0.024
First Task Started (s)	0.170	0.063	0.377

Worker Load Distribution

Metric per worker	Statistic	Hatchet	Temporal	Windmill
Tasks per Worker	Average (StdDev)	10.0 (1.26)	10.0 (0.45)	10.0 (0.45)
Tasks per Worker	Min / Max	8 / 13	9 / 11	9 / 11
Active Time (s)	Average (StdDev)	11.193 (0.640)	10.578 (0.207)	10.970 (0.191)
Active Time (s)	Min / Max	9.675 / 12.020	10.270 / 10.883	10.597 / 11.249
Idle Time (s)	Average (StdDev)	6.560 (0.640)	0.574 (0.207)	0.929 (0.191)
Idle Time (s)	Min / Max	5.733 / 8.078	0.269 / 0.882	0.650 / 1.302
Avg Wait Time (s)	Average (StdDev)	8.322 (0.372)	5.012 (0.239)	5.478 (0.171)
Avg Wait Time (s)	Min / Max	7.679 / 8.780	4.598 / 5.287	5.105 / 5.696
Avg Task Duration (s)	Average (StdDev)	1.137 (0.154)	1.059 (0.040)	1.099 (0.050)
Avg Task Duration (s)	Min / Max	0.865 / 1.370	0.979 / 1.141	1.008 / 1.215
Worker Utilization (%)	Average (StdDev)	63.05 (3.60)	94.86 (1.86)	92.19 (1.61)
Worker Utilization (%)	Min / Max	54.50 / 67.71	92.09 / 97.59	89.06 / 94.54

10 workers: 400 lightweight tasks

In the multi-worker configuration for lightweight tasks, Temporal was the fastest, completing all 400 tasks in just 4.270 seconds. Windmill followed at 7.224 seconds, while Hatchet significantly lagged behind with a total duration of 37.809 seconds.

At first glance, Windmill appears notably slower than Temporal, but a deeper look into worker load distribution helps explain the trade-offs in orchestration design.

Worker Utilization: Temporal: 32.41%, Windmill: 11.75%, Hatchet: 3.16%
Avg Wait Time: Windmill: 4.235s, Temporal: 2.115s, Hatchet: 19.187s

Despite Hatchet having perfect task distribution (exactly 40 per worker), the wait time and idle time were extremely high, indicating tasks were not being effectively overlapped or queued. Its average task duration was very low (0.030s), but these were not leveraged due to orchestration bottlenecks. Note that for multi-worker scenarios, we tried both using RunBulkNoWait that splits up all individual fibonacci tasks as separate workflows and setting PARALLEL=true that runs all fibonacci tasks in parallel in a single workflow with parrallel tasks. We did not see significant difference in the results.

Temporal and Windmill both showed strong parallel execution, but with some differences: Windmill started scheduling tasks faster (first task scheduled at 0.025s), but took longer to start executing the first task (1.663s). This is due to the fact that Windmill, when executing flow iterations in parallel, will create a sub-flow for each iteration containing the fibonacci task. This creates a slight overhead in orchestration compared to the other engines. Temporal, by contrast, started execution almost immediately at 0.108s and maintained lower wait and idle times.

Multi-Worker Comparison (10 workers)

Metric per	Hatchet	Temporal	Windmill
Total Duration (s)	37.809	4.270	7.224
First Task Scheduled (s)	0.000	0.049	0.025
First Task Started (s)	0.723	0.108	1.663

Worker Load Distribution

Metric per worker	Statistic	Hatchet	Temporal	Windmill
Tasks per Worker	Average (StdDev)	40.0 (0.00)	40.0 (1.95)	40.0 (3.13)
Tasks per Worker	Min / Max	40 / 40	37 / 43	36 / 45
Active Time (s)	Average (StdDev)	1.193 (0.178)	1.384 (0.069)	0.849 (0.064)
Active Time (s)	Min / Max	0.877 / 1.501	1.228 / 1.511	0.764 / 0.952
Idle Time (s)	Average (StdDev)	36.616 (0.178)	2.886 (0.069)	6.375 (0.064)
Idle Time (s)	Min / Max	36.308 / 36.932	2.759 / 3.042	6.272 / 6.460
Avg Wait Time (s)	Average (StdDev)	19.187 (0.014)	2.115 (0.077)	4.235 (0.130)
Avg Wait Time (s)	Min / Max	19.165 / 19.207	1.967 / 2.217	3.949 / 4.475
Avg Task Duration (s)	Average (StdDev)	0.030 (0.004)	0.035 (0.001)	0.021 (0.001)
Avg Task Duration (s)	Min / Max	0.022 / 0.038	0.032 / 0.036	0.019 / 0.023
Worker Utilization (%)	Average (StdDev)	3.16 (0.47)	32.41 (1.61)	11.75 (0.89)
Worker Utilization (%)	Min / Max	2.32 / 3.97	28.76 / 35.39	10.58 / 13.18

Single worker​

Long-Running Tasks (10 tasks)​

Lightweight Tasks (400 tasks)​

Multi-worker​

10 workers: 100 long running tasks​

Multi-Worker Comparison (10 workers)

Worker Load Distribution

10 workers: 400 lightweight tasks​

Multi-Worker Comparison (10 workers)

Worker Load Distribution

Single worker

Long-Running Tasks (10 tasks)

Lightweight Tasks (400 tasks)

Multi-worker

10 workers: 100 long running tasks

10 workers: 400 lightweight tasks