Skip to main content

Persistent Storage

Persistent storage refers to any method of storing data that remains intact and accessible even after a system is powered off, restarted, or experiences a crash.

In the context of Windmill, the stakes are: where to effectively store and manage the data manipulated by Windmill (ETL, data ingestion and preprocessing, data migration and sync etc.) ?

TLDR

When it comes to storing data manipulated by Windmil, it is recommended to only store Windmill-specific elements (resources, variables etc.). To store data, it is recommended to use external storage service providers that can be accessed from Windmill.


This present document gives a list of trusted services to use alongside Windmill.


There are 4 kinds of persistent storage in Windmill:

  1. Small data that is relevant in between script/flow execution and can be persisted on Windmill itself.

  2. Object storage for large data such as S3.

  3. Big structured SQL data that is critical to your services and that is stored externally on an SQL Database or Data Warehouse.

  4. NoSQL and document database such as MongoDB and Key-Value stores.

You already have your own database

If you already have your own database provided by a supported integration, you can easily connect it to Windmill.

If your service provider is already part of our list of integrations, just add your database as a resource.

"If your service provider is not already integrated with Windmill, you can create a new resource type to establish the connection (and if you want, share the schema on our Hub).

Windmill is not designed to store heavy data that extends beyond the execution of a script or flow. Indeed, for each computation the worker executing is not the same as the previous computation, so the data would have to be retrieved from another location.

Instead, Windmill is very convenient to use alongside data storage providers to manipulate big amounts of data.

There are however internal methods to persist data between executions of jobs.

States and Resources

Within Windmill, you can use States and Resources as a way to store a transient state - that can be represented as small JSON.

States

States are used by scripts to keep data persistent between runs of the same script by the same trigger (schedule or user).

In Windmill, States are considered as resources, but they are excluded from the Workspace tab for clarity. They are displayed on the Resources menu, under a dedicated tab.

A state is an object stored as a resource of the resource type state which is meant to persist across distinct executions of the same script.

import requests
from wmill import set_state, get_state

def main():
# Get temperature from last execution
last_temperature = get_state()

# Fetch the temperature in Paris from wttr.in
response = requests.get("http://wttr.in/Paris?format=%t")

new_temperature = response.text.strip("°F")

# Set current temperature to state
set_state(new_temperature)

# Compare last_temperature and new_temperature
if last_temperature < new_temperature:
return "The temperature has increased."
elif last_temperature > new_temperature:
return "The temperature has decreased."
else:
return "The temperature has remained the same."

States are what enable Flows to watch for changes in most event watching scenarios (trigger scripts). The pattern is as follows:

  • Retrieve the last state or, if undefined, assume it is the first execution.
  • Retrieve the current state in the external system you are watching, e.g. the list of users having starred your repo or the maximum ID of posts on Hacker News.
  • Calculate the difference between the current state and the last internal state. This difference is what you will want to act upon.
  • Set the new state as the current state so that you do not process the elements you just processed.
  • Return the differences calculated previously so that you can process them in the next steps. You will likely want to forloop over the items and trigger one Flow per item. This is exactly the pattern used when your Flow is in the mode of "Watching changes regularly".

The convenience functions do this are:

TypeScript

  • getState() which retrieves an object of any type (internally a simple Resource) at a path determined by getStatePath, which is unique to the user currently executing the Script, the Flow in which it is currently getting called in - if any - and the path of the Script.
  • setState(value: any) which sets the new state.

Please note it requires importing the wmill client library from Deno/Bun.


Python

  • get_state() which retrieves an object of any type (internally a simple Resource) at a path determined by get_state_path, which is unique to the user currently executing the Script, the Flow in which it is currently getting called in - if any - and the path of the Script.
  • set_state(value: Any) which sets the new state.

Please note it requires importing the wmill client library from Python.


Resources

States are a specific type of resources in Windmill where the type is state the path is automatically calculated for you based on the schedule path (if any) and the script path. In some cases, you want to set the path arbitrarily and/or use a different type than state. In this case, you can use the setResource and getResource functions. A same resource can be used across different scripts and flows.

  • setResource(value: any, path?: string, initializeToTypeIfNotExist?: string): which sets a resource at a given path. This is equivalent to setState but allows you to set an arbitrary path and chose a type other than state if wanted. See API.
  • getResource(path: string): gets a resource at a given path. See API.

The states can be seen in the Resources section of Windmill app with a Resource Type of state.

tip

Variables are similar to resources but have no types, can be tagged as secret (in which case they are encrypted by the workspace key) and can only store strings. In some situations, you may prefer setVariable/getVariable to resources.

In conclusion setState and setResource are convenient ways to persist json between multiple script executions.

Shared Directory

For heavier ETL processes or sharing data between steps in a flow, Windmill provides a Shared Directory feature.

The Shared Directory allows steps within a flow to share data by storing it in a designated folder.

caution

Although Shared Folders are recommended for persisting states within a flow, it's important to note that all steps are executed on the same worker and the data stored in the Shared Directory is strictly ephemeral to the flow execution.

To enable the Shared Directory, follow these steps:

  1. Open the Settings menu in the Windmill interface.
  2. Go to the Shared Directory section.
  3. Toggle on the option for Shared Directory on './shared'.

Flow Shared Directory

Once the Shared Directory is enabled, you can use it in your flow by referencing the ./shared folder. This folder is shared among the steps in the flow, allowing you to store and access data between them.

tip

Keep in mind that the contents of the ./shared folder are not preserved across suspends and sleeps. The directory is temporary and active only during the execution of the flow.

Large Data Files: S3, R2, MinIO, Azure Blob

On heavier data objects & unstructured data storage, Amazon S3 (Simple Storage Service) and its alternatives Cloudflare R2 and MinIO as well as Azure Blob storage are highly scalable and durable object storage service that provides secure, reliable, and cost-effective storage for a wide range of data types and use cases.

Windmill comes with a native integration with S3 and Azure Blob, making it the recommended storage for large objects like files and binary data.

Use Amazon S3, R2, MinIO and Azure Blob directly

Amazon S3, Cloudflare R2 and MinIO all follow the same API schema and therefore have a common Windmill resource type. Azure Blob has a slightly different API than S3 but works with Windmill as well using its dedicated resource type

Amazon S3

Amazon S3 (Simple Storage Service) is a scalable and durable object storage service offered by Amazon Web Services (AWS), designed to provide developers and businesses with an effective way to store and retrieve any amount of data from anywhere on the web.


  1. Sign-up to AWS.

  2. Create a bucket on S3.

  3. Integrate it to Windmill by filling the resource type details for S3 APIs.

Make sure the user associated with the resource has the right policies allowed in AWS Identity and Access Management (IAM).

tip

You can find examples and premade S3 scripts on Windmill Hub.

Cloudflare R2

Cloudflare R2 is a cloud-based storage service that provides developers and businesses with a cost-effective and secure way to store and access their data.

  1. Sign-up to Cloudflare

  2. Create a bucket on R2.

  3. Integrate it to Windmill by filling the resource type details for S3 APIs.

MinIO

For best performance, install MinIO locally.

MinIO is an open-source, high-performance, and scalable object storage server that is compatible with Amazon S3 APIs, designed for building private and public cloud storage solutions.

Then from Windmill, just fill the S3 resource type.

Azure Blob

Azure Blob Storage is Microsoft's alternative to S3. It serve the same purpose but has a slightly different API.

  1. Go to your Azure Portal and go to the "Storage account" application

  2. Either select an existing account of create a new one

  3. Create a container. Azure's containers are roughly the equivalent to S3 buckets. Note though that secret access key are per account, not per container.

  4. Integrate it to Windmill by filling the resource type details for Azure Blob APIs.

Connect your Windmill workspace to your S3 bucket or your Azure Blob storage

Once you've created an S3 or Azure Blob resource in Windmill, go to the workspace settings > S3 Storage. Select the resource and click Save.

S3 storage workspace settings

The resource can be set to be public with toggle "S3 resource details can be accessed by all users of this workspace".

In this case, the permissions set on the resource will be ignored when users interact with the S3 bucket via Windmill. Note that when the resource is public, the users might be able to access all of its details (including access keys and secrets) via some Windmill endpoints. When the resource is not set to be public, Windmill guarantees that users who don't have access to the resource won't be able to retrieve any of its details. That being said, access to a specific file inside the bucket will still be possible, and downloading and uploading objects will also be accessible to any workspace user. In short, as long as the user knows the path of the file they want to access, they will be able to read its content. The main difference is that users won't be able to browse the content of the bucket.

Once the workspace is configured, access to the bucket is made easy in Windmill.

When a script accepts a S3 file as input, it can be directly uploaded or chosen from the bucket explorer.

S3 file upload

S3 bucket browsing

When a script outputs a S3 file, it can be downloaded or previewed directly in Windmill's UI (for displayable files like text files, CSVs or parquet files).

S3 file download

Read a file from S3 within a script

import * as wmill from 'windmill-client';
import { S3Object } from 'windmill-client';

export async function main(input_file: S3Object) {
// Load the entire file_content as a Uint8Array
const file_content = await wmill.loadS3File(input_file);

const decoder = new TextDecoder();
const file_content_str = decoder.decode(file_content);
console.log(file_content_str);

// Or load the file lazily as a Blob
let fileContentBlob = await wmill.loadS3FileStream(inputFile);
console.log(await fileContentBlob.text());
}

Read S3 file

Create a file in S3 within a script

import * as wmill from 'windmill-client';
import { S3Object } from 'windmill-client';

export async function main(s3_file_path: string) {
const s3_file_output: S3Object = {
s3: s3_file_path
};

const file_content = 'Hello Windmill!';
// file_content can be either a string or ReadableStream<Uint8Array>
await wmill.writeS3File(s3_file_output, file_content);
return s3_file_output;
}

Write to S3 file

info

Certain file types, typically parquet files, can be directly rendered by Windmill

For more info on how to use files and S3 files in Windmill, see Handling files and binary data.

Windmill embedded integration with Polars and DuckDB for data pipelines

ETLs can be easily implemented in Windmill using its integration with Polars and DuckDB for facilitate working with tabular data. In this case, you don't need to manually interact with the S3 bucket, Polars/DuckDB does it natively and in a efficient way. Reading and Writing datasets to S3 can be done seamlessly.

#requirements:
#polars==0.20.2
#s3fs==2023.12.0
#wmill>=1.229.0

import wmill
from wmill import S3Object
import polars as pl
import s3fs


def main(input_file: S3Object):
bucket = wmill.get_resource("<PATH_TO_S3_RESOURCE>")["bucket"]

# this will default to the workspace s3 resource
storage_options = wmill.polars_connection_settings().storage_options
# this will use the designated resource
# storage_options = wmill.polars_connection_settings("<PATH_TO_S3_RESOURCE>").storage_options

# input is a parquet file, we use read_parquet in lazy mode.
# Polars can read various file types, see
# https://pola-rs.github.io/polars/py-polars/html/reference/io.html
input_uri = "s3://{}/{}".format(bucket, input_file["s3"])
input_df = pl.read_parquet(input_uri, storage_options=storage_options).lazy()

# process the Polars dataframe. See Polars docs:
# for dataframe: https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/index.html
# for lazy dataframe: https://pola-rs.github.io/polars/py-polars/html/reference/lazyframe/index.html
output_df = input_df.collect()
print(output_df)

# To write back the result to S3, Polars needs an s3fs connection
s3 = s3fs.S3FileSystem(**wmill.polars_connection_settings().s3fs_args)
output_file = "output/result.parquet"
output_uri = "s3://{}/{}".format(bucket, output_file)
with s3.open(output_uri, mode="wb") as output_s3:
# persist the output dataframe back to S3 and return it
output_df.write_parquet(output_s3)

return S3Object(s3=output_file)
info

Polars and DuckDB need to be configured to access S3 within the Windmill script. The job will need to accessed the S3 resources, which either needs to be accessible to the user running the job, or the S3 resource needs to be set as public in the workspace settings.

For more info on how Data Pipelines in Windmill, see Data Pipelines.

Structured Databases: Postgres (Supabase, Neon.tech)

For Postgres databases (best for structured data storage and retrieval, where you can define schema and relationships between entities), we recommend using Supabase or Neon.tech.

Supabase

Supabase is an open-source alternative to Firebase, providing a backend-as-a-service platform that offers a suite of tools, including real-time subscriptions, authentication, storage, and a PostgreSQL-based database.

  1. Sign-up to Supabase's Cloud App or Self-Host it.

  2. Create a new Supabase project.

  3. Get a Connection string.

    • Go to the Settings section.
    • Click Database.
    • Find your Connection Info and Connection String. Direct connections are on port 5432.
  4. From Windmill, add your Supabase connection string as a Postgresql resource and Execute queries. Tip: you might need to set the sslmode to "disable".


You can also integrate Supabase directly through its API.

Neon.tech

Neon.tech is an open-source cloud database platform that provides fully managed PostgreSQL databases with high availability and scalability.

  1. Sign-up to Neon's Cloud App or Self-Host it.

  2. Set up a project and add data.

  3. Get a Connection string. You can obtain it connection string from the Connection Details widget on the Neon Dashboard: select a branch, a role, and the database you want to connect to and a connection string will be constructed for you.

  4. From Windmill, add your Neon.tech connection string as a Postgresql resource and Execute queries.


tip

Adding the connection string as a Postgres resource requires to parse it.


For example, for psql postgres://daniel:<password>@ep-restless-rice.us-east-2.aws.neon.tech/neondb, that would be:

{
"host": "ep-restless-rice.us-east-2.aws.neon.tech",
"port": 5432,
"user": "daniel",
"dbname": "neondb",
"sslmode": "require",
"password": "<password>"
}

Where the sslmode should be "require" and Neon uses the default PostgreSQL port, 5432.

Key-Value Stores: MongoDB Atlas, Redis, Upstash

Key-value stores are a popular choice for managing non-structured data, providing a flexible and scalable solution for various data types and use cases. In the context of Windmill, you can use MongoDB Atlas, Redis, and Upstash to store and manipulate non-structured data effectively.

MongoDB Atlas

MongoDB Atlas is a managed database-as-a-service platform that provides an efficient way to deploy, manage, and optimize MongoDB instances. As a document-oriented NoSQL database, MongoDB is well-suited for handling large volumes of unstructured data. Its dynamic schema enables the storage and retrieval of JSON-like documents with diverse structures, making it a suitable option for managing non-structured data.

To use MongoDB Atlas with Windmill:

  1. Sign-up to Atlas.

  2. Create a database.

  3. Integrate it to Windmill by filling the resource type details.

tip

You can find examples and premade MonggoDB scripts on Windmill Hub.

Redis

Redis is an open-source, in-memory key-value store that can be used for caching, message brokering, and real-time analytics. It supports a variety of data structures such as strings, lists, sets, and hashes, providing flexibility for non-structured data storage and management. Redis is known for its high performance and low-latency data access, making it a suitable choice for applications requiring fast data retrieval and processing.

To use Redis with Windmill:

  1. Sign-up to Redis.

  2. Create a database.

  3. Integrate it to Windmill by filling the resource type details following the same schema as MongoDB Atlas.

Upstash

Upstash is a serverless, edge-optimized key-value store designed for low-latency access to non-structured data. It is built on top of Redis, offering similar performance benefits and data structure support while adding serverless capabilities, making it easy to scale your data storage needs.

To use Upstash with Windmill:

  1. Sign-up to Upstash.

  2. Create a database.

  3. Integrate it to Windmill by filling the resource type details following the same schema as MongoDB Atlas.