Skip to main content

S3 APIs integrations

Windmill provides a unique resource type for any API following the typical S3 schema.

Windmill for data pipelines

You can link a Windmill workspace to an S3 bucket and use it as source and/or target of your processing steps seamlessly, without any boilerplate.


See Windmill for data pipelines for more details.

Add an S3 Resource

Here are the required details:

S3 resource type

PropertyTypeDescriptionDefaultRequired
bucketstringS3 bucket nametrue
regionstringS3 region for the buckettrue
useSSLbooleanUse SSL for connectionstruefalse
endPointstringS3 endpointtrue
accessKeystringAWS access keyfalse
pathStylebooleanUse path-style addressingfalsefalse
secretKeystringAWS secret keyfalse

For guidelines on where to find such details on a given platform, please go to the AWS S3 or Cloudflare R2 pages.

Your resource can be used passed as parameters or directly fetched within scripts, flows and apps.


tip

Find some pre-set interactions with S3 on the Hub.

Feel free to create your own S3 scripts on Windmill.

Connect your Windmill workspace to your S3 bucket or your Azure Blob storage

Once you've created an S3 or Azure Blob resource in Windmill, go to the workspace settings > S3 Storage. Select the resource and click Save.

S3 storage workspace settings

The resource can be set to be public with toggle "S3 resource details can be accessed by all users of this workspace".

In this case, the permissions set on the resource will be ignored when users interact with the S3 bucket via Windmill. Note that when the resource is public, the users might be able to access all of its details (including access keys and secrets) via some Windmill endpoints. When the resource is not set to be public, Windmill guarantees that users who don't have access to the resource won't be able to retrieve any of its details. That being said, access to a specific file inside the bucket will still be possible, and downloading and uploading objects will also be accessible to any workspace user. In short, as long as the user knows the path of the file they want to access, they will be able to read its content. The main difference is that users won't be able to browse the content of the bucket.

Once the workspace is configured, access to the bucket is made easy in Windmill.

When a script accepts a S3 file as input, it can be directly uploaded or chosen from the bucket explorer.

S3 file upload

S3 bucket browsing

When a script outputs a S3 file, it can be downloaded or previewed directly in Windmill's UI (for displayable files like text files, CSVs or parquet files).

S3 file download

Read a file from S3 within a script

import * as wmill from 'windmill-client';
import { S3Object } from 'windmill-client';

export async function main(input_file: S3Object) {
// Load the entire file_content as a Uint8Array
const file_content = await wmill.loadS3File(input_file);

const decoder = new TextDecoder();
const file_content_str = decoder.decode(file_content);
console.log(file_content_str);

// Or load the file lazily as a Blob
let fileContentBlob = await wmill.loadS3FileStream(input_file);
console.log(await fileContentBlob.text());
}

Read S3 file

Create a file in S3 within a script

import * as wmill from 'windmill-client';
import { S3Object } from 'windmill-client';

export async function main(s3_file_path: string) {
const s3_file_output: S3Object = {
s3: s3_file_path
};

const file_content = 'Hello Windmill!';
// file_content can be either a string or ReadableStream<Uint8Array>
await wmill.writeS3File(s3_file_output, file_content);
return s3_file_output;
}

Write to S3 file

info

Certain file types, typically parquet files, can be directly rendered by Windmill

Windmill embedded integration with Polars and DuckDB for data pipelines

ETLs can be easily implemented in Windmill using its integration with Polars and DuckDB for facilitate working with tabular data. In this case, you don't need to manually interact with the S3 bucket, Polars/DuckDB does it natively and in a efficient way. Reading and Writing datasets to S3 can be done seamlessly.

#requirements:
#polars==0.20.2
#s3fs==2023.12.0
#wmill>=1.229.0

import wmill
from wmill import S3Object
import polars as pl
import s3fs


def main(input_file: S3Object):
bucket = wmill.get_resource("<PATH_TO_S3_RESOURCE>")["bucket"]

# this will default to the workspace s3 resource
storage_options = wmill.polars_connection_settings().storage_options
# this will use the designated resource
# storage_options = wmill.polars_connection_settings("<PATH_TO_S3_RESOURCE>").storage_options

# input is a parquet file, we use read_parquet in lazy mode.
# Polars can read various file types, see
# https://pola-rs.github.io/polars/py-polars/html/reference/io.html
input_uri = "s3://{}/{}".format(bucket, input_file["s3"])
input_df = pl.read_parquet(input_uri, storage_options=storage_options).lazy()

# process the Polars dataframe. See Polars docs:
# for dataframe: https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/index.html
# for lazy dataframe: https://pola-rs.github.io/polars/py-polars/html/reference/lazyframe/index.html
output_df = input_df.collect()
print(output_df)

# To write back the result to S3, Polars needs an s3fs connection
s3 = s3fs.S3FileSystem(**wmill.polars_connection_settings().s3fs_args)
output_file = "output/result.parquet"
output_uri = "s3://{}/{}".format(bucket, output_file)
with s3.open(output_uri, mode="wb") as output_s3:
# persist the output dataframe back to S3 and return it
output_df.write_parquet(output_s3)

return S3Object(s3=output_file)
info

Polars and DuckDB need to be configured to access S3 within the Windmill script. The job will need to accessed the S3 resources, which either needs to be accessible to the user running the job, or the S3 resource needs to be set as public in the workspace settings.

For more info on how Data Pipelines in Windmill, see Data Pipelines.