Ducklake

This page is part of our section on Persistent storage & databases which covers where to effectively store and manage the data manipulated by Windmill. Check that page for more options on data storage.

Ducklake allows you to store massive amounts of data in S3, but still query it efficiently in natural SQL language.

Learn more about Ducklake

Getting started

Prerequisites:

A workspace storage configured
A Postgres or MySQL resource if you are not superuser. Superusers can use a Custom instance database.

Go to workspace settings -> Object storage (S3) and configure a Ducklake :

Ducklake settings

Using Ducklake in scripts

Ducklakes are referenced by their name. 'main' is the special default ducklake name, which can be omitted when referencing it.

Typescript
Python
DuckDB

import * as wmill from 'windmill-client';

export async function main(user_id: string) {
	// let sql = wmill.ducklake('named_ducklake');
	let sql = wmill.ducklake();

	// This string interpolation syntax is safe
	// and is transformed into a parameterized query
	let friend = await sql`SELECT * FROM friend WHERE id = ${user_id}`.fetchOne();
	// let allFriends = await sql`INSERT INTO friend VALUES ('John', 21)`.fetch();

	return friend;
}

import wmill

def main(user_id: str):
    # dl = wmill.ducklake('named_ducklake')
    dl = wmill.ducklake()

    # DuckDB scripts use named arguments
    friend = dl.query('SELECT * FROM friend WHERE id = $id', id=user_id).fetch_one()
    # all_friends = dl.query('SELECT * FROM friend').fetch()

    return friend

-- $user_id (bigint)

-- ATTACH 'ducklake://named_ducklake' AS dl;
ATTACH 'ducklake' AS dl;
USE dl;

SELECT * FROM friend WHERE id = $user_id;

-- Note: the original DuckDB syntax `ATTACH 'ducklake:postgres:connection_string'` does not benefit from Windmill's integration.

You can use the Ducklake button in the editor bar for convenience, which will insert the necessary statements for you. S3 content

DuckDB example

DuckDB is the native query engine for Ducklake. Other integrations (TypeScript, Python...) run DuckDB scripts under the hood. Note that these integrations do not start a new job when running the queries. The DuckDB script is run inline within the same worker.

In the example below, we pass a list of messages with positive, neutral or negative sentiment.
This list might come from a Python script which queries new reviews from the Google My Business API, and sends them to an LLM to determine their sentiment.
The messages are then inserted into a Ducklake table, which effectively creates a new parquet file and stores metadata in the catalog.

-- $messages (json[])

ATTACH 'ducklake://main' AS dl;
USE dl;

CREATE TABLE IF NOT EXISTS messages (
  content STRING NOT NULL,
  author STRING NOT NULL,
  date STRING NOT NULL,
  sentiment STRING
);

CREATE TEMP TABLE new_messages AS
  SELECT
    value->>'content' AS content,
    value->>'author' AS author,
    value->>'date' AS date,
    value->>'sentiment' AS sentiment
  FROM json_each($messages);

INSERT INTO messages
  SELECT * FROM new_messages;

Using the database manager

In your Ducklake settings, clicking the "Explore" button will open the database manager. You can perform all CRUD operations through the UI or with the SQL Repl.

Explore ducklake

What Ducklake does behind the scenes

If you explore your catalog database, you will see that Ducklake created some tables for you. These metadata tables store information about your data and where it is located in S3 :

Catalog database

If you explore your selected workspace storage you will see your tables and their contents as columnar, parquet files :

S3 content

Getting started​

Using Ducklake in scripts​

DuckDB example​

Using the database manager​

What Ducklake does behind the scenes​

Getting started

Using Ducklake in scripts

DuckDB example

Using the database manager

What Ducklake does behind the scenes