dbt Runner

Overview

Summary

The dbt Runner gear enables users to run dbt (data build tool) projects on Flywheel datasets stored in external storage. It transforms source parquet files into modeled data tables using user-defined dbt transformations, outputting the results back to external storage.

Cite

N/A

License

License: MIT

Classification

Category: Analysis

Gear Level:

[[TOC]]

Inputs

dbt_project_zip
- Name: dbt Project Archive
- Type: archive (zip file)
- Optional: No
- Description: Zip file containing a complete dbt project with dbt_project.yml, profiles.yml, models/, and other required dbt files
- Notes: The dbt project should be tested locally before uploading. Must use dbt-duckdb adapter.

Config

storage_label
- Name: Storage Label
- Type: string
- Description: Label of the Flywheel external storage containing source and output data
- Default: N/A (required)
source_prefix
- Name: Source Prefix
- Type: string
- Description: Path prefix within external storage where source Flywheel dataset files are located (do not include trailing slash)
- Default: N/A (required)
output_prefix
- Name: Output Prefix
- Type: string
- Description: Path prefix within external storage where transformed output tables will be written (do not include trailing slash)
- Default: N/A (required)
debug
- Name: Debug Mode
- Type: boolean
- Description: Log debug messages
- Default: false

Outputs

Files

The gear saves dbt artifacts as gear outputs:

manifest.json - Complete dbt project manifest with model metadata
run_results.json - Results from the dbt run execution
sources.json - Source metadata and freshness information
compiled/ - Directory containing compiled SQL for all models

Transformed data tables are written directly to the specified output prefix in external storage.

Metadata

No metadata is written to Flywheel containers.

Pre-requisites

Prerequisite Gear Runs

None - this gear operates independently on external storage.

Prerequisite Files

Source Dataset
- Origin: Exported from Flywheel via dataset export
- Location: External storage at the specified source_prefix
- Format: Flywheel dataset schema (tables/ directory with parquet files)

Prerequisite Metadata

None required.

Usage

This section provides a detailed description of how the gear works in Flywheel.

Description

The dbt Runner gear executes a complete dbt project workflow:

Validation - Validates configuration parameters, dbt project structure, and storage access
Download - Downloads source dataset from external storage
Transform - Runs dbt transformations using dbt-duckdb adapter
Upload - Uploads transformed results back to external storage
Archive - Saves dbt artifacts (manifest, run results) as gear outputs

The gear uses the dbt-duckdb adapter to process data locally within the gear container, enabling fast transformations on parquet files without requiring a separate database.

File Specifications

dbt_project_zip

The dbt project zip file must contain:

dbt_project.yml - Project configuration defining project name, version, and model settings
profiles.yml - dbt profile configuration for dbt-duckdb adapter
models/ - Directory containing dbt model SQL files
Optional: macros/, seeds/, tests/, snapshots/, analyses/ directories

Important Requirements:

Must use dbt-duckdb adapter (specified in profiles.yml)
Project should be tested locally before deployment
Directory Structure: Source data paths must use ../source_data/ prefix
- The gear downloads data to a source_data/ directory that is a sibling (not child) of the dbt project directory
- All model references to source data must use relative paths: ../source_data/tables/...
- When testing locally, create the same structure: place source_data/ in the parent folder of your dbt project

Example profiles.yml:

my_project:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: 'target/dev.duckdb'
      threads: 4
      extensions:
        - parquet

Directory Structure (Local and Gear):

parent_folder/                    # Your workspace (or gear work directory)
├── my_dbt_project/               # Your dbt project
│   ├── dbt_project.yml
│   ├── profiles.yml
│   └── models/
└── source_data/                  # Data directory (sibling to project)
    └── tables/
        ├── subjects/*.parquet
        └── sessions/*.parquet

Example Model SQL:

-- models/staging/stg_subjects.sql
SELECT *
FROM read_parquet('../source_data/tables/subjects/*.parquet')

Note: The ../source_data/ path references the parent directory where source data is located, ensuring consistency between local testing and gear execution.

Workflow

graph LR;
    A[dbt Project ZIP]:::input --> E;
    B[External Storage<br>Source Data]:::storage --> E;
    E((dbt Runner<br>Gear)):::gear;
    E --> F[External Storage<br>Transformed Tables]:::storage;
    E --> G[Analysis<br>dbt Artifacts]:::container;

    classDef container fill:#57d,color:#fff
    classDef input fill:#7a9,color:#fff
    classDef gear fill:#659,color:#fff
    classDef storage fill:#f90,color:#fff

Workflow Steps:

Upload dbt project zip file to Flywheel (attach to any container)
Configure gear with storage_label, source_prefix, and output_prefix
Run gear (typically at Analysis level)
Gear validates inputs and downloads source data
Gear executes dbt transformations
Transformed tables are uploaded to external storage at output_prefix
dbt artifacts are saved to the Analysis container

Use Cases

Use Case 1: Creating Analytics Tables

Conditions:

dbt project has been tested locally with sample data
Source dataset exists in external storage (Flywheel dataset export)
dbt models reference source parquet files correctly
Models require incremental updates (not supported in v0.1.0)

Description:

A data analyst has created dbt models to join subjects, sessions, and acquisitions tables with file metadata to create enriched analytics tables. They test the project locally using a sample dataset, then upload the dbt project zip to Flywheel and run the gear against the full production dataset in external storage. The gear executes all transformations and writes the final tables back to external storage for downstream analysis.

Use Case 2: Debugging Transformation Logic

Conditions:

Previous dbt run produced unexpected results
Need to inspect compiled SQL and execution details
dbt artifacts (manifest.json, run_results.json) are needed

Description:

A data engineer notices unexpected results from a dbt transformation. They download the manifest.json and run_results.json artifacts from the Analysis output to review the compiled SQL and understand which models succeeded or failed. They can trace through the lineage and identify the issue before updating their dbt project.

Logging

The gear provides detailed logging throughout execution:

Validation Phase - Reports on configuration validation, dbt project structure, and storage access
Download Phase - Logs number of files downloaded and their locations
dbt Execution - Shows complete dbt debug and run output
Upload Phase - Reports on files uploaded to external storage
Errors - Detailed error messages with context for troubleshooting

Log output is structured in numbered phases (e.g., [1/7], [2/7]) for easy tracking of progress.

Assumptions and Limitations

Model Output Requirements

Upload opt-in via meta.upload: Models must declare meta: {upload: "<path>"} in their config to be uploaded. The path is relative to target/. Models without meta.upload are not uploaded.
String paths only: The meta.upload value must be a string. Non-string values (e.g. True) are logged as a warning and skipped.
Subfolder structure: The meta.upload path is used as the relative path in external storage. For example, meta: {upload: 'main/summary.parquet'} uploads to {output_prefix}/main/summary.parquet.
Auto-creation: Target subdirectories are automatically created before dbt runs by scanning model files for location configurations.

Supported Features

Upload control: Any model that declares meta.upload is uploaded, regardless of materialization type. This works with external, table, and Python models.
dbt adapter: Only the dbt-duckdb adapter is supported. Other adapters (Snowflake, BigQuery, etc.) are not compatible.
Data formats: Source data must be in Parquet format. CSV, JSON, and other formats require preprocessing.
Dataset schema: Source data must follow the Flywheel dataset schema with a tables/ directory containing container-level subdirectories (subjects/, sessions/, etc.).

Source Data References

Relative paths: All source data references in dbt models must use ../source_data/ prefix (e.g., read_parquet('../source_data/tables/subjects/*.parquet')).
Directory structure: The gear creates source_data/ as a sibling directory to the dbt project, not as a child. Local testing should replicate this structure.

Execution Model

Full rebuild: Each gear run performs a complete rebuild of all models. No incremental processing is supported in this version.
No state persistence: The gear does not maintain dbt state between runs. Each execution starts fresh.
Single-threaded: While dbt can use multiple threads (configured in profiles.yml), all work happens within a single gear container.

Storage Integration

Write permissions: The gear requires write access to the output_prefix in external storage.
Overwrite behavior: Existing files at the output location are overwritten without warning.
No cleanup: The gear does not remove old output files. Previous runs' outputs remain in storage unless manually deleted.

Known Limitations

No incremental models: Incremental materialization is not supported. All models are rebuilt on every run.
No seeds support: dbt seed files are included in the project but not actively managed by the gear.
No snapshot support: dbt snapshot functionality is not tested or supported.
Limited error recovery: If dbt run fails mid-execution, partial results are not uploaded. The gear returns an error and no outputs are written to external storage.
File size constraints: Very large model outputs may encounter memory limits within the gear container. Reaching out to Flywheel support is recommended for large datasets so that appropriate resources can be allocated.

Contributing

[For more information about how to get started contributing to that gear, checkout CONTRIBUTING.md.]

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
docs		docs
fw_gear_dbt_runner		fw_gear_dbt_runner
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
manifest.json		manifest.json
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
run.py		run.py
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

dbt Runner

Overview

Summary

Cite

License

Classification

Inputs

Config

Outputs

Files

Metadata

Pre-requisites

Prerequisite Gear Runs

Prerequisite Files

Prerequisite Metadata

Usage

Description

File Specifications

dbt_project_zip

Workflow

Use Cases

Use Case 1: Creating Analytics Tables

Use Case 2: Debugging Transformation Logic

Logging

Assumptions and Limitations

Model Output Requirements

Supported Features

Source Data References

Execution Model

Storage Integration

Known Limitations

Contributing

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages