Getting Started Guide with DBT
The dbt Runner gear enables users to run dbt (data build tool) projects on Flywheel datasets stored in external storage. It transforms source parquet files into modeled data tables using user-defined dbt transformations, outputting the results back to external storage.
N/A
License: MIT
Category: Analysis
Gear Level:
- Project
- Subject
- Session
- Acquisition
- Analysis
[[TOC]]
- dbt_project_zip
- Name: dbt Project Archive
- Type: archive (zip file)
- Optional: No
- Description: Zip file containing a complete dbt project with dbt_project.yml, profiles.yml, models/, and other required dbt files
- Notes: The dbt project should be tested locally before uploading. Must use dbt-duckdb adapter.
-
storage_label
- Name: Storage Label
- Type: string
- Description: Label of the Flywheel external storage containing source and output data
- Default: N/A (required)
-
source_prefix
- Name: Source Prefix
- Type: string
- Description: Path prefix within external storage where source Flywheel dataset files are located (do not include trailing slash)
- Default: N/A (required)
-
output_prefix
- Name: Output Prefix
- Type: string
- Description: Path prefix within external storage where transformed output tables will be written (do not include trailing slash)
- Default: N/A (required)
-
debug
- Name: Debug Mode
- Type: boolean
- Description: Log debug messages
- Default: false
The gear saves dbt artifacts as gear outputs:
- manifest.json - Complete dbt project manifest with model metadata
- run_results.json - Results from the dbt run execution
- sources.json - Source metadata and freshness information
- compiled/ - Directory containing compiled SQL for all models
Transformed data tables are written directly to the specified output prefix in external storage.
No metadata is written to Flywheel containers.
None - this gear operates independently on external storage.
- Source Dataset
- Origin: Exported from Flywheel via dataset export
- Location: External storage at the specified source_prefix
- Format: Flywheel dataset schema (tables/ directory with parquet files)
None required.
This section provides a detailed description of how the gear works in Flywheel.
The dbt Runner gear executes a complete dbt project workflow:
- Validation - Validates configuration parameters, dbt project structure, and storage access
- Download - Downloads source dataset from external storage
- Transform - Runs dbt transformations using dbt-duckdb adapter
- Upload - Uploads transformed results back to external storage
- Archive - Saves dbt artifacts (manifest, run results) as gear outputs
The gear uses the dbt-duckdb adapter to process data locally within the gear container, enabling fast transformations on parquet files without requiring a separate database.
The dbt project zip file must contain:
- dbt_project.yml - Project configuration defining project name, version, and model settings
- profiles.yml - dbt profile configuration for dbt-duckdb adapter
- models/ - Directory containing dbt model SQL files
- Optional: macros/, seeds/, tests/, snapshots/, analyses/ directories
Important Requirements:
- Must use dbt-duckdb adapter (specified in profiles.yml)
- Project should be tested locally before deployment
- Directory Structure: Source data paths must use
../source_data/prefix- The gear downloads data to a
source_data/directory that is a sibling (not child) of the dbt project directory - All model references to source data must use relative paths:
../source_data/tables/... - When testing locally, create the same structure: place
source_data/in the parent folder of your dbt project
- The gear downloads data to a
Example profiles.yml:
my_project:
target: dev
outputs:
dev:
type: duckdb
path: 'target/dev.duckdb'
threads: 4
extensions:
- parquetDirectory Structure (Local and Gear):
parent_folder/ # Your workspace (or gear work directory)
├── my_dbt_project/ # Your dbt project
│ ├── dbt_project.yml
│ ├── profiles.yml
│ └── models/
└── source_data/ # Data directory (sibling to project)
└── tables/
├── subjects/*.parquet
└── sessions/*.parquet
Example Model SQL:
-- models/staging/stg_subjects.sql
SELECT *
FROM read_parquet('../source_data/tables/subjects/*.parquet')Note: The ../source_data/ path references the parent directory where
source data is located, ensuring consistency between local testing and gear execution.
graph LR;
A[dbt Project ZIP]:::input --> E;
B[External Storage<br>Source Data]:::storage --> E;
E((dbt Runner<br>Gear)):::gear;
E --> F[External Storage<br>Transformed Tables]:::storage;
E --> G[Analysis<br>dbt Artifacts]:::container;
classDef container fill:#57d,color:#fff
classDef input fill:#7a9,color:#fff
classDef gear fill:#659,color:#fff
classDef storage fill:#f90,color:#fff
Workflow Steps:
- Upload dbt project zip file to Flywheel (attach to any container)
- Configure gear with storage_label, source_prefix, and output_prefix
- Run gear (typically at Analysis level)
- Gear validates inputs and downloads source data
- Gear executes dbt transformations
- Transformed tables are uploaded to external storage at output_prefix
- dbt artifacts are saved to the Analysis container
Conditions:
- dbt project has been tested locally with sample data
- Source dataset exists in external storage (Flywheel dataset export)
- dbt models reference source parquet files correctly
- Models require incremental updates (not supported in v0.1.0)
Description:
A data analyst has created dbt models to join subjects, sessions, and acquisitions tables with file metadata to create enriched analytics tables. They test the project locally using a sample dataset, then upload the dbt project zip to Flywheel and run the gear against the full production dataset in external storage. The gear executes all transformations and writes the final tables back to external storage for downstream analysis.
Conditions:
- Previous dbt run produced unexpected results
- Need to inspect compiled SQL and execution details
- dbt artifacts (manifest.json, run_results.json) are needed
Description:
A data engineer notices unexpected results from a dbt transformation. They download the manifest.json and run_results.json artifacts from the Analysis output to review the compiled SQL and understand which models succeeded or failed. They can trace through the lineage and identify the issue before updating their dbt project.
The gear provides detailed logging throughout execution:
- Validation Phase - Reports on configuration validation, dbt project structure, and storage access
- Download Phase - Logs number of files downloaded and their locations
- dbt Execution - Shows complete dbt debug and run output
- Upload Phase - Reports on files uploaded to external storage
- Errors - Detailed error messages with context for troubleshooting
Log output is structured in numbered phases (e.g., [1/7], [2/7]) for easy tracking of progress.
- Upload opt-in via
meta.upload: Models must declaremeta: {upload: "<path>"}in their config to be uploaded. The path is relative totarget/. Models withoutmeta.uploadare not uploaded. - String paths only: The
meta.uploadvalue must be a string. Non-string values (e.g.True) are logged as a warning and skipped. - Subfolder structure: The
meta.uploadpath is used as the relative path in external storage. For example,meta: {upload: 'main/summary.parquet'}uploads to{output_prefix}/main/summary.parquet. - Auto-creation: Target subdirectories are automatically created before dbt runs by scanning model files for location configurations.
- Upload control: Any model that declares
meta.uploadis uploaded, regardless of materialization type. This works withexternal,table, and Python models. - dbt adapter: Only the dbt-duckdb adapter is supported. Other adapters (Snowflake, BigQuery, etc.) are not compatible.
- Data formats: Source data must be in Parquet format. CSV, JSON, and other formats require preprocessing.
- Dataset schema: Source data must follow the Flywheel dataset
schema with a
tables/directory containing container-level subdirectories (subjects/, sessions/, etc.).
- Relative paths: All source data references in dbt models must use
../source_data/prefix (e.g.,read_parquet('../source_data/tables/subjects/*.parquet')). - Directory structure: The gear creates
source_data/as a sibling directory to the dbt project, not as a child. Local testing should replicate this structure.
- Full rebuild: Each gear run performs a complete rebuild of all models. No incremental processing is supported in this version.
- No state persistence: The gear does not maintain dbt state between runs. Each execution starts fresh.
- Single-threaded: While dbt can use multiple threads (configured in profiles.yml), all work happens within a single gear container.
- Write permissions: The gear requires write access to the
output_prefixin external storage. - Overwrite behavior: Existing files at the output location are overwritten without warning.
- No cleanup: The gear does not remove old output files. Previous runs' outputs remain in storage unless manually deleted.
- No incremental models: Incremental materialization is not supported. All models are rebuilt on every run.
- No seeds support: dbt seed files are included in the project but not actively managed by the gear.
- No snapshot support: dbt snapshot functionality is not tested or supported.
- Limited error recovery: If dbt run fails mid-execution, partial results are not uploaded. The gear returns an error and no outputs are written to external storage.
- File size constraints: Very large model outputs may encounter memory limits within the gear container. Reaching out to Flywheel support is recommended for large datasets so that appropriate resources can be allocated.
[For more information about how to get started contributing to that gear, checkout CONTRIBUTING.md.]