Skip to content

dtsong/data-diff

 
 

Repository files navigation

data-diff -- Efficiently diff rows across databases

Community Maintained License: MIT PyPI

Note: This project is maintained by the community after Datafold sunset the project in May 2024.

data-diff is an open-source CLI and Python library for efficiently comparing data across 13+ database engines. It uses bisection and checksumming to find differing rows without transferring entire tables, making it fast even on tables with millions of rows.

Installation

pip install data-diff

Install with database-specific extras:

pip install 'data-diff[postgresql,mysql]'

Quick Start

CLI

data-diff \
  postgresql://user:password@localhost/db1 table1 \
  postgresql://user:password@localhost/db2 table2 \
  --key-columns id \
  --columns name,email,updated_at

Python API

import data_diff

diff = data_diff.diff_tables(
    table1=data_diff.connect_to_table("postgresql://localhost/db1", "table1", "id"),
    table2=data_diff.connect_to_table("postgresql://localhost/db2", "table2", "id"),
)

for sign, row in diff:
    print(sign, row)  # '+' for added, '-' for removed

Supported Databases

Database Tier Known Limitations
PostgreSQL Production
MySQL Production
DuckDB Production
Redshift Stable Extends PostgreSQL driver
Snowflake Stable No optimizer hints
Presto Stable No primary key detection; fixed precision defaults
Vertica Stable
Databricks Stable No unique constraint support
MsSQL Limited No session timezone; no OFFSET support
BigQuery Limited No session timezone; array/struct compared as JSON
ClickHouse Limited Complex decimal normalization
Oracle Limited No OFFSET support; no EXPLAIN
Trino Experimental Minimal driver extending Presto; may have SQL divergences

Tier definitions:

  • Production — All methods implemented, dedicated tests, CI coverage
  • Stable — Core functionality works, minor limitations noted above
  • Limited — Usable but missing some features; cross-database comparisons may have edge cases
  • Experimental — Minimal implementation, use with caution

dbt Integration

data-diff integrates with dbt to compare tables between development and production environments:

data-diff --dbt

Install with dbt support:

pip install 'data-diff[dbt]'

See the full documentation for configuration details.

Windows Support

The pip install data-diff command works natively on Windows, macOS, and Linux. Database-specific extras install the same way across all platforms.

For development, the Makefile and docker compose workflow assumes a Unix-like shell. On Windows, use WSL (recommended) or Git Bash to run make test, make up, and other development commands.

Documentation

Contributors

License

This project is licensed under the terms of the MIT License.

About

Compare tables within or across databases

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 99.6%
  • Other 0.4%