Skip to content

SCD_TYPE_2_BY_COLUMN: opt-in strict attributes on close (no COALESCE(target, source) for updated_rows) #5757

@nathantapsas

Description

@nathantapsas

Summary

When SCD_TYPE_2_BY_COLUMN detects a change in check_columns (e.g. a row-level hash), SQLMesh rebuilds existing rows in the updated_rows CTE inside _scd_type_2 (sqlmesh/core/engine_adapter/base.py). For every unmanaged attribute it currently uses:

COALESCE(t_<col>, source.<col>)
— i.e. prefer the previous open row (t_*), but if that value is NULL, take the incoming batch (source).

Why this hurts

  1. NULL backfill from the next batch
    If the currently open row has NULL and the next daily batch has a value, the closed period’s row can show the new value even though it was NULL for the whole period you just closed. The physical table no longer matches that day’s snapshot.

  2. Erasures (value → NULL) are mishandled in the same pattern
    COALESCE(previous_value, NULL) keeps the old value. So when the new batch clears a field, the closed row may still show the old non-NULL while you expected NULL for the period that just ended—unless the change detection also fires purely on that column in a way that matches your mental model.

  3. Symptom
    After loading a later snapshot, two SCD versions can look identical on business columns even though the underlying snapshot rows differed (e.g. NULL vs populated on a non-tracked attribute). That undermines audits and “as-of” reporting against clients_snapshot.

Desired behavior (strict / snapshot-faithful SCD2)

When a version is closed because check_columns changed:

Attribute columns on that closed row should still describe that version’s row as it was while open—typically the values from t_* (the previous open/target row), not values filled in from the next batch’s source.
When a new version is inserted, it should still come from the current batch’s source as today.

So we want a model-level option on SCD_TYPE_2_BY_COLUMN, for example:

scd_attribute_close_mode: coalesce_null (current default) vs strict (or boolean disable_coalesce_on_updated_rows),
where strict means: for updated_rows, emit unmanaged attributes as t_ only (no COALESCE with source), so:

NULL in the open row stays NULL on the closed row when the next batch fills the field in the new row only.
Clearing a field in the next batch does not leave stale non-NULL on the closed row because of COALESCE.
(Exact naming and defaults are for maintainers; the behavioral split is what matters.)

Motivation

We load daily snapshots. The hash can change between days when e.g. a field is NULL in one extract and populated in another. After the later run, two SCD rows can look the same on business columns because the closed row was rewritten with COALESCE(latest, source). We need an optional mode that keeps closed attribute columns aligned with what was actually true for that version.

References

Implementation: sqlmesh/core/engine_adapter/base.py, _scd_type_2, CTE updated_rows (the COALESCE over prefixed_unmanaged_columns and joined source columns).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions