-
Notifications
You must be signed in to change notification settings - Fork 370
SCD_TYPE_2_BY_COLUMN: opt-in strict attributes on close (no COALESCE(target, source) for updated_rows) #5757
Description
Summary
When SCD_TYPE_2_BY_COLUMN detects a change in check_columns (e.g. a row-level hash), SQLMesh rebuilds existing rows in the updated_rows CTE inside _scd_type_2 (sqlmesh/core/engine_adapter/base.py). For every unmanaged attribute it currently uses:
COALESCE(t_<col>, source.<col>)
— i.e. prefer the previous open row (t_*), but if that value is NULL, take the incoming batch (source).
Why this hurts
-
NULL backfill from the next batch
If the currently open row has NULL and the next daily batch has a value, the closed period’s row can show the new value even though it was NULL for the whole period you just closed. The physical table no longer matches that day’s snapshot. -
Erasures (value → NULL) are mishandled in the same pattern
COALESCE(previous_value, NULL) keeps the old value. So when the new batch clears a field, the closed row may still show the old non-NULL while you expected NULL for the period that just ended—unless the change detection also fires purely on that column in a way that matches your mental model. -
Symptom
After loading a later snapshot, two SCD versions can look identical on business columns even though the underlying snapshot rows differed (e.g. NULL vs populated on a non-tracked attribute). That undermines audits and “as-of” reporting against clients_snapshot.
Desired behavior (strict / snapshot-faithful SCD2)
When a version is closed because check_columns changed:
Attribute columns on that closed row should still describe that version’s row as it was while open—typically the values from t_* (the previous open/target row), not values filled in from the next batch’s source.
When a new version is inserted, it should still come from the current batch’s source as today.
So we want a model-level option on SCD_TYPE_2_BY_COLUMN, for example:
scd_attribute_close_mode: coalesce_null (current default) vs strict (or boolean disable_coalesce_on_updated_rows),
where strict means: for updated_rows, emit unmanaged attributes as t_ only (no COALESCE with source), so:
NULL in the open row stays NULL on the closed row when the next batch fills the field in the new row only.
Clearing a field in the next batch does not leave stale non-NULL on the closed row because of COALESCE.
(Exact naming and defaults are for maintainers; the behavioral split is what matters.)
Motivation
We load daily snapshots. The hash can change between days when e.g. a field is NULL in one extract and populated in another. After the later run, two SCD rows can look the same on business columns because the closed row was rewritten with COALESCE(latest, source). We need an optional mode that keeps closed attribute columns aligned with what was actually true for that version.
References
Implementation: sqlmesh/core/engine_adapter/base.py, _scd_type_2, CTE updated_rows (the COALESCE over prefixed_unmanaged_columns and joined source columns).