Skip to content

Informatica provider: Add SQL auto-lineage and selective lineage control#66612

Merged
RNHTTR merged 14 commits into
apache:mainfrom
cetingokhan:informatica-provider-v0.2.0
Jun 15, 2026
Merged

Informatica provider: Add SQL auto-lineage and selective lineage control#66612
RNHTTR merged 14 commits into
apache:mainfrom
cetingokhan:informatica-provider-v0.2.0

Conversation

@cetingokhan

Copy link
Copy Markdown
Contributor

Add automatic SQL lineage detection and per-task/DAG lineage control to the Informatica provider (apache-airflow-providers-informatica).

Previously the provider only supported manual lineage through explicit inlets/outlets declarations.
This PR extends it with:

Automatic SQL Lineage

  • Adds a lineage/sql_parser.py module that uses sqlglot to parse SQL and extract source and target tables from SELECT, INSERT INTO, CREATE TABLE AS SELECT, and MERGE INTO statements.
  • Adds lineage/resolver.py which infers the SQL dialect from the task's connection ID string (e.g. postgres_conn_id → postgres, snowflake → snowflake) and resolves parsed table references against the Informatica EDC catalog. Supports 13 dialects: PostgreSQL, MySQL, Snowflake, BigQuery, Databricks, Redshift, SQLite, Oracle, Trino, Presto, Hive, Spark, and MSSQL (T-SQL).
  • When auto_lineage_enabled = True (the new default), the listener automatically detects SQL operators, parses their SQL, and creates lineage links — no inlets/outlets required on the task.

Fail-fast validation (two-phase listener)

  • Refactors InformaticaListener to a two-phase model:
    • on_task_instance_running — pre-validates and resolves all inlet/outlet URIs (manual) or parsed table references (auto) before the operator's execute() is called. Introduces InformaticaLineageResolutionError which immediately fails the task when any URI or table cannot be resolved in the Informatica catalog. Resolved EDC object IDs are cached in memory.
    • on_task_instance_success — creates lineage links using the cached IDs, avoiding a second round of EDC calls.
    • on_task_instance_failed — clears the cache to prevent stale state.
  • This prevents silent lineage gaps: tasks that reference catalog objects not present in EDC now fail clearly before execution rather than succeeding with missing lineage.

Selective lineage control

  • Adds lineage/selective.py with disable_informatica_lineage(task_or_dag) and enable_informatica_lineage(task_or_dag) helpers, exported from airflow.providers.informatica.lineage. These let users opt individual tasks or entire DAGs out of automatic lineage without touching inlets/outlets.
  • Adds disabled_for_operators config option to exclude entire operator classes (e.g. BashOperator) from lineage tracking via airflow.cfg.

New configuration options ([informatica] section in airflow.cfg):

  • auto_lineage_enabled (bool, default True) — enable/disable SQL auto-lineage globally.
  • disabled_for_operators (str, default "") — semicolon-separated FQCNs of operator classes to skip.
  • request_timeout (int, default 30) — timeout in seconds for EDC REST API calls.

Other changes

  • Adds example_dags/example_informatica_lineage.py demonstrating all four modes: auto-lineage, manual lineage, per-task disable, and operator-class exclusion.
  • Updates docs/guides/usage.rst with comprehensive documentation for all new features.
  • Adds is_operator_disabled() to conf.py for per-operator lookup.

Manual lineage still takes priority — if a task has any inlets or outlets defined, SQL parsing is skipped entirely.

closes: #ISSUE


Was generative AI tooling used to co-author this PR?
  • Yes — GitHub Copilot (Claude Sonnet 4.6, Claude Opus 4.6, GPT-5.3-Codex)

Generated-by: GitHub Copilot (Claude Sonnet 4.6) following the guidelines

@potiuk potiuk force-pushed the informatica-provider-v0.2.0 branch from f6677da to 144c81d Compare May 9, 2026 20:30
@cetingokhan cetingokhan force-pushed the informatica-provider-v0.2.0 branch from 6af66d4 to 32102fe Compare May 10, 2026 21:49
@potiuk potiuk added the ready for maintainer review Set after triaging when all criteria pass. label May 11, 2026
Comment thread providers/informatica/src/airflow/providers/informatica/plugins/listener.py Outdated
@cetingokhan cetingokhan requested a review from RNHTTR June 3, 2026 12:49
@potiuk potiuk removed the ready for maintainer review Set after triaging when all criteria pass. label Jun 9, 2026
@potiuk

potiuk commented Jun 9, 2026

Copy link
Copy Markdown
Member

@cetingokhan — A reviewer (@RNHTTR) has requested changes on this PR, so I've removed the ready for maintainer review label — the next step is on your side. Could you address the review comments (push a fix, or reply in-thread explaining why the feedback doesn't apply)? Once addressed, re-request review from @RNHTTR or re-mark the PR ready and it returns to the maintainer queue. Thank you.

Note: This comment was drafted by an AI-assisted triage tool and may contain mistakes. Once you have addressed the points above, an Apache Airflow maintainer — a real person — will take the next look at your PR. We use this two-stage triage process so that our maintainers' limited time is spent where it matters most: the conversation with you.

@RNHTTR

RNHTTR commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

@cetingokhan Would you mind resolving the merge conflict?

Also, have you manually tested the functionality of this PR against a real Informatica environment? I think it'd be super helpful to actually run this against edc and make sure that pre_execute works as expected. Once you've been able to do that, I'll be happy to approve :)

@cetingokhan

Copy link
Copy Markdown
Contributor Author

@cetingokhan Would you mind resolving the merge conflict?

Also, have you manually tested the functionality of this PR against a real Informatica environment? I think it'd be super helpful to actually run this against edc and make sure that pre_execute works as expected. Once you've been able to do that, I'll be happy to approve :)

As you can see that I added informatica edc simulation app into the dev folder. Simulator works as like as informatica which we need to test lineage so I tested with simulator and It works well :) Also, I had trying to find a way to get trial or demo informatica environment from informatica but nobody respond me yet :) I will continue to remind myself to them about this issue :)

I think you can approve the PR. I'll keep an eye out for any issues, and if we need to test it on Informatica, I can use one of the customers' environments. :)

@RNHTTR RNHTTR merged commit 6f5172b into apache:main Jun 15, 2026
295 checks passed
pgagnon pushed a commit to pgagnon/airflow that referenced this pull request Jun 15, 2026
…rol (apache#66612)

* added auto-lineage support for sql operators

* docs(informatica): add sqlglot parsing limitations note to usage guide

* added new version details into the changelog

* fixed checks results

* fixed checks results

* added pre_execute for fail-fast validation

* docs errror fixes

* fix merge comment

---------

Co-authored-by: Ryan Hatter <25823361+RNHTTR@users.noreply.github.com>
imrichardwu pushed a commit to imrichardwu/airflow that referenced this pull request Jun 16, 2026
…rol (apache#66612)

* added auto-lineage support for sql operators

* docs(informatica): add sqlglot parsing limitations note to usage guide

* added new version details into the changelog

* fixed checks results

* fixed checks results

* added pre_execute for fail-fast validation

* docs errror fixes

* fix merge comment

---------

Co-authored-by: Ryan Hatter <25823361+RNHTTR@users.noreply.github.com>
dingo4dev pushed a commit to dingo4dev/airflow that referenced this pull request Jun 16, 2026
…rol (apache#66612)

* added auto-lineage support for sql operators

* docs(informatica): add sqlglot parsing limitations note to usage guide

* added new version details into the changelog

* fixed checks results

* fixed checks results

* added pre_execute for fail-fast validation

* docs errror fixes

* fix merge comment

---------

Co-authored-by: Ryan Hatter <25823361+RNHTTR@users.noreply.github.com>
RulerChen pushed a commit to RulerChen/airflow that referenced this pull request Jun 16, 2026
…rol (apache#66612)

* added auto-lineage support for sql operators

* docs(informatica): add sqlglot parsing limitations note to usage guide

* added new version details into the changelog

* fixed checks results

* fixed checks results

* added pre_execute for fail-fast validation

* docs errror fixes

* fix merge comment

---------

Co-authored-by: Ryan Hatter <25823361+RNHTTR@users.noreply.github.com>
Lee-W pushed a commit that referenced this pull request Jun 17, 2026
Signed-off-by: PoAn Yang <payang@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants