Informatica provider: Add SQL auto-lineage and selective lineage control#66612
Conversation
f6677da to
144c81d
Compare
6af66d4 to
32102fe
Compare
|
@cetingokhan — A reviewer (@RNHTTR) has requested changes on this PR, so I've removed the Note: This comment was drafted by an AI-assisted triage tool and may contain mistakes. Once you have addressed the points above, an Apache Airflow maintainer — a real person — will take the next look at your PR. We use this two-stage triage process so that our maintainers' limited time is spent where it matters most: the conversation with you. |
|
@cetingokhan Would you mind resolving the merge conflict? Also, have you manually tested the functionality of this PR against a real Informatica environment? I think it'd be super helpful to actually run this against edc and make sure that |
As you can see that I added informatica edc simulation app into the dev folder. Simulator works as like as informatica which we need to test lineage so I tested with simulator and It works well :) Also, I had trying to find a way to get trial or demo informatica environment from informatica but nobody respond me yet :) I will continue to remind myself to them about this issue :) I think you can approve the PR. I'll keep an eye out for any issues, and if we need to test it on Informatica, I can use one of the customers' environments. :) |
…rol (apache#66612) * added auto-lineage support for sql operators * docs(informatica): add sqlglot parsing limitations note to usage guide * added new version details into the changelog * fixed checks results * fixed checks results * added pre_execute for fail-fast validation * docs errror fixes * fix merge comment --------- Co-authored-by: Ryan Hatter <25823361+RNHTTR@users.noreply.github.com>
…rol (apache#66612) * added auto-lineage support for sql operators * docs(informatica): add sqlglot parsing limitations note to usage guide * added new version details into the changelog * fixed checks results * fixed checks results * added pre_execute for fail-fast validation * docs errror fixes * fix merge comment --------- Co-authored-by: Ryan Hatter <25823361+RNHTTR@users.noreply.github.com>
…rol (apache#66612) * added auto-lineage support for sql operators * docs(informatica): add sqlglot parsing limitations note to usage guide * added new version details into the changelog * fixed checks results * fixed checks results * added pre_execute for fail-fast validation * docs errror fixes * fix merge comment --------- Co-authored-by: Ryan Hatter <25823361+RNHTTR@users.noreply.github.com>
…rol (apache#66612) * added auto-lineage support for sql operators * docs(informatica): add sqlglot parsing limitations note to usage guide * added new version details into the changelog * fixed checks results * fixed checks results * added pre_execute for fail-fast validation * docs errror fixes * fix merge comment --------- Co-authored-by: Ryan Hatter <25823361+RNHTTR@users.noreply.github.com>
Add automatic SQL lineage detection and per-task/DAG lineage control to the Informatica provider (
apache-airflow-providers-informatica).Previously the provider only supported manual lineage through explicit
inlets/outletsdeclarations.This PR extends it with:
Automatic SQL Lineage
lineage/sql_parser.pymodule that uses sqlglot to parse SQL and extract source and target tables fromSELECT,INSERT INTO,CREATE TABLE AS SELECT, andMERGE INTOstatements.lineage/resolver.pywhich infers the SQL dialect from the task's connection ID string (e.g.postgres_conn_id → postgres,snowflake → snowflake) and resolves parsed table references against the Informatica EDC catalog. Supports 13 dialects: PostgreSQL, MySQL, Snowflake, BigQuery, Databricks, Redshift, SQLite, Oracle, Trino, Presto, Hive, Spark, and MSSQL (T-SQL).auto_lineage_enabled = True(the new default), the listener automatically detects SQL operators, parses their SQL, and creates lineage links — noinlets/outletsrequired on the task.Fail-fast validation (two-phase listener)
InformaticaListenerto a two-phase model:on_task_instance_running— pre-validates and resolves all inlet/outlet URIs (manual) or parsed table references (auto) before the operator'sexecute()is called. IntroducesInformaticaLineageResolutionErrorwhich immediately fails the task when any URI or table cannot be resolved in the Informatica catalog. Resolved EDC object IDs are cached in memory.on_task_instance_success— creates lineage links using the cached IDs, avoiding a second round of EDC calls.on_task_instance_failed— clears the cache to prevent stale state.Selective lineage control
lineage/selective.pywithdisable_informatica_lineage(task_or_dag)andenable_informatica_lineage(task_or_dag)helpers, exported fromairflow.providers.informatica.lineage. These let users opt individual tasks or entire DAGs out of automatic lineage without touchinginlets/outlets.disabled_for_operatorsconfig option to exclude entire operator classes (e.g.BashOperator) from lineage tracking viaairflow.cfg.New configuration options (
[informatica]section inairflow.cfg):auto_lineage_enabled(bool, defaultTrue) — enable/disable SQL auto-lineage globally.disabled_for_operators(str, default"") — semicolon-separated FQCNs of operator classes to skip.request_timeout(int, default30) — timeout in seconds for EDC REST API calls.Other changes
example_dags/example_informatica_lineage.pydemonstrating all four modes: auto-lineage, manual lineage, per-task disable, and operator-class exclusion.docs/guides/usage.rstwith comprehensive documentation for all new features.is_operator_disabled()toconf.pyfor per-operator lookup.Manual lineage still takes priority — if a task has any
inletsoroutletsdefined, SQL parsing is skipped entirely.closes: #ISSUE
Was generative AI tooling used to co-author this PR?
Generated-by: GitHub Copilot (Claude Sonnet 4.6) following the guidelines