Skip to content

Enforce SQLToolset allowed_tables on queries, not just discovery#68487

Merged
kaxil merged 2 commits into
apache:mainfrom
astronomer:sql-toolset-allowed-tables-enforcement
Jun 16, 2026
Merged

Enforce SQLToolset allowed_tables on queries, not just discovery#68487
kaxil merged 2 commits into
apache:mainfrom
astronomer:sql-toolset-allowed-tables-enforcement

Conversation

@kaxil

@kaxil kaxil commented Jun 12, 2026

Copy link
Copy Markdown
Member

SQLToolset(allowed_tables=[...]) previously restricted only metadata discovery (list_tables/get_schema); the query and check_query tools never checked it, so an agent could read any table by name (SELECT * FROM secret). This makes allowed_tables an enforced boundary on the query tools too: the SQL is parsed with sqlglot and rejected before execution if it reaches any table not on the list.

What it catches

Every table a query reaches must be on the list, resolved with its database/catalog: direct, subquery, CTE body, JOIN, set operations, DESCRIBE, information_schema/pg_catalog, and DML (with allow_writes=True). CTE references are excluded by lexical scope, so a same-named CTE in another scope cannot hide a real table.

Constructs a schema.table list cannot describe are rejected while the list is active: table-valued functions (dblink), TABLE('name') row sources, the TABLE <name> shorthand, SHOW, dynamic SQL (EXEC), quoted identifiers (case-sensitive on the engine), cross-database references (otherdb.public.orders), and inline comments.

Design rationale

  • Reject comments: SELECT * FROM orders/*!UNION SELECT * FROM secret*/ runs the UNION arm on MySQL/MariaDB but is inert to sqlglot and other engines (a parser/engine differential). Rejecting comments closes the class instead of chasing variants.
  • Catalog-aware: the toolset targets one database, so a catalog-qualified reference to another database is refused.
  • Scope-aware CTE handling: excluding CTE names globally let an inner same-named CTE hide a real top-level table; exclusion now follows lexical scope and CTE ordering.

Tradeoffs and limitations

Application-level guardrail (parse-then-check): strong defense-in-depth, not a substitute for database permissions. It cannot police data reached through a function whose argument is itself SQL or a path: pg_read_file('...') (a file) or query_to_xml('SELECT ... FROM other', ...) / scalar dblink (a table, via a string the parser cannot read). For a hard boundary, run the connection as a least-privilege role with SELECT limited to the same tables. Quoted identifiers and comments are rejected while a list is active, so agents should send unquoted, comment-free SQL on restricted connections.

@kaxil kaxil requested a review from gopidesupavan as a code owner June 12, 2026 23:36
@kaxil kaxil changed the title Enforce SQLToolset's allowed_tables on queries Enforce SQLToolset allowed_tables on queries, not just discovery Jun 12, 2026
@kaxil kaxil force-pushed the sql-toolset-allowed-tables-enforcement branch from 6130da2 to eae9492 Compare June 12, 2026 23:40
allowed_tables previously restricted only metadata discovery (list_tables /
get_schema); the query and check_query tools never checked it, so an agent
could read any table by name. It is now enforced on the query tools as well:
the SQL is parsed with sqlglot and rejected before execution if it reaches a
table that is not on the list, resolved with its database/catalog and with CTE
references excluded by lexical scope.

Constructs a schema.table list cannot describe are rejected while a list is
active: table-valued functions, TABLE('name') row sources, the TABLE <name>
shorthand, SHOW, dynamic SQL, quoted identifiers, cross-database references,
and inline comments (where MySQL executable /*! ... */ comments hide).

This is application-level defense-in-depth, not a substitute for database
permissions: data reached through a function whose argument is itself SQL or a
path (pg_read_file, query_to_xml, scalar dblink) is out of its reach, so a
least-privilege DB role remains the hard boundary.
@kaxil kaxil force-pushed the sql-toolset-allowed-tables-enforcement branch from eae9492 to cb1c4a6 Compare June 13, 2026 00:06
Comment thread providers/common/ai/src/airflow/providers/common/ai/utils/sql_validation.py Outdated

@gopidesupavan gopidesupavan left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one comment it would be good to resolve them.. :)

A CTE used as a DML source (WITH src AS (...) INSERT INTO orders SELECT * FROM
src) was falsely rejected: CTE scoping was disabled for the whole DML statement,
so src was checked against allowed_tables as if it were a base table.

Now only the DML target is exempt from CTE resolution (you cannot write to a CTE,
so a same-named CTE never shadows the target); sources follow normal lexical CTE
scoping. The target, and any off-list table a CTE body actually reads, are still
enforced.
@kaxil kaxil merged commit b193ced into apache:main Jun 16, 2026
81 checks passed
@kaxil kaxil deleted the sql-toolset-allowed-tables-enforcement branch June 16, 2026 12:17
RulerChen pushed a commit to RulerChen/airflow that referenced this pull request Jun 16, 2026
…che#68487)

allowed_tables previously restricted only metadata discovery (list_tables /
get_schema); the query and check_query tools never checked it, so an agent
could read any table by name. It is now enforced on the query tools as well:
the SQL is parsed with sqlglot and rejected before execution if it reaches a
table that is not on the list, resolved with its database/catalog and with CTE
references excluded by lexical scope.

Constructs a schema.table list cannot describe are rejected while a list is
active: table-valued functions, TABLE('name') row sources, the TABLE <name>
shorthand, SHOW, dynamic SQL, quoted identifiers, cross-database references,
and inline comments (where MySQL executable /*! ... */ comments hide).

This is application-level defense-in-depth, not a substitute for database
permissions: data reached through a function whose argument is itself SQL or a
path (pg_read_file, query_to_xml, scalar dblink) is out of its reach, so a
least-privilege DB role remains the hard boundary.

* Allow CTE sources in DML against allowed_tables

A CTE used as a DML source (WITH src AS (...) INSERT INTO orders SELECT * FROM
src) was falsely rejected: CTE scoping was disabled for the whole DML statement,
so src was checked against allowed_tables as if it were a base table.

Now only the DML target is exempt from CTE resolution (you cannot write to a CTE,
so a same-named CTE never shadows the target); sources follow normal lexical CTE
scoping. The target, and any off-list table a CTE body actually reads, are still
enforced.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants