Skip to content

[fix](fe) Fix VSlotRef invalid slot id in INTERSECT/EXCEPT with ExprId reuse#62296

Open
yujun777 wants to merge 2 commits intoapache:masterfrom
yujun777:fix/intersect-exprid-reuse
Open

[fix](fe) Fix VSlotRef invalid slot id in INTERSECT/EXCEPT with ExprId reuse#62296
yujun777 wants to merge 2 commits intoapache:masterfrom
yujun777:fix/intersect-exprid-reuse

Conversation

@yujun777
Copy link
Copy Markdown
Contributor

@yujun777 yujun777 commented Apr 9, 2026

What problem does this PR solve?

Issue Number:

Problem Summary:

Introduced by #56366.

The real bug is not in the physical translator. It starts earlier in PushDownFilterThroughSetOperation when a filter is pushed through a set operation that still contains constant children.

For UNION/INTERSECT/EXCEPT, constant children intentionally reuse the set operation output ExprIds. When filter push-down cannot fold one of those constant children away, the rewrite promotes that constant expression into a regular child. Before this fix, the promoted child kept the same ExprIds as the set operation output, so regularChildrenOutputs and the set operation output collided on ExprId. That duplicate ExprId state later surfaced as wrong slot binding and errors such as VSlotRef have invalid slot id.

The fix handles the problem at the source: when PushDownFilterThroughSetOperation promotes a constant child into a regular child, it wraps the child with LogicalProject and fresh Alias outputs so the promoted child gets new ExprIds while preserving names and output order.

Reproducing query:

WITH
  tbl0(`n`) AS (SELECT 1 UNION ALL SELECT 3 UNION ALL SELECT NULL),
  tbl1(`n`) AS (SELECT 2 UNION ALL SELECT NULL UNION ALL SELECT 1)
(
  SELECT (`n` * 2) AS `n` FROM tbl0
  INTERSECT
  SELECT (`n` * 2) AS `n` FROM tbl1
);

Release note

Fixed a bug where set-operation queries could fail because filter push-down promoted a constant child into a regular child while reusing the set operation output ExprIds.

Check List (For Author)

  • Test: Unit test (PushDownFilterThroughSetOperationTest) / Regression test (set_operation_exprid_reuse, forceGenOut)
  • Behavior changed: No
  • Does this need documentation: No

@Thearas
Copy link
Copy Markdown
Contributor

Thearas commented Apr 9, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@yujun777
Copy link
Copy Markdown
Contributor Author

yujun777 commented Apr 9, 2026

run buildall

@yujun777
Copy link
Copy Markdown
Contributor Author

yujun777 commented Apr 9, 2026

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  1. High - The fix is incomplete because the same ExprId->SlotRef overwrite pattern still exists in the functionally parallel recursive CTE translator path (visitPhysicalRecursiveUnion()), so recursive unions can still bind child result expressions to the wrong slots when ExprIds are reused.

Critical checkpoints

  • Goal of current task: Partially achieved. visitPhysicalSetOperation() is fixed and covered by regression tests, but the same translator invariant is still violated in visitPhysicalRecursiveUnion().
  • Modification size/focus: Yes, the patch is small and focused.
  • Concurrency: Not involved.
  • Special lifecycle/static init: No special lifecycle concerns in the changed code.
  • Configuration: None added.
  • Compatibility: No FE/BE protocol or storage compatibility change.
  • Parallel code paths: Not fully covered. visitPhysicalRecursiveUnion() is a parallel path that still uses the old translation order.
  • Special conditional checks: The new ordering comment is clear; no extra conditional-risk issue found.
  • Test coverage: Added good regression coverage for non-recursive set operations, but there is no recursive CTE regression for the same ExprId-reuse invariant.
  • Observability: No additional observability needed for this translator-only fix.
  • Transaction/persistence: Not involved.
  • Data writes/modifications: Not involved.
  • FE/BE variable passing: Not involved.
  • Performance: Acceptable; the change is still linear and does not add meaningful overhead.
  • Other issues: None beyond the missed recursive CTE path.

@Override
public PlanFragment visitPhysicalSetOperation(
PhysicalSetOperation setOperation, PlanTranslatorContext context) {
// Translate each child's result exprs and distribute exprs immediately after
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High: this only fixes visitPhysicalSetOperation(), but visitPhysicalRecursiveUnion() still follows the old order: visit both children, call generateTupleDesc(recursiveCte.getOutput(), ...), then translate recursiveCte.getRegularChildrenOutputs() (see lines 2302-2328 in the current file). That path uses the same global PlanTranslatorContext.exprIdToSlotRef map, so later child visits and the parent tuple creation still overwrite earlier mappings. Because LogicalRecursiveUnion.buildNewOutputs() can preserve an existing output ExprId, a recursive CTE with reused ExprIds can still translate to the wrong SlotRef and hit the same invalid-slot failure. Please apply the same interleaving strategy here and add a regression under recursive_cte.

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 100.00% (12/12) 🎉
Increment coverage report
Complete coverage report

@yujun777 yujun777 force-pushed the fix/intersect-exprid-reuse branch from bf4213e to 8aa64f0 Compare April 9, 2026 14:48
@yujun777
Copy link
Copy Markdown
Contributor Author

yujun777 commented Apr 9, 2026

run buildall

@yujun777
Copy link
Copy Markdown
Contributor Author

yujun777 commented Apr 9, 2026

/review

@yujun777
Copy link
Copy Markdown
Contributor Author

yujun777 commented Apr 9, 2026

Re: review comment on visitPhysicalRecursiveUnion() — Applied the same interleaving fix. Result exprs and distribute exprs are now translated immediately after each child visit, before generateTupleDesc() can overwrite the ExprId→SlotRef mappings. Verified with recursive CTE tests.

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One issue found.

  1. regression-test/data/query_p0/set_operations/set_operation_exprid_reuse.out encodes the wrong expected result for intersect_nulls. Doris already treats NULL as equal for set operations: the existing query_p0/set_operations/set_with_null suite expects INTERSECT ... SELECT NULL to return \N, not to drop it. The new expected output only contains 1, so this regression test would bless a semantic regression instead of protecting against it.

Critical checkpoint conclusions:

  • Goal / correctness / test proof: The goal is to fix ExprId -> SlotRef overwrite during set-operation translation. The code change in PhysicalPlanTranslator.visitPhysicalSetOperation() does address that overwrite hazard by translating each child immediately after visiting it, and the added suite exercises the reported failing shapes. However, one expected result is incorrect, so the current tests do not yet prove the fix safely.
  • Scope / minimality: The code change is small and focused. Extending the same interleaving pattern to visitPhysicalRecursiveUnion() is also consistent with the same mutable-map mechanism.
  • Concurrency: No new concurrency or lock-safety concerns in the touched code path.
  • Lifecycle / static initialization: No special lifecycle or static-init concerns introduced here.
  • Config changes: None.
  • Compatibility: No FE/BE protocol or storage-format compatibility changes detected.
  • Parallel code paths: The analogous recursive CTE path was updated too, which is good. I did not find another remaining set-operation translation path with the same overwrite pattern in this file.
  • Special condition checks: The new comments explain the ExprId reuse hazard clearly enough.
  • Test coverage: Coverage improved for INTERSECT / EXCEPT / UNION DISTINCT and nested shapes, but the intersect_nulls oracle is inconsistent with existing Doris semantics and should be corrected. No recursive CTE regression was added for the newly touched parallel path.
  • Observability: No additional observability seems necessary for this FE translation fix.
  • Transaction / persistence: Not applicable.
  • Data writes / modifications: Not applicable beyond regression-test DDL/DML setup.
  • FE/BE variable passing: No new variables added.
  • Performance: The interleaving avoids extra retranslation hazards without adding meaningful overhead.
  • Other issues: None beyond the incorrect expected result above.

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 100.00% (25/25) 🎉
Increment coverage report
Complete coverage report

Copy link
Copy Markdown
Contributor

@morrySnow morrySnow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should not has same exprid for different slot reference in a plan

@yujun777 yujun777 force-pushed the fix/intersect-exprid-reuse branch from 8aa64f0 to 843dadf Compare April 11, 2026 15:35
@yujun777
Copy link
Copy Markdown
Contributor Author

run buildall

@yujun777
Copy link
Copy Markdown
Contributor Author

/review

@yujun777
Copy link
Copy Markdown
Contributor Author

should not has same exprid for different slot reference in a plan

PushDownFilterThroughSetOperation introduce duplicate exprid, fix it

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings:

  1. High - regression-test/data/query_p0/set_operations/set_operation_exprid_reuse.out records the wrong expected result for the new INTERSECT NULL case. Doris' existing set_with_null regression suite treats NULL as equal in set semantics, so order_qt_intersect_nulls should return both \N and 1, not just 1. As written, this new regression would bless a semantic regression instead of proving the ExprId fix.

Critical checkpoint conclusions:

  • Goal of the task: Partially accomplished. The code change is aimed at removing ExprId collisions for promoted UNION constant rows, and the FE unit test covers that helper behavior. However, the new end-to-end regression currently has an incorrect golden result, so it does not reliably prove the user-visible fix.
  • Modification size/focus: Yes. The Java change is small and focused on PushDownFilterThroughSetOperation.
  • Concurrency: Not involved in this diff.
  • Special lifecycle/static init: Not involved.
  • Configuration changes: None.
  • Compatibility/incompatible changes: None observed.
  • Functionally parallel paths: No additional unaddressed path found in the current diff beyond the added regression issue.
  • Special conditional checks: The new project-wrapping branch is understandable and locally justified to generate fresh ExprIds.
  • Test coverage: FE unit coverage is good for the rewrite helper, but regression coverage is not yet trustworthy because of the wrong NULL expectation.
  • Observability: No additional observability needed for this FE rewrite fix.
  • Transaction/persistence: Not involved.
  • Data writes/modifications: Not involved.
  • FE-BE variable passing: Not involved.
  • Performance: No obvious performance regression; the new project wrapper is only on the promoted-constant rewrite path.
  • Other issues: The incorrect regression output above is the remaining blocker I found.

I did not run FE or regression tests locally in this runner; conclusions are based on code inspection and existing Doris regression semantics.

…d in UNION

### What problem does this PR solve?

Issue Number: close #CIR-19889

Related PR: apache#56366

Problem Summary:
PR apache#56366 (master, commit e27ceb3, 2025-09-28) introduced a bug in
PushDownFilterThroughSetOperation.addFiltersToNewChildren(). When a UNION
constant expression cannot be statically eliminated by a filter (e.g.,
NULL IN (4,2) evaluates to UNKNOWN), it is promoted from constantExprsList
to a regular child via LogicalOneRowRelation + LogicalFilter. The else
branch directly used eliminateFilter.getOutput() as regularChildrenOutputs,
inheriting the same ExprId as the UNION output slot.

This ExprId collision is then manifested in PhysicalPlanTranslator
.visitPhysicalSetOperation(): generateTupleDesc(output) registers
ExprId->SlotRef for the UNION output, which overwrites the child's earlier
ExprId->SlotRef mapping (both share the same ExprId). The child result
expression then resolves to the wrong slot, causing BE crash:
"VSlotRef have invalid slot id".

The fix wraps the promoted constant child in a LogicalProject with fresh
ExprIds (via new Alias(ne.toSlot(), ne.getName()) which auto-generates a
fresh ExprId), so that regularChildrenOutputs never share ExprIds with the
UNION output.

Also removed dead code: a Builder<List<SlotReference>> newChildrenOutput
that was constructed but never used.

### Release note

Fix INTERSECT/EXCEPT/UNION DISTINCT query crash with "VSlotRef have
invalid slot id" when a UNION constant row cannot be statically eliminated
during filter push-down.

### Check List (For Author)

- Test: Regression test (query_p0/set_operations/set_operation_exprid_reuse.groovy) + FE unit test (PushDownFilterThroughSetOperationTest, added testConstantPromotionGeneratesFreshExprIds)
- Behavior changed: No
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@yujun777 yujun777 force-pushed the fix/intersect-exprid-reuse branch from 843dadf to cac8ff4 Compare April 13, 2026 07:32
@yujun777
Copy link
Copy Markdown
Contributor Author

run buildall

@yujun777
Copy link
Copy Markdown
Contributor Author

/review

@github-actions
Copy link
Copy Markdown
Contributor

No issues found in the reviewed changes.

Critical checkpoints:

  • Goal and correctness: The patch addresses the reported ExprId collision in by assigning fresh ExprIds when a constant child is promoted to a regular child. The downstream flow now receives distinct child/output slot mappings, so the reported failure path is covered.
  • Scope and minimality: The code change is narrowly scoped to the constant-promotion branch and removes one dead local builder. I did not find unrelated behavioral changes.
  • Concurrency: Not applicable. The modified code is planner rewrite logic with no new shared-state or lock interactions.
  • Lifecycle / initialization: Not applicable. No special lifecycle or static initialization changes.
  • Configuration: None added.
  • Compatibility: No incompatible FE/BE protocol or storage-format change. The fix stays within existing logical-plan semantics.
  • Parallel code paths: I checked the related set-operation rewrite/translation flow and did not find another sibling promotion path in this area that still reuses the colliding ExprId mapping.
  • Conditional checks: The new conditional branch is justified and documented inline; it matches the specific constant-to-regular-child promotion case.
  • Test coverage: Adequate for this fix. The FE unit tests cover fresh-ExprId generation on promotion, and the regression case exercises INTERSECT/EXCEPT/UNION DISTINCT queries including the reported reproducer.
  • Observability: No additional observability appears necessary for this planner-only fix.
  • Transaction / persistence: Not applicable.
  • Data write / atomicity: Not applicable.
  • FE/BE variable passing: No new variables added.
  • Performance: The added is only introduced on the rare promoted-constant path and is an appropriate, minimal cost for preserving ExprId correctness.
  • Other issues: None identified during review.

Residual risk:

  • I did not run FE unit tests or regression tests in this runner, so the conclusion is based on static review of the changed paths and tests.

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 100.00% (8/8) 🎉
Increment coverage report
Complete coverage report

Refresh the regression output for set_operation_exprid_reuse after the set operation ExprId reuse fix changed NULL handling in the result.

Key changes:
- regenerate regression-test/data/query_p0/set_operations/set_operation_exprid_reuse.out
- update expected rows for intersect_cte_expr, except_cte_expr, and intersect_nulls

Unit Test:
- run-regression-test.sh --run --conf regression-test/conf/regression-conf-custom.groovy -g p0 -d query_p0/set_operations -s set_operation_exprid_reuse -forceGenOut

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@yujun777
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 100.00% (8/8) 🎉
Increment coverage report
Complete coverage report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants