Skip to content

Fix fab deserialize user session leak#68100

Merged
vincbeck merged 5 commits into
apache:mainfrom
pcorliss:fix-fab-deserialize-user-session-leak
Jun 18, 2026
Merged

Fix fab deserialize user session leak#68100
vincbeck merged 5 commits into
apache:mainfrom
pcorliss:fix-fab-deserialize-user-session-leak

Conversation

@pcorliss

@pcorliss pcorliss commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Fix periodic 500 due to idle_in_transaction timeout in FAB auth manager

When we upgraded to Airflow 3.1.8 (apache-airflow-providers-fab 3.4.0) we noticed periodic 500s being returned when connecting to certain API endpoints or browsing the GUI. We traced this back to a SQL error on the api-server.

sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.

(Subsequent requests reusing the same scoped_session then cascade into sqlalchemy.exc.PendingRollbackError: Can't reconnect until invalid transaction is rolled back until the session is removed.)

On investigation we found that SQL queries were being left open in an idle_in_transaction state and would time out due to our PostgreSQL settings. In our environment we have our PostgreSQL idle in transaction timeout set to 5 minutes (idle_in_transaction_session_timeout = '5min' in postgresql.conf). The PostgreSQL default is 0 ("never time out"), so deployments using the default will not see this bug; you can confirm the live value with SHOW idle_in_transaction_session_timeout;.

We traced this back to this change:

We think the removal of create_session() is what removed the commit/rollback handler.

We also noticed several follow-up fixes and tried upgrading to Airflow 3.2.2 (apache-airflow-providers-fab 3.6.4), but the errors continued because none of them fix the root cause (the open transaction left behind on a worker thread):

Repro Steps

Deployed Instance

  1. Set PostgreSQL idle_in_transaction_session_timeout - ALTER SYSTEM SET idle_in_transaction_session_timeout = '10s'; SELECT pg_reload_conf();
  2. Clear all existing cookies or use an incognito window.
  3. Login via /auth/login.
  4. Navigate to /api/v2/version — expected 200 response. An idle_in_transaction query is opened in the database (visible via SELECT pid, state, now()-state_change AS idle_for, LEFT(query, 80) FROM pg_stat_activity WHERE state='idle in transaction';).
  5. Take no action for at least the duration of PostgreSQL's idle_in_transaction_session_timeout.
  6. Navigate to /api/v2/version again — expected 500 response with OperationalError / PendingRollbackError in the api-server logs.

Breeze Script

Local breeze configuration in
files/airflow-breeze-config/environment_variables.env:

AIRFLOW__FAB__CACHE_TTL=0
AIRFLOW__API__WORKERS=1

AIRFLOW__FAB__CACHE_TTL=0 disables the TTL cache so every request exercises the cache-miss path. AIRFLOW__API__WORKERS=1 keeps a single gunicorn worker so the same anyio worker thread holds the leaked
connection across requests.

Start breeze with PostgreSQL and shorten the idle-in-transaction timeout so the bug is observable in seconds:

breeze start-airflow --backend postgres \
  --postgres-version 16 \
  --integration none \
  --load-default-connections \
  --load-example-dags

# In a separate shell, on the host:
docker exec -u postgres breeze-postgres-1 \
  psql -d airflow -c \
  "ALTER SYSTEM SET idle_in_transaction_session_timeout = '10s'; \
   SELECT pg_reload_conf();"

Then run the minimal repro shell script

#!/usr/bin/env bash
# Reproduce the FAB deserialize_user idle-in-transaction leak.
set -euo pipefail

API="${API:-http://localhost:28080}"
USERNAME="${USERNAME:-admin}"
PASSWORD="${PASSWORD:-admin}"
WAIT_SECONDS="${WAIT_SECONDS:-15}"

JAR=$(mktemp)
trap 'rm -f "$JAR"' EXIT

step() { printf "\n\033[1;34m== %s ==\033[0m\n" "$*"; }

step "1. Authenticate via /auth/login/"
LOGIN_HTML=$(curl -fsS -c "$JAR" -b "$JAR" "$API/auth/login/")
CSRF=$(printf '%s' "$LOGIN_HTML" \
  | grep -oE 'name="csrf_token"[^>]*value="[^"]*"' \
  | head -1 | sed -E 's/.*value="([^"]*)".*/\1/')
curl -fsS -o /dev/null -c "$JAR" -b "$JAR" -X POST "$API/auth/login/" \
  -H 'Content-Type: application/x-www-form-urlencoded' \
  --data-urlencode "username=$USERNAME" \
  --data-urlencode "password=$PASSWORD" \
  --data-urlencode "csrf_token=$CSRF"

step "2. Warm GET /api/v2/version (leaks idle-in-tx scoped_session on worker thread)"
curl -fsS -o /dev/null -w 'HTTP %{http_code}\n' -b "$JAR" "$API/api/v2/version"

step "3. Confirm idle-in-transaction connection holding the FAB User JOIN"
docker exec breeze-postgres-1 psql -U postgres -d airflow -c \
  "SELECT pid, state, now()-state_change AS idle_for, LEFT(query, 80) \
   FROM pg_stat_activity \
   WHERE datname='airflow' AND state='idle in transaction';"

step "4. Sleep ${WAIT_SECONDS}s (> idle_in_transaction_session_timeout)"
sleep "$WAIT_SECONDS"

step "5. Probe GET /api/v2/version — pre-fix: 500; post-fix: 200"
HTTP=$(curl -s -o /tmp/repro-resp.txt -w '%{http_code}' -b "$JAR" "$API/api/v2/version")
echo "HTTP $HTTP"

if [[ "$HTTP" == "500" ]]; then
  echo -e "\n\033[1;31mLEAK REPRODUCED\033[0m"
  exit 1
else
  echo -e "\n\033[1;32mNO LEAK\033[0m"
  exit 0
fi

Pre-fix the script exits non-zero with HTTP 500 on step 5. Post-fix it exits 0 with HTTP 200 and no idle in transaction row visible at step 3.


Was generative AI tooling used to co-author this PR?
  • Yes — Claude Code (Opus 4.7)

Generated-by: Claude Code (Opus 4.7) following the guidelines


  • Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
  • For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
  • When adding dependency, check compliance with the ASF 3rd Party License Policy.
  • For significant user-facing changes create newsfragment: {pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.

@pcorliss pcorliss requested a review from vincbeck as a code owner June 5, 2026 20:16
@pcorliss pcorliss force-pushed the fix-fab-deserialize-user-session-leak branch from 5a971c6 to 81cf36a Compare June 5, 2026 20:16
@pcorliss

pcorliss commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

Test failures appear to be environmental, pulling golang from remote URL. I don't see a retry option so will wait until there are commits on main to rebase from.

@pcorliss pcorliss force-pushed the fix-fab-deserialize-user-session-leak branch from 0198985 to 49e90c6 Compare June 6, 2026 02:33
@potiuk potiuk added the ready for maintainer review Set after triaging when all criteria pass. label Jun 8, 2026
pcorliss added 5 commits June 17, 2026 20:30
Address self-review findings: import airflow.settings at module top
instead of inside the test method, and constrain the create_session
context-manager mock to the protocol it actually needs.

AI-Assisted-By: Claude Opus 4.7
The compat-3.0.6 test environment configures the engine with NullPool,
which closes connections on release and exposes no checkedout() method.
Skip the leak assertion there — by construction NullPool cannot leak
the connection this test guards against.

AI-Assisted-By: Claude Opus 4.7
@pcorliss pcorliss force-pushed the fix-fab-deserialize-user-session-leak branch from 8f69128 to 551aa5b Compare June 18, 2026 01:31

@vincbeck vincbeck left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully this will finally solve this long running issue!

@vincbeck vincbeck merged commit 91ba4a2 into apache:main Jun 18, 2026
77 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:fab ready for maintainer review Set after triaging when all criteria pass.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PendingRollbackError: Can't reconnect until invalid transaction is rolled back. Please rollback() fully before proceeding (Background on this error)

3 participants