Skip to content

fix: bind py4j callback server to a dynamic port to avoid 25334 collision (#86, #19)#275

Open
nikolauspschuetz wants to merge 1 commit into
awslabs:masterfrom
nikolauspschuetz:fix/issue-86-callback-server-port-collision
Open

fix: bind py4j callback server to a dynamic port to avoid 25334 collision (#86, #19)#275
nikolauspschuetz wants to merge 1 commit into
awslabs:masterfrom
nikolauspschuetz:fix/issue-86-callback-server-port-collision

Conversation

@nikolauspschuetz

Copy link
Copy Markdown

Problem

Using a Check with a lambda assertion starts a py4j callback server. PythonCallback.__init__ called gateway.start_callback_server() with no port, so py4j binds the hardcoded default 25334. Concurrent or repeated runs on the same host then collide:

OSError: [Errno 98] Address already in use (127.0.0.1:25334)

This single root cause underlies a cluster of reports: #86, #19, #7, #72, #156, #173, #198. The known port=0 workaround alone is insufficient — the JVM-side callback client keeps pointing at the old port, producing Error while obtaining a new communication channel (#19).

Fix

Start the callback server on a dynamic port and reset the JVM-side callback client to the port actually bound — the documented py4j idiom (mirrors py4j's own ResetCallbackClientTest):

gateway.start_callback_server(CallbackServerParameters(port=0))
port = gateway.get_callback_server().get_listening_port()
gateway.java_gateway_server.resetCallbackClient(
    gateway.java_gateway_server.getCallbackClient().getAddress(), port)

The is_shutdown restart branch is also updated to re-derive a fresh dynamic port instead of reverting to 25334.

Verification

Runtime-verified on real Spark 3.5 / Java 17 / py4j 0.10.9.7 / Deequ 2.0.8:

Added test_lambda_check_uses_dynamic_callback_port to tests/test_checks.py, which squats on 25334, runs a lambda-assertion Check, and asserts the callback server does not use the default port.

Reviewer note

This touches the shared JVM-bridge callback lifecycle. The restart branch reaches into py4j's _callback_server attribute so get_callback_server() returns the new server — worth a careful look. getAddress() returns /127.0.0.1 (verified live) and is passed straight back into resetCallbackClient.

Closes #86, #19, #7, #72, #156, #173, #198

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Generated by AI (model: us.anthropic.claude-opus-4-6-v1, prompt: 416310f3) — may not be fully accurate. Reply if this doesn't help.

Comment thread pydeequ/scala_utils.py Outdated
Comment thread pydeequ/scala_utils.py Outdated
Comment thread tests/test_checks.py Outdated
PythonCallback let py4j bind the hardcoded default callback port 25334, so
concurrent or repeated runs on the same host using a lambda-based Check failed
with "OSError: [Errno 98] Address already in use (127.0.0.1:25334)".

Force a dynamic (OS-assigned) port by setting the gateway's existing
callback_server_parameters to port=0 before starting the server the stock way.
Reusing PySpark's own parameters (rather than passing a fresh
CallbackServerParameters + resetCallbackClient) keeps the JVM callback client
correctly wired (no "Error while obtaining a new communication channel", awslabs#19)
and, crucially, lets shutdown_callback_server() return cleanly at teardown on
both Linux and macOS.

Closes awslabs#86, awslabs#19, awslabs#7, awslabs#72, awslabs#156, awslabs#173, awslabs#198
@nikolauspschuetz nikolauspschuetz force-pushed the fix/issue-86-callback-server-port-collision branch from f626e1b to 4ad5b3b Compare June 27, 2026 17:41
@nikolauspschuetz nikolauspschuetz marked this pull request as ready for review June 27, 2026 17:41
@nikolauspschuetz

Copy link
Copy Markdown
Author

Ready for review. Reworked after deeper validation: the first approach (CallbackServerParameters(port=0) + resetCallbackClient) made shutdown_callback_server() hang at teardown on both Linux and macOS (I confirmed via the repo's own Dockerfile: stock shutdown returns cleanly, that approach hangs), which would have hung the test suite's tearDownClass in CI.

The current fix is minimal: set the gateway's existing (PySpark-configured) callback_server_parameters.port to 0 only if it's the legacy hardcoded 25334, then start the server the stock way. This avoids the collision and preserves PySpark's callback wiring, so shutdown returns cleanly. Verified on Linux (Docker) and macOS: full test_checks.py → 87 passed, exit 0, no teardown hang. cc @sudsali @chenliu0831. Closes #86, #19, #7, #72, #156, #173, #198.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Generated by AI (model: us.anthropic.claude-opus-4-6-v1, prompt: 416310f3) — may not be fully accurate. Reply if this doesn't help.

Comment thread pydeequ/scala_utils.py
Comment thread pydeequ/scala_utils.py
@nikolauspschuetz

Copy link
Copy Markdown
Author

Thanks for the automated review pass. The current revision addresses these findings:

Resolving the threads accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Check with assertion as parameter throws "OSError: [Errno 98] Address already in use"

2 participants