fix: bind py4j callback server to a dynamic port to avoid 25334 collision (#86, #19)#275
Conversation
PythonCallback let py4j bind the hardcoded default callback port 25334, so concurrent or repeated runs on the same host using a lambda-based Check failed with "OSError: [Errno 98] Address already in use (127.0.0.1:25334)". Force a dynamic (OS-assigned) port by setting the gateway's existing callback_server_parameters to port=0 before starting the server the stock way. Reusing PySpark's own parameters (rather than passing a fresh CallbackServerParameters + resetCallbackClient) keeps the JVM callback client correctly wired (no "Error while obtaining a new communication channel", awslabs#19) and, crucially, lets shutdown_callback_server() return cleanly at teardown on both Linux and macOS. Closes awslabs#86, awslabs#19, awslabs#7, awslabs#72, awslabs#156, awslabs#173, awslabs#198
f626e1b to
4ad5b3b
Compare
|
Ready for review. Reworked after deeper validation: the first approach ( The current fix is minimal: set the gateway's existing (PySpark-configured) |
|
Thanks for the automated review pass. The current revision addresses these findings:
Resolving the threads accordingly. |
Problem
Using a
Checkwith a lambda assertion starts a py4j callback server.PythonCallback.__init__calledgateway.start_callback_server()with no port, so py4j binds the hardcoded default 25334. Concurrent or repeated runs on the same host then collide:This single root cause underlies a cluster of reports: #86, #19, #7, #72, #156, #173, #198. The known
port=0workaround alone is insufficient — the JVM-side callback client keeps pointing at the old port, producingError while obtaining a new communication channel(#19).Fix
Start the callback server on a dynamic port and reset the JVM-side callback client to the port actually bound — the documented py4j idiom (mirrors py4j's own
ResetCallbackClientTest):The
is_shutdownrestart branch is also updated to re-derive a fresh dynamic port instead of reverting to 25334.Verification
Runtime-verified on real Spark 3.5 / Java 17 / py4j 0.10.9.7 / Deequ 2.0.8:
ALL_SCENARIOS_PASSED.Added
test_lambda_check_uses_dynamic_callback_porttotests/test_checks.py, which squats on 25334, runs a lambda-assertionCheck, and asserts the callback server does not use the default port.Reviewer note
This touches the shared JVM-bridge callback lifecycle. The restart branch reaches into py4j's
_callback_serverattribute soget_callback_server()returns the new server — worth a careful look.getAddress()returns/127.0.0.1(verified live) and is passed straight back intoresetCallbackClient.Closes #86, #19, #7, #72, #156, #173, #198