Skip to content

NODEJS-681: ControlConnection Concurrent Read and Write on .host and .connection#462

Open
toptobes wants to merge 2 commits into
apache:trunkfrom
toptobes:control-connection-refresh
Open

NODEJS-681: ControlConnection Concurrent Read and Write on .host and .connection#462
toptobes wants to merge 2 commits into
apache:trunkfrom
toptobes:control-connection-refresh

Conversation

@toptobes
Copy link
Copy Markdown

@toptobes toptobes commented May 20, 2026

This PR superceeds Jane He's #430 with her blessing

After some investigation, we were unable to figure out the root cause behind the NPEs, with there being multiple potential avenues where the issue may have originated from, and so we decided to fix the issue at the lowest and simplest level we could–simply adding a stronger concurrency control to _refresh directly via a _refreshInProgress flag

I personally believe the issue stemmed from _setHealthListeners being called multiple times on the same host/connection, causing the listeners to trigger refreshes multiple times for the same event, leading to the NPEs mentioned in the ticket.

However the issue is quite hard to organically reproduce so the theory remains a theory.

Potential trace
  1. _refresh() is called
  2. _refresh() calls _refreshControlConnection()
  3. _refreshControlConnection() fails to borrow a connection so it calls _initializeConnection()
  4. _initializeConnection() calls _setHealthListeners()
  5. _refresh() gets back in control and then also calls _setHealthListeners()

which means that there's the potential of, sequentially:

  1. A new host and connection being set (call them H1 and C1)
  2. Listeners being attached to the H1 and C1
  3. A newer host being set (call it H2)
  4. Listeners being attached to the H2 and C1 without the previous listeners being removed

Comment thread lib/control-connection.js

removeListeners();

if (self._isShuttingDown) {
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this conditional in _refresh() logs if true... should it also log here or no?

Comment thread lib/promise-utils.js
*/
function toBackground(promise) {
promise.catch(() => {});
promise?.catch(() => {});
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if it's this package's responsibility to deal with the fact that the promise can be null 🤷

assert.strictEqual(cc.hosts.length, 1);
});

it('should not break when refreshing concurrently', async () => {
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may need a better heuristic for ensuring the refreshes are okay... not sure... I just "borrowed" this from Jane's original PR

Comment thread lib/control-connection.js
*/
async _refresh(hostIterator) {
if (this._refreshInProgress) {
return;
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

returning an auto-resolving promise here would defer the _refresh call to the microtask queue which may or may not open a hole for a race condition–I haven't checked.

I don't think it's necessary to return one, but if we do, it's definitely something that needs to be done with at least a little caution

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses NODEJS-681 by adding concurrency protection around ControlConnection._refresh() to avoid concurrent refresh executions that can lead to inconsistent .host / .connection state, and adds an integration test intended to exercise concurrent refresh behavior.

Changes:

  • Add _refreshInProgress guard and refactor refresh logic into _unsafeDoRefresh() in ControlConnection.
  • Add an integration test that triggers many _refresh() calls.
  • Make promiseUtils.toBackground() tolerate undefined/null inputs via optional chaining.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
lib/control-connection.js Adds a refresh-in-progress guard and refactors refresh implementation into a separate method.
lib/promise-utils.js Makes toBackground() no-op safely when given a nullish value.
test/integration/short/control-connection-tests.js Adds a concurrency-focused integration test for control connection refresh.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lib/control-connection.js
Comment thread lib/control-connection.js Outdated
Comment thread lib/promise-utils.js
Comment on lines 146 to 153
/**
* Deals with unexpected rejections in order to avoid the unhandled promise rejection warning or failure.
* @param {Promise} promise
* @returns {undefined}
*/
function toBackground(promise) {
promise.catch(() => {});
promise?.catch(() => {});
}
Comment on lines +164 to +170
const refreshPromises = [];
// randomly emit cc._refresh 100 times
for (let i = 0; i < 100; i++) {
refreshPromises.push(cc._refresh());
await helper.delayAsync(~~(Math.random() * 100));
}
await Promise.all(refreshPromises);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's suggestion won't work. If the _refresh call all starts around the same time and ends around the same time, the null pointer exception won't actually happen

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And although my test can be unreliable, false test failures won't happen.
I agree that this test isn't ideal tho, open to ideas on how to improve it

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@SiyaoIsHiding SiyaoIsHiding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, the main reason why it broke in the past, is that this.host and this.connection are a nullable type, (they are supposed to be null in certain circumstainces like refreshing), but function calls like this._setHealthListeners(this.host, this.connection); are not treating them as nullable. Imagine if this is TypeScript, it would complain that _setHealthListeners expects Host, Connection as argument while this call passes Host?, Connection?.
So, aside from the _refreshInProgress flag to make sure only one refresh is happening at one time, I think we should still make sure we are using this.host and this.connection as nullable.
That means this._setHealthListeners(this.host, this.connection); should be

if (this.host && this.connection){
this._setHealthListeners(this.host, this.connection);
}

And other places that access this.host and this.connection should also has null guards, like this.
What do you think?

Comment on lines +164 to +170
const refreshPromises = [];
// randomly emit cc._refresh 100 times
for (let i = 0; i < 100; i++) {
refreshPromises.push(cc._refresh());
await helper.delayAsync(~~(Math.random() * 100));
}
await Promise.all(refreshPromises);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's suggestion won't work. If the _refresh call all starts around the same time and ends around the same time, the null pointer exception won't actually happen

Comment on lines +164 to +170
const refreshPromises = [];
// randomly emit cc._refresh 100 times
for (let i = 0; i < 100; i++) {
refreshPromises.push(cc._refresh());
await helper.delayAsync(~~(Math.random() * 100));
}
await Promise.all(refreshPromises);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And although my test can be unreliable, false test failures won't happen.
I agree that this test isn't ideal tho, open to ideas on how to improve it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants