NODEJS-681: ControlConnection Concurrent Read and Write on .host and .connection#462
NODEJS-681: ControlConnection Concurrent Read and Write on .host and .connection#462toptobes wants to merge 2 commits into
Conversation
|
|
||
| removeListeners(); | ||
|
|
||
| if (self._isShuttingDown) { |
There was a problem hiding this comment.
this conditional in _refresh() logs if true... should it also log here or no?
| */ | ||
| function toBackground(promise) { | ||
| promise.catch(() => {}); | ||
| promise?.catch(() => {}); |
There was a problem hiding this comment.
not sure if it's this package's responsibility to deal with the fact that the promise can be null 🤷
| assert.strictEqual(cc.hosts.length, 1); | ||
| }); | ||
|
|
||
| it('should not break when refreshing concurrently', async () => { |
There was a problem hiding this comment.
may need a better heuristic for ensuring the refreshes are okay... not sure... I just "borrowed" this from Jane's original PR
| */ | ||
| async _refresh(hostIterator) { | ||
| if (this._refreshInProgress) { | ||
| return; |
There was a problem hiding this comment.
returning an auto-resolving promise here would defer the _refresh call to the microtask queue which may or may not open a hole for a race condition–I haven't checked.
I don't think it's necessary to return one, but if we do, it's definitely something that needs to be done with at least a little caution
There was a problem hiding this comment.
Pull request overview
This PR addresses NODEJS-681 by adding concurrency protection around ControlConnection._refresh() to avoid concurrent refresh executions that can lead to inconsistent .host / .connection state, and adds an integration test intended to exercise concurrent refresh behavior.
Changes:
- Add
_refreshInProgressguard and refactor refresh logic into_unsafeDoRefresh()inControlConnection. - Add an integration test that triggers many
_refresh()calls. - Make
promiseUtils.toBackground()tolerateundefined/nullinputs via optional chaining.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| lib/control-connection.js | Adds a refresh-in-progress guard and refactors refresh implementation into a separate method. |
| lib/promise-utils.js | Makes toBackground() no-op safely when given a nullish value. |
| test/integration/short/control-connection-tests.js | Adds a concurrency-focused integration test for control connection refresh. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /** | ||
| * Deals with unexpected rejections in order to avoid the unhandled promise rejection warning or failure. | ||
| * @param {Promise} promise | ||
| * @returns {undefined} | ||
| */ | ||
| function toBackground(promise) { | ||
| promise.catch(() => {}); | ||
| promise?.catch(() => {}); | ||
| } |
| const refreshPromises = []; | ||
| // randomly emit cc._refresh 100 times | ||
| for (let i = 0; i < 100; i++) { | ||
| refreshPromises.push(cc._refresh()); | ||
| await helper.delayAsync(~~(Math.random() * 100)); | ||
| } | ||
| await Promise.all(refreshPromises); |
There was a problem hiding this comment.
Copilot's suggestion won't work. If the _refresh call all starts around the same time and ends around the same time, the null pointer exception won't actually happen
There was a problem hiding this comment.
And although my test can be unreliable, false test failures won't happen.
I agree that this test isn't ideal tho, open to ideas on how to improve it
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
SiyaoIsHiding
left a comment
There was a problem hiding this comment.
I think, the main reason why it broke in the past, is that this.host and this.connection are a nullable type, (they are supposed to be null in certain circumstainces like refreshing), but function calls like this._setHealthListeners(this.host, this.connection); are not treating them as nullable. Imagine if this is TypeScript, it would complain that _setHealthListeners expects Host, Connection as argument while this call passes Host?, Connection?.
So, aside from the _refreshInProgress flag to make sure only one refresh is happening at one time, I think we should still make sure we are using this.host and this.connection as nullable.
That means this._setHealthListeners(this.host, this.connection); should be
if (this.host && this.connection){
this._setHealthListeners(this.host, this.connection);
}
And other places that access this.host and this.connection should also has null guards, like this.
What do you think?
| const refreshPromises = []; | ||
| // randomly emit cc._refresh 100 times | ||
| for (let i = 0; i < 100; i++) { | ||
| refreshPromises.push(cc._refresh()); | ||
| await helper.delayAsync(~~(Math.random() * 100)); | ||
| } | ||
| await Promise.all(refreshPromises); |
There was a problem hiding this comment.
Copilot's suggestion won't work. If the _refresh call all starts around the same time and ends around the same time, the null pointer exception won't actually happen
| const refreshPromises = []; | ||
| // randomly emit cc._refresh 100 times | ||
| for (let i = 0; i < 100; i++) { | ||
| refreshPromises.push(cc._refresh()); | ||
| await helper.delayAsync(~~(Math.random() * 100)); | ||
| } | ||
| await Promise.all(refreshPromises); |
There was a problem hiding this comment.
And although my test can be unreliable, false test failures won't happen.
I agree that this test isn't ideal tho, open to ideas on how to improve it
This PR superceeds Jane He's #430 with her blessing
After some investigation, we were unable to figure out the root cause behind the NPEs, with there being multiple potential avenues where the issue may have originated from, and so we decided to fix the issue at the lowest and simplest level we could–simply adding a stronger concurrency control to
_refreshdirectly via a_refreshInProgressflagI personally believe the issue stemmed from
_setHealthListenersbeing called multiple times on the same host/connection, causing the listeners to trigger refreshes multiple times for the same event, leading to the NPEs mentioned in the ticket.However the issue is quite hard to organically reproduce so the theory remains a theory.
Potential trace
which means that there's the potential of, sequentially: