Skip to content

CCOR-13137 - Handle 401 + 403 (w/auth error) Gracefully#121

Open
chrishagglund-ship-it wants to merge 6 commits into
mainfrom
fix/refresh-token-when-expired
Open

CCOR-13137 - Handle 401 + 403 (w/auth error) Gracefully#121
chrishagglund-ship-it wants to merge 6 commits into
mainfrom
fix/refresh-token-when-expired

Conversation

@chrishagglund-ship-it

@chrishagglund-ship-it chrishagglund-ship-it commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

CCOR-13137

Rework automatic token refresh to be more like the python implementation, (react to 403 as well, require error string in json response)

Pull Request type

  • [ x ] Bug fix

Changes in this PR

Improve handling of errors from requests that were not authorized. Prior work added reactive automatic token refresh however the implementation was naive - this should bring it up to the standards of the preferred implementation from the python sdk.

Implementation requirements:

  • automatically refresh token inline when it is known to be expired
  • intercept 401 + 403 response that has error text values EXPIRED_TOKEN or INVALID_TOKEN, and make an inline attempt to get a fresh token
  • max 5 failures with exponential backoff + reset counter on success
  • fatal exception / terminate with error when retries is exhausted

…ion, (react to 403 as well, require error string in json response)
@codecov

codecov Bot commented Jun 5, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 86.48649% with 10 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...conductor/client/http/TokenRefreshInterceptor.java 85.29% 4 Missing and 1 partial ⚠️
...kes/conductor/client/http/OrkesAuthentication.java 88.00% 2 Missing and 1 partial ⚠️
...ctor/client/http/FatalAuthenticationException.java 50.00% 2 Missing ⚠️
Files with missing lines Coverage Δ Complexity Δ
...netflix/conductor/client/automator/TaskRunner.java 55.95% <100.00%> (+1.35%) 31.00 <4.00> (+5.00)
...main/java/io/orkes/conductor/client/ApiClient.java 15.06% <100.00%> (ø) 3.00 <0.00> (ø)
...ctor/client/http/FatalAuthenticationException.java 50.00% <50.00%> (ø) 1.00 <1.00> (?)
...kes/conductor/client/http/OrkesAuthentication.java 79.68% <88.00%> (+10.79%) 14.00 <4.00> (+4.00)
...conductor/client/http/TokenRefreshInterceptor.java 85.29% <85.29%> (ø) 11.00 <11.00> (?)

... and 4 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@chrishagglund-ship-it chrishagglund-ship-it changed the title https://orkes.atlassian.net/browse/CCOR-13137 CCOR-13137 - Handle 401 + 403 (w/auth error) Gracefully Jun 5, 2026

@mp-orkes mp-orkes left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall direction might be good (interceptor + backoff), but please address the comments before merge.

The permanent lockout, and the way it's currently handled, is a blocker IMO.

}
synchronized (refreshLock) {
if (tokenRefreshFailures >= MAX_TOKEN_REFRESH_FAILURES) {
throw new ConductorClientException("Token refresh has failed " + tokenRefreshFailures

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we hit this condition the process can never recover. As already mentioned in this comment.

As it is, this shouldn't be a recoverable Exception. It should be fatal, either a dedicated unchecked error that propagates out, or an explicit process exit.

@mp-orkes mp-orkes requested review from c4lm and manan164 June 10, 2026 15:01
…ration) - demonstrate token issue locally in different ways
… in a place that makes sense and exit program
@chrishagglund-ship-it

chrishagglund-ship-it commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

will now handle the permanent lockout by exiting the program. also enhanced the harness worker to have a basic mode that replicates the simple test worker connect and poll so i could run it conveniently to test that

edit: added some tests for the new fatal exception, also did manual testing with the harness worker both full and worker-only modes with bad and valid auth keys, worked as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants