Skip to content

Add Elastic Defensiveness#899

Merged
AbdulRahmanAlHamali merged 4 commits intopid-take-2from
defensiveness
Dec 4, 2025
Merged

Add Elastic Defensiveness#899
AbdulRahmanAlHamali merged 4 commits intopid-take-2from
defensiveness

Conversation

@AbdulRahmanAlHamali
Copy link
Contributor

This PR:

  1. Adds an elastic defensiveness coefficient that allows us to block more than the error rate we are seeing, depending on how high that error rate is (the higher it is, the higher we block)
  2. Increases Ki, this makes sure we don't drop the rejection rate too quickly when the error rate drops (especially when we're blocking 100%)
  3. Changes the integral windup to a simple clamp between -10 and 10. For some reason, the saturation clamping was leading to weird results.

integral_avg: metrics.sum { |m| m[:integral] } / metrics.length.to_f,
integral_min: metrics.map { |m| m[:integral] }.min,
integral_max: metrics.map { |m| m[:integral] }.max,
p_value_avg: metrics.sum { |m| m[:p_value] } / metrics.length.to_f,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were not recording P values in the CSV, added them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

much better protection

lib/semian.rb Outdated
kd: 0.5, # Small derivative gain (as per design doc)
kp: 1.0, # Standard proportional gain
ki: 0.2, # Moderate integral gain
kd: 0.0, # Small derivative gain (as per design doc)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Kd seems to have a bad effect in general. When we are blocking 100%, and the error rate drops to 0, P becomes very negative, while the previous P was very positive, it ends up accelerating the drop. Which is good if the incident is actually recovered, but very bad if it isn't.

Setting it to 0 seems to provide better protection

# Output was clamped, reverse the integral accumulation
@integral -= @last_p_value * dt
end
@integral = @integral.clamp(-10.0, 10.0)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really sure what was wrong with this clamping method, but looking at the integral behaviour with it, it was very weird, and never settling back to 0.

A simple clamp of -10,10 seems to work and make sure it does go back to 0 eventually

# P decreases when: rejection_rate > 0 (feedback mechanism)
(current_error_rate - ideal_error_rate) - @rejection_rate
delta_error = current_error_rate - ideal_error_rate
delta_error - (1 - delta_error) * @rejection_rate
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the key change: an elastic defensiveness that stretches and expands based on how bad the error rate is

Copy link

@anagayan anagayan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit: Moved comment to code section. Ignore this

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the baseline of successes lower on the new one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pretty sure that's just normal test-to-test variability. I've noticed running all tests in parallel usually decreases the amount of rps for some tests. See adaptive for this test as well.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these two numbers so different in the graphs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same reasoning as mentioned in my other comment. Basically variability in test runs and other external factors.

I re-ran:

Image

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a lot of oscillations. Have we tried larger Ki values to see if it helps here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for classic.

ideal_error_rate = calculate_ideal_error_rate

# P = (error_rate - ideal_error_rate) - rejection_rate
# P = (error_rate - ideal_error_rate) - (1 - (error_rate - ideal_error_rate)) * rejection_rate
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this more, can we add an error threshold before we start rejecting? As in, we allow a range of errors to happen without engaging rejections. Once it passes that threshold, we start engaging the adaptive controller and rejections.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's a good idea. Do you have a suggestion on what the threshold should be?

I'm wondering whether it should be an absolute number (e.g. 1%) or something relative to what ideal error rate is.

Another option could be to have finite resolution on our error observation. i.e. applying some rounding up/down so that 1.2% error rate just becomes 1%.

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we essentially have that threshold since we initially set the ideal error rate at 5%?
I do wonder if we should then clamp it to at least 1%

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be relative to the ideal error rate. Maybe like 20-25% of it or something similar, completely arbitrary but maybe something we can play around with initially

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds goood. Just to confirm: would you say your comment is specific for this "elastic defensiveness" implementation? Or more general? If general, we can get this merged and open a separate issue

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spoke with @Aguasvivas22 on this and looks like technically we already have it in place with the P formula and subtraction. I think the only thing missing is that we may want to add a floor value so it doesn't go to 0.

More a general comment, so ignore it on this PR and we can followup if we agree and find it useful

@AbdulRahmanAlHamali AbdulRahmanAlHamali merged commit b495564 into pid-take-2 Dec 4, 2025
32 checks passed
@AbdulRahmanAlHamali AbdulRahmanAlHamali deleted the defensiveness branch December 4, 2025 17:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants