Add Elastic Defensiveness#899
Conversation
| integral_avg: metrics.sum { |m| m[:integral] } / metrics.length.to_f, | ||
| integral_min: metrics.map { |m| m[:integral] }.min, | ||
| integral_max: metrics.map { |m| m[:integral] }.max, | ||
| p_value_avg: metrics.sum { |m| m[:p_value] } / metrics.length.to_f, |
There was a problem hiding this comment.
We were not recording P values in the CSV, added them
There was a problem hiding this comment.
much better protection
lib/semian.rb
Outdated
| kd: 0.5, # Small derivative gain (as per design doc) | ||
| kp: 1.0, # Standard proportional gain | ||
| ki: 0.2, # Moderate integral gain | ||
| kd: 0.0, # Small derivative gain (as per design doc) |
There was a problem hiding this comment.
The Kd seems to have a bad effect in general. When we are blocking 100%, and the error rate drops to 0, P becomes very negative, while the previous P was very positive, it ends up accelerating the drop. Which is good if the incident is actually recovered, but very bad if it isn't.
Setting it to 0 seems to provide better protection
| # Output was clamped, reverse the integral accumulation | ||
| @integral -= @last_p_value * dt | ||
| end | ||
| @integral = @integral.clamp(-10.0, 10.0) |
There was a problem hiding this comment.
I'm not really sure what was wrong with this clamping method, but looking at the integral behaviour with it, it was very weird, and never settling back to 0.
A simple clamp of -10,10 seems to work and make sure it does go back to 0 eventually
| # P decreases when: rejection_rate > 0 (feedback mechanism) | ||
| (current_error_rate - ideal_error_rate) - @rejection_rate | ||
| delta_error = current_error_rate - ideal_error_rate | ||
| delta_error - (1 - delta_error) * @rejection_rate |
There was a problem hiding this comment.
This is the key change: an elastic defensiveness that stretches and expands based on how bad the error rate is
5735823 to
649afe9
Compare
de0e696 to
b1a96be
Compare
There was a problem hiding this comment.
Why is the baseline of successes lower on the new one?
There was a problem hiding this comment.
pretty sure that's just normal test-to-test variability. I've noticed running all tests in parallel usually decreases the amount of rps for some tests. See adaptive for this test as well.
There was a problem hiding this comment.
Why are these two numbers so different in the graphs?
There was a problem hiding this comment.
Seems like a lot of oscillations. Have we tried larger Ki values to see if it helps here?
| ideal_error_rate = calculate_ideal_error_rate | ||
|
|
||
| # P = (error_rate - ideal_error_rate) - rejection_rate | ||
| # P = (error_rate - ideal_error_rate) - (1 - (error_rate - ideal_error_rate)) * rejection_rate |
There was a problem hiding this comment.
Thinking about this more, can we add an error threshold before we start rejecting? As in, we allow a range of errors to happen without engaging rejections. Once it passes that threshold, we start engaging the adaptive controller and rejections.
There was a problem hiding this comment.
I think that's a good idea. Do you have a suggestion on what the threshold should be?
I'm wondering whether it should be an absolute number (e.g. 1%) or something relative to what ideal error rate is.
Another option could be to have finite resolution on our error observation. i.e. applying some rounding up/down so that 1.2% error rate just becomes 1%.
What do you think?
There was a problem hiding this comment.
Don't we essentially have that threshold since we initially set the ideal error rate at 5%?
I do wonder if we should then clamp it to at least 1%
There was a problem hiding this comment.
I think it should be relative to the ideal error rate. Maybe like 20-25% of it or something similar, completely arbitrary but maybe something we can play around with initially
There was a problem hiding this comment.
Sounds goood. Just to confirm: would you say your comment is specific for this "elastic defensiveness" implementation? Or more general? If general, we can get this merged and open a separate issue
There was a problem hiding this comment.
Spoke with @Aguasvivas22 on this and looks like technically we already have it in place with the P formula and subtraction. I think the only thing missing is that we may want to add a floor value so it doesn't go to 0.
More a general comment, so ignore it on this PR and we can followup if we agree and find it useful

This PR: