Add Elastic Defensiveness by AbdulRahmanAlHamali · Pull Request #899 · Shopify/semian

AbdulRahmanAlHamali · 2025-11-27T21:33:58Z

This PR:

Adds an elastic defensiveness coefficient that allows us to block more than the error rate we are seeing, depending on how high that error rate is (the higher it is, the higher we block)
Increases Ki, this makes sure we don't drop the rejection rate too quickly when the error rate drops (especially when we're blocking 100%)
Changes the integral windup to a simple clamp between -10 and 10. For some reason, the saturation clamping was leading to weird results.

AbdulRahmanAlHamali · 2025-11-27T21:34:21Z

experiments/experiment_helpers.rb

            integral_avg: metrics.sum { |m| m[:integral] } / metrics.length.to_f,
            integral_min: metrics.map { |m| m[:integral] }.min,
            integral_max: metrics.map { |m| m[:integral] }.max,
+            p_value_avg: metrics.sum { |m| m[:p_value] } / metrics.length.to_f,


We were not recording P values in the CSV, added them

AbdulRahmanAlHamali · 2025-11-27T21:35:12Z

experiments/results/main_graphs/one_of_many_services_latency_degradation_adaptive.png

much better protection

AbdulRahmanAlHamali · 2025-11-28T14:30:40Z

lib/semian.rb

-      kd: 0.5, # Small derivative gain (as per design doc)
+      kp: 1.0, # Standard proportional gain
+      ki: 0.2, # Moderate integral gain
+      kd: 0.0, # Small derivative gain (as per design doc)


The Kd seems to have a bad effect in general. When we are blocking 100%, and the error rate drops to 0, P becomes very negative, while the previous P was very positive, it ends up accelerating the drop. Which is good if the incident is actually recovered, but very bad if it isn't.

Setting it to 0 seems to provide better protection

AbdulRahmanAlHamali · 2025-11-28T14:32:10Z

lib/semian/pid_controller.rb

-          # Output was clamped, reverse the integral accumulation
-          @integral -= @last_p_value * dt
-        end
+        @integral = @integral.clamp(-10.0, 10.0)


I'm not really sure what was wrong with this clamping method, but looking at the integral behaviour with it, it was very weird, and never settling back to 0.

A simple clamp of -10,10 seems to work and make sure it does go back to 0 eventually

AbdulRahmanAlHamali · 2025-11-28T14:33:24Z

lib/semian/pid_controller.rb

        # P decreases when: rejection_rate > 0 (feedback mechanism)
-        (current_error_rate - ideal_error_rate) - @rejection_rate
+        delta_error = current_error_rate - ideal_error_rate
+        delta_error - (1 - delta_error) * @rejection_rate


This is the key change: an elastic defensiveness that stretches and expands based on how bad the error rate is

anagayan

Edit: Moved comment to code section. Ignore this

anagayan · 2025-12-03T18:46:44Z

experiments/results/main_graphs/gradual_increase.png

Why is the baseline of successes lower on the new one?

pretty sure that's just normal test-to-test variability. I've noticed running all tests in parallel usually decreases the amount of rps for some tests. See adaptive for this test as well.

anagayan · 2025-12-03T18:47:21Z

experiments/results/main_graphs/near_target_error_rate.png

Why are these two numbers so different in the graphs?

same reasoning as mentioned in my other comment. Basically variability in test runs and other external factors.

I re-ran:

anagayan · 2025-12-03T18:48:59Z

experiments/results/main_graphs/one_of_many_services_latency_degradation.png

Seems like a lot of oscillations. Have we tried larger Ki values to see if it helps here?

This is for classic.

anagayan · 2025-12-03T18:56:17Z

lib/semian/pid_controller.rb

        ideal_error_rate = calculate_ideal_error_rate

-        # P = (error_rate - ideal_error_rate) - rejection_rate
+        # P = (error_rate - ideal_error_rate) - (1 - (error_rate - ideal_error_rate)) * rejection_rate


Thinking about this more, can we add an error threshold before we start rejecting? As in, we allow a range of errors to happen without engaging rejections. Once it passes that threshold, we start engaging the adaptive controller and rejections.

I think that's a good idea. Do you have a suggestion on what the threshold should be?

I'm wondering whether it should be an absolute number (e.g. 1%) or something relative to what ideal error rate is.

Another option could be to have finite resolution on our error observation. i.e. applying some rounding up/down so that 1.2% error rate just becomes 1%.

What do you think?

Don't we essentially have that threshold since we initially set the ideal error rate at 5%?
I do wonder if we should then clamp it to at least 1%

I think it should be relative to the ideal error rate. Maybe like 20-25% of it or something similar, completely arbitrary but maybe something we can play around with initially

Sounds goood. Just to confirm: would you say your comment is specific for this "elastic defensiveness" implementation? Or more general? If general, we can get this merged and open a separate issue

Spoke with @Aguasvivas22 on this and looks like technically we already have it in place with the P formula and subtraction. I think the only thing missing is that we may want to add a floor value so it doesn't go to 0.

More a general comment, so ignore it on this PR and we can followup if we agree and find it useful

AbdulRahmanAlHamali commented Nov 27, 2025

View reviewed changes

experiments/results/main_graphs/one_of_many_services_latency_degradation_adaptive.png

Copy link

Contributor Author

AbdulRahmanAlHamali Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

much better protection

AbdulRahmanAlHamali commented Nov 28, 2025

View reviewed changes

AbdulRahmanAlHamali added 3 commits December 1, 2025 20:03

add elastic defensiveness

ad2740b

remove kd

df8f4d1

update docs and run experiments

649afe9

AbdulRahmanAlHamali force-pushed the defensiveness branch from 5735823 to 649afe9 Compare December 1, 2025 20:18

kris-gaudel approved these changes Dec 2, 2025

View reviewed changes

Aguasvivas22 approved these changes Dec 2, 2025

View reviewed changes

fix tests

b1a96be

Aguasvivas22 force-pushed the defensiveness branch from de0e696 to b1a96be Compare December 3, 2025 16:35

anagayan reviewed Dec 3, 2025

View reviewed changes

anagayan approved these changes Dec 4, 2025

View reviewed changes

AbdulRahmanAlHamali merged commit b495564 into pid-take-2 Dec 4, 2025
32 checks passed

AbdulRahmanAlHamali deleted the defensiveness branch December 4, 2025 17:40

AbdulRahmanAlHamali mentioned this pull request Dec 4, 2025

Add defensiveness coefficient, and simplify the controller into a P-only controller #875

Closed

Conversation

AbdulRahmanAlHamali commented Nov 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anagayan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

anagayan left a comment •

edited

Loading