Add sampler API, use in SDK tracer#225
Conversation
This reverts commit e743fdc.
|
|
||
| @classmethod | ||
| def get_bound_for_rate(cls, rate: float) -> int: | ||
| return round(rate * (cls.CHECK_BYTES + 1)) |
There was a problem hiding this comment.
If rate is less than 0.0000000000000000000271, bound will be rounded to 0 which means it never gets sampled.
Do we consider to round up or this is just too pedantic?
There was a problem hiding this comment.
Actually it is 0.0000000000000000000813 since bound is rounded to 1 (and all-zero trace id is illegal according to the TraceContext spec). 😄
There was a problem hiding this comment.
I used round intentionally here for what I thought were pedantic reasons: this way each of the 2^64 sampling rate "buckets" are split exactly in half:
r <= 2^-65 : 0
2^-65 < r <= 3 * 2^-65 : 1
3 * 2^-65 < r <= 5 * 2^-65 : 2
and etc.
I think if you pass in a sampling rate <= 2^-65 it's more correct for us to make the actual rate 0 than 2^-64.
This doesn't exactly work because float precision is worse than 2^-64 for values close to 1, but it works near 0 and is at least theoretically correct (or should be, there may still be bugs here).
all-zero trace id is illegal according to the TraceContext spec.
Oh that's interesting, so the trace ID space is [0x1, 0xf...f], not [0x0, 0xf...f]. So would you expect this line to be:
return 1 + round(rate * cls.CHECK_BYTES)I think that'd give us the same behavior as above, but always sample trace ID 0x0.
Should we have special handling for 0x0 trace IDs?
There was a problem hiding this comment.
I'm also happy to leave this as-is and treat 0x0 as a regular ID here, but treat it differently at span creation time. Up to you. :D
There was a problem hiding this comment.
I don't have any preference here as I cannot think of a scenario where people would need such low but non-zero sample rate (unless someone is instrumenting SETI@home). 😆
There was a problem hiding this comment.
Something I missed above: there are and 2^64 - 1 trace IDs besides 0x0 that end in 0x0000000000000000. If we want to always/never sample the all-zero trace ID we have to make that decision before masking.
| if parent_context is not None: | ||
| return Decision(parent_context.trace_options.recorded, {}) | ||
|
|
||
| return Decision(trace_id & self.CHECK_BYTES < self.bound, {}) |
There was a problem hiding this comment.
This could be a problem for systems that generate shorter trace_id and use zero-padding / const part, especially under low sample rate case.
There was a problem hiding this comment.
Do you think we should check fewer bytes? Or make it configurable?
This follows the OC behavior, I actually don't know why we only check the lower 8.
There was a problem hiding this comment.
I haven't dug into this.
If we're at the same bar of OpenCensus, we probably should be okay.
There was a problem hiding this comment.
If everyone follows the W3C spec here, we should be covered:
When a system operates with a trace-id that is shorter than 16 bytes, it SHOULD fill-in the extra bytes with random values rather than zeroes.
There was a problem hiding this comment.
As long as they follow the SHOULDs, which they MAY not. :P
reyang
left a comment
There was a problem hiding this comment.
Overall structure looks great!
I've left some math related comments.
Co-Authored-By: Reiley Yang <reyang@microsoft.com>
|
|
||
| # The sampler checks the last 8 bytes of the trace ID to decide whether to | ||
| # sample a given trace. | ||
| CHECK_BYTES = 0xFFFFFFFFFFFFFFFF |
There was a problem hiding this comment.
According to the W3 spec, you should use the high ("left") 8 bytes instead:
Many unique identification generation algorithms create IDs where one part of the value is constant (often time- or host-based), and the other part is a randomly generated value. Because tracing systems may make sampling decisions based on the value of trace-id, for increased interoperability vendors MUST keep the random part of trace-id ID on the left side.
But I wonder if we can find a more robust way to maybe randomly mix some bits here and there together.
There was a problem hiding this comment.
I considered this but decided to stick to the OC behavior (or at least the intended behavior: this also fixes a rounding/OBO bug). FWIW checking the high bytes also seems more correct to me, but in practice -- if people are using short trace IDs in the wild -- sampling every request seems worse than sampling based on the non-random part.
There was a problem hiding this comment.
I think you're right about this. I suggested it as a change to the spec in open-telemetry/opentelemetry-specification#331.
Another fun benefit of checking bytes high-to-low is that the sampling decision should be mostly consistent between samplers that check different numbers of bytes. Unlike checking low-to-high where every incremental bit is effectively another coin toss. Ultimately we'll probably just put this number in the spec, but it's a neat property in any case. :D
There was a problem hiding this comment.
FYI - keep an eye on https://github.com/w3c/trace-context/pull/344/files.
There was a problem hiding this comment.
Let's revisit this once open-telemetry/opentelemetry-specification#331 and w3c/trace-context#344 are resolved.
In particular, the following errors are fixed in this commit: * Don't return False in __exit__ Returning a literal causes a mypy error when combined with the `typing.Optional[bool]` type hint. Furthermore, exception handling is the same when returning `False` and when returning `None` (the exception is re-raised). Therefore, it's simpler to remove the return statement and change the type hint to `None`. * Correctly initialize nested tuple Tuples of length 1 should be initialized with a trailing comma to be properly interpreted. * Pass correct type to use_context() in test * Add type annotations for test helper functions Since we have `disallow_untyped_calls = True` in our mypy config for tests, we must add type annotations to any function that is called from a test. Addditionally, bump minimal mypy version to 0.740 to consistently reproduce these errors.
|
|
||
| # The sampler checks the last 8 bytes of the trace ID to decide whether to | ||
| # sample a given trace. | ||
| CHECK_BYTES = 0xFFFFFFFFFFFFFFFF |
There was a problem hiding this comment.
Let's revisit this once open-telemetry/opentelemetry-specification#331 and w3c/trace-context#344 are resolved.
This PR adds the sampler API and changes the SDK tracer so that it creates either a
SpanorDefaultSpandepending on the sampling decision at span creation time.The sampling spec is a moving target. The recently-merged open-telemetry/opentelemetry-specification#296 moves samplers out of the API, adds a distinction between "sampled" and "recorded" traces, and changes the behavior of the tracer with respect to links and attributes.
This PR doesn't implement the recent spec changes yet. I thought it'd be easier to write and review this in two stages: one to add basic samplers, and another to bring samplers up to speed with the spec.
See also: