-
Notifications
You must be signed in to change notification settings - Fork 11
fix(ddsketch): correct trim_left bin overflow and handle oversized metrics in encoder #1305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
9d0055e
9456e8c
aef56e9
7bc253d
f18b8ff
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -104,6 +104,15 @@ where | |
| encoded_len: usize, | ||
| uncompressed_len_limit: usize, | ||
| }, | ||
| #[snafu(display( | ||
| "input encoded size ({} bytes) exceeds compressed payload limit ({} bytes) and can never fit", | ||
| encoded_len, | ||
| compressed_len_limit | ||
| ))] | ||
| InputExceedsCompressedSizeLimit { | ||
| encoded_len: usize, | ||
| compressed_len_limit: usize, | ||
| }, | ||
| #[snafu(display("input was invalid for request builder: {:?}'", input))] | ||
| InvalidInput { input: E::Input }, | ||
| #[snafu(display("failed to encode/write payload: {}", source))] | ||
|
|
@@ -333,6 +342,16 @@ where | |
| }); | ||
| } | ||
|
|
||
| // If the input's encoded size already exceeds the compressed payload limit, it can never fit regardless of | ||
| // what else is in the payload, so there is no point in asking the caller to flush and retry. Return an error | ||
| // so the input is dropped rather than looping forever. | ||
| if self.scratch_buf.len() > self.compressed_len_limit { | ||
| return Err(RequestBuilderError::InputExceedsCompressedSizeLimit { | ||
| encoded_len: self.scratch_buf.len(), | ||
| compressed_len_limit: self.compressed_len_limit, | ||
| }); | ||
| } | ||
|
Comment on lines
+345
to
+353
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This actually... isn't right, now that I think about it. The whole thing is that we really can't know if we've exceed the compressed size limit or not until we actually attempt compressing the request payload. I think what's going on, more likely than not, is that we're triggering the "can't encode this metric" pathway ( I think we would actually just want to remove the conditional I mentioned from
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As a further thought, any sort of pathological metric like this would cause problems because even if we allow it in, and then detect an oversized payload during flush and so we split and try to do the two separate halves... the half with the pathological metric will still fail to produce a compressed payload that is within the limit.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll take another look, I'm wondering whether we actually need to bail this out ahead of time or whether we could instead just not panic if we have a buffer with size 1 and still can't flush
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm also curious whether we expect any legit payloads to ever actually be this big (>3MB)?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixing the compression estimator is definitely something we should do: its purpose is to act purely as an optimization hint to flush before we grow payloads so large that they need to be split... but in this case, it's causing a logical error (instructing the caller to flush when there's no actual payload) so that's just straight-up wrong. If we do that, then what will happen is that we'll keep letting encoded metrics through until the compressor finally does an intermediate flush (because it filled its own internal buffers) and then we'll quickly realize whether or not the current compressed size is over the limit or not... and if not, then the actual estimator logic will kick in. Finally, the check during flush protects us from oversized payloads.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Well, in the case of our weird Go tracer behavior reporting allocation statistics with crazy high sample rates... yes? That's uncompressed, though. All of those nearly identical bins should then compress down fairly well. With the v3 intake protocol (#1175), they'll compress down really well because of the delta encoding that we do.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Those don't get so large once the trimming is actually working though, right? Shouldn't they be in the 10s of KBs? |
||
|
|
||
| // If the input can't fit into the current request payload based on the uncompressed size limit, or isn't likely | ||
| // to fit into the current request payload based on the estimated compressed size limit, then return it to the | ||
| // caller: this indicates that a flush must happen before trying to encode the same input again. | ||
|
|
@@ -947,6 +966,40 @@ mod tests { | |
| // size limits. | ||
| } | ||
|
|
||
| #[tokio::test] | ||
| async fn input_exceeds_compressed_size_limit() { | ||
| // Regression test: when a single input's encoded size exceeds the compressed payload limit, the request | ||
| // builder previously returned Ok(Some(input)) (signalling "flush and retry"), but since the builder | ||
| // was empty there was nothing to flush. The caller (run_request_builder in the metrics encoder) would | ||
| // then panic because flush() returned an empty vec on a supposedly non-empty builder. | ||
| // | ||
| // The fix returns Err(InputExceedsCompressedSizeLimit) instead, letting the caller drop the metric | ||
| // and continue rather than entering an unresolvable flush loop. | ||
| let compressed_limit = 64; | ||
| let encoder = TestEncoder::new(compressed_limit, usize::MAX, "/submit"); | ||
| let mut request_builder = create_no_compression_request_builder(encoder).await; | ||
|
|
||
| // This input encodes to more bytes than the compressed limit, so it can never fit. | ||
| let oversized_input = "x".repeat(compressed_limit + 1); | ||
|
|
||
| match request_builder.encode(oversized_input).await { | ||
| Err(RequestBuilderError::InputExceedsCompressedSizeLimit { | ||
| encoded_len, | ||
| compressed_len_limit, | ||
| }) => { | ||
| assert_eq!(encoded_len, compressed_limit + 1); | ||
| assert_eq!(compressed_len_limit, compressed_limit); | ||
| } | ||
| other => panic!("expected InputExceedsCompressedSizeLimit, got: {:?}", other), | ||
| } | ||
|
|
||
| // The builder should still be empty and usable after the error. | ||
| assert!(request_builder.flush().await.is_empty()); | ||
|
|
||
| let small_input = "hello".to_string(); | ||
| assert_eq!(None, request_builder.encode(small_input).await.unwrap()); | ||
| } | ||
|
|
||
| #[tokio::test] | ||
| async fn uncompressed_size_limit_too_small() { | ||
| // Make sure that we can't build a request builder with an uncompressed size limit that is smaller than the | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, I think we actually want to trim the lower bins, since lowest-collapsing is the default behavior for the official DDSketch implementation, IIRC: it means you end up trading accuracy at lower percentiles where it matters far less.
As is, we would be harming the upper percentiles: p95, p99, yadda yadda.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah
truncatekeeps the earlier ones, so I guess this wasn't trimming left to begin with. Will fix