feat(binary): add git binary patch support#61
Conversation
61c9c2b to
531e8d9
Compare
a2190d2 to
bc427f8
Compare
There was a problem hiding this comment.
We use our hand-written one. Alternatives:
- base85
- https://crates.io/crates/base85
- Same RFC 1924 alphabet
- MPL-2.0
- Doesn't have
decode_intoto avoid allocation
- base85rs
- https://crates.io/crates/base85rs
- Same RFC 1924 alphabet
decode -> Optionnot good for error reporting- Doesn't have
decode_intoto avoid allocation
- ascii85
- https://crates.io/crates/ascii85
- It's not Git base85
- z85
- https://crates.io/crates/z85
- ZeroMQ variant
There was a problem hiding this comment.
The hand-written one here is solid, and it is nice to avoid an extra dep if we can
| } else { | ||
| // ADD instruction | ||
| let add_len = control as usize; | ||
| let data = cursor.read_bytes(add_len)?; | ||
| result.extend_from_slice(data); | ||
| } |
There was a problem hiding this comment.
Its been a while since i've implemented the delta compression algorithm but I think that this may be mis handling the case where control == 0x00 which is reserved by by git but here we're treating it as an ADD of 0 bytes.
Although rereading the diffx spec you linked at the top of this file 0x00 doesn't seem to be called out as special or reserved so we could probably go either way.
There was a problem hiding this comment.
Good catch! This is definitely worth a compat test. I dare not reading GPL source to confirm that 😨.
| }); | ||
| } | ||
|
|
||
| let mut result = Vec::with_capacity(header_mod_size as usize); |
There was a problem hiding this comment.
This can be possibly dangerous as we're reading a u64 value from a possibly malformed set of delta instructions which could lead to some inputs causing an instant OOM. We probably want to try and cap the amount we're willing to allocate up front in order to avoid small adversarial inputs from crashing the process.
There was a problem hiding this comment.
Some naive analyze against rust-lang/rust main branch on disk (not diff)
- 581 binary files, largest is 2.2 MiB
- 83% are under 64 KiB
- 97% are under 256 KiB
- 1 exceeds 1 MiB
So I guess 64KiB prealloc cap is fine.
| fn decode_data(binary_data: &BinaryData<'_>) -> Result<Vec<u8>, BinaryPatchParseError> { | ||
| use std::io::Read; | ||
|
|
||
| let compressed = decode_base85_lines(binary_data.data)?; | ||
|
|
||
| let mut decoder = flate2::read::ZlibDecoder::new(&compressed[..]); | ||
| let mut decompressed = Vec::new(); | ||
| decoder | ||
| .read_to_end(&mut decompressed) | ||
| .map_err(|e| BinaryPatchParseErrorKind::DecompressionFailed(e.to_string()))?; | ||
|
|
||
| Ok(decompressed) | ||
| } |
There was a problem hiding this comment.
We don't seem to do any checks that the decompressed data matches the size that is present in BinaryData.size. Do we want to do an extra check for that to ensure the decompressed data is the expected size?
Awesome! hopefully getting all this in wasn't too painful. I have a few smaller things that I think would be nice to add or polish myself so hopefully i can find the time to do that before we are ready to do a release, although I won't hold up a release if i don't get around to those things. |
Feel free to dump your thoughts in an issue. I can help clean them up! Here is a rough list in my mind:
|
|
oops I guess my editor stripped those trailing space. I am afk but |
* Added types representing both literal and delta Git binary patches * Added a parser for the `GIT binary patch` format. This doesn't include the patch application (which will be added in later commits) The implementation is based on * Specification from <https://diffx.org/spec/binary-diffs.html> * Behavior observation of Git CLI
Returns the longest valid UTF-8 prefix of the input. For `str` this is the entire input; for `[u8]` it truncates at the first invalid byte. This will be used at call sites where generic `T: Text` input needs to be narrowed to `&str` for parsers that only handle ASCII data (e.g. binary patch base85 content).
The API was stabilized in 1.73. The lint was added in 1.93. This is required for a MSRV bump to 1.75
This is a preparation for binary diff application support. * Git binary patch is compressed by zlib hence flate2 * zlib-rs (which is the most performant zlib backend) requires MSRV 1.75.0+ hence the bump.
* Add base85 encoder/decoder and Git delta format decoder. * Wire them into `BinaryPatch::apply() and `apply_reverse()` for decoding zlib-compressed, base85-encoded binary payload. These are feature-gated behind the `binary` feature.
Now compat tests require `binary` Cargo feature.
Now replay tests require `binary` Cargo feature.
Assert current behavior where 0x00 control byte falls through to ADD which is a no-op. `git apply` rejects this as "unexpected delta opcode 0".
Declared literal size doesn't match decompressed data. `git apply` rejects this; diffy currently doesn't check.
After this, both git-apply and diffy reject.
A malformed delta header claiming a huge modified size could cause an instant OOM. Cap the pre-allocation at 64 KiB, and let the vec grow organically from actual instructions.
|
Ready for another round of review! Please review from d43d756. Old commits got no functional changes but doc and test snapshot tweaks. |
bmwill
left a comment
There was a problem hiding this comment.
lgtm with this really only being a nit, which i'll just fix before landing.
| /// When decoding data where the original byte count isn't a multiple of 4, | ||
| /// callers must handle truncation at a higher level. | ||
| /// For example, via a length indicator in Git binary patch. | ||
| pub fn decode_into(input: &str, output: &mut impl Extend<u8>) -> Result<(), Base85Error> { |
There was a problem hiding this comment.
I would probably have this take as input &[u8] instead of &str since we already have a valid alphabet any anything outside that is invalid, this will eliminate the perf cost of needing to validate a&[u8] into str before passing to this function.
There was a problem hiding this comment.
That is indeed better. Nice improvement!
| let (_, binary_input) = ps.input.split_at(abs_patch_start + binary_patch_start); | ||
| // Binary patch data is always ASCII (base85-encoded). | ||
| let binary_input = binary_input.as_str_prefix(); | ||
| let (binary_patch, consumed) = match parse_binary_patch(binary_input) { |
There was a problem hiding this comment.
As commented elsewhere, i would probably default to dropping to bytes here vs str to avoid the extra utf8 validation which isn't really needed since decoding a binary patch has a limited alphabet.
Fixes #43
Basically a port of weihanglo#33.
BTW, this is probably the very last major feature PR. I might have some other small ones for housekeeping and preparation of the new release.
Add parsing and application of git binary patches,
including both literal and delta formats.
This feature is behind the
binaryCargo feature.The implementation is based on
Some others highlights:
binaryfeature.