Add support for decoding CESU-8 encoded strings. #17
Merged
tcalmant merged 1 commit intotcalmant:masterfrom Jun 29, 2017
Merged
Add support for decoding CESU-8 encoded strings. #17tcalmant merged 1 commit intotcalmant:masterfrom
tcalmant merged 1 commit intotcalmant:masterfrom
Conversation
…va's broken utf-8 implementation.
Owner
|
Thanks for your contribution ! |
Contributor
|
Note that the CESU-8/Java-UTF-8 decoder in ftfy.bad_codecs does not enforce correctness, and is documented as being explicitly intended not to do so. Here's an example of a byte sequence that is invalid CESU-8 and is rejected by Java, but is accepted by ftfy's decoder: So be careful not to rely on the codec to make accept/reject decisions about the validity of serialized objects ... |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This works around java's broken utf-8 implementation.
You will need the https://github.com/LuminosoInsight/python-ftfy module for the patch to have an effect.
The following code will now output a
😃(\u0001f603), instead of raising aUnicodeDecodeError, or outputting??????.The problem with the byte sequence
ED A0 BD ED B8 83is that it decodes tod83d de03which are invalid codepoints, but is actually a validUTF-16sequence, so you have to decode it twice, first utf-8, then utf-16, then you will end up with unicode character 0x1F603.