Skip to content

String.split hangs when string ends with quotation mark (QU class) #5

@GregLMcDonald

Description

@GregLMcDonald

I've found that with the break: :line option, String.split/2 hangs if the string argument ends with any quotation mark character. All the following will hang:

Unicode.String.split(~s(He said, "A cup of hot tea?"), locale: :en, break: :line) # Quotation mark (aka typewriter quote)
Unicode.String.split(~s(He said, “A cup of hot tea?”), locale: :en, break: :line) # Right double quotation mark
Unicode.String.split(~s(«Voilà le problème!»), locale: :fr, break: :line) # Right-pointing double angle quotation mark

I would expect Unicode.String.split(~s(He said, "A cup of hot tea?"), locale: :en, break: :line) to return ["He ", "said, ", "\"A ", "cup ", "of ", "hot ", "tea?\""] and this is indeed how the online Unicode segmentation utility splits the same text. The same text is not problematic if break: :word is used. My test are with version 1.3 of the library. Elixir 1.14.

I poked around the code a bit and the Unicode line segmentation algo but must admit I find it all rather daunting. In the line segmentation test data, it doesn't seem like there's anything that would prevent a break after a quotation mark.

As a workaround, for now I am padding the end of the string with a single space. This allows the method to return and for my use case is perfectly fine.

Let me take this opportunity to thank you @kipcole9 for this incredible library and all your contributions to the Elixir ecosystem.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions