-
Notifications
You must be signed in to change notification settings - Fork 3
Description
I've found that with the break: :line option, String.split/2 hangs if the string argument ends with any quotation mark character. All the following will hang:
Unicode.String.split(~s(He said, "A cup of hot tea?"), locale: :en, break: :line) # Quotation mark (aka typewriter quote)
Unicode.String.split(~s(He said, “A cup of hot tea?”), locale: :en, break: :line) # Right double quotation mark
Unicode.String.split(~s(«Voilà le problème!»), locale: :fr, break: :line) # Right-pointing double angle quotation mark
I would expect Unicode.String.split(~s(He said, "A cup of hot tea?"), locale: :en, break: :line) to return ["He ", "said, ", "\"A ", "cup ", "of ", "hot ", "tea?\""] and this is indeed how the online Unicode segmentation utility splits the same text. The same text is not problematic if break: :word is used. My test are with version 1.3 of the library. Elixir 1.14.
I poked around the code a bit and the Unicode line segmentation algo but must admit I find it all rather daunting. In the line segmentation test data, it doesn't seem like there's anything that would prevent a break after a quotation mark.
As a workaround, for now I am padding the end of the string with a single space. This allows the method to return and for my use case is perfectly fine.
Let me take this opportunity to thank you @kipcole9 for this incredible library and all your contributions to the Elixir ecosystem.