Skip to content

Encoding issue with non-English text #47

@omid-jf

Description

@omid-jf

A non-English unicode string as input to preprocessor.clean with preprocessor.OPT.EMOJI option returns random meaningless characters.
And this is happening only on version 0.6.0

The cause of this issue seems to be line 50 of preprocess.py

To reproduce:
import preprocessor as p
p.set_options(p.OPT.URL, p.OPT.EMOJI, p.OPT.SMILEY)
print(p.clean("внесла предложение призвать всех избегать применять незаконные"))

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions