Clean pdffonts output to avoid invalid UTF-8 characters by tbk303 · Pull Request #134 · documentcloud/docsplit

tbk303 · 2015-09-22T09:20:55Z

I came across some weird PDF files for which pdffonts outputs invalid UTF-8 chars. This results in a "invalid UTF-8 ..." exception when matching NO_TEXT_DETECTED.

If Ruby 1.9/2.0 compatability is required, I can also extend this pull request with some scrub-polyfill.

Use qpdf instead of pdftk

Clean pdffonts output to avoid invalid UTF-8 characters

e7a53d9

tbk303 mentioned this pull request Jan 5, 2016

rails invalid byte sequence in UTF-8 #135

Open

tbk303 and others added 7 commits February 6, 2019 13:52

Add timeout to OCR

bc41d6e

Raise upon failed image conversion

a6d5f4e

Update MAGICK_TMPDIR

1c2ff70

Update image_extractor.rb

fba35bc

Ruby3 compatability

e8e8941

Use qpdf instead of pdftk

3936296

Merge pull request #1 from fortytools/qpdf

4a7749f

Use qpdf instead of pdftk

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean pdffonts output to avoid invalid UTF-8 characters#134

Clean pdffonts output to avoid invalid UTF-8 characters#134
tbk303 wants to merge 8 commits intodocumentcloud:masterfrom
fortytools:master

tbk303 commented Sep 22, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tbk303 commented Sep 22, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants