Skip to content

Use \r\n in data/whirlwind.warc #13

@sebastian-nagel

Description

@sebastian-nagel

The WARC file data/whirlwind.warc uses '\n' instead of '\r\n' to separate WARC and HTTP headers. This can make parsing the WARC file fail:

$> java -jar jwarc.jar ls data/whirlwind.warc
Exception in thread "main" java.io.UncheckedIOException: org.netpreserve.jwarc.ParsingException: invalid WARC record at position 8: WARC/1.0<-- HERE -->\nWARC-Type: warcinfo\nWARC-Date: 2024-05-... (offset 0 in whirlwind.warc)
        at org.netpreserve.jwarc.WarcReader$1.hasNext(WarcReader.java:357)
        at org.netpreserve.jwarc.tools.ListTool.main(ListTool.java:12)
        at org.netpreserve.jwarc.tools.WarcTool.main(WarcTool.java:40)
Caused by: org.netpreserve.jwarc.ParsingException: invalid WARC record at position 8: WARC/1.0<-- HERE -->\nWARC-Type: warcinfo\nWARC-Date: 2024-05-... (offset 0 in whirlwind.warc)
        at org.netpreserve.jwarc.WarcParser.parse(WarcParser.java:356)
        at org.netpreserve.jwarc.WarcReader.next(WarcReader.java:181)
        at org.netpreserve.jwarc.WarcReader$1.hasNext(WarcReader.java:355)
        ... 2 more

Same for:

  • data/whirlwind.warc.wat
  • data/whirlwind.warc.wet

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions