Skip to content

[BUG] Empty files created when invoking reshard_jsonl method at nemo_curator.utils.file_utils.py #43

@miguelusque

Description

@miguelusque

Describe the bug
Sometimes, when invoking reshard_jsonl method at nemo_curator.utils.file_utils.py, some files are created with size zero, which makes other scripts fail.

This is produced by the method .to_textfiles when, because of any reason, one of the partitions in the dask bag is empty (that can happen because of multiple reasons. For instance, when you have a jsonl file consisting on a single line of around 1GB, and 5 lines with only a few megabytes each).

Steps/Code to reproduce bug

Create a jsonl file as detailed above, and it will produce those results.

I have some files like that in one of our clusters. Just ping me if you want to have access to them.

Expected behavior

No empty files created.

P.S.: I will submit a MR today to fix this issue

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions