Describe the bug
Sometimes, when invoking reshard_jsonl method at nemo_curator.utils.file_utils.py, some files are created with size zero, which makes other scripts fail.
This is produced by the method .to_textfiles when, because of any reason, one of the partitions in the dask bag is empty (that can happen because of multiple reasons. For instance, when you have a jsonl file consisting on a single line of around 1GB, and 5 lines with only a few megabytes each).
Steps/Code to reproduce bug
Create a jsonl file as detailed above, and it will produce those results.
I have some files like that in one of our clusters. Just ping me if you want to have access to them.
Expected behavior
No empty files created.
P.S.: I will submit a MR today to fix this issue
Describe the bug
Sometimes, when invoking
reshard_jsonlmethod atnemo_curator.utils.file_utils.py, some files are created with size zero, which makes other scripts fail.This is produced by the method
.to_textfileswhen, because of any reason, one of the partitions in the dask bag is empty (that can happen because of multiple reasons. For instance, when you have a jsonl file consisting on a single line of around 1GB, and 5 lines with only a few megabytes each).Steps/Code to reproduce bug
Create a jsonl file as detailed above, and it will produce those results.
I have some files like that in one of our clusters. Just ping me if you want to have access to them.
Expected behavior
No empty files created.
P.S.: I will submit a MR today to fix this issue