Switch DataLoader from HuggingFace to R2

Once [feature/diloco](https://github.com/KMFODA/DistributedTraining/pull/47) is merged, we will be using a DataLoader that uses the current blockchain block and the miner's uid to seed a random subset of the data using HuggingFace's DataLoader.

Given ongoing issues with HuggingFace's dataset API it is worth investing in a DataLoader that loads the data in the same format from an R2 instance where we will locally host the data.

Templar do this in a very clean way here: https://github.com/tplr-ai/templar/blob/main/src/tplr/r2_dataset.py which can be used as inspo.

The format of the parquet files stored within the dataset can be explored using the HF API: https://huggingface.co/docs/dataset-viewer/en/parquet#using-the-dataset-viewer-api



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch DataLoader from HuggingFace to R2 #49

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Switch DataLoader from HuggingFace to R2 #49

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions