Skip to content

Switch DataLoader from HuggingFace to R2 #49

@KMFODA

Description

@KMFODA

Once feature/diloco is merged, we will be using a DataLoader that uses the current blockchain block and the miner's uid to seed a random subset of the data using HuggingFace's DataLoader.

Given ongoing issues with HuggingFace's dataset API it is worth investing in a DataLoader that loads the data in the same format from an R2 instance where we will locally host the data.

Templar do this in a very clean way here: https://github.com/tplr-ai/templar/blob/main/src/tplr/r2_dataset.py which can be used as inspo.

The format of the parquet files stored within the dataset can be explored using the HF API: https://huggingface.co/docs/dataset-viewer/en/parquet#using-the-dataset-viewer-api

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedOpen to contributions from the community

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions