Once feature/diloco is merged, we will be using a DataLoader that uses the current blockchain block and the miner's uid to seed a random subset of the data using HuggingFace's DataLoader.
Given ongoing issues with HuggingFace's dataset API it is worth investing in a DataLoader that loads the data in the same format from an R2 instance where we will locally host the data.
Templar do this in a very clean way here: https://github.com/tplr-ai/templar/blob/main/src/tplr/r2_dataset.py which can be used as inspo.
The format of the parquet files stored within the dataset can be explored using the HF API: https://huggingface.co/docs/dataset-viewer/en/parquet#using-the-dataset-viewer-api
Once feature/diloco is merged, we will be using a DataLoader that uses the current blockchain block and the miner's uid to seed a random subset of the data using HuggingFace's DataLoader.
Given ongoing issues with HuggingFace's dataset API it is worth investing in a DataLoader that loads the data in the same format from an R2 instance where we will locally host the data.
Templar do this in a very clean way here: https://github.com/tplr-ai/templar/blob/main/src/tplr/r2_dataset.py which can be used as inspo.
The format of the parquet files stored within the dataset can be explored using the HF API: https://huggingface.co/docs/dataset-viewer/en/parquet#using-the-dataset-viewer-api