This repository contains code to fine-tune a Hugging Face LLM on the PPMI dataset to generate synthetic structured data.
The code in the repository requires the PPMI data in a JSONL format for the fine-tuning process.
Specifically, the dataset file must contain one patient for each line, structured as a JSON object according to the schema defined in data/schema.json.
A sample dataset file is provided as data/data.jsonl.
First, create and activate a virtual environment and install this package in it by running:
pip install -e .To fine-tune a model on the data in data/data.jsonl, run:
python scripts/finetune.py --config configs/finetune.yamlIn order to enable the structured output generation for the LLM, it is necessary to define a valid JSON schema for the data.
The schema provided in data/schema.json acts as a template, where the following placeholders must be replaced
with the effective dataset values in order to obtain a valid schema:
[
"__ENUM__",
"__MIN__",
"__MAX__",
"__EXCL_MIN__",
"__EXCL_MAX__",
"__MIN_LENGTH__",
"__MAX_LENGTH__",
]Once a valid JSON schema is available, it is possible to convert it to the corresponding Pydantic model by running:
make data-modelThis will create the correct Model class in data/model.py, which can then be used for the structured output generation and validation.
To generate a synthetic dataset using a fine-tuned model and the defined schema, run:
python scripts/generate.py --config configs/generate.yaml --model path/to/model