PPMI LLM

Purpose

This repository contains code to fine-tune a Hugging Face LLM on the PPMI dataset to generate synthetic structured data.

Data

The code in the repository requires the PPMI data in a JSONL format for the fine-tuning process.

Specifically, the dataset file must contain one patient for each line, structured as a JSON object according to the schema defined in data/schema.json.

A sample dataset file is provided as data/data.jsonl.

Usage

Setup

First, create and activate a virtual environment and install this package in it by running:

pip install -e .

Fine-tuning

To fine-tune a model on the data in data/data.jsonl, run:

python scripts/finetune.py --config configs/finetune.yaml

Generation

In order to enable the structured output generation for the LLM, it is necessary to define a valid JSON schema for the data.

The schema provided in data/schema.json acts as a template, where the following placeholders must be replaced with the effective dataset values in order to obtain a valid schema:

[
    "__ENUM__",
    "__MIN__",
    "__MAX__",
    "__EXCL_MIN__",
    "__EXCL_MAX__",
    "__MIN_LENGTH__",
    "__MAX_LENGTH__",
]

Once a valid JSON schema is available, it is possible to convert it to the corresponding Pydantic model by running:

make data-model

This will create the correct Model class in data/model.py, which can then be used for the structured output generation and validation.

To generate a synthetic dataset using a fine-tuned model and the defined schema, run:

python scripts/generate.py --config configs/generate.yaml --model path/to/model

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
configs		configs
data		data
scripts		scripts
src/ppmillm		src/ppmillm
.gitignore		.gitignore
IMPLEMENTATION_DETAILS.md		IMPLEMENTATION_DETAILS.md
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PPMI LLM

Purpose

Data

Usage

Setup

Fine-tuning

Generation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PPMI LLM

Purpose

Data

Usage

Setup

Fine-tuning

Generation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages