Skip to content

aindo-com/ppmi-llm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PPMI LLM

Purpose

This repository contains code to fine-tune a Hugging Face LLM on the PPMI dataset to generate synthetic structured data.

Data

The code in the repository requires the PPMI data in a JSONL format for the fine-tuning process.

Specifically, the dataset file must contain one patient for each line, structured as a JSON object according to the schema defined in data/schema.json.

A sample dataset file is provided as data/data.jsonl.

Usage

Setup

First, create and activate a virtual environment and install this package in it by running:

pip install -e .

Fine-tuning

To fine-tune a model on the data in data/data.jsonl, run:

python scripts/finetune.py --config configs/finetune.yaml

Generation

In order to enable the structured output generation for the LLM, it is necessary to define a valid JSON schema for the data.

The schema provided in data/schema.json acts as a template, where the following placeholders must be replaced with the effective dataset values in order to obtain a valid schema:

[
    "__ENUM__",
    "__MIN__",
    "__MAX__",
    "__EXCL_MIN__",
    "__EXCL_MAX__",
    "__MIN_LENGTH__",
    "__MAX_LENGTH__",
]

Once a valid JSON schema is available, it is possible to convert it to the corresponding Pydantic model by running:

make data-model

This will create the correct Model class in data/model.py, which can then be used for the structured output generation and validation.

To generate a synthetic dataset using a fine-tuned model and the defined schema, run:

python scripts/generate.py --config configs/generate.yaml --model path/to/model

About

LLM fine-tuning on the PPMI dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors