Skip to content

Python framework for Bayesian topic modeling: LDA, HDP, CTM with evaluation & visualization

License

Notifications You must be signed in to change notification settings

sucpark/bayesian-topics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Nonparametric Bayesian Topic Modeling

This project restructures the learning content from University of Chicago STAT 37400 (Nonparametric Inference) into a modern architecture.

A research framework for comparing topic modeling algorithms, including LDA variants and nonparametric Bayesian models.

Features

  • Multiple Algorithms: Gibbs Sampling LDA, Variational LDA, HDP, CTM
  • Unified Interface: Common API for all topic models
  • Evaluation Metrics: Topic coherence (UMass, C_V, NPMI), perplexity, diversity
  • Visualization: Word clouds, topic-word distributions, convergence plots
  • CLI Support: Easy experiment management from command line
  • Experiment Tracking: Optional WandB/MLflow integration

Installation

# Create virtual environment with uv
uv venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate

# Install with development dependencies
uv pip install -e ".[dev]"

# Or with all optional dependencies
uv pip install -e ".[all]"

Quick Start

Python API

from nbtm.models import create_model
from nbtm.data import load_corpus

# Load data
documents = load_corpus("data/raw/corpus.txt")

# Create and train model
model = create_model("lda_gibbs", num_topics=10)
model.fit(documents, num_iterations=1000)

# Get results
topics = model.get_all_topic_words(top_n=10)
for i, topic in enumerate(topics):
    print(f"Topic {i}: {[word for word, _ in topic]}")

Command Line

# Train a model
nbtm train --config configs/default.yaml --num-topics 10

# Evaluate model
nbtm evaluate --model-path outputs/model.pkl --metrics all

# Generate visualizations
nbtm visualize --model-path outputs/model.pkl --type wordcloud

# List available models
nbtm list-models

Supported Algorithms

Algorithm Description Key Features
Gibbs LDA Collapsed Gibbs Sampling Simple, interpretable
LDA-VI Variational Inference Fast, scalable
HDP Hierarchical Dirichlet Process Automatic topic count
CTM Correlated Topic Model Topic correlations

Project Structure

nbtm/
├── src/nbtm/           # Main package
│   ├── models/         # Topic model implementations
│   ├── data/           # Data loading and preprocessing
│   ├── training/       # Training infrastructure
│   ├── evaluation/     # Evaluation metrics
│   ├── visualization/  # Plotting tools
│   └── utils/          # Utilities
├── configs/            # YAML configurations
├── notebooks/          # Tutorial notebooks
├── docs/               # Documentation
└── tests/              # Test suite

Configuration

Example configuration (configs/default.yaml):

model:
  name: lda_gibbs
  num_topics: 10
  alpha: 0.1
  beta: 0.01

training:
  num_iterations: 1000
  burn_in: 200

evaluation:
  compute_coherence: true
  coherence_measure: c_v

Development

# Install dev dependencies
uv pip install -e ".[dev]"

# Run tests
pytest

# Format code
black src/
ruff check src/ --fix

# Type check
mypy src/

License

MIT License

References

  • Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. JMLR.
  • Teh, Y. W., et al. (2006). Hierarchical Dirichlet Processes. JASA.
  • Blei, D. M., & Lafferty, J. D. (2007). Correlated Topic Models. NIPS.

About

Python framework for Bayesian topic modeling: LDA, HDP, CTM with evaluation & visualization

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •