cluster-distribution

Monte-Carlo study of DBSCAN clustering on random 2D point clouds (CSR / Poisson null), plus analytic derivation of the null distribution and a calibrated anomaly detector.

uv sync
uv run cluster-distribution simulate   # or: fit, stats, visualize, plot, ...

Full catalog: docs/overview.md · Analytic results: docs/analytic_findings.md · Executive summary: docs/EXECUTIVE_SUMMARY.md

Findings in brief

DBSCAN-on-noise produces structured, non-trivial cluster statistics. The density ratio R = (N'/S')/λ₀ obeys a factorization R(eps) = scale(eps)·X where X is eps-invariant: rescaling to R̃ = R·ε^α(ε) (where α(ε) = 2.031 + 0.258·ln ε is a running exponent fit from 110 eps values) collapses the distribution onto a single master: a shifted inverse-gamma with integer shape 10 (= min_samples), SF(R̃) = P(10, 157.70/(R̃ − 7.51)), pooled KS≈0.005 over eps 1.00–1.60 and z-calibrated to |median_z| ≤ 0.02 at every eps. The location shift is essential — zero-loc families (log-logistic, plain inv-gamma) mis-centre z by ~0.07σ (docs/RCA.md).

The master's parameters are not arbitrary (docs/RCA.md §8): its support floor is DBSCAN's certification bound — an n = min_samples cluster must fit inside one core point's eps-ball, so R̃ ≥ min_samples·(1+2π²/3(m−1)²) ≈ 10.87 (observed global min 10.92 over 1.2M clusters), and its mean loc + scale/(shape−1) = 25.03 equals the no-free-parameter amplitude C = N_min/f·k/(k−1) ≈ 25 derived from hull geometry. The law is N-invariant at fixed λ₀ (analysis/ncheck.py: N = 10k/20k/40k, radius √N — identical master and calibration, yield ∝ N), i.e. a purely local observable of (λ₀, eps, min_samples).

The correct detection statistic is the Kulldorff scan likelihood ratio, not the bare density ratio. A typical CSR cluster scores "~6σ" under naive per-window scoring — overstated ~10⁸×. See docs/analytic_findings.md for the full derivation.

Detector library

modules/cluster_detector.py — importable, calibrated two-arm anomaly detector. Flags local over-densities and gives each a look-elsewhere-corrected p-value and Gaussian-equivalent z via DBSCAN + Kulldorff scan-LR (tight clumps) and KDE peaks (extended over-densities, hybrid analytic/MC null).

import numpy as np
from modules.cluster_detector import Detector

det = Detector(radius=100.0, eps=1.2, n_background=10000)
det.calibrate(n_mc=2000, cache_path="null.npz")   # MC the null once; cached thereafter
# det = Detector.load("null.npz")                  # reuse a saved calibration

pts = np.load("my_points.npy")                     # (M, 2) coords inside the disk
for d in det.score(pts, zthr=3.0):
    print(d)   # Detection(method, z, p_value, x, y, n_points, scale)

For a non-CSR background: calibrate(null_generator=lambda rng: my_points(rng)).

For cheap per-cluster ratings without any MC, det.score_clusters(pts) scores each DBSCAN cluster against the analytic shifted inv-gamma(10) master (same constants as the webapp demo). Per-cluster p — not look-elsewhere corrected; use score() for detection. The master is valid for any field size: it depends only on (λ₀, eps, min_samples), verified at N = 10k–40k (analysis/ncheck.py).

Live demo

webapp/index.html — static, dependency-free browser demo. Opens with no build step. Three live panels: noise field + detected clusters, R̃ = R·ε^α(ε) histogram converging to the shifted inv-gamma(10) master curve, cluster z-scores forming N(0,1). Move the eps slider to see the collapse.

cd webapp && python3 -m http.server 8000   # then open http://localhost:8000

GitHub Pages: copy webapp/ to a docs/ folder or gh-pages branch (see webapp/README.md).

Data

207 simulation CSVs (~4.3 GB) are stored as Parquet on HuggingFace — not in this repo. Scripts download slices on demand and cache them locally in simdata/v2_parquet/.

HuggingFace dataset: https://huggingface.co/datasets/Winternewt/cluster-distribution-simdata

# Scripts handle this automatically; to fetch a slice manually:
import pandas as pd
df = pd.read_parquet(
    "hf://datasets/Winternewt/cluster-distribution-simdata/data/eps_1.20.parquet"
)
df = df[df.S_prime != -1]   # drop placeholder rows (no cluster found that iteration)

Bulk download:

huggingface-cli download Winternewt/cluster-distribution-simdata \
    --repo-type dataset --local-dir simdata/v2_parquet

Schema: S_prime (float64, convex-hull area), N_prime (int64, cluster size), iteration (int64). Rows with S_prime = -1 are placeholder (no cluster that field).

Tools

Script	Purpose
`tools/simdata_to_parquet.py`	Convert local CSVs → zstd parquet, validate 1-to-1
`tools/upload_to_hf.py`	Upload parquet folder to HuggingFace (uses `upload_large_folder`)
`tools/validate_hf_download.py`	Download N spot-check files from HF and assert exact match vs local CSVs

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.idea		.idea
analysis		analysis
docs		docs
modules		modules
results		results
scripts		scripts
simdata/v1		simdata/v1
tools		tools
webapp		webapp
.gitattributes		.gitattributes
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cluster-distribution

Findings in brief

Detector library

Live demo

Data

Tools

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

cluster-distribution

Findings in brief

Detector library

Live demo

Data

Tools

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages