Skip to content

winternewt/cluster_distribution

Repository files navigation

cluster-distribution

Monte-Carlo study of DBSCAN clustering on random 2D point clouds (CSR / Poisson null), plus analytic derivation of the null distribution and a calibrated anomaly detector.

uv sync
uv run cluster-distribution simulate   # or: fit, stats, visualize, plot, ...

Full catalog: docs/overview.md · Analytic results: docs/analytic_findings.md · Executive summary: docs/EXECUTIVE_SUMMARY.md

Findings in brief

DBSCAN-on-noise produces structured, non-trivial cluster statistics. The density ratio R = (N'/S')/λ₀ obeys a factorization R(eps) = scale(eps)·X where X is eps-invariant: rescaling to R̃ = R·ε^α(ε) (where α(ε) = 2.031 + 0.258·ln ε is a running exponent fit from 110 eps values) collapses the distribution onto a single master: a shifted inverse-gamma with integer shape 10 (= min_samples), SF(R̃) = P(10, 157.70/(R̃ − 7.51)), pooled KS≈0.005 over eps 1.00–1.60 and z-calibrated to |median_z| ≤ 0.02 at every eps. The location shift is essential — zero-loc families (log-logistic, plain inv-gamma) mis-centre z by ~0.07σ (docs/RCA.md).

The master's parameters are not arbitrary (docs/RCA.md §8): its support floor is DBSCAN's certification bound — an n = min_samples cluster must fit inside one core point's eps-ball, so R̃ ≥ min_samples·(1+2π²/3(m−1)²) ≈ 10.87 (observed global min 10.92 over 1.2M clusters), and its mean loc + scale/(shape−1) = 25.03 equals the no-free-parameter amplitude C = N_min/f·k/(k−1) ≈ 25 derived from hull geometry. The law is N-invariant at fixed λ₀ (analysis/ncheck.py: N = 10k/20k/40k, radius √N — identical master and calibration, yield ∝ N), i.e. a purely local observable of (λ₀, eps, min_samples).

The correct detection statistic is the Kulldorff scan likelihood ratio, not the bare density ratio. A typical CSR cluster scores "~6σ" under naive per-window scoring — overstated ~10⁸×. See docs/analytic_findings.md for the full derivation.

Detector library

modules/cluster_detector.py — importable, calibrated two-arm anomaly detector. Flags local over-densities and gives each a look-elsewhere-corrected p-value and Gaussian-equivalent z via DBSCAN + Kulldorff scan-LR (tight clumps) and KDE peaks (extended over-densities, hybrid analytic/MC null).

import numpy as np
from modules.cluster_detector import Detector

det = Detector(radius=100.0, eps=1.2, n_background=10000)
det.calibrate(n_mc=2000, cache_path="null.npz")   # MC the null once; cached thereafter
# det = Detector.load("null.npz")                  # reuse a saved calibration

pts = np.load("my_points.npy")                     # (M, 2) coords inside the disk
for d in det.score(pts, zthr=3.0):
    print(d)   # Detection(method, z, p_value, x, y, n_points, scale)

For a non-CSR background: calibrate(null_generator=lambda rng: my_points(rng)).

For cheap per-cluster ratings without any MC, det.score_clusters(pts) scores each DBSCAN cluster against the analytic shifted inv-gamma(10) master (same constants as the webapp demo). Per-cluster p — not look-elsewhere corrected; use score() for detection. The master is valid for any field size: it depends only on (λ₀, eps, min_samples), verified at N = 10k–40k (analysis/ncheck.py).

Live demo

webapp/index.html — static, dependency-free browser demo. Opens with no build step. Three live panels: noise field + detected clusters, R̃ = R·ε^α(ε) histogram converging to the shifted inv-gamma(10) master curve, cluster z-scores forming N(0,1). Move the eps slider to see the collapse.

cd webapp && python3 -m http.server 8000   # then open http://localhost:8000

GitHub Pages: copy webapp/ to a docs/ folder or gh-pages branch (see webapp/README.md).

Data

207 simulation CSVs (~4.3 GB) are stored as Parquet on HuggingFace — not in this repo. Scripts download slices on demand and cache them locally in simdata/v2_parquet/.

HuggingFace dataset: https://huggingface.co/datasets/Winternewt/cluster-distribution-simdata

# Scripts handle this automatically; to fetch a slice manually:
import pandas as pd
df = pd.read_parquet(
    "hf://datasets/Winternewt/cluster-distribution-simdata/data/eps_1.20.parquet"
)
df = df[df.S_prime != -1]   # drop placeholder rows (no cluster found that iteration)

Bulk download:

huggingface-cli download Winternewt/cluster-distribution-simdata \
    --repo-type dataset --local-dir simdata/v2_parquet

Schema: S_prime (float64, convex-hull area), N_prime (int64, cluster size), iteration (int64). Rows with S_prime = -1 are placeholder (no cluster that field).

Tools

Script Purpose
tools/simdata_to_parquet.py Convert local CSVs → zstd parquet, validate 1-to-1
tools/upload_to_hf.py Upload parquet folder to HuggingFace (uses upload_large_folder)
tools/validate_hf_download.py Download N spot-check files from HF and assert exact match vs local CSVs

Releases

No releases published

Packages

 
 
 

Contributors