Monte-Carlo study of DBSCAN clustering on random 2D point clouds (CSR / Poisson null), plus analytic derivation of the null distribution and a calibrated anomaly detector.
uv sync
uv run cluster-distribution simulate # or: fit, stats, visualize, plot, ...Full catalog: docs/overview.md · Analytic results: docs/analytic_findings.md · Executive summary: docs/EXECUTIVE_SUMMARY.md
DBSCAN-on-noise produces structured, non-trivial cluster statistics. The density ratio
R = (N'/S')/λ₀ obeys a factorization R(eps) = scale(eps)·X where X is
eps-invariant: rescaling to R̃ = R·ε^α(ε) (where α(ε) = 2.031 + 0.258·ln ε is a
running exponent fit from 110 eps values) collapses the distribution onto a single master:
a shifted inverse-gamma with integer shape 10 (= min_samples),
SF(R̃) = P(10, 157.70/(R̃ − 7.51)), pooled KS≈0.005 over eps 1.00–1.60 and z-calibrated
to |median_z| ≤ 0.02 at every eps. The location shift is essential — zero-loc families
(log-logistic, plain inv-gamma) mis-centre z by ~0.07σ (docs/RCA.md).
The master's parameters are not arbitrary (docs/RCA.md §8): its support floor is
DBSCAN's certification bound — an n = min_samples cluster must fit inside one core
point's eps-ball, so R̃ ≥ min_samples·(1+2π²/3(m−1)²) ≈ 10.87 (observed global min
10.92 over 1.2M clusters), and its mean loc + scale/(shape−1) = 25.03 equals the
no-free-parameter amplitude C = N_min/f·k/(k−1) ≈ 25 derived from hull geometry.
The law is N-invariant at fixed λ₀ (analysis/ncheck.py: N = 10k/20k/40k, radius √N
— identical master and calibration, yield ∝ N), i.e. a purely local observable of
(λ₀, eps, min_samples).
The correct detection statistic is the Kulldorff scan likelihood ratio, not the bare density ratio. A typical CSR cluster scores "~6σ" under naive per-window scoring — overstated ~10⁸×. See docs/analytic_findings.md for the full derivation.
modules/cluster_detector.py — importable, calibrated two-arm anomaly detector.
Flags local over-densities and gives each a look-elsewhere-corrected p-value and
Gaussian-equivalent z via DBSCAN + Kulldorff scan-LR (tight clumps) and KDE peaks
(extended over-densities, hybrid analytic/MC null).
import numpy as np
from modules.cluster_detector import Detector
det = Detector(radius=100.0, eps=1.2, n_background=10000)
det.calibrate(n_mc=2000, cache_path="null.npz") # MC the null once; cached thereafter
# det = Detector.load("null.npz") # reuse a saved calibration
pts = np.load("my_points.npy") # (M, 2) coords inside the disk
for d in det.score(pts, zthr=3.0):
print(d) # Detection(method, z, p_value, x, y, n_points, scale)For a non-CSR background: calibrate(null_generator=lambda rng: my_points(rng)).
For cheap per-cluster ratings without any MC, det.score_clusters(pts) scores each
DBSCAN cluster against the analytic shifted inv-gamma(10) master (same constants as the
webapp demo). Per-cluster p — not look-elsewhere corrected; use score() for detection.
The master is valid for any field size: it depends only on (λ₀, eps, min_samples),
verified at N = 10k–40k (analysis/ncheck.py).
webapp/index.html — static, dependency-free browser demo. Opens with no build step.
Three live panels: noise field + detected clusters, R̃ = R·ε^α(ε) histogram converging
to the shifted inv-gamma(10) master curve, cluster z-scores forming N(0,1). Move the eps
slider to see the collapse.
cd webapp && python3 -m http.server 8000 # then open http://localhost:8000GitHub Pages: copy webapp/ to a docs/ folder or gh-pages branch (see webapp/README.md).
207 simulation CSVs (~4.3 GB) are stored as Parquet on HuggingFace — not in this repo.
Scripts download slices on demand and cache them locally in simdata/v2_parquet/.
HuggingFace dataset: https://huggingface.co/datasets/Winternewt/cluster-distribution-simdata
# Scripts handle this automatically; to fetch a slice manually:
import pandas as pd
df = pd.read_parquet(
"hf://datasets/Winternewt/cluster-distribution-simdata/data/eps_1.20.parquet"
)
df = df[df.S_prime != -1] # drop placeholder rows (no cluster found that iteration)Bulk download:
huggingface-cli download Winternewt/cluster-distribution-simdata \
--repo-type dataset --local-dir simdata/v2_parquetSchema: S_prime (float64, convex-hull area), N_prime (int64, cluster size),
iteration (int64). Rows with S_prime = -1 are placeholder (no cluster that field).
| Script | Purpose |
|---|---|
tools/simdata_to_parquet.py |
Convert local CSVs → zstd parquet, validate 1-to-1 |
tools/upload_to_hf.py |
Upload parquet folder to HuggingFace (uses upload_large_folder) |
tools/validate_hf_download.py |
Download N spot-check files from HF and assert exact match vs local CSVs |