Feature/bulk prefetch decomposition#37
Merged
adamjohnwright merged 37 commits intomainfrom Apr 29, 2026
Merged
Conversation
This PR adds significant improvements to the logic-network-generator: ## Testing (43 → 52 tests, all passing) - Added 9 comprehensive tests for regulators and catalysts - Verifies negative regulators have pos_neg='neg' - Verifies positive regulators have pos_neg='pos' - Verifies catalysts have pos_neg='pos' and edge_type='catalyst' - Added integration tests validating real network files - Fixed Neo4j import mocking in all test files ## Documentation - Created Architecture Overview (docs/ARCHITECTURE.md) - Complete system architecture and data flow - Virtual reactions and edge semantics - AND/OR logic rules and design decisions - Created comprehensive examples (examples/) - Working example script with analysis - Usage patterns and troubleshooting guide - Example pathways table with complexity ratings - Added CHANGELOG.md documenting all improvements - Added test suite documentation (TEST_SUITE_SUMMARY.md) - Added GitHub Actions badge to README ## Error Handling - Enhanced Neo4j connector with specific exceptions - Added informative error messages with troubleshooting steps - Improved pathway generator logging - Added graceful handling of file I/O errors ## Code Quality - Added type hints to remaining functions - Added type annotations to variables - Fixed mypy warnings - Enhanced docstrings with exception documentation ## Files Modified - src/neo4j_connector.py - Error handling - src/pathway_generator.py - Logging and error handling - src/reaction_generator.py - Type hints - src/logic_network_generator.py - Type annotations - tests/*.py - Mock fixes and new regulator tests - README.md - Updated documentation and test count - CHANGELOG.md - Comprehensive change documentation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Enhancements: - Added comprehensive CLI argument parsing with argparse - Optional authentication (no auth by default, --username/--password when needed) - Custom output file path via --output flag - Species filtering with --all-species flag - Debug and verbose logging modes - Comprehensive error handling and validation - Query result validation with informative errors - File I/O error handling with troubleshooting hints - Progress reporting and statistics summary - Full type hints throughout - Enhanced README documentation with examples - Passes mypy and ruff linting checks The script now follows production best practices with proper error handling, logging, and user-friendly CLI interface. Script size: 70 → 345 lines (5x improvement in functionality) Tested successfully: Generated mapping file with 78,154 entities
Changes: - Added missing Union import in src/reaction_generator.py - Removed unused pytest imports from test files - Converted f-strings without placeholders to regular strings - Removed unused Mock import from test_regulators_and_catalysts.py All ruff checks now pass. Verified with: poetry run ruff check src/ bin/ tests/
Integration tests that require pathway_logic_network_69620.csv now skip gracefully when the file doesn't exist (e.g., in CI environments). Changes: - Added pytest.mark.skipif to test_network_invariants.py - Added pytest.mark.skipif to test_actual_edge_semantics.py - Tests pass when file exists (52 passed) - Tests skip when file missing (14 skipped, 38 passed) This allows CI to pass while still running integration tests locally.
Remove 15 historical markdown files (Nov 2025 bug analyses, fix summaries, status reports) — bugs they describe are now fixed and covered by tests. Distill the three load-bearing pieces (position-aware UUID design, EntitySet expansion behavior, decomposition loop semantics) into docs/UUID_DESIGN.md and docs/DESIGN_DECISIONS.md. Delete 5 root-level scratch scripts (analyze_loops, investigate_loops, validate_generated_network, validate_pathway, test_position_aware) — all superseded by the tests/ suite. Drop CHANGELOG.md; pre-1.0 with no users, commits/PRs are the history. Update README.md, docs/ARCHITECTURE.md, src/logic_network_generator.py, and .github/pull_request_template.md to point at the new doc locations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Register a new `integration` marker for tests that depend on artifacts in output/ from a prior pathway generation run. Apply it to the four files that were previously skipping silently via pytest.mark.skipif: test_network_invariants, test_actual_edge_semantics, test_uid_reaction_connections, test_uuid_mapping_export. CI now runs `pytest -m "not database and not integration"` — only the unit tier (101 tests, ~4s, no fixtures). The integration and database tiers stay runnable locally; tests/README.md documents how. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two test files were written when reaction IDs in cache files were
numeric, then silently broke when the codebase migrated to R-HSA
stable IDs ("R-HSA-9665106"). The breakage was masked by a fixture
parser that filtered all R-HSA-named pathway dirs out of
AVAILABLE_PATHWAYS, so the affected tests collected zero parameters
and reported as NOTSET-skipped.
Fixes:
- find_pathway_dir / get_available_pathways: extract trailing digits
via regex instead of endswith("_{id}"), handling both "Foo_12345"
and "Foo_R-HSA-12345" naming.
- test_all_reactions_present: query Neo4j for stId, drop the now-
unnecessary numeric comparison.
- test_all_reactions_in_decomposition / test_entity_sets_are_decomposed:
drop astype(int) on R-HSA-prefixed columns; query Neo4j for stId.
Also delete 28 stale output/ directories from the pre-stable-ID era
(they lacked stid_to_uuid_mapping.csv, which all current code paths
emit). The integration tier was failing on each one.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously and_or carried the within-entity decomposition logic for catalysts and regulators (Complex → 'and' across members, EntitySet → 'or' across alternatives). That conflicted with the reaction-level reading the column carries elsewhere — inputs are 'and' because they are required for the reaction to proceed, outputs are 'or' when they have multiple producers. Switch to a single reaction-level meaning: anything that contributes to the reaction proceeding (catalyst, positive regulator) is 'and'; anything that blocks it (negative regulator) is 'or' because any one blocker suffices. Positive regulators get 'and' even though biologically the reaction can sometimes proceed without them — parameter learning later models the conditional dependency. The Complex/EntitySet decomposition tree is preserved in decomposed_uid_mapping.csv. Tests updated to reflect the new contract: - test_and_logic_consistency now allows positive regulators in AND. - New test_negative_regulators_are_or asserts neg regulators are OR. - New test_positive_regulators_are_and asserts pos regulators are AND. - test_and_or_logic_per_type and test_entityset_catalyst_*: updated expectations to match the reaction-level semantic. Also: add the missing pytest.mark.database to test_autophagy_validation (it was silently skipping in CI because the marker was absent), and update its hardcoded PATHWAY_DIR to the current R-HSA naming. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
source_entity_id was previously populated only for one path: a Complex
that contains an EntitySet, when decomposed. Bare-EntitySet inputs
(common case) lost the parent on the way back up to the reaction level
because break_apart_entity's EntitySet branch returned the flat set of
leaves without writing any rows. Some pathways (e.g. Circadian_clock
9909396) had zero rows with source_entity_id set despite having 16
EntitySets in Neo4j; even Autophagy hit only 1.7% population.
The fix adds an EntitySet provenance write inside break_apart_entity:
when an EntitySet S decomposes to leaves {A, B, ...}, emit one row per
leaf with reactome_id=S, component_id=leaf, source_entity_id=S. Leaves
that are themselves UIDs from a nested decomposition expand to their
component_ids so each component carries S as a parent.
The existing reactome_id-keyed cache lookup at the top of
break_apart_entity finds these rows on re-entry and short-circuits
correctly, so a given EntitySet's provenance is written exactly once.
Provenance row uids are deterministic sha256(entity_set|leaf|component)
hashes that never escape break_apart_entity's return value, so
downstream callers (best_reaction_match, _build_uid_index,
_get_reactome_id_from_hash) never receive them and treat them as the
internal markers they are.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A previous round nearly broke the matching layer by collapsing two separate concerns: how decomposition serves cross-reaction matching (plumbing) vs. how it serves perturbation in the final network (output). Same word, two different rules. The fix wasn't more code — the fix is making the rules explicit so future-me/future-LLM can't make the same mistake. Three pieces: - docs/DESIGN_DECISIONS.md gets a "Complex vs EntitySet" section laying out the biological semantic and a "Two layers of decomposition" section spelling out the matcher/output split. Catches human-level confusion before code is written. - src/reaction_generator.py gets a module-level docstring stating which layer it serves and what NOT to do here, with a pointer at the design doc. - tests/test_decomposition_semantics.py codifies the contract as pure unit tests with mocked Neo4j: EntitySet is flat alternatives, simple Complex is atomic, Complex-with-EntitySet is cartesian, ubiquitin is intact, repeated calls don't duplicate provenance rows. Each test has a one-line "if you're about to delete this, read the doc first" message in its assertion. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root-input and terminal-output complexes now get synthetic edges
exposing their leaf components, so that individual proteins can be
perturbed at the network entry and read at the exit. Intermediate
complexes — those produced by some reaction AND consumed by another —
are deliberately left intact. They're the actual biological species
flowing between reactions, and the AB dimer is a different molecule
from free A and free B.
Two pieces:
- src/reaction_generator.py: new get_terminal_components(entity_id),
a pure recursive read that walks Complex/EntitySet structures down
to leaves. Used only for boundary expansion in the output layer; it
is the counterpart to break_apart_entity's matching-layer rules
and intentionally treats Complexes differently. Module docstring
already calls out the two-layer split; this helper lives next to
the matching helpers but is documented to belong to the output side.
- src/logic_network_generator.py: new _emit_boundary_decomposition_edges
runs after append_regulators. For each root-input complex C with
leaves {A, B, ...}, emit A→C, B→C, ... edges of edge_type='assembly';
for each terminal-output complex, emit C→A, C→B, ... of
edge_type='dissociation'. Each leaf shares one UUID across all
boundary contexts, so perturbing a leaf at assembly propagates
through any matching dissociation downstream.
Verified on R-HSA-9909396 (Circadian_clock): 14 assembly + 75
dissociation edges across 33 unique boundary leaves. All have
pos_neg='pos', and_or='and'. Network total grew from 1366 → 1455
edges; perturbation of individual proteins is now possible at
boundaries.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three new tests in test_network_invariants.py: - assembly edges are pos/and - dissociation edges are pos/and - no self-loops on boundary edges Tests parametrize over generated pathway dirs and skip cleanly when a given pathway has no boundary edges, which is fine: the invariants fire once a pathway's network actually contains these edges. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
_emit_boundary_decomposition_edges previously minted a fresh UUID for every leaf of a root/terminal complex without checking whether that leaf's stId already had a UUID elsewhere in the network. So if MDM2 appeared as both a regulator (one UUID) and a leaf of the MDM2:TP53 root complex (a different fresh UUID), the two were disconnected nodes for the same biological protein. Perturbing one wouldn't propagate to the other — which kills the perturbation use case boundary expansion exists to enable. The fix builds a stId → UUID lookup from reactome_id_to_uuid (which by this point holds every UUID minted by the VR phases and by append_regulators) and consults it in _leaf_uuid before minting. Existing UUIDs win; fresh ones are minted only for stIds new to the network. Two new lockdown tests in TestBoundaryLeavesReuseExistingUUIDs prove the contract: a leaf with a pre-existing UUID reuses it; a leaf without one mints fresh. Both have a one-line "if you're about to break this, read the docs" message in the assertion. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Six small cleanups, none changing observable behavior: - create_reaction_id_map: drop the unused reaction_ids parameter the docstring already admitted was dead, and assert that incomming and outgoing hashes resolve to the same Reactome reaction. Hungarian pairing should guarantee this; the assert turns a silent quirk into a loud failure if it ever doesn't. - Output edges: replace and_or="" for single-producer outputs with None. Empty string was inconsistent with the rest of the column (where "and" / "or" / NaN are the meaningful values) and produced serialization noise. - Drop _extract_uid_and_reactome_values — defined, never called. - _decompose_regulator_entity now returns 2-tuples (terminal_id, stoichiometry) instead of 3-tuples carrying a 'logic' field that has been ignored at the call site since the and_or = pos→and / neg→or rewrite. The decomposition rules in this helper still mirror break_apart_entity for matching consistency; AND/OR is set by append_regulators based on pos_neg, not by entity shape. Updated mocks in test_regulators_and_catalysts to match. - find_root_inputs / find_terminal_outputs: dedupe with .unique() and drop NaNs symmetrically on both sides. The functions now return what their names suggest (a set-like list of distinct boundary UUIDs), and they handle nulls the same way. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously, importing src.neo4j_connector or src.logic_network_generator
opened a Neo4j connection at import time using hardcoded credentials.
This made pytest collection fail when Neo4j was unreachable and forced
test files to wrap imports in `with patch('py2neo.Graph')`.
Now the Graph is created lazily on first call to get_graph(), with
URL/user/password read from NEO4J_URL / NEO4J_USER / NEO4J_PASSWORD
(falling back to the previous defaults). A small _GraphProxy shim
keeps the existing `graph.run(...)` call sites working without a
mass-rewrite.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously, each Reactome reaction triggered three separate Cypher queries (catalysts, positive regulators, negative regulators). For a 100-reaction pathway that's 300 round-trips just for these relationships. Now each kind is a single parameterized query that UNWINDs the reaction list — three round-trips total per pathway. The query bodies are also extracted to module-level _CATALYST_CYPHER / _POS_REG_CYPHER / _NEG_REG_CYPHER constants and use $reaction_ids parameterization instead of f-string interpolation, ruling out a class of escaping issues even though Reactome stable IDs aren't a real injection risk. A small _bulk_fetch_reaction_links helper handles the actual graph call; _bulk_fetch_regulators consolidates the body shared between positive and negative regulator fetching. Output is unchanged: same DataFrames with the same columns and the same per-(reaction_uuid, entity) rows. _execute_regulator_query is removed — it was only useful in the per-reaction loop. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The audit assertion in create_reaction_id_map (added in the previous cleanup commit) immediately fired when regenerating Circadian_clock, exposing a real existing bug: the same hash can appear in decomposed_uid_mapping under multiple reactome_ids, because hashes are sha256 of sorted components and don't carry reaction context. _get_reactome_id_from_hash returned the first match arbitrarily, so virtual reactions could be labeled with the wrong source reaction whenever two Reactome reactions happened to share an input or output combination. Fix: emit each best_match as a triple (input_hash, output_hash, reaction_id) at the point the Hungarian assignment runs (which has the reaction_id) instead of reverse-deriving it later from the hash. - src.reaction_generator.decompose_by_reactions now extends all_best_matches with triples. - src.pathway_generator: best_matches DataFrame and best_matches.csv cache now carry a reactome_id column. - src.logic_network_generator.create_reaction_id_map reads the reaction_id directly from the row instead of looking it up; the defensive assert is unnecessary now and is removed. This invalidates existing best_matches.csv caches (they lack the new column). The next regeneration will produce fresh caches with the right schema; clear output/*/cache before running. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The uri = "bolt://..." module variable was removed in commit 1995145 when the Neo4j connection went lazy, but three error-message f-strings still referenced it. They'd raise NameError instead of the intended ConnectionError if Neo4j ever became unreachable, masking the real diagnostic. Pulled the URL from os.getenv at the formatting site so the message keeps working. Dead code removed: - contains_reference_gene_product_molecule_or_isoform: this was the exact filter the bulk-prefetch branch removed (Change 2 from the earlier semantic discussion). The function definition was orphaned; nothing imports it. - get_reactions(pathway_id, taxon_id): defined, never called. - generate_pathway_file: drop unused taxon_id and decompose parameters; the docstring already admitted taxon_id was unused, and the body never referenced decompose. Caller in bin/create-pathways updated; the matching taxon_id = "9606" constant goes away too. - Stale-cache shim that detected dbId-formatted reaction_connections caches and cleared them: a migration aid that's been obsolete since the codebase moved to stable IDs months ago. - number_of_reaction_connections: int = -1 hardcoded debug toggle whose `if > 0` branch could never fire under the default. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Every query in neo4j_connector was built via f-string interpolation
of stable IDs (e.g. f"WHERE p.stId = '{pathway_id}'"). py2neo accepts
parameter binding via $name placeholders, which sidesteps an entire
class of escaping bugs and makes the queries auditable as plain
strings. Reactome stable IDs aren't a real injection risk in practice,
but this is the right shape for the codebase.
- prefetch_entity_data: all 5 queries (io, descendants, components,
members, ref) now bind reaction_ids / entity_ids as $params.
- prefetch_entity_decomposition_data: same treatment for its 3 queries.
- get_reaction_connections, get_pathway_name, get_labels,
get_complex_components, get_set_members, get_reference_entity_id:
bind $entity_id / $pathway_id directly.
- get_reaction_input_output_ids: relationship type can't be a Cypher
parameter, so the input_or_output value is now whitelist-validated
before being embedded.
Magic-number comment on the *0..10 traversal depth in
prefetch_entity_data: real Reactome nesting maxes out around 5; 10 is
a safety margin without being expensive.
Module-level call sites now use get_graph() directly instead of the
graph proxy, which is unreferenced everywhere else and is removed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Hungarian-pairing module is the heart of input-to-output matching within a reaction. It had no tests before this commit and an O(n×m ×DataFrame-scan) inner loop that filtered the entire decomposed mapping twice per (i,j) pair. 11 new tests in tests/test_best_reaction_match.py lock the contract: - counts matrix mechanics (no overlap, partial, full, multi-pair) - square Hungarian picks the optimum - non-square padding correctly DROPS the extras (the subtle path that would silently mislabel virtual reactions if it broke) - empty input or output returns no matches With those in place, the rewrite: - Pre-builds a uid → component-set dict once instead of filtering the DataFrame per pair. Quadratic scans become quadratic set intersections — orders of magnitude faster on big pathways. - find_best_match_both_decomposed_reactions converts inputs/outputs to lists once and reuses them, instead of re-listing per matched pair. - Drops the unused `count` from the matched-pair zip loop. - Returns a tuple instead of a 2-element list (callers destructure identically). - Removes the find_best_reaction_match string-to-set coercion: the one production caller passes Set[UID], so the coercion was dead defensive code that would have masked caller bugs. - Adds module docstring explaining the padding/drop contract. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When a protein regulated multiple reactions (or appeared as both catalyst and regulator) and didn't otherwise show up as a regular VR input/output, append_regulators minted a fresh UUID for every regulator-row emission. The result: MDM2 regulating R1 and R2 became two disconnected nodes in the graph topology, both labeled MDM2 only in the secondary stid mapping. Perturbing one wouldn't propagate to the other — same shape as the boundary-leaf disconnect fixed earlier. Fix: when minting a fresh UUID for a regulator/catalyst member, record it in stid_to_existing_uuid so the next emission for the same stId reuses it. Same simple invariant that already governed boundary leaves and registry-seeded reuse. Two new tests in TestRegulatorSharedAcrossReactions lock the contract: the same protein on two reactions shares one UUID, and a protein appearing as both catalyst and regulator shares one UUID. Also drops the dead defensive fallback that swapped a NaN entity_id for the regulator-link UUID — the link UUID is not a Reactome stable ID, the fetch query never returns rows with NaN entities, and the substitution would have silently passed garbage to _decompose_regulator_entity. Letting a real upstream gap surface is the right behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
catalyst_map and the regulator maps stored the same kind of data — (reaction stable id, regulator entity stable id, edge type, link uuid, reaction uuid) — under different column names: catalyst_map used reaction_id and catalyst_id; regulator_map used reaction and PhysicalEntity. Every consumer had two branches to read the same field. The Cypher queries returned the same shape under three names. Now all three DataFrames use the same _CAT_REG_COLUMNS: [reaction_id, entity_id, edge_type, uuid, reaction_uuid] Knock-on cleanup: - _bulk_fetch_regulators is gone; _bulk_fetch_reaction_entity_links is one helper that produces both kinds. - regulator_configs in append_regulators no longer carries an entity_col — the column is always 'entity_id'. - The cat_reg_entity_ids collection is one loop over the concat of all three maps instead of two loops with different column names. - export_uuid_to_reactome_mapping's catalyst-vs-regulator branching collapses to a single line reading row['entity_id']. - Test fixtures in test_regulators_and_catalysts updated to the new column names; the renames were mechanical (reaction → reaction_id, catalyst_id → entity_id, PhysicalEntity → entity_id). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Profiling Autophagy showed pd.concat dominating the run at 12.4s of 28s (44%), called 2,139 times across get_uids_for_iterproduct_components, get_broken_apart_ids, and _emit_entityset_provenance_rows. Each concat copies the entire growing DataFrame, making decomposition O(N²) in total rows. On big pathways with many EntitySet expansions (Cell_Cycle hit 4 hours; Disease never finished) this dominated runtime even after the best_reaction_match optimization landed. Replace the module-level decomposed_uid_mapping DataFrame with a _DecompositionStore: a Python list of row dicts plus two side indexes (uid → row indices, reactome_id → row indices) that the hot lookups in break_apart_entity, get_broken_apart_ids, get_uids_for_iterproduct_components, and _emit_entityset_provenance_rows now consult instead of scanning a DataFrame. The pandas DataFrame is built exactly once at the end of get_decomposed_uid_mapping. find_best_reaction_match still expects a DataFrame; it gets a small per-reaction slice (built from the store via dataframe_for_uids) containing only the rows for that reaction's input/output combinations — cheap to build, cheap to scan. Tests in test_decomposition_semantics.py updated to clear the store in the fixture and materialize a DataFrame view via _store_df() when assertions need filter expressions. Autophagy: 28s → 9s (3x). Circadian_clock similar. Real win is on big pathways where N gets large; preliminary expectation is hours → minutes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hungarian matching produces an optimal 1-to-1 assignment, but real reactions routinely have unequal input vs output combination counts: - EntitySet expansion can yield (e.g.) 3 input combinations and 2 output combinations — each alternative is a real biological path. - Cleavage reactions produce one input and several fragment outputs; per the Reactome curator guide's IMBALANCE QA check this is legitimate, not an error. Fragments don't share refEntity with the input molecule, so overlap is zero, but every input→fragment edge is still real. - Modifications (phospho, ubiquitin) and transcription resolve separately via component_id_or_reference_entity_id — modified and unmodified forms share a reference entity so the matcher pairs them automatically. The previous code padded the cost matrix and dropped any pair that hit the padding (i.e. dropped surplus inputs or outputs). Now after the Hungarian assignment, surplus inputs get paired with their best-overlap output (and surplus outputs with their best-overlap input). Symmetric on both sides. Zero-overlap pairs are still emitted — cleavage products need them. Tests in test_best_reaction_match.py replace the old "drops extras" expectations with "fans out symmetrically", and add a cleavage fixture that proves a 1-input → 3-fragment-output reaction emits all three input→fragment pairs. Documentation in docs/DESIGN_DECISIONS.md gets a new "Surplus inputs/outputs fan out" section that points to the curator-guide basis (cleavage QA exemption, refEntity-based mod handling). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 2 of create_pathway_logic_network's UUID assignment was doing a full registry rescan on every merge: when two UUIDs needed to be merged, the code iterated every entry of entity_uuid_registry to rewrite values. With many merges and a registry that grows into the hundreds of thousands of entries on big pathways (RAF_MAP_kinase_cascade hits ~512K), this is O(N²) and was the dominant cost. Now merges record source→target unions in a separate uuid_unions map (deferred — no scan), and a single _canonicalize_registry pass after Phase 2 rewrites every registry value to its union-find root. _uf_find does path compression on each lookup, so resolving a UUID is amortized O(α(N)). _get_or_create_entity_uuid grew an optional uuid_unions parameter; when callers don't pass one (single-shot use, tests), a fresh map is allocated. The three TestInterReactionConnectivity tests that read directly from the registry after merges now thread an explicit unions dict and call _canonicalize_registry before assertions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Several MP-BioPath curator prediction files use slightly different column names — 'key output' (with space) and 'key_outout' (typo) instead of the canonical 'key_output'. A handful also have stray trailing tabs producing one-too-many fields on isolated rows. Detect any of the known aliases and rename to 'key_output' before proceeding; switch to the python parser engine with on_bad_lines='warn' so the trailing-tab rows are read leniently. Recovers ~300 additional test cases (12,589 → 12,895) without code changes elsewhere. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Generic Boolean perturbation propagator that runs against the generated networks and compares to the MP-BioPath curator-prediction test set (Sundararaman et al., 2017 — 84 pathways with curator-predicted input/output state changes, 10 with empirical evidence). bin/validate-against-mpbiopath.py: - Maps perturbation gene names to all UUIDs whose stable ID's reference entity has that gene name (catches every modification/compartment form). - Pins those UUIDs to 0 (knockout) or 2 (upregulation), runs synchronous Boolean propagation over a 3-state lattice with AND on input/catalyst/ positive-regulator edges, OR on output/dissociation, and inversion for negative regulators. - Reads predicted state at each key-output UUID; compares to curator. - Builds adjacency and incoming-edge indexes once per pathway via numpy iteration (iterrows on 1.5M-row networks was unusably slow). - Categorizes each failing case: gene_not_in_network, keyoutput_not_in_network, no_path, propagator_missed, or false_positive_change. Distinguishes test-set drift (v86→v96 entity retirements) from propagator limits from potential generation bugs. - Reports both raw accuracy (counts drift cases as defaults) and valid-only accuracy (drift cases excluded, the honest number). validation_results/ holds the run output: - 70.55% on 12,895 currently-runnable tests (compared to MP-BioPath's ~75% on v86 networks with its tuned algorithm). - 60% of failures are propagator limits, 37% are no-path (network-bug-or-biology-change), 2.6% are false positives. - README documents methodology and how to read the failure categories. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…heck
bin/validate-against-mpbiopath.py grew a --ground-truth flag that picks
between 'curator' (default, 84 pathways) and 'experimental' (10
pathways with measured perturbation outcomes). The truth-table
loader is shared and tolerant of the schema variants between the two
file sets (key_output / key output / key_outout / key_ouput, trailing
tabs on stray rows). -999 cells in the experimental data are skipped.
bin/check-no-path-cases-in-neo4j.py: for each no_path failure case
from the validation, build the pathway's reaction-flow graph from
current Neo4j (entity → reaction → entity edges via input/catalyst/
regulator → output) and BFS for a path from the perturbed gene's
entities to the key-output entity. Decomposes via hasComponent /
hasMember / hasCandidate so a perturbation on a Complex matches its
sub-entities. Output classifies each case as bug_candidate (Neo4j has
a path; logic network missed it) or pathway_changed (Neo4j has no
path either; v86→v96 biology change).
Headline result vs curator (84 pathways, 12,895 valid tests):
- 70.55% accuracy
- 156 bug_candidates / 12,895 = 1.2% of valid tests potentially
indicate a generation issue (concentrated in 9 pathways, dominated
by Transcriptional_Regulation_by_TP53 / MDM2 with 80 cases)
- 17.7% propagator_missed (path exists, simple Boolean can't carry
the perturbation — propagator limit, not network)
- 9.8% pathway_changed (Reactome dropped a connection between v86
and v96; not a bug)
Vs experimental (10 pathways, 499 valid tests): 34.07%. Lower than
vs-curator because the experimental dataset is 90% non-normal cases,
heavily exposing the Boolean propagator's inability to propagate "up"
signals through AND gates whose other inputs sit at baseline. A
propagator limit, not a network limit.
validation_results/README.md updated with all of this so it's the
boss-facing summary.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two metrics reported side by side now: - End-to-end accuracy (70.55%): directly comparable to MP-BioPath's published 75% — the network plus a generic Boolean propagator vs curator predictions across 12,895 valid tests. - Network correctness (98.3%): of cases where the network's connectivity is what determines the answer, the network agrees with the curator. Excludes cases attributable to propagator limits or v86→v96 test-set staleness. The latter is the cleanest measure of whether the logic-network generation itself is correct. The former is what's directly comparable to published numbers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hand-traced the largest two clusters of "bug_candidate" failures (80 cases of MDM2 in Transcriptional_Regulation_by_TP53, 18 cases of PTEN in PIP3_activates_AKT_signaling). Same structural pattern in both: the curator's predicted effect depends on mass-action substrate depletion (more enzyme flux → less substrate available downstream). Our directed-flow network correctly emits every Reactome reaction with its inputs, outputs, catalysts, and regulators, but does not emit edges representing "this reaction's flux depletes its substrate." So perturbing an enzyme propagates forward to the reaction's products but not backward to "less of the substrate is left." That's the gap. This is a known limitation of pathway-diagram-derived Boolean networks, not a generation bug. Two ways to address it if needed: explicit negative consumption edges (a semantic enhancement to the generator), or letting alphaSignal's parameter learning absorb the mass-action effect implicitly. Neither is a fix. validation_results/README.md gets the full finding so it's the boss-facing record. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Rewrite README to reflect current features, env-var-based Neo4j config, R-HSA-prefixed pathway IDs, the per-pathway output layout, assembly/dissociation edge types, and the 70.55% / 98.3% validation framing. - Delete .flake8 (project uses ruff; the flake8 config was unused). - Drop the mypy step from CI — it ran with continue-on-error masking 19 type errors. Re-introduce when the codebase is ready to enforce. - Move sets-in-reactome-that-cause-combinatorial-explosion.txt under docs/ so the repo root is cleaner. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Most errors were genuine type-annotation gaps; two were pandas-stubs strictness on dtype dicts. None were runtime bugs — the code worked, the annotations just didn't reflect reality. Notable fixes: - get_broken_apart_ids: parameter was annotated list[set[str]] but the function actually accepts heterogeneous list[set[str] | str]. Tightened to Sequence[Set[str] | str] and added an explicit assert in the str-only branch so mypy can narrow. - best_reaction_match: input/output_reactions widened from Sequence to Iterable since callers pass sets; added Optional defaults for reaction_id; renamed surplus-pairing loop variables to avoid Optional[int] reassignment churn. - reaction_generator: renamed `member_ids` (rebound from set→dict across branches inside break_apart_entity); explicit list-typing on reaction_ids before prefetch_entity_data. - logic_network_generator: explicit None-guard on stoichiometry int conversion (pd.isna doesn't narrow); str() coercion on uid groupby keys; deleted a dead pd.unique block whose result was never used. - decomposed_uid_mapping_column_types: typed Dict[str, Any] so pandas read_csv(dtype=...) and DataFrame.astype(...) overloads accept it. CI: re-added the mypy step (no continue-on-error this time) so future type regressions block merges. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Dependencies (#9, #10): - Bumped pyarrow ^15→^17, python-dotenv ^1.0.1→^1.2.2, pytest ^8.4→^9.0 to clear 17 CVEs flagged by pip-audit. Refreshing the lock cleared another 6 transitive vulnerabilities (urllib3, certifi, pygments, filelock, virtualenv). - pytest 9 requires Python 3.10+; Python 3.9 reached EOL 2025-10-31, so bumped the project floor to 3.10. README and pyproject.toml updated. - Project-dep CVEs went from 23 → 0. Remaining 6 are in pip/setuptools themselves (env tooling, not project deps). Coverage (#11): - Added pytest-cov to the unit-tier CI step with a 40% floor (current unit coverage is 44%). The floor is intentionally below current to avoid false-positive failures on small reorganizations; it acts as a regression bar, not a target. Higher floors are warranted once the database/integration tiers can run in CI. Reactome version tracking (#12): - Bumped docker-compose Neo4j image Release94 → Release96 (the version the validation suite was actually run against — the old tag was drift). - Added tests/test_reactome_version.py: a database-tier sentinel that reads the loaded Neo4j's DBInfo.version and prints it into the test log. Doesn't pin to a specific version; just records what each run used so a Reactome-correlated regression is easy to spot. - Documented the upgrade workflow in README ("Tracking new Reactome releases" subsection) so bumping the image tag and re-running the database tier is the documented path forward. CI hygiene: - actions/checkout v3→v4, actions/setup-python v4→v5 in both workflows. - ruff workflow Python 3.9→3.12 (matches test.yml). - src/ ruff and mypy still clean; the existing tests/ ruff issues are pre-existing and not blocking (CI ruff job only checks src/ and bin/). - Fixed 3 small ruff issues in bin/ scripts (unused import, no-placeholder f-strings). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI: test against Python 3.10, 3.11, 3.12 (was 3.12 only). Now matches the supported floor declared in pyproject.toml; catches regressions on the lowest supported version before they hit users. CONTRIBUTING.md was stale on multiple fronts: - Said Python 3.9; bumped to 3.10 to match pyproject. - Referenced docker-compose Release94 in raw `docker run` form; switched to `docker-compose up -d` since the compose file is the pinned source of truth. - Mentioned only `database` marker; integration tier exists too and they have different prerequisites — documented both. - Linked CHANGELOG.md which doesn't exist; removed the link. - Pointed at `mypy --ignore-missing-imports` which is laxer than CI's `mypy src/`. Aligned the documented command with what CI actually enforces. - Added a coverage-gate section (40% floor) so contributors aren't surprised by the failure mode. Pre-commit: the mypy hook used `types-all`, which was archived years ago and breaks on `pre-commit run --all-files`. Replaced with the specific stub we actually need (pandas-stubs); restricted to src/ via `files:`; dropped --ignore-missing-imports so the hook matches CI behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tests/test_comprehensive_validation.py scanned `output/` at module import time (to populate AVAILABLE_PATHWAYS for parametrize), but without checking that the directory exists. In CI there's no output/ — the scan raised FileNotFoundError during pytest collection, which fails the whole run before a single test executes. The other test files that scan output/ already guard with `if not output_dir.exists(): return []`. Added the same guard here. Verified by running the unit-tier command in a fresh clone with no output/ directory: 116 passed, 2 skipped, 0 errors. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.