Skip to content

Feature/bulk prefetch decomposition#37

Merged
adamjohnwright merged 37 commits intomainfrom
feature/bulk-prefetch-decomposition
Apr 29, 2026
Merged

Feature/bulk prefetch decomposition#37
adamjohnwright merged 37 commits intomainfrom
feature/bulk-prefetch-decomposition

Conversation

@adamjohnwright
Copy link
Copy Markdown
Contributor

No description provided.

adamjohnwright and others added 6 commits October 29, 2025 22:11
This PR adds significant improvements to the logic-network-generator:

## Testing (43 → 52 tests, all passing)
- Added 9 comprehensive tests for regulators and catalysts
- Verifies negative regulators have pos_neg='neg'
- Verifies positive regulators have pos_neg='pos'
- Verifies catalysts have pos_neg='pos' and edge_type='catalyst'
- Added integration tests validating real network files
- Fixed Neo4j import mocking in all test files

## Documentation
- Created Architecture Overview (docs/ARCHITECTURE.md)
  - Complete system architecture and data flow
  - Virtual reactions and edge semantics
  - AND/OR logic rules and design decisions
- Created comprehensive examples (examples/)
  - Working example script with analysis
  - Usage patterns and troubleshooting guide
  - Example pathways table with complexity ratings
- Added CHANGELOG.md documenting all improvements
- Added test suite documentation (TEST_SUITE_SUMMARY.md)
- Added GitHub Actions badge to README

## Error Handling
- Enhanced Neo4j connector with specific exceptions
- Added informative error messages with troubleshooting steps
- Improved pathway generator logging
- Added graceful handling of file I/O errors

## Code Quality
- Added type hints to remaining functions
- Added type annotations to variables
- Fixed mypy warnings
- Enhanced docstrings with exception documentation

## Files Modified
- src/neo4j_connector.py - Error handling
- src/pathway_generator.py - Logging and error handling
- src/reaction_generator.py - Type hints
- src/logic_network_generator.py - Type annotations
- tests/*.py - Mock fixes and new regulator tests
- README.md - Updated documentation and test count
- CHANGELOG.md - Comprehensive change documentation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Enhancements:
- Added comprehensive CLI argument parsing with argparse
- Optional authentication (no auth by default, --username/--password when needed)
- Custom output file path via --output flag
- Species filtering with --all-species flag
- Debug and verbose logging modes
- Comprehensive error handling and validation
- Query result validation with informative errors
- File I/O error handling with troubleshooting hints
- Progress reporting and statistics summary
- Full type hints throughout
- Enhanced README documentation with examples
- Passes mypy and ruff linting checks

The script now follows production best practices with proper error
handling, logging, and user-friendly CLI interface.

Script size: 70 → 345 lines (5x improvement in functionality)

Tested successfully: Generated mapping file with 78,154 entities
Changes:
- Added missing Union import in src/reaction_generator.py
- Removed unused pytest imports from test files
- Converted f-strings without placeholders to regular strings
- Removed unused Mock import from test_regulators_and_catalysts.py

All ruff checks now pass. Verified with: poetry run ruff check src/ bin/ tests/
Integration tests that require pathway_logic_network_69620.csv now skip
gracefully when the file doesn't exist (e.g., in CI environments).

Changes:
- Added pytest.mark.skipif to test_network_invariants.py
- Added pytest.mark.skipif to test_actual_edge_semantics.py
- Tests pass when file exists (52 passed)
- Tests skip when file missing (14 skipped, 38 passed)

This allows CI to pass while still running integration tests locally.
adamjohnwright and others added 23 commits April 27, 2026 09:58
Remove 15 historical markdown files (Nov 2025 bug analyses, fix
summaries, status reports) — bugs they describe are now fixed and
covered by tests. Distill the three load-bearing pieces (position-aware
UUID design, EntitySet expansion behavior, decomposition loop semantics)
into docs/UUID_DESIGN.md and docs/DESIGN_DECISIONS.md.

Delete 5 root-level scratch scripts (analyze_loops, investigate_loops,
validate_generated_network, validate_pathway, test_position_aware) —
all superseded by the tests/ suite.

Drop CHANGELOG.md; pre-1.0 with no users, commits/PRs are the history.

Update README.md, docs/ARCHITECTURE.md, src/logic_network_generator.py,
and .github/pull_request_template.md to point at the new doc locations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Register a new `integration` marker for tests that depend on
artifacts in output/ from a prior pathway generation run. Apply it
to the four files that were previously skipping silently via
pytest.mark.skipif: test_network_invariants, test_actual_edge_semantics,
test_uid_reaction_connections, test_uuid_mapping_export.

CI now runs `pytest -m "not database and not integration"` — only
the unit tier (101 tests, ~4s, no fixtures). The integration and
database tiers stay runnable locally; tests/README.md documents how.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two test files were written when reaction IDs in cache files were
numeric, then silently broke when the codebase migrated to R-HSA
stable IDs ("R-HSA-9665106"). The breakage was masked by a fixture
parser that filtered all R-HSA-named pathway dirs out of
AVAILABLE_PATHWAYS, so the affected tests collected zero parameters
and reported as NOTSET-skipped.

Fixes:
- find_pathway_dir / get_available_pathways: extract trailing digits
  via regex instead of endswith("_{id}"), handling both "Foo_12345"
  and "Foo_R-HSA-12345" naming.
- test_all_reactions_present: query Neo4j for stId, drop the now-
  unnecessary numeric comparison.
- test_all_reactions_in_decomposition / test_entity_sets_are_decomposed:
  drop astype(int) on R-HSA-prefixed columns; query Neo4j for stId.

Also delete 28 stale output/ directories from the pre-stable-ID era
(they lacked stid_to_uuid_mapping.csv, which all current code paths
emit). The integration tier was failing on each one.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously and_or carried the within-entity decomposition logic for
catalysts and regulators (Complex → 'and' across members, EntitySet →
'or' across alternatives). That conflicted with the reaction-level
reading the column carries elsewhere — inputs are 'and' because they
are required for the reaction to proceed, outputs are 'or' when they
have multiple producers.

Switch to a single reaction-level meaning: anything that contributes
to the reaction proceeding (catalyst, positive regulator) is 'and';
anything that blocks it (negative regulator) is 'or' because any one
blocker suffices. Positive regulators get 'and' even though biologically
the reaction can sometimes proceed without them — parameter learning
later models the conditional dependency. The Complex/EntitySet
decomposition tree is preserved in decomposed_uid_mapping.csv.

Tests updated to reflect the new contract:
- test_and_logic_consistency now allows positive regulators in AND.
- New test_negative_regulators_are_or asserts neg regulators are OR.
- New test_positive_regulators_are_and asserts pos regulators are AND.
- test_and_or_logic_per_type and test_entityset_catalyst_*: updated
  expectations to match the reaction-level semantic.

Also: add the missing pytest.mark.database to test_autophagy_validation
(it was silently skipping in CI because the marker was absent), and
update its hardcoded PATHWAY_DIR to the current R-HSA naming.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
source_entity_id was previously populated only for one path: a Complex
that contains an EntitySet, when decomposed. Bare-EntitySet inputs
(common case) lost the parent on the way back up to the reaction level
because break_apart_entity's EntitySet branch returned the flat set of
leaves without writing any rows. Some pathways (e.g. Circadian_clock
9909396) had zero rows with source_entity_id set despite having 16
EntitySets in Neo4j; even Autophagy hit only 1.7% population.

The fix adds an EntitySet provenance write inside break_apart_entity:
when an EntitySet S decomposes to leaves {A, B, ...}, emit one row per
leaf with reactome_id=S, component_id=leaf, source_entity_id=S. Leaves
that are themselves UIDs from a nested decomposition expand to their
component_ids so each component carries S as a parent.

The existing reactome_id-keyed cache lookup at the top of
break_apart_entity finds these rows on re-entry and short-circuits
correctly, so a given EntitySet's provenance is written exactly once.

Provenance row uids are deterministic sha256(entity_set|leaf|component)
hashes that never escape break_apart_entity's return value, so
downstream callers (best_reaction_match, _build_uid_index,
_get_reactome_id_from_hash) never receive them and treat them as the
internal markers they are.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A previous round nearly broke the matching layer by collapsing two
separate concerns: how decomposition serves cross-reaction matching
(plumbing) vs. how it serves perturbation in the final network
(output). Same word, two different rules. The fix wasn't more code —
the fix is making the rules explicit so future-me/future-LLM can't
make the same mistake.

Three pieces:

- docs/DESIGN_DECISIONS.md gets a "Complex vs EntitySet" section
  laying out the biological semantic and a "Two layers of
  decomposition" section spelling out the matcher/output split.
  Catches human-level confusion before code is written.

- src/reaction_generator.py gets a module-level docstring stating
  which layer it serves and what NOT to do here, with a pointer at
  the design doc.

- tests/test_decomposition_semantics.py codifies the contract as
  pure unit tests with mocked Neo4j: EntitySet is flat alternatives,
  simple Complex is atomic, Complex-with-EntitySet is cartesian,
  ubiquitin is intact, repeated calls don't duplicate provenance
  rows. Each test has a one-line "if you're about to delete this,
  read the doc first" message in its assertion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root-input and terminal-output complexes now get synthetic edges
exposing their leaf components, so that individual proteins can be
perturbed at the network entry and read at the exit. Intermediate
complexes — those produced by some reaction AND consumed by another —
are deliberately left intact. They're the actual biological species
flowing between reactions, and the AB dimer is a different molecule
from free A and free B.

Two pieces:

- src/reaction_generator.py: new get_terminal_components(entity_id),
  a pure recursive read that walks Complex/EntitySet structures down
  to leaves. Used only for boundary expansion in the output layer; it
  is the counterpart to break_apart_entity's matching-layer rules
  and intentionally treats Complexes differently. Module docstring
  already calls out the two-layer split; this helper lives next to
  the matching helpers but is documented to belong to the output side.

- src/logic_network_generator.py: new _emit_boundary_decomposition_edges
  runs after append_regulators. For each root-input complex C with
  leaves {A, B, ...}, emit A→C, B→C, ... edges of edge_type='assembly';
  for each terminal-output complex, emit C→A, C→B, ... of
  edge_type='dissociation'. Each leaf shares one UUID across all
  boundary contexts, so perturbing a leaf at assembly propagates
  through any matching dissociation downstream.

Verified on R-HSA-9909396 (Circadian_clock): 14 assembly + 75
dissociation edges across 33 unique boundary leaves. All have
pos_neg='pos', and_or='and'. Network total grew from 1366 → 1455
edges; perturbation of individual proteins is now possible at
boundaries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three new tests in test_network_invariants.py:
- assembly edges are pos/and
- dissociation edges are pos/and
- no self-loops on boundary edges

Tests parametrize over generated pathway dirs and skip cleanly when
a given pathway has no boundary edges, which is fine: the invariants
fire once a pathway's network actually contains these edges.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
_emit_boundary_decomposition_edges previously minted a fresh UUID for
every leaf of a root/terminal complex without checking whether that
leaf's stId already had a UUID elsewhere in the network. So if MDM2
appeared as both a regulator (one UUID) and a leaf of the MDM2:TP53
root complex (a different fresh UUID), the two were disconnected
nodes for the same biological protein. Perturbing one wouldn't
propagate to the other — which kills the perturbation use case
boundary expansion exists to enable.

The fix builds a stId → UUID lookup from reactome_id_to_uuid (which
by this point holds every UUID minted by the VR phases and by
append_regulators) and consults it in _leaf_uuid before minting.
Existing UUIDs win; fresh ones are minted only for stIds new to the
network.

Two new lockdown tests in TestBoundaryLeavesReuseExistingUUIDs prove
the contract: a leaf with a pre-existing UUID reuses it; a leaf
without one mints fresh. Both have a one-line "if you're about to
break this, read the docs" message in the assertion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Six small cleanups, none changing observable behavior:

- create_reaction_id_map: drop the unused reaction_ids parameter the
  docstring already admitted was dead, and assert that incomming and
  outgoing hashes resolve to the same Reactome reaction. Hungarian
  pairing should guarantee this; the assert turns a silent quirk
  into a loud failure if it ever doesn't.

- Output edges: replace and_or="" for single-producer outputs with
  None. Empty string was inconsistent with the rest of the column
  (where "and" / "or" / NaN are the meaningful values) and produced
  serialization noise.

- Drop _extract_uid_and_reactome_values — defined, never called.

- _decompose_regulator_entity now returns 2-tuples (terminal_id,
  stoichiometry) instead of 3-tuples carrying a 'logic' field that
  has been ignored at the call site since the and_or = pos→and /
  neg→or rewrite. The decomposition rules in this helper still
  mirror break_apart_entity for matching consistency; AND/OR is set
  by append_regulators based on pos_neg, not by entity shape.
  Updated mocks in test_regulators_and_catalysts to match.

- find_root_inputs / find_terminal_outputs: dedupe with .unique()
  and drop NaNs symmetrically on both sides. The functions now
  return what their names suggest (a set-like list of distinct
  boundary UUIDs), and they handle nulls the same way.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously, importing src.neo4j_connector or src.logic_network_generator
opened a Neo4j connection at import time using hardcoded credentials.
This made pytest collection fail when Neo4j was unreachable and forced
test files to wrap imports in `with patch('py2neo.Graph')`.

Now the Graph is created lazily on first call to get_graph(), with
URL/user/password read from NEO4J_URL / NEO4J_USER / NEO4J_PASSWORD
(falling back to the previous defaults). A small _GraphProxy shim
keeps the existing `graph.run(...)` call sites working without a
mass-rewrite.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously, each Reactome reaction triggered three separate Cypher
queries (catalysts, positive regulators, negative regulators). For
a 100-reaction pathway that's 300 round-trips just for these
relationships. Now each kind is a single parameterized query that
UNWINDs the reaction list — three round-trips total per pathway.

The query bodies are also extracted to module-level _CATALYST_CYPHER
/ _POS_REG_CYPHER / _NEG_REG_CYPHER constants and use $reaction_ids
parameterization instead of f-string interpolation, ruling out a
class of escaping issues even though Reactome stable IDs aren't a
real injection risk.

A small _bulk_fetch_reaction_links helper handles the actual graph
call; _bulk_fetch_regulators consolidates the body shared between
positive and negative regulator fetching.

Output is unchanged: same DataFrames with the same columns and the
same per-(reaction_uuid, entity) rows. _execute_regulator_query is
removed — it was only useful in the per-reaction loop.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The audit assertion in create_reaction_id_map (added in the previous
cleanup commit) immediately fired when regenerating Circadian_clock,
exposing a real existing bug: the same hash can appear in
decomposed_uid_mapping under multiple reactome_ids, because hashes
are sha256 of sorted components and don't carry reaction context.
_get_reactome_id_from_hash returned the first match arbitrarily, so
virtual reactions could be labeled with the wrong source reaction
whenever two Reactome reactions happened to share an input or output
combination.

Fix: emit each best_match as a triple (input_hash, output_hash,
reaction_id) at the point the Hungarian assignment runs (which has
the reaction_id) instead of reverse-deriving it later from the hash.

- src.reaction_generator.decompose_by_reactions now extends
  all_best_matches with triples.
- src.pathway_generator: best_matches DataFrame and best_matches.csv
  cache now carry a reactome_id column.
- src.logic_network_generator.create_reaction_id_map reads the
  reaction_id directly from the row instead of looking it up; the
  defensive assert is unnecessary now and is removed.

This invalidates existing best_matches.csv caches (they lack the new
column). The next regeneration will produce fresh caches with the
right schema; clear output/*/cache before running.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The uri = "bolt://..." module variable was removed in commit 1995145
when the Neo4j connection went lazy, but three error-message f-strings
still referenced it. They'd raise NameError instead of the intended
ConnectionError if Neo4j ever became unreachable, masking the real
diagnostic. Pulled the URL from os.getenv at the formatting site so
the message keeps working.

Dead code removed:
- contains_reference_gene_product_molecule_or_isoform: this was the
  exact filter the bulk-prefetch branch removed (Change 2 from the
  earlier semantic discussion). The function definition was orphaned;
  nothing imports it.
- get_reactions(pathway_id, taxon_id): defined, never called.
- generate_pathway_file: drop unused taxon_id and decompose
  parameters; the docstring already admitted taxon_id was unused, and
  the body never referenced decompose. Caller in bin/create-pathways
  updated; the matching taxon_id = "9606" constant goes away too.
- Stale-cache shim that detected dbId-formatted reaction_connections
  caches and cleared them: a migration aid that's been obsolete since
  the codebase moved to stable IDs months ago.
- number_of_reaction_connections: int = -1 hardcoded debug toggle
  whose `if > 0` branch could never fire under the default.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Every query in neo4j_connector was built via f-string interpolation
of stable IDs (e.g. f"WHERE p.stId = '{pathway_id}'"). py2neo accepts
parameter binding via $name placeholders, which sidesteps an entire
class of escaping bugs and makes the queries auditable as plain
strings. Reactome stable IDs aren't a real injection risk in practice,
but this is the right shape for the codebase.

- prefetch_entity_data: all 5 queries (io, descendants, components,
  members, ref) now bind reaction_ids / entity_ids as $params.
- prefetch_entity_decomposition_data: same treatment for its 3 queries.
- get_reaction_connections, get_pathway_name, get_labels,
  get_complex_components, get_set_members, get_reference_entity_id:
  bind $entity_id / $pathway_id directly.
- get_reaction_input_output_ids: relationship type can't be a Cypher
  parameter, so the input_or_output value is now whitelist-validated
  before being embedded.

Magic-number comment on the *0..10 traversal depth in
prefetch_entity_data: real Reactome nesting maxes out around 5; 10 is
a safety margin without being expensive.

Module-level call sites now use get_graph() directly instead of the
graph proxy, which is unreferenced everywhere else and is removed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Hungarian-pairing module is the heart of input-to-output matching
within a reaction. It had no tests before this commit and an O(n×m
×DataFrame-scan) inner loop that filtered the entire decomposed
mapping twice per (i,j) pair.

11 new tests in tests/test_best_reaction_match.py lock the contract:
- counts matrix mechanics (no overlap, partial, full, multi-pair)
- square Hungarian picks the optimum
- non-square padding correctly DROPS the extras (the subtle path
  that would silently mislabel virtual reactions if it broke)
- empty input or output returns no matches

With those in place, the rewrite:

- Pre-builds a uid → component-set dict once instead of filtering
  the DataFrame per pair. Quadratic scans become quadratic set
  intersections — orders of magnitude faster on big pathways.
- find_best_match_both_decomposed_reactions converts inputs/outputs
  to lists once and reuses them, instead of re-listing per matched
  pair.
- Drops the unused `count` from the matched-pair zip loop.
- Returns a tuple instead of a 2-element list (callers destructure
  identically).
- Removes the find_best_reaction_match string-to-set coercion: the
  one production caller passes Set[UID], so the coercion was dead
  defensive code that would have masked caller bugs.
- Adds module docstring explaining the padding/drop contract.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When a protein regulated multiple reactions (or appeared as both
catalyst and regulator) and didn't otherwise show up as a regular
VR input/output, append_regulators minted a fresh UUID for every
regulator-row emission. The result: MDM2 regulating R1 and R2 became
two disconnected nodes in the graph topology, both labeled MDM2 only
in the secondary stid mapping. Perturbing one wouldn't propagate to
the other — same shape as the boundary-leaf disconnect fixed earlier.

Fix: when minting a fresh UUID for a regulator/catalyst member,
record it in stid_to_existing_uuid so the next emission for the
same stId reuses it. Same simple invariant that already governed
boundary leaves and registry-seeded reuse.

Two new tests in TestRegulatorSharedAcrossReactions lock the
contract: the same protein on two reactions shares one UUID, and a
protein appearing as both catalyst and regulator shares one UUID.

Also drops the dead defensive fallback that swapped a NaN entity_id
for the regulator-link UUID — the link UUID is not a Reactome stable
ID, the fetch query never returns rows with NaN entities, and the
substitution would have silently passed garbage to
_decompose_regulator_entity. Letting a real upstream gap surface is
the right behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
catalyst_map and the regulator maps stored the same kind of data —
(reaction stable id, regulator entity stable id, edge type, link
uuid, reaction uuid) — under different column names: catalyst_map
used reaction_id and catalyst_id; regulator_map used reaction and
PhysicalEntity. Every consumer had two branches to read the same
field. The Cypher queries returned the same shape under three names.

Now all three DataFrames use the same _CAT_REG_COLUMNS:
  [reaction_id, entity_id, edge_type, uuid, reaction_uuid]

Knock-on cleanup:
- _bulk_fetch_regulators is gone; _bulk_fetch_reaction_entity_links
  is one helper that produces both kinds.
- regulator_configs in append_regulators no longer carries an
  entity_col — the column is always 'entity_id'.
- The cat_reg_entity_ids collection is one loop over the concat of
  all three maps instead of two loops with different column names.
- export_uuid_to_reactome_mapping's catalyst-vs-regulator branching
  collapses to a single line reading row['entity_id'].
- Test fixtures in test_regulators_and_catalysts updated to the new
  column names; the renames were mechanical (reaction → reaction_id,
  catalyst_id → entity_id, PhysicalEntity → entity_id).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Profiling Autophagy showed pd.concat dominating the run at 12.4s of
28s (44%), called 2,139 times across get_uids_for_iterproduct_components,
get_broken_apart_ids, and _emit_entityset_provenance_rows. Each
concat copies the entire growing DataFrame, making decomposition
O(N²) in total rows. On big pathways with many EntitySet expansions
(Cell_Cycle hit 4 hours; Disease never finished) this dominated
runtime even after the best_reaction_match optimization landed.

Replace the module-level decomposed_uid_mapping DataFrame with a
_DecompositionStore: a Python list of row dicts plus two side
indexes (uid → row indices, reactome_id → row indices) that the
hot lookups in break_apart_entity, get_broken_apart_ids,
get_uids_for_iterproduct_components, and _emit_entityset_provenance_rows
now consult instead of scanning a DataFrame. The pandas DataFrame
is built exactly once at the end of get_decomposed_uid_mapping.

find_best_reaction_match still expects a DataFrame; it gets a small
per-reaction slice (built from the store via dataframe_for_uids)
containing only the rows for that reaction's input/output
combinations — cheap to build, cheap to scan.

Tests in test_decomposition_semantics.py updated to clear the store
in the fixture and materialize a DataFrame view via _store_df()
when assertions need filter expressions.

Autophagy: 28s → 9s (3x). Circadian_clock similar. Real win is on
big pathways where N gets large; preliminary expectation is hours
→ minutes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hungarian matching produces an optimal 1-to-1 assignment, but real
reactions routinely have unequal input vs output combination counts:

- EntitySet expansion can yield (e.g.) 3 input combinations and 2
  output combinations — each alternative is a real biological path.
- Cleavage reactions produce one input and several fragment outputs;
  per the Reactome curator guide's IMBALANCE QA check this is
  legitimate, not an error. Fragments don't share refEntity with the
  input molecule, so overlap is zero, but every input→fragment edge
  is still real.
- Modifications (phospho, ubiquitin) and transcription resolve
  separately via component_id_or_reference_entity_id — modified and
  unmodified forms share a reference entity so the matcher pairs
  them automatically.

The previous code padded the cost matrix and dropped any pair that
hit the padding (i.e. dropped surplus inputs or outputs). Now after
the Hungarian assignment, surplus inputs get paired with their
best-overlap output (and surplus outputs with their best-overlap
input). Symmetric on both sides. Zero-overlap pairs are still
emitted — cleavage products need them.

Tests in test_best_reaction_match.py replace the old "drops extras"
expectations with "fans out symmetrically", and add a cleavage
fixture that proves a 1-input → 3-fragment-output reaction emits
all three input→fragment pairs.

Documentation in docs/DESIGN_DECISIONS.md gets a new "Surplus
inputs/outputs fan out" section that points to the curator-guide
basis (cleavage QA exemption, refEntity-based mod handling).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 2 of create_pathway_logic_network's UUID assignment was doing
a full registry rescan on every merge: when two UUIDs needed to be
merged, the code iterated every entry of entity_uuid_registry to
rewrite values. With many merges and a registry that grows into the
hundreds of thousands of entries on big pathways (RAF_MAP_kinase_cascade
hits ~512K), this is O(N²) and was the dominant cost.

Now merges record source→target unions in a separate uuid_unions
map (deferred — no scan), and a single _canonicalize_registry pass
after Phase 2 rewrites every registry value to its union-find root.
_uf_find does path compression on each lookup, so resolving a UUID
is amortized O(α(N)).

_get_or_create_entity_uuid grew an optional uuid_unions parameter;
when callers don't pass one (single-shot use, tests), a fresh map
is allocated. The three TestInterReactionConnectivity tests that
read directly from the registry after merges now thread an explicit
unions dict and call _canonicalize_registry before assertions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Several MP-BioPath curator prediction files use slightly different
column names — 'key output' (with space) and 'key_outout' (typo)
instead of the canonical 'key_output'. A handful also have stray
trailing tabs producing one-too-many fields on isolated rows.

Detect any of the known aliases and rename to 'key_output' before
proceeding; switch to the python parser engine with on_bad_lines='warn'
so the trailing-tab rows are read leniently. Recovers ~300 additional
test cases (12,589 → 12,895) without code changes elsewhere.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Generic Boolean perturbation propagator that runs against the
generated networks and compares to the MP-BioPath curator-prediction
test set (Sundararaman et al., 2017 — 84 pathways with curator-predicted
input/output state changes, 10 with empirical evidence).

bin/validate-against-mpbiopath.py:
- Maps perturbation gene names to all UUIDs whose stable ID's reference
  entity has that gene name (catches every modification/compartment form).
- Pins those UUIDs to 0 (knockout) or 2 (upregulation), runs synchronous
  Boolean propagation over a 3-state lattice with AND on input/catalyst/
  positive-regulator edges, OR on output/dissociation, and inversion for
  negative regulators.
- Reads predicted state at each key-output UUID; compares to curator.
- Builds adjacency and incoming-edge indexes once per pathway via numpy
  iteration (iterrows on 1.5M-row networks was unusably slow).
- Categorizes each failing case: gene_not_in_network,
  keyoutput_not_in_network, no_path, propagator_missed, or
  false_positive_change. Distinguishes test-set drift (v86→v96 entity
  retirements) from propagator limits from potential generation bugs.
- Reports both raw accuracy (counts drift cases as defaults) and
  valid-only accuracy (drift cases excluded, the honest number).

validation_results/ holds the run output:
- 70.55% on 12,895 currently-runnable tests (compared to MP-BioPath's
  ~75% on v86 networks with its tuned algorithm).
- 60% of failures are propagator limits, 37% are no-path
  (network-bug-or-biology-change), 2.6% are false positives.
- README documents methodology and how to read the failure categories.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
adamjohnwright and others added 8 commits April 29, 2026 11:55
…heck

bin/validate-against-mpbiopath.py grew a --ground-truth flag that picks
between 'curator' (default, 84 pathways) and 'experimental' (10
pathways with measured perturbation outcomes). The truth-table
loader is shared and tolerant of the schema variants between the two
file sets (key_output / key output / key_outout / key_ouput, trailing
tabs on stray rows). -999 cells in the experimental data are skipped.

bin/check-no-path-cases-in-neo4j.py: for each no_path failure case
from the validation, build the pathway's reaction-flow graph from
current Neo4j (entity → reaction → entity edges via input/catalyst/
regulator → output) and BFS for a path from the perturbed gene's
entities to the key-output entity. Decomposes via hasComponent /
hasMember / hasCandidate so a perturbation on a Complex matches its
sub-entities. Output classifies each case as bug_candidate (Neo4j has
a path; logic network missed it) or pathway_changed (Neo4j has no
path either; v86→v96 biology change).

Headline result vs curator (84 pathways, 12,895 valid tests):
  - 70.55% accuracy
  - 156 bug_candidates / 12,895 = 1.2% of valid tests potentially
    indicate a generation issue (concentrated in 9 pathways, dominated
    by Transcriptional_Regulation_by_TP53 / MDM2 with 80 cases)
  - 17.7% propagator_missed (path exists, simple Boolean can't carry
    the perturbation — propagator limit, not network)
  - 9.8% pathway_changed (Reactome dropped a connection between v86
    and v96; not a bug)

Vs experimental (10 pathways, 499 valid tests): 34.07%. Lower than
vs-curator because the experimental dataset is 90% non-normal cases,
heavily exposing the Boolean propagator's inability to propagate "up"
signals through AND gates whose other inputs sit at baseline. A
propagator limit, not a network limit.

validation_results/README.md updated with all of this so it's the
boss-facing summary.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two metrics reported side by side now:
- End-to-end accuracy (70.55%): directly comparable to MP-BioPath's
  published 75% — the network plus a generic Boolean propagator vs
  curator predictions across 12,895 valid tests.
- Network correctness (98.3%): of cases where the network's
  connectivity is what determines the answer, the network agrees with
  the curator. Excludes cases attributable to propagator limits or
  v86→v96 test-set staleness.

The latter is the cleanest measure of whether the logic-network
generation itself is correct. The former is what's directly
comparable to published numbers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hand-traced the largest two clusters of "bug_candidate" failures
(80 cases of MDM2 in Transcriptional_Regulation_by_TP53, 18 cases of
PTEN in PIP3_activates_AKT_signaling). Same structural pattern in
both: the curator's predicted effect depends on mass-action substrate
depletion (more enzyme flux → less substrate available downstream).

Our directed-flow network correctly emits every Reactome reaction
with its inputs, outputs, catalysts, and regulators, but does not
emit edges representing "this reaction's flux depletes its
substrate." So perturbing an enzyme propagates forward to the
reaction's products but not backward to "less of the substrate is
left." That's the gap.

This is a known limitation of pathway-diagram-derived Boolean
networks, not a generation bug. Two ways to address it if needed:
explicit negative consumption edges (a semantic enhancement to the
generator), or letting alphaSignal's parameter learning absorb the
mass-action effect implicitly. Neither is a fix.

validation_results/README.md gets the full finding so it's the
boss-facing record.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Rewrite README to reflect current features, env-var-based Neo4j
  config, R-HSA-prefixed pathway IDs, the per-pathway output layout,
  assembly/dissociation edge types, and the 70.55% / 98.3% validation
  framing.
- Delete .flake8 (project uses ruff; the flake8 config was unused).
- Drop the mypy step from CI — it ran with continue-on-error masking
  19 type errors. Re-introduce when the codebase is ready to enforce.
- Move sets-in-reactome-that-cause-combinatorial-explosion.txt under
  docs/ so the repo root is cleaner.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Most errors were genuine type-annotation gaps; two were pandas-stubs
strictness on dtype dicts. None were runtime bugs — the code worked,
the annotations just didn't reflect reality.

Notable fixes:
- get_broken_apart_ids: parameter was annotated list[set[str]] but
  the function actually accepts heterogeneous list[set[str] | str].
  Tightened to Sequence[Set[str] | str] and added an explicit assert
  in the str-only branch so mypy can narrow.
- best_reaction_match: input/output_reactions widened from Sequence
  to Iterable since callers pass sets; added Optional defaults for
  reaction_id; renamed surplus-pairing loop variables to avoid
  Optional[int] reassignment churn.
- reaction_generator: renamed `member_ids` (rebound from set→dict
  across branches inside break_apart_entity); explicit list-typing
  on reaction_ids before prefetch_entity_data.
- logic_network_generator: explicit None-guard on stoichiometry int
  conversion (pd.isna doesn't narrow); str() coercion on uid groupby
  keys; deleted a dead pd.unique block whose result was never used.
- decomposed_uid_mapping_column_types: typed Dict[str, Any] so pandas
  read_csv(dtype=...) and DataFrame.astype(...) overloads accept it.

CI: re-added the mypy step (no continue-on-error this time) so future
type regressions block merges.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Dependencies (#9, #10):
- Bumped pyarrow ^15→^17, python-dotenv ^1.0.1→^1.2.2, pytest ^8.4→^9.0
  to clear 17 CVEs flagged by pip-audit. Refreshing the lock cleared
  another 6 transitive vulnerabilities (urllib3, certifi, pygments,
  filelock, virtualenv).
- pytest 9 requires Python 3.10+; Python 3.9 reached EOL 2025-10-31, so
  bumped the project floor to 3.10. README and pyproject.toml updated.
- Project-dep CVEs went from 23 → 0. Remaining 6 are in pip/setuptools
  themselves (env tooling, not project deps).

Coverage (#11):
- Added pytest-cov to the unit-tier CI step with a 40% floor (current
  unit coverage is 44%). The floor is intentionally below current to
  avoid false-positive failures on small reorganizations; it acts as a
  regression bar, not a target. Higher floors are warranted once the
  database/integration tiers can run in CI.

Reactome version tracking (#12):
- Bumped docker-compose Neo4j image Release94 → Release96 (the version
  the validation suite was actually run against — the old tag was drift).
- Added tests/test_reactome_version.py: a database-tier sentinel that
  reads the loaded Neo4j's DBInfo.version and prints it into the test
  log. Doesn't pin to a specific version; just records what each run
  used so a Reactome-correlated regression is easy to spot.
- Documented the upgrade workflow in README ("Tracking new Reactome
  releases" subsection) so bumping the image tag and re-running the
  database tier is the documented path forward.

CI hygiene:
- actions/checkout v3→v4, actions/setup-python v4→v5 in both workflows.
- ruff workflow Python 3.9→3.12 (matches test.yml).
- src/ ruff and mypy still clean; the existing tests/ ruff issues are
  pre-existing and not blocking (CI ruff job only checks src/ and bin/).
- Fixed 3 small ruff issues in bin/ scripts (unused import,
  no-placeholder f-strings).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI: test against Python 3.10, 3.11, 3.12 (was 3.12 only). Now matches
the supported floor declared in pyproject.toml; catches regressions
on the lowest supported version before they hit users.

CONTRIBUTING.md was stale on multiple fronts:
- Said Python 3.9; bumped to 3.10 to match pyproject.
- Referenced docker-compose Release94 in raw `docker run` form;
  switched to `docker-compose up -d` since the compose file is the
  pinned source of truth.
- Mentioned only `database` marker; integration tier exists too
  and they have different prerequisites — documented both.
- Linked CHANGELOG.md which doesn't exist; removed the link.
- Pointed at `mypy --ignore-missing-imports` which is laxer than
  CI's `mypy src/`. Aligned the documented command with what CI
  actually enforces.
- Added a coverage-gate section (40% floor) so contributors aren't
  surprised by the failure mode.

Pre-commit: the mypy hook used `types-all`, which was archived
years ago and breaks on `pre-commit run --all-files`. Replaced
with the specific stub we actually need (pandas-stubs); restricted
to src/ via `files:`; dropped --ignore-missing-imports so the hook
matches CI behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tests/test_comprehensive_validation.py scanned `output/` at module
import time (to populate AVAILABLE_PATHWAYS for parametrize), but
without checking that the directory exists. In CI there's no
output/ — the scan raised FileNotFoundError during pytest
collection, which fails the whole run before a single test executes.

The other test files that scan output/ already guard with
`if not output_dir.exists(): return []`. Added the same guard here.

Verified by running the unit-tier command in a fresh clone with no
output/ directory: 116 passed, 2 skipped, 0 errors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@adamjohnwright adamjohnwright merged commit 1c8c687 into main Apr 29, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant