Skip to content

feat: Centralised configuration utility#17

Merged
nicoloesch merged 27 commits into
mainfrom
15-oa-config
Jul 2, 2026
Merged

feat: Centralised configuration utility#17
nicoloesch merged 27 commits into
mainfrom
15-oa-config

Conversation

@nicoloesch

@nicoloesch nicoloesch commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adapts omop-graph to the oa-configurator configuration layer, replacing the previous environment-variable-based setup with a typed TOML-backed config and a first-class omop-config configure omop_graph subcommand.

Notes

Due to the importance, this PR also absorbed the following issue:

Changes

  • OmopGraphConfig subclasses PackageConfigBase, exposing all package settings as typed Pydantic fields backed by [tools.omop_graph] in ~/.config/omop/config.toml
  • Entry point registered under omop.config so omop-config configure omop_graph prompts for package extras interactively or via named flags (--max-depth, --max-paths); omop-graph owns no database resource directly — it relies on the CDM resource configured by omop-alchemy
  • OmopGraphConfig.get_config() (inherited classmethod) replaces the old standalone get_config() function; all internal call sites updated
  • Resolver.from_active_config() replaces the old standalone get_resolver() function; all engine-creation helpers updated (db/session.py, oaklib_interface/omop_factory.py)
  • OmopGraphConfig.configure_logging(verbosity=…) (inherited classmethod) replaces the old standalone configure_logging() wrapper; extra_logging_namespaces = ("omop_alchemy", "omop_emb") declares transitive dependencies whose logs are also configured (omop_emb is optional but harmless to include)
  • docker-compose.yaml updated from --resource-set/--set flags to named flags matching the new CLI

@nicoloesch nicoloesch marked this pull request as ready for review June 30, 2026 02:09
@nicoloesch nicoloesch requested a review from gkennos June 30, 2026 02:09

@gkennos gkennos left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think only the comments in paths and grounding actually block the release

Comment thread src/omop_graph/graph/paths.py Outdated

if max_concepts and len(found_standard_concepts) >= max_concepts:
if max_concepts and all(
found_count_per_target.get(t, 0) >= max_concepts for t in targets

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if one of these targets has no concept_ancestor values then all() is permanently False and max_concepts has no effect

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid edge case but the current code is already the right trade-off. When all targets are reachable and all hit max_concepts the break fires correctly. When some targets are unreachable the break never fires and the BFS drains fully. This is identical to not having this limit set at all but that is the implication of the limit.

Instead of optimising the early-stopping condition, I optimised the number of round-trips to the DB by doing batched ancestry search. This should reduce the number of requests to the DB and speed the BFS up.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup ok that's a good optimisation

standard_concepts=tuple(sc for sc in standard_concepts if sc.match_kind != LabelMatchKind.EMBEDDING),
kg=kg,
nearest_concept_matches=nearest_concept_matches,
nearest_concept_matches=None, # No embedding-based scoring for non-embedding matches

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come this was changed? if a synonym is super different textually but somehow overwhelmed in embedding space, it should not be penalised that hard? so - CIN is synonym of Cervical intraepithelial neoplasia, but does not appear in top n embedding concepts -> shouldn't be scored to effectively 0 by its lack of textual similarity I think? it should get its (kind of mid, but existent) embedding score, even though it wasn't enough to score in the top n and be resolved by the embedding resolver

was line 190/191 a huge performance hit? is there a reason specifically to keep the list shorter there? otherwise I think it should not be None here

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The split is intentional. Resolvers that find concepts textually (Exact/Synonym/Partial/FTS) score textually; the EmbeddingResolver scores semantically. Adding embedding scoring to FTS results reintroduces the dilution problem the split was designed to fix: FTS surfaces hundreds of NOS/body-part variants with nearly identical embeddings, and scoring all of them semantically swamps the concept-specific signal.

This is NOT a performance fix but a structural/syntactic (?) fix.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok that is rational - just wasn't clear on why it was changed

Comment thread src/omop_graph/oaklib_interface/omop_implementation.py Outdated
Comment thread scripts/benchmarks/benchmark.py Outdated
Comment thread scripts/benchmarks/trace_example.py
Comment thread src/omop_graph/db/session.py Outdated
@nicoloesch

Copy link
Copy Markdown
Collaborator Author

@gkennos Included requested changes. There are 3 open conversations I would like your feedback on, which is why they are not being marked as "Resolved".

In addition to the suggested changes, I added an optimisation to the BFS search to query the DB for all concept ancestors for the entire heap at once. That way, we should be able to reduce the number of times the DB is queried significantly (i.e. not doing the ancestry check for EACH individual item). This performance improvement is probably the most noticeable the more concepts we have to check.

Let me know your thoughts on the implementation. It is deliberately still open for bringing costs to specific predicate kinds in the future (which is why the heap is still there) so we can extend if we wanted to. Docstrings should also indicate this.

@nicoloesch nicoloesch requested a review from gkennos July 1, 2026 01:50
@nicoloesch nicoloesch merged commit 049cd6e into main Jul 2, 2026
4 checks passed
@nicoloesch nicoloesch deleted the 15-oa-config branch July 2, 2026 06:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment