feat: Centralised configuration utility#17
Conversation
gkennos
left a comment
There was a problem hiding this comment.
I think only the comments in paths and grounding actually block the release
|
|
||
| if max_concepts and len(found_standard_concepts) >= max_concepts: | ||
| if max_concepts and all( | ||
| found_count_per_target.get(t, 0) >= max_concepts for t in targets |
There was a problem hiding this comment.
if one of these targets has no concept_ancestor values then all() is permanently False and max_concepts has no effect
There was a problem hiding this comment.
Valid edge case but the current code is already the right trade-off. When all targets are reachable and all hit max_concepts the break fires correctly. When some targets are unreachable the break never fires and the BFS drains fully. This is identical to not having this limit set at all but that is the implication of the limit.
Instead of optimising the early-stopping condition, I optimised the number of round-trips to the DB by doing batched ancestry search. This should reduce the number of requests to the DB and speed the BFS up.
There was a problem hiding this comment.
yup ok that's a good optimisation
| standard_concepts=tuple(sc for sc in standard_concepts if sc.match_kind != LabelMatchKind.EMBEDDING), | ||
| kg=kg, | ||
| nearest_concept_matches=nearest_concept_matches, | ||
| nearest_concept_matches=None, # No embedding-based scoring for non-embedding matches |
There was a problem hiding this comment.
How come this was changed? if a synonym is super different textually but somehow overwhelmed in embedding space, it should not be penalised that hard? so - CIN is synonym of Cervical intraepithelial neoplasia, but does not appear in top n embedding concepts -> shouldn't be scored to effectively 0 by its lack of textual similarity I think? it should get its (kind of mid, but existent) embedding score, even though it wasn't enough to score in the top n and be resolved by the embedding resolver
was line 190/191 a huge performance hit? is there a reason specifically to keep the list shorter there? otherwise I think it should not be None here
There was a problem hiding this comment.
The split is intentional. Resolvers that find concepts textually (Exact/Synonym/Partial/FTS) score textually; the EmbeddingResolver scores semantically. Adding embedding scoring to FTS results reintroduces the dilution problem the split was designed to fix: FTS surfaces hundreds of NOS/body-part variants with nearly identical embeddings, and scoring all of them semantically swamps the concept-specific signal.
This is NOT a performance fix but a structural/syntactic (?) fix.
There was a problem hiding this comment.
ok that is rational - just wasn't clear on why it was changed
|
@gkennos Included requested changes. There are 3 open conversations I would like your feedback on, which is why they are not being marked as "Resolved". In addition to the suggested changes, I added an optimisation to the BFS search to query the DB for all concept ancestors for the entire heap at once. That way, we should be able to reduce the number of times the DB is queried significantly (i.e. not doing the ancestry check for EACH individual item). This performance improvement is probably the most noticeable the more concepts we have to check. Let me know your thoughts on the implementation. It is deliberately still open for bringing costs to specific predicate kinds in the future (which is why the heap is still there) so we can extend if we wanted to. Docstrings should also indicate this. |
Summary
Adapts
omop-graphto theoa-configuratorconfiguration layer, replacing the previous environment-variable-based setup with a typed TOML-backed config and a first-classomop-config configure omop_graphsubcommand.Notes
Due to the importance, this PR also absorbed the following issue:
Identityinpredicate_mapping.csv#25Changes
OmopGraphConfigsubclassesPackageConfigBase, exposing all package settings as typed Pydantic fields backed by[tools.omop_graph]in~/.config/omop/config.tomlomop.configsoomop-config configure omop_graphprompts for package extras interactively or via named flags (--max-depth,--max-paths); omop-graph owns no database resource directly — it relies on the CDM resource configured by omop-alchemyOmopGraphConfig.get_config()(inherited classmethod) replaces the old standaloneget_config()function; all internal call sites updatedResolver.from_active_config()replaces the old standaloneget_resolver()function; all engine-creation helpers updated (db/session.py,oaklib_interface/omop_factory.py)OmopGraphConfig.configure_logging(verbosity=…)(inherited classmethod) replaces the old standaloneconfigure_logging()wrapper;extra_logging_namespaces = ("omop_alchemy", "omop_emb")declares transitive dependencies whose logs are also configured (omop_embis optional but harmless to include)docker-compose.yamlupdated from--resource-set/--setflags to named flags matching the new CLI