Skip to content

Save rankings feature#49

Merged
warreveys merged 8 commits into
techwolf-ai:mainfrom
warreveys:save-rankings-feature
May 26, 2026
Merged

Save rankings feature#49
warreveys merged 8 commits into
techwolf-ai:mainfrom
warreveys:save-rankings-feature

Conversation

@warreveys

@warreveys warreveys commented May 6, 2026

Copy link
Copy Markdown
Collaborator

Description

Add a save_rankings: bool = False flag to workrb.evaluate() that persists the full prediction matrix for each ranking-task dataset under <output_folder>/rankings/<model_name>/<task>__<dataset_id>.json. Each artifact carries a self-describing header (schema version, workrb version, model/task/dataset identity, sizes, query/target canary strings).

To enable this without recomputing, RankingTask.evaluate is split into compute_prediction_matrix + compute_metrics_from_prediction_matrix. Default behaviour of evaluate() is unchanged.

Adds a companion entry point workrb.evaluate_rankings(rankings_dir, tasks, ...) that replays saved artifacts to compute metrics without a model. A new rankings module handles loading, header validation (schema version is hard reject; structural mismatch is a hard reject; workrb-version drift only warns), and matrix materialization. RankingsArtifactMissing and RankingsArtifactInvalid are exported.

Also: BenchmarkMetadata gains replayed_from_workrb_version (None for normal runs), and _get_dataset_ids_to_evaluate is refactored so ExecutionMode.ALL consistently keeps every dataset regardless of
aggregation mode.

Checklist

  • Added new tests for new functionality
  • Tested locally with example tasks
  • Code follows project style guidelines
  • Documentation updated
  • No new warnings introduced

warreveys added 2 commits May 6, 2026 11:40
Add a `save_rankings: bool = False` flag to `workrb.evaluate()` that
persists per-target ranking score arrays for each ranking-task dataset
under `<output_folder>/rankings/<model_name>/<task>__<dataset_id>.json`.
Each artifact also records `model_name` in its payload so files remain
self-describing if moved.

To enable this without recomputing the prediction matrix, `RankingTask.evaluate`
is split into `compute_prediction_matrix` + `compute_metrics_from_prediction_matrix`;
default behavior is unchanged.
@warreveys warreveys marked this pull request as draft May 6, 2026 10:07
@warreveys warreveys marked this pull request as ready for review May 6, 2026 10:24
@warreveys warreveys requested a review from Mattdl May 7, 2026 06:44
@warreveys warreveys merged commit dcba546 into techwolf-ai:main May 26, 2026
0 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant