Adjust placement of paragraph markers#298
Conversation
…nt of paragraph markers.
There was a problem hiding this comment.
Pull request overview
This PR ports SILNLP’s segment-boundary adjustment logic into Machine.py to improve the natural placement of paragraph markers during USFM marker placement, avoiding awkward segment starts (e.g., beginning with commas).
Changes:
- Add
SegmentBoundaryAdjuster(+TokenRejoiner) for adjusting boundaries in both raw-text and tokenized scenarios. - Integrate boundary adjustment when placing paragraph markers in
PlaceMarkersUsfmUpdateBlockHandler. - Add unit tests for the adjuster and an integration-style test covering paragraph-marker adjustment in USFM updates.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
machine/corpora/segment_boundary_adjuster.py |
New boundary-adjustment implementation for raw and tokenized segment pairs. |
machine/corpora/place_markers_usfm_update_block_handler.py |
Applies boundary adjustment when inserting paragraph markers based on alignment. |
tests/corpora/test_segment_boundary_adjuster.py |
New comprehensive unit tests for boundary-adjustment behavior. |
tests/corpora/test_place_markers_usfm_update_block_handler.py |
Adds a test validating adjusted paragraph-marker placement in an end-to-end update flow. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| accumulated_length = len(token_rejoiner.add_token_to_joined_text(token)) | ||
|
|
||
| if accumulated_length >= target_segment_length: | ||
| # In the unlikely case that the adjusted boundary falls in the middle of a token | ||
| # select the token boundary that is closest | ||
| error_with_current_boundary = accumulated_length - target_segment_length | ||
| error_with_previous_boundary = target_segment_length - (accumulated_length - len(token)) | ||
|
|
There was a problem hiding this comment.
This is not actually an issue, since we're working with tokenized text, which has no whitespace tokens. It doesn't matter on which side of the hypothetical space we decide to draw a boundary, because the boundary will still fall between the same two tokens either way. It's not worth adding extra logic to handle a distinction that will never result in different functionality.
| # If inserting a paragraph marker, make small adjustments to place it in a more natural location | ||
| if element.type == UsfmUpdateBlockElementType.PARAGRAPH: | ||
| adj_trg_tok = SegmentBoundaryAdjuster().adjust_tokenized_segment_pair_boundaries(adj_trg_tok, trg_toks) | ||
|
|
benjaminking
left a comment
There was a problem hiding this comment.
@benjaminking made 2 comments and resolved 2 discussions.
Reviewable status: 0 of 4 files reviewed, all discussions resolved.
| # If inserting a paragraph marker, make small adjustments to place it in a more natural location | ||
| if element.type == UsfmUpdateBlockElementType.PARAGRAPH: | ||
| adj_trg_tok = SegmentBoundaryAdjuster().adjust_tokenized_segment_pair_boundaries(adj_trg_tok, trg_toks) | ||
|
|
| accumulated_length = len(token_rejoiner.add_token_to_joined_text(token)) | ||
|
|
||
| if accumulated_length >= target_segment_length: | ||
| # In the unlikely case that the adjusted boundary falls in the middle of a token | ||
| # select the token boundary that is closest | ||
| error_with_current_boundary = accumulated_length - target_segment_length | ||
| error_with_previous_boundary = target_segment_length - (accumulated_length - len(token)) | ||
|
|
There was a problem hiding this comment.
This is not actually an issue, since we're working with tokenized text, which has no whitespace tokens. It doesn't matter on which side of the hypothetical space we decide to draw a boundary, because the boundary will still fall between the same two tokens either way. It's not worth adding extra logic to handle a distinction that will never result in different functionality.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #298 +/- ##
==========================================
+ Coverage 91.51% 91.60% +0.09%
==========================================
Files 355 357 +2
Lines 22934 23205 +271
==========================================
+ Hits 20987 21257 +270
- Misses 1947 1948 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
An SILNLP issue (sillsdev/silnlp#996) surfaced an issue with paragraph marker placement where paragraph markers would sometimes be placed in slightly, but obviously wrong positions, and for example, would lead to segments starting with commas.
SILNLP's verse segmentation code has a class that makes small adjustments to segment boundaries to try to avoid these issues. This PR ports that code to Machine.py, adds tests for it, and incorporates it into the marker placement algorithm.
This change is