Adjust placement of paragraph markers by benjaminking · Pull Request #298 · sillsdev/machine.py

benjaminking · 2026-05-14T17:45:01Z

An SILNLP issue (sillsdev/silnlp#996) surfaced an issue with paragraph marker placement where paragraph markers would sometimes be placed in slightly, but obviously wrong positions, and for example, would lead to segments starting with commas.

SILNLP's verse segmentation code has a class that makes small adjustments to segment boundaries to try to avoid these issues. This PR ports that code to Machine.py, adds tests for it, and incorporates it into the marker placement algorithm.

This change is

…nt of paragraph markers.

Copilot

Pull request overview

This PR ports SILNLP’s segment-boundary adjustment logic into Machine.py to improve the natural placement of paragraph markers during USFM marker placement, avoiding awkward segment starts (e.g., beginning with commas).

Changes:

Add SegmentBoundaryAdjuster (+ TokenRejoiner) for adjusting boundaries in both raw-text and tokenized scenarios.
Integrate boundary adjustment when placing paragraph markers in PlaceMarkersUsfmUpdateBlockHandler.
Add unit tests for the adjuster and an integration-style test covering paragraph-marker adjustment in USFM updates.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
`machine/corpora/segment_boundary_adjuster.py`	New boundary-adjustment implementation for raw and tokenized segment pairs.
`machine/corpora/place_markers_usfm_update_block_handler.py`	Applies boundary adjustment when inserting paragraph markers based on alignment.
`tests/corpora/test_segment_boundary_adjuster.py`	New comprehensive unit tests for boundary-adjustment behavior.
`tests/corpora/test_place_markers_usfm_update_block_handler.py`	Adds a test validating adjusted paragraph-marker placement in an end-to-end update flow.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

benjaminking · 2026-05-14T18:03:39Z

+            accumulated_length = len(token_rejoiner.add_token_to_joined_text(token))
+
+            if accumulated_length >= target_segment_length:
+                # In the unlikely case that the adjusted boundary falls in the middle of a token
+                # select the token boundary that is closest
+                error_with_current_boundary = accumulated_length - target_segment_length
+                error_with_previous_boundary = target_segment_length - (accumulated_length - len(token))
+


This is not actually an issue, since we're working with tokenized text, which has no whitespace tokens. It doesn't matter on which side of the hypothetical space we decide to draw a boundary, because the boundary will still fall between the same two tokens either way. It's not worth adding extra logic to handle a distinction that will never result in different functionality.

benjaminking · 2026-05-14T18:03:38Z

+            # If inserting a paragraph marker, make small adjustments to place it in a more natural location
+            if element.type == UsfmUpdateBlockElementType.PARAGRAPH:
+                adj_trg_tok = SegmentBoundaryAdjuster().adjust_tokenized_segment_pair_boundaries(adj_trg_tok, trg_toks)
+


…ance

benjaminking

@benjaminking made 2 comments and resolved 2 discussions.
Reviewable status: 0 of 4 files reviewed, all discussions resolved.

benjaminking · 2026-05-14T18:03:38Z

+            # If inserting a paragraph marker, make small adjustments to place it in a more natural location
+            if element.type == UsfmUpdateBlockElementType.PARAGRAPH:
+                adj_trg_tok = SegmentBoundaryAdjuster().adjust_tokenized_segment_pair_boundaries(adj_trg_tok, trg_toks)
+


benjaminking · 2026-05-14T18:03:39Z

+            accumulated_length = len(token_rejoiner.add_token_to_joined_text(token))
+
+            if accumulated_length >= target_segment_length:
+                # In the unlikely case that the adjusted boundary falls in the middle of a token
+                # select the token boundary that is closest
+                error_with_current_boundary = accumulated_length - target_segment_length
+                error_with_previous_boundary = target_segment_length - (accumulated_length - len(token))
+


This is not actually an issue, since we're working with tokenized text, which has no whitespace tokens. It doesn't matter on which side of the hypothetical space we decide to draw a boundary, because the boundary will still fall between the same two tokens either way. It's not worth adding extra logic to handle a distinction that will never result in different functionality.

codecov-commenter · 2026-05-14T18:05:01Z

Codecov Report

❌ Patch coverage is 99.63100% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 91.60%. Comparing base (aa71f15) to head (e2452a3).

Files with missing lines	Patch %	Lines
machine/corpora/segment_boundary_adjuster.py	98.73%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #298      +/-   ##
==========================================
+ Coverage   91.51%   91.60%   +0.09%     
==========================================
  Files         355      357       +2     
  Lines       22934    23205     +271     
==========================================
+ Hits        20987    21257     +270     
- Misses       1947     1948       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Add SegmentBoundaryAdjuster + tests, and use it to adjust the placeme…

00312e0

…nt of paragraph markers.

benjaminking requested a review from Copilot May 14, 2026 17:45

Copilot started reviewing on behalf of benjaminking May 14, 2026 17:45 View session

Copilot AI reviewed May 14, 2026

View reviewed changes

Initialize SegmentBoundaryAdjuster once per update block handler inst…

e2452a3

…ance

benjaminking commented May 14, 2026

View reviewed changes

benjaminking marked this pull request as ready for review May 14, 2026 18:20

benjaminking requested a review from Enkidu93 May 14, 2026 18:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adjust placement of paragraph markers#298

Adjust placement of paragraph markers#298
benjaminking wants to merge 2 commits into
mainfrom
segment_boundary_adjuster

benjaminking commented May 14, 2026 •

edited by ddaspit

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

benjaminking May 14, 2026

Uh oh!

benjaminking May 14, 2026

Uh oh!

benjaminking left a comment

Uh oh!

benjaminking May 14, 2026

Uh oh!

benjaminking May 14, 2026

Uh oh!

codecov-commenter commented May 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

benjaminking commented May 14, 2026 • edited by ddaspit Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

benjaminking May 14, 2026

Choose a reason for hiding this comment

Uh oh!

benjaminking May 14, 2026

Choose a reason for hiding this comment

Uh oh!

benjaminking left a comment

Choose a reason for hiding this comment

Uh oh!

benjaminking May 14, 2026

Choose a reason for hiding this comment

Uh oh!

benjaminking May 14, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

benjaminking commented May 14, 2026 •

edited by ddaspit

Loading

codecov-commenter commented May 14, 2026 •

edited

Loading