You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This PR removes the manual S3 pre-upload dependency from the AWS Transcribe flow and centralizes the audio file lifecycle in the base transcribe engine. It also fixes the custom embeddings flow by reusing the same shared transcription lifecycle instead of maintaining a separate and incorrect S3 handling path.
Changes Made
Added S3 existence checks before starting transcription.
Added upload of missing audio files when they are not already present in S3.
Added storage edit permission validation before uploading a missing file.
Added cleanup for generated transcription output and request-uploaded audio files.
Refactored the custom embeddings flow to reuse the shared transcription lifecycle.
Updated the custom embeddings flow to parse audio_segments from the transcription JSON.
Replaced deprecated generic stacktrace logging with specific error messages.
How to Test
Run a transcript request for an audio file that already exists in the configured S3 path.
Run a transcript request for an audio file that is not in S3 but is available locally through the normal application flow.
Run the custom embeddings flow for a supported audio file.
Run the missing-file flow with a user who does not have edit access to the storage engine.
Expected outcomes
Existing S3 audio files are transcribed without being deleted.
Missing S3 audio files are uploaded automatically, transcribed, and cleaned up after processing.
Custom embeddings output is generated from audio_segments in the transcription JSON.
Users without storage edit permission cannot upload missing files.
Notes
This preserves the existing objectPath-based storage contract.
The older custom embeddings flow had a logic bug because it treated transcript text as a TranscriptionJobStatus.
No new security issue is introduced. The upload path is permission-checked, and cleanup is limited to request-created artifacts.
Problem:
The previous AWS Transcribe flow assumed the input audio file already existed in S3, which forced a manual AWS console upload step before transcript generation could run. The custom embeddings path tried to compensate for that separately, but it duplicated S3 lifecycle handling and relied on a broken status check.
Fix:
This change moves the S3 existence check, conditional upload, transcription execution, result retrieval, and cleanup into the shared base transcribe engine. The custom embeddings flow now reuses that shared lifecycle and only handles segment parsing for CSV output.
Security:
This change does not introduce a new security issue. The new upload path is only used when the audio file is missing from S3, it is gated by storage edit permission, and cleanup only removes artifacts created by the request rather than pre-existing S3 audio files.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR removes the manual S3 pre-upload dependency from the AWS Transcribe flow and centralizes the audio file lifecycle in the base transcribe engine. It also fixes the custom embeddings flow by reusing the same shared transcription lifecycle instead of maintaining a separate and incorrect S3 handling path.
Changes Made
audio_segmentsfrom the transcription JSON.How to Test
Expected outcomes
audio_segmentsin the transcription JSON.Notes
objectPath-based storage contract.TranscriptionJobStatus.