AWS transcribe S3 lifecycle and custom embeddings flow by Subhadeepghosh1 · Pull Request #2408 · SEMOSS/Semoss

Subhadeepghosh1 · 2026-04-24T12:15:52Z

Description

This PR removes the manual S3 pre-upload dependency from the AWS Transcribe flow and centralizes the audio file lifecycle in the base transcribe engine. It also fixes the custom embeddings flow by reusing the same shared transcription lifecycle instead of maintaining a separate and incorrect S3 handling path.

Changes Made

Added S3 existence checks before starting transcription.
Added upload of missing audio files when they are not already present in S3.
Added storage edit permission validation before uploading a missing file.
Added cleanup for generated transcription output and request-uploaded audio files.
Refactored the custom embeddings flow to reuse the shared transcription lifecycle.
Updated the custom embeddings flow to parse audio_segments from the transcription JSON.
Replaced deprecated generic stacktrace logging with specific error messages.

How to Test

Run a transcript request for an audio file that already exists in the configured S3 path.
Run a transcript request for an audio file that is not in S3 but is available locally through the normal application flow.
Run the custom embeddings flow for a supported audio file.
Run the missing-file flow with a user who does not have edit access to the storage engine.

Expected outcomes

Existing S3 audio files are transcribed without being deleted.
Missing S3 audio files are uploaded automatically, transcribed, and cleaned up after processing.
Custom embeddings output is generated from audio_segments in the transcription JSON.
Users without storage edit permission cannot upload missing files.

Notes

This preserves the existing objectPath-based storage contract.
The older custom embeddings flow had a logic bug because it treated transcript text as a TranscriptionJobStatus.
No new security issue is introduced. The upload path is permission-checked, and cleanup is limited to request-created artifacts.

Subhadeepghosh1 · 2026-04-24T12:16:17Z

Problem:
The previous AWS Transcribe flow assumed the input audio file already existed in S3, which forced a manual AWS console upload step before transcript generation could run. The custom embeddings path tried to compensate for that separately, but it duplicated S3 lifecycle handling and relied on a broken status check.

Fix:
This change moves the S3 existence check, conditional upload, transcription execution, result retrieval, and cleanup into the shared base transcribe engine. The custom embeddings flow now reuses that shared lifecycle and only handles segment parsing for CSV output.

Security:
This change does not introduce a new security issue. The new upload path is only used when the audio file is missing from S3, it is gated by storage edit permission, and cleanup only removes artifacts created by the request rather than pre-existing S3 audio files.

fix: AWS transcribe S3 lifecycle and custom embeddings flow

51eae47

Subhadeepghosh1 requested a review from Pragya7011 April 24, 2026 12:15

Subhadeepghosh1 self-assigned this Apr 24, 2026

Subhadeepghosh1 requested a review from a team as a code owner April 24, 2026 12:15

Subhadeepghosh1 added 3 commits April 27, 2026 13:36

Merge branch 'dev' into AWSTranscribe

8f97344

Merge branch 'dev' into AWSTranscribe

2c7e581

Merge branch 'dev' into AWSTranscribe

aab4c3f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS transcribe S3 lifecycle and custom embeddings flow#2408

AWS transcribe S3 lifecycle and custom embeddings flow#2408
Subhadeepghosh1 wants to merge 4 commits into
devfrom
AWSTranscribe

Subhadeepghosh1 commented Apr 24, 2026

Uh oh!

Subhadeepghosh1 commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Subhadeepghosh1 commented Apr 24, 2026

Description

Changes Made

How to Test

Notes

Uh oh!

Subhadeepghosh1 commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant