Skip to content

AWS transcribe S3 lifecycle and custom embeddings flow#2408

Open
Subhadeepghosh1 wants to merge 4 commits into
devfrom
AWSTranscribe
Open

AWS transcribe S3 lifecycle and custom embeddings flow#2408
Subhadeepghosh1 wants to merge 4 commits into
devfrom
AWSTranscribe

Conversation

@Subhadeepghosh1
Copy link
Copy Markdown
Contributor

Description

This PR removes the manual S3 pre-upload dependency from the AWS Transcribe flow and centralizes the audio file lifecycle in the base transcribe engine. It also fixes the custom embeddings flow by reusing the same shared transcription lifecycle instead of maintaining a separate and incorrect S3 handling path.

Changes Made

  • Added S3 existence checks before starting transcription.
  • Added upload of missing audio files when they are not already present in S3.
  • Added storage edit permission validation before uploading a missing file.
  • Added cleanup for generated transcription output and request-uploaded audio files.
  • Refactored the custom embeddings flow to reuse the shared transcription lifecycle.
  • Updated the custom embeddings flow to parse audio_segments from the transcription JSON.
  • Replaced deprecated generic stacktrace logging with specific error messages.

How to Test

  1. Run a transcript request for an audio file that already exists in the configured S3 path.
  2. Run a transcript request for an audio file that is not in S3 but is available locally through the normal application flow.
  3. Run the custom embeddings flow for a supported audio file.
  4. Run the missing-file flow with a user who does not have edit access to the storage engine.

Expected outcomes

  • Existing S3 audio files are transcribed without being deleted.
  • Missing S3 audio files are uploaded automatically, transcribed, and cleaned up after processing.
  • Custom embeddings output is generated from audio_segments in the transcription JSON.
  • Users without storage edit permission cannot upload missing files.

Notes

  • This preserves the existing objectPath-based storage contract.
  • The older custom embeddings flow had a logic bug because it treated transcript text as a TranscriptionJobStatus.
  • No new security issue is introduced. The upload path is permission-checked, and cleanup is limited to request-created artifacts.

@Subhadeepghosh1 Subhadeepghosh1 self-assigned this Apr 24, 2026
@Subhadeepghosh1 Subhadeepghosh1 requested a review from a team as a code owner April 24, 2026 12:15
@Subhadeepghosh1
Copy link
Copy Markdown
Contributor Author

Problem:
The previous AWS Transcribe flow assumed the input audio file already existed in S3, which forced a manual AWS console upload step before transcript generation could run. The custom embeddings path tried to compensate for that separately, but it duplicated S3 lifecycle handling and relied on a broken status check.

Fix:
This change moves the S3 existence check, conditional upload, transcription execution, result retrieval, and cleanup into the shared base transcribe engine. The custom embeddings flow now reuses that shared lifecycle and only handles segment parsing for CSV output.

Security:
This change does not introduce a new security issue. The new upload path is only used when the audio file is missing from S3, it is gated by storage edit permission, and cleanup only removes artifacts created by the request rather than pre-existing S3 audio files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant