Turn any video into a detailed, multi-page written analysis — fully offline, fully local.
FrameRead is an AI-powered video analysis pipeline that extracts audio transcriptions and visual frame descriptions from any video file, then synthesizes them into an exhaustive natural-language summary. Optionally, ask it a specific question and get a precise, evidence-backed answer.
Everything runs locally on your machine — no API keys, no cloud services, no data leaves your device.
- Dual-Pipeline Analysis — Processes both audio (speech) and video (frames) in parallel pipelines, then fuses them into a unified summary.
- Scene-Change Keyframe Extraction — Intelligently detects visual scene changes using histogram + SSIM comparison rather than naive interval sampling.
- Dynamic GPU Batching — Automatically profiles your GPU's VRAM at runtime and calculates the optimal batch size for vision inference. No manual tuning needed.
- Fully Offline — All models run locally. No internet required after initial model downloads.
- Importable as a Module — Use it from the command line or
importit into your own Python project. - Rich Structured Logging — Every pipeline stage emits timestamped, module-tagged logs for full observability.
- Automatic Cleanup — All temporary files (audio, frames, intermediate text) are deleted after each run.
- Hardware-Adaptive — Automatically detects GPU/CPU and selects optimal dtypes, batch sizes, and compute strategies.
- Prompt Q&A — Optionally pass a question to get a targeted answer grounded in the video content.
The pipeline follows a three-stage architecture:
Video File ──┬── Audio Pipeline ──→ Transcription (faster-whisper)
│
└── Video Pipeline ──→ Frame Descriptions (Qwen2-VL, local HF)
│
▼
Synthesis Layer (qwen3.5:9b via Ollama)
│
┌───────────┴───────────┐
▼ ▼
Master Summary Q&A Answer
(always) (if prompt given)
📖 For the full architecture diagram, pipeline details, and developer blueprint, see
VideoAnalyzer_ProjectSpec.md.
| Model | Purpose | Runtime | Size |
|---|---|---|---|
distil-whisper/distil-large-v3 |
Audio transcription | faster-whisper (CTranslate2) |
~1.5 GB |
Qwen/Qwen2-VL-2B-Instruct |
Visual frame analysis | HuggingFace Transformers (local) | ~4.5 GB |
qwen3.5:9b |
Summary synthesis & Q&A | Ollama (local) | ~6 GB |
All models are downloaded automatically on first run and cached locally for future use.
- Python ≥ 3.10
- ffmpeg installed and on PATH (install guide)
- Ollama installed (ollama.com)
- CUDA GPU recommended (8+ GB VRAM) — CPU mode works but is significantly slower
# Clone the repository
git clone https://github.com/s-pra1ham/FrameRead.git
cd FrameRead
# Create and activate a virtual environment
python -m venv venv
# Windows
venv\Scripts\activate
# macOS/Linux
source venv/bin/activate
# Install dependencies
pip install -r requirements.txtpython run.py video.mp4This processes the entire video and prints a comprehensive multi-page summary covering:
- Overview & purpose
- Detailed chronological narrative
- Key points & concepts
- Visual highlights
- Speakers & participants
- Tone & style
python run.py video.mp4 --prompt "What tools or technologies are mentioned?"python run.py lecture.mp4 -p "Summarize the main argument in 3 bullet points."This generates the full summary internally, then uses it to answer your question with cited evidence.
Use FrameRead programmatically in your own scripts or projects:
from src import analyze
result = analyze(video_path="path/to/video.mp4")
print(result.summary) # Full multi-page analysis
print(result.transcription) # Timestamped transcript
print(result.keyframe_count) # Number of keyframes extracted
print(result.duration_seconds) # Total pipeline time in secondsfrom src import analyze
result = analyze(
video_path="path/to/video.mp4",
prompt="What products are shown in this video?"
)
print(result.prompt_answer) # Direct answer to your question
print(result.summary) # Full summary is still available| Field | Type | Description |
|---|---|---|
summary |
str |
Complete multi-page master summary (always present) |
prompt_answer |
str | None |
Answer to your prompt (only if prompt was provided) |
keyframe_count |
int |
Number of scene-change keyframes extracted |
transcription |
str |
Full timestamped transcript of spoken audio |
duration_seconds |
float |
Total wall-clock pipeline time |
video_path |
str |
Absolute path to the analyzed video |
Run FrameRead on Google Colab's free T4 GPU — no local GPU required:
FrameRead/
├── src/ ← Main package
│ ├── __init__.py ← Public API: analyze()
│ ├── analyzer.py ← Pipeline orchestrator
│ ├── config.py ← All tunable constants
│ ├── audio/
│ │ ├── extractor.py ← ffmpeg audio extraction
│ │ └── transcriber.py ← Whisper transcription
│ ├── video/
│ │ ├── frame_extractor.py ← Scene-change keyframe extraction
│ │ └── frame_analyzer.py ← Qwen2-VL vision inference
│ ├── llm/
│ │ ├── ollama_manager.py ← Ollama process & model management
│ │ └── summarizer.py ← Summary + Q&A generation
│ └── utils/
│ ├── hardware.py ← GPU/CPU detection
│ ├── logger.py ← Centralized logging
│ ├── model_manager.py ← Whisper model cache management
│ └── cleanup.py ← Temp directory lifecycle
├── docs/
│ └── VideoAnalyzer_ProjectSpec.md ← Full technical specification
├── run.py ← CLI entry point
├── requirements.txt
└── setup.py
📖 For the complete developer specification, see
docs/VideoAnalyzer_ProjectSpec.md.
- Input — You provide a video file path and an optional prompt.
- Audio Extraction —
ffmpegstrips the audio track into a 16kHz mono WAV. - Transcription —
faster-whispertranscribes every spoken word with timestamps. - Keyframe Extraction — OpenCV reads every frame; histogram + SSIM comparison detects scene changes and saves only the meaningful keyframes.
- Vision Analysis — Each keyframe is described in detail by
Qwen2-VL-2B-Instructrunning locally. On GPU, batch size is dynamically calculated via a VRAM probe to maximize throughput without OOM. - Synthesis — The full transcript + all frame descriptions are fed to
qwen3.5:9b(via Ollama) which produces the master summary. - Q&A (optional) — If a prompt was given, the summary is used as context to answer the question.
- Cleanup — All temporary files are automatically deleted.
[00:00:00.000] [INIT] Starting VideoAnalyzer pipeline for: C:\videos\demo.mp4
[00:00:00.012] [HARDWARE] ── Hardware Survey ──────────────────────────
[00:00:00.013] [HARDWARE] Device: CUDA (GPU)
[00:00:00.013] [HARDWARE] GPU: NVIDIA GeForce RTX 4060
[00:00:00.013] [HARDWARE] VRAM: 8.0 GB
[00:00:00.014] [HARDWARE] Torch dtype: float16
[00:00:00.014] [HARDWARE] ─────────────────────────────────────────────
[00:00:01.220] [AUDIO] Extracting audio from video...
[00:00:03.891] [AUDIO] ✓ Audio extracted in 2.7s
[00:00:07.441] [TRANSCRIBE] ✓ Transcription complete — 12 segments, 143 words, 3.5s
[00:00:07.500] [FRAMES] ✓ Extraction complete — 8 keyframes from 900 total frames (0.5s)
[00:00:08.100] [VISION] Vision Mode: Local Inference (Dynamic Batch size: 2)
[00:00:45.200] [VISION] All frames analyzed -- 37.1s total (Dynamic Batch Size: 2)
[00:01:12.300] [SUMMARY] ✓ Summary generated — 2847 words, 3891 tokens (27.1s)
[00:01:12.400] [CLEANUP] ✓ Cleaned up 11 files
[00:01:12.401] [DONE] ✨ Total pipeline time: 72.4s
This project is for personal and educational use.
S. Pratham — GitHub