Case Study
YouTube Research Agent (2025) — Automating YouTube Surveying with LLM Summaries
An open-source Python CLI tool that searches YouTube for high-view-count videos on a topic, extracts transcripts, summarizes each video using GPT-3.5 Turbo via a LangChain map-reduce chain, and emails a formatted digest to a recipient.
Tech Stack
YouTube Search
- google-api-python-client (YouTube Data API v3)
Transcript Extraction
- youtube-transcript-api
LLM Orchestration
- LangChain (langchain, langchain-openai, langchain-community)
Model
- OpenAI gpt-3.5-turbo
- smtplib + email.mime (SMTP/TLS)
Config
- python-decouple
- python-dotenv
Runtime
- Python 3.11+
Problem
Researchers and knowledge workers often need to survey YouTube for information on a topic, which requires manual searching, skimming, and synthesizing notes across multiple videos.
Solution
Given a topic and recipient email, the agent:
- 1Searches YouTube for relevant, high-view-count videos
- 2Extracts transcripts
- 3Summarizes each video using GPT-3.5 Turbo with a map-reduce summarization chain
- 4Emails a digest of the top videos and their summaries to a specified recipient
Architecture & Data Flow
- CLI collects: topic, recipient email, max videos (default 5)
- Search module queries YouTube Data API v3, filters to videos published after 2024-01-01, orders by view count, fetches transcripts per video (handles missing transcripts)
- Summarizer combines title + description + transcript, splits text into 2000-character chunks with 200-character overlap using RecursiveCharacterTextSplitter, runs LangChain map-reduce summarization using ChatOpenAI (gpt-3.5-turbo, temperature=0)
- Email module formats summaries into plain-text and sends via SMTP/TLS (default Gmail SMTP: port 587)
- Output: returns summaries to CLI for console display and sends the email digest
Key Files
src/main.py— Orchestrator (YouTubeResearchAgent) + CLI inputsrc/config/settings.py— loads env vars (YouTube/OpenAI/email/SMTP)src/youtube/search.py— search + transcript extractionsrc/summarizer/summarize.py— chunking + map-reduce summarization chainsrc/email/sender.py— formats and sends email via SMTP/TLStests/— present but currently empty placeholders
Design Patterns
- Separation of concerns across search, summarization, email modules
- Orchestrator pattern in main.py
- Map-reduce summarization to handle long transcripts
- Centralized configuration loading via settings.py
Technical Architecture
A Python CLI tool with clear separation of concerns. The orchestrator (src/main.py) coordinates three modules: a search module (src/youtube/search.py) that queries YouTube Data API v3, filters to videos published after 2024-01-01, orders by view count, and extracts transcripts via youtube-transcript-api; a summarizer module (src/summarizer/summarize.py) that combines title + description + transcript, chunks text into 2000-character segments with 200-character overlap using RecursiveCharacterTextSplitter, and runs a LangChain map-reduce summarization chain using ChatOpenAI (gpt-3.5-turbo, temperature=0); and an email module (src/email/sender.py) that formats summaries into plain-text and sends via SMTP/TLS (default Gmail SMTP: port 587). Configuration is centralized in src/config/settings.py using python-decouple and python-dotenv.
Key Decisions
- 1
Used LangChain map-reduce chain over simple prompt truncation — preserves full transcript content regardless of length
- 2
Separated search, summarization, and email into distinct modules — clear boundaries make each component independently testable
- 3
Used RecursiveCharacterTextSplitter with 200-character overlap — maintains context across chunk boundaries for coherent summaries
- 4
Chose gpt-3.5-turbo with temperature=0 — deterministic output for research summarization, lower cost than GPT-4
- 5
Centralized all config in settings.py via python-decouple — single source of truth for API keys and SMTP credentials
Tradeoffs
- CLI-only interface limits accessibility but keeps scope focused and avoids web infrastructure complexity
- Synchronous processing means longer wait times for multiple videos, but simplifies implementation and debugging
- Plain-text email limits formatting but avoids HTML rendering complexity and email client compatibility issues
- Hardcoded 2024-01-01 date filter limits flexibility but scopes results to recent content by default
- Stateless design (no caching/db) means repeated queries re-fetch and re-summarize, but avoids storage management
Current Limitations
- CLI-only (no web UI / API)
- Tests are stubs (empty)
- Uses print statements instead of structured logging
- Synchronous processing (no async/parallel)
- Hardcoded filter: videos after 2024-01-01
- No retry/rate limiting
- Plain-text email only
- Stateless (no caching/db)
Lessons
- Map-reduce is a practical pattern for summarizing content that exceeds LLM context windows without losing information
- Separation of concerns across modules pays off immediately when debugging pipeline stages independently
- Even simple CLI tools benefit from centralized configuration — scattered env var access is a maintenance hazard