Case Study

YouTube Research Agent (2025) — Automating YouTube Surveying with LLM Summaries

An open-source Python CLI tool that searches YouTube for high-view-count videos on a topic, extracts transcripts, summarizes each video using GPT-3.5 Turbo via a LangChain map-reduce chain, and emails a formatted digest to a recipient.

Open Source

CLI

View on GitHub

Tech Stack

YouTube Search

google-api-python-client (YouTube Data API v3)

Transcript Extraction

youtube-transcript-api

LLM Orchestration

LangChain (langchain, langchain-openai, langchain-community)

Model

OpenAI gpt-3.5-turbo

Email

smtplib + email.mime (SMTP/TLS)

Config

python-decouple
python-dotenv

Runtime

Python 3.11+

Problem

Researchers and knowledge workers often need to survey YouTube for information on a topic, which requires manual searching, skimming, and synthesizing notes across multiple videos.

Solution

Given a topic and recipient email, the agent:

1Searches YouTube for relevant, high-view-count videos
2Extracts transcripts
3Summarizes each video using GPT-3.5 Turbo with a map-reduce summarization chain
4Emails a digest of the top videos and their summaries to a specified recipient

Architecture & Data Flow

Topic + Email

→

YouTube Search

→

Transcript Extraction

→

Map-Reduce Summary

→

Email Digest

CLI collects: topic, recipient email, max videos (default 5)
Search module queries YouTube Data API v3, filters to videos published after 2024-01-01, orders by view count, fetches transcripts per video (handles missing transcripts)
Summarizer combines title + description + transcript, splits text into 2000-character chunks with 200-character overlap using RecursiveCharacterTextSplitter, runs LangChain map-reduce summarization using ChatOpenAI (gpt-3.5-turbo, temperature=0)
Email module formats summaries into plain-text and sends via SMTP/TLS (default Gmail SMTP: port 587)
Output: returns summaries to CLI for console display and sends the email digest

Key Files

src/main.py— Orchestrator (YouTubeResearchAgent) + CLI input
src/config/settings.py— loads env vars (YouTube/OpenAI/email/SMTP)
src/youtube/search.py— search + transcript extraction
src/summarizer/summarize.py— chunking + map-reduce summarization chain
src/email/sender.py— formats and sends email via SMTP/TLS
tests/— present but currently empty placeholders

Design Patterns

Separation of concerns across search, summarization, email modules
Orchestrator pattern in main.py
Map-reduce summarization to handle long transcripts
Centralized configuration loading via settings.py

Technical Architecture

A Python CLI tool with clear separation of concerns. The orchestrator (src/main.py) coordinates three modules: a search module (src/youtube/search.py) that queries YouTube Data API v3, filters to videos published after 2024-01-01, orders by view count, and extracts transcripts via youtube-transcript-api; a summarizer module (src/summarizer/summarize.py) that combines title + description + transcript, chunks text into 2000-character segments with 200-character overlap using RecursiveCharacterTextSplitter, and runs a LangChain map-reduce summarization chain using ChatOpenAI (gpt-3.5-turbo, temperature=0); and an email module (src/email/sender.py) that formats summaries into plain-text and sends via SMTP/TLS (default Gmail SMTP: port 587). Configuration is centralized in src/config/settings.py using python-decouple and python-dotenv.

Key Decisions

1
Used LangChain map-reduce chain over simple prompt truncation — preserves full transcript content regardless of length
2
Separated search, summarization, and email into distinct modules — clear boundaries make each component independently testable
3
Used RecursiveCharacterTextSplitter with 200-character overlap — maintains context across chunk boundaries for coherent summaries
4
Chose gpt-3.5-turbo with temperature=0 — deterministic output for research summarization, lower cost than GPT-4
5
Centralized all config in settings.py via python-decouple — single source of truth for API keys and SMTP credentials

Tradeoffs

CLI-only interface limits accessibility but keeps scope focused and avoids web infrastructure complexity
Synchronous processing means longer wait times for multiple videos, but simplifies implementation and debugging
Plain-text email limits formatting but avoids HTML rendering complexity and email client compatibility issues
Hardcoded 2024-01-01 date filter limits flexibility but scopes results to recent content by default
Stateless design (no caching/db) means repeated queries re-fetch and re-summarize, but avoids storage management

Current Limitations

CLI-only (no web UI / API)
Tests are stubs (empty)
Uses print statements instead of structured logging
Synchronous processing (no async/parallel)
Hardcoded filter: videos after 2024-01-01
No retry/rate limiting
Plain-text email only
Stateless (no caching/db)

Lessons

Map-reduce is a practical pattern for summarizing content that exceeds LLM context windows without losing information
Separation of concerns across modules pays off immediately when debugging pipeline stages independently
Even simple CLI tools benefit from centralized configuration — scattered env var access is a maintenance hazard