Back to Work

Case Study

YouTube Research Agent (2025) — Automating YouTube Surveying with LLM Summaries

An open-source Python CLI tool that searches YouTube for high-view-count videos on a topic, extracts transcripts, summarizes each video using GPT-3.5 Turbo via a LangChain map-reduce chain, and emails a formatted digest to a recipient.

Open Source
CLI
View on GitHub

Tech Stack

YouTube Search

  • google-api-python-client (YouTube Data API v3)

Transcript Extraction

  • youtube-transcript-api

LLM Orchestration

  • LangChain (langchain, langchain-openai, langchain-community)

Model

  • OpenAI gpt-3.5-turbo

Email

  • smtplib + email.mime (SMTP/TLS)

Config

  • python-decouple
  • python-dotenv

Runtime

  • Python 3.11+

Problem

Researchers and knowledge workers often need to survey YouTube for information on a topic, which requires manual searching, skimming, and synthesizing notes across multiple videos.

Solution

Given a topic and recipient email, the agent:

  1. 1Searches YouTube for relevant, high-view-count videos
  2. 2Extracts transcripts
  3. 3Summarizes each video using GPT-3.5 Turbo with a map-reduce summarization chain
  4. 4Emails a digest of the top videos and their summaries to a specified recipient

Architecture & Data Flow

Topic + Email
YouTube Search
Transcript Extraction
Map-Reduce Summary
Email Digest
  • CLI collects: topic, recipient email, max videos (default 5)
  • Search module queries YouTube Data API v3, filters to videos published after 2024-01-01, orders by view count, fetches transcripts per video (handles missing transcripts)
  • Summarizer combines title + description + transcript, splits text into 2000-character chunks with 200-character overlap using RecursiveCharacterTextSplitter, runs LangChain map-reduce summarization using ChatOpenAI (gpt-3.5-turbo, temperature=0)
  • Email module formats summaries into plain-text and sends via SMTP/TLS (default Gmail SMTP: port 587)
  • Output: returns summaries to CLI for console display and sends the email digest

Key Files

  • src/main.pyOrchestrator (YouTubeResearchAgent) + CLI input
  • src/config/settings.pyloads env vars (YouTube/OpenAI/email/SMTP)
  • src/youtube/search.pysearch + transcript extraction
  • src/summarizer/summarize.pychunking + map-reduce summarization chain
  • src/email/sender.pyformats and sends email via SMTP/TLS
  • tests/present but currently empty placeholders

Design Patterns

  • Separation of concerns across search, summarization, email modules
  • Orchestrator pattern in main.py
  • Map-reduce summarization to handle long transcripts
  • Centralized configuration loading via settings.py

Technical Architecture

A Python CLI tool with clear separation of concerns. The orchestrator (src/main.py) coordinates three modules: a search module (src/youtube/search.py) that queries YouTube Data API v3, filters to videos published after 2024-01-01, orders by view count, and extracts transcripts via youtube-transcript-api; a summarizer module (src/summarizer/summarize.py) that combines title + description + transcript, chunks text into 2000-character segments with 200-character overlap using RecursiveCharacterTextSplitter, and runs a LangChain map-reduce summarization chain using ChatOpenAI (gpt-3.5-turbo, temperature=0); and an email module (src/email/sender.py) that formats summaries into plain-text and sends via SMTP/TLS (default Gmail SMTP: port 587). Configuration is centralized in src/config/settings.py using python-decouple and python-dotenv.

Key Decisions

  • 1

    Used LangChain map-reduce chain over simple prompt truncation — preserves full transcript content regardless of length

  • 2

    Separated search, summarization, and email into distinct modules — clear boundaries make each component independently testable

  • 3

    Used RecursiveCharacterTextSplitter with 200-character overlap — maintains context across chunk boundaries for coherent summaries

  • 4

    Chose gpt-3.5-turbo with temperature=0 — deterministic output for research summarization, lower cost than GPT-4

  • 5

    Centralized all config in settings.py via python-decouple — single source of truth for API keys and SMTP credentials

Tradeoffs

  • CLI-only interface limits accessibility but keeps scope focused and avoids web infrastructure complexity
  • Synchronous processing means longer wait times for multiple videos, but simplifies implementation and debugging
  • Plain-text email limits formatting but avoids HTML rendering complexity and email client compatibility issues
  • Hardcoded 2024-01-01 date filter limits flexibility but scopes results to recent content by default
  • Stateless design (no caching/db) means repeated queries re-fetch and re-summarize, but avoids storage management

Current Limitations

  • CLI-only (no web UI / API)
  • Tests are stubs (empty)
  • Uses print statements instead of structured logging
  • Synchronous processing (no async/parallel)
  • Hardcoded filter: videos after 2024-01-01
  • No retry/rate limiting
  • Plain-text email only
  • Stateless (no caching/db)

Lessons

  • Map-reduce is a practical pattern for summarizing content that exceeds LLM context windows without losing information
  • Separation of concerns across modules pays off immediately when debugging pipeline stages independently
  • Even simple CLI tools benefit from centralized configuration — scattered env var access is a maintenance hazard