Back to Work

Case Study

VidScript (2025) — Turning Video into Structured AI Notes

An open-source CLI tool that converts videos (local files or YouTube URLs) into structured AI-generated PDF notes using on-device Whisper transcription and LLMs, with optional RAG for long content.

Technical Stack

CLI Framework

  • commander
  • inquirer
  • ora
  • chalk
  • cli-progress

Audio Processing

  • fluent-ffmpeg
  • ytdl-core
  • yt-dlp (fallback)

Transcription

  • @xenova/transformers (Whisper whisper-tiny.en)

AI / LLM

  • Anthropic SDK (Claude 3 Opus, 3.5 Sonnet, 3.7 Sonnet)
  • OpenAI SDK (GPT-4 Turbo, GPT-4o)

RAG / Vectors

  • OpenAI text-embedding-3-small
  • Pinecone

PDF Generation

  • marked (Markdown to HTML)
  • Puppeteer (HTML to PDF)

Runtime

  • TypeScript (ES2022, strict)
  • dotenv
  • npm package (vidscript)

Problem

Video is difficult to search, skim, or reference. Long lectures and tutorials require manual note-taking, and even AI struggles with full transcripts due to context window limits.

Solution

A CLI tool that converts video content into structured PDF notes by:

  1. 1Downloading videos (for YouTube URLs)
  2. 2Extracting audio
  3. 3Transcribing with on-device Whisper
  4. 4Generating formatted notes using Claude or GPT
  5. 5Rendering the output as a styled PDF

For long videos, it uses Pinecone vector search to chunk transcripts and generate notes hierarchically using Retrieval-Augmented Generation.

Core Pipeline

Video Input
Audio Extraction
Whisper Transcription
AI Note Generation
PDF Output
  • YouTube videos downloaded via ytdl-core with yt-dlp fallback
  • Audio extracted using FFmpeg
  • Transcription runs locally using Xenova Whisper (whisper-tiny.en)
  • Notes generated via Claude or GPT (configurable)
  • Markdown converted to HTML and rendered to PDF using Puppeteer

RAG / Hierarchical Generation Path

For long transcripts that exceed LLM context window limits:

  1. 1Text is chunked (~500 tokens)
  2. 2Chunks embedded using OpenAI text-embedding-3-small
  3. 3Stored in Pinecone
  4. 4LLM generates: top-level outline, subtopics, section-by-section notes via semantic retrieval
  5. 5Sections are merged with a generated table of contents
  6. 6Vector index is cleared afterward

CLI Commands

  • vidscript generategenerate notes from a video
  • vidscript initconfigure API keys
  • vidscript checkverify dependencies (FFmpeg, etc.)

Supported Models

Claude 3 Opus
Claude 3.5 Sonnet
Claude 3.7 Sonnet (default)
GPT-4 Turbo
GPT-4o

Technical Architecture

A TypeScript CLI tool (ES2022, strict) with main logic in src/index.ts and Pinecone integration in vectorStore.ts. The pipeline: download video (YouTube via ytdl-core with yt-dlp fallback, or accept local file), extract audio via FFmpeg, transcribe locally via Xenova Whisper (whisper-tiny.en), generate structured notes via Claude or GPT, convert Markdown to HTML via marked, render to PDF via Puppeteer. For long transcripts, a RAG path chunks the text (~500 tokens), embeds via OpenAI text-embedding-3-small, stores in Pinecone, generates notes hierarchically via semantic retrieval, merges sections with a table of contents, and clears the vector index afterward. Environment variables loaded via dotenv. Published as npm package (vidscript) with global CLI support.

Key Decisions

  • 1

    Used on-device Whisper (@xenova/transformers) instead of an API — avoids per-minute transcription costs and works offline

  • 2

    Chose Puppeteer for PDF generation over a library like pdfkit — HTML/CSS provides full layout control with minimal code

  • 3

    Implemented RAG path with Pinecone for long transcripts instead of truncating — preserves full content without hitting context window limits

  • 4

    Supported both Anthropic and OpenAI models — lets users choose based on preference and API key availability

  • 5

    Used commander for CLI framework — well-established, composable subcommands, minimal boilerplate

Tradeoffs

  • On-device Whisper (tiny.en) trades accuracy for speed and cost — larger models would improve quality but require more resources
  • Puppeteer adds a heavyweight dependency (headless Chrome) but provides the most reliable HTML-to-PDF rendering
  • RAG path adds Pinecone as a dependency and requires an API key, but is necessary for transcripts that exceed context windows
  • Output is generated in full (no streaming) — simpler implementation but longer perceived wait for large videos

Current State

Open-source on GitHub. Published to npm (200+ downloads). Full pipeline from video input to styled PDF output, with optional RAG for long content. No test suite yet.

Lessons

  • On-device inference (Whisper) makes CLI tools practical for individual users without API costs
  • RAG is a useful pattern even outside web apps — chunking and hierarchical generation solves context window limits in any pipeline
  • CLI UX matters — progress bars, spinners, and clear error messages are the difference between usable and abandoned tools