Case Study

VidScript (2025) — Turning Video into Structured AI Notes

An open-source CLI tool that converts videos (local files or YouTube URLs) into structured AI-generated PDF notes using on-device Whisper transcription and LLMs, with optional RAG for long content.

npm Source

Technical Stack

CLI Framework

commander
inquirer
ora
chalk
cli-progress

Audio Processing

fluent-ffmpeg
ytdl-core
yt-dlp (fallback)

Transcription

@xenova/transformers (Whisper whisper-tiny.en)

AI / LLM

Anthropic SDK (Claude 3 Opus, 3.5 Sonnet, 3.7 Sonnet)
OpenAI SDK (GPT-4 Turbo, GPT-4o)

RAG / Vectors

OpenAI text-embedding-3-small
Pinecone

PDF Generation

marked (Markdown to HTML)
Puppeteer (HTML to PDF)

Runtime

TypeScript (ES2022, strict)
dotenv
npm package (vidscript)

Problem

Video is difficult to search, skim, or reference. Long lectures and tutorials require manual note-taking, and even AI struggles with full transcripts due to context window limits.

Solution

A CLI tool that converts video content into structured PDF notes by:

1Downloading videos (for YouTube URLs)
2Extracting audio
3Transcribing with on-device Whisper
4Generating formatted notes using Claude or GPT
5Rendering the output as a styled PDF

For long videos, it uses Pinecone vector search to chunk transcripts and generate notes hierarchically using Retrieval-Augmented Generation.

Core Pipeline

Video Input

→

Audio Extraction

→

Whisper Transcription

→

AI Note Generation

→

PDF Output

YouTube videos downloaded via ytdl-core with yt-dlp fallback
Audio extracted using FFmpeg
Transcription runs locally using Xenova Whisper (whisper-tiny.en)
Notes generated via Claude or GPT (configurable)
Markdown converted to HTML and rendered to PDF using Puppeteer

RAG / Hierarchical Generation Path

For long transcripts that exceed LLM context window limits:

1Text is chunked (~500 tokens)
2Chunks embedded using OpenAI text-embedding-3-small
3Stored in Pinecone
4LLM generates: top-level outline, subtopics, section-by-section notes via semantic retrieval
5Sections are merged with a generated table of contents
6Vector index is cleared afterward

CLI Commands

vidscript generate— generate notes from a video
vidscript init— configure API keys
vidscript check— verify dependencies (FFmpeg, etc.)

Supported Models

Claude 3 Opus

Claude 3.5 Sonnet

Claude 3.7 Sonnet (default)

GPT-4 Turbo

GPT-4o

Technical Architecture

A TypeScript CLI tool (ES2022, strict) with main logic in src/index.ts and Pinecone integration in vectorStore.ts. The pipeline: download video (YouTube via ytdl-core with yt-dlp fallback, or accept local file), extract audio via FFmpeg, transcribe locally via Xenova Whisper (whisper-tiny.en), generate structured notes via Claude or GPT, convert Markdown to HTML via marked, render to PDF via Puppeteer. For long transcripts, a RAG path chunks the text (~500 tokens), embeds via OpenAI text-embedding-3-small, stores in Pinecone, generates notes hierarchically via semantic retrieval, merges sections with a table of contents, and clears the vector index afterward. Environment variables loaded via dotenv. Published as npm package (vidscript) with global CLI support.

Key Decisions

1
Used on-device Whisper (@xenova/transformers) instead of an API — avoids per-minute transcription costs and works offline
2
Chose Puppeteer for PDF generation over a library like pdfkit — HTML/CSS provides full layout control with minimal code
3
Implemented RAG path with Pinecone for long transcripts instead of truncating — preserves full content without hitting context window limits
4
Supported both Anthropic and OpenAI models — lets users choose based on preference and API key availability
5
Used commander for CLI framework — well-established, composable subcommands, minimal boilerplate

Tradeoffs

On-device Whisper (tiny.en) trades accuracy for speed and cost — larger models would improve quality but require more resources
Puppeteer adds a heavyweight dependency (headless Chrome) but provides the most reliable HTML-to-PDF rendering
RAG path adds Pinecone as a dependency and requires an API key, but is necessary for transcripts that exceed context windows
Output is generated in full (no streaming) — simpler implementation but longer perceived wait for large videos

Current State

Open-source on GitHub. Published to npm (200+ downloads). Full pipeline from video input to styled PDF output, with optional RAG for long content. No test suite yet.

Lessons

On-device inference (Whisper) makes CLI tools practical for individual users without API costs
RAG is a useful pattern even outside web apps — chunking and hierarchical generation solves context window limits in any pipeline
CLI UX matters — progress bars, spinners, and clear error messages are the difference between usable and abandoned tools