Technical Stack
CLI Framework
- commander
- inquirer
- ora
- chalk
- cli-progress
Audio Processing
- fluent-ffmpeg
- ytdl-core
- yt-dlp (fallback)
Transcription
- @xenova/transformers (Whisper whisper-tiny.en)
AI / LLM
- Anthropic SDK (Claude 3 Opus, 3.5 Sonnet, 3.7 Sonnet)
- OpenAI SDK (GPT-4 Turbo, GPT-4o)
RAG / Vectors
- OpenAI text-embedding-3-small
- Pinecone
PDF Generation
- marked (Markdown to HTML)
- Puppeteer (HTML to PDF)
Runtime
- TypeScript (ES2022, strict)
- dotenv
- npm package (vidscript)
Problem
Video is difficult to search, skim, or reference. Long lectures and tutorials require manual note-taking, and even AI struggles with full transcripts due to context window limits.
Solution
A CLI tool that converts video content into structured PDF notes by:
- 1Downloading videos (for YouTube URLs)
- 2Extracting audio
- 3Transcribing with on-device Whisper
- 4Generating formatted notes using Claude or GPT
- 5Rendering the output as a styled PDF
For long videos, it uses Pinecone vector search to chunk transcripts and generate notes hierarchically using Retrieval-Augmented Generation.
Core Pipeline
- YouTube videos downloaded via ytdl-core with yt-dlp fallback
- Audio extracted using FFmpeg
- Transcription runs locally using Xenova Whisper (whisper-tiny.en)
- Notes generated via Claude or GPT (configurable)
- Markdown converted to HTML and rendered to PDF using Puppeteer
RAG / Hierarchical Generation Path
For long transcripts that exceed LLM context window limits:
- 1Text is chunked (~500 tokens)
- 2Chunks embedded using OpenAI text-embedding-3-small
- 3Stored in Pinecone
- 4LLM generates: top-level outline, subtopics, section-by-section notes via semantic retrieval
- 5Sections are merged with a generated table of contents
- 6Vector index is cleared afterward
CLI Commands
vidscript generate— generate notes from a videovidscript init— configure API keysvidscript check— verify dependencies (FFmpeg, etc.)
Supported Models
Technical Architecture
A TypeScript CLI tool (ES2022, strict) with main logic in src/index.ts and Pinecone integration in vectorStore.ts. The pipeline: download video (YouTube via ytdl-core with yt-dlp fallback, or accept local file), extract audio via FFmpeg, transcribe locally via Xenova Whisper (whisper-tiny.en), generate structured notes via Claude or GPT, convert Markdown to HTML via marked, render to PDF via Puppeteer. For long transcripts, a RAG path chunks the text (~500 tokens), embeds via OpenAI text-embedding-3-small, stores in Pinecone, generates notes hierarchically via semantic retrieval, merges sections with a table of contents, and clears the vector index afterward. Environment variables loaded via dotenv. Published as npm package (vidscript) with global CLI support.
Key Decisions
- 1
Used on-device Whisper (@xenova/transformers) instead of an API — avoids per-minute transcription costs and works offline
- 2
Chose Puppeteer for PDF generation over a library like pdfkit — HTML/CSS provides full layout control with minimal code
- 3
Implemented RAG path with Pinecone for long transcripts instead of truncating — preserves full content without hitting context window limits
- 4
Supported both Anthropic and OpenAI models — lets users choose based on preference and API key availability
- 5
Used commander for CLI framework — well-established, composable subcommands, minimal boilerplate
Tradeoffs
- On-device Whisper (tiny.en) trades accuracy for speed and cost — larger models would improve quality but require more resources
- Puppeteer adds a heavyweight dependency (headless Chrome) but provides the most reliable HTML-to-PDF rendering
- RAG path adds Pinecone as a dependency and requires an API key, but is necessary for transcripts that exceed context windows
- Output is generated in full (no streaming) — simpler implementation but longer perceived wait for large videos
Current State
Open-source on GitHub. Published to npm (200+ downloads). Full pipeline from video input to styled PDF output, with optional RAG for long content. No test suite yet.
Lessons
- On-device inference (Whisper) makes CLI tools practical for individual users without API costs
- RAG is a useful pattern even outside web apps — chunking and hierarchical generation solves context window limits in any pipeline
- CLI UX matters — progress bars, spinners, and clear error messages are the difference between usable and abandoned tools