How VCR# Talks to ttyd
The Side-by-Side Architecture
When VCR# records a terminal session, it doesn't directly manipulate a terminal window. Instead, four separate processes work together:
- VCR# - The orchestrator
- ttyd - The terminal server
- Browser (Chromium) - The rendering engine
- Shell (bash/pwsh/cmd) - The actual command interpreter
These components run independently, side-by-side, each handling a distinct responsibility.
Why This Separation Exists
Why not directly control a terminal?
Terminal emulation is deceptively complex. Getting fonts, colors, Unicode, ligatures, and emoji to render consistently across Windows, macOS, and Linux would take years of development. Each platform has different terminal APIs (ConPTY on Windows, PTY on Unix), different font rendering engines, and different default behaviors.
VCR#'s approach: composition over custom
Rather than building a custom terminal emulator, VCR# orchestrates existing, battle-tested tools:
- ttyd provides the terminal-to-web bridge
- xterm.js (inside ttyd's web page) handles terminal emulation and rendering
- Playwright provides browser automation and screenshot capabilities
Each component has millions of users finding edge cases VCR# will never encounter.
The Communication Chain
Sending Input: From VCR# to Shell
When a Type or Key command executes, input flows through multiple layers:
VCR# Command
↓
Playwright Automation API
↓
Browser Keyboard Event
↓
xterm.js (in browser)
↓
WebSocket Message
↓
ttyd Process
↓
PTY (Pseudo-Terminal)
↓
Shell Process (bash/pwsh/cmd)
What's actually happening:
- VCR# calls Playwright:
page.Keyboard.TypeAsync("hello") - Playwright sends browser automation commands: Simulates physical keyboard events in Chromium
- xterm.js receives keyboard events: The terminal emulator running in the browser captures them
- WebSocket transmission: xterm.js sends the keystrokes to ttyd over a WebSocket connection
- ttyd writes to PTY: Acts as a bridge, converting WebSocket messages to PTY input
- Shell receives input: The actual shell process (bash/PowerShell/cmd) reads from its PTY
This chain seems convoluted, but each step serves a purpose:
- Playwright: Provides reliable cross-platform automation
- Browser: Handles all the complexity of keyboard input (modifiers, special keys, international keyboards)
- xterm.js: Translates browser keyboard events into terminal input codes
- WebSocket: Provides bi-directional communication between browser and server
- ttyd: Bridges the web world (WebSocket) and Unix world (PTY)
- PTY: Provides the shell with a terminal interface it understands
Reading Output: From Shell to VCR#
Shell Process Output
↓
PTY
↓
ttyd Process
↓
WebSocket Message
↓
xterm.js Rendering
↓
Browser Canvas Display
↓
Playwright JavaScript Execution
↓
VCR# receives text/screenshots
What's actually happening:
- Shell produces output: Command writes to stdout/stderr
- PTY captures output: Terminal interface receives the bytes
- ttyd forwards via WebSocket: Sends output to the browser
- xterm.js renders: Terminal emulator interprets ANSI codes, updates display
- Browser renders to canvas: xterm.js draws characters, colors, cursor to canvas elements
- VCR# reads through Playwright: Executes JavaScript in the browser to access xterm.js internals
Two ways VCR# reads output:
Text reading (for Wait commands):
// VCR# runs this JavaScript in the browser
window.term.buffer.active.getLine(lineNumber).translateToString(true)
This accesses xterm.js's internal buffer directly, reading the actual text content that was rendered.
Visual capture (for recording):
// Playwright screenshots the canvas elements
await page.ScreenshotAsync(options)
This captures the pixels displayed in the browser—the visual representation of the terminal.
The key architectural insight: VCR# operates entirely through the browser—it never directly interacts with the shell's input/output streams. All input goes through Playwright's keyboard automation, and all output is read from xterm.js's rendered display. This indirection is what enables consistent visual output across platforms—we capture exactly what xterm.js displays, not what the shell emitted.
The Recording Lifecycle
Understanding how these components interact during a full recording session reveals why this architecture works:
Phase 1: Startup
VCR# starts the terminal server:
- Finds an available port (requests ephemeral port from OS)
- Spawns ttyd process:
ttyd --port 12345 --writable pwsh.exe - ttyd launches the shell (PowerShell in this example)
- ttyd binds to localhost:12345 and waits for browser connections
- VCR# polls the port until ttyd responds (confirms it's ready)
At this point, two processes are running side-by-side:
- ttyd (serving a web page on port 12345)
- shell (running under ttyd, waiting for input)
VCR# connects the browser:
- Launches headless Chromium via Playwright
- Navigates to
http://localhost:12345 - Browser loads ttyd's web page (which includes xterm.js)
- xterm.js initializes and connects to ttyd via WebSocket
- VCR# waits for terminal to be ready (checks for shell prompt)
Now four processes are running side-by-side:
- VCR# (orchestrating)
- Browser (displaying terminal)
- ttyd (bridging browser ↔ shell)
- shell (ready for commands)
Phase 2: Command Execution
VCR# executes tape commands:
For each command in the tape file, VCR# calls ExecuteAsync(), which interacts with the terminal through the browser.
Example: Type command
Type "echo hello"
This follows the input flow described above. VCR# calls Playwright.Keyboard.TypeAsync("echo hello"), which travels through the chain to the shell. The shell's echo mode displays each character as it's typed, creating the visual effect of someone typing in real-time.
Example: Wait command
Wait /hello/
- VCR#: Starts polling loop (every 10ms)
- Executes JavaScript in browser:
window.term.buffer.active - Reads terminal text content from xterm.js
- Checks if pattern
/hello/matches - Repeats until match found or timeout
Phase 3: Frame Capture
While commands execute, VCR# continuously captures screenshots:
- Background thread runs a capture loop at 50fps (20ms intervals)
- Each iteration:
await page.ScreenshotAsync() - Playwright captures the browser's canvas elements
- Returns PNG bytes
- VCR# writes PNG to disk:
frame0001.png,frame0002.png, ...
This happens in parallel with command execution. The capture loop doesn't care what commands are running—it just screenshots at regular intervals. This separation ensures consistent framerate regardless of command timing.
Phase 4: Cleanup
When recording finishes:
- VCR# stops frame capture (no more screenshots)
- Closes browser (disconnects from ttyd)
- Kills ttyd process (which terminates the shell)
- Processes frames (trim blank frames, generate video)
- Deletes temporary PNG files
The components shut down in reverse order of startup, ensuring clean termination.
Why This Seems Inefficient (But Isn't)
At first glance, this architecture looks wasteful:
- Why run a browser just to automate a terminal?
- Why use WebSockets when the shell is on the same machine?
- Why capture screenshots instead of recording terminal escape codes?
The answers reveal VCR#'s design priorities:
- Browser overhead is deliberate: We prioritize pixel-perfect, cross-platform consistent output over being the lightest recorder. xterm.js's production-grade rendering (it powers VS Code's terminal) justifies the browser overhead.
- WebSocket indirection enables platform independence: ttyd handles all platform-specific PTY complexity, so VCR# never directly manipulates PTY APIs that differ across Windows, macOS, and Linux.
- Screenshots enable universality: Capturing pixels (not escape codes) means output works everywhere—GitHub READMEs, presentations, documentation sites—with no special player needed.
The "inefficiency" is an investment in reliability, consistency, and compatibility.
What This Means for Users
Understanding this architecture explains several VCR# characteristics:
Why startup takes 2-3 seconds: VCR# must launch ttyd, wait for port availability, launch browser, wait for xterm.js initialization, and wait for shell prompt. That's unavoidable with this architecture.
Why VCR# requires dependencies: ttyd, Chromium (via Playwright), and FFmpeg aren't optional—they're fundamental to how VCR# works.
Why recordings look identical everywhere: xterm.js provides consistent rendering across all platforms.
Why Wait commands are necessary: VCR# can't know when commands finish (it doesn't monitor the shell directly). It must watch the browser's terminal display for expected output.
Why recordings work in CI/CD: Everything runs headless—no visible windows, no display server needed. ttyd serves localhost, browser runs headless, shell executes commands normally.
The Browser as Automation Target
To VCR#, the terminal is the browser tab showing ttyd's web interface. All interaction happens through browser automation—keyboard input via Playwright, output reading via JavaScript execution, visual capture via screenshots.
This abstraction is what makes VCR# maintainable. Browser automation is well-understood and well-supported. Terminal emulation is complex and platform-specific. By operating at the browser level, VCR# avoids re-implementing terminal complexity.
Conclusion
When you run vcr demo.tape, you're orchestrating a multi-layered pipeline where each component handles a distinct concern. The result: recordings that look identical everywhere, work in CI/CD, and require no special playback tools.