Deep dive · Architecture

The full audio-to-answer pipeline

Three services — a Python transcriber, a Node.js backend, and an Electron HUD — work together to go from raw microphone audio to streaming AI answers in under a second.

Microphone
  ↓ PCM audio (16kHz)
Python Transcriber (FastAPI)
  ↓ VAD gates noise · rolling 30s deque buffer
  ↓ Whisper (MLX / openai-whisper) every 300ms
  ↓ LocalAgreement-2 → committed + tentative words
  ↓ Socket.IO: stt_partial (every 300ms) / stt_final (silence)
Node.js Backend (Express + Socket.IO)
  ↓ handleSttFinal() → ai.service.answerInterviewQuestion()
  ↓ Prompt: system + session memory + question
  ↓ Provider cascade: Groq → Gemini → OpenAI → Claude
  ↓ Streams question_answer_token events
Electron HUD (always-on-top · content-protected)
  ↓ Renders tokens as they arrive
Answer visible on screen ✓

Service 1: Python Transcriber

Microphone capture and VAD

audio_recorder.py captures raw PCM at 16 kHz from the default input device (or the device configured in settings). Before any transcription, a Voice Activity Detection model gates the audio — silence, background noise, and non-speech frames are dropped to avoid feeding garbage to Whisper.

Rolling buffer and 300 ms decode loop

Speech frames are appended to a rolling deque buffer (up to ~30 seconds). Every 300 ms, StreamingSTT hands the current buffer to Whisper and receives a word-level transcript with timestamps.

LocalAgreement-2: commit before silence

Each successive decode is compared to the previous. A word at position i is committed when it matches across two consecutive decodes. Committed words are forwarded to Node as stt_partial.committed without waiting for silence. Tentative (not yet agreed) words follow as stt_partial.tentative.

When 700 ms of silence is detected — or the user presses Cmd+Shift+X — the remaining buffer is force-decoded and emitted as stt_final, triggering the AI answer.

Service 2: Node.js Backend

Receive stt_final and build prompt

dataHandler.js receives the stt_final event. It assembles the prompt by combining the current question with conversation memory from InterviewTranscriptBuffer — the last 3–5 Q&A pairs plus compressed summaries of older context (~850 tokens max overhead).

Multi-provider AI with streaming

ai.service.js tries the configured provider cascade (default: Groq → Gemini → OpenAI → Claude). The first healthy provider is called with stream: true. Each token is immediately forwarded as a question_answer_token Socket.IO event to the HUD — no buffering.

If a provider errors or rate-limits, it is marked as cooling off and the next one in the cascade is tried. Config hot-reloads on every fs.watch event on api-keys.json — no restart needed to change providers or keys.

Session memory update

After question_answer_complete, the Q&A pair is stored in InterviewTranscriptBuffer. When 5 pairs accumulate, an async Ollama call compresses them into a ~150-token summary. This happens fire-and-forget — it never delays the next answer.

Service 3: Electron HUD

Always-on-top, content-protected overlay

The HUD is a frameless Electron window (380×460 px) pinned above all other windows at screen-saver level. setContentProtection(true) makes it invisible to every software-based screen capture tool. Socket.IO connects over ws://localhost:4000.

Live strip + streaming answer render

stt_partial events update the live strip: committed words render bright, tentative words render dim/italic. When question_answer_token events arrive, they are appended to the answer card in real time. The card is visible and updating while the model is still generating.

Flow 2: Screenshot analysis

A second, parallel flow handles screenshot-based questions. screenshot-monitor.service.js polls the uploads/ directory. New images are preprocessed with Sharp (contrast enhancement, grayscale) then passed to Tesseract for OCR. The extracted text is treated as the question and sent through the same AI pipeline — same provider cascade, same streaming HUD render.

← How screenshare invisibility works ← Why it's fast