Deep dive · Latency

Why SolveWatch answers in under 400 ms

Three compounding decisions — streaming-first architecture, fast inference providers, and commit-before-silence STT — stack to deliver the answer before the interviewer has finished their sentence.

~200 ms

First token (Groq)

~400 ms

First token (Gemini Flash)

~1.8 s

Full screenshot answer

300 ms

STT decode interval

1. Streaming-first: you read while the model writes

SolveWatch never waits for the full AI response. The moment the first token arrives, it is forwarded over Socket.IO to the Electron HUD and rendered. In practice, the first few words are visible within 200–400 ms of sending the prompt — the same latency as the AI provider's first token.

The pipeline uses server-sent streaming from every supported provider (stream: true for OpenAI/Groq, streamGenerateContent for Gemini, stream: true for Anthropic). Each token chunk is immediately emitted as a question_answer_token Socket.IO event without buffering.

2. Groq LPU: the fastest inference available

Groq's Language Processing Unit (LPU) is purpose-built for transformer inference. On llama-3.3-70b-versatile, Groq consistently returns first tokens in 150–250 ms and generates 500–800 tokens/sec — roughly 5–10× faster throughput than GPU-based providers serving the same model.

For an interview answer of ~150 words (~200 tokens), Groq completes the full generation in about 400–600 ms total. The answer is fully visible on screen before most interviewers reach the end of their question.

Gemini 2.5 Flash is the fallback — similarly optimised for low latency. GPT-4o mini and Claude Sonnet are in the cascade for reliability, not speed.

3. LocalAgreement-2: commit words before silence

Most STT pipelines flush on silence — they wait until you stop talking, transcribe the whole utterance, then forward it. SolveWatch uses LocalAgreement-2, a streaming decoder that decodes the rolling audio buffer every 300 ms while you are still speaking.

The algorithm compares consecutive decode outputs. A word is committed (treated as final) when it appears in the same position across two successive decodes. Committed words are forwarded to the AI immediately — without waiting for sentence completion or silence.

In practice this means the AI prompt often arrives 2–4 seconds earlier than a flush-on-silence approach, giving the model a head start while you are still finishing your sentence. By the time you stop talking, the answer is already streaming.

4. Prompt kept lean

The AI prompt contains: system instructions (~200 tokens), the last 3–5 Q&A pairs from session memory (~400 tokens max), and the current question. Total prompt is typically under 800 tokens. Shorter prompts mean lower time-to-first-token (TTFT) at the provider — every extra 1k tokens adds roughly 50–100 ms TTFT at Groq speeds.

Ollama-based conversation summarisation runs asynchronously after an answer is complete — it never blocks the next question's response path.

Provider latency comparison

Provider / Model	First token (TTFT)	Generation speed	Role in cascade
Groq · llama-3.3-70b	~150–250 ms	~600 tok/s	Primary (fastest)
Gemini 2.5 Flash	~300–500 ms	~300 tok/s	Fallback #2
GPT-4o mini	~400–700 ms	~200 tok/s	Fallback #3
Claude Sonnet	~500–900 ms	~150 tok/s	Fallback #4
Ollama (llama3.2:1b)	~30–80 ms*	~60 tok/s	Offline / classify only

* Ollama TTFT on Apple Silicon M-series. Actual figures vary by hardware and network conditions.

Next: The full pipeline →← Screenshare invisibility