Deep dive · Observability

Grafana Cloud + OpenTelemetry

SolveWatch instruments every layer of the pipeline — from VAD and Whisper decode inside the Python transcriber to AI provider latency and token cost in the Node backend — and ships it all to Grafana Cloud over OTLP, without ever touching the hot answer path.

instrumented services

10 s

metric batch interval

10k

log queue depth

0 ms

hot path overhead

Architecture: two services, one Grafana destination

Both the Node.js backend (src/utils/telemetry.js) and the Python transcriber (transcriber/telemetry.py) ship an identical OTel stack: a metrics pipeline backed by MeterProvider with a PeriodicExportingMetricReader (10 s batch) and a logs pipeline backed by LoggerProvider with BatchLogRecordProcessor (5 s flush, 10 k bounded queue). Both export over OTLP HTTP to Grafana Cloud.

The previous on-disk NDJSON writers (logs/app.jsonl, logs/memory.jsonl) have been removed. Grafana Cloud is now the single log destination. If telemetry is disabled or the endpoint is unreachable, every call is a no-op — the answer pipeline is never blocked.

Python transcriber (telemetry.py)
  OTel MeterProvider + LoggerProvider
  ↓ OTLP HTTP /v1/metrics, /v1/logs (10 s batch)
Node.js backend (telemetry.js)
  OTel MeterProvider + LoggerProvider
  ↓ OTLP HTTP /v1/metrics, /v1/logs (10 s batch)
Grafana Cloud OTLP gateway
  ↓ metrics → Grafana Mimir (PromQL)
  ↓ logs   → Grafana Loki (LogQL)
Dashboard + alerts in docs/grafana-dashboard.json ✓

Node.js backend metrics

The server instruments every step of both the screenshot and listen flows. Histograms capture latency distributions; counters track throughput and cost.

Metric	Type	What it measures
ai_ttft_ms	histogram	Time-to-first-token per provider call
ai_total_ms	histogram	Full AI generation duration
ocr_duration_ms	histogram	Tesseract OCR latency per image
screenshot_pipeline_total_ms	histogram	End-to-end screenshot → answer ready
end_to_end_question_ms	histogram	stt_final received → answer complete
http_request_duration_ms	histogram	Express route latency
ai_provider_success_total	counter	Successful AI calls (labeled by provider)
ai_provider_failure_total	counter	Failed AI calls (labeled by provider + reason)
ai_input_tokens_total	counter	Input tokens consumed
ai_output_tokens_total	counter	Output tokens generated
ai_cost_usd_total	counter	Estimated AI spend in USD
ai_cache_read_tokens_total	counter	Anthropic prompt-cache hits (tokens read)
ai_cache_creation_tokens_total	counter	Anthropic prompt-cache writes (first-time)
screenshot_captured_total	counter	Screenshots processed
ocr_failed_total	counter	OCR failures

Python transcriber metrics

The transcriber instruments VAD, Whisper, and speaker identification — the three steps where latency variance most affects the time between when someone finishes speaking and when the AI starts answering.

Metric	Type	What it measures
vad_latency_ms	histogram	VAD inference time per audio chunk
whisper_decode_ms	histogram	Whisper per-decode duration (300 ms loop)
speaker_id_latency_ms	histogram	Speaker ID classify() call latency
silence_wait_actual_ms	histogram	Measured silence gap before stt_final emit
utterances_detected_total	counter	VAD speech-start transitions
utterances_passed_total	counter	Utterances forwarded to AI
utterances_discarded_total	counter	Utterances filtered (labeled with reason)
listener_active	gauge	Always-on listener running (1/0)
whisper_model_loaded	gauge	Whisper model warm in memory (1/0)
speaker_id_model_status	gauge	Speaker ID model ready (labeled)

Host and process gauges (both services)

Both services run a background sampler (10 s interval on a daemon thread / unref'd interval) that pushes system-resource gauges. On Apple Silicon the sampler reports MPS-allocated memory via PyTorch; on NVIDIA it uses pynvml. These gauges include host_name, host_owner, and device_cpu_brand as direct metric labels so Grafana dashboards can filter by machine without relying on resource attribute promotion.

host_cpu_percent

Overall host CPU %

host_memory_percent

Host RAM used %

host_memory_used_bytes

Host RAM used (bytes)

process_cpu_percent

This process CPU %

process_memory_rss_bytes

Process RSS memory

gpu_utilization_percent

GPU utilization %

gpu_memory_used_bytes

GPU memory used

Multi-machine identity

Every metric and log record carries OTel resource attributes that uniquely identify the machine. host.id uses IOPlatformUUID on macOS and /etc/machine-id on Linux — it stays stable across reboots. Both services derive the same host.id for a given machine, so logs from Node and metrics from Python can be correlated in Grafana without a join key.

service.name        solvewatch.server / solvewatch.transcriber
host.name           machine hostname
host.id             IOPlatformUUID (macOS) / machine-id (Linux)
host.arch           arm64 / x86_64
device.cpu.brand    Apple M3 Pro / Intel Core i9-...
os.type             darwin / linux / windows
host.owner          optional — set via host_owner in config
gpu.vendor / gpu.model  Apple / NVIDIA + model string

Logging: from NDJSON files to Loki

Previously, file-logger.js and memory-logger.js wrote structured NDJSON to logs/app.jsonl and logs/memory.jsonl on disk. In this overhaul those writers have been replaced: both modules now delegate to telemetry.logEvent(), which emits an OTel log record to Grafana Loki via the same OTLP HTTP exporter. The call-site API is identical — no existing event emitters changed.

Logs are queued in a bounded BatchLogRecordProcessor (max 10 k records, flush every 5 s). If Grafana is unreachable, the OTel SDK retries with exponential backoff and drops the oldest records when the queue fills — the answer path is never blocked. The same design applies on the Python side (log_writer.py → telemetry.log()).

Configuration

Telemetry is configured in config/api-keys.json under a telemetry key and can be toggled live from the settings page at http://localhost:4000/settings — no restart required. The settings page validates the OTLP endpoint before saving by POSTing an empty metrics payload and checking the response code.

// api-keys.json
{
  "telemetry": {
    "enabled": true,
    "otlp_endpoint": "https://otlp-gateway-prod-us-east-0.grafana.net/otlp",
    "instance_id":   "123456",  // Grafana Cloud stack ID
    "access_token":  "glc_eyJ...",  // Access Policy token
    "service_prefix": "solvewatch",  // prefix for service.name
    "host_owner":    "yourname"   // optional machine label
  }
}

← Why it's fast The full pipeline →