← Back to SolveWatch

Deep dive · Observability

Grafana Cloud + OpenTelemetry

SolveWatch instruments every layer of the pipeline — from VAD and Whisper decode inside the Python transcriber to AI provider latency and token cost in the Node backend — and ships it all to Grafana Cloud over OTLP, without ever touching the hot answer path.

2
instrumented services
10 s
metric batch interval
10k
log queue depth
0 ms
hot path overhead

Architecture: two services, one Grafana destination

Both the Node.js backend (src/utils/telemetry.js) and the Python transcriber (transcriber/telemetry.py) ship an identical OTel stack: a metrics pipeline backed by MeterProvider with a PeriodicExportingMetricReader (10 s batch) and a logs pipeline backed by LoggerProvider with BatchLogRecordProcessor (5 s flush, 10 k bounded queue). Both export over OTLP HTTP to Grafana Cloud.

The previous on-disk NDJSON writers (logs/app.jsonl, logs/memory.jsonl) have been removed. Grafana Cloud is now the single log destination. If telemetry is disabled or the endpoint is unreachable, every call is a no-op — the answer pipeline is never blocked.

Python transcriber (telemetry.py)
OTel MeterProvider + LoggerProvider
↓ OTLP HTTP /v1/metrics, /v1/logs (10 s batch)
Node.js backend (telemetry.js)
OTel MeterProvider + LoggerProvider
↓ OTLP HTTP /v1/metrics, /v1/logs (10 s batch)
Grafana Cloud OTLP gateway
↓ metrics → Grafana Mimir (PromQL)
↓ logs → Grafana Loki (LogQL)
Dashboard + alerts in docs/grafana-dashboard.json ✓

Node.js backend metrics

The server instruments every step of both the screenshot and listen flows. Histograms capture latency distributions; counters track throughput and cost.

MetricTypeWhat it measures
ai_ttft_mshistogramTime-to-first-token per provider call
ai_total_mshistogramFull AI generation duration
ocr_duration_mshistogramTesseract OCR latency per image
screenshot_pipeline_total_mshistogramEnd-to-end screenshot → answer ready
end_to_end_question_mshistogramstt_final received → answer complete
http_request_duration_mshistogramExpress route latency
ai_provider_success_totalcounterSuccessful AI calls (labeled by provider)
ai_provider_failure_totalcounterFailed AI calls (labeled by provider + reason)
ai_input_tokens_totalcounterInput tokens consumed
ai_output_tokens_totalcounterOutput tokens generated
ai_cost_usd_totalcounterEstimated AI spend in USD
ai_cache_read_tokens_totalcounterAnthropic prompt-cache hits (tokens read)
ai_cache_creation_tokens_totalcounterAnthropic prompt-cache writes (first-time)
screenshot_captured_totalcounterScreenshots processed
ocr_failed_totalcounterOCR failures

Python transcriber metrics

The transcriber instruments VAD, Whisper, and speaker identification — the three steps where latency variance most affects the time between when someone finishes speaking and when the AI starts answering.

MetricTypeWhat it measures
vad_latency_mshistogramVAD inference time per audio chunk
whisper_decode_mshistogramWhisper per-decode duration (300 ms loop)
speaker_id_latency_mshistogramSpeaker ID classify() call latency
silence_wait_actual_mshistogramMeasured silence gap before stt_final emit
utterances_detected_totalcounterVAD speech-start transitions
utterances_passed_totalcounterUtterances forwarded to AI
utterances_discarded_totalcounterUtterances filtered (labeled with reason)
listener_activegaugeAlways-on listener running (1/0)
whisper_model_loadedgaugeWhisper model warm in memory (1/0)
speaker_id_model_statusgaugeSpeaker ID model ready (labeled)

Host and process gauges (both services)

Both services run a background sampler (10 s interval on a daemon thread / unref'd interval) that pushes system-resource gauges. On Apple Silicon the sampler reports MPS-allocated memory via PyTorch; on NVIDIA it uses pynvml. These gauges include host_name, host_owner, and device_cpu_brand as direct metric labels so Grafana dashboards can filter by machine without relying on resource attribute promotion.

host_cpu_percent
Overall host CPU %
host_memory_percent
Host RAM used %
host_memory_used_bytes
Host RAM used (bytes)
process_cpu_percent
This process CPU %
process_memory_rss_bytes
Process RSS memory
gpu_utilization_percent
GPU utilization %
gpu_memory_used_bytes
GPU memory used

Multi-machine identity

Every metric and log record carries OTel resource attributes that uniquely identify the machine. host.id uses IOPlatformUUID on macOS and /etc/machine-id on Linux — it stays stable across reboots. Both services derive the same host.id for a given machine, so logs from Node and metrics from Python can be correlated in Grafana without a join key.

service.name solvewatch.server / solvewatch.transcriber
host.name machine hostname
host.id IOPlatformUUID (macOS) / machine-id (Linux)
host.arch arm64 / x86_64
device.cpu.brand Apple M3 Pro / Intel Core i9-...
os.type darwin / linux / windows
host.owner optional — set via host_owner in config
gpu.vendor / gpu.model Apple / NVIDIA + model string

Logging: from NDJSON files to Loki

Previously, file-logger.js and memory-logger.js wrote structured NDJSON to logs/app.jsonl and logs/memory.jsonl on disk. In this overhaul those writers have been replaced: both modules now delegate to telemetry.logEvent(), which emits an OTel log record to Grafana Loki via the same OTLP HTTP exporter. The call-site API is identical — no existing event emitters changed.

Logs are queued in a bounded BatchLogRecordProcessor (max 10 k records, flush every 5 s). If Grafana is unreachable, the OTel SDK retries with exponential backoff and drops the oldest records when the queue fills — the answer path is never blocked. The same design applies on the Python side (log_writer.pytelemetry.log()).

Configuration

Telemetry is configured in config/api-keys.json under a telemetry key and can be toggled live from the settings page at http://localhost:4000/settings — no restart required. The settings page validates the OTLP endpoint before saving by POSTing an empty metrics payload and checking the response code.

// api-keys.json
{
"telemetry": {
"enabled": true,
"otlp_endpoint": "https://otlp-gateway-prod-us-east-0.grafana.net/otlp",
"instance_id": "123456", // Grafana Cloud stack ID
"access_token": "glc_eyJ...", // Access Policy token
"service_prefix": "solvewatch", // prefix for service.name
"host_owner": "yourname" // optional machine label
}
}
← Why it's fastThe full pipeline →