logoalt Hacker News

suganesh95today at 2:53 AM0 repliesview on HN

This is great. I built 3 assistants last week for same purpose with entirely different tech stack.

(Raspberry Pi Voice Assistant)

Jarvis uses Porcupine for wake word detection with the built-in "jarvis" keyword. Speech input flows through ElevenLabs Scribe v2 for transcription. The LLM layer uses Groq llama-3.3-70b-versatile as primary with Groq llama-3.1-8b-instant as fallback. Text-to-speech uses Smallest.ai Lightning with Chetan voice. Audio input/output handled by ALSA (arecord/aplay). End-to-end latency is 3.8–7.3 seconds.

(Twilio + VPS)

This setup ingests audio via Twilio Media Streams in μ-law 8kHz format. Silero VAD detects speech for turn boundaries. Groq Whisper handles batch transcription. The LLM stack chains Groq llama-4-scout-17b (primary), Groq llama-3.3-70b-versatile (fallback 1), and Groq llama-3.1-8b-instant (fallback 2) with automatic failover. Text-to-speech uses Smallest.ai Lightning with Pooja voice. Audio is encoded from PCM to μ-law 8kHz before streaming back via Twilio. End-to-end latency is 0.5–1.1 seconds.

───

(Alexa Skill)

Tina receives voice input through Alexa's built-in ASR, followed by Alexa's NLU for intent detection. The LLM is Claude Haiku routed through the OpenClaw gateway. Voice output uses Alexa's native text-to-speech. End-to-end latency is 1.5–2.5 seconds.