Build Your Own Voice Assistant: Escape Alexa and Google

7 min read

Build Your Own Voice Assistant: Escape Alexa and Google

"Would you like to try Alexa Plus?"

No.

"Alexa Plus can help you with..."

No.

This happens about 80 times a day now. Amazon turned my Echo from a useful tool into a subscription salesperson. Every weather request, every timer, every simple question gets hijacked by an upsell. I'm done.

So I'm building my own. The code is working, the hardware is on the way, and I wanted to share what I've figured out so far. Turns out it's more approachable than I expected.

Voice assistant architecture with privacy-first design

The setup: a Mac Mini that does everything

The target system runs on a Mac Mini that already lives in my office as a development server. It handles my AI coding workflows when I'm working remotely, so I can code from anywhere without worrying about compute power. Tacking voice assistant duties onto a machine that's already running 24/7 just makes sense.

Hardware (incoming):

  • Mac Mini M2 (already owned for dev work)
  • Omnidirectional USB microphone (picks up voice from anywhere in the room)
  • Omnidirectional speaker for responses

Right now I'm testing with temporary mic and speakers plugged into the Mac Mini. The code works. Once the proper omnidirectional hardware arrives, it becomes a permanent fixture. One machine, AI development workflows and home assistant duties.

Architecture: modular by design

The system breaks into four independent components:

Microphone → Wake Word → Speech-to-Text → LLM + Tools → Text-to-Speech → Speaker

Each piece is swappable. Don't like OpenAI? Use Claude or a local model. Want different wake words? Train your own. Need new capabilities? Add a Python function.

Voice pipeline architecture diagram

ComponentWhat It DoesCurrent ChoiceAlternatives
Wake WordListens for trigger phraseopenWakeWordPorcupine, train your own
Speech-to-TextConverts voice to textOpenAI Whisperfaster-whisper, whisper.cpp, Vosk
LLMUnderstands and respondsGPT-4oClaude, Ollama, Llama
Text-to-SpeechSpeaks the responseOpenAI TTSPiper (local), ElevenLabs

Wake word detection: always listening, locally

The wake word detector runs entirely on-device. No audio leaves your machine until you say "Hey Jarvis" (or whatever phrase you pick).

from openwakeword.model import Model

class WakeWordDetector:
    def __init__(self, model_name: str = "hey_jarvis", threshold: float = 0.5):
        self.model = Model(wakeword_models=[model_name])
        self.threshold = threshold

    def detect(self, audio: np.ndarray) -> bool:
        prediction = self.model.predict(audio.flatten())
        return any(score >= self.threshold for score in prediction.values())

Available pre-trained wake words:

  • hey_jarvis (my current choice)
  • alexa (if you miss the old days)
  • hey_mycroft
  • hey_rhasspy

You can also train a custom wake word. I'm tempted to design a personality for my assistant with a unique name and train a wake word to match.

The key difference from commercial assistants: this runs locally. The audio stream never leaves your machine until wake word detection actually triggers a recording.

Speech-to-text: record until silence

Once the wake word hits, the system records your command until you stop talking. Voice Activity Detection (VAD) handles that automatically:

import webrtcvad

def record_until_silence(stream, vad, silence_duration=0.5):
    """Record until 0.5 seconds of silence."""
    frames = []
    silent_frames = 0

    while True:
        audio = stream.read()
        frames.append(audio)

        if vad.is_speech(audio, SAMPLE_RATE):
            silent_frames = 0
        else:
            silent_frames += 1

        if silent_frames >= max_silent_frames:
            break

    return b"".join(frames)

The recorded audio goes to Whisper for transcription. OpenAI's API is fast and accurate, but there are solid local options if you want zero cloud dependency:

  • faster-whisper - GPU-accelerated, fastest local option
  • whisper.cpp - Runs on CPU, works everywhere
  • Vosk - Lightweight, real-time capable
  • DeepSpeech - Mozilla's open source option

For a fully local setup, faster-whisper with the base model gives you excellent accuracy with minimal latency.

The LLM: your brain, your choice

This is where it gets interesting. The LLM handles understanding your request and deciding what to do. Mine uses GPT-4o, but the architecture supports any model with tool calling:

def process_message(client, user_message, model="gpt-4o"):
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": user_message},
    ]

    while True:
        response = client.chat.completions.create(
            model=model,
            messages=messages,
            tools=TOOLS,  # Registered capabilities
        )

        if not response.choices[0].message.tool_calls:
            return response.choices[0].message.content

        # Execute tools and continue conversation
        for tool_call in response.choices[0].message.tool_calls:
            result = execute_tool(tool_call.function.name,
                                  tool_call.function.arguments)
            messages.append({"role": "tool", "content": result})

Swap models easily:

# Use Claude instead
from anthropic import Anthropic
client = Anthropic()

# Use Ollama locally
client = OpenAI(base_url="http://localhost:11434/v1")

The system prompt tells the LLM it's a voice assistant, so responses stay short and conversational instead of verbose and markdown-heavy.

Tools: the magic of Pydantic

This is the elegant part. Each capability is defined as a Pydantic model:

from pydantic import BaseModel, Field

class GetWeather(BaseModel):
    """Get the current weather for a location."""

    location: str | None = Field(
        default=None,
        description="City name (optional, defaults to current location)"
    )

def get_weather(params: GetWeather) -> str:
    # Call weather API...
    return f"It's currently {temp}°F in {location} with {condition}."

That single class definition gives you:

  1. An LLM function schema, so the model knows how to call it
  2. A CLI interface, so you can test with uv run weather "Tokyo"
  3. Validation through type checking and constraints
  4. Documentation, since the docstring becomes the function description

Adding a new capability is just writing a model and handler:

class ControlLight(BaseModel):
    """Turn a smart light on or off."""

    room: str = Field(description="Room name")
    state: bool = Field(description="True for on, False for off")
    brightness: int | None = Field(default=None, ge=0, le=100)

def control_light(params: ControlLight) -> str:
    # Call Home Assistant, Hue, or whatever
    return f"Turned {params.room} light {'on' if params.state else 'off'}"

Built-in tools include:

  • Weather (Open-Meteo, free)
  • News headlines (BBC API)
  • Web search (Perplexity)
  • Spotify control (play, pause, skip, volume)
  • System volume (macOS AppleScript)
  • Conversation history lookup

Text-to-speech: the response

Finally, the assistant speaks. OpenAI's TTS is fast and natural:

def speak(client, text, voice="alloy"):
    response = client.audio.speech.create(
        model="tts-1",
        voice=voice,
        input=text,
        response_format="pcm",
    )
    play_audio(response.content)

Voice options: alloy, echo, fable, onyx, nova, shimmer

For fully local operation, Piper TTS sounds great and runs on CPU with minimal latency.

My aspiration: design a custom personality for my assistant and clone a unique voice with ElevenLabs. Pair that with a custom wake word and you get a truly personal assistant instead of a generic one.

Running it

# Install
uv sync

# Download wake word models
uv run python -c "from openwakeword import utils; utils.download_models()"

# Configure
echo "OPENAI_API_KEY=your-key" > .env

# Run
uv run assistant

The assistant starts listening for "Hey Jarvis." Speak a command, and it responds through your speakers.

# Alternative modes
uv run assistant --repl          # Text mode (no audio)
uv run assistant "weather Tokyo" # One-shot query

Going fully local

Want zero cloud dependency? Every component has a local alternative:

ComponentCloudLocal Alternative
Wake Word-openWakeWord (already local)
STTWhisper APIfaster-whisper, whisper.cpp, Vosk
LLMGPT-4oOllama + Llama 3.2
TTSOpenAI TTSPiper

With faster-whisper for transcription, Ollama running Llama 3.2 for the brain, and Piper for speech, everything stays on your machine. Responses are slightly slower, but privacy is absolute. No data leaves your network.

Why this beats commercial assistants

Alexa/GoogleDIY Assistant
Wake word runs locallyClaimedVerified
Audio sent to cloudAlwaysOnly after wake word
Choose your LLMNoYes
Add custom toolsComplex skillsPython function
Runs offlineLimitedFull (with local stack)
Data collectionExtensiveNone
CostDevice + privacyAPI costs (~$0.01/query)

The Mac Mini setup has another benefit: it's already my remote development server. When I'm coding from my laptop at a coffee shop, I SSH into it for heavy AI workloads. The voice assistant becomes just another daemon running in the background.

What I'm testing

Working well:

  • "Hey Jarvis, what's the weather?"
  • "Hey Jarvis, play some jazz" (Spotify control)
  • "Hey Jarvis, what's in the news?"
  • "Hey Jarvis, search for Python 3.13 release notes"
  • "Hey Jarvis, set volume to 40"
  • "Hey Jarvis, what did I ask you about yesterday?" (history lookup)

On the roadmap:

  • Home Assistant integration for lights and thermostat
  • Calendar queries
  • Timer and reminder management
  • Omnidirectional mic and speaker hardware
  • Custom wake word and personality
  • Custom voice via ElevenLabs

The privacy difference

Commercial assistants optimize for data collection. Their business model depends on knowing everything about you. That's why they're "free" (after you buy the device). And now Amazon wants $20/month for Alexa Plus on top of that.

This system optimizes for functionality. It only processes what you explicitly ask, only when you trigger it, and you control where the data goes.

No subscription nagging. No upsells. Just a voice assistant that does what you ask.

Getting started

If you want to build your own, here are the key libraries:

Core dependencies:

For fully local:

The architecture is straightforward. Wake word detection loops until triggered, then record until silence, transcribe, send to LLM with tools, speak the response. Each piece is independent and swappable.

The Pydantic tool pattern is the standout feature. Once it clicks, adding any capability becomes trivial. Weather, lights, music, calendars, home automation, all just a model and a handler function.

I'll share more as the hardware arrives and the system matures. For now, the core is solid and the Alexa can stay unplugged.

No more subscription pitches. Just answers.

#voice-assistant#privacy#python#openai#home-automation
Matthew Fontana
About the author

Matthew Fontana

Staff Engineer at Airbnb · ex-Spotify, ex-UPS · 13 yrs in enterprise software

I build agentic developer platforms inside large engineering orgs, and I'm available to build them inside yours.