Session 5 — Build a voice-to-voice AI News Anchor using Pipecat. All services run locally. Sub-2-second latency. Zero API fees.
Every session so far used keyboards and screens. Session 5 is different — you speak and the AI speaks back.
An open-source Python framework for building voice and multimodal AI agents. Think of it as Express.js but for voice pipelines.
# Pipecat — Voice AI in ~20 lines from pipecat.pipeline import Pipeline from pipecat.services import ( WhisperSTT, # Speech → Text OllamaLLM, # Local LLM KokoroTTS, # Text → Speech ) # Build the pipeline pipeline = Pipeline([ WhisperSTT(model="base"), OllamaLLM(model="llama3.2"), KokoroTTS(voice="af_heart"), ]) # Run — that's it! pipeline.run()
Your voice flows through three services — each running locally on your AI PC. No cloud round-trips.
Each service runs as a separate process. Pipecat orchestrates data flow between them.
You speak → AI thinks → AI speaks back. The entire round-trip happens in under 2 seconds locally. Cloud services add 3-5x more latency.
Every major cloud voice AI service charges per minute of audio or per character of text. Running locally? Unlimited. Forever. Free.
# Cloud costs for 1 hour of voice AI: Google Cloud Speech: ₹2.00/min OpenAI Whisper API: ₹0.50/min Cloud TTS: ₹330-1300/1M chars GPT-4 API: ₹2.50/1K tokens # 1 hour of voice conversation ≈ Cloud total: ~₹160-400 per hour Local total: ₹0. Always. 🎉
Speak naturally. The AI fetches real-time news and talks back. Try these conversations:
The same Pipecat pipeline pattern works for any voice agent — just swap the LLM prompt and data source.
Real-time news briefings via voice
Voice-based patient triage and reminders
Interactive voice tutors for students
Private voice assistant — no Alexa
AI phone agents for businesses
Voice UI for visually impaired users
In-car voice assistants
AI-powered interactive characters
A complete voice AI agent — built with open-source tools, running locally, with sub-2s latency.
Pipeline architecture for voice AI agents
Three-service pipeline, all running locally
Streaming + local = real-time conversation
Your voice never leaves your device