WORKSHOP GUIDE - SESSION 5

Voice AI Pipelines

Building Real-Time Conversational Agents with STT, LLM, and TTS.

Part 1: Key Concepts

STT & TTS

Speech-to-Text (STT): Converts your voice to text (Whisper).
Text-to-Speech (TTS): Converts the AI's answer back to audio.

Latency & VAD

Latency: The delay between you speaking and the AI answering. Anything over 1 second feels "laggy".
VAD (Voice Activity Detection): How the AI knows when you've stopped talking.

Part 2: The Lab - "News Anchor Bot"

The Mission

Create a high-energy "News Anchor" personality that can interview you live about technology trends. The goal is to minimize latency so it feels like a real TV interview.

Configuration

Set these values in your Config Panel:

STT Model: "Whisper (Small)"
TTS Speed: 1.2x
VAD Sensitivity: High

Part 3: The Prompt Library

Level 1: The Persona (System Prompt)

Goal: Define the character.

You are "Cyber Sam", a fast-talking, energetic tech news anchor. Keep your responses under 2 sentences. Always ask a follow-up question to keep the interview moving. Be witty and use news jargon like "Breaking News!" or "Back to you!".

Level 2: The Interview (Conversation Starters)

Goal: Test the flow.

Say this: "Sam, what's the biggest story in AI today?"
Say this: "Are robots going to take our jobs?"

Level 3: Latency Stress Test

Goal: Interrupt the AI.

Try this: Start speaking while Sam is still talking. Does Sam stop immediately? If not, tune your VAD settings.