Overview
Pipecat uses user turn strategies to determine when user turns start and end. These strategies can use different techniques: For detecting turn start:- Voice Activity Detection (VAD): triggers when speech is detected
- Transcription-based (fallback): triggers when transcription is received but VAD didn’t detect speech
- Minimum words: waits for a minimum number of spoken words before triggering
- Transcription-based: analyzes transcription to determine when the user is done
- Turn detection model: uses AI to understand if the user has finished their thought
Voice Activity Detection (VAD)
What VAD Does
VAD is responsible for detecting when a user starts and stops speaking. Pipecat uses the Silero VAD, an open-source model that runs locally on CPU with minimal overhead. Performance characteristics:- Processes 30+ms audio chunks in less than 1ms
- Runs on a single CPU thread
- Minimal system resource impact
VAD Configuration
VAD is configured throughVADParams in your transport setup:
Key Parameters
start_secs (default: 0.2)
- How long a user must speak before VAD confirms speech has started
- Lower values = more responsive, but may trigger on brief sounds
- Higher values = less sensitive, but may miss quick utterances, like “yes”, “no”, or “ok”
stop_secs (default: 0.8)
- How much silence must be detected before confirming speech has stopped
- Critical for turn-taking behavior
- Modified automatically when using turn detection
confidence and min_volume
- Generally work well with defaults
- Only adjust after extensive testing with your specific audio conditions
User Turn Detection
While VAD detects speech vs. silence, it can’t understand linguistic context. A pause doesn’t mean the user is done. User turn strategies interpret VAD signals and transcriptions to determine actual turn boundaries.How It Works
- Turn Start: When VAD detects speech (or transcription arrives), the start strategy emits
UserStartedSpeakingFrameand optionally triggers an interruption - Turn End: When the stop strategy determines the user is done, it emits
UserStoppedSpeakingFrame
VAD also emits its own frames (
VADUserStartedSpeakingFrame,
VADUserStoppedSpeakingFrame) which indicate raw speech/silence detection.
These are inputs to the turn strategies, not the final turn decisions.Detecting Turn End
Turn end detection uses transcription to understand what the user said: Transcription-based (Default): Analyzes transcription to determine when the user has finished speaking.stop_secs (0.2) so the model can analyze speech quickly.
User Turn Strategies
Complete reference for start and stop strategies
Smart Turn Overview
Smart Turn model implementation guide
Interruptions
Interruptions stop the bot when the user starts speaking. This is controlled by theenable_interruptions parameter on start strategies (enabled by default).
When a user turn starts with interruptions enabled:
- Bot immediately stops speaking
- Pending audio and text is cleared
- Pipeline ready for new user input
Keep interruptions enabled (default) for natural conversations. This enables
users to interrupt the bot mid-sentence, just like human conversations.
Best Practices
Optimal Configuration
For most voice AI use cases:Performance Considerations
- Use local VAD: 150-200ms faster than remote VAD services
- Tune for your use case: Test with real audio conditions
- Monitor CPU usage: VAD adds minimal overhead but monitor in production
- Consider turn detection: Improves conversation quality but adds complexity
Key Takeaways
- VAD detects speech activity but turn detection understands conversation context
- Configuration affects user experience - tune parameters for your specific use case
- System frames coordinate behavior - enable interruptions and natural turn-taking
- Local processing is faster - Silero VAD provides low-latency speech detection
- Turn detection improves quality - but requires careful VAD configuration
What’s Next
Now that you understand how speech input is detected and processed, let’s explore how that audio gets converted to text through speech recognition.Speech to Text
Learn how to configure speech recognition in your voice AI pipeline