Overview
SpeechmaticsSTTService enables real-time speech transcription using Speechmatics’ WebSocket API with partial and final results, speaker diarization, and end of utterance detection (VAD) for comprehensive conversation analysis.
Since Speechmatics provides its own user turn start and end detection, you
should use
ExternalUserTurnStrategies to let Speechmatics handle turn
management. See User Turn
Strategies for
configuration details.Speechmatics STT API Reference
Pipecat’s API methods for Speechmatics STT integration
Example Implementation
Complete example with interruption handling
Speechmatics Documentation
Official Speechmatics documentation and features
Speaker Diarization Guide
Learn about separating different speakers in audio
Installation
To use Speechmatics services, install the required dependencies:Prerequisites
Speechmatics Account Setup
Before using Speechmatics STT services, you need:- Speechmatics Account: Sign up at Speechmatics
- API Key: Generate an API key from your account dashboard
- Feature Selection: Configure transcription features like speaker diarization
Select Endpoint
Speechmatics STT supports the following endpoints (defaults toEU2):
| Region | Environment | STT Endpoint | Access |
|---|---|---|---|
| EU | EU1 | wss://neu.rt.speechmatics.com/ | Self-Service / Enterprise |
| EU | EU2 (Default) | wss://eu2.rt.speechmatics.com/ | Self-Service / Enterprise |
| US | US1 | wss://wus.rt.speechmatics.com/ | Enterprise |
Required Environment Variables
SPEECHMATICS_API_KEY: Your Speechmatics API key for authenticationSPEECHMATICS_RT_URL: Speechmatics endpoint URL (optional, defaults to EU2)
End of Turn detection
The Speechmatics STT service supports Pipecat’s own end of turn detection (Silero VAD and Smart Turn) without any additional configuration. When using Pipecat’s features, theturn_detection_mode must be set to TurnDetectionMode.EXTERNAL (which is the default).
Default mode
By default, Speechmatics uses signals from Pipecat’s VAD / smart turn detection as input to trigger the end of turn and finalization of the current transcript segment. This provides a seamless integration where Pipecat’s voice activity detection and turn detection work in conjunction with Speechmatics’ real-time processing capabilities.If you wish to use features such as focussing on or ignoring other speakers,
then you may see benefit from using
TurnDetectionMode.ADAPTIVE or
TurnDetectionMode.SMART_TURN modes.Adaptive End of Turn detection
This mode looks at the content of the speech, pace of speaking and other acoustic information (using VAD) to determine when the user has finished speaking. This is especially important when using the plugin’s ability to focus on a specific speaker and not have other speakers interrupt the agent / conversation. To use this mode, set theturn_detection_mode to TurnDetectionMode.ADAPTIVE in your STT configuration. You must also remove any other VAD / smart turn features within Pipecat to ensure that there is not a conflict.
Smart Turn detection
Further toADAPTIVE, Speechmatics also provides its own smart turn detection which combines VAD and the use of Smart Turn v3 from Pipecat. This can be enabled by setting the turn_detection_mode parameter to TurnDetectionMode.SMART_TURN.
Speaker Diarization
Speechmatics STT supports speaker diarization, which separates out different speakers in the audio. The identity of each speaker is returned in the TranscriptionFrame objects in theuser_id attribute.
If speaker_active_format or speaker_passive_format are provided, then the text output for the TranscriptionFrame will be formatted to this specification. Your system context can then be updated to include information about this format to understand which speaker spoke which words. The passive format is optional and is used when the engine has been told to focus on specific speakers and other speakers will then be formatted using the speaker_passive_format format.
speaker_active_format-> the formatter for active speakersspeaker_passive_format-> the formatter for passive / background speakers
<{speaker_id}>{text}</{speaker_id}>-><S1>Good morning.</S1>.@{speaker_id}: {text}->@S1: Good morning..
Available attributes
| Attribute | Description | Example |
|---|---|---|
speaker_id | The label of the speaker | S1 |
text / content | The transcribed text | Good morning. |
ts | The timestamp of the transcription | 2025-09-15T19:47:29.096+00:00 |
start_time | The start time of the transcription segment | 0.0 |
end_time | The end time of the transcription segment | 2.5 |
lang | The language of the transcription segment | en |
Speaker Lock
In conjunction with speaker diarization, it is possible to decide at the start or during a conversation to focus on a specific speaker, ignore or retain words from other speakers, or implicitly ignore one or more speakers altogether. In the example below, the following will happen:S1will be transcribed as normal and drive the end of turn and the conversation flowS2will be ignored completely- All other speakers’ words will be transcribed and emitted as tagged segments, but ONLY when a speaker in focus also speaks
S3 says “Hello”, then it is not until S1 speaks again that the transcription will be emitted.
Language Support
Refer to the Speechmatics
docs for more
information on supported languages.
language parameter when creating the STT object. The exception to this is English / Mandarin which has the code cmn_en.
| Language Code | Description | Locales |
|---|---|---|
Language.AR | Arabic | - |
Language.BA | Bashkir | - |
Language.EU | Basque | - |
Language.BE | Belarusian | - |
Language.BG | Bulgarian | - |
Language.BN | Bengali | - |
Language.YUE | Cantonese | - |
Language.CA | Catalan | - |
Language.HR | Croatian | - |
Language.CS | Czech | - |
Language.DA | Danish | - |
Language.NL | Dutch | - |
Language.EN | English | en-US, en-GB, en-AU |
Language.EO | Esperanto | - |
Language.ET | Estonian | - |
Language.FA | Persian | - |
Language.FI | Finnish | - |
Language.FR | French | - |
Language.GL | Galician | - |
Language.DE | German | - |
Language.EL | Greek | - |
Language.HE | Hebrew | - |
Language.HI | Hindi | - |
Language.HU | Hungarian | - |
Language.IA | Interlingua | - |
Language.IT | Italian | - |
Language.ID | Indonesian | - |
Language.GA | Irish | - |
Language.JA | Japanese | - |
Language.KO | Korean | - |
Language.LV | Latvian | - |
Language.LT | Lithuanian | - |
Language.MS | Malay | - |
Language.MT | Maltese | - |
Language.CMN | Mandarin | cmn-Hans, cmn-Hant |
Language.MR | Marathi | - |
Language.MN | Mongolian | - |
Language.NO | Norwegian | - |
Language.PL | Polish | - |
Language.PT | Portuguese | - |
Language.RO | Romanian | - |
Language.RU | Russian | - |
Language.SK | Slovakian | - |
Language.SL | Slovenian | - |
Language.ES | Spanish | - |
Language.SV | Swedish | - |
Language.SW | Swahili | - |
Language.TA | Tamil | - |
Language.TH | Thai | - |
Language.TR | Turkish | - |
Language.UG | Uyghur | - |
Language.UK | Ukrainian | - |
Language.UR | Urdu | - |
Language.VI | Vietnamese | - |
Language.CY | Welsh | - |
language and domain parameters as follows:
| Language Code | Description | Domain Options |
|---|---|---|
cmn_en | English / Mandarin | - |
en_ms | English / Malay | - |
Language.ES | English / Spanish | bilingual-en |
en_ta | English / Tamil | - |
Usage Examples
Examples are included in the Pipecat project:- Using Speechmatics STT service -> 07a-interruptible-speechmatics.py
- Using Speechmatics STT service with VAD -> 07a-interruptible-speechmatics-vad.py
- Transcribing with Speechmatics STT -> 13h-speechmatics-transcription.py
Basic Configuration
Initialize theSpeechmaticsSTTService and use it in a pipeline:
With Diarization
This will enable diarization and also only go to the LLM if words are spoken from the first speaker (S1). Words from other speakers are transcribed but only sent when the first speaker speaks. When using the TurnDetectionMode.ADAPTIVE or TurnDetectionMode.SMART_TURN options, this will use the speaker diarization to determine when a speaker is speaking. You will need to disable VAD options within the selected transport object to ensure this works correctly (see 07b-interruptible-speechmatics-vad.py as an example).
Initialize the SpeechmaticsSTTService and use it in a pipeline:
Additional Notes
- Connection Management: Automatically handles WebSocket connections and reconnections
- Sample Rate: The default sample rate of
16000inpcm_s16leformat - VAD Integration: Optionally supports Speechmatics’ built-in VAD and end of utterance detection