ElevenLabs, the highly-valued AI voice cloning and era startup from former Palantir alumni, at this time launched Scribe v1, a brand new speech-to-text mannequin that reportedly achieves the very best accuracy throughout a number of languages. Customers can attempt it right here on the ElevenLabs website.
Based on the corporate’s benchmarks, it outperforms Google’s Gemini 2.0 Flash, OpenAI’s Whisper v3, and Deepgram Nova-3 on precisely changing spoken speech into textual content on the internet, reaching new record-low error charges.
The corporate claims that Scribe delivers state-of-the-art transcription accuracy in 99 languages, together with improved efficiency in beforehand underserved languages corresponding to Serbian, Cantonese, and Malayalam.
As Flavio Schneider, ElevenLabs Lead Researcher wrote on X, Scribe is the “smartest audio understanding model” launched by ElevenLabs but.
“Scribe doesn’t just transcribe — it understands audio,” Schneider continued in a threaded reply. “It can detect non-verbal events (like laughter, sound effects, music, and background noise) and analyze long audio contexts for accurate diarization, even in the most challenging environments.”
“Diarization” is the title given to processes of separating audio system by their vocal qualities on a recording.
In truth, ElevenLabs’ documentation states Scribe can distinguish and isolate as much as 32 completely different audio system in the identical audio file.
Whereas ElevenLabs cautions that Scribe is “best used for when high-accuracy transcription is required rather than real-time transcription,” the corporate additionally plans to introduce a low-latency model quickly, increasing its use for real-time purposes.
Lowest phrase error charges (WER)
Scribe is designed to deal with real-world audio challenges with precision. Based on benchmark outcomes from FLEURS and Frequent Voice, it data the bottom phrase error charges (WER) for a lot of languages, together with Italian (98.7%) and English (96.7%).
Key options embrace:
Speaker diarization to distinguish audio system in multi-speaker recordings
Phrase-level timestamps for detailed transcription accuracy
Detection of non-speech occasions, corresponding to laughter and background noises
Structured transcript output for seamless integration by way of API
Pricing and availability
Scribe is accessible now by way of the ElevenLabs web site and API.
Pricing is about at $0.40 per hour of enter audio, with a 50% low cost for the following six weeks. A low-latency model for real-time purposes can be in improvement.
What it means for enterprises
For enterprise decision-makers, Scribe presents a software for scalable, high-accuracy transcription, making it helpful for industries counting on automated documentation, assembly transcription, and content material accessibility.
The mannequin’s means to deal with numerous languages with excessive precision additionally advantages multinational companies, media firms, and buyer help purposes.
Scribe’s pricing construction makes it aggressive for companies that require high-volume transcription companies, and its API-based integration permits for seamless adoption in enterprise workflows.
Moreover, the upcoming low-latency model may place Scribe as a viable choice for real-time communication instruments.
Coming the identical day as rival Hume’s reverse text-to-speech mannequin Octave
Timing is every thing, and ElevenLabs selected to launch Scribe the identical day as rival Hume AI unveiled Octave, an LLM-powered text-to-speech mannequin that enables customers to customise AI-generated voices with adjustable feelings.
It’s designed for content material creation, together with audiobooks, podcasts, and online game voiceovers. Not like commonplace TTS techniques, Octave considers context past particular person sentences, adjusting tone, rhythm, and cadence dynamically to sound extra pure.
Hume AI positions Octave as a direct competitor to ElevenLabs’ text-to-speech choices, highlighting that Octave’s pricing is about half the price of ElevenLabs’ present AI voice companies.
Whereas Scribe and Octave serve completely different features, their improvement displays the rising competitors in AI-driven audio fashions.
ElevenLabs is prioritizing exact, multi-language speech recognition, whereas Hume AI is advancing expressive AI-generated speech.
For enterprises, this implies extra specialised options for each transcription and artificial voice purposes, enabling extra environment friendly content material manufacturing, buyer engagement, and accessibility instruments.
Scribe is now reside, and ElevenLabs is internet hosting a digital occasion subsequent week with the crew behind its improvement. Extra particulars, benchmarks, and API documentation can be found within the official weblog submit.
Day by day insights on enterprise use instances with VB Day by day
If you wish to impress your boss, VB Day by day has you coated. We provide the inside scoop on what firms are doing with generative AI, from regulatory shifts to sensible deployments, so you may share insights for max ROI.
An error occured.