New York Metropolis startup Hume AI emerged from stealth two years in the past and has since raised multimillions in funding on the idea of its expertise that creatives emotive AI voices to be used in enterprise purposes.
In the present day, it’s taking its choices a step additional with a brand new large-language and speech mannequin known as the “Omni-capable text and voice engine,” or Octave for brief, designed to supply lifelike, emotionally nuanced speech to be used throughout totally different types of content material, from audiobooks to prerecorded online game character dialog and movie/TV/video.
Hume claims Octave is the primary text-to-speech system powered by a big language mannequin (LLM) skilled not solely on textual content however on speech and emotion tokens, enabling it to know phrases in context and regulate tone, rhythm and cadence accordingly — and which the consumer can regulate on the sentence stage with textual content prompts.
“We’re launching the first LLM for text-to-speech — a model that understands words in context, predicting the right emotions, rhythm, cadence and emphasis, making speech sound more human than ever before,” stated Alan Cowen, Hume AI’s cofounder and CEO, in a video name interview with VentureBeat.
Octave’s capabilities transcend primary voice technology. It may well interpret character traits and elegance from a script alone, adjusting vocal inflections to match implied feelings. A sarcastic comment shall be spoken sarcastically, a panicked sentence will sound pressing, and a whispered secret shall be hushed — all without having express course.
As well as, if the consumer doesn’t just like the generated voice or needs to regulate it, they will achieve this granularly by means of pure language by merely typing in a textual content instruction to Octave, akin to “happier, sadder, more frustrated, angrier, more sarcastic, more sincere,” and so on.
“You can describe a character — like a sarcastic medieval peasant — and the model will instantly create that voice, adjusting emotions like anger, sadness or happiness based on your instructions,” Cowen added. “Voice modulation works at the sentence level, but you can also adjust parts of a sentence, instructing the model to convey nuanced emotions like slight frustration mixed with humor or exasperation.”
The mannequin additionally considers context past particular person sentences. “Unlike traditional models that process text word by word, our model considers entire paragraphs, capturing context to deliver more natural and emotionally accurate speech,” he defined.
Whereas the present launch focuses on English-language speech, Octave additionally helps Spanish and is predicted to increase its language capabilities within the close to future.
Tailor-made for content material creation
Octave is tailor-made for content material creators and media manufacturing, providing a variety of purposes.
“This new model is designed for offline text-to-speech — perfect for audiobooks, podcasts, video voiceovers, and video game characters — where creators need realistic, character-specific voices,” Cowen defined.
Nonetheless, the consumer should entry it by means of Hume’s web site both on its Initiatives web page or by means of an software programming interface (API). The “offline” part refers to the truth that this mannequin is designed to supply discrete audio information that may be added to initiatives akin to movies or audiobooks. It’s not designed to hold on real-time dialog, although that would theoretically be allowed by piping in textual content queries to the web site.
Hume’s API permits builders to make as much as 50 requests of the brand new Octave mannequin per minute, with a most textual content size of 5,000 characters and descriptions capped at 1,000 characters. Every request can generate as much as 5 outputs, and the supported audio codecs embody MP3, WAV and PCM.
Hume’s prior EVI collection of fashions permits for streaming, real-time, back-and-forth interactions. They continue to be accessible and can proceed to be developed.
Hume AI provides a subscription-based pricing mannequin with tiers starting from a free choice to Creator, Creator Professional, and Enterprise plans.
Right here’s a concise breakdown of the choices:
Free ($0/month) – 10,000 characters of text-to-speech monthly (~10 minutes) with limitless customized voices
Starter ($3/month) – 30,000 characters (~half-hour) plus help for as much as 20 initiatives
Creator ($10/month) – 100,000 characters (~100 minutes), usage-based pricing for additional characters ($0.20/1,000), and help for as much as 1,000 initiatives
Professional ($50/month) – 500,000 characters (~500 minutes), decrease usage-based pricing ($0.15/1,000), and help for as much as 3,000 initiatives
Scale ($150/month) – 2,000,000 characters (~2,000 minutes), additional diminished usage-based pricing ($0.13/1,000), and help for as much as 10,000 initiatives
Enterprise ($900/month) – 10,000,000 characters (~10,000 minutes), even decrease usage-based pricing ($0.10/1,000), and help for as much as 20,000 initiatives
Enterprise (Customized worth) – Limitless utilization, customized authorized phrases, safety assurances, considerably discounted bulk pricing, and precedence help
Altogether, Hume emphasised that its Octave TTS pricing is round half the price of the competing service from AI voice creation startup ElevenLabs, exhibiting the intensifying competitors within the text-to-speech house.
As well as, Hume AI performed a blind comparability research with 180 human raters to benchmark Octave towards ElevenLabs. The outcomes confirmed that Octave was most popular by way of audio high quality (71.6% of trials), naturalness (51.7% of trials), and the way properly the speech matched descriptions of the specified voice (57.7% of trials), throughout 120 numerous prompts.
To additional consider its efficiency, Hume AI has additionally launched the Expressive TTS Area, a public benchmark designed to check how properly AI fashions deal with longer, expressive speech — an space that earlier TTS benchmarks have largely ignored.
Tens of trillions of language tokens
Not like conventional text-to-speech programs that depend on restricted speech datasets, Octave TTS is constructed on an LLM skilled on tens of trillions of language tokens.
“Traditional text-to-speech models are trained on limited speech data, but ours is built on an LLM trained on tens of trillions of tokens, enabling it to reason, think, and infer emotions from text,” Cowen stated.
The mannequin was skilled utilizing tens of millions of hours of public, long-form speech knowledge and Hume AI’s proprietary datasets of recent voices recored by survey contributors.
“We collected data from people recording themselves through webcams, reacting naturally to videos, telling stories, and talking to others, including friends and family, to capture a wide range of emotional expressions,” Cowen stated.
This in depth coaching permits the mannequin to deduce emotional context and comply with detailed directions, creating voices that match particular character descriptions and attributes.
Constant character voices and limitations
Octave TTS maintains constant character voices throughout long-form content material.
“With our platform, you can generate unique voices for each character in an audiobook — like a middle-aged orc — and maintain that character’s voice throughout the story,” Cowen stated.
This functionality is supported by Hume AI’s “Projects” web page, which handles long-form content material like audiobooks by routinely chunking textual content whereas preserving character consistency and context throughout chapters.
Hume has technical guardrails constructed into its web site and API prohibiting sure makes use of, however apart from that, it’s open to make use of throughout a variety of content material and topics, together with doubtlessly not-safe-for-work scenes akin to these in standard romance novels.
“We give developers freedom, allowing content across a broad range of human experiences, though we restrict the creation of realistic children’s voices and imitations of specific individuals,” Cowen defined.
As well as, Cowen stated that the corporate might regulate these guardrails for particular shoppers upon request, akin to a kids’s-book writer trying to create voices for kids’s audiobooks.
Hume AI is engaged on a forthcoming Voice Cloning characteristic, which is able to permit customers to duplicate a voice from as little as 5 seconds of audio. The corporate is growing safeguards to make sure moral use earlier than rolling out the characteristic publicly.
With its mixture of contextual consciousness, emotional expression and character customization, Octave TTS goals to supply content material creators with extra management and suppleness, delivering voices that sound each life like and emotionally participating.
Day by day insights on enterprise use instances with VB Day by day
If you wish to impress your boss, VB Day by day has you lined. We provide the inside scoop on what firms are doing with generative AI, from regulatory shifts to sensible deployments, so you possibly can share insights for max ROI.
An error occured.