TL;DR
- Artificial intelligence can do remarkable things, and its transcription capabilities are only expected to expand over time. Currently, however, AI transcription technology is not without its limitations.
- Those limits, particularly in accuracy, and the lack of a legal framework around synthetic voices, means broadcasters are advised to experiment and use AI with care and human supervision.
- Dubbing is likely to be transformed by AI where human talent might morph into “AI Dubbing Managers,” or Creative Directors.
Transcription is one process that stands to be uniquely impacted by recent developments in AI. Thanks to ever-evolving language and learning models, transcribing audio to text has never been faster or easier. But there are also limitations to new AI-powered transcription solutions.
The global translation service market will exceed $47 billion by 2031, largely driven by media and entertainment. Yet current costs to caption titles for distribution on streaming services ranges between $60-$100 per program hour, and typically takes between 1-3 days to complete “because of excessive manual intervention” claims Cineverse CTO Tony Huidor.
READ MORE: Global Translation Service Merket Research (Market Research Future)
“Captions, and localization more broadly, are generally major pain points for content owners seeking to monetize their assets across the many streaming services,” Huidor added.
That’s because content companies need to generate far more revenue by broadening their audiences at significantly reduced costs.
“Companies have been priced out of bringing their entire content catalogs to market due to the extremely high costs of captioning and localization,” Huidor said.
READ MORE: Matchpoint AI Disrupts the Captioning Industry with the Launch of the Revolutionary MatchCaption AI Technology (Cineverse)
The traditional transcription process involves an individual transcriber listening to a piece of audio and manually converting every audio element they hear to text. It is clearly very labor intensive using trained specialists, and costly.
But it does produce accurate results.
AI transcription eliminates the need for a human transcriber and relies instead on automatic speech recognition technology. ASR uses language and learning models to interpret human speech and convert specific sounds (or phonemes) as written language.
Some of the most popular speech-to-text software is provided by Google, Azure, IBM, and Dragon Professional.
The upside of using automated transcription is the ability for companies to scale more of their output, to keep pace with huge global demand and to slash the costs of the whole exercise.
The main downsides, as outlined by Vitac, are inaccuracy. AI system tend to deliver poor quality results when the input recording is poor, when there are more than one speaker and when the audio contains a substantial amount of overlapping speech. Other factors that can inhibit the AI’s ability are when speakers have diverse accents or dialects.
“All these variables can substantially impact AI’s ability to interpret and represent the audio of a recording and result in a final transcript containing a substantial number of errors,” Vitac says.
Its prescription to achieve “exceptionally high rates of accuracy” is to match automation with human experts. Not coincidentally this is exactly the service it offers.
READ MORE: AI-Powered Captioning Solution, ‘Verbit Captivate,’ Provides New Levels of Accuracy, Customization (Vitac)
Broadcasters and publishers are a little reticent to rely on AI transcription given that tools to date have not proved fool proof. The BBC, for instance, values the trust that viewers put in the veracity of its output more than most broadcasters. It also faces increasing pressure to cut costs. It is exploring and evaluating AI tools which is a route that it advises others to follow.
Vanessa Lecomte, localization operations manager at BBC Studios, telling language information site Slator that for all the benefits that AI has in localization, it “must match BBC’s quality standards at a minimum.”
She said, “The main question is whether AI can improve current processes, increase speed to market, and reduce costs.”
Lecomte advised balancing opportunities against the risk. “These technologies offer the potential to speed up the process, which in turn enables you to localize more content, reach new markets, but it shouldn’t be done to the detriment of quality or of a well-respected industry. So do the right thing and commit to a thoughtful localization strategy.”
The BBC is also addressing AI in dubbing using synthetic voices. Lecomte described the current dubbing process as “time-consuming and expensive involving many technical and creative talents.” She said her division is exploring the capabilities of AI dubbing technology to try and deliver more content, faster, and still meet quality standards, adding that this should be done acting responsibly in regards to talent rights.
READ MORE: How the BBC Evaluates the Use of AI in Dubbing and Captioning (Slator)
Anton Dvorkovich, CEO & Founder of Dubformer, also flagged the industry responsibility of establishing regulations around the ethical use of human voices.
He also believes AI dubbing is “poised to dramatically transform the media industry…with solutions that cut production costs by 30-50%.
“For now, investors and the media are struggling with the challenge of evaluating new solutions. However, the focus is shifting to the potential costs of emerging tools and their impact on the media industry,” he wrote in an op-ed for Streaming Media.
Solutions range from those like Papercup and Deepdub where humans finalize the AI-powered dubbing to “DIY translation tools” aimed at enabling freelance content creators to translate their videos with AI. One such solution, from Heygen, relies on natural-sounding speech synthesis and text-to-speech software developed by Eleven Labs.
READ MORE: Papercup raises $20M for AI that automatically dubs videos (Tech Crunch)
READ MORE: Deepdub raises $20M for AI-powered dubbing that uses actors’ original voices (Tech Crunch)
He predicts that the introduction of an “AI Dubbing Manager,” or proof listener, tasked with fine-tuning AI dubbing systems or types of content. This role could include listening to the automatic voice overs to grasp cultural nuances, refine voice modulation, and make corrections. Some actors and interpreters may transition into this profession as it evolves, he suggested.
There could be Creative Directors for AI-enhanced productions to guide creative content developed through AI dubbing while the market for actors to license their AI-generated voices will grow. “More tools will enter the market, enabling individuals to generate their voices with AI. Actors will be able to create new voices based on their own.”
READ MORE: What’s Next for AI Dubbing in the Media Industry? (Streaming Media)
AI-Powered Localization and Captioning Tools
Software developer Enco introduced AITrack and ENCO-GPT, which both use ChatGPT to generate language responses from text-based queries for automated TV and radio production workflows.
AITrack, for instance, integrates with Enco’s DAD radio automation system to generate and insert voice tracks between songs. It leverages synthetic voice engines to produce natural-sounding, engaging content between songs.
ENCO-GPT could be used to condense a lengthy written news article into a few sentences, or inject breaking news updates within live ad breaks or automatically creates ad copy on behalf of sponsors.
Company president Ken Frommert sees an opportunity to go bigger with both solutions. “We see opportunities to convert a morning or afternoon drive radio show into a short-form podcast, or summarize an 11:00 p.m. local news program for the TV station’s website…. It offers a seamless way to publish content in diverse forms.”
READ MORE: ENCO Moves Deeper into AI Universe with ChatGPT Toolsets (Enco)
LEXI Recorded, a VOD automated captioning solution from Australian firm AI Media, claims 98% accuracy, “comparable to human captioning,” and even higher with the use of custom dictionaries or topic models. Its use is priced from 20 cents per minute.
“We are not just meeting but exceeding the demands for high-volume, quick, and precise captioning of recorded content,” said AI-Media’s Chief Product Officer, Bill McLaughlin who will present the product at NAB Show in April.
READ MORE: AI-Media’s LEXI Tool Kit Expanded with LEXI Recorded – Breakthrough Solution for the Growing VOD Market (AI-Media)
Captions offers an AI-based video editing app and a solution for automatically generating subtitles. Both products are aimed at content creators and marketers.
It also offers an in-house voice cloning tool trained on licensed audio recordings to translate users’ audio into 28 other languages or use an AI voiceover to narrate the content from scratch.
Gaurav Misra, CEO and cofounder says Captions’ approach to video editing software is different because its tools are designed for specifically editing talking videos. “Most video production editing is focused more on aesthetics like filters and colors, whereas our focus became more about conveying an idea or experience,” he told Rashi Shrivastava at Forbes.
READ MORE: Video Editing App Captions Just Raised $25 Million To Bring AI To Creators (Forbes)
Vitac’s claims its own AI captioning solution, Verbit Captivate, stands apart from “generic” ASR engines in being designed, developed and built, inhouse. “Whereas other AI captioning vendors either provide an engine or a service, Vitac is unique in that we own both. And because of that, we can change, update, upgrade, and customize customer offerings, tuning our solutions to individual customer needs, creating an offering that achieves accuracy and results on a personal level.”
Additionally, it pairs the tech with “human backup” — specialists who boost performance with prep, pre- and post-session research, and live-session monitoring.
Cineverse’s MatchCaption, targets bulk film, television and video libraries localization “at significant scale.” It claims its generated captions are “perfectly timed and formatted according to industry standards, then auto converted into multiple caption/subtitle formats, to meet the specifications of all streaming platforms.
It also claims its system can complete the same tasks which currently cost content owners $60-100 for less than $10 per program hour, “and a full feature film can be completed, and quality checked in less than one hour — an 85% reduction in cost and 90% reduction in time.