The world is full of data. And 90% of the world’s total data was created in the last 2 years.
Let this sink in for a moment: In the past 2 years, we’ve created more data than our human civilization produced since we started writing.
A lot of this data is multimedia files. Whether it is a simple customer call, recording of a meeting, judicial hearing, etc. organizations generate enormous amounts of multimedia files.
Of course, large organizations need to somehow extract value from this stored multimedia content.
The Challenge of Human-Powered Multimedia Transcription Services
Although it seems more logical for humans to respond to the task of transcribing multimedia files, people can’t adequately respond to the needed level of accuracy and volume. Plus, there’s the security and privacy to think about.
I tend to look at this problem from a construction perspective. Imagine we’re tasked to dig a 1000 miles long ditch, in a perfect, straight line. Will we do it better, faster, easier by getting 1000 people, or go for a mechanized approach? I’m pretty sure that anyone who’s come within 30 feet of a shovel will gladly call on the excavators.
Organizations that regularly do manual transcription of recordings say that it takes up to 9 hours of transcription work to transcribe a single hour of recorded interview. A 9:1 ratio is very hard to use at a scale.
This means that a court hearing of 4 hours will mean one worker’s full workweek to transcribe. Multiply this by many more hours, more hearings, in more courtrooms, every single day, and we soon get to see the problem of human-powered transcription. Human-powered transcription is not scalable.
AI-powered transcription software, on the other hand, has become much easier to use, much faster and much more accurate than traditional transcription platforms. With any of the available software solutions for automated transcription, a single employee can cover a lot of ground, without the grind.
And, one increasingly relevant consideration that makes a huge difference is privacy and security. Automated transcription software where fewer people touch the data are inherently more secure.
The Current Market Of AI-Based Transcription Software
The current market for AI-based transcription software is divided between established vendors and startups. The major difference between the two is their different approach to the market.
Startup companies sell transcription platforms as a service directly to consumers. Larger companies, on the other hand, tend to offer speech-to-text through API, as part of a larger product or as an enterprise-level offering. Some of them tend to focus more on dictation instead of transcription. Others limit computer capabilities and offer ‘hybrid’ solution of human and AI transcription services which require manual ‘polishing’ of the transcript provided. And some focus specifically on one or more subsets of users like medical transcription, or law transcription.
Let me introduce briefly on transcription software solutions that aim at the Enterprise level.
1. Amazon Transcribe
Amazon offers a transcription service called Amazon Transcribe and offers an API connection, enabling integration with third-party services.
Their transcribe service is an Automatic Speech Recognition (ASR) service which enables the easy speech-to-text capability to AWS applications. It can be used for lots of common applications, including transcription and generation of subtitles on multimedia content.
The AWS and Amazon Transcribe platform make speech-to-text conversion very simple. Just store the multimedia content you want to convert into text on an Amazon S3 server, and you can pull the extracted text via an API connection.
The process is very simple, the only thing to worry about is to make sure to:
- Use proper URL when uploading the file, and
- Specify the format and the language of the input.
In return, the platform will provide a URI to get the results.
The results are saved in the S3 bucket and are identified by this user-specific URI you use to get your results.
2. Google Cloud Speech-to-Text
Google speech-to-text enables audio to text conversion by applying powerful neural network models in an easy-to-use API.
You can enable voice command-and-control, transcribe audio from call centers, and more. It can process real-time streaming or prerecorded audio, using Google’s Machine Learning technology.
Google Cloud Speech-to-text is a little bit more complicated than AWS Transcribe. Namely, you need to send a speech recognition request to the Cloud Speech-to-Text in order to get started. The type of request you send depends on the length of the multimedia content you want to convert. Based on that, there are three types of requests:
- Synchronous Recognition requests. These requests are limited to audio content long 1 min or less. This option sends audio content to the Speech-to-Text API, the API recognizes the content and returns results after all audio has been processed.
- Asynchronous Recognition requests. These requests are limited to audio content up to 180 minutes.
- Streaming Recognition requests. These requests stream recognition while the audio content is being captured, enabling results while the user is still speaking. This option is very useful for real-time recognition like capturing live audio content.
The biggest advantage of Google’s transcription service is their access to a huge amount of the training audio necessary to build a great speech model. Google has massive audio content repositories like YouTube, which is a big part of why Google shows greater accuracy than the others. Every time someone uploads their video presentation and then edits the subtitled text, they’re essentially helping Google get better at transcription. And with so many active YouTube users who contribute content, Google definitely has the edge of crowd-supported machine learning loops.
3. IBM Watson
Just like AWS and Google, IBM’s speech-to-text service provides an API to add speech transcription capabilities to applications.
What is unique about IBM speech-to-text service is that it combines information about language structure with the composition of the audio signal.
IBM Watson transcription software allows two different possibilities for transcription:
- Transcribe the audio with one option. Which means that the platform provides one transcription solution.
- Transcribe the audio content with two possible alternatives.
The thing about IBM speech-to-text API is that it decodes audio content with cURL. Namely, you need to use cURL to convert your audio content into text.
The unfortunate part is that cURL is usually available for Linux-based systems only. For Microsoft fans, there are additional plugins needed to get Dr. Watson to speak Linux.
4. Nuance Dragon Speech Recognition Solutions
Nuance speech-to-text service is an AI-based transcription technology offered to businesses and consumers. Their NaturallySpeaking software is aimed at single-users, while through their Dragon line of products are specialized for different industries.
Nuance Dragon also offers business-specific tools for law enforcement, education, financial services, and more. Although it might seem a little far off from the previous three services, Nuance is actually the first to make the move towards transcription in the ’60s. Until 2014, Nuance’s technology powered Siri, Apple’s voice-activated mobile assistant.
Because Nuance has built training into their products like the Law Enforcement and Legal solutions, this brand is pushing toward end-user acceptance.
For their Legal Transcription Service, they state that their system is trained for Legal terminology using over 400 million words from legal documents. This helps their transcriptions to come back with a 99% accuracy.
Another end-user oriented pitch is their Law Enforcement solution. Their website states that law enforcement officers can use dictation in real time to create incident reports, do mandatory paperwork, and even use it on the field for quick searches for people, license plates and so on.
So, Which Is Best?
With so much multimedia that large organizations produce, it’s difficult for traditional, human-powered transcription services to keep up.
This is why companies like Amazon, Google, IBM have come up with their AI-powered transcription software. Some, like Nuance, went a step further and offer end-user oriented transcription packs.
The “Which Is Best” question is hard to answer. And, I don’t think there’s one answer that will suit everybody.
Which AI-powered transcription software solution is best for you, will mostly depend on your needs.
So to answer this question, you’ll want to start with your technical requirements. Then, your interest in using high-code or low-code solutions, and, most importantly, budget restraints.
This is a complex issue, and I’ll be covering the market of AI-powered transcription software in more detail. So, make sure you follow us on Facebook, Twitter, and LinkedIn to get notified when a more in-depth coverage hits the web.
In the meantime, please take a moment to let me know if you have any comments, questions or ideas on the topic.
Informational post! Thanks for sharing.