Did you know that the industry standard for manually transcribing 1 hour of clear audio is 4 hours? That is a 4 to 1 ratio. Poor audio could take as much as 9 hours to transcribe 1 hour (9 to 1 ratio), which means a lot of manual hours are needed to transcribe audio and video files. Think about how much Society relies on audio and video files these days. Recently, we were asked to help solve this challenge by a law enforcement agency. This agency was spending about 8 hours for every 1 hour (8 to 1) of body camera or interview audio or video. Our solution – ArkCase, Alfresco, and AWS – originally proposed for their IT Modernization project addressed this challenge. Intrigued, they asked for a demo. With a clear understanding of the opportunity to further demonstrate value, we asked for a few weeks to prepare the demo.
Before We Get to the How, Let’s Describe the What
What are the platforms being used in this solution?
- For starters, ArkCase is a low-code IT modernization or case management platform. ArkCase aims to be the premier open source low-code case management platform.
- Alfresco is the leading open source enterprise content management platform providing core document management, business process management and records management services.
- AWS is the leading cloud-based platform as a service (PaaS) supporting our ability to comply with FedRAMP, CJIS, HIPAA, HiTECH, SOC2 and other security controls.
ArkCase provides an intuitive, accessible and responsive user interface to view, stream, and transcribe rich media files. ArkCase integrates with AWS Transcribe service to provide high volume, high quality and cost-effective transcription of audio and video files. The transcription functionality in ArkCase allows users to upload audio and video files within ArkCase and configure whether files are automatically transcribed or manually sent to be transcribed allowing organizations to control their costs. ArkCase then uses an Alfresco Activiti workflow to process the files using AWS automatic speech recognition (ASR) service, which will produce the transcription files. The outcome of the process is a highly visual transcription with close caption and that can be searched and edited for enrichment. In addition, the transcription can be compiled into a Microsoft Word document from sharing. The user interface allows for streaming the rich media file while viewing the transcription text. This is extremely helpful when manually QAing the transcription text.
What Does ArkCase Transcription Functionality Provide?
The ArkCase user can perform many different actions on the transcribed file:
- Viewing of the File Details
- Total Word Count
- Confidence Rating of the Transcribed File
- Transcription Status (In Process, Complete)
- Listen or View to File in a Streaming Viewer with Close Caption
- Searching for text within audio or video files
- Jumping to that section of the audio or video
- Tagging the file based on the transcription
- Viewing Individual Sections of the Transcription Text
- Each Transcription Text Section Shows:
- Start Time of that Section of Text
- Confidence Rating of that Section of Text
- Editing Individual Sections of Transcription Text during QA
- Automatically Compile the Transcription Text into a Single Document File
Can ArkCase Transcription Functionality be Configured?
ArkCase transcription functionality also provides administration configurations for certain transcribe options within the ArkCase application:
- Enable Transcription to turn on transcription
- Automatic or Manual Transcription to decide if you want all rich media files processed or to manually select the files you want to be processed
- Word Count per transcribed section for chunking and readability
- Confidence Threshold for highlighting sections that may need human reviewing
An Administrator can control the transcription functionality for all users in the ArkCase application, by enabling or disabling the functionality. If the functionality is enabled, the rich media files can be sent to AWS for transcription manually or automatically. If the Admin selects to enable automatic transcription, then each rich media file that is uploaded into ArkCase will be automatically sent to the Amazon Transcribe service. The Admin can also control the word count and confidence threshold for each section of transcription text. ArkCase will flag any sections of transcription text that do not meet the configured confidence threshold.
ArkCase transcription can support many different use cases, no matter your specific business. If you want to gain efficiencies and get more value from your audio and video content, let’s talk.
In recent years, multimedia content is more used than ever. This means that the world data trapped in multimedia formats becomes larger and more difficult to use. To help you solve this problem and use multimedia content to your advantage we made this integration of ArkCase, Alfresco, and AWS.
ArkCase, Alfresco, and AWS created a platform that can help enterprises derive value from the ever-increasing multimedia content. Thanks to this integration, enterprises of all sizes can now use Armedia Legal Module and Amazon Transcribe as part of it.
Armedia Legal Module turns any audio or video file into a text format which can later be used as any other textual document. No more external transcription services. No more waiting. No more time wasted.
Want to view a recorded demo of ArkCase Transcription?
Are you interested in a live demo of ArkCase Transcription?
In one of my previous blog posts, I touched on the topic of AI-powered transcription services on the market. There, I introduced the idea that, with this pace of multimedia production, traditional, human-powered transcription services is not the solution.
In the past 2 years, we’ve produced 90% of all the data our civilization has. At this pace, and a 9:1 ratio of transcribing multimedia files, human-powered transcription is simply impossible to keep up. It’s too slow, too expensive, too prone to error, and too vulnerable to data leaks.
Just like hiring an army of workers to dig a perfectly straight ditch of a 1000 miles is not the best option, we need to start thinking of how machines can help.
In this blog post, I’d like to dig a bit deeper and do better coverage of the 4 major transcription services: Amazon, Google, IBM, and Nuance. They are all good players, however only one can fully respond to all of your specific needs.
To help you choose the best transcription service provider, let’s make a little comparison between the four.
My Comparison Methodology
I’ll be covering the four providers from several different angles, so you can get a more comprehensive understanding of their value proposition for your specific needs. Here are the different angles I’ll be covering:
- Speed. The speed of a transcribe platform is a crucial factor. Given enough time, everyone could transcribe a multimedia content, but the point of the existence of platforms like these is to make that time as short as possible. But in some cases, speed may not be the ultimate, deciding factor. Some companies will be better off with a slower but more accurate solution.
- Accuracy is paramount to a transcription platform. Very often the worth of the transcription platform is measured by its accuracy. If the platform gives you a transcription that needs additional edits in punctuation and speakers, then that platform my friend hasn’t done much of the job for you. But again, in some cases, companies that have large amounts of transcripts, they’ll be better off with a slightly less accurate, but much cheaper solution.
- Price. No matter if you are a small company or a well-established vendor moving the market, everyone cares about costs. How much of a deciding factor this will be, depends on how large your budget is, and how important the other two metrics are.
Now that I’ve introduced the software packs and the methodology of comparing the 4 transcription services, let’s get started.
Amazon Transcribe Service
In trying to keep up the pace with the evolution of language, Amazon Transcribe platform is continually learning and improving. AWS Transcribe platform is designed to provide fast and accurate automated transcripts for multimedia files with varying quality.
Currently, Amazon’s transcription service is able to process multimedia content:
- Duration: maximum 2 hours,
- Custom Vocabulary: maximum 50 KB file size
- Sampling rate: from 8KHz (telephony audio) to 48Kh
- Languages: English and Spanish
- Formats: In WAV, mp3, mp4, FLAC
Thanks to AWS processing prowess, Amazon Transcribe is doing transcription at an astonishing speed.
The best thing about Amazon Transcribe is the accuracy of transcriptions. AWS has been the world’s most comprehensive and broadly adopted cloud platform for the last 12 years. This experience can be seen in the accuracy Amazon Transcribe shows in their results.
Namely, unlike other transcribe services, Amazon transcribe platform produces texts that are ready to use, without a need for further editing. To achieve this, AWS Transcribe pays special attention to:
- Punctuation. Amazon Transcribe platform is capable of adding appropriate punctuation to the text as it goes and formats the text automatically. This way producing an intelligible output which can be used without further editing.
- Confidence score. AWS Transcribe makes sure to provide a confidence score which shows how confident the platform is with the transcription.
This means you can always check the confidence score to see whether a particular line of the transcript needs alterations.
- Possible alternatives. The platform also gives you an opportunity to make some alterations in cases where you are not completely satisfied with the results.
- Timestamp Generation. Powered by deep learning technologies, AWS Transcribe automatically generates time-stamped text transcripts.
This feature provides timestamps for every word which makes locating the audio in the original recording very easy by searching for the text.
- Custom Vocabulary. AWS Transcribe allows you to create your own custom vocabulary. By creating and managing a custom vocabulary you expand and customize the speech recognition of AWS Transcribe.
Basically, custom vocabulary gives AWS Transcribe more information about how to process speech in the multimedia file.
This feature is very important in achieving high accuracy in transcriptions of specific use such as Engineering, Medical, Law Enforcement, Legal, etc.
- Multiple Speakers. AWS Transcribe platform can identify different speakers in a multimedia file. The platform can recognize when the speaker changes and attribute the transcribed text accordingly. Recognition of multiple speakers is handy when transcribing multimedia content that involves multiple speakers (such as telephone calls, meetings, etc.).
AWS Transcribe platform also allows you to specify the number of speakers you want to be identified in the multimedia file. The platform allows identification of up to 10 speakers.
The best performance can be achieved when the number of speakers you require to be identified, matches the number of speakers in the multimedia content.
The best part of Amazon Transcribe, unlike the other transcription services we discuss, is that you pay-as-you-go based on the seconds of audio transcribed per month.
Amazon Transcribe API is billed monthly at a rate of $0.00056 per second. Usage is billed in one-second increments, with a minimum per request charge of 15 seconds.
Thanks to all of these features, Amazon Transcribe service may be considered as highly accurate transcribe service. With its speed, accuracy, and price this transcribe service is one of the best, if not the best player in the game.
Google Speech-to-Text is available for multimedia content from different lengths and duration and returns them immediately. Thanks to Google’s Machine Learning technology, the platform can also process real-time streaming or prerecorded audio content including FLAC, AMR, PCMU, and Linear-16.
The platform recognizes 120 languages which makes it much more advanced than Amazon Transcribe platform.
However, despite this, Google still falls short on accuracy and price, compared to Amazon Transcribe platform.
Google Speech-to-Text accuracy improves over time as Google improves the internal speech recognition technology used by Google products. It includes:
- Automatic identification of the spoken language. Google employs this feature to automatically identify the language spoken in the multimedia content (out of 4 selected languages) without any additional alterations.
- Automatic recognition of proper nouns and context-specific formatting. Google Speech-to-Text works well with real-life speech. It can accurately transcribe proper nouns and appropriately format language (such as dates, phones numbers).
- Phrase hints. Almost identical to Amazon’s Custom Vocabulary, Google Speech-to-Text allows customization of context by providing a set of words and phrases that are likely to be met in the transcription.
- Noise robustness. This feature of Google Speech-to-Text allows for noisy multimedia to be handled without additional noise cancellation.
- Inappropriate content filtering. Google Speech-to-Text is capable of filtering inappropriate content in text results for some
- Automatic punctuation. Like Amazon Transcribe, this platform also uses punctuation in transcriptions.
- Speaker recognition. This feature is similar to Amazon’s recognition of multiple speakers. It makes automatic predictions about which of the speakers in a conversation spoke which part of the text.
Google Speech-to-Text costs $0.006 per 15 seconds, while the video model costs twice as much, at $0.012 per 15 seconds.
Considering the speed, price, and accuracy, Google Speech-to-Text is definitely among the best in the industry. However, its features are mostly based on language instead of meaning and inference. Which for now, gives Amazon Transcribe advantage in the game.
But, let’s move on and take a look at the other two transcription services.
IBM Watson Speech-To-Text
IBM Watson Speech-to-Text can transcribe speech form 7 different languages. However, the service does not support all features for the 7 languages. For most languages, it supports 2 sampling rates, broadband or narrowband models. It uses broadband for audio that is sampled at a minimum rate of 16 kHz and narrowband for audio that is sampled at a minimum rate of 8 kHz.
In addition to basic transcription, IBM Watson Speech-to-Text includes voice control of embedded systems, transcription of meetings and conference calls, and dictation of email and notes in a real-time.
When it comes to accuracy, IBM Watson speech-to-text pays special attention to:
- Keyword spotting. This feature enables search by a specific keyword. It basically identifies spoken phrases that match specific keyword strings.
- Speaker recognition. This feature is available for audio content in US English, Spanish or Japanese.
- Word alternatives. This feature enables requests of alternative words that are similar to the words in transcripts by acoustics.
- Word confidence. IBM Watson speech-to-text provides confidence levels for each word of a transcript.
- Word timestamps. The service also provides timestamps for the start and end of each word of a transcript.
- Profanity filtering. This feature censors profanity from US English transcripts.
The IBM Watson Speech-to-Text is priced at $0.02 per minute. This price applies to the use of both broadband and narrowband models.
IBM Watson Speech-to-Text has a wide range of possibilities. When it comes to accuracy, the features above say it all. IBM Watson Speech-to-Text is one of the most accurate transcription services.
However, all of these features do not apply to all languages and even more importantly, some of them come only with the BETA version. This makes IBM Watson Speech-to-Text described as such much more expensive in comparison with the previous two transcribe services.
Nuance Dragon Transcription
Nuance Transcription Engine can easily transcribe messages and conference calls in 43 different languages. The process takes up time according to the length and duration of the message and the traffic on the server.
The service pays special attention to accuracy and for that matter includes the following features:
- Multi-speaker identification. Nuance Transcription Engine can recognize and transcribe up to six individual speakers.
- Customizable language models. This feature is actually very similar to Amazon Transcribe custom vocabulary. It can identify various names using specialized vocabulary tools.
- Intelligent error correction. This transcribe service makes probability‑based suggestions for alternative words when the speech is too unclear to transcribe. This feature is very useful and significantly increases accuracy.
- Timestamps. Nuance Transcription Engine provides fully time‑coded and stamped lines which increase the clearance of transcription. Making possible to know who said what and when in a particular case.
Nuance Transcription Engine price is starting at $150 and it’s a lifetime deal.
Although this transcription service is one of the best on the market, when it comes to accuracy, it, however, differs much from the other transcribe services included in this comparison.
The major difference is that Nuance Transcription Engine focuses on transcribing voice messages and industry-specific transcriptions.
To be more specific, the Nuance Transcription Engine is one of the best, if not the best medical transcription software in the world. Which, unfortunately, means that if you are not a part of that industry, the accuracy of your transcriptions will not be as good as that of medical transcriptions.
Let’s Wrap Up
A research shows that the human brain can remember only 10% of what we read and 20% of what we hear. This is nothing less but an emphasis on the need for deriving value from multimedia content. And AI has proven to be the real deal when it comes to transcribing multimedia content.
Capturing and retrieving information from multimedia content using NLP and Speech Recognition has been the goal of Artificial Intelligence giants for the last decade. And they become more sophisticated every year.
In this comparison, I’ve decided to include only four transcription services which, by my research, are the best ones. I included three factors (speed, accuracy, and price) according to which I was leading the comparison. And based on these factors, I found out that:
- All four transcription services included in the comparison have some distinctive qualities that give them an advantage over the rest solutions on the market,
- They are all fast in processing and delivering results,
- They all show high accuracy of transcriptions,
- They all offer acceptable prices.
However, not all of them can equally respond to everyone’s needs. Take a good look at the comparison made above and decide which one will meet your needs best.
We at Armedia decided to rely on AWS and integrate Amazon Transcribe as part of our Armedia Legal Module for ArkCase.
What choice you’ll make, depends on your organizations’ requirements.
If you have any questions, do not hesitate to get in touch with us. Our team at Armedia is always at your service.
The world is full of data. And 90% of the world’s total data was created in the last 2 years.
Let this sink in for a moment: In the past 2 years, we’ve created more data than our human civilization produced since we started writing.
A lot of this data is multimedia files. Whether it is a simple customer call, recording of a meeting, judicial hearing, etc. organizations generate enormous amounts of multimedia files.
Of course, large organizations need to somehow extract value from this stored multimedia content.
The Challenge of Human-Powered Multimedia Transcription Services
Although it seems more logical for humans to respond to the task of transcribing multimedia files, people can’t adequately respond to the needed level of accuracy and volume. Plus, there’s the security and privacy to think about.
I tend to look at this problem from a construction perspective. Imagine we’re tasked to dig a 1000 miles long ditch, in a perfect, straight line. Will we do it better, faster, easier by getting 1000 people, or go for a mechanized approach? I’m pretty sure that anyone who’s come within 30 feet of a shovel will gladly call on the excavators.
Organizations that regularly do manual transcription of recordings say that it takes up to 9 hours of transcription work to transcribe a single hour of recorded interview. A 9:1 ratio is very hard to use at a scale.
This means that a court hearing of 4 hours will mean one worker’s full workweek to transcribe. Multiply this by many more hours, more hearings, in more courtrooms, every single day, and we soon get to see the problem of human-powered transcription. Human-powered transcription is not scalable.
AI-powered transcription software, on the other hand, has become much easier to use, much faster and much more accurate than traditional transcription platforms. With any of the available software solutions for automated transcription, a single employee can cover a lot of ground, without the grind.
And, one increasingly relevant consideration that makes a huge difference is privacy and security. Automated transcription software where fewer people touch the data are inherently more secure.
The Current Market Of AI-Based Transcription Software
The current market for AI-based transcription software is divided between established vendors and startups. The major difference between the two is their different approach to the market.
Startup companies sell transcription platforms as a service directly to consumers. Larger companies, on the other hand, tend to offer speech-to-text through API, as part of a larger product or as an enterprise-level offering. Some of them tend to focus more on dictation instead of transcription. Others limit computer capabilities and offer ‘hybrid’ solution of human and AI transcription services which require manual ‘polishing’ of the transcript provided. And some focus specifically on one or more subsets of users like medical transcription, or law transcription.
Let me introduce briefly on transcription software solutions that aim at the Enterprise level.
1. Amazon Transcribe
Amazon offers a transcription service called Amazon Transcribe and offers an API connection, enabling integration with third-party services.
Their transcribe service is an Automatic Speech Recognition (ASR) service which enables the easy speech-to-text capability to AWS applications. It can be used for lots of common applications, including transcription and generation of subtitles on multimedia content.
The AWS and Amazon Transcribe platform make speech-to-text conversion very simple. Just store the multimedia content you want to convert into text on an Amazon S3 server, and you can pull the extracted text via an API connection.
The process is very simple, the only thing to worry about is to make sure to:
- Use proper URL when uploading the file, and
- Specify the format and the language of the input.
In return, the platform will provide a URI to get the results.
The results are saved in the S3 bucket and are identified by this user-specific URI you use to get your results.
2. Google Cloud Speech-to-Text
Google speech-to-text enables audio to text conversion by applying powerful neural network models in an easy-to-use API.
You can enable voice command-and-control, transcribe audio from call centers, and more. It can process real-time streaming or prerecorded audio, using Google’s Machine Learning technology.
Google Cloud Speech-to-text is a little bit more complicated than AWS Transcribe. Namely, you need to send a speech recognition request to the Cloud Speech-to-Text in order to get started. The type of request you send depends on the length of the multimedia content you want to convert. Based on that, there are three types of requests:
- Synchronous Recognition requests. These requests are limited to audio content long 1 min or less. This option sends audio content to the Speech-to-Text API, the API recognizes the content and returns results after all audio has been processed.
- Asynchronous Recognition requests. These requests are limited to audio content up to 180 minutes.
- Streaming Recognition requests. These requests stream recognition while the audio content is being captured, enabling results while the user is still speaking. This option is very useful for real-time recognition like capturing live audio content.
The biggest advantage of Google’s transcription service is their access to a huge amount of the training audio necessary to build a great speech model. Google has massive audio content repositories like YouTube, which is a big part of why Google shows greater accuracy than the others. Every time someone uploads their video presentation and then edits the subtitled text, they’re essentially helping Google get better at transcription. And with so many active YouTube users who contribute content, Google definitely has the edge of crowd-supported machine learning loops.
3. IBM Watson
Just like AWS and Google, IBM’s speech-to-text service provides an API to add speech transcription capabilities to applications.
What is unique about IBM speech-to-text service is that it combines information about language structure with the composition of the audio signal.
IBM Watson transcription software allows two different possibilities for transcription:
- Transcribe the audio with one option. Which means that the platform provides one transcription solution.
- Transcribe the audio content with two possible alternatives.
The thing about IBM speech-to-text API is that it decodes audio content with cURL. Namely, you need to use cURL to convert your audio content into text.
The unfortunate part is that cURL is usually available for Linux-based systems only. For Microsoft fans, there are additional plugins needed to get Dr. Watson to speak Linux.
4. Nuance Dragon Speech Recognition Solutions
Nuance speech-to-text service is an AI-based transcription technology offered to businesses and consumers. Their NaturallySpeaking software is aimed at single-users, while through their Dragon line of products are specialized for different industries.
Nuance Dragon also offers business-specific tools for law enforcement, education, financial services, and more. Although it might seem a little far off from the previous three services, Nuance is actually the first to make the move towards transcription in the ’60s. Until 2014, Nuance’s technology powered Siri, Apple’s voice-activated mobile assistant.
Because Nuance has built training into their products like the Law Enforcement and Legal solutions, this brand is pushing toward end-user acceptance.
For their Legal Transcription Service, they state that their system is trained for Legal terminology using over 400 million words from legal documents. This helps their transcriptions to come back with a 99% accuracy.
Another end-user oriented pitch is their Law Enforcement solution. Their website states that law enforcement officers can use dictation in real time to create incident reports, do mandatory paperwork, and even use it on the field for quick searches for people, license plates and so on.
So, Which Is Best?
With so much multimedia that large organizations produce, it’s difficult for traditional, human-powered transcription services to keep up.
This is why companies like Amazon, Google, IBM have come up with their AI-powered transcription software. Some, like Nuance, went a step further and offer end-user oriented transcription packs.
The “Which Is Best” question is hard to answer. And, I don’t think there’s one answer that will suit everybody.
Which AI-powered transcription software solution is best for you, will mostly depend on your needs.
So to answer this question, you’ll want to start with your technical requirements. Then, your interest in using high-code or low-code solutions, and, most importantly, budget restraints.
This is a complex issue, and I’ll be covering the market of AI-powered transcription software in more detail. So, make sure you follow us on Facebook, Twitter, and LinkedIn to get notified when a more in-depth coverage hits the web.
In the meantime, please take a moment to let me know if you have any comments, questions or ideas on the topic.