We recently had the opportunity to test and evaluate (T&E) the accuracy of a machine transcription product in our lab. Two challenges with testing the product were finding a test corpus we could trust and a scoring methodology that was fair.
The problem with finding a trustworthy test corpus was that we had to ensure that every spoken – and misspoken – word was in the transcript. The transcript of each audio file tested had to be word-for-word accurate. We found plenty of translations of audio files, but translations are not transcriptions and would not work for this T&E effort.
The solution to this problem was found in language lessons at the Defense Language Institute Foreign Language Center (DLIFLC), that contained audio samples with exact transcriptions to accompany the lessons. These lessons gave us plenty of audio with transcriptions in numerous foreign languages.
The problem of scoring the accuracy of machine transcriptions was solved with the use of the Word Error Rate (WER) metric. The WER accounts for three kinds of errors an MT product might make:
- Word substitution (S) – substituting an incorrect word or a homophone
- Word insertion (I) – inserting a word that was not spoken
- Word deletion (D) – omitting a word that was spoken
Together with the total number of words spoken (N), an error metric was calculated.
Thus, an accuracy measure could be calculated as that allowed the comparison of MT product performance or the performance among different language models in the same product.
To calculate S, D, I, and N the team utilized the sclite utility from NIST. sclite is a tool specifically designed for scoring ASR and MT outputs. The tool required the trusted reference files to be in one of three formats for easy comparison: conversation time-marked (CTM), segment time-marked (STM), or transcript format (TRN). These file formats contain extremely accurate time markings for each word in the reference file. You can learn more about these file formats and ways to produce them at the sclite homepage.
The test team opted for a slightly easier approach. We discovered that if we removed all punctuation, extraneous spaces, and carriage returns and formatted the reference file on a single line, as a single utterance, the sclite tool aligned correctly and scored the transcript. We validated this approach by reformatting a few of the STM formatted files provided by the product vendor and were able to reproduce their results within 2%.
This T&E procedure proved to be successful and the preparation of the reference transcript files, the running of the MT product, and the scoring by the sclite tool could be automated. In all, the team tested using four different languages and a total of 480 files.
The approach to obtaining ground truth reference test data and use of an industry standard test tool are just small examples of Armedia’s innovative capabilities in the T&E arena.
Great info! Thanks for sharing informational stats.