June 14, 2023
Here at Cortico, we empower our users to make sense of small group community conversations. As part of this process, users upload their conversations to our platform and the audio is transcribed using a combination of machine and manual transcription. Machine transcription is primarily used for word-level timings and to create a temporary transcript, while partners engage in manual transcription to further process conversations. Although the unique combination of machine learning and human listening is the cornerstone of Cortico’s work, the manual transcription process is very expensive.
Thus we periodically evaluate alternative machine transcription services to see if their accuracy is good enough to replace the manual transcription for some conversations. OpenAI’s Whisper got a lot of notice when it came out last fall, and claims to “approach human level robustness and accuracy on English speech recognition.” We were interested to see if OpenAI’s Whisper lives up to the hype.
Tl;dr I don’t think it does, but that’s not the (most) interesting part:
Whisper inserts the occasional completely made-up yet plausible-sounding phrase into the transcription, sometimes even changing the meaning of the sentence.
Currently, we use AssemblyAI for our machine transcription. Although we have evaluated other transcription services as well, I’m just going to focus on how they compare with Whisper. We compare transcriptions from these services against those of our manual transcribers. The typical metric used in such comparisons is word error rate (WER).
There are some limitations to this approach:
Because of these limitations of using word error rate, I also manually inspected the results of automatically transcribing 13 of our public conversations. But in some cases, when the audio quality is very poor, or there is significant crosstalk, I had a difficult time figuring out what was actually said at times.
Whisper has a number of different language models, of varying sizes. All the results reported here are for their large V2 language model. These are all public conversations and linked to below.
Conversation | Assembly WER | Whisper Large V2 WER | notes |
1720 | 0.078 | 0.1614 | Apparent increased WER mostly due to disfluencies |
1964 | 0.0733 | 0.162 | Manual inspection showed that Whisper did much worse on this one |
2201 | 0.1712 | 0.1844 | |
2275 | 0.0388 | 0.048 | |
2368 | 0.053 | 0.1012 | Apparent increased WER mostly due to disfluencies |
2694 | 0.1184 | 0.1386 | |
2719 | 0.1238 | 0.197 | |
2774 | 0.0582 | 0.1059 | Apparent increased WER partially due to disfluencies |
2778 | 0.1006 | 0.1135 | |
2868 | 0.1273 | 0.1384 | |
2960 | 0.4467 | 65.34 | Extremely poor audio quality, manual transcribers also had a hard time. Despite the significant WER difference, it was hard for me to tell which was better |
On the whole, you can see Whisper does somewhat worse than AssemblyAI. Some part of that, although not all, is due to the disfluencies. But in addition to that, and far more concerning, is that there are many cases where Whisper just makes things up. We’ll refer to these as confabulations (also commonly called hallucinations), as referenced here.
Below are examples of such confabulations, comparing the manual transcription against Whisper. The words in red were added by Whisper. (In the case of the added so, and ands, they were actually said and our manual transcription service omitted them; it is the larger blocks of text that were made-up).
Audio | Manual transcription | Whisper |
Click here to listen | You know what I mean? But you know the saying, “You’re damned if you do, damned if you don’t.” | You know what I mean? So, but, you know, I’m not a scientist. You know, the saying you’re damned if you do it, damned if you don’t. |
Click here to listen | And my son’s parent coordinator was the one that really helped me when I caught COVID and I was severely sick from it. | And my son’s parent coordinator, my son’s elementary school parent coordinator, and my school district coordinator was the one that really helped me when I caught COVID and I was severely sick from it. |
Click here to listen | The nutrition needs to be a focus of the State and it’s not. I see that every day. | The nutrition needs to be a focus of the community. It’s not a focus of the state and it’s not. And I see that every day. |
Click here to listen | But I think it’s the breakdown of the American family. | But I think it’s the breakdown of the environment and the American family. |
Click here to listen | Yeah. So my name is Hector and yeah, I’m the portal curator for Johannesburg and Johannesburg, our community is located on the outskirts of the city, the urban area, and in a precinct called Victoria Yards. | And yeah, so my name is Hector and yeah, I’m the portal curator for the new book, The Art of Victoria. And I’m also the curator for Johannesburg and Johannesburg or our community is located on the outskirts of the city, the urban area and in a precinct called Victoria Yards. |
Large language models are good at creating plausible sounding text, regardless of accuracy, and that’s certainly happening here. These made-up sentences all seem pretty reasonable! But I really didn’t expect it to happen in a use case like machine transcription.
In the cases of inserted text, Whisper also gives unrealistic word timings. For the inserted phrase, “And then you know, I think it’s also important to think about the future of our community,” Whisper says that this entire sentence took just over half a second. Whisper does provide word confidences for its transcriptions, and for these confabulations, the confidence does tend to be low, so it is quite a surprise that this is happening.
I am certainly not the only person to have noted these confabulations. There are a lot of conversations about this on Whisper’s discussion board, although most of these mention either multiple repeats of words or phrases or inserting text during a gap in audio. Some of the time, the extraneous text is shoved in where there aren’t gaps, although there might be other gaps nearby. People on the discussion board discuss workarounds, but because our team is small, we really want a plug-and-play solution, and I imagine there are many other people in a similar situation.
So here’s my warning to you: at this time, if you need your transcriptions to be accurate without made-up text that sounds entirely plausible—but might change the meaning of an utterance—don’t use the open source Whisper model. (I can’t speak to their API, but wouldn’t expect it to be better.)
But if you still really want to use Whisper, Deepgram’s API includes a version of Whisper with some secret sauce tweaks. On our dataset, Deepgram’s version of Whisper does not have the made-up text problem and otherwise typically outperforms Whisper run locally. Deepgram’s version of Whisper also significantly outperforms their Nova model on our conversations, but neither outperform AssemblyAI. So for now, we’re sticking with AssemblyAI.