Evaluating OpenAI’s Whisper on Community Conversations

June 14, 2023

Here at Cortico, we empower our users to make sense of small group community conversations. As part of this process, users upload their conversations to our platform and the audio is transcribed using a combination of machine and manual transcription. Machine transcription is primarily used for word-level timings and to create a temporary transcript, while partners engage in manual transcription to further process conversations. Although the unique combination of machine learning and human listening is the cornerstone of Cortico’s work, the manual transcription process is very expensive.

Thus we periodically evaluate alternative machine transcription services to see if their accuracy is good enough to replace the manual transcription for some conversations. OpenAI’s Whisper got a lot of notice when it came out last fall, and claims to “approach human level robustness and accuracy on English speech recognition.” We were interested to see if OpenAI’s Whisper lives up to the hype.

Tl;dr I don’t think it does, but that’s not the (most) interesting part:

Whisper inserts the occasional completely made-up yet plausible-sounding phrase into the transcription, sometimes even changing the meaning of the sentence.

Analyzing Machine Transcription Accuracy

Currently, we use AssemblyAI for our machine transcription. Although we have evaluated other transcription services as well, I’m just going to focus on how they compare with Whisper. We compare transcriptions from these services against those of our manual transcribers. The typical metric used in such comparisons is word error rate (WER).

There are some limitations to this approach:

  • The level of audio quality and crosstalk varies significantly across our conversations, which take place either in person or on Zoom.
  • When people speak, they typically use filler words, such as um or like, repetition, such as “I was I was going,” and repairs, where someone corrects themselves mid-sentence. Our manual transcribers remove these disfluencies, and AssemblyAI removes many of these as well. This makes for a more readable transcript. However, although Whisper removes the ums and uhs, it transcribes most of the other disfluencies. It’s a more faithful representation of what was said, so there’s certainly an argument to be made for transcribing in this fashion, but it will raise the word error rate as calculated. Some conversations have a lot more disfluencies than others.
  • Interestingly, Whisper translates other languages into English if they appear in a conversation it has determined is in English, and some of the conversations that I looked at have significant non-English portions. Our manual transcription service transcribes this as ‘[foreign language]’, so this increases error rate as calculated. I left the two multi-lingual conversations out of the table below.

Because of these limitations of using word error rate, I also manually inspected the results of automatically transcribing 13 of our public conversations. But in some cases, when the audio quality is very poor, or there is significant crosstalk, I had a difficult time figuring out what was actually said at times.

Whisper has a number of different language models, of varying sizes. All the results reported here are for their large V2 language model. These are all public conversations and linked to below.

ConversationAssembly WERWhisper Large V2 WERnotes
17200.0780.1614Apparent increased WER mostly due to disfluencies
19640.07330.162Manual inspection showed that Whisper did much worse on this one
23680.0530.1012Apparent increased WER mostly due to disfluencies
27740.05820.1059Apparent increased WER partially due to disfluencies
29600.446765.34Extremely poor audio quality, manual transcribers also had a hard time. Despite the significant WER difference, it was hard for me to tell which was better

On the whole, you can see Whisper does somewhat worse than AssemblyAI. Some part of that, although not all, is due to the disfluencies. But in addition to that, and far more concerning, is that there are many cases where Whisper just makes things up. We’ll refer to these as confabulations (also commonly called hallucinations), as referenced here.

Below are examples of such confabulations, comparing the manual transcription against Whisper. The words in red were added by Whisper. (In the case of the added so, and ands, they were actually said and our manual transcription service omitted them; it is the larger blocks of text that were made-up).

AudioManual transcriptionWhisper
Click here to listen You know what I mean? But you know the saying, “You’re damned if you do, damned if you don’t.” You know what I mean? So, but, you know, I’m not a scientist. You know, the saying you’re damned if you do it, damned if you don’t. 
Click here to listen And my son’s parent coordinator was the one that really helped me when I caught COVID and I was severely sick from it.And my son’s parent coordinator, my son’s elementary school parent coordinator, and my school district coordinator was the one that really helped me when I caught COVID and I was severely sick from it.
Click here to listenThe nutrition needs to be a focus of the State and it’s not. I see that every day.The nutrition needs to be a focus of the community. It’s not a focus of the state and it’s not. And I see that every day.
Click here to listenBut I think it’s the breakdown of the American family.But I think it’s the breakdown of the environment and the American family.
Click here to listenYeah. So my name is Hector and yeah, I’m the portal curator for Johannesburg and Johannesburg, our community is located on the outskirts of the city, the urban area, and in a precinct called Victoria Yards.And yeah, so my name is Hector and yeah, I’m the portal curator for the new book, The Art of Victoria. And I’m also the curator for Johannesburg and Johannesburg or our community is located on the outskirts of the city, the urban area and in a precinct called Victoria Yards. 

Large language models are good at creating plausible sounding text, regardless of accuracy, and that’s certainly happening here. These made-up sentences all seem pretty reasonable! But I really didn’t expect it to happen in a use case like machine transcription.

In the cases of inserted text, Whisper also gives unrealistic word timings. For the inserted phrase, “And then you know, I think it’s also important to think about the future of our community,” Whisper says that this entire sentence took just over half a second. Whisper does provide word confidences for its transcriptions, and for these confabulations, the confidence does tend to be low, so it is quite a surprise that this is happening.

I am certainly not the only person to have noted these confabulations. There are a lot of conversations about this on Whisper’s discussion board, although most of these mention either multiple repeats of words or phrases or inserting text during a gap in audio. Some of the time, the extraneous text is shoved in where there aren’t gaps, although there might be other gaps nearby. People on the discussion board discuss workarounds, but because our team is small, we really want a plug-and-play solution, and I imagine there are many other people in a similar situation. 

So here’s my warning to you: at this time, if you need your transcriptions to be accurate without made-up text that sounds entirely plausible—but might change the meaning of an utterance—don’t use the open source Whisper model. (I can’t speak to their API, but wouldn’t expect it to be better.)

But if you still really want to use Whisper, Deepgram’s API includes a version of Whisper with some secret sauce tweaks. On our dataset, Deepgram’s version of Whisper does not have the  made-up text problem and otherwise typically outperforms Whisper run locally. Deepgram’s version of Whisper also significantly outperforms their Nova model on our conversations, but neither outperform AssemblyAI. So for now, we’re sticking with AssemblyAI.

Subscribe to Our Newsletter
Join Our Community
Cortico is a non-profit 501(c)(3) organization led by a multidisciplinary team of experts in community engagement, technology and non-profit management. We strive to meld two elements not typically found in one space: technological innovation and community-centered advocacy. We'd love to keep you up to date on what we're doing!