Whisper
12 Dec 2022OpenAI recently open-sourced their “Whisper” neural net for automatic speech recognition. While this is interesting in its own right, the fact that this is an open-source model that can be run in our own infrastructure may open up some deployment options that were not previously available. Let’s take a closer look.
Getting started with Whisper
Installation
Installing Whisper is much easier than I had feared and basically consists of these three steps:
-
Create a virtual environment with Python 3.9.9. This is the recommended Python version for Whisper, and even though this is over a year old, better safe than sorry.
conda create -n whisper python=3.9.9 anaconda
-
Install PyTorch into the environment.
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
-
Install Whisper from its GitHub repo
pip install git+https://github.com/openai/whisper.git
First experiments
I wanted to see how well Whisper works in different languages, so I made two short recordings to use for my initial experiments, one in English - “Hello and welcome to Whisper. This is my first demo.”:
…and one in Danish - “Lad os prøve på dansk og se om den kan oversætte det korrekt.”:
Processing the English audio file through Whisper gave this result:
> whisper 1st_english.wav
100%|███████████████████████████████████████| 461M/461M [00:16<00:00, 28.6MiB/s]
Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: English
[00:00.000 --> 00:04.000] Hello and welcome to Whisper. This is my first demo.
Flawless transcription! Also notice that on this first run Whisper automatically downloaded the required model data. This also meant that the first run took significantly longer than subsequent runs.
Now for the Danish audio:
> whisper 1st_danish.wav
Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: English
[00:00.000 --> 00:04.000] There's Pro Podensk, I'll see how I'm doing, how was it to go out?
That’s a terrible transcription, but notice how it reported the detected language as English. Let’s try again but this time tell it that the audio is Danish using the --language
argument:
- whisper 1st_danish.wav --language Danish
[00:00.000 --> 00:04.000] Deres proberdensk, og se om den kører sit dekvært.
This is still terrible. In fact I don’t think any Dane reading this transcription would be able to guess what was actually said.
At this point I went back and actually listened to the recordings I made. They are very noisy, but not noisy enough that a native listener wouldn’t understand the content. I tried processing the Danish audio with Google Cloud Speech-to-Text, and it transcribed it flawlessly. Perhaps Whisper is just more sensitive to noise than other speech recognition systems?
Noise reduction
For my first attempt at improving the recognition of the Danish audio using noise reduction, I used Audacity to reduce the noise. This is a semi-manual process where you first select a portion of the audio that should be silent, that is only contains noise, so that Audacity can analyse the noise that is present. In a second step you select the audio to perform noise reduction on based on the analysis performed. This generally leads to quite good results, but cannot be done unattended. Here is the resulting audio:
Whisper produced this transcription:
>whisper 1st_danish_nr_audacity.wav --language Danish
[00:00.000 --> 00:04.000] Der er pro-potensk, og se om den kører siddende korrekt.
While this is a little better, it’s still not good enough to be useful.
Since Audacity’s noise reduction method is unsuited for real-time use, I tried using Gstreamer’s noise reduction from
webrtcdsp
using this command-line:
>gst-launch-1.0 filesrc location=1st_danish_mono.wav ! wavparse ! audioconvert ! audioresample ! webrtcdsp echo-cancel=false extended-filter=false gain-control=false high-pass-filter=false ! wavenc ! filesink location=1st_danish_nr_gstreamer.wav
Use Windows high-resolution clock, precision: 1 ms
Setting pipeline to PAUSED ...
Pipeline is PREROLLING ...
Pipeline is PREROLLED ...
Setting pipeline to PLAYING ...
Redistribute latency...
New clock: GstSystemClock
Got EOS from element "pipeline0".
Execution ended after 0:00:00.023411500
Setting pipeline to NULL ...
Freeing pipeline ...
This produced the following audio, which is - to my ears - not quite as clean as what Audacity produced, but noticably less noisy than the original:
From this, Whisper produced:
>whisper 1st_danish_nr_gstreamer.wav --language Danish
[00:00.000 --> 00:04.000] Deres proberdensk, og se om den kører sit dekvært.
Sadly this is precisely the same transcription we got from the original noisy audio.
Better audio recording
Let’s see if we can do better by making a cleaner recording to begin with. The following is another recording of me saying the same thing, but this time I made the recording in a more quiet environment, which is immediately obvious when you hear the recording:
This time Whisper produced this transcription:
>whisper 2nd_danish_mono.wav --language Danish
[00:00.000 --> 00:04.000] Lad os prøve det på dansk og se om den kan oversætte det korrekt.
Flawless transcription in Danish! This illustrates that Whisper can produce impressive transcription results even in “small” languages like Danish provided that the recorded audio is relatively free of noise.
Translation
One of Whisper’s interesting features is that it can not only perform transcription but can actually perform translation into English at the same time. To use this feature all we have to do is add the --task translate
argument. Let’s try this with the Danish audio:
>whisper 2nd_danish_mono.wav --language Danish --task translate
[00:00.000 --> 00:04.000] Let's try it in Danish and see if it can translate correctly.
This is a correct English translation, and we got it from the Danish audio in a single step. Quite impressive! And this is not limited to Danish audio. Whisper can do this for all it’s supported languages. Their blog post calls this “Any-to-English speech translation”. Do note, however, that Whisper only supports translation into English - at least at the time of writing.
Real-time usage
So far, we have only seen Whisper used in what you might call “batch mode”, where an audio file is prepared and then processed by Whisper in a separate step and as a whole. Whisper does not directly support real-time processing, where you feed a live audio signal to the model and get back live transcription results. However, several efforts are underway to repurpose Whisper for real-time processing, and I plan to visit this in the future. Real-time transcription would make it possible to use Whisper in many more settings, for instance with digital humans.
Advantages
Whisper’s machine learning models are quite large, and it does require a GPU with CUDA support in order to run, so what might be the advantage of running Whisper in your own infrastructure vs. using one of the available cloud speech recognition services from e.g. Google or Azure? By hosting your own speech recognition engine you can potentially deploy it closer to the other components of your system and thereby reduce latency. This could be important in real-time applications such as digital humans, where users’ overall perceived latency is a critical factor in useability. Hosting your own engine could prove advantageous in terms of privacy - think GDPR. If you can avoid sending audio to a third party, you could reduce your compliance risk.