OpenAI Whisper - An open source speech recognition model with multilingual support

Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification.

A Transformer sequence-to-sequence model is trained on various speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection. All of these tasks are jointly represented as a sequence of tokens to be predicted by the decoder, allowing for a single model to replace many different stages of a traditional speech processing pipeline. The multitask training format uses a set of special tokens that serve as task specifiers or classification targets.

The code and the model weights of Whisper are released under the MIT License.

What I find most interesting about this model is how they use a sequence of tokens as a universal interface to interact with the model. They use a “TRANSCRIBE” token to instruct the model to transcribe speech, use language code tokens to tell the model what language to transcribe to, and the model outputs timestamp tokens among the text tokens so that the text can be associated with time points in the source audio.

Interestingly it also performs better with Spanish/Italian, which kind of makes sense, since both have less sound ambiguity than English.

1 Like