We just released the version 3.0 of CTranslate2! Here’s an overview of the main changes:
The main highlight of this version is the integration of the Whisper speech-to-text model that was published by OpenAI a few weeks ago.
Its architecture is very similar to a text-to-text Transformer model but it uses Conv1D layers to transform the audio features. On GPU, Conv1D layers are implemented using cuDNN which is a new optional dependency.
The current implementation already supports many CTranslate2 features and optimizations such as quantization, asynchronous execution, decoding with random sampling, etc. It is up to 3x faster than the implementation in the Transformers library: