OpenNMT-py v3.0

argosopentech · November 6, 2022, 2:27am

We are pleased to announce the release of OpenNMT-py v3.0

The main motivation was to simplify the data loading API which relied on an old version of Torchtext.
We decided to remove completely torchtext from the scope of OpenNMT-py.

We made our best effort to uniformize some code structure but of course it is not perfect. Also, we have not reworked the Library examples and documentation yet. Help is very welcome.

argosopentech · November 6, 2022, 2:29am

I made an issue to track any updates needed for Argos Train:

github.com/argosopentech/argos-train

OpenNMT-py v3 support

opened 02:27AM - 06 Nov 22 UTC

argosopentech

https://forum.opennmt.net/t/opennmt-py-v3-0-is-out/5077 > The vanilla transfo…rmer uses sinusoidal positional encoding (position_encoding = true). We recommend to use “maximum relative positions” encoding instead (max_relative_positions=20, position_encoding=false) which again has a small overhead. > We kept the “fusedadam” (old legacy code) which provides the best performance in speed (compare to pytroch amp adam fp16, apex level O1/O2). We tested the new Adam(fused=true) released with pytorch 1.13 but it is way slower. > Always use the highest batch size possible (to your GPU ram capacity) and use an update interval according to the “true bach size” you want. For instance, if your GPU can accept 8192 tokens, then if you use accum_count=12, you will have a true batch size of 98304 tokens. > Adjust the bucket size to your CPU ram. Most of the time a bucket between 200K and 500K examples will be suitable. The highest your bucket size is, the less padding you will have since examples are sorted based on this bucket and batches yield from this bucket.

I’ve already found at least one issue:

github.com/OpenNMT/OpenNMT-py

onmt_train crashes in inputter.py

opened 01:24AM - 06 Nov 22 UTC

PJ-Finlay

``` File "/home/argosopentech/OpenNMT-py/onmt/inputters/inputter.py", line 133,… in _read_vocab_file if int(line.split(None, 1)[1]) >= min_count: IndexError: list index out of range ``` I've started getting this error when trying to run onmt_train. It looks like a possible bug in inputter.py. ``` Traceback (most recent call last): File "/home/argosopentech/env/bin/onmt_train", line 33, in <module> sys.exit(load_entry_point('OpenNMT-py', 'console_scripts', 'onmt_train')()) File "/home/argosopentech/OpenNMT-py/onmt/bin/train.py", line 65, in main train(opt) File "/home/argosopentech/OpenNMT-py/onmt/bin/train.py", line 50, in train train_process(opt, device_id=0) File "/home/argosopentech/OpenNMT-py/onmt/train_single.py", line 131, in main checkpoint, vocabs, transforms_cls = _init_train(opt) File "/home/argosopentech/OpenNMT-py/onmt/train_single.py", line 75, in _init_train vocabs, transforms_cls = prepare_transforms_vocabs(opt) File "/home/argosopentech/OpenNMT-py/onmt/train_single.py", line 28, in prepare_transforms_vocabs vocabs = build_vocab(opt, specials) File "/home/argosopentech/OpenNMT-py/onmt/inputters/inputter.py", line 55, in build_vocab src_vocab = _read_vocab_file(opt.src_vocab, opt.src_words_min_frequency) File "/home/argosopentech/OpenNMT-py/onmt/inputters/inputter.py", line 133, in _read_vocab_file if int(line.split(None, 1)[1]) >= min_count: IndexError: list index out of range ```

argosopentech · November 8, 2022, 12:17pm

We just released the version 3.0 of CTranslate2! Here’s an overview of the main changes:

The main highlight of this version is the integration of the Whisper speech-to-text model that was published by OpenAI a few weeks ago.

Its architecture is very similar to a text-to-text Transformer model but it uses Conv1D layers to transform the audio features. On GPU, Conv1D layers are implemented using cuDNN which is a new optional dependency.

The current implementation already supports many CTranslate2 features and optimizations such as quantization, asynchronous execution, decoding with random sampling, etc. It is up to 3x faster than the implementation in the Transformers library: