The current priorities are improving the training scripts to better automate training and collecting user input [1, 2] to identify valuable models to train.
Looking forward, I am planning to keep the current package format for at least most of 2022 and when breaking changes do occur I’m considering single character tokenization and seq2seq sentence boundary detection. Depending on how the machine translation field progresses few shot translation, which is already implemented, may also play a larger role in later versions.
I’d also like to expand into using CTranslate2 language models for more tasks including possibly: Q&A, summarizing text, generating text, messaging, and more. It’s currently possible to use Argos Translate for custom tasks by training a custom translation model but pretrained models and better support would make this much easier.
Another promising area is combining multiple pieces of functionality into one model, this allows for larger models in absolute terms (instead of many small ones) that can share understanding of language and the world. For example, the current sentence boundary detection models used in the Mac app are separate from the translation models. In the future it could become possible to combine them.
The idea with seq2seq sentence boundary detection is to run an input like this through the network:
<detect-sentence-boundary> This is the first sentence. This is more text that i
Which gives an output like this:
This is the first sentence. <sentence-boundary>
This is currently used for the Mac app which can’t support Stanza. If there’s enough improvement in ML hardware/software by the time 2.0 makes sense this could not be necessary and instead translate entire paragraphs or more at a time.
A long term goal could be trivialization of the process of 1) Generating a new language model (or improving an existing one) and 2) Use the new language model in argos.
I guess just improving the usability / decreasing the learning curve for https://github.com/argosopentech/onmt-models
If training can be done without a GPU (for learning purposes on a very small model) that should also be made available an option. Just a thought, I guess it would be interesting to see if making it easier to generate models leads to overall better translations with models contributed by a wider community.
Easier to use training scripts is definitely a goal, I’ve been working on a Docker version that’s easier to run.