Argos Translate Roadmap

I’m currently mostly working on the training scripts to better automate training and train more and better models. There are also a number of open tickets for various smaller things.

Looking forward, breaking changes in 2.0 are still a ways off but I want to do single character tokenization and seq2seq sentence boundary detection. Depending on how the field progresses few shot translation may also play a larger role in later versions, but is already implemented.

Repost from Github

The idea with seq2seq sentence boundary detection is to run an input like this through the network:

<detect-sentence-boundary> This is the first sentence. This is more text that i

Which gives an output like this:

This is the first sentence. <sentence-boundary>

This is currently used for the Mac app which can’t support Stanza. If there’s enough improvement in ML hardware/software by the time 2.0 makes sense this could not be necessary and instead translate entire paragraphs or more at a time.

1 Like

A long term goal could be trivialization of the process of 1) Generating a new language model (or improving an existing one) and 2) Use the new language model in argos.

I guess just improving the usability / decreasing the learning curve for https://github.com/argosopentech/onmt-models

If training can be done without a GPU (for learning purposes on a very small model) that should also be made available an option. Just a thought, I guess it would be interesting to see if making it easier to generate models leads to overall better translations with models contributed by a wider community.

Easier to use training scripts is definitely a goal, I’ve been working on a Docker version that’s easier to run.

1 Like