Yes it is indeed a very big change in the architecture of argos and the fact of being able to use models coming from large companies like meta too.
I think it would be interesting to add a progressbar. when I download models via chrome there is the size indicated, the remaining time, etc.
so it would be doable to implement this in argospm
Multilingual translation with M2M-100 now works in the dev branch!
pip install git+https://github.com/argosopentech/argos-translate.git@v2
argospm install translate
argos-translate --from-lang en --to-lang es "I am translating from English to Spanish with the M2M-100 language model."
Estoy traducindo del inglés al español con el modelo de lenguaje M2M-100.
Great!
how argos will handle the case where several models are installed, example:
the m2m model is installed
the argos model translate-en_es is installed
which will take priority? the one loaded first?
Itâs not strictly determined currently, it would just be whichever is loaded first.
Iâve thought about using some kind of âspecificityâ heuristic where Argos Translate would load the model that translates to the fewest other languages. Nothing is implemented right now though.
I couldnât get these lines to work. when I do argospm update && argospm search the m2m package doesnât seem to exist in the index.
Try deleting the existing packages and index:
rm -rf ~/.local/share/argos-translate/
rm -rf ~/.local/cache/argos-translate/
You may also need to run:
argospm update
You could some metadata for the model in properties such as âprefix,â âsuffixâ and more specific ones for the src/tgt sides.
A lot of the annoyance is probably with models such as M2M where the encoder doesnt know what lang to translate to, and the decoder is given a token and decodes from there - at least with these fields youâd be able to add to a sequence of tokens.
Ultimately these properties couldnât just be one token so theyâd either need to be an array of tokens or a string which is tokenized.
Eg
src_prefix: â__opt_src_SRCLANG __opt_tgt__TGTLANGâ
This would allow a standard process of adding this to each seq of tokens into a model by replacing the placeholders with lang names and if the properties donât exist then just do nothing.
For some models Iâve made, tokenizing these prefix properties wouldnât work as opposed to using an array because I artificially inserted the target and source tokens to decrease seq length [to avoid having an un-necessary space token between the two when tokenized]
This could be extended to not just be for src/tgt tokens but to formality or locale tokens.
Metadata with the model may be the best strategy for adding different prefixes to the source and target text.
I also posted about this on the OpenNMT Forum: