Retrain to cope with shortcomings in translations DE-IT/IT-DE

Today morning i tested the demo API and i noticed that some translations between italian and german are not accurate at all.

I was wondering whether i can re-train by the parallel corpus JRC-Acquis (Index of JRC-Acquis/alignments) or similar datasets so that to improve a little the precision.

I saw you described how to re-train the model by “locomotive”,

i do have a couple of questions in this regard:

  1. is it only matter of quantity? are we going to improve the accuracy just by adding “examples”?

  2. do you have tracked the trainingsets that have been used to train the current production-model?

  3. it is not clear to me whether the re-train is cumulative or not. i mean if I want to add some “examples” do we have to include again in “sources” all the datasets used so far or not?

many thanks,
Simone

I think we should train a de-it model. Right now we’re translating de->en->it and I think de-it might be a popular language pair. Opus has a lot of data for de-it:

I added a help-wanted tag

There’s a references field in .argosmodel package metadata.json file that tells you which datasets were used for training.

You can download the production .argosmodel packages here: argospm Index