After training on english-french pairs also, I have come to the conclusion that the training data is really the dominant factor in the modelās quality.
Tried first on a mix of excerpts from CCMAtrix, UNPC, Multi-UN, EuroPat, TEDs and DGT (UE) to which I added the Canadian hansards and Cadlaws. Totalling 55M selected sentences pairs, I came to slightly better results than the existing models, but nothing really exciting : said improvement was due to the hyperparameters.
Then, I reduced the dataset to 25M sentences pairs, keeping only the best excerpts, got a small jump in COMET score of 0.4 points. Nothing too noticeable on a manual evaluation.
Added 20M backtranslated sentences from the Leipzig University Wortschatz corpora. Scores went back to the values I had had with the 55M sentences pairs. So much for backtranslation.
Incidentally, I was instructed to look for an LLM-backed translator, and came across TowerInstruct. The team that developed it explicitly used the reference-free wmt22-cometkiwi-da model to filter their training data. They reach a 0.8824/0.884 COMEt score after training with only 2M sentence pairs per language. For comparison, Google is 0.8922/0.8992, LT1.9 0.863/0.882, and I barely reach 0.873 towards french and 0.890 towards english.
So, I devised a script to calculate the comet scores and sort (~10k a minute, a little more than 10 million a day) sentences pairs according to it. Not a filter, since I have to find out first how they distribute on various corpora but the good threshold value for a filter might be between 0.85 and 0.89 depending on how selective one wants to be.
The goal is to get the best 25M sentence pairs comet-wise and train models on them. For now, the script is single GPU only, but the code can be optimized for multiple GPUs.