I am a native speaker, but honestly I can’t really say if it improves the existing model in a objective way. Some phrases do look better, but other phrases look a lot worse. I think the best way would be to A/B test it with real users. (in other words, please don’t push this new model for now, keep the old model)
The reason I was training the model was because of that repeated words problem, but it seems that pierotofy has fixed it in another way?
For the pt-es model, do you think we would get visible improvements with your previous experience of other languages that used the lang1->en->lang2 double translate path? If so, I can try training that.
I normally haven’t included non-English language pairs but I would be willing to merge es-pt. Since the languages are so similar I think there would be a big performance improvement from not pivoting through English.
I’ve seen this issue where the models need to be re-zipped a few times now and I believe it’s caused by Google Drive. Downloading from Google Drive seems to mess with the .zip file compression somehow.
Any other data source I should include in this list? (or maybe exclude?). Does more data sources mean better results? Or can it hinder the quality of the model after a certain point? Thanks in advance
I looked into this more and I think there is something broken with the current zip implementation. I made a pull request with a new version that seems to work better.
I’ve encountered the same problem.
I assume it has to do with the way the files are sorted in the script (9 more than 5).
I worked around this problem temporarily by simply deleting step files 1000 through 9000.
The same problem exists when averaging control points, so be careful, you may average points 9000 and 50000 for example, BLEU may be high, but translate with mediocre quality.