Filtering data with a pre-trained model

The technique suggested here is to use a pre-trained translation model to translate your source data, compare your generated translation to the target data, and then filter out the data that doesn’t match your generated translation well. You then retrain a new model on the filtered data.

If you had much more powerful language models in the future you could even try something like this:

Is this a high quality translation? (yes/no)

__en__ Hola Mundo

Hello World

yes