Improving Vietnamese to English model

I’ve found some mistakes in Vietnamese-to-English translation. Namely changing the name of the language to English in certain sentences

Where may I find the current Vietnamese-to-English LM? There’s no Vietnamese on Argos’s page, where did Libretranslate get the LM from?

Oh, I know what the problem is - bloody Wikimedia. Let us assume that LibreTranslate uses the OPUS corpus:
Then we get sentence pairs like this from the Wikimedia dataset https://opus.nlpl.eu/results/vi&en/corpus-result-table

The English version includes the Vietnamese title in brackets and the Vietnamese version has the English title in brackets, It’s stuff like that that teaches the LM to substitute names of foreign languages with English sometimes.

Oh… Wikipedia’s dataset is even worse. Citations left untranslated, and sometimes the Vietnamese text already has an English translation (so we are getting English to Vietnamese + English) https://opus.nlpl.eu/sample/en&vi/Wikipedia&v1.0/sample

Overall, it seems that certain technical Vietnamese texts “code-switch” into English. Along with those, Vietnamese and English texts that are about the same topic but aren’t necessarily translations of each other need to be culled. I’ll try training an LM like that and maybe another one where other corpora I trust take the place of the removed datasets.