The use of LLMs to correct noisy translation data

With the rise of a new open-source and cheap LLM, DeepSeek, I’ve started to wonder how applicable it may be to fixing src - tgt alignment between translations.
I do not trust LLMs to have the ability to complete the translation themselves, but provided with an incorrect or misaligned one, I do think an LLMs experience with language can correct un-natural over/under/mis translations, as well as decide when it doesn’t know how to fix something.

This will no doubt only be testable and functional with high-resource languages, the biggest benefit obviously being Chinese and English, then descending down to other high res languages.

To correct a million sentences (assuming avg of 50 chars per translation, and that a token is 3 chars) it would cost around $7 (with deepseek v-3 API), as opposed to around $166 with the chatgpt-4o that is similar in performance (according to their repo)

1 Like

That’s an interesting use case; I wonder in terms of speed how long it would take to correct a few millions sentences.

I’m going to test this in a two-stage training plan. CCMatrix data which I’ll do all my normal filtering with first (some perplexity filtering, sentence similarity, length comparisons, normalization). Train a model until it plateaus, then take half of the data and feed it through deepseek (around 7m sentences).

My models are english centric multilingual models, where the xx - en and en - xx data is the same just reversed. So I think I’ll take half of an xx - en dataset, and further split that up in half so that deepseek corrects the first portion in an en - xx manner (deepseek fixing xx), and the second half will have deepseek fixing xx - en (deepseek fixing en).

Then just train the model on only this deepseek corrected data and compare the results.

1 Like