User request to train chinese->chinese(tranditional) model

NicoLe · October 30, 2024, 10:39pm

I’ve been studying Chinese for a year and what I can already tell is that there is little use for such a translation model between simplified and traditional Chinese: the difference goes down to how you write some of the character components.
For instance the “speech” key features a semi-dozen strokes in traditional writing, in simplified it looks like an i, a.s.o. It was designed under Mao, which is why most overseas Chinese communities still use traditional writing.
A rule-based post-processing can easily do this kind of thing.
Ancient texts are quite impossible to translate anyway, because the grammar has been reformed during the first Chinese Republic a century ago.
Chinese is not the only language to have undergone dramatic changes during the last century: that has been deemed necessary to overcome mass illiteracy in many countries, Turkey being another example. Portugal and the Soviet Union simplified grammar and writing to, though not to the point of making written language unrecognizable to the older generations.
Even if one tried to train a model to that end, there’s simply not enough data on OPUS to make it worth the while (300K sentence pairs, 95% of which in the notoriously useless CCMultiAligned), so the result would be worse than the current pivot.