Improving Transliteration in LibreTranslate

LibreTranslate currently uses the Polyglot library for transliteration. From what I can tell the performance is good but I don’t speak any languages that require transliteration so I can’t judge well. The issue is that the Polyglot library can be difficult to install and has caused some friction for LibreTranslate users [1] [2]. I think it would be worth looking into possible alternatives; this isn’t a pressing issue so we can also pursue long term fixes.

Some ideas:

  • To my mind the ideal solution is to train the core seq2seq Transformer model so that it does any necessary transliteration. The seq2seq model maps a source string of Unicode characters to a target string and could do transliteration as part of this.
  • We could generate datasets for the transliteration task by running the current transliteration library and saving it’s input and output. This would help train a seq2seq model like above and we could potentially release a dataset that would be helpful to other groups.
  • Investigate other libraries
1 Like

+1 for training transliteration directly into the model.

1 Like

From what I can tell the Seq2Seq model is already doing a decent job of transliteration; it may not be necessary to create a dataset just for transliteration. There should already be a lot of instances of transliterated names in the data so the model should be able to figure it out just from that.

I found a list of Russian names and put them through LibreTranslate (without a separate transliteration library) and Google Translate:

Russian Name LibreTranslate Seq2Seq Model Google Translate
Авдотья August Avdotya
Дана Dana Dana
Жанна Jeanne Jeanne
Игнатий Ignat Ignatius
Лика Lika Lika
Прокопий Procopies Procopius
Тамила Tamil Tamil
Феодосия Feodos Feodosia
Чингиз Chingiz Genghis
Шамиль Shamil Shamil
Яков Yakov Jacob

If any native speakers have an opinion on Russian or another transliterated language I’d be curious what they have to say. From what I remember before we added Polyglot for transliteration we didn’t have many complaints of poor transliteration; but we got a pull request to add it and did.

1 Like