LibreTranslate currently uses the Polyglot library for transliteration. From what I can tell the performance is good but I don’t speak any languages that require transliteration so I can’t judge well. The issue is that the Polyglot library can be difficult to install and has caused some friction for LibreTranslate users [1] [2]. I think it would be worth looking into possible alternatives; this isn’t a pressing issue so we can also pursue long term fixes.
Some ideas:
- To my mind the ideal solution is to train the core seq2seq Transformer model so that it does any necessary transliteration. The seq2seq model maps a source string of Unicode characters to a target string and could do transliteration as part of this.
- We could generate datasets for the transliteration task by running the current transliteration library and saving it’s input and output. This would help train a seq2seq model like above and we could potentially release a dataset that would be helpful to other groups.
- Investigate other libraries
1 Like
+1 for training transliteration directly into the model.
1 Like
From what I can tell the Seq2Seq model is already doing a decent job of transliteration; it may not be necessary to create a dataset just for transliteration. There should already be a lot of instances of transliterated names in the data so the model should be able to figure it out just from that.
I found a list of Russian names and put them through LibreTranslate (without a separate transliteration library) and Google Translate:
Russian Name |
LibreTranslate Seq2Seq Model |
Google Translate |
Авдотья |
August |
Avdotya |
Дана |
Dana |
Dana |
Жанна |
Jeanne |
Jeanne |
Игнатий |
Ignat |
Ignatius |
Лика |
Lika |
Lika |
Прокопий |
Procopies |
Procopius |
Тамила |
Tamil |
Tamil |
Феодосия |
Feodos |
Feodosia |
Чингиз |
Chingiz |
Genghis |
Шамиль |
Shamil |
Shamil |
Яков |
Yakov |
Jacob |
If any native speakers have an opinion on Russian or another transliterated language I’d be curious what they have to say. From what I remember before we added Polyglot for transliteration we didn’t have many complaints of poor transliteration; but we got a pull request to add it and did.
1 Like