Help wanted : Kabyle language model for Argos Translate

argosopentech · February 12, 2024, 3:29pm

The easiest fix would be to use a Stanza model for another similar language. Since Kabyle is written using the Latin alphabet [1] other models for languages using the same character set, like English or Turkish, might work. The Stanza model only needs to recognize the sentence boundaries not translate so as long as the basic structure of sentences look similar between languages it will probably work.

I’ve been exploring switching from Stanza to Spacy but it looks like Spacy dosn’t support Kabyle either (Spacy supports these languages: Catalan, Chinese, Croatian, Danish, Dutch, English, Finnish, French, German, Greek, Italian, Japanese, Korean, Lithuanian, Macedonian, Multi-language, Norwegian Bokmål, Polish, Portuguese, Romanian, Russian, Slovenian, Spanish, Swedish, Ukrainian).

I’ve also done some experiments on using neural networks that are trained with OpenNMT-py and run on CTranslate2 to do sentence boundary detection in Argos Translate 2 Beta. The benefit of this approach is that it lets you use one software stack for both splitting sentences and translating. However, as you can see in my experimental results this approach performs much worse than libraries that are designed specifically to do this type of text processing like Stanza and Spacy.

In the Argos Translate 2 Beta code I have better configuration options so that you can deactivate sentence boundary detection in Argos Translate. This would let you translate short strings of text without needing to do any sentence splitting.