I tried manually testing a variety of current Argos Translate languages to see which ones xx_sent_ud_sm
was able to successfully do sentence segmentation on:
Arabic works
Chinese does not work
Dutch works
Irish works
Korean works
Russian works
Thai does not work
Turkish works
Urdu does not work
Chinese not working for this model is a known issue, however, there is a dedicated Chinese model for Spacy that does work (zh_core_web_sm).
I try pretty hard to minimize the dependencies for Argos Translate so I’m attracted to the idea of dropping the need for Stanza if possible. I think the xx_sent_ud_sm
model is probably good enough for most languages and I can try to fix languages like Thai that don’t work using one-off libraries (like pythainlp) that are lighter weight than Stanza. I’m also okay with dropping support for a small number of languages (like possibly Urdu) if it makes the codebase higher performance and easier to maintain. It’s may also be worth looking into adding support to Spacy for the languages we need and then contributing it back to the Spacy project so that we can use it (Spacy training documentation).
2 Likes
Do you incorporate language detection into the process? If not, that’s a way to get more specialized solutions for different languages. A one size fits all is difficult to find.
Argos Translate always knows what languages it’s translating from/to. Language detection (“auto” as the source language) is a feature in LibreTranslate and LibreTranslate passes the source language code to Argos Translate. So for Chinese I think I’m just going to use a different Spacy model when the user is translating Chinese source text.
1 Like