Community Trained Welsh model

Someone trained this Welsh model for Argos Translate 3 years ago and it’s still live on GitHub and seems to work. I think it’s MIT licensed so I could upload it to the package index if the community wants it.

I’d appreciate feedback on the quality if anyone can speak Welsh. I’m also not sure how much demand there would be for Welsh. I try to avoid uploading languages that won’t get used frequently because it wastes bandwith when people download language packs and don’t use them.

Source text

2 Likes

Hey there,

First, thank you for all the effort you’re putting into Argostranslate.

While Welsh might not be as widely spoken or mainstream as some other languages, it has some online presence (more than Irish, Urdu, or Albanian we currently have), and I would love to see it added to the package index.

Something that might aid your decision on what languages to include/exclude: We crawl significant portions of the web and use Argos for translation; I decided to share a few of our statistics on the domains we crawl (I omitted English and included percentages rather than actual numbers; each domain is counted once).
We use a combination of GitHub - pemistahl/lingua-py: The most accurate natural language detection library for Python, suitable for short text and mixed-language text and fasttext for language detection, so there is a bit of a bias towards languages lingua-py detects.

Language Percentage Notes
de 15.39
es 10.25
fr 8.93
zh 7.5 ***
ru 7.24
pt 6.74
ja 6.62
nl 6.47
it 3.74
id 3.22
pl 2.58
tr 2.37
sv 1.49
fa 1.37
cs 1.35
ro 1.16
ko 1.08
la 1.02 **
da 1
hu 0.91
th 0.76
uk 0.75
ar 0.72
nb 0.7
fi 0.69
el 0.6
sk 0.5
he 0.49
yo 0.48 *
vi 0.46
tl 0.37
bg 0.33
hr 0.27 *
ca 0.24
bs 0.22 *
lt 0.22
sl 0.19
et 0.16
ms 0.14
eo 0.13
nn 0.13 *
bn 0.11
lv 0.11
hi 0.1
az 0.06
cy 0.06 *
is 0.05 *
sq 0.05
ts 0.05 *
st 0.04 *
af 0.04 *
mk 0.03 *
mi 0.03 *
ka 0.03 *
sw 0.02 *
mr 0.02 *
sr 0.02 *
kk 0.02 *
eu 0.02 *
tn 0.02 *
mn 0.02 *
hy 0.02 *
ta 0.02 *
xh 0.01 *
sn 0.01 *
ur 0.01
zu 0.01 *
gu 0.01 *
te 0.01 *
so 0.01 *
be 0 *
lg 0 *
ga 0

Notes:

  • Languages unsupported by Argostranlate
    ** Latin is probably not common; it is just commonly used by templates and such (Lorem Ipsum, etc.)
    *** Lingua-py doesn’t separate ZT and ZH language codes as of today, so the statistic is a bit off here

Thanks,
Daniel

2 Likes

Supporting Latin would be cool