vietnamese is available in the online libretranlate but not the offline version.
i saw where it had an error and was removed a little over a year ago.
is it going to be made available for the offline version any time?
Check this thread which has links to models that might not have been uploaded to the argos-index just yet: OPUS-MT Language Models Port Thread
that works
thanks!!!
This was the bug I removed Vietnamese for.
It would be great to get the Vietnamese model working again but I havenāt looked into it much.
got it
inputs needs a little sed processing
Is the Stanza issue fixed in newer versions of Stanza?
this does not process
XE BA BĆNH Tį»° CHįŗ¾ (phįŗ§n 5) chįŗ”y thį»
this does
XE BA BĆNH Tį»° CHįŗ¾ phįŗ§n 5 chįŗ”y thį»
i donāt know if the output is correct
only that it doesnāt produce an error
Is there a way to force download Vietnamese?
It looks like the bug with Vietnamese got fixed on Stanzaās end so I uploaded the Opus-MT model to the index.
Vietnamese should be live now for new LibreTranslate installations:
Iām seeing reports that this might have caused a regression:
It looks like upgrading Stanza to fix the Vietnamese model breaks other languages so I had to revert the commit adding Vietnamese support:
I also released Argos Translate 1.9.3 which pins the Stanza version at 1.1.1.
Is there any plan for adding Vietnamese again?
Thereās still a bug with Vietnamese and Stanza. We tried to move away from Stanza but couldnāt find a good replacement so thereās still no Vietnamese support. It may be possible to get Vietnamese to work by using the Stanza model for another language like Chinese.
Arenāt we using spacy in main branch now? Weāve been running it for a while, and it works pretty well (and there is Vietnamese support)
I have Spacy on the main branch but I think Iām going to revert it. It doesnāt support a lot of the languages we need
Hello,
Actually, i have had the hybrid spacy/stanza PR#460 running on our lab for a few months now, and no problem occurred. Using spacy xx, I developed a Swahili-French, a Tatar-English to research cometkiwi-based data processing (the language is nominally not supported but scores are 90+% accurate, and my data improvement features work), and Malay/English and Iām starting work on new versions of my former models (that inherited some bugs from the data used).
All of these use spacy for French, English or multilingual, they install rather straightforward using package_from_file.py scripts and so far, itās been cutting sentences as should be.
As of Vietnamese, i have tried not to use spacy models that require external dependencies, since most of them are redundant for SBD. If the spacy vi weāre dealing with is the one developed by trungtv, it falls in this category (moreover, itās a ālargeā model, and the download instructions do not allow excluding features).
Iāll have to work on a model for Vietnamese-French later this year, then I can tell whether spacy multilingual does the job or not.
@argosopentech this is very unfortunate, spacy has provided a significant boost in performance for our use-case. This might be because the version 1.1.1 of stanza that we are pinned to is ancient (circa 2020). What languages are not supported?
Wondering if it makes sense to keep spacy for most languages and handle (hopefully upgraded) stanza for the rest? Weāre more than happy to assist in testing this on a large scale and perhaps retrain a few models if necessary.
I donāt think we even have to retrain the models: currently, training processes already segmented sentences, so the only thing to do is replacing the SBD files within the package.
As of what to use for sbd, i was all about stanza a year ago, but then found the reason why stanza was pinned to 1.1.1 (further versions do not feature retrocompatibilty, that means packages have been published with insufficient testing, which, pardon my French, s****) and then, since i have had requests for languages that stanza doesnāt support, i came to appreciate the way that spacy xx works, insofar it can split texts written in Cyrillic as well as Latin without heavy dependencies.
Vietnamese is extended Latin alphabet, punctuation is quite regular i think, thereās a good chance spacy xx is the right sbd system for it if stanza 1.1.1 is not compatible with the language library.
With languages that write from right to left (Hebrew, Arabic), have non-standard punctuation (Chinese, Armenian, Hindi) or do not use punctuation as commonly understood (Thai), itās quite likely that stanza works better⦠and there are even such languages (Bangla, Khmer, possibly Georgian) without stanza tokenizer.
But still, this wouldnāt preclude training a prototype model, because the data on opus is principally sentences or phrases, and working out the sbd issue afterwards, calling an sbd or tokenizer other than stanza or spacy from a yet-to-be-defined sbd class.
Thanks for the information @NicoLe, maybe you can share some more information regarding what is not backward compatible?
I would rather not speak for the maintainers, but in terms of future maintenance, keeping Stanza at 1.1.1 is not really sustainable. Skimming through the changelog, the main diff is between 1.1.1 and 1.2.0 (tokenize rename), but Iām sure that doesnāt cover all the changes.
I mentioned rebuilding/repackaging the model due to issue #400, stanza is part of the package and in some cases, processors were missing (probably due to the fact the models were updated and are not backward compatible). Perhaps a suitable solution might be to remove stanza folders from the argospm index and force stanza.download() on first call. Speaking of Georgian, there is Georgian support in the latest Stanza version.
Once the maintainers decide which way we are going, Iām happy to assist to the best of my abilities.
As of Georgian being supported, itās a brand new feature in stanza 1.10.1: il seems from the release information on github.com that stanford rebuilt all models from universal dependencies 2.15, and thereās a flurry of recent stanza models on huggingface.co to support this.
It looks like the maintainer wants to use Ctranslate2 eventually instead of Stanza, Spacy and whatever else, probably because after reading all the stanza-related issues this looked like the most sensible strategy at some point.
But maybe using the 1.10.1 version of stanza to complete spacy whenever necessary would result in a stable and fast product.