Offline vietnamese

fxkl47BF · December 14, 2023, 4:01pm

vietnamese is available in the online libretranlate but not the offline version.
i saw where it had an error and was removed a little over a year ago.
is it going to be made available for the offline version any time?

pierotofy · December 15, 2023, 3:02am

Check this thread which has links to models that might not have been uploaded to the argos-index just yet: OPUS-MT Language Models Port Thread

fxkl47BF · December 15, 2023, 2:24pm

that works
thanks!!!

argosopentech · December 15, 2023, 11:07pm

This was the bug I removed Vietnamese for.

It would be great to get the Vietnamese model working again but I haven’t looked into it much.

fxkl47BF · December 15, 2023, 11:22pm

got it
inputs needs a little sed processing

argosopentech · December 15, 2023, 11:28pm

Is the Stanza issue fixed in newer versions of Stanza?

fxkl47BF · December 16, 2023, 12:51am

this does not process
XE BA BÁNH TỰ CHẾ (phần 5) chạy thử

this does
XE BA BÁNH TỰ CHẾ phần 5 chạy thử

i don’t know if the output is correct
only that it doesn’t produce an error

HarrisonJung · March 12, 2024, 1:18pm

Is there a way to force download Vietnamese?

argosopentech · March 13, 2024, 3:23pm

It looks like the bug with Vietnamese got fixed on Stanza’s end so I uploaded the Opus-MT model to the index.

Vietnamese should be live now for new LibreTranslate installations:

argosopentech · March 15, 2024, 10:03pm

I’m seeing reports that this might have caused a regression:

github.com/argosopentech/argos-translate

Stanza version >=1.1.1 breaks a few langauges

opened 04:37AM - 14 Mar 24 UTC

yudelevi

I spent quite a bit of time figuring out what was wrong. I wanted to upgrade S…tanza to the latest 1.8.1, but by default, it overwrites the resources.json file inside the packages. The list of languages I couldn't get working with stanza 1.8.1: az, bn, eo, ms , sq, tl , zt, tr While I managed to avoid the files being overwritten by specifying download_method=None and allow_unknown_language=False, the resource file format significantly changed between version 1.8.1 and 1.1.1 I ended up downgrading until I hit a version that worked and fell back to 1.1.1. I only ran into this after upgrading to 1.9.2 where the stanza version is >=1.2.1 with stanza==1.1.1 (ignore download_method=None, it doesn't exist before 1.4.0) ``` >>> import stanza >>> p=stanza.Pipeline( ... lang="az", ... dir="/home/dyudelevich/.local/share/argos-translate/packages/translate-az_en-1_5/stanza", ... processors="tokenize", ... use_gpu=True, ... logging_level="DEBUG", ... download_method=None ... ) 2024-03-14 04:25:27 DEBUG: Loading resource file... 2024-03-14 04:25:27 DEBUG: Processing parameter "processors"... 2024-03-14 04:25:27 DEBUG: Found tokenize: imst. 2024-03-14 04:25:27 INFO: Loading these models for language: az (Turkish): ======================= | Processor | Package | ----------------------- | tokenize | imst | ======================= 2024-03-14 04:25:27 INFO: Use device: gpu 2024-03-14 04:25:27 INFO: Loading: tokenize 2024-03-14 04:25:27 DEBUG: With settings: 2024-03-14 04:25:27 DEBUG: {'model_path': '/home/dyudelevich/.local/share/argos-translate/packages/translate-az_en-1_5/stanza/az/tokenize/imst.pt', 'lang': 'az', 'mode': 'predict'} 2024-03-14 04:25:27 INFO: Done loading processors! ``` With stanza==1.2: ``` 2024-03-14 04:26:24 INFO: Use device: gpu 2024-03-14 04:26:24 INFO: Loading: tokenize 2024-03-14 04:26:24 DEBUG: With settings: 2024-03-14 04:26:24 DEBUG: {'model_path': '/home/dyudelevich/.local/share/argos-translate/packages/translate-az_en-1_5/stanza/az/tokenize/imst.pt', 'lang': 'az', 'mode': 'predict'} 2024-03-14 04:26:25 INFO: Loading: mwt 2024-03-14 04:26:25 DEBUG: With settings: 2024-03-14 04:26:25 DEBUG: {'model_path': '/home/dyudelevich/.local/share/argos-translate/packages/translate-az_en-1_5/stanza/az/mwt/imst.pt', 'lang': 'az', 'mode': 'predict'} 2024-03-14 04:26:25 ERROR: Cannot load model from /home/dyudelevich/.local/share/argos-translate/packages/translate-az_en-1_5/stanza/az/mwt/imst.pt Traceback (most recent call last): File "/home/dyudelevich/.local/lib/python3.10/site-packages/stanza/pipeline/core.py", line 128, in __init__ self.processors[processor_name] = NAME_TO_PROCESSOR_CLASS[processor_name](config=curr_processor_config, File "/home/dyudelevich/.local/lib/python3.10/site-packages/stanza/pipeline/processor.py", line 155, in __init__ self._set_up_model(config, use_gpu) File "/home/dyudelevich/.local/lib/python3.10/site-packages/stanza/pipeline/mwt_processor.py", line 21, in _set_up_model self._trainer = Trainer(model_file=config['model_path'], use_cuda=use_gpu) File "/home/dyudelevich/.local/lib/python3.10/site-packages/stanza/models/mwt/trainer.py", line 36, in __init__ self.load(model_file, use_cuda) File "/home/dyudelevich/.local/lib/python3.10/site-packages/stanza/models/mwt/trainer.py", line 141, in load checkpoint = torch.load(filename, lambda storage, loc: storage) File "/usr/lib/python3/dist-packages/torch/serialization.py", line 791, in load with _open_file_like(f, 'rb') as opened_file: File "/usr/lib/python3/dist-packages/torch/serialization.py", line 271, in _open_file_like return _open_file(name_or_buffer, mode) File "/usr/lib/python3/dist-packages/torch/serialization.py", line 252, in __init__ super().__init__(open(name, mode)) FileNotFoundError: [Errno 2] No such file or directory: '/home/dyudelevich/.local/share/argos-translate/packages/translate-az_en-1_5/stanza/az/mwt/imst.pt' The above exception was the direct cause of the following exception: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/dyudelevich/.local/lib/python3.10/site-packages/stanza/pipeline/core.py", line 155, in __init__ raise FileNotFoundError('Could not find model file %s, although there are other models downloaded for language %s. Perhaps you need to download a specific model. Try: stanza.download(lang="%s",package=None,processors={"%s":"%s"})' % (model_path, lang, lang, processor_name, model_name)) from e FileNotFoundError: Could not find model file /home/dyudelevich/.local/share/argos-translate/packages/translate-az_en-1_5/stanza/az/mwt/imst.pt, although there are other models downloaded for language az. Perhaps you need to download a specific model. Try: stanza.download(lang="az",package=None,processors={"mwt":"imst"}) ```

argosopentech · March 17, 2024, 12:45pm

It looks like upgrading Stanza to fix the Vietnamese model breaks other languages so I had to revert the commit adding Vietnamese support:

I also released Argos Translate 1.9.3 which pins the Stanza version at 1.1.1.

anh_hoang · April 15, 2025, 4:13pm

Is there any plan for adding Vietnamese again?

argosopentech · April 17, 2025, 11:01am

There’s still a bug with Vietnamese and Stanza. We tried to move away from Stanza but couldn’t find a good replacement so there’s still no Vietnamese support. It may be possible to get Vietnamese to work by using the Stanza model for another language like Chinese.

yudelevi · April 21, 2025, 5:49pm

Aren’t we using spacy in main branch now? We’ve been running it for a while, and it works pretty well (and there is Vietnamese support)

argosopentech · April 23, 2025, 10:06am

I have Spacy on the main branch but I think I’m going to revert it. It doesn’t support a lot of the languages we need

NicoLe · April 23, 2025, 12:19pm

Hello,
Actually, i have had the hybrid spacy/stanza PR#460 running on our lab for a few months now, and no problem occurred. Using spacy xx, I developed a Swahili-French, a Tatar-English to research cometkiwi-based data processing (the language is nominally not supported but scores are 90+% accurate, and my data improvement features work), and Malay/English and I’m starting work on new versions of my former models (that inherited some bugs from the data used).
All of these use spacy for French, English or multilingual, they install rather straightforward using package_from_file.py scripts and so far, it’s been cutting sentences as should be.

As of Vietnamese, i have tried not to use spacy models that require external dependencies, since most of them are redundant for SBD. If the spacy vi we’re dealing with is the one developed by trungtv, it falls in this category (moreover, it’s a “large” model, and the download instructions do not allow excluding features).
I’ll have to work on a model for Vietnamese-French later this year, then I can tell whether spacy multilingual does the job or not.

yudelevi · April 23, 2025, 5:12pm

@argosopentech this is very unfortunate, spacy has provided a significant boost in performance for our use-case. This might be because the version 1.1.1 of stanza that we are pinned to is ancient (circa 2020). What languages are not supported?

Wondering if it makes sense to keep spacy for most languages and handle (hopefully upgraded) stanza for the rest? We’re more than happy to assist in testing this on a large scale and perhaps retrain a few models if necessary.

NicoLe · April 23, 2025, 10:15pm

I don’t think we even have to retrain the models: currently, training processes already segmented sentences, so the only thing to do is replacing the SBD files within the package.
As of what to use for sbd, i was all about stanza a year ago, but then found the reason why stanza was pinned to 1.1.1 (further versions do not feature retrocompatibilty, that means packages have been published with insufficient testing, which, pardon my French, s****) and then, since i have had requests for languages that stanza doesn’t support, i came to appreciate the way that spacy xx works, insofar it can split texts written in Cyrillic as well as Latin without heavy dependencies.
Vietnamese is extended Latin alphabet, punctuation is quite regular i think, there’s a good chance spacy xx is the right sbd system for it if stanza 1.1.1 is not compatible with the language library.
With languages that write from right to left (Hebrew, Arabic), have non-standard punctuation (Chinese, Armenian, Hindi) or do not use punctuation as commonly understood (Thai), it’s quite likely that stanza works better… and there are even such languages (Bangla, Khmer, possibly Georgian) without stanza tokenizer.
But still, this wouldn’t preclude training a prototype model, because the data on opus is principally sentences or phrases, and working out the sbd issue afterwards, calling an sbd or tokenizer other than stanza or spacy from a yet-to-be-defined sbd class.

yudelevi · April 24, 2025, 8:37pm

Thanks for the information @NicoLe, maybe you can share some more information regarding what is not backward compatible?

I would rather not speak for the maintainers, but in terms of future maintenance, keeping Stanza at 1.1.1 is not really sustainable. Skimming through the changelog, the main diff is between 1.1.1 and 1.2.0 (tokenize rename), but I’m sure that doesn’t cover all the changes.

I mentioned rebuilding/repackaging the model due to issue #400, stanza is part of the package and in some cases, processors were missing (probably due to the fact the models were updated and are not backward compatible). Perhaps a suitable solution might be to remove stanza folders from the argospm index and force stanza.download() on first call. Speaking of Georgian, there is Georgian support in the latest Stanza version.

Once the maintainers decide which way we are going, I’m happy to assist to the best of my abilities.

NicoLe · April 25, 2025, 4:36am

As of Georgian being supported, it’s a brand new feature in stanza 1.10.1: il seems from the release information on github.com that stanford rebuilt all models from universal dependencies 2.15, and there’s a flurry of recent stanza models on huggingface.co to support this.

It looks like the maintainer wants to use Ctranslate2 eventually instead of Stanza, Spacy and whatever else, probably because after reading all the stanza-related issues this looked like the most sensible strategy at some point.

But maybe using the 1.10.1 version of stanza to complete spacy whenever necessary would result in a stable and fast product.