Error doing Stanza sentence boundary detection in Vietnamese

argosopentech · August 6, 2022, 10:30pm

github.com/argosopentech/argos-translate

ValueError when tokenize some inputs using `Vietnamese → English`

opened 08:04AM - 16 Nov 21 UTC

AutumnSun1996

bug help wanted

A simple example: ```python from argostranslate.translate import get_installed…_languages languages_list = get_installed_languages() languages = {l.code: l for l in languages_list} trans = languages['vi'].get_translation(languages['en']) text = 'thuc luc di em trai <@!12345>' res = trans.translate(text) ``` output: ``` Traceback (most recent call last): File "test.py", line 9, in <module> res = trans.translate(text) File "/home/username/.local/lib/python3.7/site-packages/argostranslate/translate.py", line 52, in translate return self.hypotheses(input_text, num_hypotheses=1)[0].value File "/home/username/.local/lib/python3.7/site-packages/argostranslate/translate.py", line 275, in hypotheses paragraph, num_hypotheses File "/home/username/.local/lib/python3.7/site-packages/argostranslate/translate.py", line 160, in hypotheses self.pkg, paragraph, self.translator, num_hypotheses File "/home/username/.local/lib/python3.7/site-packages/argostranslate/translate.py", line 385, in apply_packaged_translation stanza_sbd = stanza_pipeline(input_text) File "/home/username/.local/lib/python3.7/site-packages/stanza/pipeline/core.py", line 166, in __call__ doc = self.process(doc) File "/home/username/.local/lib/python3.7/site-packages/stanza/pipeline/core.py", line 160, in process doc = self.processors[processor_name].process(doc) File "/home/username/.local/lib/python3.7/site-packages/stanza/pipeline/tokenize_processor.py", line 88, in process no_ssplit=self.config.get('no_ssplit', False)) File "/home/username/.local/lib/python3.7/site-packages/stanza/models/tokenize/utils.py", line 165, in output_predictions st0 = text.index(part, char_offset) - char_offset ValueError: substring not found ``` The bug only occurs for vi->en, thus should be related to the model used by stanza.

from argostranslate.translate import get_installed_languages

languages_list = get_installed_languages()
languages = {l.code: l for l in languages_list}

trans = languages['vi'].get_translation(languages['en'])

text = 'thuc luc di em trai <@!12345>'
res = trans.translate(text)

Traceback (most recent call last):
  File "test.py", line 9, in <module>
    res = trans.translate(text)
  File "/home/username/.local/lib/python3.7/site-packages/argostranslate/translate.py", line 52, in translate
    return self.hypotheses(input_text, num_hypotheses=1)[0].value
  File "/home/username/.local/lib/python3.7/site-packages/argostranslate/translate.py", line 275, in hypotheses
    paragraph, num_hypotheses
  File "/home/username/.local/lib/python3.7/site-packages/argostranslate/translate.py", line 160, in hypotheses
    self.pkg, paragraph, self.translator, num_hypotheses
  File "/home/username/.local/lib/python3.7/site-packages/argostranslate/translate.py", line 385, in apply_packaged_translation
    stanza_sbd = stanza_pipeline(input_text)
  File "/home/username/.local/lib/python3.7/site-packages/stanza/pipeline/core.py", line 166, in __call__
    doc = self.process(doc)
  File "/home/username/.local/lib/python3.7/site-packages/stanza/pipeline/core.py", line 160, in process
    doc = self.processors[processor_name].process(doc)
  File "/home/username/.local/lib/python3.7/site-packages/stanza/pipeline/tokenize_processor.py", line 88, in process
    no_ssplit=self.config.get('no_ssplit', False))
  File "/home/username/.local/lib/python3.7/site-packages/stanza/models/tokenize/utils.py", line 165, in output_predictions
    st0 = text.index(part, char_offset) - char_offset
ValueError: substring not found

argosopentech · March 13, 2024, 2:53pm

It looks like the Stanza bug for Vietnamese should be fixed now:

github.com/stanfordnlp/stanza

ValueError: substring not found

opened 02:03AM - 22 Nov 20 UTC

closed 11:49AM - 11 Dec 20 UTC

pipiman

bug fixed on dev

**Describe the bug** when use the Vietnamese's POS, there have this problem **…To Reproduce** Steps to reproduce the behavior: 1. read the sentences s; 2. call nlp(s); 3.'ValueError: substring not found' come out then. **Environment (please complete the following information):** - OS: CentOS - Python version: Python 3.6.8 - Stanza version: 1.1.1 **Additional context**

I incremented the minimum Stanza version number to fix this and released the fix in Argos Translate v1.9.2.

argosopentech · March 17, 2024, 12:45pm

github.com/argosopentech/argos-translate

Stanza version >=1.1.1 breaks a few langauges

opened 04:37AM - 14 Mar 24 UTC

yudelevi

I spent quite a bit of time figuring out what was wrong. I wanted to upgrade S…tanza to the latest 1.8.1, but by default, it overwrites the resources.json file inside the packages. The list of languages I couldn't get working with stanza 1.8.1: az, bn, eo, ms , sq, tl , zt, tr While I managed to avoid the files being overwritten by specifying download_method=None and allow_unknown_language=False, the resource file format significantly changed between version 1.8.1 and 1.1.1 I ended up downgrading until I hit a version that worked and fell back to 1.1.1. I only ran into this after upgrading to 1.9.2 where the stanza version is >=1.2.1 with stanza==1.1.1 (ignore download_method=None, it doesn't exist before 1.4.0) ``` >>> import stanza >>> p=stanza.Pipeline( ... lang="az", ... dir="/home/dyudelevich/.local/share/argos-translate/packages/translate-az_en-1_5/stanza", ... processors="tokenize", ... use_gpu=True, ... logging_level="DEBUG", ... download_method=None ... ) 2024-03-14 04:25:27 DEBUG: Loading resource file... 2024-03-14 04:25:27 DEBUG: Processing parameter "processors"... 2024-03-14 04:25:27 DEBUG: Found tokenize: imst. 2024-03-14 04:25:27 INFO: Loading these models for language: az (Turkish): ======================= | Processor | Package | ----------------------- | tokenize | imst | ======================= 2024-03-14 04:25:27 INFO: Use device: gpu 2024-03-14 04:25:27 INFO: Loading: tokenize 2024-03-14 04:25:27 DEBUG: With settings: 2024-03-14 04:25:27 DEBUG: {'model_path': '/home/dyudelevich/.local/share/argos-translate/packages/translate-az_en-1_5/stanza/az/tokenize/imst.pt', 'lang': 'az', 'mode': 'predict'} 2024-03-14 04:25:27 INFO: Done loading processors! ``` With stanza==1.2: ``` 2024-03-14 04:26:24 INFO: Use device: gpu 2024-03-14 04:26:24 INFO: Loading: tokenize 2024-03-14 04:26:24 DEBUG: With settings: 2024-03-14 04:26:24 DEBUG: {'model_path': '/home/dyudelevich/.local/share/argos-translate/packages/translate-az_en-1_5/stanza/az/tokenize/imst.pt', 'lang': 'az', 'mode': 'predict'} 2024-03-14 04:26:25 INFO: Loading: mwt 2024-03-14 04:26:25 DEBUG: With settings: 2024-03-14 04:26:25 DEBUG: {'model_path': '/home/dyudelevich/.local/share/argos-translate/packages/translate-az_en-1_5/stanza/az/mwt/imst.pt', 'lang': 'az', 'mode': 'predict'} 2024-03-14 04:26:25 ERROR: Cannot load model from /home/dyudelevich/.local/share/argos-translate/packages/translate-az_en-1_5/stanza/az/mwt/imst.pt Traceback (most recent call last): File "/home/dyudelevich/.local/lib/python3.10/site-packages/stanza/pipeline/core.py", line 128, in __init__ self.processors[processor_name] = NAME_TO_PROCESSOR_CLASS[processor_name](config=curr_processor_config, File "/home/dyudelevich/.local/lib/python3.10/site-packages/stanza/pipeline/processor.py", line 155, in __init__ self._set_up_model(config, use_gpu) File "/home/dyudelevich/.local/lib/python3.10/site-packages/stanza/pipeline/mwt_processor.py", line 21, in _set_up_model self._trainer = Trainer(model_file=config['model_path'], use_cuda=use_gpu) File "/home/dyudelevich/.local/lib/python3.10/site-packages/stanza/models/mwt/trainer.py", line 36, in __init__ self.load(model_file, use_cuda) File "/home/dyudelevich/.local/lib/python3.10/site-packages/stanza/models/mwt/trainer.py", line 141, in load checkpoint = torch.load(filename, lambda storage, loc: storage) File "/usr/lib/python3/dist-packages/torch/serialization.py", line 791, in load with _open_file_like(f, 'rb') as opened_file: File "/usr/lib/python3/dist-packages/torch/serialization.py", line 271, in _open_file_like return _open_file(name_or_buffer, mode) File "/usr/lib/python3/dist-packages/torch/serialization.py", line 252, in __init__ super().__init__(open(name, mode)) FileNotFoundError: [Errno 2] No such file or directory: '/home/dyudelevich/.local/share/argos-translate/packages/translate-az_en-1_5/stanza/az/mwt/imst.pt' The above exception was the direct cause of the following exception: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/dyudelevich/.local/lib/python3.10/site-packages/stanza/pipeline/core.py", line 155, in __init__ raise FileNotFoundError('Could not find model file %s, although there are other models downloaded for language %s. Perhaps you need to download a specific model. Try: stanza.download(lang="%s",package=None,processors={"%s":"%s"})' % (model_path, lang, lang, processor_name, model_name)) from e FileNotFoundError: Could not find model file /home/dyudelevich/.local/share/argos-translate/packages/translate-az_en-1_5/stanza/az/mwt/imst.pt, although there are other models downloaded for language az. Perhaps you need to download a specific model. Try: stanza.download(lang="az",package=None,processors={"mwt":"imst"}) ```