Sentence Boundary Detection for Machine Translation

I tried manually testing a variety of current Argos Translate languages to see which ones xx_sent_ud_sm was able to successfully do sentence segmentation on:

Arabic works
Chinese does not work
Dutch works
Irish works
Korean works
Russian works
Thai does not work
Turkish works
Urdu does not work

Chinese not working for this model is a known issue, however, there is a dedicated Chinese model for Spacy that does work (zh_core_web_sm).

I try pretty hard to minimize the dependencies for Argos Translate so I’m attracted to the idea of dropping the need for Stanza if possible. I think the xx_sent_ud_sm model is probably good enough for most languages and I can try to fix languages like Thai that don’t work using one-off libraries (like pythainlp) that are lighter weight than Stanza. I’m also okay with dropping support for a small number of languages (like possibly Urdu) if it makes the codebase higher performance and easier to maintain. It’s may also be worth looking into adding support to Spacy for the languages we need and then contributing it back to the Spacy project so that we can use it (Spacy training documentation).

2 Likes

Do you incorporate language detection into the process? If not, that’s a way to get more specialized solutions for different languages. A one size fits all is difficult to find.

Argos Translate always knows what languages it’s translating from/to. Language detection (ā€œautoā€ as the source language) is a feature in LibreTranslate and LibreTranslate passes the source language code to Argos Translate. So for Chinese I think I’m just going to use a different Spacy model when the user is translating Chinese source text.

1 Like

From my Discord server:

I’ve done some testing on the last main, and for my use-case (larger documents) the results with spacy are amazing! not sure what is the exact cause (I can guess that it’s because the old stanza version had suboptimal dependencies) but the throughput is up ~2-3x on CPU, another observation is that I’m not seeing any improvement running on GPU anymore (was ~2-3x before).

I’m planning to delay releasing Argos Translate 1.10 and the Spacy change for a bit. I want to do more testing and figure out solutions for the languages Spacy doesn’t support well. I’ve also thought about trying to train my own model for Spacy with the language support we need.

1 Like

2-3x performance improvement seems like a very good improvement!

Keep wondering if there’s a way to keep the stanza models for languages that aren’t supported, but without using the stanza package, maybe via onnx or similar. I’ll try to find some time one of these days to investigate.

1 Like

I managed to run a proof of concept on the stanza’s Italian tokenizer, removing pytorch inference from the loop. :partying_face:

  1. I edited stanza’s code to run the ONNX exporter, which gave me a ONNX model:

  1. I removed the pytorch inference code and replaced it with the newly exported ONNX model. The results were identical (although some numerical differences showed up).
[[{'id': (1,), 'text': 'Questa', 'misc': 'start_char=0|end_char=6'}, {'id': (2,), 'text': "e'", 'misc': 'start_char=7|end_char=9'}, {'id': (3,), 'text': 'una', 'misc': 'start_char=10|end_char=13'}, {'id': (4,), 'text': 'frase', 'misc': 'start_char=14|end_char=19'}, {'id': (5,), 'text': '.', 'misc': 'start_char=19|end_char=20'}], [{'id': (1,), 'text': 'Questa', 'misc': 'start_char=21|end_char=27'}, {'id': (2,), 'text': "e'", 'misc': 'start_char=28|end_char=30'}, {'id': (3,), 'text': 'un', 'misc': 'start_char=31|end_char=33'}, {'id': (4,), 'text': 'altra', 'misc': 'start_char=34|end_char=39'}, {'id': (5,), 'text': '.', 'misc': 'start_char=39|end_char=40'}]]
["Questa e' una frase.", "Questa e' un altra."]

In a nutshell, I edited Stanza’s trainer.py as follows:

    def predict(self, inputs):
        self.model.eval()
        units, labels, features, _ = inputs
        if self.use_cuda:
            units = units.cuda()
            labels = labels.cuda()
            features = features.cuda()

        pred = self.model(units, features)

        # torch.onnx.export(self.model,         # model being run 
        #     (units, features),       # model input (or a tuple for multiple inputs) 
        #     "/home/piero/Downloads/staging/it.onnx",       # where to save the model  
        #     export_params=True,  # store the trained parameter weights inside the model file 
        #     opset_version=10,    # the ONNX version to export the model to 
        #     do_constant_folding=True,  # whether to execute constant folding for optimization 
        #     input_names = ['units', 'features'],   # the model's input names 
        #     output_names = ['modelOutput'], # the model's output names 
        # )
        # exit(1)
        
        import onnxruntime
        ort_session = onnxruntime.InferenceSession("/home/piero/Downloads/staging/it.onnx", providers=["CPUExecutionProvider"])

        def to_numpy(tensor):
            return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()

        # compute ONNX Runtime output prediction
        ort_inputs = {'units': to_numpy(units), 'features': to_numpy(features)}
        ort_outs = ort_session.run(None, ort_inputs)

        # compare ONNX Runtime and PyTorch results
        # np.testing.assert_allclose(to_numpy(pred), ort_outs[0], rtol=1e-03, atol=1e-05)

        # return ort_outs[0]

        return pred.data.cpu().numpy()

Uncommenting the 2nd to last line uses the ONNX output rather than pytorch.

2 Likes

Hello,
All the models I’ve made with Locomotive use stanza so far, and since they have to run primarily on Triton for speech-to-text/translation/subtitling ensemble, we could not simply use LibreTranslate for inference. So we have coded our own stanza/sp/ct2 pipeline.

However, we have to work out a complex strategy (BLS) for general pivoting, so I have decided to run an additional LT server for odd use cases and ensemble models for the few cases where we always need to pivot. For it to work as intended, I need to get LT to support stanza as fallback to spacy or vice versa, and you need it too actually.

As of two weeks ago, I found out that any language with an unusual punctuation will not be spacy_multilingual-compliant. Hence, not only th and hy, but also hi and all other languages written in India are concerned. This starts to add up quite a bit and there’s no ā€œspacyficā€ model as in zh or ko for any of them.

This far, I managed to get Locomotive to go to spacy as stanza fallback, and my teammate coded stanza segmentation on the Triton. He is to code spacy segmentation soon for language pairs that don’t have stanza and are in my pipeline, so I’ll clone my LT lab and check out with him in the coming weeks how to get the best of both worlds.
(with models to date already embedding the stanza lib for zh and ko, I’d say we can keep things legacy, use spacy as fallback when no stanza model exists rather than the other way round, and not suffer too much from it, but it’s your project so feel free to tell me what you favor).

1 Like

I think I’m going to revert my Spacy changes in the master branch and put them into a separate feature branch for now. I’ll make an Argos Translate 1.10 version without the Spacy changes.

I want to take a bit of a ā€œback to the drawing boardā€ approach so to speak and find a good sentence boundary detection solution that will work well for Argos Translate and LibreTranslate. I’ve thought about trying to train a Spacy model myself; using multiple sentence segmentation systems for different languages; rules based systems; or building my own system. I still have the SBD system I built using a CTranslate2 model and is has a lot of room for easy improvement. I’ve also thought about trying to use rules based systems to create synthetic data to train neural networks on.

I think that this would work pretty well

Since I was the one who originally posted about the results on Discord, I should probably update that there might be a bit of a bias in my testing. I’ve noticed that some translations are truncated compared to historical translations, and that led me to discover bug Degraded translation in main compared to packaged 1.9.6 for fr->en translation Ā· Issue #456 Ā· argosopentech/argos-translate Ā· GitHub .

Looking at the sentences, they are split/tokenized the same between versions, but one produced a truncated result.
I played around with removing everything except letters, numbers, accents and -?!.,'" symbols and the results were pretty similar between main branch and 1.9.6 (about 2.5 documents/second, average length 1000 words)

I did some pre-processing on my end to replace suspicious characters with . and re-evaluated, main branch is still ~25% faster than 1.9.6 package (so it’s still a significant improvement)

Hi,
I recoded the sbd forking the master branch to support both stanza and spacy SBD.
Since this was supported through an environment variable in 1.9.6 and shall be package dependent, I did as follows…

  1. in package.py, use an if branch to scan the package and define the values of a new ā€œsbd_packageā€ property. The code mirrors the one used for tokenizers.

  2. In sbd.py, defined an alternate stanza class ā€œStanzaSentencizerā€ in sbd.py that mirrors ā€œSpacySentencizerSmallā€ as of the input and output.

  3. in translate.py, added a ā€œsentencizerā€ property to the Packaged translation class constructor, that initialzes itself as pkg.sbd_package (I used a setter construct before, but I decided to simplify this, since it is quite a duplicate from the code in package.py)

The code is debugged with PyCharmPro, but I cannot manage to run it in CLI on the Windows conda env where I installed argos-translate. Could you lend me a hand?

Thx beforehand,

I realized in the mean time that I can always execute the CLI directly on my self-hosted LIbreTranslate lab, I have been debugging further with it.
I opened a topic about what befell me, because I am not a seasoned object programmer… but after some effort I can do just fine.

The code I produced for Argos is in the PR pipeline, it does both Stanza and Spacy.
For performance improvement, I included the possibility of packaging Argos models with a language-specific Spacy model when it exists, since it runs faster than stanza.
Since I had an internal error in my LT lab, I ran an up-to-date LT environment in debug mode locally on my workstation. It is functional with all three package types (Stanza/Spacy-packaged/Spacy-generic).
The problem with the online instance is probably down to a proxy or WSGI wrapper error.