Sentence Boundary Detection for Machine Translation

argosopentech · September 26, 2024, 6:53pm

I tried manually testing a variety of current Argos Translate languages to see which ones xx_sent_ud_sm was able to successfully do sentence segmentation on:

Arabic works
Chinese does not work
Dutch works
Irish works
Korean works
Russian works
Thai does not work
Turkish works
Urdu does not work

Chinese not working for this model is a known issue, however, there is a dedicated Chinese model for Spacy that does work (zh_core_web_sm).

argosopentech · September 26, 2024, 7:20pm

I try pretty hard to minimize the dependencies for Argos Translate so I’m attracted to the idea of dropping the need for Stanza if possible. I think the xx_sent_ud_sm model is probably good enough for most languages and I can try to fix languages like Thai that don’t work using one-off libraries (like pythainlp) that are lighter weight than Stanza. I’m also okay with dropping support for a small number of languages (like possibly Urdu) if it makes the codebase higher performance and easier to maintain. It’s may also be worth looking into adding support to Spacy for the languages we need and then contributing it back to the Spacy project so that we can use it (Spacy training documentation).

ArtanisTheOne · September 30, 2024, 12:44am

Do you incorporate language detection into the process? If not, that’s a way to get more specialized solutions for different languages. A one size fits all is difficult to find.

argosopentech · October 1, 2024, 12:32pm

Argos Translate always knows what languages it’s translating from/to. Language detection (“auto” as the source language) is a feature in LibreTranslate and LibreTranslate passes the source language code to Argos Translate. So for Chinese I think I’m just going to use a different Spacy model when the user is translating Chinese source text.

argosopentech · December 19, 2024, 5:50pm

From my Discord server:

I’ve done some testing on the last main, and for my use-case (larger documents) the results with spacy are amazing! not sure what is the exact cause (I can guess that it’s because the old stanza version had suboptimal dependencies) but the throughput is up ~2-3x on CPU, another observation is that I’m not seeing any improvement running on GPU anymore (was ~2-3x before).

argosopentech · December 19, 2024, 5:52pm

I’m planning to delay releasing Argos Translate 1.10 and the Spacy change for a bit. I want to do more testing and figure out solutions for the languages Spacy doesn’t support well. I’ve also thought about trying to train my own model for Spacy with the language support we need.

pierotofy · December 20, 2024, 4:28am

2-3x performance improvement seems like a very good improvement!

Keep wondering if there’s a way to keep the stanza models for languages that aren’t supported, but without using the stanza package, maybe via onnx or similar. I’ll try to find some time one of these days to investigate.

pierotofy · December 20, 2024, 5:10pm

I managed to run a proof of concept on the stanza’s Italian tokenizer, removing pytorch inference from the loop.

I edited stanza’s code to run the ONNX exporter, which gave me a ONNX model:

I removed the pytorch inference code and replaced it with the newly exported ONNX model. The results were identical (although some numerical differences showed up).

[[{'id': (1,), 'text': 'Questa', 'misc': 'start_char=0|end_char=6'}, {'id': (2,), 'text': "e'", 'misc': 'start_char=7|end_char=9'}, {'id': (3,), 'text': 'una', 'misc': 'start_char=10|end_char=13'}, {'id': (4,), 'text': 'frase', 'misc': 'start_char=14|end_char=19'}, {'id': (5,), 'text': '.', 'misc': 'start_char=19|end_char=20'}], [{'id': (1,), 'text': 'Questa', 'misc': 'start_char=21|end_char=27'}, {'id': (2,), 'text': "e'", 'misc': 'start_char=28|end_char=30'}, {'id': (3,), 'text': 'un', 'misc': 'start_char=31|end_char=33'}, {'id': (4,), 'text': 'altra', 'misc': 'start_char=34|end_char=39'}, {'id': (5,), 'text': '.', 'misc': 'start_char=39|end_char=40'}]]
["Questa e' una frase.", "Questa e' un altra."]

In a nutshell, I edited Stanza’s trainer.py as follows:

    def predict(self, inputs):
        self.model.eval()
        units, labels, features, _ = inputs
        if self.use_cuda:
            units = units.cuda()
            labels = labels.cuda()
            features = features.cuda()

        pred = self.model(units, features)

        # torch.onnx.export(self.model,         # model being run 
        #     (units, features),       # model input (or a tuple for multiple inputs) 
        #     "/home/piero/Downloads/staging/it.onnx",       # where to save the model  
        #     export_params=True,  # store the trained parameter weights inside the model file 
        #     opset_version=10,    # the ONNX version to export the model to 
        #     do_constant_folding=True,  # whether to execute constant folding for optimization 
        #     input_names = ['units', 'features'],   # the model's input names 
        #     output_names = ['modelOutput'], # the model's output names 
        # )
        # exit(1)
        
        import onnxruntime
        ort_session = onnxruntime.InferenceSession("/home/piero/Downloads/staging/it.onnx", providers=["CPUExecutionProvider"])

        def to_numpy(tensor):
            return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()

        # compute ONNX Runtime output prediction
        ort_inputs = {'units': to_numpy(units), 'features': to_numpy(features)}
        ort_outs = ort_session.run(None, ort_inputs)

        # compare ONNX Runtime and PyTorch results
        # np.testing.assert_allclose(to_numpy(pred), ort_outs[0], rtol=1e-03, atol=1e-05)

        # return ort_outs[0]

        return pred.data.cpu().numpy()

Uncommenting the 2nd to last line uses the ONNX output rather than pytorch.

NicoLe · January 4, 2025, 12:22pm

Hello,
All the models I’ve made with Locomotive use stanza so far, and since they have to run primarily on Triton for speech-to-text/translation/subtitling ensemble, we could not simply use LibreTranslate for inference. So we have coded our own stanza/sp/ct2 pipeline.

However, we have to work out a complex strategy (BLS) for general pivoting, so I have decided to run an additional LT server for odd use cases and ensemble models for the few cases where we always need to pivot. For it to work as intended, I need to get LT to support stanza as fallback to spacy or vice versa, and you need it too actually.

As of two weeks ago, I found out that any language with an unusual punctuation will not be spacy_multilingual-compliant. Hence, not only th and hy, but also hi and all other languages written in India are concerned. This starts to add up quite a bit and there’s no “spacyfic” model as in zh or ko for any of them.

This far, I managed to get Locomotive to go to spacy as stanza fallback, and my teammate coded stanza segmentation on the Triton. He is to code spacy segmentation soon for language pairs that don’t have stanza and are in my pipeline, so I’ll clone my LT lab and check out with him in the coming weeks how to get the best of both worlds.
(with models to date already embedding the stanza lib for zh and ko, I’d say we can keep things legacy, use spacy as fallback when no stanza model exists rather than the other way round, and not suffer too much from it, but it’s your project so feel free to tell me what you favor).

argosopentech · January 6, 2025, 12:33am

I think I’m going to revert my Spacy changes in the master branch and put them into a separate feature branch for now. I’ll make an Argos Translate 1.10 version without the Spacy changes.

I want to take a bit of a “back to the drawing board” approach so to speak and find a good sentence boundary detection solution that will work well for Argos Translate and LibreTranslate. I’ve thought about trying to train a Spacy model myself; using multiple sentence segmentation systems for different languages; rules based systems; or building my own system. I still have the SBD system I built using a CTranslate2 model and is has a lot of room for easy improvement. I’ve also thought about trying to use rules based systems to create synthetic data to train neural networks on.

argosopentech · January 6, 2025, 12:40am

I think that this would work pretty well

yudelevi · January 7, 2025, 6:28pm

Since I was the one who originally posted about the results on Discord, I should probably update that there might be a bit of a bias in my testing. I’ve noticed that some translations are truncated compared to historical translations, and that led me to discover bug Degraded translation in main compared to packaged 1.9.6 for fr->en translation · Issue #456 · argosopentech/argos-translate · GitHub .

Looking at the sentences, they are split/tokenized the same between versions, but one produced a truncated result.
I played around with removing everything except letters, numbers, accents and -?!.,'" symbols and the results were pretty similar between main branch and 1.9.6 (about 2.5 documents/second, average length 1000 words)

yudelevi · January 8, 2025, 1:19am

I did some pre-processing on my end to replace suspicious characters with . and re-evaluated, main branch is still ~25% faster than 1.9.6 package (so it’s still a significant improvement)

NicoLe · January 23, 2025, 2:58pm

Hi,
I recoded the sbd forking the master branch to support both stanza and spacy SBD.
Since this was supported through an environment variable in 1.9.6 and shall be package dependent, I did as follows…

in package.py, use an if branch to scan the package and define the values of a new “sbd_package” property. The code mirrors the one used for tokenizers.
In sbd.py, defined an alternate stanza class “StanzaSentencizer” in sbd.py that mirrors “SpacySentencizerSmall” as of the input and output.
in translate.py, added a “sentencizer” property to the Packaged translation class constructor, that initialzes itself as pkg.sbd_package (I used a setter construct before, but I decided to simplify this, since it is quite a duplicate from the code in package.py)

The code is debugged with PyCharmPro, but I cannot manage to run it in CLI on the Windows conda env where I installed argos-translate. Could you lend me a hand?

Thx beforehand,

NicoLe · January 24, 2025, 5:22pm

I realized in the mean time that I can always execute the CLI directly on my self-hosted LIbreTranslate lab, I have been debugging further with it.
I opened a topic about what befell me, because I am not a seasoned object programmer… but after some effort I can do just fine.

NicoLe · January 30, 2025, 2:21pm

The code I produced for Argos is in the PR pipeline, it does both Stanza and Spacy.
For performance improvement, I included the possibility of packaging Argos models with a language-specific Spacy model when it exists, since it runs faster than stanza.
Since I had an internal error in my LT lab, I ran an up-to-date LT environment in debug mode locally on my workstation. It is functional with all three package types (Stanza/Spacy-packaged/Spacy-generic).
The problem with the online instance is probably down to a proxy or WSGI wrapper error.

argosopentech · June 15, 2025, 2:53pm

I merged this pull request into Argos Translate on Github. But I’m not going to release it on PyPI immediately. This pull request adds support for using either Stanza or Spacy to perform Sentence Boundary Detection/Sentence Segmentation.

github.com/argosopentech/argos-translate

Versatile Sentence Boundary Detection

master ← lecoqnicolas:master

opened 10:31AM - 31 Jan 25 UTC

lecoqnicolas

+115 -40

1. Created property sdb_package, with values either pointing to SDB subdirs (sta…nza/language-specific spacy) or None, in package 2. Created class StanzaSentencizer, rewrote SpacySentencizerSmall and base-class to be package-dependant and load from previously described property, or if None, from cache, in sbd 3. Fixed spacy automatic download: created function cache_spacy in networking 4. To avoid circular import, called the classes upon initializing the sentencizer in translate, 5. Fixed byte-fallback bug whilst kept active underscores rewritten as spaces (commented legacy code) in tokenizer 6. Commented all legacy code relative to "stanza_available" environment variable in settings, package and translate. 7. Swapped a few lines and edited spelling things here and there for consistency.

lecoqnicolas commented Jan 31, 2025

Created property sdb_package, with values either pointing to SDB subdirs (stanza/language-specific spacy) or None, in package
Created class StanzaSentencizer, rewrote SpacySentencizerSmall and base-class to be package-dependant and load from previously described property, or if None, from cache, in sbd
Fixed spacy automatic download: created function cache_spacy in networking
To avoid circular import, called the classes upon initializing the sentencizer in translate,
Fixed byte-fallback bug whilst kept active underscores rewritten as spaces (commented legacy code) in tokenizer
Commented all legacy code relative to “stanza_available” environment variable in settings, package and translate.
Swapped a few lines and edited spelling things here and there for consistency.

yudelevi · June 16, 2025, 5:21am

hmmm, I just tested this out in the process I currently use (that translates texts between multiple languages and English, one language at a time via parallel panda) and it randomly hangs.
I’ve run into issues like this before with older versions of argos + stanza on the CPU, but it worked well on GPU, I will debug some more.

NicoLe · June 21, 2025, 8:28am

We use a similar code to dispatch mixtures of languages (for instance, French (with spacy_fr) and Chinese (stanza_cn) to the corresponding models that are managed on a Triton server.
I would not say the implementation is straightforward (quite the contrary actually, since we also had to sandwich the SBD within fasttext lid for such use cases), but it is feasible.