Sentence Boundary Detection for Machine Translation

I tried manually testing a variety of current Argos Translate languages to see which ones xx_sent_ud_sm was able to successfully do sentence segmentation on:

Arabic works
Chinese does not work
Dutch works
Irish works
Korean works
Russian works
Thai does not work
Turkish works
Urdu does not work

Chinese not working for this model is a known issue, however, there is a dedicated Chinese model for Spacy that does work (zh_core_web_sm).

I try pretty hard to minimize the dependencies for Argos Translate so Iā€™m attracted to the idea of dropping the need for Stanza if possible. I think the xx_sent_ud_sm model is probably good enough for most languages and I can try to fix languages like Thai that donā€™t work using one-off libraries (like pythainlp) that are lighter weight than Stanza. Iā€™m also okay with dropping support for a small number of languages (like possibly Urdu) if it makes the codebase higher performance and easier to maintain. Itā€™s may also be worth looking into adding support to Spacy for the languages we need and then contributing it back to the Spacy project so that we can use it (Spacy training documentation).

2 Likes

Do you incorporate language detection into the process? If not, thatā€™s a way to get more specialized solutions for different languages. A one size fits all is difficult to find.

Argos Translate always knows what languages itā€™s translating from/to. Language detection (ā€œautoā€ as the source language) is a feature in LibreTranslate and LibreTranslate passes the source language code to Argos Translate. So for Chinese I think Iā€™m just going to use a different Spacy model when the user is translating Chinese source text.

1 Like

From my Discord server:

Iā€™ve done some testing on the last main, and for my use-case (larger documents) the results with spacy are amazing! not sure what is the exact cause (I can guess that itā€™s because the old stanza version had suboptimal dependencies) but the throughput is up ~2-3x on CPU, another observation is that Iā€™m not seeing any improvement running on GPU anymore (was ~2-3x before).

Iā€™m planning to delay releasing Argos Translate 1.10 and the Spacy change for a bit. I want to do more testing and figure out solutions for the languages Spacy doesnā€™t support well. Iā€™ve also thought about trying to train my own model for Spacy with the language support we need.

1 Like

2-3x performance improvement seems like a very good improvement!

Keep wondering if thereā€™s a way to keep the stanza models for languages that arenā€™t supported, but without using the stanza package, maybe via onnx or similar. Iā€™ll try to find some time one of these days to investigate.

1 Like

I managed to run a proof of concept on the stanzaā€™s Italian tokenizer, removing pytorch inference from the loop. :partying_face:

  1. I edited stanzaā€™s code to run the ONNX exporter, which gave me a ONNX model:

  1. I removed the pytorch inference code and replaced it with the newly exported ONNX model. The results were identical (although some numerical differences showed up).
[[{'id': (1,), 'text': 'Questa', 'misc': 'start_char=0|end_char=6'}, {'id': (2,), 'text': "e'", 'misc': 'start_char=7|end_char=9'}, {'id': (3,), 'text': 'una', 'misc': 'start_char=10|end_char=13'}, {'id': (4,), 'text': 'frase', 'misc': 'start_char=14|end_char=19'}, {'id': (5,), 'text': '.', 'misc': 'start_char=19|end_char=20'}], [{'id': (1,), 'text': 'Questa', 'misc': 'start_char=21|end_char=27'}, {'id': (2,), 'text': "e'", 'misc': 'start_char=28|end_char=30'}, {'id': (3,), 'text': 'un', 'misc': 'start_char=31|end_char=33'}, {'id': (4,), 'text': 'altra', 'misc': 'start_char=34|end_char=39'}, {'id': (5,), 'text': '.', 'misc': 'start_char=39|end_char=40'}]]
["Questa e' una frase.", "Questa e' un altra."]

In a nutshell, I edited Stanzaā€™s trainer.py as follows:

    def predict(self, inputs):
        self.model.eval()
        units, labels, features, _ = inputs
        if self.use_cuda:
            units = units.cuda()
            labels = labels.cuda()
            features = features.cuda()

        pred = self.model(units, features)

        # torch.onnx.export(self.model,         # model being run 
        #     (units, features),       # model input (or a tuple for multiple inputs) 
        #     "/home/piero/Downloads/staging/it.onnx",       # where to save the model  
        #     export_params=True,  # store the trained parameter weights inside the model file 
        #     opset_version=10,    # the ONNX version to export the model to 
        #     do_constant_folding=True,  # whether to execute constant folding for optimization 
        #     input_names = ['units', 'features'],   # the model's input names 
        #     output_names = ['modelOutput'], # the model's output names 
        # )
        # exit(1)
        
        import onnxruntime
        ort_session = onnxruntime.InferenceSession("/home/piero/Downloads/staging/it.onnx", providers=["CPUExecutionProvider"])

        def to_numpy(tensor):
            return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()

        # compute ONNX Runtime output prediction
        ort_inputs = {'units': to_numpy(units), 'features': to_numpy(features)}
        ort_outs = ort_session.run(None, ort_inputs)

        # compare ONNX Runtime and PyTorch results
        # np.testing.assert_allclose(to_numpy(pred), ort_outs[0], rtol=1e-03, atol=1e-05)

        # return ort_outs[0]

        return pred.data.cpu().numpy()

Uncommenting the 2nd to last line uses the ONNX output rather than pytorch.

1 Like