Sentence Boundary Detection for Machine Translation

NicoLe · January 4, 2025, 12:22pm

Hello,
All the models I’ve made with Locomotive use stanza so far, and since they have to run primarily on Triton for speech-to-text/translation/subtitling ensemble, we could not simply use LibreTranslate for inference. So we have coded our own stanza/sp/ct2 pipeline.

However, we have to work out a complex strategy (BLS) for general pivoting, so I have decided to run an additional LT server for odd use cases and ensemble models for the few cases where we always need to pivot. For it to work as intended, I need to get LT to support stanza as fallback to spacy or vice versa, and you need it too actually.

As of two weeks ago, I found out that any language with an unusual punctuation will not be spacy_multilingual-compliant. Hence, not only th and hy, but also hi and all other languages written in India are concerned. This starts to add up quite a bit and there’s no “spacyfic” model as in zh or ko for any of them.

This far, I managed to get Locomotive to go to spacy as stanza fallback, and my teammate coded stanza segmentation on the Triton. He is to code spacy segmentation soon for language pairs that don’t have stanza and are in my pipeline, so I’ll clone my LT lab and check out with him in the coming weeks how to get the best of both worlds.
(with models to date already embedding the stanza lib for zh and ko, I’d say we can keep things legacy, use spacy as fallback when no stanza model exists rather than the other way round, and not suffer too much from it, but it’s your project so feel free to tell me what you favor).