Continuing the discussion from Sentence Boundary Detection for Machine Translation - #55 by pierotofy, I’m happy to share MiniSBD, a subset port of Stanza’s tokenizer models that uses 8-bit quantized ONNX models for inference, making it extremely lightweight and fast.
It only depends on onnxruntime (or onnxruntime-gpu for GPU inference), which means this paves the way for potentially removing argos-translate’s dependency on pytorch (more on this below).
Code: GitHub - LibreTranslate/MiniSBD: Free and open source library for fast sentence boundary detection
Installation: pip install minisbd
Usage:
from minisbd import SBDetect
text = """
La Révolution française (1789-1799) est une période de bouleversements politiques et sociaux en France et dans ses colonies, ainsi qu'en Europe à la fin du XVIIIe siècle. Traditionnellement, on la fait commencer à l'ouverture des États généraux le 5 mai 1789 et finir au coup d'État de Napoléon Bonaparte le 9 novembre 1799 (18 brumaire de l'an VIII). En ce qui concerne l'histoire de France, elle met fin à l'Ancien Régime, notamment à la monarchie absolue, remplacée par la monarchie constitutionnelle (1789-1792) puis par la Première République.
"""
detector = SBDetect("fr", use_gpu=True)
for sent in detector.sentences(text):
print(f"--> {sent}")
# --> La Révolution française (1789-1799) est une période de bouleversements politiques et sociaux en France et dans ses colonies, ainsi qu'en Europe à la fin du XVIIIe siècle.
# --> Traditionnellement, on la fait commencer à l'ouverture des États généraux le 5 mai 1789 et finir au coup d'État de Napoléon Bonaparte le 9 novembre 1799 (18 brumaire de l'an VIII).
# --> En ce qui concerne l'histoire de France, elle met fin à l'Ancien Régime, notamment à la monarchie absolue, remplacée par la monarchie constitutionnelle (1789-1792) puis par la Première République.
The models are quantized, so they might produce slightly different outputs compared to Stanza, however I was not able to detect differences in my tests. I’d love for people to try it out and see what they think.
Next steps
I would like to help integrate this library into argos-translate and replace stanza, but I’m unsure of the best way forward. In particular, I would like thoughts on how to best handle model storage.
Option A. Include copies of ONNX models into each argospackage, just like the current setup includes a copy of the stanza model. The downside is that there’s redundancy, e.g. all en => [lang] pairs include a copy of the same “en” model, which is wasteful (although somewhat minimal, since .onnx models are less than 1MB each). I’m unsure what would be the preferred way to migrate the packages, as older versions would continue to require stanza models to work. Perhaps keeping both stanza and ONNX models for a while could provide a path to upgrade. Publishing a new package index URL might be required in order not to break older clients, since older clients have no idea that stanza models would have been removed.
Option B. Separate the ctranslate2 models from the SBD models and simply augment the argospackage metadata definition to include a key specifying the MiniSBD language code to use for SBD (or let argos-translate map the lang_from <=> MiniSBD lang code mapping).
Option C. ?
I’m also unsure whether this library might remove the need (?) for using Spacy, which could further help remove dependencies.