Introducing MiniSBD: Fast sentence boundary detection

Continuing the discussion from Sentence Boundary Detection for Machine Translation - #55 by pierotofy, I’m happy to share MiniSBD, a subset port of Stanza’s tokenizer models that uses 8-bit quantized ONNX models for inference, making it extremely lightweight and fast.

It only depends on onnxruntime (or onnxruntime-gpu for GPU inference), which means this paves the way for potentially removing argos-translate’s dependency on pytorch (more on this below).

Code: GitHub - LibreTranslate/MiniSBD: Free and open source library for fast sentence boundary detection

Installation: pip install minisbd

Usage:

from minisbd import SBDetect

text = """
La Révolution française (1789-1799) est une période de bouleversements politiques et sociaux en France et dans ses colonies, ainsi qu'en Europe à la fin du XVIIIe siècle. Traditionnellement, on la fait commencer à l'ouverture des États généraux le 5 mai 1789 et finir au coup d'État de Napoléon Bonaparte le 9 novembre 1799 (18 brumaire de l'an VIII). En ce qui concerne l'histoire de France, elle met fin à l'Ancien Régime, notamment à la monarchie absolue, remplacée par la monarchie constitutionnelle (1789-1792) puis par la Première République.
"""

detector = SBDetect("fr", use_gpu=True)
for sent in detector.sentences(text):
   print(f"--> {sent}")

# --> La Révolution française (1789-1799) est une période de bouleversements politiques et sociaux en France et dans ses colonies, ainsi qu'en Europe à la fin du XVIIIe siècle.
# --> Traditionnellement, on la fait commencer à l'ouverture des États généraux le 5 mai 1789 et finir au coup d'État de Napoléon Bonaparte le 9 novembre 1799 (18 brumaire de l'an VIII).
# --> En ce qui concerne l'histoire de France, elle met fin à l'Ancien Régime, notamment à la monarchie absolue, remplacée par la monarchie constitutionnelle (1789-1792) puis par la Première République.

The models are quantized, so they might produce slightly different outputs compared to Stanza, however I was not able to detect differences in my tests. I’d love for people to try it out and see what they think.

Next steps

I would like to help integrate this library into argos-translate and replace stanza, but I’m unsure of the best way forward. In particular, I would like thoughts on how to best handle model storage.

Option A. Include copies of ONNX models into each argospackage, just like the current setup includes a copy of the stanza model. The downside is that there’s redundancy, e.g. all en => [lang] pairs include a copy of the same “en” model, which is wasteful (although somewhat minimal, since .onnx models are less than 1MB each). I’m unsure what would be the preferred way to migrate the packages, as older versions would continue to require stanza models to work. Perhaps keeping both stanza and ONNX models for a while could provide a path to upgrade. Publishing a new package index URL might be required in order not to break older clients, since older clients have no idea that stanza models would have been removed.

Option B. Separate the ctranslate2 models from the SBD models and simply augment the argospackage metadata definition to include a key specifying the MiniSBD language code to use for SBD (or let argos-translate map the lang_from <=> MiniSBD lang code mapping).

Option C. ?

I’m also unsure whether this library might remove the need (?) for using Spacy, which could further help remove dependencies.

1 Like

Awesome work! I love the idea of integrating this into Argos Translate to remove the need for PyTorch for inference.

The model storage is a tricky issue, let me think about it.

1 Like

I also need need to do more testing in evaluating what kind of loss the 8bit quantization might introduce and if there’s a case for keeping the models in float format (or offer both).

1 Like

Are you able to run some sort of regression test to evaluate both? As I remember @argosopentech ran a test evaluating Stanza, Spacy, and some other sentence boundary detection pipelines some time ago.

Happy new year to the team! All hail Piero for the current job.

Some thoughts about the exchange, and something I’ll do next.

Quantization:
Difference between 8-bit quantized and 32-bit specialized small models is often close to none (For instance, I tried quantizing the CTranslate2 binaries differently in float16 or float32 before, ended with the same exact metrics as int8). The leap occurs with 4-bit quantization.

Evaluation:
The example of Thai sets the boundary of the exercise: it does not feature punctuation (nor word separator), and so SBD results are not measurable by any means but reading.

Configuration:
I really think the current vertical approach is OK: removing PyTorch will free almost 1GB of disk space, that’s room for more duplicate ONNX models than existing languages to train from opus.

Our implementation uses the transversal approach, it’s (much) more complicated to manage, but since our SBD serves other tasks too (see below), we’re better off with it.

Spacy:
I will check this one out. For the past six months, I have been working on another project almost full time, and didn’t have time to do much on my private fork of Locomotive. However, this project aims at vectorizing a huge library of multilingual PDF images.
Last summer, I established that superior vectorization was attained with chunks around 512 characters, so I built an text processing pipeline that would yield those with full sentences and eventually found out :

  1. pysbd was better than both spacy & stanza in English, French, and some more
  2. only stanza could work out languages with non-standard punctuation (Armenian, Thai).

I still have to figure out some preprocessing issues (stray \ chars from the OCR) before I can run a full benchmark and we can reconfigure our SBD pipeline, but when I figure out improvements on the current SBD in Argos that reduce dependencies and simplify architecture, I will code them and send a PR.

1 Like

Cool! I wasn’t aware of pysbd. I like the idea of using a rule based library for latin languages. Would be interesting to compare speeds.

Good idea to check the speed: right now, I am running the lib asynchronously within a celery worker, but I can restart from my first synchrone commit and get speed and precision metrics for several languages and SBDs.

By the way, I looked at the code, and though I do not master the ONNX extraction process itself, I have seen that it measures conformity of the ONNX model (quantized or not) to stanza over a given text.

I plan to use a modified, with no line breaks at every sentence, flores dataset for my benchmark experiment, I’ll use the function to check stanza conformity over a wide range of languages. I’ll submit the results afterwards and the code some time between now and end of February.

1 Like

I think the best way to handle model storage is to have a MiniSBD model in the .argosmodel package that the user can enable with a setting/flag. Similar to how we did it for Spacy SBD where it doesn’t break backwards compatibility.

Since updating all the .argosmodel packages (which are zip archives) and reuploading them to Cloudflare would be a huge hassle we could release a beta version of this feature in Argos Translate where the packages are downloaded from a server (maybe just a Github repo like LibreTranslate originally did to distribute .argosmodel packages) instead of being being bundled in the packages for now. That way it’s easy to experiment and test

1 Like

I think this is definitely worth testing. Quantization seemed to break some of the Opus-MT models. I think quantization is generally worth it though; Argos Translate uses 8 bit quantization and the model sizes would be four times bigger if I used 32 bit. Ideally you could train in 8 bit but that might not always be practical. Since the MiniSBD feature is aimed at power users who want to improve performance maybe we should give them the option by offering both.

I agree. I think rule based could work well for a lot of languages.

I was working on building my own SBD system that would use a CTranslate2 model listed as “Argos Translate 2” here.

Ok, so just to re-iterate something like:

  • Add a setting to enable MiniSBD inference (disabled by default in beta)
  • In beta, download models from GitHub, but plan to eventually move them inside the argosmodel

I could help write a PR.

This sounds great to me. I’m happy to merge it once you’re finished.

1 Like

Early PR: feat: MiniSBD support by pierotofy · Pull Request #510 · argosopentech/argos-translate · GitHub

Still need to do some more tests.

1 Like

Done more testing and seems pretty solid / ready for review.

1 Like

This has been released on PyPI in Argos Translate 1.11.0!

1 Like

LibreTranslate updated/released as well. :partying_face: