Sentence Boundary Detection for Machine Translation

argosopentech · May 28, 2023, 4:17pm

I did some basic benchmarking for different sentence splitting libraries.

Here are the results:

Model	Average Accuracy	Average Runtime (seconds)
Spacy en_core_web_sm	0.924311498164287	0.0250468651453654
Spacy xx_sent_ud_sm	0.924311498164287	0.00476229190826416
Argos Translate 2 Beta	0.515548280365557	1.87798078854879
Stanza en	0.924311498164287	0.0219400326410929

It looks like both Spacy and Stanza are pretty accurate for English and can run quickly. The Spacy xx_sent_ud_sm model is even faster without a loss in accuracy.