Sentence Boundary Detection for Machine Translation

I did some basic benchmarking for different sentence splitting libraries.

Here are the results:

Model Average Accuracy Average Runtime (seconds)
Spacy en_core_web_sm 0.924311498164287 0.0250468651453654
Spacy xx_sent_ud_sm 0.924311498164287 0.00476229190826416
Argos Translate 2 Beta 0.515548280365557 1.87798078854879
Stanza en 0.924311498164287 0.0219400326410929

It looks like both Spacy and Stanza are pretty accurate for English and can run quickly. The Spacy xx_sent_ud_sm model is even faster without a loss in accuracy.

1 Like