Sentence Boundary Detection for Machine Translation

The way Argos Translate currently translate strings of text is it first splits the text into sentences and then translates each sentence independently. This allows us to use small and efficient neural networks that can only handle 150 characters of input. However, since each sentence is translated independently there’s no mechanism for context from one sentence to influence the translation of nearby sentences. In most cases this is fine and translating each sentence independently leads to understandable results. This method also requires a method for splitting a string of text into discreet sentences.

Argos Translate currently uses Stanza to detect sentence boundaries and split text into sentences. Stanza has worked very well for us, and supports a large number of languages, but it’s a bit clunky and slow. Stanza uses neural networks to detect sentence boundaries in text, but also does a lot of other things like identifying parts of speech.

There are a several other Python libraries available to detect sentence boundaries. Many of them use a set of rules to determine which periods in a sentence are a sentence boundary and which are an abbreviation like “Dr.” or “P.J.”. This works well in a lot of cases but it can miss some nuance and doesn’t work well for non-European languages that don’t use periods.

My plan going forward is to use CTranslate2, which is the Transformer inference engine for Argos Translate, to detect sentence boundaries. I have this implemented for the v2 development version of Argos Translate but am not currently using it on the master branch.

The way this works is that instead of translating between languages like “en”->“de” with a Transformer neural network I translate from “string of text”->“the first sentence of original string of text”. Then I use a string similarity metric to match the output text from the Transformer model to a substring in the original text.

Example:

<detect-sentence-boundary> This is the first sentence. This is more text that i
This is the first sentence. <sentence-boundary>

In my experience this works pretty well. But for now the Stanza library works better in most cases. It also allows me to use one Transformer inference engine for both translation and sentence boundary detection. This is beneficial because it reduces the total dependencies required by Argos Translate (Stanza requires PyTorch which is ~1GB to install).

One issue with using the CTranslate2 seq2seq Transformer model for sentence boundary detection is that it it can be difficult to find high quality training data since this is a niche task. To create data I appended unrelated sentences from datasets from Opus together to create fake sentence boundaries to train the Transformer model on. For example:

<detect-sentence-boundary> This is a sentence from the Europarl dataset. This is an unrelated sentence in the same dataset 
This is a sentence from the Europarl dataset. <sentence-boundary>

Going forward I also want to experiment with using rules based sentence boundary detection systems to generate synthetic data for neural network based sentence boundary detection. This could be done by taking unstructured text data, splitting it into sentences with a rules based system, and then using the split sentences as training data for the neural network based system.

References

2 Likes

I think https://spacy.io could be an interesting alternative to explore; it seems solid and in active development.

Like you say, generating training data might be a challenge and the real issue are not the nominal phrases (e.g. [sentence] dot [sentence] ) but the more obscure cases (e.g. [He said: "… wait! ". Then Dr. Smith ran away]).

1 Like

Spacy is a really cool library. They have a sentence splitter model that supports 27 languages alone with a super small file.
For other languages you can install the pretrained models and use them as well.

Spacy has senter and sentencizer components, in which senter is model-based and sentencizer is rule based.

PySBD is an addable Spacy component that supports the below langs

LANGUAGE_CODES = {
    'en': English,
    'hi': Hindi,
    'mr': Marathi,
    'zh': Chinese,
    'es': Spanish,
    'am': Amharic,
    'ar': Arabic,
    'hy': Armenian,
    'bg': Bulgarian,
    'ur': Urdu,
    'ru': Russian,
    'pl': Polish,
    'fa': Persian,
    'nl': Dutch,
    'da': Danish,
    'fr': French,
    'my': Burmese,
    'el': Greek,
    'it': Italian,
    'ja': Japanese,
    'de': Deutsch,
    'kk': Kazakh,
    'sk': Slovak
}```

And for languages that neither PySBD or Spacy dont have support for with models you should be able to enable the `sentencizer` which uses the punctuation rules for each language (init using `spacy.blank("de")` for example).
2 Likes

I did some basic benchmarking for different sentence splitting libraries.

Here are the results:

Model Average Accuracy Average Runtime (seconds)
Spacy en_core_web_sm 0.924311498164287 0.0250468651453654
Spacy xx_sent_ud_sm 0.924311498164287 0.00476229190826416
Argos Translate 2 Beta 0.515548280365557 1.87798078854879
Stanza en 0.924311498164287 0.0219400326410929

It looks like both Spacy and Stanza are pretty accurate for English and can run quickly. The Spacy xx_sent_ud_sm model is even faster without a loss in accuracy.

1 Like

The reason xx_sent_ud_sm is faster is because the only enabled component is sent, while the en_core_web_sm or any other language spacy module has multiple components which are all ran (eg NER tagging, tokenizing, etc) which slow down computation.
You should disable all components in the spacy module except for sent (or sentencizer which is rule-based if the mod/lang isn’t supported with sent)

2 Likes