Sentence Boundary Detection for Machine Translation

I merged this pull request into Argos Translate on Github. But I’m not going to release it on PyPI immediately. This pull request adds support for using either Stanza or Spacy to perform Sentence Boundary Detection/Sentence Segmentation.

lecoqnicolas

lecoqnicolas commented Jan 31, 2025

  1. Created property sdb_package, with values either pointing to SDB subdirs (stanza/language-specific spacy) or None, in package
  2. Created class StanzaSentencizer, rewrote SpacySentencizerSmall and base-class to be package-dependant and load from previously described property, or if None, from cache, in sbd
  3. Fixed spacy automatic download: created function cache_spacy in networking
  4. To avoid circular import, called the classes upon initializing the sentencizer in translate,
  5. Fixed byte-fallback bug whilst kept active underscores rewritten as spaces (commented legacy code) in tokenizer
  6. Commented all legacy code relative to “stanza_available” environment variable in settings, package and translate.
  7. Swapped a few lines and edited spelling things here and there for consistency.
1 Like