Sentence Boundary Detection for Machine Translation

argosopentech · October 22, 2023, 3:12pm

I think I’m going to try to start moving towards using spaCy for sentence boundary detection. Stanza works pretty well but it has a lot of bugs [1][2] and requires installing PyTorch which is a ~700MB dependency.

My tentative plan to make this backwards compatible is to:

Keep including Stanza models in .argosmodel packages when possible
Continue to support Stanza in Argos Translate with stanza==1.1.1

I still need to figure out if I want to put the data files for spaCy in the .argosmodel packages or have spaCy download any models it needs on the first run.