Sentence Boundary Detection for Machine Translation

Yeah we definitely want to let users configure this choice for themselves. I have a “CHUNK_TYPE” setting that can be expanded to support Spacy/Stanza.

1 Like

http://data.argosopentech.com/argospm/v1/translate-vi_en-1_9.argosmodel

I just merged this, thanks @yudelevi. The main branch on Github is now using Stanza 1.10.1.

2 Likes

Looks like VI model is working; passes tests.
Looked at ValueError: substring not found · Issue #531 · stanfordnlp/stanza · GitHub :

>>> text
'Vịt là tên gọi phổ thông cho một số loài chim thuộc họ Vịt (Anatidae) trong bộ Ngỗng (Anseriformes). Các loài này được chia thành một số phân họ trong toàn bộ các phân họ thuộc họ Anatidae. Vịt chủ yếu là một loài chim nước, sống được ở cả vùng nước ngọt lẫn nước mặn, có kích thước nhỏ hơn so với những loài bà con của chúng là ngan, ngỗng, và thiên nga.'
>>> trans.translate(text)
'Ducks are common names for several birds of the Duck family (Anatidae) in the Petrus (Anseriformes) set. These species are divided into a number of them in the entire family of Anatidae. Ducks are mainly a species of water, which live in both sweet water and salt water, which are smaller in size than those of their kins are swan, goose, and swans.'

also issue from ValueError when tokenize some inputs using `Vietnamese → English` · Issue #216 · argosopentech/argos-translate · GitHub :

>>> text = 'thuc luc di em trai <@!12345>'
>>> trans.translate(text)
'Retrieved younger brother.'

also did a random paragraph biderectional:

>>> text
"Outer space presents unique challenges to explorers, unlike those found on Earth. The vacuum of space means there's no air to breathe and no atmospheric pressure to support life. Without oxygen, the human body would quickly suffer, and without pressure, bodily fluids would boil. Furthermore, the absence of an atmosphere exposes explorers to harmful radiation from the sun, potentially causing radiation poisoning. Another significant danger is the presence of meteors and micrometeorites, space debris traveling at high speeds, which can cause catastrophic damage to spacecrafts. \nBeyond these fundamental hazards, space exploration also encounters other difficulties. The psychological impact of isolation and confinement on astronauts is a growing concern. There is also the challenge of navigating vast distances and the immense costs associated with space missions. Ultimately, while the allure of space is strong, the inherent dangers necessitate meticulous planning and robust safety measures to protect those who venture beyond Earth"
>>> res=en_vi.translate(text)
>>> res
'Không gian bên ngoài đưa ra những thách thức đặc biệt cho các nhà thám hiểm, không giống như những gì tìm thấy trên trái đất không gian có nghĩa là không có không khí để thở và không có áp suất khí để duy trì sự sống. không có oxy, cơ thể con người sẽ nhanh chóng chịu đựng, và nếu không có áp lực, các chất lỏng cơ thể sẽ sôi sục. hơn nữa, sự thiếu vắng không khí dẫn đến bức xạ độc hại từ mặt trời, có khả năng gây ra ngộ độc bức xạ một mối nguy khác là sự hiện diện của các thiên thạch và các mảnh không gian di chuyển với tốc độ cao có thể gây thiệt hại đến phi thuyền.\nNgoài những mối nguy cơ cơ căn bản này, thám hiểm không gian cũng gặp phải những khó khăn khác. tác động tâm lý của sự cô lập và giam giữ các phi hành gia là mối quan tâm ngày càng lớn. cũng là thách thức của việc định hướng khoảng cách rộng lớn và những chi phí khổng lồ liên quan đến các nhiệm vụ không gian. cuối cùng, trong khi sự hấp dẫn của không gian là mạnh mẽ, những mối nguy hiểm bẩm sinh cần thiết cho kế hoạch tỉ mỉ và những biện pháp an toàn mạnh mẽ để bảo vệ những người mạo hiểm vượt ra ngoài Trái Đất.'
>>> res2=vi_en.translate(res)
>>> res2
'Outdoor space offers special challenges to explorers, unlike what we find on the Earth space means no air to breathe and no gas pressure to sustain life. without oxygen, the human body will quickly suffer, and if there is no pressure, body fluids will boil. More than that, the absence of air leads to harmful radiation from the sun, capable of causing radiation another risk of the presence of asteroids and space particles that move at high speeds that can damage the vessel.\nIn addition to these fundamental risks, space exploration also encounters other difficulties. The psychological impact of isolation and detention of astronauts is increasing concern. It is also the challenge of navigationing large distances and enormous costs associated with space missions. Finally, while the attraction of space is strong, the natural dangers necessary for elaborate planning and strong safety measures to protect those who venture beyond Earth.'

It looks like ChunkType is a relic; can’t find usage in code. What should be the enum values? happy to submit a PR

1 Like

Thanks! Yeah I don’t think it’s currently connected to anything.

“STANZA” and “SPACY” or something like that is probably good.

Submitted PR : Add STANZA_ONLY and SPACY_ONLY sentence boundary detection modes by yudelevi · Pull Request #490 · argosopentech/argos-translate · GitHub

Surprisingly, I didn’t see any improvement when using only spacy and avoiding Stanza, I’ll look a bit into that.

1 Like

Something isn’t working properly with this change. As you can see in this screenshot the ‘sentences’ log shows that this paragraph isn’t being correctly split into sentences. This leads to too much text being put into the neural network at once which leads to a loss of translation length and quality. In this screenshot there are two version of Argos Translate running. The released prod version on the left and the main branch on Github on the right.

I put the Python debugger in sbd.py:StanzaSentencizer and it seems to be calling Stanza correctly but something is broken.

(env) pj@pj-Latitude-5490:~/git/argos-translate$ TEXT="HMS Queen Mary was the last battlecruiser built by the Royal Navy before the First World War. The sole member of her class, Queen Mary was completed in 1913. She never left the North Sea once the war started, participating in the Battle of Heligoland Bight in 1914 as part of the Grand Fleet. Queen Mary unsuccessfully attempted to intercept a German force that bombarded the North Sea coast of England that December. She was refitting in early 1915 and missed the Battle of Dogger Bank in January. Queen Mary was sunk in the largest fleet action of the war, the Battle of Jutland, in mid-1916. Twice hit by the German battlecruiser Derfflinger during the early part of the battle, her magazines exploded, sinking her. The wreck was discovered in 1991 and rests in pieces on the floor of the North Sea. Her wreck is designated as a protected place under the Protection of Military Remains Act 1986 as it is the grave of 1,266 men."
(env) pj@pj-Latitude-5490:~/git/argos-translate$ export ARGOS_DEBUG=1
(env) pj@pj-Latitude-5490:~/git/argos-translate$ argos-translate -f en -t es "$TEXT"
('Looking for cached Spacy xx_sent_ud_sm.',)
('get_installed_languages',)
('paragraphs:', ['HMS Queen Mary was the last battlecruiser built by the Royal Navy before the First World War. The sole member of her class, Queen Mary was completed in 1913. She never left the North Sea once the war started, participating in the Battle of Heligoland Bight in 1914 as part of the Grand Fleet. Queen Mary unsuccessfully attempted to intercept a German force that bombarded the North Sea coast of England that December. She was refitting in early 1915 and missed the Battle of Dogger Bank in January. Queen Mary was sunk in the largest fleet action of the war, the Battle of Jutland, in mid-1916. Twice hit by the German battlecruiser Derfflinger during the early part of the battle, her magazines exploded, sinking her. The wreck was discovered in 1991 and rests in pieces on the floor of the North Sea. Her wreck is designated as a protected place under the Protection of Military Remains Act 1986 as it is the grave of 1,266 men.'])
('apply_packaged_translation', 'HMS Queen Mary was the last battlecruiser built by the Royal Navy before the First World War. The sole member of her class, Queen Mary was completed in 1913. She never left the North Sea once the war started, participating in the Battle of Heligoland Bight in 1914 as part of the Grand Fleet. Queen Mary unsuccessfully attempted to intercept a German force that bombarded the North Sea coast of England that December. She was refitting in early 1915 and missed the Battle of Dogger Bank in January. Queen Mary was sunk in the largest fleet action of the war, the Battle of Jutland, in mid-1916. Twice hit by the German battlecruiser Derfflinger during the early part of the battle, her magazines exploded, sinking her. The wreck was discovered in 1991 and rests in pieces on the floor of the North Sea. Her wreck is designated as a protected place under the Protection of Military Remains Act 1986 as it is the grave of 1,266 men.')
('sentences', ['HMS Queen Mary was the last battlecruiser built by the Royal Navy before the First World War. The sole member of her class, Queen Mary was completed in 1913. She never left the North Sea once the war started, participating in the Battle of Heligoland Bight in 1914 as part of the Grand Fleet. Queen Mary unsuccessfully attempted to intercept a German force that bombarded the North Sea coast of England that December. She was refitting in early 1915 and missed the Battle of Dogger Bank in January. Queen Mary was sunk in the largest fleet action of the war, the Battle of Jutland, in mid-1916. Twice hit by the German battlecruiser Derfflinger during the early part of the battle, her magazines exploded, sinking her. The wreck was discovered in 1991 and rests in pieces on the floor of the North Sea. Her wreck is designated as a protected place under the Protection of Military Remains Act 1986 as it is the grave of 1,266 men.'])
('tokenized', [['▁H', 'MS', '▁Queen', '▁Mary', '▁was', '▁the', '▁last', '▁battle', 'cru', 'is', 'er', '▁built', '▁by', '▁the', '▁Royal', '▁Navy', '▁before', '▁the', '▁First', '▁World', '▁War', '.', '▁The', '▁sole', '▁member', '▁of', '▁her', '▁class', ',', '▁Queen', '▁Mary', '▁was', '▁completed', '▁in', '▁19', '13', '.', '▁She', '▁never', '▁left', '▁the', '▁North', '▁Sea', '▁once', '▁the', '▁war', '▁started', ',', '▁participating', '▁in', '▁the', '▁Battle', '▁of', '▁He', 'lig', 'o', 'land', '▁B', 'ight', '▁in', '▁19', '14', '▁as', '▁part', '▁of', '▁the', '▁Grand', '▁Fle', 'et', '.', '▁Queen', '▁Mary', '▁un', 's', 'uc', 'ces', 's', 'fully', '▁attempted', '▁to', '▁intercept', '▁a', '▁German', '▁force', '▁that', '▁bombard', 'ed', '▁the', '▁North', '▁Sea', '▁coast', '▁of', '▁England', '▁that', '▁December', '.', '▁She', '▁was', '▁re', 'fit', 'ting', '▁in', '▁early', '▁19', '15', '▁and', '▁missed', '▁the', '▁Battle', '▁of', '▁Do', 'gger', '▁Bank', '▁in', '▁January', '.', '▁Queen', '▁Mary', '▁was', '▁su', 'nk', '▁in', '▁the', '▁largest', '▁fleet', '▁action', '▁of', '▁the', '▁war', ',', '▁the', '▁Battle', '▁of', '▁Ju', 't', 'land', ',', '▁in', '▁mid', '-19', '16', '.', '▁T', 'wi', 'ce', '▁hit', '▁by', '▁the', '▁German', '▁battle', 'cru', 'is', 'er', '▁Der', 'ff', 'ling', 'er', '▁during', '▁the', '▁early', '▁part', '▁of', '▁the', '▁battle', ',', '▁her', '▁magazine', 's', '▁explode', 'd', ',', '▁sin', 'king', '▁her', '.', '▁The', '▁wreck', '▁was', '▁discovered', '▁in', '▁1991', '▁and', '▁rest', 's', '▁in', '▁pieces', '▁on', '▁the', '▁floor', '▁of', '▁the', '▁North', '▁Sea', '.', '▁Her', '▁wreck', '▁is', '▁designated', '▁as', '▁a', '▁protected', '▁place', '▁under', '▁the', '▁Protection', '▁of', '▁Military', '▁Re', 'main', 's', '▁Act', '▁1986', '▁as', '▁it', '▁is', '▁the', '▁grave', '▁of', '▁1,2', '66', '▁men', '.']])
('translated_batches', [TranslationResult(hypotheses=[['▁El', '▁único', '▁miembro', '▁de', '▁su', '▁clase', ',', '▁la', '▁reina', '▁María', '▁fue', '▁termina', 'da', '▁en', '▁19', '13', '.', '▁Ella', '▁nunca', '▁abandon', 'ó', '▁el', '▁Mar', '▁del', '▁Norte', '▁una', '▁vez', '▁que', '▁comenzó', '▁la', '▁guerra', ',', '▁participando', '▁en', '▁la', '▁batalla', '▁de', '▁He', 'lig', 'o', 'land', '▁B', 'ight', '▁en', '▁19', '14', '▁como', '▁parte', '▁de', '▁la', '▁Gran', '▁Flo', 'ta', '.', '▁La', '▁reina', '▁María', '▁trató', '▁sin', '▁éxito', '▁de', '▁intercept', 'ar', '▁una', '▁fuerza', '▁alemana', '▁que', '▁bombard', 'e', 'ó', '▁la', '▁costa', '▁del', '▁Mar', '▁del', '▁Norte', '▁de', '▁Inglaterra', '▁en', '▁diciembre', '.']], scores=[-11.79053020477295], attention=[], logits=[])])
('value_hypotheses:', [('El único miembro de su clase, la reina María fue terminada en 1913. Ella nunca abandonó el Mar del Norte una vez que comenzó la guerra, participando en la batalla de Heligoland Bight en 1914 como parte de la Gran Flota. La reina María trató sin éxito de interceptar una fuerza alemana que bombardeó la costa del Mar del Norte de Inglaterra en diciembre.', -11.79053020477295)])
('translated_paragraphs:', [[('El único miembro de su clase, la reina María fue terminada en 1913. Ella nunca abandonó el Mar del Norte una vez que comenzó la guerra, participando en la batalla de Heligoland Bight en 1914 como parte de la Gran Flota. La reina María trató sin éxito de interceptar una fuerza alemana que bombardeó la costa del Mar del Norte de Inglaterra en diciembre.', -11.79053020477295)]])
('hypotheses_to_return:', [('El único miembro de su clase, la reina María fue terminada en 1913. Ella nunca abandonó el Mar del Norte una vez que comenzó la guerra, participando en la batalla de Heligoland Bight en 1914 como parte de la Gran Flota. La reina María trató sin éxito de interceptar una fuerza alemana que bombardeó la costa del Mar del Norte de Inglaterra en diciembre.', -11.79053020477295)])
El único miembro de su clase, la reina María fue terminada en 1913. Ella nunca abandonó el Mar del Norte una vez que comenzó la guerra, participando en la batalla de Heligoland Bight en 1914 como parte de la Gran Flota. La reina María trató sin éxito de interceptar una fuerza alemana que bombardeó la costa del Mar del Norte de Inglaterra en diciembre.
(env) pj@pj-Latitude-5490:~/git/argos-translate$ export ARGOS_DEBUG=0
(env) pj@pj-Latitude-5490:~/git/argos-translate$ argos-translate -f es -t en "El único miembro de su clase, la reina María fue terminada en 1913. Ella nunca abandonó el Mar del Norte una vez que comenzó la guerra, participando en la batalla de Heligoland Bight en 1914 como parte de la Gran Flota. La reina María trató sin éxito de interceptar una fuerza alemana que bombardeó la costa del Mar del Norte de Inglaterra en diciembre."
The only member of her class, Queen Mary was completed in 1913. She never left the North Sea once the war began, participating in the Battle of Heligoland Bight in 1914 as part of the Great Fleet. Queen Mary tried unsuccessfully to intercept a German force that bombed the North Sea coast of England in December.

Sorry, @argosopentech, this is my bad.
I’ve had issues with get_installed_packages segfaulting locally when loading and added tokenize_pretokenized=True to stanza pipeline init (I misunderstood what the flag does). The tests passed, but the split was broken as a result.

It might be an issue with pytorch+cpu on Intel Mac (won’t be the first issue I encounter something there), I’ll attempt to get to the bottom of it and submit a PR for resetting tokenize_pretokenized

@argosopentech please see PR Fix Stanza by yudelevi · Pull Request #494 · argosopentech/argos-translate · GitHub , I tested it with your sample