Should html entities not be preserved?

veltrup · February 16, 2023, 9:45am

Examples:

de:

<span>Äpfel &amp; Birnen</span>

en:

<span>Apples & pears</span>

Even translate="no" does not help here

de

<span>Äpfel <span translate="no">&amp;</span> Birnen</span>

en

<span>Apples <span translate="no">&</span> Pears</span>

pierotofy · February 16, 2023, 3:01pm

They should be, so this might be a bug.

argosopentech · February 16, 2023, 11:26pm

In general HTML entities should sometimes be preserved, it depends on the specific language and sentence. The language model will read the HTML entity as a series of characters like any other word.

If we want HTML entities to consistently be preserved by the seq2seq model we can make a dataset of examples.

With the translate=“no” attribute the text shouldn’t be modified. It’s possible Beautiful Soup is escaping the HTML entities.

pierotofy · February 17, 2023, 3:37am

I’m thinking this is likely.

veltrup · February 17, 2023, 6:31am

If I see it correctly, translate-html is used for this.

I have adapted this example to my case.

#from_code = "es"
#to_code = "en"

# html_doc = """<div><h1>Perro</h1></div>"""

from_code = "de"
to_code = "en"

html_doc = """<span>Äpfel &amp; Birnen</span>"""

Here & still returns correctly.

<span>Apples &amp; pears</span>

veltrup · February 17, 2023, 7:12am

I think in the app.py is the bug:

from html import unescape
...
                    results.append(unescape(translated_text))
...
                    return jsonify(
                        {
                            "translatedText": unescape(translated_text)
                        }
                    )
...

It is not correct for her to unescape the text with html.

In my opinion it should only be encoded with json and this should be done by jsonify.

veltrup · February 17, 2023, 7:24am

See this commit

dingedi · February 17, 2023, 8:04am

this commit normally fix this issue, https://github.com/LibreTranslate/LibreTranslate/issues/203

but this surely generates a problem in your case, it must not come from argostranslate because with translate=“no” argostransate does not translate

veltrup · February 17, 2023, 8:22am

Maybe it would be better to check if the training data is in HTML format and unescap it first or transform it in plain text?

dingedi · February 17, 2023, 9:09am

I believe this has been done, the data is now filtered for the html.
but the problem is not with argostranslate but with the commit which solves a problem when the translation is text, but which maybe seems to create your problem in the case of the html translation.
I’ll look at that when I have time.

argosopentech · February 17, 2023, 1:59pm

I’m not doing any data filtering for HTML entities in the Argos Translate data anymore. I used to but I stopped because it was very slow.

veltrup · February 17, 2023, 2:45pm

From my point of view, it would be the right way to go. Maybe the performance can still be optimized?

I would be interested to know where this was removed.

Jourdelune · February 17, 2023, 8:26pm

maybe you must train the model to respect html tag or markdown; or also add a special segment of text to control the formating of translation, currently, we lose the context with markdown if we want that libretranslate respect the formatting.
eg: to **invite it again** on https://google.com.
will be split to:
[‘invite it again’, ‘on’]