Should html entities not be preserved?

Examples:

de:

<span>Äpfel &amp; Birnen</span>

en:

<span>Apples & pears</span>

Even translate="no" does not help here

de

<span>Äpfel <span translate="no">&amp;</span> Birnen</span>

en

<span>Apples <span translate="no">&</span> Pears</span>
1 Like

They should be, so this might be a bug.

1 Like

In general HTML entities should sometimes be preserved, it depends on the specific language and sentence. The language model will read the HTML entity as a series of characters like any other word.

If we want HTML entities to consistently be preserved by the seq2seq model we can make a dataset of examples.

With the translate=“no” attribute the text shouldn’t be modified. It’s possible Beautiful Soup is escaping the HTML entities.

1 Like

I’m thinking this is likely.

If I see it correctly, translate-html is used for this.

I have adapted this example to my case.

#from_code = "es"
#to_code = "en"

# html_doc = """<div><h1>Perro</h1></div>"""

from_code = "de"
to_code = "en"

html_doc = """<span>Äpfel &amp; Birnen</span>"""

Here &amp; still returns correctly.

<span>Apples &amp; pears</span>
1 Like

I think in the app.py is the bug:

from html import unescape
...
                    results.append(unescape(translated_text))
...
                    return jsonify(
                        {
                            "translatedText": unescape(translated_text)
                        }
                    )
...

It is not correct for her to unescape the text with html.

In my opinion it should only be encoded with json and this should be done by jsonify.

1 Like

See this commit

this commit normally fix this issue, https://github.com/LibreTranslate/LibreTranslate/issues/203

but this surely generates a problem in your case, it must not come from argostranslate because with translate=“no” argostransate does not translate

1 Like

Maybe it would be better to check if the training data is in HTML format and unescap it first or transform it in plain text?

I believe this has been done, the data is now filtered for the html.
but the problem is not with argostranslate but with the commit which solves a problem when the translation is text, but which maybe seems to create your problem in the case of the html translation.
I’ll look at that when I have time.

2 Likes

I’m not doing any data filtering for HTML entities in the Argos Translate data anymore. I used to but I stopped because it was very slow.

2 Likes

From my point of view, it would be the right way to go. Maybe the performance can still be optimized?

I would be interested to know where this was removed.

maybe you must train the model to respect html tag or markdown; or also add a special segment of text to control the formating of translation, currently, we lose the context with markdown if we want that libretranslate respect the formatting.
eg: to **invite it again** on https://google.com.
will be split to:
[‘invite it again’, ‘on’]

1 Like