Examples:
de:
<span>Äpfel & Birnen</span>
en:
<span>Apples & pears</span>
Even translate="no" does not help here
de
<span>Äpfel <span translate="no">&</span> Birnen</span>
<span>Apples <span translate="no">&</span> Pears</span>
Examples:
de:
<span>Äpfel & Birnen</span>
en:
<span>Apples & pears</span>
Even translate="no" does not help here
de
<span>Äpfel <span translate="no">&</span> Birnen</span>
<span>Apples <span translate="no">&</span> Pears</span>
They should be, so this might be a bug.
In general HTML entities should sometimes be preserved, it depends on the specific language and sentence. The language model will read the HTML entity as a series of characters like any other word.
If we want HTML entities to consistently be preserved by the seq2seq model we can make a dataset of examples.
With the translate=“no” attribute the text shouldn’t be modified. It’s possible Beautiful Soup is escaping the HTML entities.
I’m thinking this is likely.
If I see it correctly, translate-html is used for this.
I have adapted this example to my case.
#from_code = "es"
#to_code = "en"
# html_doc = """<div><h1>Perro</h1></div>"""
from_code = "de"
to_code = "en"
html_doc = """<span>Äpfel & Birnen</span>"""
Here & still returns correctly.
<span>Apples & pears</span>
I think in the app.py is the bug:
from html import unescape
...
results.append(unescape(translated_text))
...
return jsonify(
{
"translatedText": unescape(translated_text)
}
)
...
It is not correct for her to unescape the text with html.
In my opinion it should only be encoded with json and this should be done by jsonify.
See this commit
this commit normally fix this issue, https://github.com/LibreTranslate/LibreTranslate/issues/203
but this surely generates a problem in your case, it must not come from argostranslate because with translate=“no” argostransate does not translate
Maybe it would be better to check if the training data is in HTML format and unescap it first or transform it in plain text?
I believe this has been done, the data is now filtered for the html.
but the problem is not with argostranslate but with the commit which solves a problem when the translation is text, but which maybe seems to create your problem in the case of the html translation.
I’ll look at that when I have time.
I’m not doing any data filtering for HTML entities in the Argos Translate data anymore. I used to but I stopped because it was very slow.
From my point of view, it would be the right way to go. Maybe the performance can still be optimized?
I would be interested to know where this was removed.
maybe you must train the model to respect html tag or markdown; or also add a special segment of text to control the formating of translation, currently, we lose the context with markdown if we want that libretranslate respect the formatting.
eg: to **invite it again** on https://google.com.
will be split to:
[‘invite it again’, ‘on’]