In general HTML entities should sometimes be preserved, it depends on the specific language and sentence. The language model will read the HTML entity as a series of characters like any other word.
If we want HTML entities to consistently be preserved by the seq2seq model we can make a dataset of examples.
With the translate=“no” attribute the text shouldn’t be modified. It’s possible Beautiful Soup is escaping the HTML entities.
I believe this has been done, the data is now filtered for the html.
but the problem is not with argostranslate but with the commit which solves a problem when the translation is text, but which maybe seems to create your problem in the case of the html translation.
I’ll look at that when I have time.
maybe you must train the model to respect html tag or markdown; or also add a special segment of text to control the formating of translation, currently, we lose the context with markdown if we want that libretranslate respect the formatting.
eg: to **invite it again** on https://google.com.
will be split to:
[‘invite it again’, ‘on’]