Why im getting emoji in responce?

Im trying to translate some text as always on main instance (libretranslate dot com) and gettings random emoji in responces. Cant post any links (including screenshots) so…

Is that ok? Why is that? How I can fix it??
Have a nice day :slight_smile:

Did you buy an API Key? How are you sending translation requests (which software)?

Translating emojies , or any unusual character, is always a tricky process: you should understand that the model learns from whatever parallel data fed into it while training. If it has been relating the emoji to inadequate translations in the training data, it is unable to translate it properly.
That any data cannot be comprehensive is also a feature. You’ll find translations for the most current emojies only.
The best way to address this issue is to retrain the model using byte fallback. Any complex character (emoji, unusual alphabet or ideographic characters) are coded as a sequence of bytes. Byte fallback includes a whole set of bytes into the model’s vocabulary and ignores the least current characters within the dataset in order to train the model upon those seldom occurring characters so as to enable it to accurately translate unknowns, either into identical characters (a quoted Japanese name in an English text, say) or not (a translitterated phrase).
@lynxpda was the first to introduce byte fallback to the community in spring 2024, and i recently released PRs in both argos and Locomotive that include this feature.

So, to answer your question… why? Because the model has been fed inaccurate data containing emojies as training material.
As of how to remedy it, one has to retrain a new model using byte fallback. Or clean the dataset (which is way more difficult) and retrain anyway.

Another solution is to try and preprocess and postprocess any translation for special characters. With custom models I’ve trained, I’ve instilled behavior for it to reflect a “<>” sequence in the translation and deal with it as a noun essentially, thus allowing me to replace any special characters (or @s in discord) and then add them back in, with the model dealing with the correct placement.

How did you do this? I am trying to preprocess pdf files, and getting in trouble with all the tags.

I created a regex to define illegal and legal chars (must be unicode, include unicode punct, unicode numbers etc, since everything is multilingual), and replaced all illegal chars with <>. Then in every src/tgt example I count how many of <> there are, if the amount isn’t the same then I replace them all and either prepend or append a <> tag in both.
That alone should be enough, I had the added benefit of having a lot of custom data from online messages, so things like pings (@s) are respected by translation providers with expensive models such as Google’s, so replacing these translation respected sequences with the special sequence of <> often resulted in a correct modification.