Unfortunately, LibreTranslate often ignores the language the user supplied and does not translate, neither give problem message.
So I kindly ask: How can I prevent LibreTranslate from just returning the query string, after corrupting it?
How can I find out with certainty that LibreTranslate failed?
% curl -X POST http://lthost:5000/translate -F 'q="The full text should contain all details __01911926098108448__.The full text should contain all details __01911926098108448__."' -F 'source="en"' -F 'target="ar"'
{"translatedText":"The full text should contain all details __01911926098108448_. The full text should contain all details __01911926098108448_."}
%
% curl -X POST http://lthost:5000/translate -F 'q="The full text should contain all details.The full text should contain all details ."' -F 'source="en"' -F 'target="ar"'
{"translatedText":"وينبغي أن يتضمن النص الكامل جميع التفاصيل. وينبغي أن يتضمن النص الكامل جميع التفاصيل."}
As you can see, it looks like that it is super easy to knock out LibreTranslate by just entering ordinary things like numbers.
Any idea how to at least make LbreTranslate error out or whatever, but never giving back false data?
As it is now, it is practically faking translation, returning almost unmodified input data, but only after corrupting it (see the damaged tags).
Not good.
Edit: One of the many things I hate with Discourse is that it is super annoying to edit texts in a microscopic small window. Tedious and error prone.
You’ll need to check for corrupted data yourself. A basic check for Arabic could involve making sure that the majority of characters are from the Arabic set.
This is caused by the black box neural network giving a bad response. There’s no easy way to detect this since the software itself isn’t failing. The only thing you can do is try to validate the translation yourself using something like checking for the right character type.
Argos is right. The models that feature this don’t know how to handle long numbers, therefore they get lost.
Either you introduce thorough pre and post processing, or you train your own models. Those i did keep numbers intact (in general, there might be the occasional glitch), but on the other hand, they scrap underscores…
يجب أن يحتوي النص الكامل على جميع التفاصيل 01911926098108448 . يجب أن يحتوي النص الكامل على جميع التفاصيل 01911926098108448 .
You misunderstood me…
The data material I am testing the translation capabilities are the resources for the CMS I write, it was about articles structured title/abstract/full text.
Anyway, this made me decide to use Gemma directly, sending it my prompt and receiving the answer.
To my surprise I had to learn that Gemma also has some limitations; it seems to feel inhibited if asked for security-related stuff (like login, process lists etc) and shows the same limitations then. Strangely this phenomenon seems to get worse the more languages one translates in a single call; currently I am testing with 9 languages, 1 source and 8 target languages.
It took me many hours and many retries until I got my prompt so far that Gemma no longer gives back either empty translations or just echoes the original language string…
I would highly suggest having some pre and post-processing for texts where the format should be respected. I currently strip out sequences such as markdown (or discord @/#/emojis etc) and put a placeholder in the translation, then put back the original sequence once the translation is returned.
It comes down to having the translation model/provider respect the placeholder you use, and obviously, the weakness of having a static placeholder is that the order within the text may change, so perhaps fiddling with having the placeholder keep track of the actual placeholder item that corresponds to it is important for you.
Rob's story impressed Leah.
Leah se quedó impresionada con la historia de Rob.
<>'s story impressed <>.
<> se quedó impresionada con la historia de <>.
<1>'s story impressed <2>.
<2> se quedó impresionada con la historia de <1>.
Because the order of the placeholder changed (if the placeholders were the names in this example) the replacements would be incorrect if the placeholder was just static.
My uses involve a hybrid of my own trained models and Google’s Translation API, and as of late Google’s updated NMT models are interestingly refusing to obey this approach as consistently leading to some complaints from my users, however I have run into no issues with my own models which are trained on data where I’ve artificially added this behavior.
I’ve thought about training the ability to handle tags better into the models before. It should be possible to create synthetic data to get the models to respect tags/placeholder tokens in the seq2seq model.
This issue is why I now concentrate on using Gemma/LTEngine.
The LLM’s ability to put the placeholders correctly, treating them as placeholders or as separator, respectively, really has surprised me. This is particularly evident when Gemma is instructed to output not only to a single target language, but take a list of languages and output the translations in one rush.
And it is quite simple to achieve with Gemma, using this as part of prompt:
'Defintion Xung: It is a string containing numbers and/or uppercase ASCII letters and/or underscore characters, which begins and ends with either double underscores or double equality signs.
Xung rules: NEVER interpret xungs. NEVER modify xungs. NEVER omit xungs when reproducing text in specified languages. ALWAYS preserve xungs in their unmodified original form. ALWAYS place them in reproduced texts EXACTLY where they were placed in the original text. Never decompose xungs into parts, in particular never separate the delimiting dual underscores or equal signs from the rest of the xung.
In case you ask why ‘xung’ instead of ‘tag’, think of the conflicts that arise when you use terms that are already defined. I tried a long time without success until I got the idea to purposefully use neologisms to avoid such conflicts in the first place.
Thanks for this, it’s useful for the future. However, because I haven’t the resources to run an instance of my own, it would require significant spend at Digital Ocean (or similar). So I’m going to improve the HTML parser and re-assembler for the moment.