Unfortunately, LibreTranslate often ignores the language the user supplied and does not translate, neither give problem message.
So I kindly ask: How can I prevent LibreTranslate from just returning the query string, after corrupting it?
How can I find out with certainty that LibreTranslate failed?
% curl -X POST http://lthost:5000/translate -F 'q="The full text should contain all details __01911926098108448__.The full text should contain all details __01911926098108448__."' -F 'source="en"' -F 'target="ar"'
{"translatedText":"The full text should contain all details __01911926098108448_. The full text should contain all details __01911926098108448_."}
%
% curl -X POST http://lthost:5000/translate -F 'q="The full text should contain all details.The full text should contain all details ."' -F 'source="en"' -F 'target="ar"'
{"translatedText":"وينبغي أن يتضمن النص الكامل جميع التفاصيل. وينبغي أن يتضمن النص الكامل جميع التفاصيل."}
As you can see, it looks like that it is super easy to knock out LibreTranslate by just entering ordinary things like numbers.
Any idea how to at least make LbreTranslate error out or whatever, but never giving back false data?
As it is now, it is practically faking translation, returning almost unmodified input data, but only after corrupting it (see the damaged tags).
Not good.
Edit: One of the many things I hate with Discourse is that it is super annoying to edit texts in a microscopic small window. Tedious and error prone.
You’ll need to check for corrupted data yourself. A basic check for Arabic could involve making sure that the majority of characters are from the Arabic set.
This is caused by the black box neural network giving a bad response. There’s no easy way to detect this since the software itself isn’t failing. The only thing you can do is try to validate the translation yourself using something like checking for the right character type.
Argos is right. The models that feature this don’t know how to handle long numbers, therefore they get lost.
Either you introduce thorough pre and post processing, or you train your own models. Those i did keep numbers intact (in general, there might be the occasional glitch), but on the other hand, they scrap underscores…
يجب أن يحتوي النص الكامل على جميع التفاصيل 01911926098108448 . يجب أن يحتوي النص الكامل على جميع التفاصيل 01911926098108448 .
You misunderstood me…
The data material I am testing the translation capabilities are the resources for the CMS I write, it was about articles structured title/abstract/full text.
Anyway, this made me decide to use Gemma directly, sending it my prompt and receiving the answer.
To my surprise I had to learn that Gemma also has some limitations; it seems to feel inhibited if asked for security-related stuff (like login, process lists etc) and shows the same limitations then. Strangely this phenomenon seems to get worse the more languages one translates in a single call; currently I am testing with 9 languages, 1 source and 8 target languages.
It took me many hours and many retries until I got my prompt so far that Gemma no longer gives back either empty translations or just echoes the original language string…