LibreTranslate community dataset

argosopentech · April 24, 2022, 6:30pm

The LibreTranslate API currently has functionality to accept translation suggestions but a manually assembled database could be helpful.

We often get feedback of the type “this translation wasn’t what I expected or wanted” but don’t do anything with it. We could assemble data by hand with a Git repository as we get feedback on recommended translations.

Any suggestions for doing this?

pierotofy · April 25, 2022, 8:50pm

We could open a Git repository with suggestions from both direct user contributions (which can open a pull request) and by the feedback data from LibreTranslate (which needs to be moderated/reviewed before merging into the repository).

A simple JSON file could do for starters:

[{"q": input, "s": suggestion, "source": lang_code, "target": lang_code}, ...]

Or maybe multiple JSON files (named in chronological order), maybe <timestamp>.json so that reviews can be easier instead of looking at one giant file (we can always merge the files later).

The moderation/review of suggestions from the API might be tricky (how to avoid trolls or wrong translations?)

argosopentech · April 25, 2022, 10:55pm

The JSON schema you posted looks good to me. Would source_lang instead of source be better, it could be confusing that “source” is an ISO 639 code not the text itself?

Maybe .jsonl for the file type, each line is its own JSON object. This would work well with Git and allows you to partially read files and not have to hold the entire JSON file in memory (The Pile and wikiextract both use jsonl). We could then use files for semantically different datasets or datasets from different sources.

Does the libretranslate.com instance accept /suggest requests?

pierotofy · April 26, 2022, 2:03am

I like the .jsonl idea as well.

LibreTranslate does accept /suggest requests. We actually have already ~1000 submissions in the database, I haven’t had a chance to review them, but at a quick glance a good ~80% of them are of high quality.

If you want a more compact JSON representation, we can also use arrays instead of objects (save on repeating the same keys).

Good ol’ CSV would also fit the purpose.

argosopentech · April 26, 2022, 12:26pm

Sounds good

I think arrays would be confusing in a jsonl file because you would have to combine arrays on each line into a larger array.

CSV would work well too.

argosopentech · April 28, 2022, 12:22am

Looking again we use “source” throughout the LibreTranslate API so it may be a better choice after all.

jefs42 · April 28, 2022, 1:13am

Yeah, since “lang” is irrelevant (so to speak, either implied or applied) there’s just from and to

pierotofy · April 28, 2022, 3:51am

Any preference for where the git repo should be hosted (libretranslate, argostranslate)?

I’m still thinking about a good way to filter out bad submissions from user provided contributions via the API… one idea could be to compare the user submission with the result of the automated translation, a good submission will (?) typically have a percentage of similar words, whereas a “troll” one might just be completely different, but I’m sure I’m over-simplifying and there’s going to be lots of other things to consider. Sound like a ML problem in itself. I’m not even sure a human reviewer could really catch all bad submissions, one would need to be fluent in so many languages.

dingedi · April 28, 2022, 8:24am

Comparing to another translation service might actually catch most trolls, but it could also filter out good translations that are bad in the service used.
It is indeed a complicated task.

argosopentech · April 28, 2022, 12:01pm

Any preference for where the git repo should be hosted (libretranslate, argostranslate)?

The LibreTranslate GitHub is probably the best place.

Comparing user submitted data to machine translations with something like a BLEU score is probably a good way to automatically detect spam. We can also just commit the data we have to the public repository and remove bad data as people notice it.

In the future we may want more sophisticated tooling, like a web app for native speakers to rate translations, but it’s probably best to do this after we’ve collected data for a bit and know what sort of issues we’re likely to have.

In general training neural nets is reasonably robust to imperfections in the data, a lot of the data Argos Translate uses is messy and hasn’t ever been manually reviewed by a human.

pierotofy · April 29, 2022, 2:11pm

pierotofy · May 22, 2022, 8:14pm

Finally took the time to upload the first round of user submitted suggestions:

github.com

LibreTranslate/CommunityDS/blob/main/1653250371.jsonl

{"q": "باشى", "s": "ظابطتلذابظلقذلقيتا", "source": "ar", "target": "ko"}
{"q": "قمت بزيارة جدتي عصر الجمعة", "s": "I on visited my grandmother Friday afternoon.", "source": "ar", "target": "en"}
{"q": "الرسم", "s": "Grea uation", "source": "ar", "target": "en"}
{"q": "Mice", "s": "فئران", "source": "ar", "target": "ar"}
{"q": "راين دروبز\n", "s": "Rain drops", "source": "ar", "target": "en"}
{"q": "\n\n", "s": "KSA", "source": "ar", "target": "en"}
{"q": "التلفق", "s": "ننحخ٧٧", "source": "ar", "target": "en"}
{"q": "ل", "s": "Yhrhgdhhdn\n", "source": "ar", "target": "en"}
{"q": " هلا ", "s": "Hi", "source": "ar", "target": "en"}
{"q": "انا احبك يا امي  ", "s": "Is breá liom tú, Mam.ggc", "source": "ar", "target": "ga"}
{"q": "Reblog", "s": "Reblog", "source": "ar", "target": "en"}
{"q": "\n\n\n\n\n\n\n\n\n", "s": "Cliff", "source": "ar", "target": "en"}
{"q": "استداريت\n", "s": "Turn around.", "source": "ar", "target": "en"}
{"q": "Apellidos", "s": "Apellidos", "source": "ar", "target": "en"}
{"q": "اناني\n", "s": "Selfish.", "source": "ar", "target": "en"}
{"q": "Square", "s": "Square", "source": "ar", "target": "en"}
{"q": "كيف حالك", "s": "No hablar inglés jajajaja", "source": "ar", "target": "en"}
{"q": " ", "s": "Hi\n", "source": "ar", "target": "en"}
{"q": "VERIF\n\n", "s": "Verif.\n\n", "source": "ar", "target": "en"}
{"q": "دانغو", "s": "Dango", "source": "ar", "target": "en"}

This file has been truncated. show original

pierotofy · May 22, 2022, 8:18pm

One can use:

./suggestions-to-jsonl.py to extract the info from the suggestions database of LT.

jefs42 · May 23, 2022, 2:49am

Do I need to reinstall LT? Where is ./suggetsions-to-jsonl.py (current directory somewhere…)

Haven’t checked in awhile, not sure I have much suggestions to add from my API /suggestion other than testing the PHP interface.

But If people are using my end API and do suggest, how would I then add that?

pierotofy · May 24, 2022, 2:49am

You just need to pull/download the script from the LT GitHub repository.

I don’t think you need to add anything (necessarily), this is a script and is probably invoked by administrators that have shell access.

jefs42 · May 24, 2022, 4:19am

Perfect thanks! But yeah, I don’t have any useful data in there to submit as yet

Raw Python Script

Yeah, you could put it anywhere and run as any user that has read access to the suggestions.db file using the --db /path/to/suggestions.db option.

I instead figured I’d just create it in the same home directory of the user that my LT server is running under, where the suggestions.db file is located anyway. So can just go there and run ./outputSuggestions.py (down the line I’d probably edit the script and set the default for --clear to True)