LibreTranslate community dataset

The LibreTranslate API currently has functionality to accept translation suggestions but a manually assembled database could be helpful.

We often get feedback of the type “this translation wasn’t what I expected or wanted” but don’t do anything with it. We could assemble data by hand with a Git repository as we get feedback on recommended translations.

Any suggestions for doing this?

We could open a Git repository with suggestions from both direct user contributions (which can open a pull request) and by the feedback data from LibreTranslate (which needs to be moderated/reviewed before merging into the repository).

A simple JSON file could do for starters:

[{"q": input, "s": suggestion, "source": lang_code, "target": lang_code}, ...]

Or maybe multiple JSON files (named in chronological order), maybe <timestamp>.json so that reviews can be easier instead of looking at one giant file (we can always merge the files later).

The moderation/review of suggestions from the API might be tricky (how to avoid trolls or wrong translations?)

1 Like

The JSON schema you posted looks good to me. Would source_lang instead of source be better, it could be confusing that “source” is an ISO 639 code not the text itself?

Maybe .jsonl for the file type, each line is its own JSON object. This would work well with Git and allows you to partially read files and not have to hold the entire JSON file in memory (The Pile and wikiextract both use jsonl). We could then use files for semantically different datasets or datasets from different sources.

Does the libretranslate.com instance accept /suggest requests?

:+1:

I like the .jsonl idea as well.

LibreTranslate does accept /suggest requests. We actually have already ~1000 submissions in the database, I haven’t had a chance to review them, but at a quick glance a good ~80% of them are of high quality.

If you want a more compact JSON representation, we can also use arrays instead of objects (save on repeating the same keys).

Good ol’ CSV would also fit the purpose.

1 Like

Sounds good

I think arrays would be confusing in a jsonl file because you would have to combine arrays on each line into a larger array.

CSV would work well too.

Looking again we use “source” throughout the LibreTranslate API so it may be a better choice after all.

2 Likes

Yeah, since “lang” is irrelevant (so to speak, either implied or applied) there’s just from and to

1 Like

Any preference for where the git repo should be hosted (libretranslate, argostranslate)?

I’m still thinking about a good way to filter out bad submissions from user provided contributions via the API… one idea could be to compare the user submission with the result of the automated translation, a good submission will (?) typically have a percentage of similar words, whereas a “troll” one might just be completely different, but I’m sure I’m over-simplifying and there’s going to be lots of other things to consider. Sound like a ML problem in itself. I’m not even sure a human reviewer could really catch all bad submissions, one would need to be fluent in so many languages.

1 Like

Comparing to another translation service might actually catch most trolls, but it could also filter out good translations that are bad in the service used.
It is indeed a complicated task. :sweat_smile:

1 Like

Any preference for where the git repo should be hosted (libretranslate, argostranslate)?

The LibreTranslate GitHub is probably the best place.

Comparing user submitted data to machine translations with something like a BLEU score is probably a good way to automatically detect spam. We can also just commit the data we have to the public repository and remove bad data as people notice it.

In the future we may want more sophisticated tooling, like a web app for native speakers to rate translations, but it’s probably best to do this after we’ve collected data for a bit and know what sort of issues we’re likely to have.

In general training neural nets is reasonably robust to imperfections in the data, a lot of the data Argos Translate uses is messy and hasn’t ever been manually reviewed by a human.

2 Likes
2 Likes