Command line tool to extract po files and create argostranslate suitable datasets

In order to create new open datasets for minorized languages, I’ve created po2dataset command line tool to extract po files and create argosdata datasets.

For example, in the Basque language we already have lots of open source applications, video games… translated, so I thought would be a good idea to extract those strings, filter them if needed and create open datasets for the training of future models.

The project still is under heavy development but the latest versions are capable to create data-datasetname-en_eu.argosdata kind files (README files not included still) that contains all needed to train argostranslate models:

So, any contribution will be welcome.

Thank you!

2 Likes

Very cool! Thanks for sharing. How do you handle strings with placeholders (e.g. %(input)s)? Do you just filter those out?

2 Likes

Nop… still WIP. But that’s the idea. Any type of contribution, idea, suggestion will be wellcome.

I’m not a machine translation expert, so I’m doing this with my common sense…