GitHub - tldr-pages/tldr-translation-pairs-gen: Generates a structured dataset in various formats

tldr-translation-pairs-gen

Matrix chat license

About

A CLI application for parsing tldr pages from the tldr-pages/tldr repository, and producing a dataset that maps the strings across localized pages. The motivation was to provide an additional corpus for OPUS, see What is Opus? for more context.

What is OPUS?

OPUS is public dataset of translated resources on the web. All translations are derived from freely available and openly licensed sources, so the translations themselves are safe to use with minimal restrictions. These datasets are helpful for a variety of applications such as research and machine learning.

A notable project that uses the OPUS corpuses is LibreTranslate, powered by argos-translate. It’s a free, open-source, and self-hostable machine translation API that doesn’t depend on third-party services. Now by translating tldr-pages, we’re collectively contributing more data to improve open-source machine translations!

1 Like