Training an Argos Translation Model Locally on Windows

Downloading Text for Training

Automatic Quick Setup

Download Text Automatically

Setup

git clone https://github.com/Interaction-Bot/opus-nlp-downloader.git
cd opus-nlp-downloader
pip install -r requirements.txt
python main.py get en th
python main.py download en th data/

Response

{'wikimedia': {'links': 'https://object.pouta.csc.fi/OPUS-wikimedia/v20210402/moses/en-th.txt.zip', 'sentences': 26597}, 'CCAligned': {'links': 'https://object.pouta.csc.fi/OPUS-CCAligned/v1/moses/en-th.txt.zip', 'sentences': 10746372}, 'OpenSubtitles': {'links': 'https://object.pouta.csc.fi/OPUS-OpenSubtitles/v2018/moses/en-th.txt.zip', 'sentences': 3281533}, 'XLEnt': {'links': 'https://object.pouta.csc.fi/OPUS-XLEnt/v1.2/moses/en-th.txt.zip', 'sentences': 1236145}, 'Tanzil': {'links': 'https://object.pouta.csc.fi/OPUS-Tanzil/v1/moses/en-th.txt.zip', 'sentences': 93540}, 'QED': {'links': 'https://object.pouta.csc.fi/OPUS-QED/v2.0a/moses/en-th.txt.zip', 'sentences': 264677}, 'GNOME': {'links': 'https://object.pouta.csc.fi/OPUS-GNOME/v1/moses/en-th.txt.zip', 'sentences': 78}, 'NeuLab-TedTalks': {'links': 'https://object.pouta.csc.fi/OPUS-NeuLab-TedTalks/v1/moses/en-th.txt.zip', 'sentences': 102773}, 'bible-uedin': {'links': 'https://object.pouta.csc.fi/OPUS-bible-uedin/v1/moses/en-th.txt.zip', 'sentences': 124386}, 'TED2020': {'links': 'https://object.pouta.csc.fi/OPUS-TED2020/v1/moses/en-th.txt.zip', 'sentences': 160762}}

Alternative - Collect Translation Texts

Opus project

  • Gather data from above link
  • Get English Text
  • Get Thai Text - copy of English text
  • Get License information

Creating an Argos Data Package

  • Folder structure
data-<dataSource>-<codeFrom>_<codeTo>
	metadata.json
	README
	source
	target
  • metadata.json
{
	"name": "<dataSource>",
	"type": "data",
	"from_code": "<codeFrom>",
	"to_code": "<codeTo>",
	"size": <sentences>,
	"reference": ""
}
  • Zip folders and change extension to .argosdata

Hosting the Package Locally

  • Add all .argosdata files to a folder
  • Install python 3
  • In the folder run
python3 -m http.server
  • Links to add to data-index inside docker will be of the format
http://host.docker.internal:8000/<your-file>.argosdata
1 Like