Downloading Text for Training
Automatic Quick Setup
Download Text Automatically
Setup
git clone https://github.com/Interaction-Bot/opus-nlp-downloader.git
cd opus-nlp-downloader
pip install -r requirements.txt
python main.py get en th
python main.py download en th data/
Response
{'wikimedia': {'links': 'https://object.pouta.csc.fi/OPUS-wikimedia/v20210402/moses/en-th.txt.zip', 'sentences': 26597}, 'CCAligned': {'links': 'https://object.pouta.csc.fi/OPUS-CCAligned/v1/moses/en-th.txt.zip', 'sentences': 10746372}, 'OpenSubtitles': {'links': 'https://object.pouta.csc.fi/OPUS-OpenSubtitles/v2018/moses/en-th.txt.zip', 'sentences': 3281533}, 'XLEnt': {'links': 'https://object.pouta.csc.fi/OPUS-XLEnt/v1.2/moses/en-th.txt.zip', 'sentences': 1236145}, 'Tanzil': {'links': 'https://object.pouta.csc.fi/OPUS-Tanzil/v1/moses/en-th.txt.zip', 'sentences': 93540}, 'QED': {'links': 'https://object.pouta.csc.fi/OPUS-QED/v2.0a/moses/en-th.txt.zip', 'sentences': 264677}, 'GNOME': {'links': 'https://object.pouta.csc.fi/OPUS-GNOME/v1/moses/en-th.txt.zip', 'sentences': 78}, 'NeuLab-TedTalks': {'links': 'https://object.pouta.csc.fi/OPUS-NeuLab-TedTalks/v1/moses/en-th.txt.zip', 'sentences': 102773}, 'bible-uedin': {'links': 'https://object.pouta.csc.fi/OPUS-bible-uedin/v1/moses/en-th.txt.zip', 'sentences': 124386}, 'TED2020': {'links': 'https://object.pouta.csc.fi/OPUS-TED2020/v1/moses/en-th.txt.zip', 'sentences': 160762}}
- Download data from links and unzip
- Skip to Creating an Argos Data Package
Alternative - Collect Translation Texts
- Gather data from above link
- Get English Text
- Get Thai Text - copy of English text
- Get License information
Creating an Argos Data Package
- Folder structure
data-<dataSource>-<codeFrom>_<codeTo>
metadata.json
README
source
target
metadata.json
{
"name": "<dataSource>",
"type": "data",
"from_code": "<codeFrom>",
"to_code": "<codeTo>",
"size": <sentences>,
"reference": ""
}
- Zip folders and change extension to
.argosdata
Hosting the Package Locally
- Add all
.argosdata
files to a folder - Install python 3
- In the folder run
python3 -m http.server
- Links to add to data-index inside docker will be of the format
http://host.docker.internal:8000/<your-file>.argosdata