Proposal to extend the argospackage format

pierotofy · October 7, 2023, 2:21am

Currently all argos models share the sentencepiece tokenizer model between source and target.

Would it be of interest to extend the specification to allow for separate tokenizer models?

This could be simply:

If sentencepiece.model exists, then it’s a shared tokenizer.
Else If sentencepiece_src.model and sentencepiece_tgt.model exist, then it’s separate. (or something like that)

This could allow the inclusion of Helsinki NLP’s models into the argos package index: https://github.com/Helsinki-NLP/OPUS-MT-train. I just need to write a conversion script that adds the stanza model, converts the Marian model to ctranslate2 and packages it up. The models there have separate tokenizers though. This way we could add language pairs that are currently missing without needing to train from scratch.

I could open a PR if there’s interest.

pierotofy · October 7, 2023, 5:22pm

Actually, this might not be needed. I noticed that argos-translate does not decode tokens using SentencePiece, it simply uses:

        detokenized = "".join(translated_tokens)
        detokenized = detokenized.replace("▁", " ")
        value = detokenized

So it doesn’t need a target SP model.

pierotofy · October 7, 2023, 7:10pm

pierotofy · October 7, 2023, 7:40pm

Not all models on OPUS-MT use sentencepiece however; some use BPE via subword-nmt.

argosopentech · October 8, 2023, 1:34am

I’d be open to this idea if you want to make a pull request but it sounds like it’s not needed.

pierotofy · October 8, 2023, 3:09am

Yeah decoding doesn’t look like it’s needed. How about BPE encoding on the source side?

argosopentech · October 8, 2023, 1:25pm

I’m not sure the best way to do Byte Pair Encoding. Maybe there’s a way to convert subword-nmt models to SentencePiece? Another option would be extending the package format with a new file for optional BPE but that would make the code more complex.

argosopentech · October 8, 2023, 1:31pm

I asked ChatGPT-3.5 if this is possible and this is what I get:

Me

I’m maintaining a machine translation system that uses SentencePIece to tokenize the source text. I would like to use a model trained by another research group that uses subword-nmt BPE. Is it possible to convert a BPE model into a SentencePiece model to maintain compatibility with my current system?

ChatGPT

Yes, it is possible to convert a subword-nmt BPE model into a SentencePiece model to maintain compatibility with your current system. However, the process involves several steps and may require some programming and data preprocessing. Here’s a high-level overview of the steps involved:

Understand the BPE Model: First, you need to understand the structure and format of the BPE model that you want to convert. BPE models typically consist of a vocabulary file and a merges file. The vocabulary file contains subword units, and the merges file contains the merge operations that were used to create those subword units.
Extract Subword Units: Extract the subword units (symbols) from the vocabulary file of the BPE model. These symbols are the basic units of the BPE model.
Create a SentencePiece Model: You will need to create a new SentencePiece model. You can use the extracted subword units as the initial vocabulary for your SentencePiece model.
Convert Merge Operations: You need to convert the merge operations from the BPE model into SentencePiece’s tokenization rules. This can be a bit tricky because BPE merges are not one-to-one mappings to SentencePiece tokenization. You may need to experiment with different approaches to approximate the tokenization behavior of the BPE model.
Train SentencePiece Model: Train the SentencePiece model using your existing training data. You can fine-tune the model to better match the tokenization behavior of the original BPE model. Be sure to use a similar-sized vocabulary if possible.
Integration: Replace the original BPE tokenizer in your machine translation system with the newly created SentencePiece tokenizer. You will need to modify the code that handles tokenization to use the SentencePiece model and its rules.
Testing and Validation: Test the new setup extensively to ensure that it produces similar tokenization and translation results as the original BPE-based system. You may need to fine-tune the SentencePiece model further to achieve the desired behavior.

Keep in mind that while it’s possible to convert a BPE model to SentencePiece, there may be subtle differences in tokenization behavior between the two methods. These differences could affect translation quality, so thorough testing and validation are crucial. Additionally, some loss of compatibility or performance may occur due to the differences in how BPE and SentencePiece handle subword tokenization.

Before proceeding, it’s a good idea to consult with the research group that trained the original BPE model to ensure you have the necessary permissions and to see if they have any specific recommendations for conversion.

pierotofy · October 9, 2023, 1:04am

I don’t think conversion to SentencePiece is viable; you might get an approximation, not a direct conversion. I think you would also need access to the original training data, but it’s not clear which subset of OPUS data was used for training.

It really doesn’t look too bad to add support for BPE though;

If bpe.model is in the package folder (proposal), the model uses BPE encoding.
Add sacremoses (or if you don’t want to add dependencies, I can extract/refactor the two tokenizer/detokenizer classes): https://github.com/hplt-project/sacremoses
Add the code from https://github.com/rsennrich/subword-nmt/blob/master/subword_nmt/apply_bpe.py (which is a single module, dependency-free), optionally remove the unnecessary bits.
Use mosestokenizer → apply_bpe → tokens → ctranslate2 → tokens → replace "@@ " → mosesdetokenizer (https://github.com/Helsinki-NLP/Opus-MT/blob/master/content_processor.py#L51-L59).

With that, we could port all bilingual models from OPUS-MT. I could open a PR for argos-translate.

I wrote a proof-of-concept and it works:

(en)> When people don't see moose as potentially dangerous, they may approach too closely and put themselves at risk.
(gt)> Si los individuos no perciben que los alces son potencialmente agresivos, podrían acercárseles más de lo adecuado y ponerse en peligro.
(es)> Cuando la gente no ve a Moose como potencialmente peligroso, pueden acercarse demasiado y ponerse en riesgo.

argosopentech · October 10, 2023, 12:45pm

Sounds good, I can merge this. I think it’s pretty valuable to be able to use the OPUS-MT models. I prefer the style of factoring out the necessary code instead of adding a dependency when practical.

pierotofy · October 16, 2023, 4:11pm

PR opened: https://github.com/argosopentech/argos-translate/pull/369

pierotofy · October 19, 2023, 1:03am

Once a new version is released on PyPI, I can update LibreTranslate. Afterwards, I plan to troubleshoot the quantization issue that I found while working on the PR and then start converting models for possible inclusion in the argos-index.

argosopentech · October 19, 2023, 1:45am

Sounds great! I just released Argos Translate 1.9.0 with your BPE changes.

argosopentech · February 17, 2024, 8:02pm

Does anyone know if it’s possible/practical to run SentencePiece models in raw Python?

I want to continue using the SentencePiece C++ library in production Argos Translate but am curious if this is possible.

I’m also considering moving Argos Translate more towards using Byte Pair Encoding for tokenization. I like the simplicity of BPE but I’m guessing SentencePiece has slightly better performance.

pierotofy · February 17, 2024, 8:52pm

sentencepiece has Python bindings, but the core code is C++. I don’t think there’s a pure Python port available.

Has sentencepiece been giving trouble? I find it tends to work pretty well.

argosopentech · February 17, 2024, 9:18pm

No SentencePiece works great in my experience I’m just curious about what the other options are.