Currently all argos models share the sentencepiece tokenizer model between source and target.
Would it be of interest to extend the specification to allow for separate tokenizer models?
This could be simply:
If sentencepiece.model exists, then it’s a shared tokenizer.
Else If sentencepiece_src.model and sentencepiece_tgt.model exist, then it’s separate. (or something like that)
This could allow the inclusion of Helsinki NLP’s models into the argos package index: https://github.com/Helsinki-NLP/OPUS-MT-train. I just need to write a conversion script that adds the stanza model, converts the Marian model to ctranslate2 and packages it up. The models there have separate tokenizers though. This way we could add language pairs that are currently missing without needing to train from scratch.
I’m not sure the best way to do Byte Pair Encoding. Maybe there’s a way to convert subword-nmt models to SentencePiece? Another option would be extending the package format with a new file for optional BPE but that would make the code more complex.
I asked ChatGPT-3.5 if this is possible and this is what I get:
I’m maintaining a machine translation system that uses SentencePIece to tokenize the source text. I would like to use a model trained by another research group that uses subword-nmt BPE. Is it possible to convert a BPE model into a SentencePiece model to maintain compatibility with my current system?
Yes, it is possible to convert a subword-nmt BPE model into a SentencePiece model to maintain compatibility with your current system. However, the process involves several steps and may require some programming and data preprocessing. Here’s a high-level overview of the steps involved:
Understand the BPE Model: First, you need to understand the structure and format of the BPE model that you want to convert. BPE models typically consist of a vocabulary file and a merges file. The vocabulary file contains subword units, and the merges file contains the merge operations that were used to create those subword units.
Extract Subword Units: Extract the subword units (symbols) from the vocabulary file of the BPE model. These symbols are the basic units of the BPE model.
Create a SentencePiece Model: You will need to create a new SentencePiece model. You can use the extracted subword units as the initial vocabulary for your SentencePiece model.
Convert Merge Operations: You need to convert the merge operations from the BPE model into SentencePiece’s tokenization rules. This can be a bit tricky because BPE merges are not one-to-one mappings to SentencePiece tokenization. You may need to experiment with different approaches to approximate the tokenization behavior of the BPE model.
Train SentencePiece Model: Train the SentencePiece model using your existing training data. You can fine-tune the model to better match the tokenization behavior of the original BPE model. Be sure to use a similar-sized vocabulary if possible.
Integration: Replace the original BPE tokenizer in your machine translation system with the newly created SentencePiece tokenizer. You will need to modify the code that handles tokenization to use the SentencePiece model and its rules.
Testing and Validation: Test the new setup extensively to ensure that it produces similar tokenization and translation results as the original BPE-based system. You may need to fine-tune the SentencePiece model further to achieve the desired behavior.
Keep in mind that while it’s possible to convert a BPE model to SentencePiece, there may be subtle differences in tokenization behavior between the two methods. These differences could affect translation quality, so thorough testing and validation are crucial. Additionally, some loss of compatibility or performance may occur due to the differences in how BPE and SentencePiece handle subword tokenization.
Before proceeding, it’s a good idea to consult with the research group that trained the original BPE model to ensure you have the necessary permissions and to see if they have any specific recommendations for conversion.
I don’t think conversion to SentencePiece is viable; you might get an approximation, not a direct conversion. I think you would also need access to the original training data, but it’s not clear which subset of OPUS data was used for training.
It really doesn’t look too bad to add support for BPE though;
If bpe.model is in the package folder (proposal), the model uses BPE encoding.
With that, we could port all bilingual models from OPUS-MT. I could open a PR for argos-translate.
I wrote a proof-of-concept and it works:
(en)> When people don't see moose as potentially dangerous, they may approach too closely and put themselves at risk.
(gt)> Si los individuos no perciben que los alces son potencialmente agresivos, podrían acercárseles más de lo adecuado y ponerse en peligro.
(es)> Cuando la gente no ve a Moose como potencialmente peligroso, pueden acercarse demasiado y ponerse en riesgo.
Once a new version is released on PyPI, I can update LibreTranslate. Afterwards, I plan to troubleshoot the quantization issue that I found while working on the PR and then start converting models for possible inclusion in the argos-index.