Is it possible to use a glossary? I currently need the API so that I can run my dictionary through it as a glossary. However, I currently have the problem that words are being translated again, but completely incorrectly.
Currently not. It’s something that we should integrate at some point because it has come up before in conversations.
I started brain-storming time ago how an API for glossaries would look like but haven’t settled on a design. One thing that I want to try to avoid, is creating permanent glossaries, e.g. offer endpoints for creating, editing and deleting glossaries and referencing them during a translation request. I would prefer that the LT API remains stateless as much as possible, but I’m not sure what’s the right balance (sending the entire glossary with each translation request might be too heavy).
Speaking of which, is there some sort of de-facto standard format for storing/transferring glossaries and how large are they usually?
Creating a system to manage glossary will indeed add a fair amount of complexity.
I would have preferred the glossary the be sent with each request, but depending on the size and number of requests, it’s probably unnecessary, and if requests are made from multiple servers with the same API key, the user would have to duplicate their dictionary.
Both have advantages and disadvantages.
I’m leaning toward sending the glossary with each request, but with configurable limits (just like character limits) on how large the glossary can be.
Does argos-translate have an internal system for replacing words or not translating them?
If it doesn’t, I think it should be done with the HTML attribute translate=“no” from libretranslate.
This is a difficult thing to do. If we have a brand name we don’t want to translate in the middle of a sentence we may just have to translate the sentence in three parts. For example: ("George and I are planning to go shopping at ", "Acme Corp ", “next Thursday after work.”)
CTranslate2 could possibly have some functionality to tell the Transformer decoder to not translate certain tokens but I’m not aware of it.
I did a bit of research time ago and one suggested strategy was to replace the glossary words with numbered tokens such as __1__
, __2__
, etc., perform the translation (which typically preserves the tokens, some checks are necessary), then replace back the tokens.
There’s a bit of an issue with languages that have plurals, feminine/masculine differences (e.g. in Italian, the owner is Mary
, if there’s a glossary mapping owner
=> proprietario
, the result will be incorrect because Mary
is feminine and should thus be proprietaria
).
DeepL claims In comparison to a find-and-replace tool, our glossary can perform morphological adaptations that account for case, gender, and tense.
, but I’m unsure of how they do this.
I’ll answer myself here by saying that .tsv seems a pretty common format.
if it actually preserves the tokens it’s the best
Hopefully there’s a way to do this reliably without having to retrain all of the models.
We could try getting multiple translation results from the CTranslate2 decoder and then checking them to make sure they reproduce the tokens correctly. If they don’t we can fall back to just breaking the text into chunks.