In my scenario, I was trying to translate a polite letter from English to Spanish, and I noticed the “you” were translated to “tu” instead of Usted, and the verbs were also in 2nd person, which is not polite.
This is a good question. English doesn’t have a polite form (we’re straightforward and rude by default haha). So if you’re translating from English to a language with a formal/informal distinction there currently isn’t a good way to tell Argos Translate which one you want.
I’m open to suggestions on how to handle this well. I’ve generally defaulted to a Unicode->Unicode architecture for Argos Translate where the neural network handles any complexity with specific languages. However, for this issue maybe it could be useful to pass some sort of metadata to the translation model about the type of translation the user wants (formal/informal etc.).
This could be a separate model (or a fine tuned model?).
From a language model index perspective, something that perhaps is still missing is the ability to have different “variants” of the language models, whether it’s a formal/informal distinction or a particular language (e.g. British or American English).
Yeah this is a hard problem. Different formal/informal language model variants for Spanish, French, etc. would probably be excessive and increase the bandwidth required to download all of the language models. With multilingual translation maybe it would be more efficient to have separate language codes for different flavors of the same language?
Plus I don’t know how we would find data for formal/informal; Opus doesn’t make this distinction. I normally default to the ISO 639 language codes but I don’t think they have a formal/informal distinction.
We could pass metadata to the translation model to give it information about formal/informal but that would have an overhead of requiring more infrastructure. The current system should try to infer if you want formal/informal based on the context and how similar sentences were translated in the training data.
This problem could also be solved with few shot translation models like Chat GPT that understand more of the context for the users translation but that’s not currently how Argos Translate works.
I think there are standard language codes for regional dialects of a language. For example, British English is en-Gb. However, there generally aren’t codes for a specific situations in a language like formal/informal.