How do I make it output unbroken UTF-8?

Mcadow · April 13, 2023, 2:25pm

I don’t even know how to explain this problem. I’ve spent hours now trying to either return or first save the result to a file and then load that file, but no matter what I do, it contains weird errors.

There is some kind of encoding issue which I cannot heal with my ordinary functions to hela broken input data. It makes no sense. The text output is “special” in some way which I’ve never encountered before – and I’ve dealt with many weird CLI programs!

There is no --encoding parameter, and as soon as try to do any string operation on the text from argostranslate, it fails or messes up the text in various ways which clearly has to do with the encoding or unexpected data of some sort.

Is it really ANSI? UTF-8? Varies? Does it inject weird invisible characters which my functions are unable to remove, and this is causing the issues? If so, why?

I’ve only tested Latin-based languages so far, so we’re not talking about Asian symbols and all that.

argosopentech · April 13, 2023, 10:22pm

That’s strange. I think Argos Translate should be encoding things as a Python str type which is Unicode/UTF-8.

Maybe try putting the text through a tool that shown non printable characters.
https://freetools.textmagic.com/unicode-detector

Mcadow · April 14, 2023, 3:57pm

After a lot of head-scratching, experimentation and help-asking, I found an unsatisfying workaround by doing this in my PHP script:

$output_from_Argos_Translate = iconv('Windows-1252', 'UTF-8', $output_from_Argos_Translate);

As you can see, I’m converting it from ‘Windows-1252’ to ‘UTF-8’, because for some reason, Argos Translate appears to output as ‘Windows-1252’ for me instead of the expected UTF-8. This is in spite of me running this in a cmd.exe which has my standard chcp 65001 > NUL line in the beginning, which instructs cmd.exe to use UTF-8 as the “code page” (archaic term).

‘Windows-1252’ is not something which I have actively picked, but is set in some sense by Windows for my language, so maybe/apparently, Python or Argos Translate gets confused by this and thinks I wanted that charset instead of UTF-8?

So in other words: I have it working on my specific machine, but I would rather have the issue resolved, assuming it’s something within your control, maybe by adding an optional “charset” parameter or something like that?

argosopentech · April 15, 2023, 1:04am

Very strange. I’m guessing this is something with Python interfacing with Windows but I have no idea what.

Mcadow · April 15, 2023, 12:30pm

Sadly, my workaround falls apart whenever I try to translate to Japanese and stream the output (including when PHP CLI grabs it from the stdout):

	argos-translate.py --from-lang en --to-lang ja "Sailor Moon rules!" > out.txt
	Traceback (most recent call last):
  File "C:\Users\John Doe\AppData\Local\Programs\Python\Python310\Scripts\argos-translate.py", line 5, in <module>
    cli.main()
  File "C:\Users\John Doe\AppData\Local\Programs\Python\Python310\lib\site-packages\argostranslate\cli.py", line 63, in main
    print(translation.translate(text_to_translate))
  File "C:\Users\John Doe\AppData\Local\Programs\Python\Python310\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
	UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-9: character maps to <undefined>

However, this command works (makes the correct output in the file and doesn’t show any errors):

argos-translate.py --from-lang en --to-lang es "Sailor Moon rules!" > out.txt

Does this help in finding the issue? The cp1252.py part of the error dump seems extremely interesting, no?

Mcadow · April 18, 2023, 3:29pm

I hope that this can be fixed, because I’m now stuck translating only from/to Latin-based languages, leaving me unable to translate to/from Japanese and Russian among others…