Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

internal/syscall/windows: GetACP returns wrong codepage #16857

Closed
tkausl opened this issue Aug 24, 2016 · 23 comments
Closed

internal/syscall/windows: GetACP returns wrong codepage #16857

tkausl opened this issue Aug 24, 2016 · 23 comments
Milestone

Comments

@tkausl
Copy link

tkausl commented Aug 24, 2016

  1. What version of Go are you using (go version)?
    1.7

  2. What operating system and processor architecture are you using (go env)?
    Windows, arm64
    go env:

    set GOARCH=amd64
    set GOBIN=
    set GOEXE=.exe
    set GOHOSTARCH=amd64
    set GOHOSTOS=windows
    set GOOS=windows
    set GOPATH=C:\dev\golang
    set GORACE=
    set GOROOT=C:\Go
    set GOTOOLDIR=C:\Go\pkg\tool\windows_amd64
    set CC=gcc
    set GOGCCFLAGS=-m64 -mthreads -fmessage-length=0 -fdebug-prefix-map=C:\Users\Tobias\AppData\Local\Temp\go-build473402786=/tmp/go-build -gno-record-gcc-switches
    set CXX=g++
    set CGO_ENABLED=1
    
  3. What did you do?
    I wrote a program which just reads bytes from os.Stdin and prints their values as Hex to Stdout. I started the program, typed a ä to the console and pressed enter.

  4. What did you expect to see?
    I expected to see the correct UTF-8 sequence for a ä which is C3 A4

  5. What did you see instead?
    I see the Hex-sequence E2 80 9E

I found that internal/syscall/windows GetACP() returns 1252 even though I can verify that the ä is encoded in CP850. Because of this wrongly returned codepage, Stdin.readConsole tries to decode the character from 1252 to UTF-8 instead of from 850 to UTF-8. As you see in my Post on Stackoverflow, when I read from stdin through CGO I get the byte 0x84 which is the value for a ä in CP850, it should've been 0xE4 if it were in 1252. The value 0x84 decoded from golang.org/x/text/encoding/charmap.CodePage850 decodes to the correct UTF-8 character.

chcp tells me that the active codepage is 850.

@mattn
Copy link
Member

mattn commented Aug 24, 2016

When reading from console, go decode the bytes from stream to utf-8.
You can see that your code should work with below.

cat utf-8.txt | go run test.go

@tkausl
Copy link
Author

tkausl commented Aug 24, 2016

cat utf-8.txt | go run test.go prints the correct byte-sequence C3 A4. I understand that Go decodes the bytes to UTF-8 but it wrongly assumes the input is in CP1252 encoded (GetACP() returns 1252) when the input really is CP850 encoded.

@mattn
Copy link
Member

mattn commented Aug 24, 2016

What value return from GetConsoleCP() on your environment?

@tkausl
Copy link
Author

tkausl commented Aug 24, 2016

GetConsoleCP() returns 850 while GetACP() returns 1252

@alexbrainman
Copy link
Member

@mattn I will let you try and fix this, since I have nothing but English around me.

Alex

@gopherbot
Copy link

CL https://golang.org/cl/27575 mentions this issue.

@mattn
Copy link
Member

mattn commented Aug 24, 2016

@tkausl you always change console codepage with chcp 850 ? I don't know the windows settings to be possible to change this.

@tkausl
Copy link
Author

tkausl commented Aug 24, 2016

No I don't. I've googled a bit, there is not much Information about default charsets for Windows but according to this site, Windows uses charset 1252 as default for west-european countries but it uses usually 850 for the console for west-european countries. Not sure why, but these are the defaults it seems.

@alexbrainman
Copy link
Member

@tkausl can you, please, try this change https://golang.org/cl/27575 to see if it fixes your problem? (let use know if you need more instructions) Thank you.

Alex

@tkausl
Copy link
Author

tkausl commented Aug 24, 2016

This does change the input I get but it's still not the correct one. With this change I get the sequence 61 CC 88 which is according to the UTF-8 tool the small latin letter a followed by a "COMBINING DIAERESIS" ̈, it should've been the single character C3 A4. I don't know if - according to UTF-8 rules - the single character containing the diaeresis and the character followed by a combining diaeresis are equivalent, but it seems in this case the output doesn't like it as my console prints it as

@tkausl
Copy link
Author

tkausl commented Aug 24, 2016

Seems like they are equivalent, so this change gets us one step further at least.

@tkausl
Copy link
Author

tkausl commented Aug 24, 2016

Oh, you actually choose to convert them to decomposed characters (MB_COMPOSITE), not sure what the advantages are but it would be great if Stdout could actually re-encode them correctly.

@alexbrainman
Copy link
Member

@tkausl I know nothing about unicode, but please try https://go-review.googlesource.com/#/c/27576/ to see if that fixes your problem. Thank you.

Alex

@mattn
Copy link
Member

mattn commented Aug 24, 2016

@tkausl if changing MB_COMPOSITE to MB_PRECOMPOSED, it works?

@mattn
Copy link
Member

mattn commented Aug 24, 2016

please note merging CL27576 will reopen #6303

@tkausl
Copy link
Author

tkausl commented Aug 24, 2016

@alexbrainman Yes, with this change it works.
@mattn With MB_PRECOMPOSED it works, I'm worried if this breaks other languages though.

@mattn
Copy link
Member

mattn commented Aug 24, 2016

@tkausl thanks your confirming. MB_PRECOMPOSED will fix the issue. I'm working on vim-dev and vim.exe is implemented using MB_PRECOMPOSED not MB_COMPOSITE. I don't hear any issue about using MB_PRECOMPOSED. Thanks.

@alexbrainman
Copy link
Member

please note merging CL27576 will reopen #6303

I understand. But I think correct input character decoding is more important than ctrl+Z handling.

@alexbrainman Yes, with this change it works.

Thanks for checking.

I will let my https://go-review.googlesource.com/27576 as alternative to @mattn change. Given that I was against submitting of CL 4310 in the first place (and I didn't changed my mind), I don't see how I can make correct call here. So leaving for someone else to decide which CL to go with. Plus I know very little about Unicode.

Alex

@gopherbot
Copy link

CL https://golang.org/cl/27576 mentions this issue.

@mattn
Copy link
Member

mattn commented Aug 24, 2016

@alexbrainman The replacing MF_COMPOSED to MF_PRECOMPOSED and using GetConsoleCP() will fix this issue. I don't understand why you want to revert. If you know little about unicode, let's add another one for reviewing. :)

@alexbrainman
Copy link
Member

If you know little about unicode, let's add another one for reviewing.

That is what I suggested myself (#16857 (comment)).

Alex

@mattn
Copy link
Member

mattn commented Aug 24, 2016

@hirochachacha could you please take a look?

@hirochachacha
Copy link
Contributor

@mattn Sure, I'll try. but don't expect too much.

@quentinmit quentinmit changed the title internal/syscall/windows GetACP returns wrong codepage internal/syscall/windows: GetACP returns wrong codepage Sep 6, 2016
@quentinmit quentinmit added this to the Go1.8 milestone Sep 6, 2016
@golang golang locked and limited conversation to collaborators Sep 21, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants