Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os: Windows readConsole fails on short read of multi-byte rune #17097

Closed
mattn opened this issue Sep 14, 2016 · 17 comments
Closed

os: Windows readConsole fails on short read of multi-byte rune #17097

mattn opened this issue Sep 14, 2016 · 17 comments
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. OS-Windows
Milestone

Comments

@mattn
Copy link
Member

mattn commented Sep 14, 2016

What version of Go are you using (go version)?

go version devel +6fd8c00 Wed Sep 14 08:42:28 2016 +0000 windows/amd64

What operating system and processor architecture are you using (go env)?

windows/amd64

Windows7 64bit

What did you do?

package main

import (
    "fmt"
)

func main() {
    var foo string
    fmt.Scan(&foo)
    fmt.Println(foo)
}

And type multi-byte string for example 世界

What did you expect to see?

print 世界

What did you see instead?

print empty

@davecheney
Copy link
Contributor

Please always check your errors before reporting a bug. fmt.Scan returns an error value, it will tell you what went wrong.

@mattn
Copy link
Member Author

mattn commented Sep 14, 2016

yes, I will explaining in next comment. please wait.

@mattn
Copy link
Member Author

mattn commented Sep 14, 2016

fmt.Scan return EOF with multi-byte string because fmt.Scan call io.ReadFull() with sized 1.

go/src/fmt/scan.go

Lines 321 to 333 in 510fb63

func (r *readRune) readByte() (b byte, err error) {
if r.pending > 0 {
b = r.pendBuf[0]
copy(r.pendBuf[0:], r.pendBuf[1:])
r.pending--
return
}
n, err := io.ReadFull(r.reader, r.pendBuf[:1])
if n != 1 {
return 0, err
}
return r.pendBuf[0], err
}

os.Stdin.Read i.e. readConsole desn't decode the 1 byte to utf-8. So f.readbuf always nil.

go/src/os/file_windows.go

Lines 298 to 306 in 510fb63

for i, r := range f.readbuf {
if utf8.RuneLen(r) > len(b) {
f.readbuf = f.readbuf[i:]
return n, nil
}
nr := utf8.EncodeRune(b, r)
b = b[nr:]
n += nr
}

@davecheney
Copy link
Contributor

I think what you are seeing is your input is not utf8. A string only
accepts utf8, so if you're using big-5 or some other encoding you'll have
to wrap that input in a translator.

On Wed, 14 Sep 2016, 21:33 mattn notifications@github.com wrote:

fmt.Scan return EOF with multi-byte string because fmt.Scan call
io.ReadFull() with sized 1.

go/src/fmt/scan.go

Lines 321 to 333 in 510fb63

func (r *readRune) readByte() (b byte, err error) {
if r.pending > 0 {
b = r.pendBuf[0]
copy(r.pendBuf[0:], r.pendBuf[1:])
r.pending--
return
}
n, err := io.ReadFull(r.reader, r.pendBuf[:1])
if n != 1 {
return 0, err
}
return r.pendBuf[0], err
}

os.Stdin.Read i.e. readConsole desn't decode the 1 byte to utf-8. So
f.readbuf always nil.

go/src/os/file_windows.go

Lines 298 to 306 in 510fb63

for i, r := range f.readbuf {
if utf8.RuneLen(r) > len(b) {
f.readbuf = f.readbuf[i:]
return n, nil
}
nr := utf8.EncodeRune(b, r)
b = b[nr:]
n += nr
}


You are receiving this because you commented.

Reply to this email directly, view it on GitHub
#17097 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAAcA1IZ0xA8EFvPlhsyptThTuXKp96Lks5qp9tsgaJpZM4J8qnv
.

@mattn
Copy link
Member Author

mattn commented Sep 14, 2016

On windows, the input bytes are encoded as Double Byte Character Set. For example, in japanese, it is cp932. So readConsole read bytes and convert it to utf-8. However, the buffer size is not enough to decode bytes. I guess, to fix this problem, os.File need to have buffer that can store multi-byte letter enough.

@mattn
Copy link
Member Author

mattn commented Sep 14, 2016

we can change code page of cmd.exe like chcp 65001. So the buffer size should be max bytes of utf-8 rune.

@mattn
Copy link
Member Author

mattn commented Sep 14, 2016

FYI, bufio.Buffer have bytes enough to read utf-8 bytes, so this can work correctly.

package main

import (
    "bufio"
    "fmt"
    "os"
)

func main() {
    var foo string
    fmt.Fscan(bufio.NewReader(os.Stdin), &foo)
    fmt.Println(foo)
}

But not same as non-Windows OSs.

@ianlancetaylor
Copy link
Contributor

At first glance I think that readConsole is doing the wrong thing. It buffers runes but it is supposed to be returning a UTF-8 sequence of bytes, so in the case utf8.RuneLen(r) > len(b) when n == 0 && len(b) > 0 it needs to break up r into a byte sequence, return the bytes that fit into b, and buffer the remaining bytes for the next call.

@ianlancetaylor ianlancetaylor changed the title fmt.Scan doesn't handle multi-byte string os: Windows readConsole fails on short read of multi-byte rune Sep 14, 2016
@ianlancetaylor ianlancetaylor added this to the Go1.8Maybe milestone Sep 14, 2016
@ianlancetaylor
Copy link
Contributor

CC @alexbrainman

@alexbrainman
Copy link
Member

so in the case utf8.RuneLen(r) > len(b) when n == 0 && len(b) > 0 it needs to break up r into a byte sequence, return the bytes that fit into b, and buffer the remaining bytes for the next call.

Are you saying that it is wrong for readConsole(b) to return 0, nil when len(b) > 0? But why?

Alex

@mattn
Copy link
Member Author

mattn commented Sep 15, 2016

var b [1]byte
readConsole(b[:])
fmt.Println("b1=%X", b[0])
readConsole(b[:])
fmt.Println("b2=%X", b[0])
readConsole(b[:])
fmt.Println("b3=%X", b[0])

Typing on cp932. (0x82 0xA0 in cp932, 0xE3 0x81 0x82 in utf-8)

I think right behavior is:

b1=E3
b2=81
b3=82

@ianlancetaylor
Copy link
Contributor

@alexbrainman It's not entirely wrong for readConsole to return 0 bytes for a 1 byte read. But it's confusing and unexpected. Go programs expect that when they try to read 1 byte, they will either get 1 byte or they will get 0 bytes and io.EOF. They don't expect to have to retry the read with a larger buffer.

As the documentation for io.Reader says: "Implementations of Read are discouraged from returning a zero byte count with a nil error, except when len(p) == 0."

@alexbrainman
Copy link
Member

@ianlancetaylor thank you for explaining. I think I see the problem now. I will try to build new test using @mattn example.

Thank you both again.

Alex

@gopherbot
Copy link

CL https://golang.org/cl/29493 mentions this issue.

@quentinmit quentinmit added the NeedsFix The path to resolution is known, but the work has not been done. label Oct 10, 2016
@mattn
Copy link
Member Author

mattn commented Oct 13, 2016

@alexbrainman thanks for fixing this. While trying this fix, I notice ^Z doesn't work. Maybe, this part is better to exit the loop.

diff --git a/src/os/file_windows.go b/src/os/file_windows.go
index ed06b55..aa5b218 100644
--- a/src/os/file_windows.go
+++ b/src/os/file_windows.go
@@ -233,6 +233,9 @@ func (f *File) readOneUTF16FromConsole() (uint16, error) {
            return 0, err
        }
        if nmb == 0 {
+           if len(mbytes) == 0 {
+               return 0, io.EOF
+           }
            continue
        }
        mbytes = append(mbytes, buf[0])

Or

diff --git a/src/os/file_windows.go b/src/os/file_windows.go
index ed06b55..e79fc5b 100644
--- a/src/os/file_windows.go
+++ b/src/os/file_windows.go
@@ -233,7 +233,7 @@ func (f *File) readOneUTF16FromConsole() (uint16, error) {
            return 0, err
        }
        if nmb == 0 {
-           continue
+           return 0, io.EOF
        }
        mbytes = append(mbytes, buf[0])

How do you think?

@bradfitz
Copy link
Contributor

@mattn, this is a closed bug, so you might want to file a new one to be sure it won't get lost.

@gopherbot
Copy link

CL https://golang.org/cl/33451 mentions this issue.

gopherbot pushed a commit that referenced this issue Nov 29, 2016
Go 1.5 worked with Unicode console input but not ^Z.
Go 1.6 did not work with Unicode console input but did handle one ^Z case.
Go 1.7 did not work with Unicode console input but did handle one ^Z case.

The intent of this CL is for Go 1.8 to work with Unicode console input
and also handle all ^Z cases.

Here's a simple test program for reading from the console.
It prints a "> " prompt, calls read, prints what it gets, and repeats.

	package main

	import (
	    "fmt"
	    "os"
	)

	func main() {
	    p := make([]byte, 100)
	    fmt.Printf("> ")
	    for {
	        n, err := os.Stdin.Read(p)
	        fmt.Printf("[%d %q %v]\n> ", n, p[:n], err)
	    }
	}

On Unix, typing a ^D produces a break in the input stream.
If the ^D is at the beginning of a line, then the 0 bytes returned
appear as an io.EOF:

	$ go run /tmp/x.go
	> hello
	[6 "hello\n" <nil>]
	> hello^D[5 "hello" <nil>]
	> ^D[0 "" EOF]
	> ^D[0 "" EOF]
	> hello^Dworld
	[5 "hello" <nil>]
	> [6 "world\n" <nil>]
	>

On Windows, the EOF character is ^Z, not ^D, and there has
been a long-standing problem that in Go programs, ^Z on Windows
does not behave in the expected way, namely like ^D on Unix.
Instead, the ^Z come through as literal ^Z characters:

	C:\>c:\go1.5.4\bin\go run x.go
	> ^Z
	[3 "\x1a\r\n" <nil>]
	> hello^Zworld
	[13 "hello\x1aworld\r\n" <nil>]
	>

CL 4310 attempted to fix this bug, then known as #6303,
by changing the use of ReadConsole to ReadFile.
This CL was released as part of Go 1.6 and did fix the case
of a ^Z by itself, but not as part of a larger input:

	C:\>c:\go1.6.3\bin\go run x.go
	> ^Z
	[0 "" EOF]
	> hello^Zworld
	[13 "hello\x1aworld\r\n" <nil>]
	>

So the fix was incomplete.
Worse, the fix broke Unicode console input.

ReadFile does not handle Unicode console input correctly.
To handle Unicode correctly, programs must use ReadConsole.
Early versions of Go used ReadFile to read the console,
leading to incorrect Unicode handling, which was filed as #4760
and fixed in CL 7312053, which switched to ReadConsole
and was released as part of Go 1.1 and still worked as of Go 1.5:

	C:\>c:\go1.5.4\bin\go run x.go
	> hello
	[7 "hello\r\n" <nil>]
	> hello world™
	[16 "hello world™\r\n" <nil>]
	>

But in Go 1.6:

	C:\>c:\go1.6.3\bin\go run x.go
	> hello
	[7 "hello\r\n" <nil>]
	> hello world™
	[0 "" EOF]
	>

That is, changing back to ReadFile in Go 1.6 reintroduced #4760,
which has been refiled as #17097. (We have no automated test
for this because we don't know how to simulate console input
in a test: it appears that one must actually type at a keyboard
to use the real APIs. This CL at least adds a comment warning
not to reintroduce ReadFile again.)

CL 29493 attempted to fix #17097, but it was not a complete fix:
the hello world™ example above still fails, as does Shift-JIS input,
which was filed as #17939.

CL 29493 also broke ^Z handling, which was filed as #17427.

This CL attempts the never before successfully performed trick
of simultaneously fixing Unicode console input and ^Z handling.
It changes the console input to use ReadConsole again,
as in Go 1.5, which seemed to work for all known Unicode input.
Then it adds explicit handling of ^Z in the input stream.
(In the case where standard input is a redirected file, ^Z processing
should not happen, and it does not, because this code path is only
invoked when standard input is the console.)

With this CL:

	C:\>go run x.go
	> hello
	[7 "hello\r\n" <nil>]
	> hello world™
	[16 "hello world™\r\n" <nil>]
	> ^Z
	[0 "" EOF]
	> [2 "\r\n" <nil>]
	> hello^Zworld
	[5 "hello" <nil>]
	> [0 "" EOF]
	> [7 "world\r\n" <nil>]

This almost matches Unix:

	$ go run /tmp/x.go
	> hello
	[6 "hello\n" <nil>]
	> hello world™
	[15 "hello world™\n" <nil>]
	> ^D
	[0 "" EOF]
	> [1 "\n" <nil>]
	> hello^Dworld
	[5 "hello" <nil>]
	> [6 "world\n" <nil>]
	>

The difference is in the handling of hello^Dworld / hello^Zworld.
On Unix, hello^Dworld terminates the read of hello but does not
result in a zero-length read between reading hello and world.
This is dictated by the tty driver, not any special Go code.

On Windows, in this CL, hello^Zworld inserts a zero length read
result between hello and world, which is treated as an interior EOF.
This is implemented by the Go code in this CL, but it matches the
handling of ^Z on the console in other programs:

	C:\>copy con x.txt
	hello^Zworld
	        1 file(s) copied.

	C:\>type x.txt
	hello
	C:\>

A natural question is how to test all this. As noted above, we don't
know how to write automated tests using the actual Windows console.
CL 29493 introduced the idea of substituting a different syscall.ReadFile
implementation for testing; this CL continues that idea but substituting
for syscall.ReadConsole instead. To avoid the regression of putting
ReadFile back, this CL adds a comment warning against that.

Fixes #17427.
Fixes #17939.

Change-Id: Ibaabd0ceb2d7af501d44ac66d53f64aba3944142
Reviewed-on: https://go-review.googlesource.com/33451
Run-TryBot: Russ Cox <rsc@golang.org>
Reviewed-by: Quentin Smith <quentin@golang.org>
Reviewed-by: Alex Brainman <alex.brainman@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
@golang golang locked and limited conversation to collaborators Nov 22, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. OS-Windows
Projects
None yet
Development

No branches or pull requests

7 participants