Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/text: Problem with charset 'hz-gb-2312', program stuck in infinite loop and gets killed! #35118

Closed
ben-sab opened this issue Oct 23, 2019 · 8 comments
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@ben-sab
Copy link

ben-sab commented Oct 23, 2019

What version of Go are you using (go version)?

$ go version
go1.12.7 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GOARCH="amd64"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPROXY=""
GORACE=""
GOROOT="/usr/local/go"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build514111842=/tmp/go-build -gno-record-gcc-switches"

What did you do?

https://play.golang.org/p/V5OAmSmE9-r

package main

import (
	"bytes"
	"golang.org/x/text/encoding/ianaindex"
)
func main() {
	content := []byte{126}
	enc, _ := ianaindex.MIME.Encoding("hz-gb-2312")
	decoder := enc.NewDecoder()
	decoded, _ := decoder.Bytes(content)
	bytes.NewReader(decoded)
}

What did you expect to see?

For the program to exit normally.

What did you see instead?

Gets stuck in an infinite loop and gets killed. In playground you'll see: "Program exited: process took too long."

The culprit is the Tilde (~ or ascii=126) which if shows up at the end of a line, makes decoder.Bytes(content) get stuck in an infinite loop for this particular encoding.

@gopherbot gopherbot added this to the Unreleased milestone Oct 23, 2019
@dmitshur
Copy link
Contributor

/cc @mpvl per owners.

@dmitshur dmitshur added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Oct 23, 2019
@ben-sab
Copy link
Author

ben-sab commented Oct 23, 2019

Not sure if this is going to help in tracking down the issue or not but have a look at the following tests (just changing the value of content in the above code):

  1. content := []byte{126}: fails
  2. content := []byte{126,126}: OK
  3. content := []byte{126,126,126}: fails
  4. content := []byte{126,126,126,126}: OK
  5. content := []byte{126,126,126,126,126}: OK!

So it's not the odd number of tildes that's breaking it either, hope this helps!

@smasher164
Copy link
Member

  1. content := []byte{126,126,126,126,126}: OK!

In that last case, shouldn't the decoder return an error? Since ~ triggers an escape sequence, any input that ends with an incomplete escape sequence is definitely invalid.

The simplifiedchinese transformer seems to be returning ErrShortDst when it should be returning ErrShortSrc instead, since an incomplete escape sequence is insufficient data to complete a transformation. This causes doAppend to continuously grow the destination buffer, re-running the same transformation in an infinite loop.

@josharian
Copy link
Contributor

Aside: This sounds like the kind of bug fuzzing would catch. Might be worth writing a few fuzz functions for this repo. cc @yevgenypats

@yevgenypats
Copy link

Thanks @josharian , I'll look into this.

@ben-sab
Copy link
Author

ben-sab commented Nov 4, 2019

Any news about this issue at all?

@smasher164 smasher164 added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Nov 4, 2019
@garrmcnu
Copy link

garrmcnu commented Feb 24, 2020

Hi, I would like to give this one a shot.

As mentioned, in this encoding '~' is an escape character which must be followed by another character, from RFC 1843:
"In ASCII mode, a byte is interpreted as an ASCII character, unless a
'~' is encountered. The character '~' is an escape character. By
convention, it must be immediately followed ONLY by '~', '{' or '\n'
(<LF>), with the following special meaning."

As far as I can tell, Transform() only returns ErrShortSrc if atEOF is false. If atEOF is true (as is the case for a fixed buffer/string) then a single '~' is replaced with the unicode replacement character (U+FFFD / utf8.RuneError), this character requires 3 bytes to encode [239 191 189].
Would be good if someone more familiar with this area could confirm if that is the expected behaviour?

The reason for the infinite loop is when the replacement character is inserted, the size variable (used to iterate through the src buffer) is not updated, so this loop doesn't make progress.
Setting this variable to 1 when writing the replacement character (to indicate that a single source byte has been read) fixes the issue.

The reason for the different behaviour on the number of '~' characters is due to the size of the destination buffer. If this buffer is not big enough to store the replacement character, Transform() returns ErrShortDst (and nSrc to indicate how many source bytes have been read so far), Transform() is then called again with a larger buffer to continue where it left off, however on this call size starts at 0 again and is never updated (because the next character is a single '~') and it gets stuck in an infinite loop.
On the other hand, if the destination buffer is large enough to store the replacement character, it will correctly write the destination buffer, but return an incorrect nSrc (1 byte larger than len(src)).

For example, []byte("~~~~~") requires 5 bytes for the destination buffer (1 byte each for the 2 pairs of '~' + 3 bytes for the replacement character).
The Bytes() method initially sizes the destination buffer to match the source buffer (so 5 bytes in this case). The first iteration through the Transform() loop reads '~~' this is valid, so size is 2 and 1 byte is written to the destination (nSrc=2 nDst=1), next iteration is the same (nSrc=4 nDst=2), next iteration for the final '~' the replacement character is written to the destination buffer but size is not updated and is still 2 from the previous iteration (nSrc=6 nDst=5). This is sufficient to exit the loop but an incorrect nSrc is returned.

The incorrect nSrc doesn't seem to cause a problem when calling Bytes(), however calling String() this results in a panic (slice bounds out of range).

So to summarise behaviour
decoder.Bytes([]byte("~")) // infinite loop (initial destination buffer too small)
decoder.Bytes([]byte("~~")) // valid sequence, no problem
decoder.Bytes([]byte("~~~")) // infinite loop (initial destination buffer too small)
decoder.Bytes([]byte("~~~~~")) // correct transform but incorrect nSrc=6
decoder.Bytes([]byte("abcd~")) // infinite loop (initial destination buffer too small)
decoder.String("~") // infinite loop (size never updated)
decoder.String("~~~") // panic due to incorrect nSrc=4

Perhaps an additional improvement in String() would be a sanity check if nSrc > len(src) to return an error in this case?

@gopherbot
Copy link

Change https://golang.org/cl/220460 mentions this issue: encoding/simplifiedchinese: fix incorrect transform count to avoid infinite loop

brunnre8 added a commit to brunnre8/go-message that referenced this issue May 14, 2020
Upstream go's x/text/encoding package has a bug that makes the
'hz-gb-2312' encoding crash (golang/go#35118)

This was reported to go-message in
emersion#95

This commit removes the offending encoding so that go-message can't crash if
people import the charset package. It should be reverted once the upstream
package is fixed by the go developers.
brunnre8 added a commit to brunnre8/go-message that referenced this issue May 14, 2020
Upstream go's x/text/encoding package has a bug that makes the
'hz-gb-2312' encoding crash (golang/go#35118)

This was reported to go-message in
emersion#95

This commit removes the offending encoding so that go-message can't crash if
people import the charset package. It should be reverted once the upstream
package is fixed by the go developers.
emersion pushed a commit to emersion/go-message that referenced this issue May 25, 2020
Upstream go's x/text/encoding package has a bug that makes the
'hz-gb-2312' encoding crash (golang/go#35118)

This was reported to go-message in
#95

This commit removes the offending encoding so that go-message can't crash if
people import the charset package. It should be reverted once the upstream
package is fixed by the go developers.
emersion added a commit to emersion/go-message that referenced this issue Nov 25, 2020
The upstream Go issue has been fixed [1].

[1]: golang/go#35118

References: #95
@golang golang locked and limited conversation to collaborators Nov 25, 2021
xhit pushed a commit to xhit/text that referenced this issue Oct 10, 2022
…finite loop

If the final character in the source buffer is a single '~' escape
character, size is not updated. The loop either doesn't make progress if
size is zero, or size retains the value from a previous iteration which
may return an incorrect source bytes consumed count.
Count the single '~' as 1 byte consumed.

Fixes golang/go#35118

Change-Id: I3eadf1b4cb632a7c4dc4255325b467a6907c10c0
Reviewed-on: https://go-review.googlesource.com/c/text/+/220460
Reviewed-by: Nigel Tao <nigeltao@golang.org>
Trust: Nigel Tao <nigeltao@golang.org>
Trust: Marcel van Lohuizen <mpvl@golang.org>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

7 participants