Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/text/encoding: decoder aborts on some invalid encodings instead of using U+FFFD #18898

Closed
AsoSunag opened this issue Feb 2, 2017 · 15 comments
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@AsoSunag
Copy link

AsoSunag commented Feb 2, 2017

Hello,

I noticed that for some encodings (at least all the ones for Japanese, Korean, traditional and simplified Chinese) the Transform method of their respective decoder handle invalid character by failing the decoding. But for example the UTF8 decoder replace everything it does not know by the invalid one (\ufffd).

Wouldn't it be better to normalize all the decoders to replace unknown characters by the invalid one?

@AsoSunag AsoSunag changed the title x/text/encoding decoder invalid character handling normalization Proposal: x/text/encoding decoder invalid character handling normalization Feb 2, 2017
@AsoSunag
Copy link
Author

AsoSunag commented Feb 3, 2017

I had not seen it before but it seems that this issue was already address in a code comment (https://github.com/golang/text/blob/c27e06ce2911df9af962d3325d66ce05b7502c32/encoding/encoding.go#L23) :

// TODO:
// - There seems to be some inconsistency in when decoders return errors
//   and when not. Also documentation seems to suggest they shouldn't return
//   errors at all (except for UTF-16).

So here is a list of all the encodings returning errors :

  • EUC-JP
  • ISO-2022-JP
  • Shift_JS
  • EUC-KR
  • gb18030
  • GBK
  • HZ-GB-2312
  • Big5
  • UTF-16
  • UTF-32

And from what I see here https://encoding.spec.whatwg.org/#error the default error management for all decoders should be doing replacement instead of fatal error by default (even for UTF-16).

@ALTree ALTree changed the title Proposal: x/text/encoding decoder invalid character handling normalization proposal: x/text/encoding decoder invalid character handling normalization Feb 3, 2017
@ALTree ALTree added the Proposal label Feb 3, 2017
@ALTree ALTree added this to the Proposal milestone Feb 3, 2017
@ALTree
Copy link
Member

ALTree commented Feb 3, 2017

You have marked the issue as "proposal", but IMHO that's a bit of a stretch if there's a TODO in the code, you're actually asking to fix a known problem.

anyway cc @mpvl who owns x/text.

@AsoSunag
Copy link
Author

AsoSunag commented Feb 3, 2017

Sorry my bad I saw the comment when I was having a second look around.

@bradfitz bradfitz changed the title proposal: x/text/encoding decoder invalid character handling normalization x/text/encoding decoder invalid character handling normalization Feb 6, 2017
@rsc rsc removed the Proposal label Feb 6, 2017
@bradfitz bradfitz changed the title x/text/encoding decoder invalid character handling normalization x/text/encoding: decoder invalid character handling normalization Feb 6, 2017
@bradfitz bradfitz added the NeedsFix The path to resolution is known, but the work has not been done. label Feb 6, 2017
@bradfitz bradfitz modified the milestones: Go1.9, Proposal Feb 6, 2017
@rsc rsc changed the title x/text/encoding: decoder invalid character handling normalization x/text/encoding: decoder aborts on some invalid encodings instead of using U+FFFD Feb 6, 2017
@mpvl
Copy link
Contributor

mpvl commented Feb 8, 2017

@nigeltao Could you clarify why these decoders returned errors instead of substituting the replacement character?

@nigeltao
Copy link
Contributor

nigeltao commented Feb 9, 2017

In https://groups.google.com/d/msg/golang-nuts/pENT3i4zJYk/o35l8WmyBwAJ on 2015 December, I said:

I remember, a year or three ago, discussing whether or not Encoding transformers should return an error early, use a substitute character, or be configurable between the two. I can't remember the details, though.

I still can't remember why we return errors instead of substituting U+FFFD.

For the record, there's also an "Error Modes" CL proposal (since abandoned) at https://go-review.googlesource.com/#/c/8303/

But as I cannot seem to contact 2013-era-nigeltao, I'm OK with just always substituting U+FFFD and changing the implementations to match the documentation, as per the TODO.

@mpvl
Copy link
Contributor

mpvl commented Feb 9, 2017

Then substituting U+FFFD without the option to do otherwise seems the best approach to me.

@gopherbot
Copy link

CL https://golang.org/cl/37316 mentions this issue.

@gopherbot
Copy link

CL https://golang.org/cl/37317 mentions this issue.

@gopherbot
Copy link

CL https://golang.org/cl/37319 mentions this issue.

@gopherbot
Copy link

CL https://golang.org/cl/37322 mentions this issue.

@gopherbot
Copy link

CL https://golang.org/cl/37324 mentions this issue.

@gopherbot
Copy link

CL https://golang.org/cl/37325 mentions this issue.

gopherbot pushed a commit to golang/text that referenced this issue Feb 23, 2017
Updates golang/go#18898

Change-Id: Ic5fb77af67656889a387fa75a3e6efc9b9975817
Reviewed-on: https://go-review.googlesource.com/37316
Run-TryBot: Marcel van Lohuizen <mpvl@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Nigel Tao <nigeltao@golang.org>
gopherbot pushed a commit to golang/text that referenced this issue Feb 23, 2017
Updates golang/go#18898

Change-Id: I9868004acb11abbfee8492c9de8ba374f6dcb2ac
Reviewed-on: https://go-review.googlesource.com/37319
Run-TryBot: Marcel van Lohuizen <mpvl@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Nigel Tao <nigeltao@golang.org>
gopherbot pushed a commit to golang/text that referenced this issue Feb 23, 2017
Updates golang/go#18898

Change-Id: I0636d4ff317bbca015cf2a2a2ce9f452bdc601fd
Reviewed-on: https://go-review.googlesource.com/37322
Run-TryBot: Marcel van Lohuizen <mpvl@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Nigel Tao <nigeltao@golang.org>
gopherbot pushed a commit to golang/text that referenced this issue Feb 23, 2017
Not sure if the right number of replacement characters is
returned in all instances, but seems reasonable.

Updates golang/go#18898

Change-Id: Ibf6efdb079191aa6db4eb05b41b7dae593947bd0
Reviewed-on: https://go-review.googlesource.com/37324
Run-TryBot: Marcel van Lohuizen <mpvl@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Nigel Tao <nigeltao@golang.org>
gopherbot pushed a commit to golang/text that referenced this issue Feb 26, 2017
Updates golang/go#18898

Change-Id: I049e5ba1fca9529eeacc3aa58f7e5c2d17f22ecd
Reviewed-on: https://go-review.googlesource.com/37317
Run-TryBot: Marcel van Lohuizen <mpvl@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Nigel Tao <nigeltao@golang.org>
@mpvl
Copy link
Contributor

mpvl commented Feb 28, 2017

As for UTF-16, it will return only an error if ExpectBOM is given. This mode is not used by either the ianaindex or htmlindex, and can easily be avoided by users if they do not wish an error in such circumstances. As this is just an opt-in error, and an error for which it is unclear what replacement would mean, I'm not altering UTF-16. The same holds for UTF-32.

@gopherbot
Copy link

CL https://golang.org/cl/37595 mentions this issue.

gopherbot pushed a commit to golang/text that referenced this issue Mar 1, 2017
Updates golang/go#18898

Change-Id: If234aa5fdc35daf5ab02f49400462aa0c1ffa5ea
Reviewed-on: https://go-review.googlesource.com/37325
Run-TryBot: Marcel van Lohuizen <mpvl@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Nigel Tao <nigeltao@golang.org>
gopherbot pushed a commit to golang/text that referenced this issue Mar 3, 2017
This CL also fixes a bug in the JIS 212 mode.

Updates golang/go#18898

Change-Id: Idb10e375bbc5d278db4e8550f26986a5d2bb6caa
Reviewed-on: https://go-review.googlesource.com/37595
Run-TryBot: Marcel van Lohuizen <mpvl@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Nigel Tao <nigeltao@golang.org>
@mpvl
Copy link
Contributor

mpvl commented Mar 10, 2017

All decoders should no longer return errors except for the opt-in error for UTF-16/32.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

7 participants