Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/text/encoding: ianaindex.Index.Encoding returns nil for valid encoding names #19421

Closed
ymmt2005 opened this issue Mar 6, 2017 · 12 comments
Closed

Comments

@ymmt2005
Copy link

ymmt2005 commented Mar 6, 2017

ianaindex.MIME.Encoding("us-ascii") for example returns (nil, nil).

This is quite surprising for me because the documentation does not say that it may return nil for valid names.
Would you please consider updating the documentation or changing it to return encoding.Nop?

Thank you!

What did you do?

package main

import (
	"fmt"

	"golang.org/x/text/encoding/ianaindex"
)

func main() {
	enc, err := ianaindex.MIME.Encoding("us-ascii")
	fmt.Printf("%v, %v\n", enc, err)
}

What did you expect to see?

non-nil Encoding object and nil error.

What did you see instead?

nil, nil

@ymmt2005 ymmt2005 changed the title x/text/encoding: ianaindex.Index.Encoding returns nil for valid character sets x/text/encoding: ianaindex.Index.Encoding returns nil for valid encoding names Mar 6, 2017
@gopherbot gopherbot added this to the Unreleased milestone Mar 21, 2017
@stapelberg
Copy link
Contributor

cc @mpvl

I just ran into this issue as well.

@pam4
Copy link

pam4 commented May 13, 2019

Me too.
Currently Index.Encoding returns (nil, nil) for 201 out of 256 encodings, corresponding to 683 out of 1513 aliases. I assume these 201 encodings are not implemented.

The 55 implemented encodings are more than enough for me. The problem is that no function is supposed to fail with a nil error. Even if it were documented, it's not idiomatic.

(It *is* a failure, because Index.Encoding is supposed to return the requested encoding, not just to validate the alias.)

I think returning encoding.Nop is even worse than the current behavior, because that's not the encoding requested by the caller and it's harder to detect.

The best solution I can think of, is to return errUnsupported and improve the docs.
The downside is that there is no way to distinguish an unimplemented encoding (errUnsupported) from a bad alias (errInvalidName), because errors are not exported, which is inconvenient.

--- a/encoding/ianaindex/ianaindex.go
+++ b/encoding/ianaindex/ianaindex.go
@@ -79,7 +79,11 @@ func (x *Index) Encoding(name string) (encoding.Encoding, error) {
            return nil, errInvalidName
        }
    }
-   return x.enc[i], nil
+   enc := x.enc[i]
+   if enc == nil {
+       return nil, errUnsupported
+   }
+   return enc, nil
 }
 
 // Name reports the canonical name of the given Encoding. It will return an

While we are waiting for the right people to look into this bug, I propose to add a Bugs section to the docs (like the one in package bytes and others), briefly explaining the problem.

This bug is already two years old and users are going to get unexpected panics (like I did), just because the information is not at hand.

@emersion
Copy link

Opened a pull request to fix this issue: golang/text#10

@pam4
Copy link

pam4 commented Dec 20, 2019

@emersion, thanks.
On second thought I prefer your solution of just documenting the (nil, nil) case (partly because my code already depends on it :D). Anyway I wouldn't want to lose the ability to distinguish between alias not found and encoding not supported.

But to treat ASCII as encoding.Nop seems incorrect to me; see the docs about Encoder and Decoder:
An Encoder (UTF-8 to X) is supposed to translate invalid UTF-8 to the replacement character and to return an error for any rune that cannot be represented by X.
A Decoder (X to UTF-8) is supposed to translate source bytes that are not valid X encoding to the replacement character.
As a result, reading from a Decoder should always produce valid UTF-8, and writing to an Encoder should always produce valid X encoding.
This is not the case for encoding.Nop, which is just a byte copy.

I would prefer to either do nothing, or implement a "correct" ASCII Encoding.
If ianaindex.MIME.Encoding must return an encoding that doesn't satisfy the above conditions, it should at least be documented.
Often a different behavior is preferred but we should probably just let the user special-case it. Someone may want encoding.Nop while others may want unicode.UTF8, charmap.Windows1252 or even something else.

@emersion
Copy link

emersion commented Dec 20, 2019

That's true. I updated the PR to make the ASCII encoding:

  • Decode non-ASCII bytes as unicode.ReplacementChar
  • Encode non-ASCII runes as encoding.ASCIISub

@pam4
Copy link

pam4 commented Dec 20, 2019

@emersion, thanks for the update.

Unlike Decoders, it seems that an Encoder is supposed to return a real error for any rune that cannot be represented by the target encoding.
https://play.golang.org/p/Wj2SLov2GNG

Existing Encoders return internal.RepertoireError, not only for unrepresentable runes, but also for invalid UTF-8 and even a literal \uFFFD itself.
The documentation is not wrong in saying that invalid UTF-8 is treated as \uFFFD, since they both produce the same error, but I still find it a little surprising that \uFFFD is not considered "representable".

I haven't looked into it, but an alternative could be to add ASCII as a charmap.Charmap here (it's a generated file), with bytes >127 mapped to \uFFFD in the decode field, and omitted from the encode field.

@emersion
Copy link

Ah, thanks, I missed that. Updated again to return a RepertoireError as expected.

I considered adding ASCII as a charmap but decided against because the script generating tables fetches an ICU table from a remote repository.

@gopherbot
Copy link

Change https://golang.org/cl/212077 mentions this issue: encoding/ianaindex: add ASCII, document Index.Encoding

@pam4
Copy link

pam4 commented Jan 24, 2020

@mpvl

One question that remains is whether ASCII should be included as an exported encoding. [...] The disadvantage is that it may be used when people are better off with Latin1. But a comment could be to suggest use Latin1 for common cases.

I think it's a good idea to give users access to the ascii encoding.
I understand it would be inappropriate to use it for html, because html producers are required to use utf-8, and consumers are required to resolve labels as htmlindex does. But it's the user's responsibility to know that.

If a user thinks he needs an ascii Encoder he's probably right: if he didn't want ascii-only output he would have chosen something else.
If a user thinks he needs an ascii Decoder, he *may* be better off with a superset instead. It depends on what he wants to do if the input expected to be ascii is not actually ascii, and what such input is most likely to be.
Again, I think it's the user's responsibility to handle mislabeled/malformed input as most appropriate for his use case.

Note that latin1 == iso-8859-1 != windows-1252.
Bytes in the range 0x80-0x9f map differently to unicode, e.g. windows-1252 0x80 is U+20AC (€) while latin1 0x80 is U+0080.
It is usually safe to decode or to label latin1 as windows-1252 but not the other way around (W3C labels are intended for decoding legacy html).
(OT: Let's do whatever we can to speed up the agonizingly slow death of non-utf-8 encodings 😃)

One could add a Get method to ianaindex, similar to Encoding, but returning an Unsupported error if non-existing and deprecate Encoding.

SGTM (to be tested with errors.Is)

@makew0rld
Copy link

Are there any encodings compatible with Unicode that are returning nil, nil besides US-ASCII? It'd be helpful application-wise. To avoid displaying garbage data I have to bring up an error if the encoding returned is nil, but now I have to make a special case for US-ASCII, because it's compatible with Unicode so I should still display it.

@pam4

@pam4
Copy link

pam4 commented Jun 20, 2020

@makeworld-the-better-one, by "compatible with Unicode" I assume you actually mean "a subset of utf-8", otherwise you will have to explain better.
AFAIK, us-ascii is the only subset of utf-8 that has a name, i.e. the only encoding you can handle as utf-8 without loss.

But you may want to treat text labeled us-ascii as windows-1252, because in case of mislabeling there is usually a higher chance for it to be windows-1252 or latin1 than utf-8 (and in case of correct labeling it makes no difference). It depends on your use case.

@makew0rld
Copy link

Hmm, thanks. But it seems that in general I should just not be displaying text that has a nil encoding, with US-ASCII being the exception.

xhit pushed a commit to xhit/text that referenced this issue Oct 30, 2020
Index.Encoding returns a nil Encoding in case the charset is valid but
unsupported by the library. Document this behavior.

Because of this, US-ASCII is seen as unsupported.
Register it as a regular encoding. The decoder replaces non-ASCII bytes
with the unicode replacement character. The encoder returns a
RepertoireError when a non-ASCII rune is encountered.

Fixes golang/go#19421

Change-Id: I4c24ba2114a5012be88488e63aa6e57df955eb96
GitHub-Last-Rev: 418ee6d
GitHub-Pull-Request: golang#10
Reviewed-on: https://go-review.googlesource.com/c/text/+/212077
Reviewed-by: Daniel Martí <mvdan@mvdan.cc>
Reviewed-by: Marcel van Lohuizen <mpvl@golang.org>
Run-TryBot: Daniel Martí <mvdan@mvdan.cc>
TryBot-Result: Gobot Gobot <gobot@golang.org>
@golang golang locked and limited conversation to collaborators Aug 26, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants