x/text/encoding: ianaindex.Index.Encoding returns nil for valid encoding names #19421

ymmt2005 · 2017-03-06T11:39:40Z

ianaindex.MIME.Encoding("us-ascii") for example returns (nil, nil).

This is quite surprising for me because the documentation does not say that it may return nil for valid names.
Would you please consider updating the documentation or changing it to return encoding.Nop?

Thank you!

What did you do?

package main

import (
	"fmt"

	"golang.org/x/text/encoding/ianaindex"
)

func main() {
	enc, err := ianaindex.MIME.Encoding("us-ascii")
	fmt.Printf("%v, %v\n", enc, err)
}

What did you expect to see?

non-nil Encoding object and nil error.

What did you see instead?

nil, nil

The text was updated successfully, but these errors were encountered:

stapelberg · 2017-11-10T22:18:03Z

cc @mpvl

I just ran into this issue as well.

pam4 · 2019-05-13T21:43:07Z

Me too.
Currently Index.Encoding returns (nil, nil) for 201 out of 256 encodings, corresponding to 683 out of 1513 aliases. I assume these 201 encodings are not implemented.

The 55 implemented encodings are more than enough for me. The problem is that no function is supposed to fail with a nil error. Even if it were documented, it's not idiomatic.

(It *is* a failure, because Index.Encoding is supposed to return the requested encoding, not just to validate the alias.)

I think returning encoding.Nop is even worse than the current behavior, because that's not the encoding requested by the caller and it's harder to detect.

The best solution I can think of, is to return errUnsupported and improve the docs.
The downside is that there is no way to distinguish an unimplemented encoding (errUnsupported) from a bad alias (errInvalidName), because errors are not exported, which is inconvenient.

--- a/encoding/ianaindex/ianaindex.go
+++ b/encoding/ianaindex/ianaindex.go
@@ -79,7 +79,11 @@ func (x *Index) Encoding(name string) (encoding.Encoding, error) {
            return nil, errInvalidName
        }
    }
-   return x.enc[i], nil
+   enc := x.enc[i]
+   if enc == nil {
+       return nil, errUnsupported
+   }
+   return enc, nil
 }
 
 // Name reports the canonical name of the given Encoding. It will return an

While we are waiting for the right people to look into this bug, I propose to add a Bugs section to the docs (like the one in package bytes and others), briefly explaining the problem.

This bug is already two years old and users are going to get unexpected panics (like I did), just because the information is not at hand.

emersion · 2019-12-19T13:03:08Z

Opened a pull request to fix this issue: golang/text#10

pam4 · 2019-12-20T00:21:07Z

@emersion, thanks.
On second thought I prefer your solution of just documenting the (nil, nil) case (partly because my code already depends on it :D). Anyway I wouldn't want to lose the ability to distinguish between alias not found and encoding not supported.

But to treat ASCII as encoding.Nop seems incorrect to me; see the docs about Encoder and Decoder:
An Encoder (UTF-8 to X) is supposed to translate invalid UTF-8 to the replacement character and to return an error for any rune that cannot be represented by X.
A Decoder (X to UTF-8) is supposed to translate source bytes that are not valid X encoding to the replacement character.
As a result, reading from a Decoder should always produce valid UTF-8, and writing to an Encoder should always produce valid X encoding.
This is not the case for encoding.Nop, which is just a byte copy.

I would prefer to either do nothing, or implement a "correct" ASCII Encoding.
If ianaindex.MIME.Encoding must return an encoding that doesn't satisfy the above conditions, it should at least be documented.
Often a different behavior is preferred but we should probably just let the user special-case it. Someone may want encoding.Nop while others may want unicode.UTF8, charmap.Windows1252 or even something else.

emersion · 2019-12-20T11:11:58Z

That's true. I updated the PR to make the ASCII encoding:

Decode non-ASCII bytes as unicode.ReplacementChar
Encode non-ASCII runes as encoding.ASCIISub

pam4 · 2019-12-20T14:44:54Z

@emersion, thanks for the update.

Unlike Decoders, it seems that an Encoder is supposed to return a real error for any rune that cannot be represented by the target encoding.
https://play.golang.org/p/Wj2SLov2GNG

Existing Encoders return internal.RepertoireError, not only for unrepresentable runes, but also for invalid UTF-8 and even a literal \uFFFD itself.
The documentation is not wrong in saying that invalid UTF-8 is treated as \uFFFD, since they both produce the same error, but I still find it a little surprising that \uFFFD is not considered "representable".

I haven't looked into it, but an alternative could be to add ASCII as a charmap.Charmap here (it's a generated file), with bytes >127 mapped to \uFFFD in the decode field, and omitted from the encode field.

emersion · 2019-12-20T17:23:02Z

Ah, thanks, I missed that. Updated again to return a RepertoireError as expected.

I considered adding ASCII as a charmap but decided against because the script generating tables fetches an ICU table from a remote repository.

gopherbot · 2019-12-20T17:34:23Z

Change https://golang.org/cl/212077 mentions this issue: encoding/ianaindex: add ASCII, document Index.Encoding

pam4 · 2020-01-24T18:54:13Z

@mpvl

One question that remains is whether ASCII should be included as an exported encoding. [...] The disadvantage is that it may be used when people are better off with Latin1. But a comment could be to suggest use Latin1 for common cases.

I think it's a good idea to give users access to the ascii encoding.
I understand it would be inappropriate to use it for html, because html producers are required to use utf-8, and consumers are required to resolve labels as htmlindex does. But it's the user's responsibility to know that.

If a user thinks he needs an ascii Encoder he's probably right: if he didn't want ascii-only output he would have chosen something else.
If a user thinks he needs an ascii Decoder, he *may* be better off with a superset instead. It depends on what he wants to do if the input expected to be ascii is not actually ascii, and what such input is most likely to be.
Again, I think it's the user's responsibility to handle mislabeled/malformed input as most appropriate for his use case.

Note that latin1 == iso-8859-1 != windows-1252.
Bytes in the range 0x80-0x9f map differently to unicode, e.g. windows-1252 0x80 is U+20AC (€) while latin1 0x80 is U+0080.
It is usually safe to decode or to label latin1 as windows-1252 but not the other way around (W3C labels are intended for decoding legacy html).
(OT: Let's do whatever we can to speed up the agonizingly slow death of non-utf-8 encodings 😃)

One could add a Get method to ianaindex, similar to Encoding, but returning an Unsupported error if non-existing and deprecate Encoding.

SGTM (to be tested with errors.Is)

makew0rld · 2020-06-19T22:31:52Z

Are there any encodings compatible with Unicode that are returning nil, nil besides US-ASCII? It'd be helpful application-wise. To avoid displaying garbage data I have to bring up an error if the encoding returned is nil, but now I have to make a special case for US-ASCII, because it's compatible with Unicode so I should still display it.

@pam4

pam4 · 2020-06-20T11:35:15Z

@makeworld-the-better-one, by "compatible with Unicode" I assume you actually mean "a subset of utf-8", otherwise you will have to explain better.
AFAIK, us-ascii is the only subset of utf-8 that has a name, i.e. the only encoding you can handle as utf-8 without loss.

But you may want to treat text labeled us-ascii as windows-1252, because in case of mislabeling there is usually a higher chance for it to be windows-1252 or latin1 than utf-8 (and in case of correct labeling it makes no difference). It depends on your use case.

makew0rld · 2020-06-20T14:51:30Z

Hmm, thanks. But it seems that in general I should just not be displaying text that has a nil encoding, with US-ASCII being the exception.

Index.Encoding returns a nil Encoding in case the charset is valid but unsupported by the library. Document this behavior. Because of this, US-ASCII is seen as unsupported. Register it as a regular encoding. The decoder replaces non-ASCII bytes with the unicode replacement character. The encoder returns a RepertoireError when a non-ASCII rune is encountered. Fixes golang/go#19421 Change-Id: I4c24ba2114a5012be88488e63aa6e57df955eb96 GitHub-Last-Rev: 418ee6d GitHub-Pull-Request: golang#10 Reviewed-on: https://go-review.googlesource.com/c/text/+/212077 Reviewed-by: Daniel Martí <mvdan@mvdan.cc> Reviewed-by: Marcel van Lohuizen <mpvl@golang.org> Run-TryBot: Daniel Martí <mvdan@mvdan.cc> TryBot-Result: Gobot Gobot <gobot@golang.org>

ymmt2005 changed the title ~~x/text/encoding: ianaindex.Index.Encoding returns nil for valid character sets~~ x/text/encoding: ianaindex.Index.Encoding returns nil for valid encoding names Mar 6, 2017

gopherbot added this to the Unreleased milestone Mar 21, 2017

emersion mentioned this issue Dec 19, 2019

encoding/ianaindex: add ASCII, document Index.Encoding golang/text#10

Closed

makew0rld mentioned this issue Jun 19, 2020

Support multiple charsets makew0rld/amfora#3

Closed

gopherbot closed this as completed in golang/text@a8b4671 Aug 26, 2020

golang locked and limited conversation to collaborators Aug 26, 2021

gopherbot added the FrozenDueToAge label Aug 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x/text/encoding: ianaindex.Index.Encoding returns nil for valid encoding names #19421

x/text/encoding: ianaindex.Index.Encoding returns nil for valid encoding names #19421

ymmt2005 commented Mar 6, 2017

stapelberg commented Nov 10, 2017

pam4 commented May 13, 2019 •

edited

emersion commented Dec 19, 2019

pam4 commented Dec 20, 2019

emersion commented Dec 20, 2019 •

edited

pam4 commented Dec 20, 2019

emersion commented Dec 20, 2019

gopherbot commented Dec 20, 2019

pam4 commented Jan 24, 2020 •

edited

makew0rld commented Jun 19, 2020

pam4 commented Jun 20, 2020

makew0rld commented Jun 20, 2020

x/text/encoding: ianaindex.Index.Encoding returns nil for valid encoding names #19421

x/text/encoding: ianaindex.Index.Encoding returns nil for valid encoding names #19421

Comments

ymmt2005 commented Mar 6, 2017

What did you do?

What did you expect to see?

What did you see instead?

stapelberg commented Nov 10, 2017

pam4 commented May 13, 2019 • edited

emersion commented Dec 19, 2019

pam4 commented Dec 20, 2019

emersion commented Dec 20, 2019 • edited

pam4 commented Dec 20, 2019

emersion commented Dec 20, 2019

gopherbot commented Dec 20, 2019

pam4 commented Jan 24, 2020 • edited

makew0rld commented Jun 19, 2020

pam4 commented Jun 20, 2020

makew0rld commented Jun 20, 2020

pam4 commented May 13, 2019 •

edited

emersion commented Dec 20, 2019 •

edited

pam4 commented Jan 24, 2020 •

edited