Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fmt: inconsistent formatting of unicode with %c and %q #14569

Closed
martisch opened this issue Feb 29, 2016 · 4 comments
Closed

fmt: inconsistent formatting of unicode with %c and %q #14569

martisch opened this issue Feb 29, 2016 · 4 comments

Comments

@martisch
Copy link
Contributor

Go1.6

https://play.golang.org/p/XcrX5-8-om

invalid:
string     |�|
fmt      c |�|
fmt      q |%!q(int64=1114112)|
strconv  q |'�'|
fmt      U |U+110000|
fmt U+%04X |U+110000|
fmt     #U |U+110000|

non-printable:
string     ||
fmt      c ||
fmt      q |'\x00'|
strconv  q |'\x00'|
fmt      U |U+0000|
fmt U+%04X |U+0000|
fmt     #U |U+0000|

printable:
string     |⌘|
fmt      c |⌘|
fmt      q |'⌘'|
strconv  q |'⌘'|
fmt      U |U+2318|
fmt U+%04X |U+2318|
fmt     #U |U+2318 '⌘'|

maxInt64:
string     |�|
fmt      c |�|
fmt      q |%!q(int64=9223372036854775807)|
strconv  q |'�'|
fmt      U |U+7FFFFFFFFFFFFFFF|
fmt U+%04X |U+7FFFFFFFFFFFFFFF|
fmt     #U |U+7FFFFFFFFFFFFFFF|

negative:
string     |�|
fmt      c |�|
fmt      q |%!q(int64=-1)|
strconv  q |'�'|
fmt      U |U+FFFFFFFFFFFFFFFF|
fmt U+%04X |U+-001|
fmt     #U |U+FFFFFFFFFFFFFFFF|

minInt64:
string     ||
fmt      c |�|
fmt      q |%!q(int64=-9223372036854775808)|
strconv  q |'\x00'|
fmt      U |U+8000000000000000|
fmt U+%04X |U+-8000000000000000|
fmt     #U |U+8000000000000000|

( Note that strconv should only be looked at within rune/int32 range so minInt64 and maxInt64 here are not relevant due to the explicit rune type conversion in the call to strconv in this example program).

I would expect fmt %q to be similar to strconv.QuoteRune within rune/int32 range (including negative numbers). E.g. -1 and 1114112 print a quoted utf.RuneError.

For values outside int32 range i would expect fmt %c and %q to behave similar. Either both print a badVerb error string or both print an utf8.RuneError (quoted in case of %q). If they print an error string then so should probably %U too.

Another possibility is that any invalid unicode point could be rejected by fmt formatting with a badVerb error string for %c %q %U.

Can/Should fmt be changed to handle these cases more consistently?

@ianlancetaylor ianlancetaylor added this to the Unplanned milestone Feb 29, 2016
@ianlancetaylor
Copy link
Member

CC @robpike

@martisch
Copy link
Contributor Author

martisch commented Mar 6, 2016

Having looked at this a bit more i would argue the following:

Returning a badVerb error string for any integer and %q or %c is a bug in my opinion since the documentation defines these verbs are ok for any integer. I also dont see any other case in fmt where badVerb triggers bases on value and not based on type.

If however "the character represented by the corresponding Unicode code point" means if there exist no character for the Unicode code point it should be an error then the current behavior of returning utf8.RuneError for other invalid runes below utf8.MaxRune is a bug. Not returning an Error however is explicitly documented in the code as "// If the character is not valid Unicode, it will print '\ufffd'.".

Either way it seems inconsistent with the documentation to me.

My proposed resolution would therefore be to return utf8.RuneError (escaped for %q) for any invalid rune regardless of the integer type or if its > utf8.MaxRune.

This should make it also easier to check for an invalid Unicode code point since instead of checking for an error string and utf8.RuneError one can now only check for the later. The character for RuneError would be RuneError before and after the change.

Also this behavior can be implemented solely in the fmtC (better renamed and moved to fmt_c) and fmt_qc functions with no range checks outside these functions.

@martisch
Copy link
Contributor Author

As this came up in the report #40175 again I would like to continue the discussion here.

My idea would still be that badVerb does not trigger for value ranges but only for types and that all integers that do not map to a valid unicode point are printed as RuneError ('\uFFFD') rune. This would align with the c verb.

Otherwise I think it should be documented that %q is only valid for integers i with 0 <= i <= utf8.MaxRune.

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/248759 mentions this issue: fmt: avoid badverb formatting for %q when used with integers

@golang golang locked and limited conversation to collaborators Aug 17, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants