Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding/json: how to marshal with unicode escape? #39137

Closed
cupen opened this issue May 19, 2020 · 23 comments
Closed

encoding/json: how to marshal with unicode escape? #39137

cupen opened this issue May 19, 2020 · 23 comments
Labels
FrozenDueToAge NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made.
Milestone

Comments

@cupen
Copy link

cupen commented May 19, 2020

What version of Go are you using (go version)?

$ go version
go version go1.14.2 linux/amd64

What did you expect to see?

module json add a API for marshal string with unicode escape would be useful.

unicode escape
https://tools.ietf.org/html/rfc7159#section-7

package main

import (
	"fmt"
	"encoding/json"
)

type Object struct {
	Name string
}

func main() {
	obj := Object{Name:"哇呀呀"}
	line, _ := json.MarshalUnicodeEscape(obj)
	fmt.Println(string(line))
}
{"Name": "\u54c7\u5440\u5440"}
@cupen cupen changed the title json: how to marshal with unicode escape? encoding/json: how to marshal with unicode escape? May 19, 2020
@mvdan
Copy link
Member

mvdan commented May 19, 2020

Before we start talking about new API, we should first talk about why you need to do that in the first place. You shouldn't need to escape non-ASCII letters in json.

@mvdan mvdan added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label May 19, 2020
@cupen
Copy link
Author

cupen commented May 19, 2020

@mvdan I have several program environments, go, python2, js runtime(a v8 app). they have different default character encoding, utf-8(go, python2). ucs2 (js runtime). It need to convert utf-8 to usc2 or usc2 to utf-8 when a json text transferred between them, it's a error-prone job.

I think unicode escape is use for this case., it's a ascii-safe and legal json string encoding format.

@mvdan
Copy link
Member

mvdan commented May 19, 2020

That seems like a very narrow edge case, and I don't think it should be the json package's job to support producing ASCII output alone.

You could consider having a named string type that implements MarshalJSON and replaces all non-ASCII characters with escape codes, if you want. That should be under twenty lines of extra Go code.

@mvdan mvdan added NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. and removed WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. labels May 19, 2020
@mvdan mvdan added this to the Backlog milestone May 19, 2020
@cupen
Copy link
Author

cupen commented May 19, 2020

Yeah, I know your meaning. here is the e.g. : https://play.golang.org/p/YVSQzad2Z2r

type UnicodeEscape string

func (ue UnicodeEscape) MarshalJSON() ([]byte, error) {
	text := strconv.QuoteToASCII(string(ue))
	return []byte(text), nil
}

type Object struct {
	Name UnicodeEscape
}

But I would to think the unicode-escape is a proposal of json spec, not a hack or a monkey-patch only for python, js(v8) or others.

BTW: json.MarshalUnicodeEscape is a ugly name, maybe I can add a new encoder option for it.

@seankhliao
Copy link
Member

json spec says unicode https://tools.ietf.org/html/rfc7159#section-8.1

@cupen
Copy link
Author

cupen commented May 20, 2020

@seankhliao Yes, it says json text character encoding could be utf-8, utf-16, utf-32, and json string character encoding could be utf-8, utf-16, utf-32 or unicode escape. Sorry to sound like a word game. 😄
https://tools.ietf.org/html/rfc7159#section-7

e.g.:

  • json string: "\u54c7\u5440\u5440", it's a json value with string type.
  • json text: {"Name": "\u54c7\u5440\u5440"}, it contains all of the json elements, field name, field value and ,:"{}.

For a json text {"Name": "\u54c7\u5440\u5440"}, it's ascii safe no matter which UTF be used.

@networkimprov
Copy link

Maybe you could consider a `json:"ascii"` tag for this case?

@mvdan
Copy link
Member

mvdan commented May 28, 2020

I think the few lines of code shown in #39137 (comment) are a completely acceptable solution to this. The json API should only cover common issues and needs. Trying to avoid utf-8 altogether in favor of ascii with unicode escapes certainly feels like an edge case that we shouldn't cover, especially given how easy it is to do with MarshalJSON.

@cupen
Copy link
Author

cupen commented Jun 2, 2020

ok, I'll do it by no-std library.

@cupen cupen closed this as completed Jun 2, 2020
@Bogdaan
Copy link

Bogdaan commented Jul 25, 2020

@cupen seems it doesn't work with emoji

package main

import (
	"encoding/json"
	"fmt"
	"strconv"
)

type building struct {
	Name encoded `json:"name"`
}

type encoded string

func (e encoded) MarshalJSON() ([]byte, error) {
	return []byte(strconv.QuoteToASCII(string(e))), nil
}

func main() {
        //  json: error calling MarshalJSON for type main.encoded: invalid character 'U' in string escape code

	data, err := json.Marshal(building{Name: encoded("😁")})
	fmt.Println(string(data), err) 
}

@networkimprov
Copy link

strconv.QuoteToASCII produces a Go string that's not valid JSON: "\U0001f601"

JSON needs a pair of UTF-16 code units: "\uabcd\uabcd"
See https://play.golang.org/p/bC7QB27IGiP

@cupen
Copy link
Author

cupen commented Sep 2, 2020

@Bogdaan Sorry for delay.
This would works.

func (ue UnicodeEscape) MarshalJSON() ([]byte, error) {
	var result = []byte("\"")
	for _, r := range []rune(ue) {
		if r < 0x10000 {
			v := "\\u" + strconv.FormatInt(int64(r), 16)
			result = append(result, []byte(v)...)
			continue
		}
		r1, r2 := utf16.EncodeRune(r)
		v1 := "\\u" + strconv.FormatInt(int64(r1), 16)
		v2 := "\\u" + strconv.FormatInt(int64(r2), 16)
		result = append(append(result, []byte(v1)...), []byte(v2)...)
	}

	result = append(result, []byte("\"")...)
	return result, nil
}

https://play.golang.org/p/_435-VMzeY2

BTW: This is not very high efficiency, I'm just show you a case which is works.

@pavelpatrin
Copy link

Yeah, I know your meaning. here is the e.g. : https://play.golang.org/p/YVSQzad2Z2r

type UnicodeEscape string

func (ue UnicodeEscape) MarshalJSON() ([]byte, error) {
	text := strconv.QuoteToASCII(string(ue))
	return []byte(text), nil
}

type Object struct {
	Name UnicodeEscape
}

But I would to think the unicode-escape is a proposal of json spec, not a hack or a monkey-patch only for python, js(v8) or others.

BTW: json.MarshalUnicodeEscape is a ugly name, maybe I can add a new encoder option for it.

This is not working with emoji. Test with 🤯.

@cupen
Copy link
Author

cupen commented Jan 19, 2021

@pavelpatrin And this? #39137 (comment)

@pavelpatrin
Copy link

@pavelpatrin And this? #39137 (comment)

I tested your example result with Chrome DevTools console and from IPython, and it looks fine with Emoji in both cases.

Instead of the \U0001f601 (from strconv.QuoteToASCII("😁")) your example produces two unicode escapes "\ud83d\ude01. I'm not sure what is right from Unicode standard point of view, but your case working right.

In [39]: json.loads(b'"\u54c7\u5440\u5440\ud83d\ude01"')
Out[39]: '哇呀呀😁'

@cupen
Copy link
Author

cupen commented Jan 19, 2021

@pavelpatrin Yes. That's right. 1f601 is UTF-32, d83dde01 is UTF-16, they are the same code unit of 😁.
See it here.
https://codepoints.net/U+1F601

@pavelpatrin
Copy link

pavelpatrin commented Jan 22, 2021

@cupen i tested it again and looks like it is not working with ascii-compartible characters.

UnicodeEscape("abc😁😁😁") - got an error.

You may replace escape formatting with fmt.Sprintf(`\u%04x`, 1).

@schandra157
Copy link

strconv.QuoteToASCII produces a Go string that's not valid JSON: "\U0001f601"

JSON needs a pair of UTF-16 code units: "\uabcd\uabcd"
See https://play.golang.org/p/bC7QB27IGiP

Is there any way I can get a valid JSON string, i am also stuck with similar Problem for non printable Unicode characters

@networkimprov
Copy link

Use a Rune Literal: https://golang.org/ref/spec#Rune_literals

a, b := utf16.EncodeRune('rune_literal')
json := fmt.Sprintf(`"\u%x\u%x"`, a, b)

@schandra157
Copy link

@cupen seems it doesn't work with emoji

package main

import (
	"encoding/json"
	"fmt"
	"strconv"
)

type building struct {
	Name encoded `json:"name"`
}

type encoded string

func (e encoded) MarshalJSON() ([]byte, error) {
	return []byte(strconv.QuoteToASCII(string(e))), nil
}

func main() {
        //  json: error calling MarshalJSON for type main.encoded: invalid character 'U' in string escape code

	data, err := json.Marshal(building{Name: encoded("😁")})
	fmt.Println(string(data), err) 
}

I also stuck with similar kind of Issue, Any help would be appreciated

@schandra157
Copy link

Use a Rune Literal: https://golang.org/ref/spec#Rune_literals

a, b := utf16.EncodeRune('rune_literal')
json := fmt.Sprintf(`"\u%x\u%x"`, a, b)

package main

import (
"encoding/json"
"fmt"
"strconv"
)

type building struct {
Name encoded json:"name"
}

type encoded string

func (e encoded) MarshalJSON() ([]byte, error) {
return []byte(strconv.QuoteToASCII(string(e))), nil
}

func main() {
// json: error calling MarshalJSON for type main.encoded: invalid character 'U' in string escape code

data, err := json.Marshal(building{Name: encoded("😁")})
fmt.Println(string(data), err) 

}

How do we do in this case !!!

@networkimprov
Copy link

@julien-may
Copy link

Maybe this helps others: https://go.dev/play/p/eJfouGxeEzs

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made.
Projects
None yet
Development

No branches or pull requests

9 participants