New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
encoding/json: unexpected handling of non-UTF-8 input #16282
Comments
Can you show us a complete standalone program that demonstrates the error? It's fine to start with a constant string, ideally one shorter than the ones above. |
Here's a rudimentary program that captures the issue I was talking about - https://play.golang.org/p/uFcs2-0yEr |
Thanks for the test case. When I run that program, I get |
As I cannot reproduce the exact issue, this should roughly translate to "\xaa\xa8" as I am UnMarshalling in the actual use case (pls refer to the dumps in my earlier comments) for protobuf to decode this later on. Since it resolves to the characters you gave, protobuf UnMarshal bails out. |
In your test program, the resulting string ( |
https://play.golang.org/p/JWgGUhzYBj (which is the same except for the output format) shows exactly what I would expect the program to produce. |
thanks for getting back. The above said example was a very basic one which didn't capture the issue I reported at first. Just posting a snippet from my earlier comment, \u0018\u0004 \u0000\u0098\u0004\u0000\u00A8 this string when we UnMarshal should ideally have been resolved to this \x18\x04 \x00\x98\x04\x00\xa8, instead it converts partially to \x18\x04 \x00\u0098\x04\x00¨ where anything lesser than u\0080 is converted to the corresponding hex but anything higher than that is preserved as is (u\0098 is preserved in the final output, ideally should have been \x98). I am thinking the unmarshalling library is only detecting 2 bytes to be converted to hex and leaves anything bigger to as is. Sorry, to reproduce this, I am not sure what I can provide to you apart from these details. |
Please provide a complete working example. Without one it's difficult to understand the problem you are seeing. I trust you know that \u0098 is a Unicode character that when stored as text will not be the bytes \x00\x98 but instead the UTF-8 encoding \xc2\x98. When you say "ideally" \u0098 will become \x00\x98 you appear to be missing this point. To put it another way, in a string \uNNNN does not represent a byte, it represents the byte sequence necessary to represent the Unicode code point NNNN. |
I understand UTF-8 encoding character set, but in the working case I see that not being applied. In the example above \u0098 apparently resolves to \x98 and when I send this data to proto.Unmarshal, it decodes correctly to the intended protobuf, hence the confusion. I will try to get an isolated example with the actual protobuf I am using. |
JSON is not appropriate for encoding arbitrary binary data. You need to do something else to it first, like base64 encode it. |
Actually, if you make your field have type []byte instead of type string, json will do the base64 for you. |
go version
)?1.6.2
go env
)?GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="*****/src/avi/go"
GORACE=""
GOROOT="/usr/local/go"
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GO15VENDOREXPERIMENT="1"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0"
CXX="g++"
CGO_ENABLED="1"
I am reporting the problem from the surface and if need be will provide more details (which are complicated and hard to list here).
I have a protobuf call Pool (It is in a file called pool.pb.go).
Storage of this protobuf goes something like this:
{"obj": <Serialized/Marshalled string of the Pool protobuf>} and this is serialized and stored as one big JSON blob:
"{"obj":"\n)pool-6b9bad47-65ac-477c-a32a-6fd5c7a98c91\u0012\ttest-pool(P@\u0001H\nP\u0000Z2healthmonitor-665dbe8c-aecb-49df-97b1-a87214e65270Z2healthmonitor-f6052e41-616d-46d6-989d-0fdfe0c5a403b1\n\u0011\n\r10.160.161.11\u0010\u0000\u0010P\u001A\r10.160.161.11 \u0001(\u0001X\u0000h\u0000r\u0000\u0080\u0001\u0000h\u0001\u00A0\u0001\u0001\u00B0\u0002\u0001\u00B8\u0002\u0000\u00C2\u0002\u0002\b\u0003\u00D0\u0002\u0000\u00F0\u0002\u0000\u00A2\u0003/vrfcontext-bfec9a97-7df7-4cfd-bd6d-ee84415ed1ab\u00B0\u0003\n\u00B8\u0003\u0001\u00E8\u0003\u0000\u00F0\u0003\u0080\u0001\u0082\u0004\u0006\b\u0000\u0018\u0004 \u0000\u0098\u0004\u0000\u00A8\u0004\u0001\u00A2\u0006\u0005admin\u00AA\u0006*cloud-2c2c062a-9d10-4d3b-a2e5-6ac9af08a4c4\u0082\u00A6\u001D\npool.proto\u008A\u00A6\u001D\u0004Pool"}"
What did you expect to see?
When I UnMarshal the above blob, I expect to see this:
"obj": "\n)pool-6b9bad47-65ac-477c-a32a-6fd5c7a98c91\x12\ttest-pool(P@\x01H\nP\x00Z2healthmonitor-665dbe8c-aecb-49df-97b1-a87214e65270Z2healthmonitor-f6052e41-616d-46d6-989d-0fdfe0c5a403b1\n\x11\n\r10.160.161.11\x10\x00\x10P\x1a\r10.160.161.11 \x01(\x01X\x00h\x00r\x00\x80\x01\x00h\x01\xa0\x01\x01\xb0\x02\x01\xb8\x02\x00\xc2\x02\x02\x08\x03\xd0\x02\x00\xf0\x02\x00\xa2\x03/vrfcontext-bfec9a97-7df7-4cfd-bd6d-ee84415ed1ab\xb0\x03\n\xb8\x03\x01\xe8\x03\x00\xf0\x03\x80\x01\x82\x04\x06\x08\x00\x18\x04 \x00\x98\x04\x00\xa8\x04\x01\xa2\x06\x05admin\xaa\x06*cloud-2c2c062a-9d10-4d3b-a2e5-6ac9af08a4c4\x82\xa6\x1d\npool.proto\x8a\xa6\x1d\x04Pool"
All Unicode characters are decoded to corresponding hex versions correctly.
"obj": "\n)pool-6b9bad47-65ac-477c-a32a-6fd5c7a98c91\x12\ttest-pool(P@\x01H\nP\x00Z2healthmonitor-665dbe8c-aecb-49df-97b1-a87214e65270Z2healthmonitor-f6052e41-616d-46d6-989d-0fdfe0c5a403b1\n\x11\n\r10.160.161.11\x10\x00\x10P\x1a\r10.160.161.11 \x01(\x01X\x00h\x00r\x00\u0080\x01\x00h\x01\u00a0\x01\x01°\x02\x01¸\x02\x00Â\x02\x02\b\x03Ð\x02\x00ð\x02\x00¢\x03/vrfcontext-bfec9a97-7df7-4cfd-bd6d-ee84415ed1ab°\x03\n¸\x03\x01è\x03\x00ð\x03\u0080\x01\u0082\x04\x06\b\x00\x18\x04 \x00\u0098\x04\x00¨\x04\x01¢\x06\x05adminª\x06*cloud-2c2c062a-9d10-4d3b-a2e5-6ac9af08a4c4\u0082¦\x1d\npool.proto\u008a¦\x1d\x04Pool"
Only Unicode characters lesser than (u\0080) are decoded to corresponding hex versions correctly. The rest of them are kept as is (look for u\ in the above string, whereas in the correct case all of these should be decoded to the relevant hex values). This looks like the Unmarshal method in encoding/json decodes only if the input fits in 2 bytes?
The text was updated successfully, but these errors were encountered: