Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unicode/utf8: Valid returns true for strings with unicode.ReplacementChar #21975

Closed
reusee opened this issue Sep 22, 2017 · 2 comments
Closed

Comments

@reusee
Copy link

reusee commented Sep 22, 2017

What did you do?

https://play.golang.org/p/yvUrxynoku

package main

import (
	"fmt"
	"unicode"
	"unicode/utf8"
)

func main() {
	s := []byte(string([]rune{unicode.ReplacementChar}))
	fmt.Printf("%v\n", utf8.Valid(s))
}

What did you expect to see?

false

What did you see instead?

true

System details

go version devel +2d69e9e259 Mon Sep 11 18:44:58 2017 +0000 linux/amd64
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/reus/go"
GORACE=""
GOROOT="/home/reus/gotip"
GOTOOLDIR="/home/reus/gotip/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build416645599=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOROOT/bin/go version: go version devel +2d69e9e259 Mon Sep 11 18:44:58 2017 +0000 linux/amd64
GOROOT/bin/go tool compile -V: compile version devel +2d69e9e259 Mon Sep 11 18:44:58 2017 +0000
uname -sr: Linux 4.12.13-1-ARCH
LSB Version:	1.4
Distributor ID:	Arch
Description:	Arch Linux
Release:	rolling
Codename:	n/a
/usr/lib/libc.so.6: GNU C Library (GNU libc) stable release version 2.26, by Roland McGrath et al.
@as
Copy link
Contributor

as commented Sep 22, 2017

My possibly incorrect analysis: This isn't a bug, The unicode replacement character is a valid UTF-8 character. Strings and []byte preserve invalid UTF-8 sequences, because they are byte-based, when you convert to []rune, that's when the replacement happens. After the replacement happens, the UTF-8 is valid, because it has been replaced with the replacement character.

@ianlancetaylor
Copy link
Contributor

I agree that this is correct behavior according to the docs. There are other ways to check for use of the replacement character. Closing.

@golang golang locked and limited conversation to collaborators Nov 6, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants