Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unicode: does not document that ZERO WIDTH NO-BREAK SPACE (\uFEFF) is not considered whitespace #42274

Open
sethvargo opened this issue Oct 29, 2020 · 9 comments
Labels
Documentation NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made.
Milestone

Comments

@sethvargo
Copy link
Contributor

What version of Go are you using (go version)?

$ go version
go version go1.15.3 darwin/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE="on"
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/sethvargo/Library/Caches/go-build"
GOENV="/Users/sethvargo/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/sethvargo/Development/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/sethvargo/Development/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/Users/sethvargo/.homebrew/Cellar/go/1.15.3/libexec"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/Users/sethvargo/.homebrew/Cellar/go/1.15.3/libexec/pkg/tool/darwin_amd64"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/dev/null"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/cs/jc9pj94x493gb8jr49ys7cnc00gy5b/T/go-build188500672=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

https://play.golang.org/p/V3JHSB7kQX9

package main

import (
	"fmt"
	"strings"
	"unicode"
)

func main() {
	s := "hi\uFEFF"
	fmt.Println(len(s))
	
	s = strings.TrimSpace(s)
	fmt.Println(len(s))
	
	fmt.Printf("%t", unicode.IsSpace('\uFEFF'))
}

What did you expect to see?

5
2
true

What did you see instead?

5
5
false
@ianlancetaylor ianlancetaylor changed the title ZERO WIDTH NO-BREAK SPACE (\uFEFF) not considered whitespace unicode: ZERO WIDTH NO-BREAK SPACE (\uFEFF) not considered whitespace Oct 29, 2020
@ianlancetaylor
Copy link
Contributor

As documented at https://golang.org/pkg/unicode/#IsSpace, this is determined by the Unicode. Unicode character ffef is not in the "space" category. The characters in that category can be found at http://www.fileformat.info/info/unicode/category/Zs/list.htm. So this seems like an issue to raise with the Unicode consortium.

@sethvargo
Copy link
Contributor Author

@ianlancetaylor would you be open to a docs update to clarify this? I understand its the spec, but I don't expect most Go developers to have completely read and understand the latest Unicode spec. The character has the name "space" in it and developers would incorrectly assume that TrimSpace would remove it. Adding something like the following to IsSpace could save a future developer a lot of time without much overhead of maintenance for the Go team:

Despite their name, the characters ZERO WIDTH SPACE (\u200B) and ZERO WIDTH NO-BREAK SPACE (\uFEFF) are not classified as space characters in Unicode.

@ianlancetaylor
Copy link
Contributor

CC @mpvl for thoughts.

@ALTree
Copy link
Member

ALTree commented Oct 29, 2020

For reference, there are 71 unicode characters that have "SPACE" in their name but for which IsSpace returns false. 62 of them actually have "MONOSPACE" (e.g. 0x1d670 MATHEMATICAL MONOSPACE CAPITAL A); the other 9 are:

0x1361 ETHIOPIC WORDSPACE
0x200b ZERO WIDTH SPACE
0x2408 SYMBOL FOR BACKSPACE
0x2420 SYMBOL FOR SPACE
0x303f IDEOGRAPHIC HALF FILL SPACE
0xfeff ZERO WIDTH NO-BREAK SPACE
0x1da7f SIGNWRITING LOCATION-WALLPLANE SPACE
0x1da80 SIGNWRITING LOCATION-FLOORPLANE SPACE
0xe0020 TAG SPACE
package main

import (
	"fmt"
	"strings"
	"unicode"

	"golang.org/x/text/unicode/runenames"
)

func main() {
	for r := rune(0); r < unicode.MaxRune; r++ {
		name := runenames.Name(r)
		if !unicode.IsSpace(r) && strings.Contains(name, "SPACE") {
			fmt.Printf("%#0x %s\n", r, name)
		}
	}
}

@dmitshur dmitshur added the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Oct 30, 2020
@dmitshur dmitshur added this to the Backlog milestone Oct 30, 2020
@dmitshur dmitshur changed the title unicode: ZERO WIDTH NO-BREAK SPACE (\uFEFF) not considered whitespace unicode: does not document that ZERO WIDTH NO-BREAK SPACE (\uFEFF) is not considered whitespace Oct 30, 2020
@sethvargo
Copy link
Contributor Author

I'm not suggesting we enumerate all of them, but ZERO WIDTH NO-BREAK SPACE is especially problematic because it frequently appears if you copy a value from Microsoft Excel 😐

@kodawah
Copy link

kodawah commented Feb 8, 2021

I came to the issue tracker as I ran in a similar problem, with \u00a0 the non-breakable space, from some text parsed from a html page. Having a bit more documentation would help -- instead of listing any particular value (because there are too many usecases) how about replacing as defined by Unicode with as described by unicode.IsSpace and addling a link https://golang.org/pkg/unicode/#IsSpace next to it?

@ghost
Copy link

ghost commented May 7, 2021

I hit by this problem, the core lib of golang provided us strings.TrimSpace, but the \u200b make this function looks very useless.

We should wrapper with a strings.Replace to help strings.TrimSpace really to trim space include \u200b.
I guess this problem will hit every one in every day, They must know how to deal with the real world \u200b.
Boom !

@ghost
Copy link

ghost commented May 7, 2021

I like the idea of

var (
    ExtraCutset = fmt.Sprintf("%v", '\uFEFF')
)
func trim(s string) string {
	return strings.Trim(strings.TrimSpace(s), ExtraCutset)
}

@TommyLeng
Copy link

I like the idea of

var (
    ExtraCutset = fmt.Sprintf("%v", '\uFEFF')
)
func trim(s string) string {
	return strings.Trim(strings.TrimSpace(s), ExtraCutset)
}

Need slightly change in order to work.

https://go.dev/play/p/aBITQorgZfm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made.
Projects
None yet
Development

No branches or pull requests

6 participants