Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/text: support API for Unicode word breaking and word extraction (Annex #29) #17256

Open
nightlyone opened this issue Sep 27, 2016 · 2 comments
Labels
help wanted NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@nightlyone
Copy link
Contributor

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

go version go1.7 linux/amd64

What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/ioe/sources/go"
GORACE=""
GOROOT="/usr/local/go"
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/user/1000/go-build353744209=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"

What did you do?

Trying to split text at word boundaries like mentioned at http://unicode.org/reports/tr29/#Word_Boundaries and also trying to extract words from strings as mentioned in the same document.

What did you expect to see?

Given the sentence "The quick (“brown”) fox can’t jump 32.3 feet, right?"

  • detecting word boundaries at all places marked with "|"
The| |quick| |(|“|brown|”|)| |fox| |can’t| |jump| |32.3| |feet|,| |right|?
  • support word extraction to a string slice with the following content
words := []string{"The", "quick", "brown", "fox", "can’t", "jump", "32.3", "feet", "right"}

What did you see instead?

That depends on the API used. example with strings.Fields at https://play.golang.org/p/dhJtlR-b3w displays:

[]string{"The", "quick", "(“brown”)", "fox", "can’t", "jump", "32.3", "feet,", "right?"}

Note: Proper test vectors are here: Test vectors are here: http://www.unicode.org/Public/UCD/latest/ucd/auxiliary/WordBreakTest.txt

An implementation using Ruby magic and a state machine generated by Ragel can be found here: github.com/blevesearch/segment

@nightlyone
Copy link
Contributor Author

Do we have any conceptual progress here after 5 years or can this be closed as won't do or out of scope?

@ianlancetaylor ianlancetaylor added help wanted NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Jun 3, 2021
@ianlancetaylor
Copy link
Contributor

This is a feature that should be added to x/text somewhere, but as far as I know nobody is working on it. There is no conceptual progress, but we should not close it. I've marked this as "help wanted". Perhaps somebody will volunteer to work on it.

@rsc rsc unassigned mpvl Jun 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

4 participants