x/text: support API for Unicode word breaking and word extraction (Annex #29) #17256
Labels
help wanted
NeedsInvestigation
Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone
Please answer these questions before submitting your issue. Thanks!
What version of Go are you using (
go version
)?go version go1.7 linux/amd64
What operating system and processor architecture are you using (
go env
)?What did you do?
Trying to split text at word boundaries like mentioned at http://unicode.org/reports/tr29/#Word_Boundaries and also trying to extract words from strings as mentioned in the same document.
What did you expect to see?
Given the sentence "The quick (“brown”) fox can’t jump 32.3 feet, right?"
What did you see instead?
That depends on the API used. example with strings.Fields at https://play.golang.org/p/dhJtlR-b3w displays:
Note: Proper test vectors are here: Test vectors are here: http://www.unicode.org/Public/UCD/latest/ucd/auxiliary/WordBreakTest.txt
An implementation using Ruby magic and a state machine generated by Ragel can be found here: github.com/blevesearch/segment
The text was updated successfully, but these errors were encountered: