regexp/syntax: document that \b and \B are ASCII-only #5896

knuesel · 2013-07-16T16:17:10Z

Matching word boundaries with '\b' does not work when the first or last character in the
word is a multi-byte UTF-8 code point such as 'é'. 

Example:

http://play.golang.org/p/1to3IN9Mnf


What is the expected output?
Matching should succeed in all cases

What do you see instead?
Matching fails when the string includes "é" at the word boundary

Which compiler are you using (5g, 6g, 8g, gccgo)?
6g


Which operating system are you using?
Debian Squeeze

Which version are you using?  (run 'go version')
go version go1.1.1 linux/amd64

rsc · 2013-07-16T17:42:30Z

Comment 1:

This is intentional: \b and \B are ASCII-only. Making them full Unicode
would require too much lookahead/lookbehind if we ever want to make a
faster byte-at-a-time matcher. This is the same tradeoff made by RE2. I
will update the regexp/syntax package doc.
Russ

knuesel · 2013-07-16T18:39:52Z

Comment 3:

I see. The syntax documentation on https://code.google.com/p/re2/wiki/Syntax defines 
\b as "at word boundary (\w on one side and \W, \A, or \z on the other)". Since \w is
defined as "word characters (≡ [0-9A-Za-z_])", I suppose the documentation is already
correct, but drawing attention to this behavior would probably not hurt.

rsc · 2013-07-30T16:53:32Z

Comment 4:

Labels changed: added priority-later, go1.2, removed priority-triage.

Status changed to Accepted.

robpike · 2013-08-08T03:26:48Z

Comment 5:

This issue was closed by revision b4f370c.

Status changed to Fixed.

knuesel added fixed labels Aug 8, 2013

rsc added this to the Go1.2 milestone Apr 14, 2015

rsc removed the go1.2 label Apr 14, 2015

golang locked and limited conversation to collaborators Jun 24, 2016

gopherbot added the FrozenDueToAge label Jun 24, 2016

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

regexp/syntax: document that \b and \B are ASCII-only #5896

regexp/syntax: document that \b and \B are ASCII-only #5896

knuesel commented Jul 16, 2013

rsc commented Jul 16, 2013

knuesel commented Jul 16, 2013

rsc commented Jul 30, 2013

robpike commented Aug 8, 2013

regexp/syntax: document that \b and \B are ASCII-only #5896

regexp/syntax: document that \b and \B are ASCII-only #5896

Comments

knuesel commented Jul 16, 2013

rsc commented Jul 16, 2013

knuesel commented Jul 16, 2013

rsc commented Jul 30, 2013

robpike commented Aug 8, 2013