regexp/syntax: \p{} should support unicode properties #10851

cyisfor · 2015-05-14T04:56:28Z

The "unicode.Properties" table seems arbitrarily left out of the parser in regexp/syntax/parse.go

Unicode properties are the most character neutral way I know to match on (or split on) whitespace, so there's no reason they shouldn't be considered. There also are properties for ideographs, hexadecimal numbers, radicals, quotation marks, and other things that would be really useful to match on. Since it just adds categories to the \u{...} syntax, it won't step on anyone's existing regular expressions. Since unicode properties in Go are implemented as unicode.RangeTables, just like categories and scripts already supported by regular expression matching, it wouldn't make the engine any slower than it is already.

So, here's the patch. It compiled and all tests passed, including the ones I added to match unicode properties. I'd appreciate if you could apply this, so other people could benefit from it too.

diff --git a/src/regexp/syntax/parse.go b/src/regexp/syntax/parse.go
index d579a40..d4f91f0 100644
--- a/src/regexp/syntax/parse.go
+++ b/src/regexp/syntax/parse.go
@@ -1436,6 +1436,9 @@ func unicodeTable(name string) (*unicode.RangeTable, *unicode.RangeTable) {
    if t := unicode.Scripts[name]; t != nil {
        return t, unicode.FoldScript[name]
    }
+   if t := unicode.Properties[name]; t != nil {
+       return t, unicode.FoldScript[name]
+   }
    return nil, nil
 }

diff --git a/src/regexp/syntax/parse_test.go b/src/regexp/syntax/parse_test.go
index c4a1117..e14ba9a 100644
--- a/src/regexp/syntax/parse_test.go
+++ b/src/regexp/syntax/parse_test.go
@@ -105,6 +105,9 @@ var parseTests = []parseTest{
    {`[\P{Braille}]`, `cc{0x0-0x27ff 0x2900-0x10ffff}`},
    {`[\p{^Braille}]`, `cc{0x0-0x27ff 0x2900-0x10ffff}`},
    {`[\P{^Braille}]`, `cc{0x2800-0x28ff}`},
+   {`[\p{White_Space}]`,`cc{0x9-0xd 0x20 0x85 0xa0 0x1680 0x2000-0x200a 0x2028-0x2029 0x202f 0x205f 0x3000}`},
+   {`[\p{^White_Space}]`,`cc{0x0-0x8 0xe-0x1f 0x21-0x84 0x86-0x9f 0xa1-0x167f 0x1681-0x1fff 0x200b-0x2027 0x202a-0x202e 0x2030-0x205e 0x2060-0x2fff 0x3001-0x10ffff}`},
+   {`[\P{^White_Space}]`,`cc{0x9-0xd 0x20 0x85 0xa0 0x1680 0x2000-0x200a 0x2028-0x2029 0x202f 0x205f 0x3000}`},
    {`[\pZ]`, `cc{0x20 0xa0 0x1680 0x2000-0x200a 0x2028-0x2029 0x202f 0x205f 0x3000}`},
    {`\p{Lu}`, mkCharClass(unicode.IsUpper)},
    {`[\p{Lu}]`, mkCharClass(unicode.IsUpper)},

minux · 2015-05-14T05:14:00Z

It seems RE2 also doesn't support this. /cc @rsc to decide.

cyisfor · 2015-05-14T05:23:14Z

I'm coming in from the go side, not the RE2 side, so I'm not as familiar with RE2 (though I did take a look). I'm not sure if the Space property is equivalent to the Zs character class. Perhaps it's redundant, and I didn't realize it? Apparantly neither the "Space" property nor Zs are the whitespace absolutely everyone agrees on outside the unicode consortium, since they don't include newlines or tabs.

minux · 2015-05-14T05:25:50Z

I talked about RE2 because Go's regexp packages are largely modeled on RE2. so if Go introduce a new syntax, it should also be propagated to RE2.

rsc · 2015-10-14T17:36:39Z

In general the Unicode package does not have the complete property database, and I don't think it makes sense to add it. The unicode.Properties symbol is only a small subset. Especially given that RE2 does not support this either, I think we can leave well enough alone.

s4y · 2016-02-25T13:32:31Z

Both the RE2 docs and the Go docs say that properties are supported when you use \p in a character class, but it doesn't seem to be implemented. Here's the snippet from the docs, it appears in both places:

`[\p{Name}]`	named Unicode property inside character class (≡ `\p{Name}`)
`[^\p{Name}]`	named Unicode property inside negated character class (≡ `\P{Name}`)

…but the implementation, here, just uses unicodeTable() which only looks at unicode.Categories and unicode.Scripts. Is this a bug?

s4y · 2016-02-25T13:34:22Z

…just for context, I'm writing a grammar and want to use properties like ID_Start and ID_Continue to match identifiers.

ianlancetaylor · 2016-02-25T17:05:36Z

@Sidnicious This issue is closed. For questions, please ask on a forum; see https://golang.org/wiki/Questoins . Thanks.

s4y · 2016-02-25T17:55:33Z

@ianlancetaylor This is a bug report, and I apologize for wording it more like a question. I linked to specific docs and source code in the Go standard library which disagrees with the docs. I just bumped this one because it's the same issue. I'll submit a fresh one.

ianlancetaylor · 2016-02-25T18:53:58Z

@Sidnicious Yes, in general, if you want to report a bug, please open a new issue, rather than commenting on one that is closed. Thanks.

minux changed the title ~~regexp should match on unicode properties too~~ regexp/syntax: \p{} should support unicode properties May 14, 2015

minux added the Thinking label May 14, 2015

minux added this to the Go1.6 milestone May 14, 2015

rsc closed this as completed Oct 14, 2015

golang locked and limited conversation to collaborators Feb 28, 2017

gopherbot added the FrozenDueToAge label Feb 28, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

regexp/syntax: \p{} should support unicode properties #10851

regexp/syntax: \p{} should support unicode properties #10851

cyisfor commented May 14, 2015

minux commented May 14, 2015 via email

cyisfor commented May 14, 2015

minux commented May 14, 2015 via email

rsc commented Oct 14, 2015

s4y commented Feb 25, 2016

s4y commented Feb 25, 2016

ianlancetaylor commented Feb 25, 2016

s4y commented Feb 25, 2016

ianlancetaylor commented Feb 25, 2016

regexp/syntax: \p{} should support unicode properties #10851

regexp/syntax: \p{} should support unicode properties #10851

Comments

cyisfor commented May 14, 2015

minux commented May 14, 2015 via email

cyisfor commented May 14, 2015

minux commented May 14, 2015 via email

rsc commented Oct 14, 2015

s4y commented Feb 25, 2016

s4y commented Feb 25, 2016

ianlancetaylor commented Feb 25, 2016

s4y commented Feb 25, 2016

ianlancetaylor commented Feb 25, 2016