Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regexp/syntax: \p{} should support unicode properties #10851

Closed
cyisfor opened this issue May 14, 2015 · 9 comments
Closed

regexp/syntax: \p{} should support unicode properties #10851

cyisfor opened this issue May 14, 2015 · 9 comments

Comments

@cyisfor
Copy link

cyisfor commented May 14, 2015

The "unicode.Properties" table seems arbitrarily left out of the parser in regexp/syntax/parse.go

Unicode properties are the most character neutral way I know to match on (or split on) whitespace, so there's no reason they shouldn't be considered. There also are properties for ideographs, hexadecimal numbers, radicals, quotation marks, and other things that would be really useful to match on. Since it just adds categories to the \u{...} syntax, it won't step on anyone's existing regular expressions. Since unicode properties in Go are implemented as unicode.RangeTables, just like categories and scripts already supported by regular expression matching, it wouldn't make the engine any slower than it is already.

So, here's the patch. It compiled and all tests passed, including the ones I added to match unicode properties. I'd appreciate if you could apply this, so other people could benefit from it too.

diff --git a/src/regexp/syntax/parse.go b/src/regexp/syntax/parse.go
index d579a40..d4f91f0 100644
--- a/src/regexp/syntax/parse.go
+++ b/src/regexp/syntax/parse.go
@@ -1436,6 +1436,9 @@ func unicodeTable(name string) (*unicode.RangeTable, *unicode.RangeTable) {
    if t := unicode.Scripts[name]; t != nil {
        return t, unicode.FoldScript[name]
    }
+   if t := unicode.Properties[name]; t != nil {
+       return t, unicode.FoldScript[name]
+   }
    return nil, nil
 }

diff --git a/src/regexp/syntax/parse_test.go b/src/regexp/syntax/parse_test.go
index c4a1117..e14ba9a 100644
--- a/src/regexp/syntax/parse_test.go
+++ b/src/regexp/syntax/parse_test.go
@@ -105,6 +105,9 @@ var parseTests = []parseTest{
    {`[\P{Braille}]`, `cc{0x0-0x27ff 0x2900-0x10ffff}`},
    {`[\p{^Braille}]`, `cc{0x0-0x27ff 0x2900-0x10ffff}`},
    {`[\P{^Braille}]`, `cc{0x2800-0x28ff}`},
+   {`[\p{White_Space}]`,`cc{0x9-0xd 0x20 0x85 0xa0 0x1680 0x2000-0x200a 0x2028-0x2029 0x202f 0x205f 0x3000}`},
+   {`[\p{^White_Space}]`,`cc{0x0-0x8 0xe-0x1f 0x21-0x84 0x86-0x9f 0xa1-0x167f 0x1681-0x1fff 0x200b-0x2027 0x202a-0x202e 0x2030-0x205e 0x2060-0x2fff 0x3001-0x10ffff}`},
+   {`[\P{^White_Space}]`,`cc{0x9-0xd 0x20 0x85 0xa0 0x1680 0x2000-0x200a 0x2028-0x2029 0x202f 0x205f 0x3000}`},
    {`[\pZ]`, `cc{0x20 0xa0 0x1680 0x2000-0x200a 0x2028-0x2029 0x202f 0x205f 0x3000}`},
    {`\p{Lu}`, mkCharClass(unicode.IsUpper)},
    {`[\p{Lu}]`, mkCharClass(unicode.IsUpper)},
@minux minux changed the title regexp should match on unicode properties too regexp/syntax: \p{} should support unicode properties May 14, 2015
@minux minux added the Thinking label May 14, 2015
@minux minux added this to the Go1.6 milestone May 14, 2015
@minux
Copy link
Member

minux commented May 14, 2015 via email

@cyisfor
Copy link
Author

cyisfor commented May 14, 2015

I'm coming in from the go side, not the RE2 side, so I'm not as familiar with RE2 (though I did take a look). I'm not sure if the Space property is equivalent to the Zs character class. Perhaps it's redundant, and I didn't realize it? Apparantly neither the "Space" property nor Zs are the whitespace absolutely everyone agrees on outside the unicode consortium, since they don't include newlines or tabs.

@minux
Copy link
Member

minux commented May 14, 2015 via email

@rsc
Copy link
Contributor

rsc commented Oct 14, 2015

In general the Unicode package does not have the complete property database, and I don't think it makes sense to add it. The unicode.Properties symbol is only a small subset. Especially given that RE2 does not support this either, I think we can leave well enough alone.

@rsc rsc closed this as completed Oct 14, 2015
@s4y
Copy link

s4y commented Feb 25, 2016

Both the RE2 docs and the Go docs say that properties are supported when you use \p in a character class, but it doesn't seem to be implemented. Here's the snippet from the docs, it appears in both places:

[\p{Name}]named Unicode property inside character class (≡ \p{Name})
[^\p{Name}]named Unicode property inside negated character class (≡ \P{Name})

…but the implementation, here, just uses unicodeTable() which only looks at unicode.Categories and unicode.Scripts. Is this a bug?

@s4y
Copy link

s4y commented Feb 25, 2016

…just for context, I'm writing a grammar and want to use properties like ID_Start and ID_Continue to match identifiers.

@ianlancetaylor
Copy link
Contributor

@Sidnicious This issue is closed. For questions, please ask on a forum; see https://golang.org/wiki/Questoins . Thanks.

@s4y
Copy link

s4y commented Feb 25, 2016

@ianlancetaylor This is a bug report, and I apologize for wording it more like a question. I linked to specific docs and source code in the Go standard library which disagrees with the docs. I just bumped this one because it's the same issue. I'll submit a fresh one.

@ianlancetaylor
Copy link
Contributor

@Sidnicious Yes, in general, if you want to report a bug, please open a new issue, rather than commenting on one that is closed. Thanks.

@golang golang locked and limited conversation to collaborators Feb 28, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants