New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
regexp/syntax: \p{} should support unicode properties #10851
Comments
It seems RE2 also doesn't support this.
/cc @rsc to decide.
|
I'm coming in from the go side, not the RE2 side, so I'm not as familiar with RE2 (though I did take a look). I'm not sure if the Space property is equivalent to the Zs character class. Perhaps it's redundant, and I didn't realize it? Apparantly neither the "Space" property nor Zs are the whitespace absolutely everyone agrees on outside the unicode consortium, since they don't include newlines or tabs. |
I talked about RE2 because Go's regexp packages are
largely modeled on RE2. so if Go introduce a new syntax,
it should also be propagated to RE2.
|
In general the Unicode package does not have the complete property database, and I don't think it makes sense to add it. The unicode.Properties symbol is only a small subset. Especially given that RE2 does not support this either, I think we can leave well enough alone. |
Both the RE2 docs and the Go docs say that properties are supported when you use
…but the implementation, here, just uses |
…just for context, I'm writing a grammar and want to use properties like |
@Sidnicious This issue is closed. For questions, please ask on a forum; see https://golang.org/wiki/Questoins . Thanks. |
@ianlancetaylor This is a bug report, and I apologize for wording it more like a question. I linked to specific docs and source code in the Go standard library which disagrees with the docs. I just bumped this one because it's the same issue. I'll submit a fresh one. |
@Sidnicious Yes, in general, if you want to report a bug, please open a new issue, rather than commenting on one that is closed. Thanks. |
The "unicode.Properties" table seems arbitrarily left out of the parser in regexp/syntax/parse.go
Unicode properties are the most character neutral way I know to match on (or split on) whitespace, so there's no reason they shouldn't be considered. There also are properties for ideographs, hexadecimal numbers, radicals, quotation marks, and other things that would be really useful to match on. Since it just adds categories to the \u{...} syntax, it won't step on anyone's existing regular expressions. Since unicode properties in Go are implemented as unicode.RangeTables, just like categories and scripts already supported by regular expression matching, it wouldn't make the engine any slower than it is already.
So, here's the patch. It compiled and all tests passed, including the ones I added to match unicode properties. I'd appreciate if you could apply this, so other people could benefit from it too.
The text was updated successfully, but these errors were encountered: