-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
regexp: match additional Unicode properties #14509
Comments
Can you provide a small self-contained test case showing the problem? Thanks. |
CC @mpvl |
Sure thing. package main
import (
"log"
"os"
"regexp"
)
func main() {
matched, err := regexp.MatchString(`[\p{ASCII_Hex_Digit}]`, "A")
if err != nil {
log.Fatal(err)
}
if !matched {
os.Exit(1)
}
} The above program should exit cleanly but prints an error instead:
|
FWIW, the Unicode support in RE2 is also categories and scripts only. (google/re2@a6b34ea recently added the ability to build against ICU in order to have full Unicode properties support.) |
In principle I don't see an issue with extending unicodetable to also check for references in Properties. There are a few drawbacks:
If we want to go the same route as Google re2, this means pulling in ALL unicode binary property data. This adds up (all properties marked as binary in http://userguide.icu-project.org/strings/properties). My first estimate is that there are about 25 missing tables, roughly double what is in unicode.Properties now. A very wild guess is that supporting this would add about 30k worth of tables total. Not the worst, but still 30k. So the possible solutions are:
It all depends what is considered to be an acceptable table increase. My first wild estimate is supporting all properties would add about 30k in tables to anyone using the regexp packages. Also people using unicode.Properties would be saddled up with 20k of additional (but possibly useful and welcome) data. I think 1 is the best choice as it is the simplest and makes the sets of Properties defined in unicode a bit more comprehensive and consistent. |
At the very least, the documentation should be updated to clearly reflect the valid values for A plugin approach might be an option, similar to the way that |
Current behavior is intended (scripts and categories only) but maybe we should reconsider. Will chat with @mpvl. |
The docs (and the RE2 docs) describe how to match Unicode properties by using
\p{…}
inside a character class:The current implementation doesn't seem to do it, though. The code which matches
\p
uses a singleunicodeTable()
helper function which only looks atunicode.Categories
andunicode.Scripts
. There probably needs to be a separate helper, or a flag passed to this one, which makes it look at properties.The text was updated successfully, but these errors were encountered: