Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regexp: match additional Unicode properties #14509

Open
s4y opened this issue Feb 25, 2016 · 7 comments
Open

regexp: match additional Unicode properties #14509

s4y opened this issue Feb 25, 2016 · 7 comments
Labels
early-in-cycle A change that should be done early in the 3 month dev cycle. FeatureRequest NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made.
Milestone

Comments

@s4y
Copy link

s4y commented Feb 25, 2016

The docs (and the RE2 docs) describe how to match Unicode properties by using \p{…} inside a character class:

[snip]

[\p{Name}]named Unicode property inside character class (≡ \p{Name})
[^\p{Name}]named Unicode property inside negated character class (≡ \P{Name})

[snip]

The current implementation doesn't seem to do it, though. The code which matches \p uses a single unicodeTable() helper function which only looks at unicode.Categories and unicode.Scripts. There probably needs to be a separate helper, or a flag passed to this one, which makes it look at properties.

@ianlancetaylor
Copy link
Contributor

Can you provide a small self-contained test case showing the problem? Thanks.

@ianlancetaylor ianlancetaylor added this to the Go1.7 milestone Feb 25, 2016
@ianlancetaylor
Copy link
Contributor

CC @mpvl

@s4y
Copy link
Author

s4y commented Feb 25, 2016

Sure thing.

package main

import (
    "log"
    "os"
    "regexp"
)

func main() {
    matched, err := regexp.MatchString(`[\p{ASCII_Hex_Digit}]`, "A")
    if err != nil {
        log.Fatal(err)
    }
    if !matched {
        os.Exit(1)
    }
}

The above program should exit cleanly but prints an error instead:

error parsing regexp: invalid character class range: `\p{ASCII_Hex_Digit}``

@junyer
Copy link
Contributor

junyer commented Feb 26, 2016

FWIW, the Unicode support in RE2 is also categories and scripts only. (google/re2@a6b34ea recently added the ability to build against ICU in order to have full Unicode properties support.)

@mpvl
Copy link
Contributor

mpvl commented Feb 26, 2016

In principle I don't see an issue with extending unicodetable to also check for references in Properties. There are a few drawbacks:

  1. it will unconditionally pull in additional data (about ~10k as a very rough first estimate)
  2. it doesn't stop there: there are many more property lists that are not in the core unicode package and would be weird not to support if the existing ones are supported (e.g. Alphabetic and Math are derived properties and are currently not defined in the unicode package).

If we want to go the same route as Google re2, this means pulling in ALL unicode binary property data. This adds up (all properties marked as binary in http://userguide.icu-project.org/strings/properties). My first estimate is that there are about 25 missing tables, roughly double what is in unicode.Properties now. A very wild guess is that supporting this would add about 30k worth of tables total. Not the worst, but still 30k.

So the possible solutions are:

  1. Included Properties and extend it to include the missing tables.
  2. Included Properties and compute the tables for derived properties other tables on the fly.
  3. implement a mechanism for users to add more tables.
  4. Don't support Properties at all.

It all depends what is considered to be an acceptable table increase. My first wild estimate is supporting all properties would add about 30k in tables to anyone using the regexp packages. Also people using unicode.Properties would be saddled up with 20k of additional (but possibly useful and welcome) data.

I think 1 is the best choice as it is the simplest and makes the sets of Properties defined in unicode a bit more comprehensive and consistent.

@rsc rsc modified the milestones: Go1.8, Go1.7 May 18, 2016
@rsc rsc changed the title regexp: Matching Unicode properties with [\p{...}] not supported even though docs say it is regexp: match additional Unicode properties May 18, 2016
@quentinmit quentinmit added the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Oct 10, 2016
@rsc rsc modified the milestones: Go1.9Early, Go1.8 Oct 18, 2016
@bradfitz bradfitz modified the milestones: Go1.10Early, Go1.9Early May 3, 2017
@alercah
Copy link

alercah commented May 21, 2017

At the very least, the documentation should be updated to clearly reflect the valid values for \p, regardless of the implementation decision.

A plugin approach might be an option, similar to the way that image packages work, so that you only need to pull in the data for properties you care about. For instance, say, load emoji properties in unicode/properties/emoji, and only allow them to work if you load regexp/properties/emoji in your binary.

@rsc
Copy link
Contributor

rsc commented May 22, 2017

Current behavior is intended (scripts and categories only) but maybe we should reconsider. Will chat with @mpvl.

@bradfitz bradfitz added early-in-cycle A change that should be done early in the 3 month dev cycle. and removed early-in-cycle A change that should be done early in the 3 month dev cycle. labels Jun 14, 2017
@bradfitz bradfitz modified the milestones: Go1.10Early, Go1.10 Jun 14, 2017
@bradfitz bradfitz modified the milestones: Go1.10, Go1.11 Nov 29, 2017
@bradfitz bradfitz modified the milestones: Go1.11, Unplanned May 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
early-in-cycle A change that should be done early in the 3 month dev cycle. FeatureRequest NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made.
Projects
None yet
Development

No branches or pull requests

8 participants