New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: regexp/syntax: relax named subexpressions (capture groups) name restrictions #60784
Comments
FWIW you may pre-process regexp string, map subexpressions to generated valid names, rewrite regexp and resolve names back after matching during evaluation. |
I am not sure if Go exposes regexp parsing at that level? Or are you suggesting I parse it and replace those myself? I think then I would have to reimplement at least some level of regexp parsing? |
You can parse regexp with regexp ;) https://go.dev/play/p/pYhPKGVby3D |
Oh, I do not think it is so simple. Such replacement would fail on regexp like (contrived) |
If you really want to embed a domain-specific language within regular expressions, you can do so with |
I would prefer not to have to pre-parse regex myself because then I also have to implement my own escaping mechanism and users will have to think about those two layers of syntax. (My understanding is that there is no existing comment syntax in Go regexp I could reuse here and that I could ideally even access comments after parsing. So your suggestion is the same as one by @AlexanderYastrebov and I would like to avoid such string munging. This is why I made this proposal.) |
The thing is that embedding a domain-specific language within regular expressions necessarily involves two layers of syntax; any users that you might have acquired over the past two weeks are already being forced to think about two layers of syntax. Moreover, the correct approach is not to abuse regular expressions, but to enable users to write parsers in a domain-specific language suited to writing parsers. With that in mind, it's unclear why the |
But they are already familiar with regular expressions and I think this is a huge advantage over asking them to learn another language.
I understand if you are reluctant to do so. To me it really looks like something which can enable interesting exploration of new ideas. Like how struct tags did. Probably preprocessing regex is the right approach then. But I would really like to avoid having to maintain my own regexp parsing just to find/replace capture group names. With golang regexp not supporting lookbehinds to me it looks like it is not possible to properly do what @AlexanderYastrebov with just regexp which would handle also regexp escaping (i.e., not replace a capture group which just looks like a capture group but in fact it is not because something is escaped). Am I mistaken here? |
I would handle escaping by looping over the runes: if the rune is |
re2 allows more unicode now: google/re2@6a99418 |
Neat. So based on Go documentation which says "More precisely, it is the syntax accepted by RE2", does this mean Go should also follow this re2 change? |
Go identifiers may contain Unicode letters and digits, so permitting more than just ASCII for capturing group names would be consistent with that. It wouldn't be "[allowing] all characters except |
I also wrote "but I would be also OK with less relaxation". :-) So those unicode categories from re2 seems fine, but I would suggest adding also Ps, Pe, Pd, Po, and Sm. Would that be too much? |
Are you able to articulate why? Python identifiers are based on UAX #31, for example, and PEP 3131 goes into further details. If you have serious, technical reasons to permit additional Unicode punctuation and symbols, please share them. |
Oh, this is so that also expressions (not just identifiers) would be possible, which is what this issue is primarily about. I am just saying that I am proposing to open it up to some unicode categories and not necessary "[allowing] all characters except |
For the record, I thoroughly disagree that this would be nice. Moreover, I consider inventing a new kind of regular expression crime to constitute an argument against, not for. Having said that, although there's no reason to make this change upstream, you can always fork the |
Hm, the whole idea of this issue is to use existing regular expression language and familiarity people have instead of them having to learn some custom domain specific parsing language? So I am not sure what new kind of regular expression is here? Anyway, thank you for clearly stating your disagreement. |
Hmm? I said that it's inventing a new kind of regular expression crime. |
OK, I think there is some language barrier here for me. Anyway. Thanks. |
Alternatively to embedding DSL into regexp the tool may receive expressions out of band like: regex2json "(?P<adatetime>.+)" -e adatetime=here.goes.your.dsl.expression |
I have been noticably bitten by the existing behavior of treating |
Great point, @seebs. The proposal would have implications for template expansion that have so far not been considered. |
This proposal has been added to the active column of the proposals project |
The discussion here is maybe a little more heated than it needs to be. In general my approach is to try to (1) keep Go and RE2 in sync and (2) make them both a useful approximation to the intersection of other regexp libraries, limited to what's actually used. Embedding a new programming language inside the name tags seems like stretching them much farther than they were ever intended. I don't think that use case by itself would justify a change. The top comment said:
I am not sure what exact incompatibility is meant here. It appears that this refers only to the documented differences in the comment quoted above it, namely:
Those differences are quite a lot smaller than what is being proposed. PCRE accepts "2z" as a name, so Go does. Python accepts the 33-byte name "a23456789012345678901234567890123", so Go does. To my knowledge, neither accepts the much more expansive names contemplated here. |
Based on the discussion above, this proposal seems like a likely decline. |
No change in consensus, so declined. |
Currently, Go regular expression language supports named subexpressions (also known as named capture groups), i.e.,
(?P<name>re)
. The topic of this proposal is what are restrictions onname
. I have not found anything documented but from looking at the code it looks like parsing parses everything from<
up to the first next>
and then validatesname
usingisValidCaptureName
which has comment:I would like to suggest that this check is relaxed and that Go allows all characters except
>
(but I would be also OK with less relaxation). Capture names are already not fully compatible with PCRE nor Python, so I think they could be relaxed further.Motivation
I made a simple tool to convert text to JSON by providing a regexp. How this conversion happens is provided as the name of the capture group. The basic idea is
(?P<foo>.*)
would create a filedfoo
in JSON with the matched value. But I also want some transformations of matched values (parsing ints, floats, dates, supporting arrays). For that I had to use a very awkward syntax with__
(double underscore) to separate arguments and___
(triple underscore) to separate operators. E.g.:(?P<foo__bar___int>.*)
would parse the value into int and store it into{"foo": {"bar": <int>}}
. I think some standard syntax where I could use dots like(?P<foo.bar>.*)
and arrays like(?P<foo[]>.*)
and parenthesis and arguments like(?P<date("2006-01-02T15:04:05Z07:00")>.*)
would be much nicer. The last example shows also another issue with current restrictions on names: I cannot really pass arbitrary date parsing layout but I can support only predefined ones. Similarly, I cannot pass location for time parsing asEurope/Ljubljana
because/
is not allowed.I know this is maybe looks like a niche use case, but to me the idea really opened a new way of working with data, similarly how struct tags enable various ways on how data is converted into structs, regexp could also allow that so that both what text to extract and how to map that to a struct could be all in the same string (which can then be passed to the program as CLI argument).
The text was updated successfully, but these errors were encountered: