Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: regexp/syntax: relax named subexpressions (capture groups) name restrictions #60784

Closed
mitar opened this issue Jun 14, 2023 · 26 comments
Labels
Milestone

Comments

@mitar
Copy link
Contributor

mitar commented Jun 14, 2023

Currently, Go regular expression language supports named subexpressions (also known as named capture groups), i.e., (?P<name>re). The topic of this proposal is what are restrictions on name. I have not found anything documented but from looking at the code it looks like parsing parses everything from < up to the first next > and then validates name using isValidCaptureName which has comment:

// isValidCaptureName reports whether name
// is a valid capture name: [A-Za-z0-9_]+.
// PCRE limits names to 32 bytes.
// Python rejects names starting with digits.
// We don't enforce either of those.

I would like to suggest that this check is relaxed and that Go allows all characters except > (but I would be also OK with less relaxation). Capture names are already not fully compatible with PCRE nor Python, so I think they could be relaxed further.

Motivation

I made a simple tool to convert text to JSON by providing a regexp. How this conversion happens is provided as the name of the capture group. The basic idea is (?P<foo>.*) would create a filed foo in JSON with the matched value. But I also want some transformations of matched values (parsing ints, floats, dates, supporting arrays). For that I had to use a very awkward syntax with __ (double underscore) to separate arguments and ___ (triple underscore) to separate operators. E.g.: (?P<foo__bar___int>.*) would parse the value into int and store it into {"foo": {"bar": <int>}}. I think some standard syntax where I could use dots like (?P<foo.bar>.*) and arrays like (?P<foo[]>.*) and parenthesis and arguments like (?P<date("2006-01-02T15:04:05Z07:00")>.*) would be much nicer. The last example shows also another issue with current restrictions on names: I cannot really pass arbitrary date parsing layout but I can support only predefined ones. Similarly, I cannot pass location for time parsing as Europe/Ljubljana because / is not allowed.

I know this is maybe looks like a niche use case, but to me the idea really opened a new way of working with data, similarly how struct tags enable various ways on how data is converted into structs, regexp could also allow that so that both what text to extract and how to map that to a struct could be all in the same string (which can then be passed to the program as CLI argument).

@mitar mitar added the Proposal label Jun 14, 2023
@gopherbot gopherbot added this to the Proposal milestone Jun 14, 2023
@AlexanderYastrebov
Copy link
Contributor

I think some standard syntax where I could use dots like (?P<foo.bar>.) and arrays like (?P<foo[]>.) and parenthesis and arguments like (?P<date("2006-01-02T15:04:05Z07:00")>.*) would be much nicer.

FWIW you may pre-process regexp string, map subexpressions to generated valid names, rewrite regexp and resolve names back after matching during evaluation.
E.g. for (?P<foo.bar>.*) (?P<baz.qux>.*) you rewrite it into (?P<subexp_0>.*) (?P<subexp_1>.*) and keep mapping {"subexp_0": "foo.bar", "subexp_1": "baz.qux"} then after matching regexp against input map subexp_0 and subexp_1 back to foo.bar and baz.qux respectively and evaluate them as you please.

@mitar
Copy link
Contributor Author

mitar commented Jun 14, 2023

I am not sure if Go exposes regexp parsing at that level? Or are you suggesting I parse it and replace those myself? I think then I would have to reimplement at least some level of regexp parsing?

@AlexanderYastrebov
Copy link
Contributor

I am not sure if Go exposes regexp parsing at that level?

You can parse regexp with regexp ;) https://go.dev/play/p/pYhPKGVby3D

@mitar
Copy link
Contributor Author

mitar commented Jun 14, 2023

Oh, I do not think it is so simple. Such replacement would fail on regexp like (contrived) foo\?P<bar>baz, which does not use capture group at all, and it escapes ? so it simply matches a static string. But your code would replace bar with something else, making the regexp not match at all.

@junyer
Copy link
Contributor

junyer commented Jun 18, 2023

If you really want to embed a domain-specific language within regular expressions, you can do so with (**) comments that your package recognises and removes before calling into the regexp package.

@mitar
Copy link
Contributor Author

mitar commented Jun 21, 2023

I would prefer not to have to pre-parse regex myself because then I also have to implement my own escaping mechanism and users will have to think about those two layers of syntax.

(My understanding is that there is no existing comment syntax in Go regexp I could reuse here and that I could ideally even access comments after parsing. So your suggestion is the same as one by @AlexanderYastrebov and I would like to avoid such string munging. This is why I made this proposal.)

@junyer
Copy link
Contributor

junyer commented Jun 21, 2023

The thing is that embedding a domain-specific language within regular expressions necessarily involves two layers of syntax; any users that you might have acquired over the past two weeks are already being forced to think about two layers of syntax.

Moreover, the correct approach is not to abuse regular expressions, but to enable users to write parsers in a domain-specific language suited to writing parsers. With that in mind, it's unclear why the regexp package should facilitate such a use case.

@mitar
Copy link
Contributor Author

mitar commented Jun 21, 2023

Moreover, the correct approach is not to abuse regular expressions, but to enable users to write parsers in a domain-specific language suited to writing parsers.

But they are already familiar with regular expressions and I think this is a huge advantage over asking them to learn another language.

With that in mind, it's unclear why the regexp package should facilitate such a use case.

I understand if you are reluctant to do so. To me it really looks like something which can enable interesting exploration of new ideas. Like how struct tags did.

Probably preprocessing regex is the right approach then. But I would really like to avoid having to maintain my own regexp parsing just to find/replace capture group names. With golang regexp not supporting lookbehinds to me it looks like it is not possible to properly do what @AlexanderYastrebov with just regexp which would handle also regexp escaping (i.e., not replace a capture group which just looks like a capture group but in fact it is not because something is escaped). Am I mistaken here?

@junyer
Copy link
Contributor

junyer commented Jun 21, 2023

I would handle escaping by looping over the runes: if the rune is \, consume and output the rune and the next rune; else if the rune is ( and the next runes are ?P<, consume the runes and the following runes until (and including) >, then output (?P<, the generated capturing group name and >; else, consume and output the rune.

@seankhliao seankhliao changed the title proposal: regexp/syntax: Relax named subexpressions (capture groups) name restrictions proposal: regexp/syntax: relax named subexpressions (capture groups) name restrictions Jun 22, 2023
@seankhliao
Copy link
Member

re2 allows more unicode now: google/re2@6a99418

@mitar
Copy link
Contributor Author

mitar commented Jun 22, 2023

Neat. So based on Go documentation which says "More precisely, it is the syntax accepted by RE2", does this mean Go should also follow this re2 change?

@junyer
Copy link
Contributor

junyer commented Jun 23, 2023

Go identifiers may contain Unicode letters and digits, so permitting more than just ASCII for capturing group names would be consistent with that. It wouldn't be "[allowing] all characters except >", which still seems to be what's being proposed here.

@mitar
Copy link
Contributor Author

mitar commented Jun 23, 2023

I also wrote "but I would be also OK with less relaxation". :-)

So those unicode categories from re2 seems fine, but I would suggest adding also Ps, Pe, Pd, Po, and Sm. Would that be too much?

@junyer
Copy link
Contributor

junyer commented Jun 23, 2023

Are you able to articulate why? Python identifiers are based on UAX #31, for example, and PEP 3131 goes into further details. If you have serious, technical reasons to permit additional Unicode punctuation and symbols, please share them.

@mitar
Copy link
Contributor Author

mitar commented Jun 23, 2023

Oh, this is so that also expressions (not just identifiers) would be possible, which is what this issue is primarily about. I am just saying that I am proposing to open it up to some unicode categories and not necessary "[allowing] all characters except >", but yea, it would be nice if expressions would be possible as well (as I argued above already).

@junyer
Copy link
Contributor

junyer commented Jun 23, 2023

For the record, I thoroughly disagree that this would be nice. Moreover, I consider inventing a new kind of regular expression crime to constitute an argument against, not for. Having said that, although there's no reason to make this change upstream, you can always fork the regexp package into your project and make this change there.

@mitar
Copy link
Contributor Author

mitar commented Jun 23, 2023

inventing a new kind of regular expression

Hm, the whole idea of this issue is to use existing regular expression language and familiarity people have instead of them having to learn some custom domain specific parsing language? So I am not sure what new kind of regular expression is here?

Anyway, thank you for clearly stating your disagreement.

@junyer
Copy link
Contributor

junyer commented Jun 23, 2023

Hmm? I said that it's inventing a new kind of regular expression crime.

@mitar
Copy link
Contributor Author

mitar commented Jun 24, 2023

OK, I think there is some language barrier here for me. Anyway. Thanks.

@AlexanderYastrebov
Copy link
Contributor

AlexanderYastrebov commented Jun 26, 2023

@mitar

I made a simple tool to convert text to JSON by providing a regexp.

Alternatively to embedding DSL into regexp the tool may receive expressions out of band like:

regex2json "(?P<adatetime>.+)" -e adatetime=here.goes.your.dsl.expression

@seebs
Copy link
Contributor

seebs commented Jun 27, 2023

I have been noticably bitten by the existing behavior of treating $2z as a named-subreference rather than as subreference 2 plus a literal z. I don't care much what the rules are for what goes inside ${...}, but I would very much like to not make the behavior of things after a $ any more aggressive than it already is.

@junyer
Copy link
Contributor

junyer commented Jun 28, 2023

Great point, @seebs. The proposal would have implications for template expansion that have so far not been considered.

@rsc
Copy link
Contributor

rsc commented Jul 5, 2023

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@rsc
Copy link
Contributor

rsc commented Jul 5, 2023

The discussion here is maybe a little more heated than it needs to be.

In general my approach is to try to (1) keep Go and RE2 in sync and (2) make them both a useful approximation to the intersection of other regexp libraries, limited to what's actually used.

Embedding a new programming language inside the name tags seems like stretching them much farther than they were ever intended. I don't think that use case by itself would justify a change.

The top comment said:

Capture names are already not fully compatible with PCRE nor Python, so I think they could be relaxed further.

I am not sure what exact incompatibility is meant here. It appears that this refers only to the documented differences in the comment quoted above it, namely:

// isValidCaptureName reports whether name
// is a valid capture name: [A-Za-z0-9_]+.
// PCRE limits names to 32 bytes.
// Python rejects names starting with digits.
// We don't enforce either of those.

Those differences are quite a lot smaller than what is being proposed. PCRE accepts "2z" as a name, so Go does. Python accepts the 33-byte name "a23456789012345678901234567890123", so Go does. To my knowledge, neither accepts the much more expansive names contemplated here.

@rsc
Copy link
Contributor

rsc commented Jul 12, 2023

Based on the discussion above, this proposal seems like a likely decline.
— rsc for the proposal review group

@rsc
Copy link
Contributor

rsc commented Jul 19, 2023

No change in consensus, so declined.
— rsc for the proposal review group

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Declined
Development

No branches or pull requests

7 participants