proposal: regexp/syntax: relax named subexpressions (capture groups) name restrictions #60784

mitar · 2023-06-14T09:15:48Z

Currently, Go regular expression language supports named subexpressions (also known as named capture groups), i.e., (?P<name>re). The topic of this proposal is what are restrictions on name. I have not found anything documented but from looking at the code it looks like parsing parses everything from < up to the first next > and then validates name using isValidCaptureName which has comment:

// isValidCaptureName reports whether name
// is a valid capture name: [A-Za-z0-9_]+.
// PCRE limits names to 32 bytes.
// Python rejects names starting with digits.
// We don't enforce either of those.

I would like to suggest that this check is relaxed and that Go allows all characters except > (but I would be also OK with less relaxation). Capture names are already not fully compatible with PCRE nor Python, so I think they could be relaxed further.

Motivation

I made a simple tool to convert text to JSON by providing a regexp. How this conversion happens is provided as the name of the capture group. The basic idea is (?P<foo>.*) would create a filed foo in JSON with the matched value. But I also want some transformations of matched values (parsing ints, floats, dates, supporting arrays). For that I had to use a very awkward syntax with __ (double underscore) to separate arguments and ___ (triple underscore) to separate operators. E.g.: (?P<foo__bar___int>.*) would parse the value into int and store it into {"foo": {"bar": <int>}}. I think some standard syntax where I could use dots like (?P<foo.bar>.*) and arrays like (?P<foo[]>.*) and parenthesis and arguments like (?P<date("2006-01-02T15:04:05Z07:00")>.*) would be much nicer. The last example shows also another issue with current restrictions on names: I cannot really pass arbitrary date parsing layout but I can support only predefined ones. Similarly, I cannot pass location for time parsing as Europe/Ljubljana because / is not allowed.

I know this is maybe looks like a niche use case, but to me the idea really opened a new way of working with data, similarly how struct tags enable various ways on how data is converted into structs, regexp could also allow that so that both what text to extract and how to map that to a struct could be all in the same string (which can then be passed to the program as CLI argument).

The text was updated successfully, but these errors were encountered:

AlexanderYastrebov · 2023-06-14T20:39:55Z

I think some standard syntax where I could use dots like (?P<foo.bar>.) and arrays like (?P<foo[]>.) and parenthesis and arguments like (?P<date("2006-01-02T15:04:05Z07:00")>.*) would be much nicer.

FWIW you may pre-process regexp string, map subexpressions to generated valid names, rewrite regexp and resolve names back after matching during evaluation.
E.g. for (?P<foo.bar>.*) (?P<baz.qux>.*) you rewrite it into (?P<subexp_0>.*) (?P<subexp_1>.*) and keep mapping {"subexp_0": "foo.bar", "subexp_1": "baz.qux"} then after matching regexp against input map subexp_0 and subexp_1 back to foo.bar and baz.qux respectively and evaluate them as you please.

mitar · 2023-06-14T20:46:05Z

I am not sure if Go exposes regexp parsing at that level? Or are you suggesting I parse it and replace those myself? I think then I would have to reimplement at least some level of regexp parsing?

AlexanderYastrebov · 2023-06-14T21:48:20Z

I am not sure if Go exposes regexp parsing at that level?

You can parse regexp with regexp ;) https://go.dev/play/p/pYhPKGVby3D

mitar · 2023-06-14T23:02:29Z

Oh, I do not think it is so simple. Such replacement would fail on regexp like (contrived) foo\?P<bar>baz, which does not use capture group at all, and it escapes ? so it simply matches a static string. But your code would replace bar with something else, making the regexp not match at all.

junyer · 2023-06-18T13:43:23Z

If you really want to embed a domain-specific language within regular expressions, you can do so with (* … *) comments that your package recognises and removes before calling into the regexp package.

mitar · 2023-06-21T12:33:50Z

I would prefer not to have to pre-parse regex myself because then I also have to implement my own escaping mechanism and users will have to think about those two layers of syntax.

(My understanding is that there is no existing comment syntax in Go regexp I could reuse here and that I could ideally even access comments after parsing. So your suggestion is the same as one by @AlexanderYastrebov and I would like to avoid such string munging. This is why I made this proposal.)

junyer · 2023-06-21T15:35:07Z

The thing is that embedding a domain-specific language within regular expressions necessarily involves two layers of syntax; any users that you might have acquired over the past two weeks are already being forced to think about two layers of syntax.

Moreover, the correct approach is not to abuse regular expressions, but to enable users to write parsers in a domain-specific language suited to writing parsers. With that in mind, it's unclear why the regexp package should facilitate such a use case.

mitar · 2023-06-21T19:42:21Z

Moreover, the correct approach is not to abuse regular expressions, but to enable users to write parsers in a domain-specific language suited to writing parsers.

But they are already familiar with regular expressions and I think this is a huge advantage over asking them to learn another language.

With that in mind, it's unclear why the regexp package should facilitate such a use case.

I understand if you are reluctant to do so. To me it really looks like something which can enable interesting exploration of new ideas. Like how struct tags did.

Probably preprocessing regex is the right approach then. But I would really like to avoid having to maintain my own regexp parsing just to find/replace capture group names. With golang regexp not supporting lookbehinds to me it looks like it is not possible to properly do what @AlexanderYastrebov with just regexp which would handle also regexp escaping (i.e., not replace a capture group which just looks like a capture group but in fact it is not because something is escaped). Am I mistaken here?

junyer · 2023-06-21T21:39:42Z

I would handle escaping by looping over the runes: if the rune is \, consume and output the rune and the next rune; else if the rune is ( and the next runes are ?P<, consume the runes and the following runes until (and including) >, then output (?P<, the generated capturing group name and >; else, consume and output the rune.

seankhliao · 2023-06-22T19:52:27Z

re2 allows more unicode now: google/re2@6a99418

mitar · 2023-06-22T20:04:37Z

Neat. So based on Go documentation which says "More precisely, it is the syntax accepted by RE2", does this mean Go should also follow this re2 change?

junyer · 2023-06-23T09:55:39Z

Go identifiers may contain Unicode letters and digits, so permitting more than just ASCII for capturing group names would be consistent with that. It wouldn't be "[allowing] all characters except >", which still seems to be what's being proposed here.

mitar · 2023-06-23T10:13:28Z

I also wrote "but I would be also OK with less relaxation". :-)

So those unicode categories from re2 seems fine, but I would suggest adding also Ps, Pe, Pd, Po, and Sm. Would that be too much?

junyer · 2023-06-23T10:57:27Z

Are you able to articulate why? Python identifiers are based on UAX #31, for example, and PEP 3131 goes into further details. If you have serious, technical reasons to permit additional Unicode punctuation and symbols, please share them.

mitar · 2023-06-23T11:20:26Z

Oh, this is so that also expressions (not just identifiers) would be possible, which is what this issue is primarily about. I am just saying that I am proposing to open it up to some unicode categories and not necessary "[allowing] all characters except >", but yea, it would be nice if expressions would be possible as well (as I argued above already).

junyer · 2023-06-23T13:46:58Z

For the record, I thoroughly disagree that this would be nice. Moreover, I consider inventing a new kind of regular expression crime to constitute an argument against, not for. Having said that, although there's no reason to make this change upstream, you can always fork the regexp package into your project and make this change there.

mitar · 2023-06-23T18:07:38Z

inventing a new kind of regular expression

Hm, the whole idea of this issue is to use existing regular expression language and familiarity people have instead of them having to learn some custom domain specific parsing language? So I am not sure what new kind of regular expression is here?

Anyway, thank you for clearly stating your disagreement.

junyer · 2023-06-23T19:14:07Z

Hmm? I said that it's inventing a new kind of regular expression crime.

mitar · 2023-06-24T08:40:22Z

OK, I think there is some language barrier here for me. Anyway. Thanks.

AlexanderYastrebov · 2023-06-26T09:15:29Z

@mitar

I made a simple tool to convert text to JSON by providing a regexp.

Alternatively to embedding DSL into regexp the tool may receive expressions out of band like:

regex2json "(?P<adatetime>.+)" -e adatetime=here.goes.your.dsl.expression

seebs · 2023-06-27T16:32:33Z

I have been noticably bitten by the existing behavior of treating $2z as a named-subreference rather than as subreference 2 plus a literal z. I don't care much what the rules are for what goes inside ${...}, but I would very much like to not make the behavior of things after a $ any more aggressive than it already is.

junyer · 2023-06-28T08:25:07Z

Great point, @seebs. The proposal would have implications for template expansion that have so far not been considered.

rsc · 2023-07-05T18:22:40Z

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

rsc · 2023-07-05T18:34:04Z

The discussion here is maybe a little more heated than it needs to be.

In general my approach is to try to (1) keep Go and RE2 in sync and (2) make them both a useful approximation to the intersection of other regexp libraries, limited to what's actually used.

Embedding a new programming language inside the name tags seems like stretching them much farther than they were ever intended. I don't think that use case by itself would justify a change.

The top comment said:

Capture names are already not fully compatible with PCRE nor Python, so I think they could be relaxed further.

I am not sure what exact incompatibility is meant here. It appears that this refers only to the documented differences in the comment quoted above it, namely:

// isValidCaptureName reports whether name
// is a valid capture name: [A-Za-z0-9_]+.
// PCRE limits names to 32 bytes.
// Python rejects names starting with digits.
// We don't enforce either of those.

Those differences are quite a lot smaller than what is being proposed. PCRE accepts "2z" as a name, so Go does. Python accepts the 33-byte name "a23456789012345678901234567890123", so Go does. To my knowledge, neither accepts the much more expansive names contemplated here.

rsc · 2023-07-12T19:54:19Z

Based on the discussion above, this proposal seems like a likely decline.
— rsc for the proposal review group

rsc · 2023-07-19T21:21:02Z

No change in consensus, so declined.
— rsc for the proposal review group

mitar added the Proposal label Jun 14, 2023

gopherbot added this to the Proposal milestone Jun 14, 2023

seankhliao changed the title ~~proposal: regexp/syntax: Relax named subexpressions (capture groups) name restrictions~~ proposal: regexp/syntax: relax named subexpressions (capture groups) name restrictions Jun 22, 2023

rsc added the Proposal-FinalCommentPeriod label Jul 12, 2023

rsc closed this as completed Jul 19, 2023

rsc removed the Proposal-FinalCommentPeriod label Jul 19, 2023

seankhliao mentioned this issue Dec 12, 2023

regexp/syntax: named capture groups don't support non-latin alphabets #64678

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal: regexp/syntax: relax named subexpressions (capture groups) name restrictions #60784

proposal: regexp/syntax: relax named subexpressions (capture groups) name restrictions #60784

mitar commented Jun 14, 2023 •

edited

AlexanderYastrebov commented Jun 14, 2023

mitar commented Jun 14, 2023

AlexanderYastrebov commented Jun 14, 2023

mitar commented Jun 14, 2023

junyer commented Jun 18, 2023

mitar commented Jun 21, 2023 •

edited

junyer commented Jun 21, 2023

mitar commented Jun 21, 2023

junyer commented Jun 21, 2023

seankhliao commented Jun 22, 2023

mitar commented Jun 22, 2023

junyer commented Jun 23, 2023

mitar commented Jun 23, 2023

junyer commented Jun 23, 2023

mitar commented Jun 23, 2023

junyer commented Jun 23, 2023

mitar commented Jun 23, 2023

junyer commented Jun 23, 2023

mitar commented Jun 24, 2023

AlexanderYastrebov commented Jun 26, 2023 •

edited

seebs commented Jun 27, 2023

junyer commented Jun 28, 2023

rsc commented Jul 5, 2023

rsc commented Jul 5, 2023

rsc commented Jul 12, 2023

rsc commented Jul 19, 2023

proposal: regexp/syntax: relax named subexpressions (capture groups) name restrictions #60784

proposal: regexp/syntax: relax named subexpressions (capture groups) name restrictions #60784

Comments

mitar commented Jun 14, 2023 • edited

Motivation

AlexanderYastrebov commented Jun 14, 2023

mitar commented Jun 14, 2023

AlexanderYastrebov commented Jun 14, 2023

mitar commented Jun 14, 2023

junyer commented Jun 18, 2023

mitar commented Jun 21, 2023 • edited

junyer commented Jun 21, 2023

mitar commented Jun 21, 2023

junyer commented Jun 21, 2023

seankhliao commented Jun 22, 2023

mitar commented Jun 22, 2023

junyer commented Jun 23, 2023

mitar commented Jun 23, 2023

junyer commented Jun 23, 2023

mitar commented Jun 23, 2023

junyer commented Jun 23, 2023

mitar commented Jun 23, 2023

junyer commented Jun 23, 2023

mitar commented Jun 24, 2023

AlexanderYastrebov commented Jun 26, 2023 • edited

seebs commented Jun 27, 2023

junyer commented Jun 28, 2023

rsc commented Jul 5, 2023

rsc commented Jul 5, 2023

rsc commented Jul 12, 2023

rsc commented Jul 19, 2023

mitar commented Jun 14, 2023 •

edited

mitar commented Jun 21, 2023 •

edited

AlexanderYastrebov commented Jun 26, 2023 •

edited