Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regexp: support encoding.TextMarshaler #46159

Closed
flimzy opened this issue May 13, 2021 · 32 comments
Closed

regexp: support encoding.TextMarshaler #46159

flimzy opened this issue May 13, 2021 · 32 comments

Comments

@flimzy
Copy link
Contributor

flimzy commented May 13, 2021

What did you do?

Marshaling a compiled regular expression, for example as part of a JSON object, results in an empty object. Example:

re := regexp.MustCompile("foo?")
data, _ := json.Marshal(map[string]interface{}{"foo":re})
fmt.Println(string(data))

Produces:

{"foo":{}}

(Playground)

What did you expect to see?

It would seem more intuitive for a compiled regexp to marshal to its string representation, as if .String() were called. Example:

{"foo":"foo?"}

I propose we implement the encoding.TextMarshaler interface to return the value produced by the String() method.

Strictly speaking, this would be a breaking change, so may need to be targeted to Go 2. On the other hand, in a practical sense, the current behavior is completely useless, so anyone depending on the current behavior probably already has a bug in their program. 🤷‍♂️

@heschi heschi added the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label May 14, 2021
@heschi heschi added this to the Backlog milestone May 14, 2021
@heschi
Copy link
Contributor

heschi commented May 14, 2021

cc @rsc

@mvdan
Copy link
Member

mvdan commented May 14, 2021

This would also be nice for #45754. cc @dsnet

@rsc rsc changed the title regexp: Support encoding.TextMarshaler proposal: regexp: support encoding.TextMarshaler May 18, 2021
@gopherbot gopherbot added Proposal and removed NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. labels May 18, 2021
@rsc rsc modified the milestones: Backlog, Proposal May 18, 2021
@rsc rsc added this to Incoming in Proposals (old) May 18, 2021
@icholy
Copy link

icholy commented May 18, 2021

@mvdan for this to be useful with #45754 it would need to also implement encoding.TextUnmarshaler.

@dsnet
Copy link
Member

dsnet commented May 18, 2021

Regardless of #45754, I'd be surprised as a user to find a type that implements only either encoding.TextMarshaler or encoding.TextUnmarshaler but not both.

@mvdan
Copy link
Member

mvdan commented May 18, 2021

I share @dsnet's sentiment, though I should also note that in the case of regexp, marshal/unmarshal wouldn't be really symmetric: marshal would just show the original expression string, whereas unmarshal would do a comparatively more expensive regexp compile. A compile also takes options, so I imagine those wouldn't be exposed via TextUnmarshaler.

@icholy
Copy link

icholy commented May 18, 2021

Losing the options feels like a deal-breaker.

re := regexp.MustCompile(".*")
re.Longest()
data, _ := re.MarshalText()

// Longest() call is not preserved
_ = re.UnmarshalText(data)

@dsnet
Copy link
Member

dsnet commented May 18, 2021

This seems to suggest that MarshalText should not emit the same thing as String, but emits sufficient information to preserve options and whether it was constructed with Compile or CompilePOSIX.

That said, I would expect MarshalText to be identical to String for the common case where no options are involved and its built with Compile (assuming that's the more frequently used one).

@rsc rsc moved this from Incoming to Active in Proposals (old) May 19, 2021
@rsc
Copy link
Contributor

rsc commented May 19, 2021

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@dsnet
Copy link
Member

dsnet commented May 19, 2021

As a strawman proposal to spur discussion, MarshalText could output the provided regexp pattern ad-verbatim if there are no options used and the regexp pattern has no slashes, otherwise, the emits the regexp pattern ad-verbatim, but surrounded by "/" characters, and with a few single character flags at the end. This syntax draws inspiration from RegExp literal notation from JavaScript and seems to be match similar syntaxes elsewhere (e.g., Ruby).

Examples:

rx1 := regexp.MustCompile(`[0-9a-fA-F]{32}`)
rx1.MarshalText() // `[0-9a-fA-F]{32}`

rx2 := regexp.MustCompilePOSIX(`[0-9a-fA-F]{32}`)
rx2.MarshalText() // `/[0-9a-fA-F]{32}/p` with 'p' flag emitted to indicate POSIX mode

rx3 := regexp.MustCompile(`[0-9a-fA-F]{32}`)
rx3.Longest()
rx1.MarshalText() // `/[0-9a-fA-F]{32}/l` with 'l' flag emitted to indicate longest-prefix mode. 

rx4 := regexp.MustCompile(`/path/to/blah`)
rx4.MarshalText() // `//path/to/blah/` with surrounding '/' emitted to avoid possible parsing ambiguities

I arbitrarily used p to indicate POSIX and l to indicate longest prefix match. It could anything, and I don't know what options are commonly provided by regexp libraries in other languages and whether we can reuse common flags.

Alternatively, options could be hacked into the existing (?x) notation as special flags that must be specified in the top-level grouping. See the "grouping" section under regexp/syntax.

@frioux
Copy link

frioux commented May 19, 2021

Yeah at least Perl stringifies regexen such that they will produce the same thing via the grouping notation:

$ perl -E'say qr/[a-z]/'
(?^u:[a-z])

@rsc
Copy link
Contributor

rsc commented May 26, 2021

We could invent our own custom flags for the (?...) syntax, but that would be quite a change for a pretty limited use.
Or we could say that MarshalText only works with non-Posix, non-longest regexps.
None of these sound really awesome.

@rsc
Copy link
Contributor

rsc commented Jun 2, 2021

It sounds like we could invent a text format here, but that text format would not be the input to regexp.Compile, which negates most of the benefit. Is this still worth doing? It seems like maybe not...

@dsnet
Copy link
Member

dsnet commented Jun 2, 2021

I have mixed feelings about this. I'm not fond of the need to invent a new format, but adding text marshalers would be useful for #45754. I've written tools before that took in a regexp as a command-line flag.

@icholy
Copy link

icholy commented Jun 2, 2021

Custom (?...) flags seem like the cleaner way of doing this. Re #45754, I don't think it's a very compelling use-case considering there's flag.Func

@flimzy
Copy link
Contributor Author

flimzy commented Jun 2, 2021

It sounds like we could invent a text format here, but that text format would not be the input to regexp.Compile

Can you elaborate on the reasoning here? If we were to (say) add new (?...) flags, what is the reason that regexp.Compile could not honor them?

@icholy
Copy link

icholy commented Jun 4, 2021

@flimzy those are two separate possible solutions.

@rsc
Copy link
Contributor

rsc commented Jun 9, 2021

The reason not to add new (?...) flags is that there are too many regular expression syntaxes in the world. Today, RE2, Go, Rust, and RE2/J all agree on a single syntax. I would be reluctant for Go to deviate from that group and introduce yet another variant, however small.

@rsc
Copy link
Contributor

rsc commented Jun 9, 2021

Based on the discussion above, this proposal seems like a likely decline.
— rsc for the proposal review group

@rsc rsc moved this from Active to Likely Decline in Proposals (old) Jun 9, 2021
@mvdan
Copy link
Member

mvdan commented Jun 9, 2021

Perhaps I derailed this thread with my first comment, #46159 (comment). The original request was for TextMarshaler only, returning the same value as the String method. Is that still a bad idea, ignoring TextUnmarshaler? It could be unfortunate to implement marshaling and not unmarshaling, so maybe that's still enough reason to not pursue this further.

@flimzy
Copy link
Contributor Author

flimzy commented Jun 9, 2021

Perhaps I derailed this thread with my first comment, #46159 (comment). The original request was for TextMarshaler only, returning the same value as the String method. Is that still a bad idea, ignoring TextUnmarshaler? It could be unfortunate to implement marshaling and not unmarshaling, so maybe that's still enough reason to not pursue this further.

For my use case, TextMarshaler is all that matters. But my use case may well be unique enough not to warrant an asymmetric feature.

@flimzy
Copy link
Contributor Author

flimzy commented Jun 9, 2021

The reason not to add new (?...) flags is that there are too many regular expression syntaxes in the world. Today, RE2, Go, Rust, and RE2/J all agree on a single syntax. I would be reluctant for Go to deviate from that group and introduce yet another variant, however small.

I understand that, and agree that it would be unfortunate to add yet another weird Regex syntax.

Would it be reasonable to marshal only non-Posix, non-longest regexps, considering those seem non-controversial? The others could perhaps return an error when marshaled? This could easily leave the implementation open for modification later if we ever decide supporting those (by whatever means) is appropriate.

@rogpeppe
Copy link
Contributor

I think it would be find if it just marshaled to the equivalent of String and unmarshaled without "longest" or Posix options. Yes, that would mean it wouldn't fully round trip, but I think that's OK and easily explainable. I suspect that in any one domain it's unusual to have both "longest" and "non-longest" regexps sitting side by side, especially as there's no way to express those options in the string that's passed to Parse. If someone needs those attributes to round-trip, they can always express them out of band.

@rsc rsc moved this from Likely Decline to Active in Proposals (old) Jun 16, 2021
@rsc
Copy link
Contributor

rsc commented Jul 14, 2021

The current idea is to have MarshalText/UnmarshalText use the direct regexp syntax and just accept that MarshalText will be a lossy encoding for POSIX or Longest regexps, which are relatively rare.

Do we have any other lossy marshaling? Does anyone think we shouldn't do this?

@rsc
Copy link
Contributor

rsc commented Jul 21, 2021

Based on the discussion above, this proposal seems like a likely accept.
— rsc for the proposal review group

@rsc rsc moved this from Active to Likely Accept in Proposals (old) Jul 21, 2021
@martin-sucha
Copy link
Contributor

Do we have any other lossy marshaling?

  • big.Float does not marshal precision/accuracy, but it uses full precision for marshaling. I don't know whether this classifies as "lossy" or "not-lossy":
// MarshalText implements the encoding.TextMarshaler interface.
// Only the Float value is marshaled (in full precision), other
// attributes such as precision or accuracy are ignored.
  • big.Int is not lossy.
  • big.Rat is not lossy, it's UnmarshalText accepts floating point number in addition to representations returned by MarshalText
  • net.IP is not lossy, it returns an error from MarshalText when the IP address is invalid.
  • time.Time uses RFC3339Nano when marshaling and RFC3339 with special handling of fractional seconds when unmarshaling, so is not lossy.

There are no other non-test implementations of TextMarshaler in the standard library.


The documentation for TextUnmarshaler says:

UnmarshalText must be able to decode the form generated by MarshalText. UnmarshalText must copy the text if it wishes to retain the text after returning.

Seems that technically this will hold (unless there are any POSIX regexes that cannot be parsed as re2 regex).

@flimzy
Copy link
Contributor Author

flimzy commented Jul 22, 2021

  • time.Time uses RFC3339Nano when marshaling and RFC3339 with special handling of fractional seconds when unmarshaling, so is not lossy.

Marshaling of time.Time does omit the monotonic clock reading, which might be considered "lossy" in the strictest sense.

@rsc
Copy link
Contributor

rsc commented Jul 28, 2021

I'll do the implementation and see if I run into anything surprising.

@rsc rsc self-assigned this Jul 28, 2021
@rsc rsc moved this from Likely Accept to Accepted in Proposals (old) Jul 28, 2021
@rsc
Copy link
Contributor

rsc commented Jul 28, 2021

No change in consensus, so accepted. 🎉
This issue now tracks the work of implementing the proposal.
— rsc for the proposal review group

@rsc rsc changed the title proposal: regexp: support encoding.TextMarshaler regexp: support encoding.TextMarshaler Jul 28, 2021
@rsc rsc modified the milestones: Proposal, Backlog Jul 28, 2021
@flimzy
Copy link
Contributor Author

flimzy commented Jan 3, 2022

I would be interested in tackling this one.

@dsnet
Copy link
Member

dsnet commented Apr 14, 2022

I'll do the implementation and see if I run into anything surprising.

@rsc, are you still planning on doing this? I found a use-case for this and would like to move it along.

@gopherbot
Copy link

Change https://go.dev/cl/479401 mentions this issue: regexp: Make Regexp type satisfy the encoding.TextMarshaler/TextUnmarshaler interfaces

@gopherbot
Copy link

Change https://go.dev/cl/498756 mentions this issue: doc/go1.21: mention regexp.MarshalText and UnmarshalText

gopherbot pushed a commit that referenced this issue May 26, 2023
For #46159

Change-Id: Ia9cc0827a89d362532d1a662b791de8eebbfb2fe
Reviewed-on: https://go-review.googlesource.com/c/go/+/498756
Reviewed-by: Ian Lance Taylor <iant@google.com>
Reviewed-by: Eli Bendersky <eliben@google.com>
Auto-Submit: Ian Lance Taylor <iant@google.com>
Run-TryBot: Ian Lance Taylor <iant@google.com>
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
@dmitshur dmitshur modified the milestones: Backlog, Go1.21 Jun 4, 2023
Nasfame pushed a commit to golangFame/go that referenced this issue Jun 4, 2023
For golang#46159

Change-Id: Ia9cc0827a89d362532d1a662b791de8eebbfb2fe
Reviewed-on: https://go-review.googlesource.com/c/go/+/498756
Reviewed-by: Ian Lance Taylor <iant@google.com>
Reviewed-by: Eli Bendersky <eliben@google.com>
Auto-Submit: Ian Lance Taylor <iant@google.com>
Run-TryBot: Ian Lance Taylor <iant@google.com>
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Accepted
Development

No branches or pull requests