Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: regexp: add (*Regexp).SubexpIndex #32420

Closed
ajwerner opened this issue Jun 4, 2019 · 9 comments
Closed

proposal: regexp: add (*Regexp).SubexpIndex #32420

ajwerner opened this issue Jun 4, 2019 · 9 comments

Comments

@ajwerner
Copy link

ajwerner commented Jun 4, 2019

Regular expressions are handy in a variety of simple string parsing situations. Using named capture groups is a good way to document the structure of a regular expression and to eliminate bugs due to the introduction of additional capture groups. Mapping the name to the capture group index is currently quite heavy-weight. I regularly find myself writing code like namedSubexp in the below toy example when using regular expressions to parse strings. What's worse is that I often don't write this code and instead just rely on a brittle hard-coded index.

The rejected proposal in #24208 argued in favour of a much more heavyweight interface change that also does not appeal to me. github.com/ghemawat/re.Scan uses slices and reflection and thus is too inefficient for anything performance critical. I understand that the bar for changes here is high Furthermore, I do hear an argument in favour of using an external library as the regexp package is already quite large but I think it's exactly in those cases where I'd avoid writing this helper function that I'd also avoid pulling in a new dependency. The proposal here is compact, useful and encourages more maintainable code so I figured I'd float it and see if the it resonates.

This issue proposes a new method on *regexp.Regexp called NamedSubexp which takes a string and returns an integer. The open-ended portion of this proposal is how to deal with the case where no such capture group exists. I'm quite open to that integer being -1 if no such named capture group exists (as in the strings package) or to augmenting the method signature to additionally return a boolean. My inclination for the panic comes from a tendency to use this pattern to define global vars and it's probably bad practice to search the list of strings at runtime.

package main

import (
	"fmt"
	"regexp"
)

var (
	re     = regexp.MustCompile("foo (?P<bar>[0-9]+) (?P<baz>0x[0-9A-Fa-f]+)")
	barIdx = namedSubexp(re, "bar")
	bazIdx = namedSubexp(re, "baz")
)

func namedSubexp(re *regexp.Regexp, name string) int {
	for i, exp := range re.SubexpNames() {
		if exp == name {
			return i
		}
	}
        panic(fmt.Errorf("%v does not have a capture group named %s", re, name))
}

func main() {
	matches := re.FindStringSubmatch("foo 12 0xA123")
        // if matches == nil { ... }
	bar, baz := matches[barIdx], matches[bazIdx]
	fmt.Println(baz, bar)
}

https://play.golang.org/p/WT8dFyp1TCE

@ianlancetaylor ianlancetaylor changed the title regexp: add method to look up named capture group index by name proposal: regexp: add method to look up named capture group index by name Jun 4, 2019
@gopherbot gopherbot added this to the Proposal milestone Jun 4, 2019
@ajwerner
Copy link
Author

ajwerner commented Jun 4, 2019

Perhaps a more palatable solution would be to augment the strings library instead of adding another method to Regexp. The following function or something like it in the strings package would make me similarly happy:

package strings

// SliceIndex returns the index of the first instance of needle in haystack, or -1 if needle is not present in haystack.
// Slice index performs a linear search of haystack for needle.
func SliceIndex(haystack []string, needle string) int {
   for i, s := range haystack {
      if s == needle {
         return i
      }
   }
   return -1
}

Of course this then begs the question about the need for at least (Last)?SliceIndex(Func)? functions for symmetry with the regular string functions. It's also not obvious that this slice searching belongs in the strings package though I can't think of a more suitable package.

@rsc
Copy link
Contributor

rsc commented Jun 25, 2019

What do people think of this?

// SubexpIndex returns the index of the first subexpression with the given name,
// or else -1 if there is no subexpression with that name.
//
// Note that multiple subexpressions can be written using the same name, as in
// (?P<bob>a+)(?P<bob>b+), which declares two subexpressions named "bob".
// In this case SubexpIndex returns the index of the leftmost such subexpression
// in the regular expression.
func (*Regexp) SubexpIndex(name string) int

@robpike
Copy link
Contributor

robpike commented Jul 1, 2019

Leftmost match or leftmost in the expression?

(?Pa*)|(?Pb+) matching b

@rsc
Copy link
Contributor

rsc commented Jul 2, 2019

Leftmost in the expression, because the method is on Regexp. There's no input text involved when you ask the question. (Hopefully people will just not name the same subexpression twice but there has to be a clear rule.)

@rsc
Copy link
Contributor

rsc commented Jul 16, 2019

Based on mild happiness and no negativity, accepting for API in #32420 (comment).

@rsc rsc changed the title proposal: regexp: add method to look up named capture group index by name proposal: regexp: add (*Regexp).SubexpIndex Jul 16, 2019
@rsc rsc modified the milestones: Proposal, Go1.14 Jul 16, 2019
@ajwerner
Copy link
Author

Sounds good to me. I'm hopeful this will encourage people to name subexpressions. I'm especially on board if the lookup will be O(1). I worry a bit that a linear scan might be something of a footgun for somebody if they call it in a hot loop. The thought occurred to me that a detailed example on Regexp.SubexpNames which demonstrates looking up indexes at init time with a small function one could just copy out of the godoc might be good enough. Having this method is more likely to change behavior and thus is better. Thanks for taking this up!

@gopherbot
Copy link

Change https://golang.org/cl/187919 mentions this issue: regexp: add (*Regexp).SubexpIndex

@sylvinus
Copy link
Contributor

I just pushed a simple implementation of this, with tests and an example.

@smoyer64
Copy link

I went a little further than this but I certainly appreciate that support for named capture groups is being added! Here's what I created - https://github.com/PennState/subexp.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants