|
|
Created:
13 years, 9 months ago by rsc Modified:
13 years, 9 months ago Reviewers:
CC:
r, Sam, kevlar, eds2, golang-dev Visibility:
Public. |
Descriptionexp/regexp/syntax: syntax data structures, parser
Parser is a work in progress but can populate most of the
interesting parts of the data structure, so a good checkpoint.
All the complicated Perl syntax is missing, as are various
important optimizations made during parsing to the
syntax tree.
The plan is that exp/regexp's API will mimic regexp,
and exp/regexp/syntax provides the parser directly
for programs that need it (and for implementing exp/regexp).
Once finished, exp/regexp will replace regexp.
Patch Set 1 #Patch Set 2 : diff -r bece9e7a644b https://go.googlecode.com/hg/ #Patch Set 3 : diff -r bece9e7a644b https://go.googlecode.com/hg/ #
Total comments: 45
Patch Set 4 : diff -r 69f12bae0f09 https://go.googlecode.com/hg #Patch Set 5 : diff -r 69f12bae0f09 https://go.googlecode.com/hg #Patch Set 6 : diff -r 69f12bae0f09 https://go.googlecode.com/hg #
Total comments: 6
Patch Set 7 : diff -r d1e5d7548fee https://go.googlecode.com/hg #
Total comments: 11
Patch Set 8 : diff -r 881a0fc6528d https://go.googlecode.com/hg #
MessagesTotal messages: 18
Hello r, sam (cc: golang-dev@googlegroups.com), I'd like you to review this change to https://go.googlecode.com/hg/
Sign in to reply to this message.
Very nice. Just a few things I noticed while reading through and trying to understand the code. http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/par... File src/pkg/exp/regexp/syntax/parse.go (right): http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/par... src/pkg/exp/regexp/syntax/parse.go:136: re.Sub = re.Sub0[:1] It could just be that I haven't seen this clever trick before, but it took me a bit to parse this. It may be more legible as: re.Sub0[0] = sub re.Sub = re.Sub0[:1] Feel free to disregard. http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/par... src/pkg/exp/regexp/syntax/parse.go:150: p.stack = p.stack[:i] Could be combined into one statement? It may even improve clarity. p.stack, sub := p.stack[:i], p.stack[i:] http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/par... File src/pkg/exp/regexp/syntax/parse_test.go (right): http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/par... src/pkg/exp/regexp/syntax/parse_test.go:124: // { "\\Q+|*?{[\\E", "str{+|*?{[}" }, I know they're not tested yet, but can these be ``-strings? I think the escapes become more legible, hopefully people will use them by example when writing these regexps.
Sign in to reply to this message.
On Wed, Jun 8, 2011 at 16:24, <kevlar@google.com> wrote: > Very nice. Just a few things I noticed while reading through and trying > to understand the code. Thanks. > http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/par... > src/pkg/exp/regexp/syntax/parse.go:150: p.stack = p.stack[:i] > Could be combined into one statement? It may even improve clarity. > > p.stack, sub := p.stack[:i], p.stack[i:] That := won't work. > http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/par... > src/pkg/exp/regexp/syntax/parse_test.go:124: // { "\\Q+|*?{[\\E", > "str{+|*?{[}" }, > I know they're not tested yet, but can these be ``-strings? I think the > escapes become more legible, hopefully people will use them by example > when writing these regexps. I will do that once all the tests are enabled. They are copy/pasted from C++ so for now I would like to leave them as is. Russ
Sign in to reply to this message.
I'm excited to see some work on an improved regexp package. Thanks! I noticed a couple things while reading through. http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/par... File src/pkg/exp/regexp/syntax/parse.go (right): http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/par... src/pkg/exp/regexp/syntax/parse.go:113: p.push(re) return p.push(re) and remove the line below http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/par... src/pkg/exp/regexp/syntax/parse.go:121: p.push(re) return p.push(re) and remove the line below http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/reg... File src/pkg/exp/regexp/syntax/regexp.go (right): http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/reg... src/pkg/exp/regexp/syntax/regexp.go:24: Sub0 [1]*Regexp // storage for short Sub It seems a little funny to export this field (and Rune0). Is it so other packages can copy Regexp values? Or to help other packages in building Regexp nodes?
Sign in to reply to this message.
> http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/reg... > src/pkg/exp/regexp/syntax/regexp.go:24: Sub0 [1]*Regexp // storage > for short Sub > It seems a little funny to export this field (and Rune0). Is it so other > packages can copy Regexp values? Or to help other packages in building > Regexp nodes? Yes, it is so that other packages can build Regexps too.
Sign in to reply to this message.
http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/par... File src/pkg/exp/regexp/syntax/parse.go (right): http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/par... src/pkg/exp/regexp/syntax/parse.go:30: ErrInternalError ErrorCode = "regexp/syntax: internal error" you only need ErrorCode here, not below. http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/par... src/pkg/exp/regexp/syntax/parse.go:60: NonGreedy // repetition operators are non-greedy by default I think you mean 'are greedy by default'. http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/par... src/pkg/exp/regexp/syntax/parse.go:146: for i > 0 && p.stack[i-1].Op < opPseudo { I found this a tad confusing at first, maybe a comment like "searches for nearest '|' or '('" would help. http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/par... src/pkg/exp/regexp/syntax/parse.go:168: for i > 0 && p.stack[i-1].Op < opPseudo { ditto here re: the comment searching for '|' or '('. http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/par... src/pkg/exp/regexp/syntax/parse.go:402: You want to look for ':' between the negate and the beginning of the loop. If that's the case, it's a [:alnum:] class only. http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/par... src/pkg/exp/regexp/syntax/parse.go:414: // TODO: Look for [:alnum:] if t[0] == '[' { // TODO: parse [:alnum:] continue } http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/par... src/pkg/exp/regexp/syntax/parse.go:416: // TODO: Look for Perl group. You could replace these two lines with a call to parseClassChar. http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/reg... File src/pkg/exp/regexp/syntax/regexp.go (right): http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/reg... src/pkg/exp/regexp/syntax/regexp.go:40: OpLiteral // matches Runes sequence this seems ambiguous; can OpLiteral match a long string, or just a single UTF8 rune? It does both in your parser. This seems difficult to build a runner around, as it's not clear that each Op is a single char (e.g. OpAnyChar) or zero chars (e.g. OpWordBoundary). The state of the runner will need to be a set of (state, index_in_string). http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/reg... src/pkg/exp/regexp/syntax/regexp.go:51: OpStar // matches Sub[0] zero or more times So in general I feel like this is an interesting tradeoff between clarity in the in-memory representation versus actually parsing a regexp down to a simpler representation (e.g., why provide OpPlus when you can just duplicate the node, connect them, and use OpStar on the second). I suspect you have strong opinions on this (and it does allow you to render the RE completely back to the user). Anyway, I feel like it adds more complexity to any eventual runner. http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/reg... src/pkg/exp/regexp/syntax/regexp.go:149: b.WriteString(strconv.Itoa(re.Min)) re.Max (also, if re.Max < re.Min, I suspect that's "bad" from a user-point of view, but I suppose it may still be valid).
Sign in to reply to this message.
http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/par... File src/pkg/exp/regexp/syntax/parse.go (right): http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/par... src/pkg/exp/regexp/syntax/parse.go:41: ErrMissingBracket = "missing closing ]" why does the ErrorCode type stop here? http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/par... src/pkg/exp/regexp/syntax/parse.go:69: POSIX Flags = 0 // POSIX syntax how close is POSIX to egrep, that is, to the existing regexp package? http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/par... src/pkg/exp/regexp/syntax/parse.go:113: p.push(re) the whole thing could be one line if you want, but it's a matter of taste how much you want to break it up. i don't care. http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/par... src/pkg/exp/regexp/syntax/parse.go:467: // merges, and eliminates duplicates. s/merges/& them/ http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/par... File src/pkg/exp/regexp/syntax/parse_test.go (right): http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/par... src/pkg/exp/regexp/syntax/parse_test.go:226: case OpEndText: default? http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/reg... File src/pkg/exp/regexp/syntax/regexp.go (right): http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/reg... src/pkg/exp/regexp/syntax/regexp.go:9: // Note to implementors: s/tors/ters/ http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/reg... src/pkg/exp/regexp/syntax/regexp.go:35: // Operators arrange in precedence order, tightest binding to weakest. s/arrange/&d/? http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/reg... src/pkg/exp/regexp/syntax/regexp.go:64: b.WriteString("<invalid op>") always good to indicate the value: + strconv.ItoA(re.Op)? maybe not worth it; not sure what this is for (no doc comment, for one thing). generally the error states in here don't say much http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/reg... src/pkg/exp/regexp/syntax/regexp.go:182: func escape(b *bytes.Buffer, r int, force bool) { i'd flip this around. the likeliest thing is IsPrint, so do that first and then fall through to the special cases. not important, though; leave it this way if you prefer
Sign in to reply to this message.
http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/par... File src/pkg/exp/regexp/syntax/parse.go (right): http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/par... src/pkg/exp/regexp/syntax/parse.go:30: ErrInternalError ErrorCode = "regexp/syntax: internal error" On 2011/06/09 02:15:49, Sam wrote: > you only need ErrorCode here, not below. No, since there are initializers it's needed on each line. http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/par... src/pkg/exp/regexp/syntax/parse.go:41: ErrMissingBracket = "missing closing ]" On 2011/06/09 02:56:04, r wrote: > why does the ErrorCode type stop here? I screwed up. http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/par... src/pkg/exp/regexp/syntax/parse.go:69: POSIX Flags = 0 // POSIX syntax On 2011/06/09 02:56:04, r wrote: > how close is POSIX to egrep, that is, to the existing regexp package? POSIX really means POSIX egrep syntax, but that's a little more than what regexp does now. The differences compared to the current regexp are: - escapes like \a \f \n \t \v \x7F \023 - counted repetition like x{5} x{5,} x{5,10} - classes like [[:space:]] Except for those, I believe they are the same. http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/par... src/pkg/exp/regexp/syntax/parse.go:113: p.push(re) On 2011/06/09 02:56:04, r wrote: > the whole thing could be one line if you want, but it's a matter of taste how > much you want to break it up. i don't care. It used to do more (and push used to not return a *Regexp). Will fix. http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/par... src/pkg/exp/regexp/syntax/parse.go:402: On 2011/06/09 02:15:49, Sam wrote: > You want to look for ':' between the negate and the beginning of the loop. If > that's the case, it's a [:alnum:] class only. That will come. Even after I decided to stop I had to throw code away to make the code review a reasonable size. http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/par... src/pkg/exp/regexp/syntax/parse.go:414: // TODO: Look for [:alnum:] On 2011/06/09 02:15:49, Sam wrote: > if t[0] == '[' { > // TODO: parse [:alnum:] > continue > } It will come later, I promise. http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/par... src/pkg/exp/regexp/syntax/parse.go:416: // TODO: Look for Perl group. On 2011/06/09 02:15:49, Sam wrote: > You could replace these two lines with a call to parseClassChar. Possibly but possibly not; parseClassChar reads a single char that might be involved in an a-z range. It's not set up to return a giant class. Let's wait until I address the TODO. This code is following the RE2 parser fairly closely, and that code does work. http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/par... File src/pkg/exp/regexp/syntax/parse_test.go (right): http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/par... src/pkg/exp/regexp/syntax/parse_test.go:226: case OpEndText: On 2011/06/09 02:56:04, r wrote: > default? Default is print nothing inside the braces, like emp{} and bol{}. http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/reg... File src/pkg/exp/regexp/syntax/regexp.go (right): http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/reg... src/pkg/exp/regexp/syntax/regexp.go:40: OpLiteral // matches Runes sequence On 2011/06/09 02:15:49, Sam wrote: > this seems ambiguous; can OpLiteral match a long string, or just a single UTF8 > rune? It does both in your parser. It matches the sequence of (one or more) Runes. > This seems difficult to build a runner around, as it's not clear that each Op is > a single char (e.g. OpAnyChar) or zero chars (e.g. OpWordBoundary). The state of > the runner will need to be a set of (state, index_in_string). This is just the parse tree. A separate compilation pass will have to turn this into a program to run. http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/reg... src/pkg/exp/regexp/syntax/regexp.go:51: OpStar // matches Sub[0] zero or more times On 2011/06/09 02:15:49, Sam wrote: > So in general I feel like this is an interesting tradeoff between clarity in the > in-memory representation versus actually parsing a regexp down to a simpler > representation (e.g., why provide OpPlus when you can just duplicate the node, > connect them, and use OpStar on the second). > I suspect you have strong opinions on this (and it does allow you to render the > RE completely back to the user). Anyway, I feel like it adds more complexity to > any eventual runner. This data structure is about parsing into a form that can be analyzed and processed. One of the ways to process it is to compile it into a program to run. But even during compilation + is important to handle on its own. Rewriting x+ into xx* during compilation makes a copy of x. If you keep it as x+ you can avoid that copy. See the diagrams in the "Converting Regular Expressions to NFAs" section of http://swtch.com/~rsc/regexp/regexp1.html and the instruction listings near the beginning of http://swtch.com/~rsc/regexp/regexp2.html Especially if x is some long expression, it's a big win not to make an unnecessary copy of it. http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/reg... src/pkg/exp/regexp/syntax/regexp.go:64: b.WriteString("<invalid op>") On 2011/06/09 02:56:04, r wrote: > always good to indicate the value: + strconv.ItoA(re.Op)? maybe not worth it; > not sure what this is for (no doc comment, for one thing). generally the error > states in here don't say much Thanks. I was trying not to use fmt and forgot I had strconv. http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/reg... src/pkg/exp/regexp/syntax/regexp.go:149: b.WriteString(strconv.Itoa(re.Min)) On 2011/06/09 02:15:49, Sam wrote: > re.Max > (also, if re.Max < re.Min, I suspect that's "bad" from a user-point of view, but > I suppose it may still be valid). Nice catch. Untested since the parser doesn't read that yet. :-)
Sign in to reply to this message.
Hello r@golang.org, sam.thorogood@gmail.com, kevlar@google.com, edsrzf@gmail.com (cc: golang-dev@googlegroups.com), Please take another look.
Sign in to reply to this message.
PTAL http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/par... File src/pkg/exp/regexp/syntax/parse.go (right): http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/par... src/pkg/exp/regexp/syntax/parse.go:60: NonGreedy // repetition operators are non-greedy by default On 2011/06/09 02:15:49, Sam wrote: > I think you mean 'are greedy by default'. No, the NonGreedy flag makes them non-greedy by default. Changed to 'make repetition operators default to non-greedy'.
Sign in to reply to this message.
LGTM Although I'm hardly authoritative. http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/par... File src/pkg/exp/regexp/syntax/parse.go (right): http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/par... src/pkg/exp/regexp/syntax/parse.go:60: NonGreedy // repetition operators are non-greedy by default On 2011/06/09 22:17:14, rsc wrote: > On 2011/06/09 02:15:49, Sam wrote: > > I think you mean 'are greedy by default'. > > No, the NonGreedy flag makes them non-greedy by default. > Changed to 'make repetition operators default to non-greedy'. Yes, the 'by default' part was confusing and seemed to imply default state without the flag. The new comment is much better :-) http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/par... src/pkg/exp/regexp/syntax/parse.go:168: for i > 0 && p.stack[i-1].Op < opPseudo { On 2011/06/09 02:15:49, Sam wrote: > ditto here re: the comment searching for '|' or '('. Actually, we should never find a | here, because it should always be swapped and then popped before this call. Right? http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/reg... File src/pkg/exp/regexp/syntax/regexp.go (right): http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/reg... src/pkg/exp/regexp/syntax/regexp.go:40: OpLiteral // matches Runes sequence On 2011/06/09 14:05:13, rsc wrote: > On 2011/06/09 02:15:49, Sam wrote: > > this seems ambiguous; can OpLiteral match a long string, or just a single UTF8 > > rune? It does both in your parser. > > It matches the sequence of (one or more) Runes. > > > This seems difficult to build a runner around, as it's not clear that each Op > is > > a single char (e.g. OpAnyChar) or zero chars (e.g. OpWordBoundary). The state > of > > the runner will need to be a set of (state, index_in_string). > > This is just the parse tree. A separate compilation pass > will have to turn this into a program to run. Ok, this resolves my concerns with this data structure. I guess I'm a tad confused as to whether Regexp will be the canonical representation passed around by users (cost of compiling it on every use), or whether it's better to compile it to this program first (cost of memory). http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/reg... src/pkg/exp/regexp/syntax/regexp.go:51: OpStar // matches Sub[0] zero or more times > Especially if x is some long expression, it's a big win not to > make an unnecessary copy of it. Yeah, I get that. http://codereview.appspot.com/4538123/diff/20001/src/pkg/exp/regexp/syntax/pa... File src/pkg/exp/regexp/syntax/parse.go (right): http://codereview.appspot.com/4538123/diff/20001/src/pkg/exp/regexp/syntax/pa... src/pkg/exp/regexp/syntax/parse.go:63: WasDollar // regexp OpEndText was $, not \z I'm kind of confused as to why you need this in this way, and don't need a parallel for ^ vs \a - it's clearly for use in a follow-up CL. I suspect you know what you're doing :) http://codereview.appspot.com/4538123/diff/20001/src/pkg/exp/regexp/syntax/pa... File src/pkg/exp/regexp/syntax/parse_test.go (right): http://codereview.appspot.com/4538123/diff/20001/src/pkg/exp/regexp/syntax/pa... src/pkg/exp/regexp/syntax/parse_test.go:22: // { "ab", "str{ab}" }, won't these str{} cases work - as just an extension of lit{} to >1 char? http://codereview.appspot.com/4538123/diff/20001/src/pkg/exp/regexp/syntax/pa... src/pkg/exp/regexp/syntax/parse_test.go:108: // { "(ab)*", "star{cap{str{ab}}}" }, I think this one should work now
Sign in to reply to this message.
PTAL http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/par... File src/pkg/exp/regexp/syntax/parse.go (right): http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/par... src/pkg/exp/regexp/syntax/parse.go:168: for i > 0 && p.stack[i-1].Op < opPseudo { On 2011/06/10 00:14:54, Sam wrote: > On 2011/06/09 02:15:49, Sam wrote: > > ditto here re: the comment searching for '|' or '('. > > Actually, we should never find a | here, because it should always be swapped and > then popped before this call. Right? Hey, it's not fair to ask for comments and then complain when they're not correct. That's entrapment or something. :-) Fixed. http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/reg... File src/pkg/exp/regexp/syntax/regexp.go (right): http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/reg... src/pkg/exp/regexp/syntax/regexp.go:40: OpLiteral // matches Runes sequence > Ok, this resolves my concerns with this data structure. I guess I'm a tad > confused as to whether Regexp will be the canonical representation passed around > by users (cost of compiling it on every use), or whether it's better to compile > it to this program first (cost of memory). The plan is that the regexp package would take an input string, call syntax.Parse to get a *syntax.Regexp, then call a compile function to get a *regexp.prog, throw away the *syntax.Regexp, and execute the *regexp.prog on each call. Most packages will have no idea that *syntax.Regexp exists, and it will just be garbage generated during regexp.Compile. But sometimes it is very useful, and when you see how complex the parser is going to get, you'll understand why we don't have a separate parser for generating the Prog directly. :-) Even though the parser will get more complex, this Regexp representation will not; the parser and the Regexp data structure serve to hide a the complexity of the Perl syntax from the compilation phase. http://codereview.appspot.com/4538123/diff/20001/src/pkg/exp/regexp/syntax/pa... File src/pkg/exp/regexp/syntax/parse.go (right): http://codereview.appspot.com/4538123/diff/20001/src/pkg/exp/regexp/syntax/pa... src/pkg/exp/regexp/syntax/parse.go:63: WasDollar // regexp OpEndText was $, not \z On 2011/06/10 00:14:55, Sam wrote: > I'm kind of confused as to why you need this in this way, and don't need a > parallel for ^ vs \a - it's clearly for use in a follow-up CL. I suspect you > know what you're doing :) In Perl, there is \Z (match end of text or right before \n at end of text) and then there is \z (end of text only). In Perl single-line mode $ means \Z, but we don't support \Z, so in our single-line mode we map $ to \z. If we every want to run tests against PCRE to make sure we get the same answers, it's important to know if a particular OpEndText was a $, because then we should expect our library and PCRE to disagree once in a while about whether something matches. There is no equivalent for ^ and \A because Perl didn't screw up \A. http://codereview.appspot.com/4538123/diff/20001/src/pkg/exp/regexp/syntax/pa... File src/pkg/exp/regexp/syntax/parse_test.go (right): http://codereview.appspot.com/4538123/diff/20001/src/pkg/exp/regexp/syntax/pa... src/pkg/exp/regexp/syntax/parse_test.go:22: // { "ab", "str{ab}" }, On 2011/06/10 00:14:55, Sam wrote: > won't these str{} cases work - as just an extension of lit{} to >1 char? That's the plan, but for now they compile into the next line below: cat{lit{a}lit{b}} instead of lit{ab}. The alternate and concat methods will get much more complicated before all this is done, and one of the complications is turning cat{lit{a}lit{b}} into str{ab}. http://codereview.appspot.com/4538123/diff/20001/src/pkg/exp/regexp/syntax/pa... src/pkg/exp/regexp/syntax/parse_test.go:108: // { "(ab)*", "star{cap{str{ab}}}" }, On 2011/06/10 00:14:55, Sam wrote: > I think this one should work now Same thing as above: the uncommented copy a few lines from now is the current parse tree, because nothing generates str yet.
Sign in to reply to this message.
http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/par... File src/pkg/exp/regexp/syntax/parse.go (right): http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/par... src/pkg/exp/regexp/syntax/parse.go:168: for i > 0 && p.stack[i-1].Op < opPseudo { On 2011/06/10 00:27:38, rsc wrote: > On 2011/06/10 00:14:54, Sam wrote: > > On 2011/06/09 02:15:49, Sam wrote: > > > ditto here re: the comment searching for '|' or '('. > > > > Actually, we should never find a | here, because it should always be swapped > and > > then popped before this call. Right? > > Hey, it's not fair to ask for comments and then > complain when they're not correct. > That's entrapment or something. :-) > > Fixed. How do you feel about an error case if we find |? I'm not fussed, this is mostly internal, but might help later down the track.
Sign in to reply to this message.
On Thu, Jun 9, 2011 at 21:03, <sam.thorogood@gmail.com> wrote: > > http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/par... > File src/pkg/exp/regexp/syntax/parse.go (right): > > http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/par... > src/pkg/exp/regexp/syntax/parse.go:168: for i > 0 && p.stack[i-1].Op < > opPseudo { > On 2011/06/10 00:27:38, rsc wrote: >> >> On 2011/06/10 00:14:54, Sam wrote: >> > On 2011/06/09 02:15:49, Sam wrote: >> > > ditto here re: the comment searching for '|' or '('. >> > >> > Actually, we should never find a | here, because it should always be > > swapped >> >> and >> > then popped before this call. Right? > >> Hey, it's not fair to ask for comments and then >> complain when they're not correct. >> That's entrapment or something. :-) > >> Fixed. > > How do you feel about an error case if we find |? I'm not fussed, this > is mostly internal, but might help later down the track. You are not going to recognize this function anymore in a few CLs. I wouldn't worry about it. Russ
Sign in to reply to this message.
Sounds good. LGTM again. On Fri, Jun 10, 2011 at 11:08, Russ Cox <rsc@golang.org> wrote: > On Thu, Jun 9, 2011 at 21:03, <sam.thorogood@gmail.com> wrote: >> >> http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/par... >> File src/pkg/exp/regexp/syntax/parse.go (right): >> >> http://codereview.appspot.com/4538123/diff/4001/src/pkg/exp/regexp/syntax/par... >> src/pkg/exp/regexp/syntax/parse.go:168: for i > 0 && p.stack[i-1].Op < >> opPseudo { >> On 2011/06/10 00:27:38, rsc wrote: >>> >>> On 2011/06/10 00:14:54, Sam wrote: >>> > On 2011/06/09 02:15:49, Sam wrote: >>> > > ditto here re: the comment searching for '|' or '('. >>> > >>> > Actually, we should never find a | here, because it should always be >> >> swapped >>> >>> and >>> > then popped before this call. Right? >> >>> Hey, it's not fair to ask for comments and then >>> complain when they're not correct. >>> That's entrapment or something. :-) >> >>> Fixed. >> >> How do you feel about an error case if we find |? I'm not fussed, this >> is mostly internal, but might help later down the track. > > You are not going to recognize this function anymore in a few CLs. > I wouldn't worry about it. > > Russ >
Sign in to reply to this message.
LGTM for checkpoint http://codereview.appspot.com/4538123/diff/6006/src/pkg/exp/regexp/syntax/par... File src/pkg/exp/regexp/syntax/parse.go (right): http://codereview.appspot.com/4538123/diff/6006/src/pkg/exp/regexp/syntax/par... src/pkg/exp/regexp/syntax/parse.go:12: ) when the time comes, i'd like a doc.go with the definition of the syntax and how the flags affect syntax and behavior. it can wait, but not too long because it helps me understand the minutiae. http://codereview.appspot.com/4538123/diff/6006/src/pkg/exp/regexp/syntax/par... src/pkg/exp/regexp/syntax/parse.go:59: OneLine // treat ^ and $ as only matching at beginning and end of text this doesn't feel like the right name but there might not be a good one. it's not about the lines, it's about ^ and $ vs inner newline characters. i don't have a great idea. InnerAnchors? AnchorEnds? AnchorNLs? http://codereview.appspot.com/4538123/diff/6006/src/pkg/exp/regexp/syntax/par... src/pkg/exp/regexp/syntax/parse.go:61: PerlX // allow Perl extensions not a fan of this either, because of X. what's wrong with PerlExt? http://codereview.appspot.com/4538123/diff/6006/src/pkg/exp/regexp/syntax/par... src/pkg/exp/regexp/syntax/parse.go:70: d http://codereview.appspot.com/4538123/diff/6006/src/pkg/exp/regexp/syntax/par... src/pkg/exp/regexp/syntax/parse.go:81: stack []*Regexp // stack of parsed expressions if you used indices instead of pointers, the regexps could be inline instead of offline through a pointer. could mean fewer allocations. http://codereview.appspot.com/4538123/diff/6006/src/pkg/exp/regexp/syntax/par... src/pkg/exp/regexp/syntax/parse.go:213: // Otherwise, have to do real work. s/have to/must/ http://codereview.appspot.com/4538123/diff/6006/src/pkg/exp/regexp/syntax/par... src/pkg/exp/regexp/syntax/parse.go:407: first := true // ] is okay as first char in class '-' too? http://codereview.appspot.com/4538123/diff/6006/src/pkg/exp/regexp/syntax/par... File src/pkg/exp/regexp/syntax/parse_test.go (right): http://codereview.appspot.com/4538123/diff/6006/src/pkg/exp/regexp/syntax/par... src/pkg/exp/regexp/syntax/parse_test.go:197: func dumpRegexp(b *bytes.Buffer, re *Regexp) { how/why does this differ from writeRegexp? function comments would help. this looks like it's a dump of the tree; write looks like it recreates the expression. http://codereview.appspot.com/4538123/diff/6006/src/pkg/exp/regexp/syntax/reg... File src/pkg/exp/regexp/syntax/regexp.go (right): http://codereview.appspot.com/4538123/diff/6006/src/pkg/exp/regexp/syntax/reg... src/pkg/exp/regexp/syntax/regexp.go:21: Op Op // operator do you intend to keep all these fields public? http://codereview.appspot.com/4538123/diff/6006/src/pkg/exp/regexp/syntax/reg... src/pkg/exp/regexp/syntax/regexp.go:59: const opPseudo Op = 128 // where pseudo-ops start i'd move this up into the bock of consts above http://codereview.appspot.com/4538123/diff/6006/src/pkg/exp/regexp/syntax/reg... src/pkg/exp/regexp/syntax/regexp.go:180: const meta = `\.+*?()|[]^$` are {} meta?
Sign in to reply to this message.
> http://codereview.appspot.com/4538123/diff/6006/src/pkg/exp/regexp/syntax/par... > src/pkg/exp/regexp/syntax/parse.go:12: ) > when the time comes, i'd like a doc.go with the definition of the syntax > and how the flags affect syntax and behavior. it can wait, but not too > long because it helps me understand the minutiae. http://code.google.com/p/re2/wiki/Syntax > http://codereview.appspot.com/4538123/diff/6006/src/pkg/exp/regexp/syntax/par... > src/pkg/exp/regexp/syntax/parse.go:59: OneLine > // treat ^ and $ as only matching at beginning and end of text > this doesn't feel like the right name but there might not be a good one. > it's not about the lines, it's about ^ and $ vs inner newline > characters. > > i don't have a great idea. InnerAnchors? AnchorEnds? AnchorNLs? > > http://codereview.appspot.com/4538123/diff/6006/src/pkg/exp/regexp/syntax/par... > src/pkg/exp/regexp/syntax/parse.go:61: PerlX > // allow Perl extensions > not a fan of this either, because of X. what's wrong with PerlExt? I agree that these names are not great. They are what RE2 uses, though, so for now I'd like to leave them the same. Once the coding is done I'm happy to rename/rearrange things. > http://codereview.appspot.com/4538123/diff/6006/src/pkg/exp/regexp/syntax/par... > src/pkg/exp/regexp/syntax/parse_test.go:197: func dumpRegexp(b > *bytes.Buffer, re *Regexp) { > how/why does this differ from writeRegexp? function comments would help. > this looks like it's a dump of the tree; write looks like it recreates > the expression. Yes. Added. dump is needed to test that you get the syntax tree you expect (for example, a string node instead of a concatenation of two literal nodes; not that that works yet). > http://codereview.appspot.com/4538123/diff/6006/src/pkg/exp/regexp/syntax/reg... > src/pkg/exp/regexp/syntax/regexp.go:21: Op Op // operator > do you intend to keep all these fields public? Yes, but this is the regexp/syntax package, so clients of regexp won't see them. > http://codereview.appspot.com/4538123/diff/6006/src/pkg/exp/regexp/syntax/reg... > src/pkg/exp/regexp/syntax/regexp.go:59: const opPseudo Op = 128 // where > pseudo-ops start > i'd move this up into the bock of consts above If I do that godoc will tell people there are unexported constants. > http://codereview.appspot.com/4538123/diff/6006/src/pkg/exp/regexp/syntax/reg... > src/pkg/exp/regexp/syntax/regexp.go:180: const meta = `\.+*?()|[]^$` > are {} meta? Yes, fixed. Russ
Sign in to reply to this message.
*** Submitted as http://code.google.com/p/go/source/detail?r=5e5d55182675 *** exp/regexp/syntax: syntax data structures, parser Parser is a work in progress but can populate most of the interesting parts of the data structure, so a good checkpoint. All the complicated Perl syntax is missing, as are various important optimizations made during parsing to the syntax tree. The plan is that exp/regexp's API will mimic regexp, and exp/regexp/syntax provides the parser directly for programs that need it (and for implementing exp/regexp). Once finished, exp/regexp will replace regexp. R=r, sam.thorogood, kevlar, edsrzf CC=golang-dev http://codereview.appspot.com/4538123
Sign in to reply to this message.
|