Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regexp/syntax: document limit of 1000 in {n,m} quantifier forms #7252

Closed
athomason opened this issue Feb 3, 2014 · 13 comments
Closed

regexp/syntax: document limit of 1000 in {n,m} quantifier forms #7252

athomason opened this issue Feb 3, 2014 · 13 comments

Comments

@athomason
Copy link

What steps will reproduce the problem?
1. Using a value of n or m greater than 1000 in any of the x{n,m}, x{n,}, x{n}
quantifier forms produces an ErrInvalidRepeatSize. http://play.golang.org/p/tD9yMerkbn.
ErrInvalidRepeatSize is not documented to have such a limit.


What is the expected output?
A working Regexp, or at least a more helpful error message.


What do you see instead?
"invalid repeat count". There is no mention in the documentation that there is
a limit of 1000; I had to look at regexp/syntax/parse.go to uncover it:
https://code.google.com/p/go/source/browse/src/pkg/regexp/syntax/parse.go#758


Which compiler are you using (5g, 6g, 8g, gccgo)?
6g


Which operating system are you using?
Ubuntu 12.04


Which version are you using?  (run 'go version')
go version go1.2 linux/amd64


Please provide any additional information below.
This code was pulled in from re2 in rev c0852f6d0549. It looks like the limit has been
in re2 itself since the first public commit:
https://code.google.com/p/re2/source/browse/re2/parse.cc?spec=svn166b9c1905c7b1bd26a20cbf2f5c1f66823af1c0&;r=166b9c1905c7b1bd26a20cbf2f5c1f66823af1c0#435.
I can find no mention of why this limit is necessary or why 1000 is an appropriate value.
@athomason
Copy link
Author

Comment 1:

For reference, other popular regular expression engines have significantly higher
limits, all close to maxima of the integer types used internally:
Perl: "n and m are limited to non-negative integral values less than a preset limit
defined when perl is built. This is usually 32766 on the most common platforms."
(http://search.cpan.org/~rjbs/perl-5.18.2/pod/perlre.pod)
PCRE: "All values in repeating quantifiers must be less than 65536."
(http://man7.org/linux/man-pages/man3/pcrelimits.3.html)
Python: no documented limit, architecture dependent. Observed limits include 2^16-1
(https://gist.github.com/athomason/7432497b3f6f8ba73c26) and 2^32-1
(https://5621528620171264-dot-shared-playground.appspot.com/?set_access_key_cookie=z1iuou3sszu4y3q9ntq03n81d).
Ruby: undocumented limit of 100,000
(https://github.com/ruby/ruby/blob/trunk/include/ruby/oniguruma.h#L341)
Java: undocumented effective limit of 2^31-1, the maximum value of an int.
(http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/7u40-b43/java/util/regex/Pattern.java?av=f#3067)
Javascript: no documented limit; Chrome 33 and Firefox 27 accept m >= 2^31 (in dev
console/firebug, `/x{1,2147483646}/.test("x")` ==> true), IE 11 accepts m <= 2^31.

@cznic
Copy link
Contributor

cznic commented Feb 17, 2014

Comment 2:

Can you provide a [reasonable] example needing rep count > 1000?

@athomason
Copy link
Author

Comment 3:

I have a program that accepts a regexp as a way to filter its line-based data stream. I
had a need to filter for lines with a minimum length of 1800 bytes, but /.{1800,}/ was
rejected as a pattern. Obviously there are much better ways to do this, but this would
have been one that didn't require a code change, rebuild, and deployment. I would argue
that this is a "[reasonable]" desire as long as there's no downside to supporting it.
Note that strings.Repeat(".", 1800) is accepted as a pattern, so there seems little
sense in prohibiting the equivalent quantified version.
Since the current value appears entirely arbitrary, it might as well be an equally
arbitrary value that is more permissive. I would guess that the check was originally
added to prevent runaway memory use and compilation time, which is fair to a point. But
with the check removed from tip, I can compile /.{100000,}/ in 22MB in 23ms, which isn't
at all crazy. Both values appear to increase about linearly; compiling /.{10000000,}/
takes a couple seconds and consumes ~2.5GB.
Regardless of what magic value strikes the right balance, it ought to be documented. I'm
happy to submit patch(es) to change the value and/or add docs.

@cznic
Copy link
Contributor

cznic commented Feb 19, 2014

Comment 4:

I don't think a regular expression is the correct tool for checking line length when
there's a much better performing* built-in len(). IOW, this isn't IMO a compelling
reason to extend the limit.
Also, I suggest to intentionally leave the limit undocumented. Otherwise some code can
start to rely on this limit, which it should not.
(*): In your case in several orders of magnitude, both in space and time.

@athomason
Copy link
Author

Comment 5:

> I don't think a regular expression is the correct tool for checking line length when
there's a much better performing* built-in len().
That's obvious, and ignores that the ability still would have been useful in at least
one real case. It's already allowed with worse syntax and identical performance. Clearly
there's not a policy prohibiting sub-optimal implementations.
> IOW, this isn't IMO a compelling reason to extend the limit.
That's not a reason to not extend it. No thoughtful basis for the current value has been
offered. Is your position that inertia alone is a reason to reject a change that would
be helpful for some people?
> Also, I suggest to intentionally leave the limit undocumented. Otherwise some code can
start to rely on this limit, which it should not.
Do you really prefer the current situation where the only explanation available is in
uncommented source code?

@cznic
Copy link
Contributor

cznic commented Feb 19, 2014

Comment 6:

> That's obvious, and ignores that the ability still would have been
> useful in at least one real case. It's already allowed with worse
> syntax and identical performance. Clearly there's not a policy
> prohibiting sub-optimal implementations.
Multiplication is useful. Even when someone does it by repetitive addition. But there's
no reason to encourage the incorrect approach by providing support of it.
> That's not a reason to not extend it. No thoughtful basis for the
> current value has been offered. Is your position that inertia alone
> is a reason to reject a change that would be helpful for some people?
That's backward. There should be a reason to extend it.
> Do you really prefer the current situation where the only explanation
> available is in uncommented source code?
Yes, I really do and I even wrote why.

@athomason
Copy link
Author

Comment 7:

I am dumbfounded by this commitment to needlessly hindering and confusing users. We'll
have to just disagree.

@cznic
Copy link
Contributor

cznic commented Feb 20, 2014

Comment 8:

Disagreement is not a problem, is it? ;-)
Correct/proper use of the regexp package does/should not run into the limitation. If it
would, then there would be a reason to raise the limit. That's IMO why the limit is
already so impractically high. I'd guess ~100 would suffice 99% of legitimate cases.

@athomason
Copy link
Author

Comment 9:

At least you're willing to admit to the existence of legitimate cases that are somehow
unworthy of maintenance-free support.

@rsc
Copy link
Contributor

rsc commented Mar 3, 2014

Comment 10:

The limit is there because - unlike all the other implementations you mentioned -
package regexp implements x{1000} by making 1000 copies of x. There has to be a limit.
The other implementations use a counter, but using a counter precludes them from
providing a useful runtime guarantee. We chose predictable performance; they chose an
algorithm that allows input+regexp combinations that will run until the sun burns out.
But having paide that cost, they can support larger counts quite "cheaply".
See swtch.com/~rsc/regexp/regexp1.html for more details.
Will fix docs for Go 1.3.

Labels changed: added release-go1.3, documentation.

Status changed to Accepted.

@gopherbot
Copy link

Comment 11 by adam@fastly.com:

Thanks for the note and link. A doc fix seems appropriate given the implementation.

@athomason
Copy link
Author

Comment 12:

For the benefit of anyone coming across this issue in the future, the trivial workaround
is to split the quantified expression into repeated copies of the desired match. See
http://play.golang.org/p/XjdKuTKx3t for an example for /x{n}/. /x{n,}/ as originally
encountered is not much more complicated.

@robpike
Copy link
Contributor

robpike commented Mar 25, 2014

Comment 13:

This issue was closed by revision 929ee59.

Status changed to Fixed.

@rsc rsc added this to the Go1.3 milestone Apr 14, 2015
@rsc rsc removed the release-go1.3 label Apr 14, 2015
@golang golang locked and limited conversation to collaborators Jun 25, 2016
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants