Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regexp: behavior on invalid UTF-8 input is undocumented and inconsistent #38006

Closed
abacabadabacaba opened this issue Mar 22, 2020 · 3 comments
Closed

Comments

@abacabadabacaba
Copy link

$ go version
go version go1.14.1 linux/amd64

The documentation for regexp package says that "All characters are UTF-8-encoded code points". However, it doesn't say anything about what happens if the input is not valid UTF-8.

The current behavior on invalid UTF-8 is not consistent. For example, when matching against the string "\xff", the pattern a doesn't match, the pattern \x{fffd} doesn't match either, but the pattern a|\x{fffd} surprisingly does match.

The documentation should be updated to specify the behavior of regexp package on invalid UTF-8 input. Also, the behavior should be made more consistent (for example, always convert every undecodeable byte into \ufffd).

@andig
Copy link
Contributor

andig commented Jun 28, 2020

Different issue but similar context: it would also be interesting to use regexes with "binary" patterns irrespective of utf8 code points. Seems this is not supported at all (while it is in python).

@davecheney
Copy link
Contributor

@andig please open a new issue for binary regex. Thank you.

@rsc
Copy link
Contributor

rsc commented Oct 6, 2021

Duplicate of #48749.

@rsc rsc closed this as completed Oct 6, 2021
@rsc rsc removed their assignment Jun 23, 2022
@golang golang locked and limited conversation to collaborators Jun 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants