Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regexp: unescape unicode sequences in regex strings #55884

Closed
anuraaga opened this issue Sep 27, 2022 · 2 comments
Closed

regexp: unescape unicode sequences in regex strings #55884

anuraaga opened this issue Sep 27, 2022 · 2 comments

Comments

@anuraaga
Copy link
Contributor

What version of Go are you using (go version)?

$ go version
1.19 (playground)

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

Playground

What did you do?

https://go.dev/play/p/KhkX1e9_0qg

I am trying to parse a file with regular expressions and found that \x5c style escape sequences for ascii bytes can be parsed, unicode sequences cannot and fail with a parse error. Also a utf-8 series of \x bytes seems to parse but not actually match unicode characters.

With string literals in code, it's simple to just use a quoted string to build expressions, but when reading from a file everything is effectively a raw string. Unquoting the whole string before compiling the regex has the problem that for example, \x5c becomes unquoted into \, which will generally cause the regex to fail (regex-syntax wise it's supposed to be \\). So the way to read patterns from a file that include byte escapes like \x5c and unicode sequences seems to be to unquote only \uxxxx sequences but not others. It seems reasonable for the regexp compilation itself to do this though instead of just fail to compile.

What did you expect to see?

Ability to specify unicode characters with escape sequences in a regex.

What did you see instead?

Unicode sequences don't parse and utf-8 byte sequences don't match.

@anuraaga
Copy link
Contributor Author

anuraaga commented Sep 27, 2022

Note that if nothing was unquoted, then it would still be consistent behavior, but currently it seems like expressions like \x5c do actually get unquoted by the regex engine into a match on a backslash, so there is some unescaping happening, just not of unicode sequences or utf8 bytes.

@anuraaga
Copy link
Contributor Author

Sorry just found the syntax is \x{30cf}, not \u30cf or \x30cf which I had tried. That works fine so there's nothing wrong here

@golang golang locked and limited conversation to collaborators Sep 27, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants