regexp: unescape unicode sequences in regex strings #55884

anuraaga · 2022-09-27T03:29:10Z

What version of Go are you using (`go version`)?

$ go version
1.19 (playground)

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (`go env`)?

Playground

What did you do?

https://go.dev/play/p/KhkX1e9_0qg

I am trying to parse a file with regular expressions and found that \x5c style escape sequences for ascii bytes can be parsed, unicode sequences cannot and fail with a parse error. Also a utf-8 series of \x bytes seems to parse but not actually match unicode characters.

With string literals in code, it's simple to just use a quoted string to build expressions, but when reading from a file everything is effectively a raw string. Unquoting the whole string before compiling the regex has the problem that for example, \x5c becomes unquoted into \, which will generally cause the regex to fail (regex-syntax wise it's supposed to be \\). So the way to read patterns from a file that include byte escapes like \x5c and unicode sequences seems to be to unquote only \uxxxx sequences but not others. It seems reasonable for the regexp compilation itself to do this though instead of just fail to compile.

What did you expect to see?

Ability to specify unicode characters with escape sequences in a regex.

What did you see instead?

Unicode sequences don't parse and utf-8 byte sequences don't match.

The text was updated successfully, but these errors were encountered:

anuraaga · 2022-09-27T03:31:10Z

Note that if nothing was unquoted, then it would still be consistent behavior, but currently it seems like expressions like \x5c do actually get unquoted by the regex engine into a match on a backslash, so there is some unescaping happening, just not of unicode sequences or utf8 bytes.

anuraaga · 2022-09-27T07:54:00Z

Sorry just found the syntax is \x{30cf}, not \u30cf or \x30cf which I had tried. That works fine so there's nothing wrong here

anuraaga closed this as completed Sep 27, 2022

golang locked and limited conversation to collaborators Sep 27, 2023

gopherbot added the FrozenDueToAge label Sep 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

regexp: unescape unicode sequences in regex strings #55884

regexp: unescape unicode sequences in regex strings #55884

anuraaga commented Sep 27, 2022

anuraaga commented Sep 27, 2022 •

edited

Loading

anuraaga commented Sep 27, 2022

regexp: unescape unicode sequences in regex strings #55884

regexp: unescape unicode sequences in regex strings #55884

Comments

anuraaga commented Sep 27, 2022

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

anuraaga commented Sep 27, 2022 • edited Loading

anuraaga commented Sep 27, 2022

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?

anuraaga commented Sep 27, 2022 •

edited

Loading