You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to parse a file with regular expressions and found that \x5c style escape sequences for ascii bytes can be parsed, unicode sequences cannot and fail with a parse error. Also a utf-8 series of \x bytes seems to parse but not actually match unicode characters.
With string literals in code, it's simple to just use a quoted string to build expressions, but when reading from a file everything is effectively a raw string. Unquoting the whole string before compiling the regex has the problem that for example, \x5c becomes unquoted into \, which will generally cause the regex to fail (regex-syntax wise it's supposed to be \\). So the way to read patterns from a file that include byte escapes like \x5c and unicode sequences seems to be to unquote only \uxxxx sequences but not others. It seems reasonable for the regexp compilation itself to do this though instead of just fail to compile.
What did you expect to see?
Ability to specify unicode characters with escape sequences in a regex.
What did you see instead?
Unicode sequences don't parse and utf-8 byte sequences don't match.
The text was updated successfully, but these errors were encountered:
Note that if nothing was unquoted, then it would still be consistent behavior, but currently it seems like expressions like \x5c do actually get unquoted by the regex engine into a match on a backslash, so there is some unescaping happening, just not of unicode sequences or utf8 bytes.
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?Playground
What did you do?
https://go.dev/play/p/KhkX1e9_0qg
I am trying to parse a file with regular expressions and found that
\x5c
style escape sequences for ascii bytes can be parsed, unicode sequences cannot and fail with a parse error. Also a utf-8 series of\x
bytes seems to parse but not actually match unicode characters.With string literals in code, it's simple to just use a quoted string to build expressions, but when reading from a file everything is effectively a raw string. Unquoting the whole string before compiling the regex has the problem that for example,
\x5c
becomes unquoted into\
, which will generally cause the regex to fail (regex-syntax wise it's supposed to be\\
). So the way to read patterns from a file that include byte escapes like\x5c
and unicode sequences seems to be to unquote only\uxxxx
sequences but not others. It seems reasonable for the regexp compilation itself to do this though instead of just fail to compile.What did you expect to see?
Ability to specify unicode characters with escape sequences in a regex.
What did you see instead?
Unicode sequences don't parse and utf-8 byte sequences don't match.
The text was updated successfully, but these errors were encountered: