regexp: behavior on invalid UTF-8 input is undocumented and inconsistent #38006

abacabadabacaba · 2020-03-22T20:31:29Z

$ go version
go version go1.14.1 linux/amd64

The documentation for regexp package says that "All characters are UTF-8-encoded code points". However, it doesn't say anything about what happens if the input is not valid UTF-8.

The current behavior on invalid UTF-8 is not consistent. For example, when matching against the string "\xff", the pattern a doesn't match, the pattern \x{fffd} doesn't match either, but the pattern a|\x{fffd} surprisingly does match.

The documentation should be updated to specify the behavior of regexp package on invalid UTF-8 input. Also, the behavior should be made more consistent (for example, always convert every undecodeable byte into \ufffd).

The text was updated successfully, but these errors were encountered:

andig · 2020-06-28T13:46:29Z

Different issue but similar context: it would also be interesting to use regexes with "binary" patterns irrespective of utf8 code points. Seems this is not supported at all (while it is in python).

davecheney · 2020-06-29T06:36:06Z

@andig please open a new issue for binary regex. Thank you.

rsc · 2021-10-06T17:57:39Z

Duplicate of #48749.

gopherbot added the Documentation label Mar 22, 2020

robpike assigned rsc Mar 22, 2020

rsc closed this as completed Oct 6, 2021

rsc mentioned this issue Oct 6, 2021

regexp: document and implement invalid UTF-8 treated as U+FFFD #48749

Closed

rsc removed their assignment Jun 23, 2022

golang locked and limited conversation to collaborators Jun 23, 2023

gopherbot added the FrozenDueToAge label Jun 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

regexp: behavior on invalid UTF-8 input is undocumented and inconsistent #38006

regexp: behavior on invalid UTF-8 input is undocumented and inconsistent #38006

abacabadabacaba commented Mar 22, 2020

andig commented Jun 28, 2020

davecheney commented Jun 29, 2020

rsc commented Oct 6, 2021

regexp: behavior on invalid UTF-8 input is undocumented and inconsistent #38006

regexp: behavior on invalid UTF-8 input is undocumented and inconsistent #38006

Comments

abacabadabacaba commented Mar 22, 2020

andig commented Jun 28, 2020

davecheney commented Jun 29, 2020

rsc commented Oct 6, 2021