You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The documentation for regexp package says that "All characters are UTF-8-encoded code points". However, it doesn't say anything about what happens if the input is not valid UTF-8.
The current behavior on invalid UTF-8 is not consistent. For example, when matching against the string "\xff", the pattern a doesn't match, the pattern \x{fffd} doesn't match either, but the pattern a|\x{fffd} surprisingly does match.
The documentation should be updated to specify the behavior of regexp package on invalid UTF-8 input. Also, the behavior should be made more consistent (for example, always convert every undecodeable byte into \ufffd).
The text was updated successfully, but these errors were encountered:
Different issue but similar context: it would also be interesting to use regexes with "binary" patterns irrespective of utf8 code points. Seems this is not supported at all (while it is in python).
The documentation for regexp package says that "All characters are UTF-8-encoded code points". However, it doesn't say anything about what happens if the input is not valid UTF-8.
The current behavior on invalid UTF-8 is not consistent. For example, when matching against the string
"\xff"
, the patterna
doesn't match, the pattern\x{fffd}
doesn't match either, but the patterna|\x{fffd}
surprisingly does match.The documentation should be updated to specify the behavior of regexp package on invalid UTF-8 input. Also, the behavior should be made more consistent (for example, always convert every undecodeable byte into
\ufffd
).The text was updated successfully, but these errors were encountered: