Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: spec: disallow LTR/RTL characters in string literals? #20209

Open
karalabe opened this issue May 2, 2017 · 34 comments
Open

proposal: spec: disallow LTR/RTL characters in string literals? #20209

karalabe opened this issue May 2, 2017 · 34 comments

Comments

@karalabe
Copy link
Contributor

karalabe commented May 2, 2017

TL;DR https://play.golang.org/p/LPkPTRF7fC

The above code looks quite plain and obvious, except it does something completely different than you'd expect (feel free to run it). The obvious thing that should happen is that it counts the number of bits set in the given string. The non-obvious thing that happens is that the mask is actually 0, not 0x01.

So, what happened there? The abuse in the above code is around the invisible Unicode characters that mark following text to be right-to-left or left-to-right. Since Go permits arbitrary Unicode characters to be present in string literals, it also allows me to have a string of the form "bla bla blaabc". Since we're dealing with valid Unicode sequences here, any modern editor/website will actually interpret those special characters, causing the content in between the two special marks to be reversed to the end of the line (alas still part of the string literal).

In my playground code this is abused by having the following source code:

str, mask := "Hello, World!<rtl>10x<ltr>", 0

Which will be displayed by all modern editors/websites as:

str, mask := "Hello, World!", 0x01

The security aspect of this issue is social engineering attacks. The "display" line of my sample code is obvious beyond doubt, so noone will ever inspect such a thing; however it managed to flip one bit in a mask (imagine doing this for file permission umasks). The issue is that such code could easily get past reviews and into a final product.

The tricky part is how to avoid these issues. My only meaningful suggestion would be to disallow these two special characters in string literals. This does break Go 1.0 compatibility guarantees, however I think it's worth it (I can't really figure out a meaningful use case for it). Using it in a single line string will screw up the display of the source code, so it doesn't make sense to do it, and using it in a multi line string is questionable at best. I think requiring users to explicitly use \x notation for adding these characters it a good compromise to protect source code sanity.

@bradfitz bradfitz changed the title Code obfuscation through invisible right-to-left/left-to-right characters spec: disallow LTR/RTL characters in string literals? May 2, 2017
@bradfitz bradfitz changed the title spec: disallow LTR/RTL characters in string literals? proposal: spec: disallow LTR/RTL characters in string literals? May 2, 2017
@bradfitz bradfitz added this to the Proposal milestone May 2, 2017
@cznic
Copy link
Contributor

cznic commented May 2, 2017

String literals must be able to hold any byte sequence whatsoever. Using a string to hold binary data is not uncommon and this proposal will introduce a forbidden byte sequence.

I understand the problem (nice trick BTW), but I don't think this is a way to solve it.

@bradfitz
Copy link
Contributor

bradfitz commented May 2, 2017

@cznic,

String literals must be able to hold any byte sequence whatsoever.

Yes, but not in their raw form. Only if they're escaped.

Even today:

      s := "`\"`"

... you need to escape the double quote. We could require that RTL/LTR characters also need to be escaped.

@cznic
Copy link
Contributor

cznic commented May 2, 2017

We could require that RTL/LTR characters also need to be escaped.

Yes, but that's unfortunately only a partial solution: https://play.golang.org/p/ZL_QF7xc-e.

@karalabe
Copy link
Contributor Author

karalabe commented May 2, 2017 via email

@bradfitz
Copy link
Contributor

bradfitz commented May 2, 2017

@karalabe, yes.

@cznic
Copy link
Contributor

cznic commented May 2, 2017

It would need to be forbidden from `` as well.

I don't like to say that, but then we are back to square 1 (modulo the backtick). To make things even worse, there's also the Go 1 compatibility promise and I'm quite unsure if such breaking change qualifies under it.

@karalabe
Copy link
Contributor Author

karalabe commented May 2, 2017

That's the catch 22 :)

@karalabe
Copy link
Contributor Author

karalabe commented May 2, 2017

Perhaps a nice compromise (alas it makes for ugly parsing rules and introduces corner cases) is to forbid the two characters in single line strings (irrelevant of the type) and only allow it in multiline strings for lines that are fully inside (i.e. not the same line as the opening or closing tick is on). In reality I think it's enough to disallow for the closing tick's line.

These "forbidden" cases would already even now screw up the code so I cannot imagine anyone benevoletly using the in such a way. Yet by limiting to the closing mark only, we can retain the functionality of rtl characters that some may have in backtick code.

All in all personally I'd just forbid everywhere but I'm not accustomed to rtl use cases so I can't really comment on how people use them in completely valid and meaningful ways.

@jimmyfrasche
Copy link
Member

This could be handled by a linter of some kind, but then people need to be aware of that tool and run it.

I know this doesn't match all the criteria for go vet, but adding it there may be best.

If go vet is added to go test per #18084 it would be run by default in a lot of situations making it harder to sneak this into a codebase.

Of course, that doesn't close the hole: it just makes it smaller.

Given how unlikely this is to be done on purpose without malicious intent, breaking Go 1 (which is allowed when it comes to security) and making it illegal across the board probably wouldn't cause anyone any pain.

@randall77
Copy link
Contributor

I think solving this problem in the language is the wrong place.
How about in your code review tool? It would be as simple as highlighting weird/unusual UTF8, homographs, ... Then a human can decide whether it is ok or not.

If you're not doing code reviews, you have bigger problems.

@karalabe
Copy link
Contributor Author

karalabe commented May 2, 2017

GitHub flat out stated they don't care about these kinds of attack vectors. Given that most Go code is hosted on GitHub, most reviews will be done there too.

Also these two characters are not homoglyphs, they are invisible.

@randall77
Copy link
Contributor

They are currently invisible. I am proposing to make them visible in a code review tool.
I don't feel the need to second-guess GitHub's decision. If you want to protect against this attack, don't use GitHub. There are plenty of code review tools to choose from, one may take up your cause.

@karalabe
Copy link
Contributor Author

karalabe commented May 3, 2017

Developers use GitHub, whether you like it or not. Telling the entire ecosystem to go use something else isn't really feasible.

But even if one project or another would actually go and use something else, you still have the issues when vendoring in code. Do you have the capacity to go over every line of every package you depend on to see if there's something malicious accidentally added to it? In a perfect world, the answer is yes, but we don't live in such a world.

It is easy to blame the user for using it wrong, but if something is easy to use wrong, is really the user to blame? In this instance it's easy to say "just use another tool", but this requires users to be aware of the issue in detail, and to know exactly what other tool to use, why and even then what to look for.

Compared to requiring everyone to change their entire tooling, a minor spec change can solve it elegantly for all, without needing to make the community aware of one more complexity that can end very badly if taken lightly.

@cznic
Copy link
Contributor

cznic commented May 3, 2017

Compared to requiring everyone to change their entire tooling, a minor spec change can solve it elegantly for all, without needing to make the community aware of one more complexity that can end very badly if taken lightly.

While the minor spec change may solve this particular problem, the root cause stays: If you don't properly review [untrusted] code with the proper tools you are to blame yourself for the consequences. There are many other ways how to sneak malicious code into a repository in a way which is obfuscated/hidden/confusing on the same level as what this issue is about. Resolving only one of the cases is not a solution.

@karalabe
Copy link
Contributor Author

karalabe commented May 3, 2017

If you don't properly review [untrusted] code with the proper tools

That's what I'm trying to accomplish here :) Add support for avoiding this into the proper tool, the language.

There are many other ways how to sneak malicious code into a repository in a way which is obfuscated/hidden/confusing on the same level as what this issue is about.

Please provide examples for such. I've yet to see anything so deceitful as this one.

Resolving only one of the cases is not a solution.

It is exactly infinitely better than resolving 0.

@cznic
Copy link
Contributor

cznic commented May 3, 2017

Please provide examples for such. I've yet to see anything so deceitful as this one.

For example, without going into details, any exec.Command(var1, var2, ...), ie. not using string literals. Level 0, admitted, but many people will simply never look at the code at all, less on where the values come from and/or how they are computed at runtime. rot13 would be enough to just move on for many out of sheer laziness.

(I've recently published something with a string encoded blob of gzip of gob of VM code of a C program, source of which is not even included in the repository. Moreover, gob outputs for the same input are not reproducible so is the gzip. Here not even a proper code review code tool can help.)

It is exactly infinitely better than resolving 0.

Well, the difference is 1, yes, but the ratio is actually NaN ;-)

@slrz
Copy link

slrz commented May 3, 2017

These things really seem more appropriate for code review or diff viewing tools. Git already marks up trailing whitespace in screaming red, it could do the same for LTR/RTL marks.

Restricting Go doesn't fullly solve the problem anyway: there are lots of shell scripts and Makefiles in people's repos that you could use instead.

@rsc
Copy link
Contributor

rsc commented May 15, 2017

The open questions for this issue, #20210, and #20115 are:

  • How much should the Go language & implementation be in the business of defending against Unicode-based visual similarity attacks?
  • What tools should do it? (Language spec, vet, something else?)

I don't think we know the answers to any of this.

Without careful thought, it seems like it might be OK for vet to contain these kinds of checks, provided the scope is limited to Unicode similarity problems and the implementation scope can be cleanly restricted. But I don't know.

/cc @robpike @mpvl @nigeltao @golang/proposal-review

@robpike
Copy link
Contributor

robpike commented May 16, 2017

Without discounting the dangers, I don't believe the language is the place to solve this, plus it's a very slippery slope to start stepping around the rich list of funny Unicode characters that can be exploited. The specification could become a mess.

Tooling perhaps, but not the language specification.

@nigeltao
Copy link
Contributor

/cc @SamWhited who also works on golang.org/x/text/secure/precis, amongst other things.

@nigeltao
Copy link
Contributor

@rsc #20215 "crypto/elliptic: different ecdsa.Verify result between p256 amd64 and generic implementations with a zero hash" seems unrelated. Did you have something else in mind?

@nigeltao
Copy link
Contributor

nigeltao commented May 17, 2017

If there's a clean, stable, simple specification of what to avoid, I'm open to bradfitz's idea of requiring escaping: "\u200F" for U+200F RIGHT-TO-LEFT MARK.

If there isn't, I agree with robpike's concerns.

I don't know whether there is, though. @mpvl?

@nigeltao
Copy link
Contributor

nigeltao commented May 17, 2017

@cznic I'm confused as to how prohibiting RTL inside backtick string literals (and regular string literals) brings us "back to square 1". Can you elaborate?

@SamWhited
Copy link
Member

Personally, I agree with Rob that the language itself should not deal with this. I suppose I could be convinced that there was a simple set of rules to apply, or a certain class of character that should never be allowed in a literal, but I suspect there won't be or you'll just run into a number of other places RTLO's or other characters could be put into the code and made to confuse people (maybe you could do something similar with an inline comment block, or with a rune literal, etc.)

This is a matter of display, not syntax, and security concerns related to the display of code are best addressed by your favorite editor or vetting tool (Vim for instance shows me that as "Hello, World!<202e>10x<202d>").

@cznic
Copy link
Contributor

cznic commented May 18, 2017

@nigeltao

I'm confused as to how prohibiting RTL inside backtick string literals (and regular string literals) brings us "back to square 1". Can you elaborate?

I think my "I don't like to say that, but then we are back to square 1 (modulo the backtick)." refers ty my earlier "String literals must be able to hold any byte sequence whatsoever. Using a string to hold binary data is not uncommon and this proposal will introduce a forbidden byte sequence.".

TBH, I'm not sure if I have maybe thought about something different at that time as now my post seems confusing to me too, or at least not clear.

@robpike
Copy link
Contributor

robpike commented May 18, 2017

Maybe something - perhaps the language, perhaps vet - should simply forbid literal non-printing characters above U+007E.

@nigeltao
Copy link
Contributor

Following a discussion with some colleagues, some notes in no particular order:

It's not merely RTL overrides. https://play.golang.org/p/ifwGIJbjBA looks confusing without having non-printing characters. That example is:
const s = "S123S" //456
but looks like:
const s = "456// "S123S
where S is U+0633 ARABIC LETTER SEEN.

For "go get" path confusion, BiDi in general is a pain, let alone punycode. There's RFC 3987 Example 5:
Logical representation: "http://ab.cd.EF/GH/ij/kl.html"
Visual representation: "http://ab.cd.HG/FE/ij/kl.html"

https://bugs.chromium.org/p/chromium/issues/detail?id=683314 is a Chromium bug about "аррӏе.com" where that first term is 5 Cyrillic letters, not the 5 latin letters "apple". In Go: https://play.golang.org/p/j-50CRKctO. I believe that, in Chromium, they have a narrowly defined filter for Cyrillic or Greek homographs for Latin glyphs.

For homograph attacks, not just LTR/RTL, there's http://www.unicode.org/reports/tr39/#Confusable_Detection although IIUC it may be too aggressive, as it marks "m" (em) and "rn" (arr en) as confusable.

The Unicode Cf "Other, Format" category is http://www.fileformat.info/info/unicode/category/Cf/list.htm

In summary, it's complicated, and therefore the response is probably in the tools rather than the language.

@rsc
Copy link
Contributor

rsc commented Jun 5, 2017

Thanks @nigeltao. It definitely sounds to me like we should be experimenting in a tool (maybe vet) and not in the spec. Does someone want to come up with a list of what a vet "unicode" check would check?

@rsc
Copy link
Contributor

rsc commented Jun 5, 2017

Updated my comment above - meant 20115 not 20215.

@rsc
Copy link
Contributor

rsc commented Jun 12, 2017

Does someone want to come up with a list of what a vet "unicode" check would check?

@rsc
Copy link
Contributor

rsc commented Oct 9, 2017

I think we agree that the first step is to add a vet check and only after building experience with it think about actual language restrictions (or not). Marking this proposal-hold until there is a proposal (a short list would be fine) of what a vet "unicode" check would check.

/cc @nigeltao @mpvl

@dolmen
Copy link
Contributor

dolmen commented Aug 13, 2020

In #40717 I propose 3 codepoints to add to that list for a vet check: U+115F, U+1160, U+3164

@holiman
Copy link

holiman commented Nov 1, 2021

"‘Trojan Source’ Bug Threatens the Security of All Code": https://krebsonsecurity.com/2021/11/trojan-source-bug-threatens-the-security-of-all-code/

Apparently some researchers with the University of Cambridge have given this (type of) bug a cool-sounding name now.

@komuw
Copy link
Contributor

komuw commented Nov 1, 2021

stevendanna added a commit to stevendanna/cockroach that referenced this issue Nov 1, 2021
This PR adds a linter to disallow the use of directional formatting
characters to help prevent them being used to get malicious code past
code review.

Ideally our code-review tool would highlight such characters for us
since such characters might routinely appear in binary artifacts.

See also:

- https://www.trojansource.codes/
- https://blog.rust-lang.org/2021/11/01/cve-2021-42574.html
- golang/go#20209

Release note: None
@golang golang locked and limited conversation to collaborators Nov 2, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests