Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crypto/x509: Mac without CGO is very slow, times out with large numbers of certs #38215

Closed
deitch opened this issue Apr 2, 2020 · 14 comments
Closed

Comments

@deitch
Copy link

deitch commented Apr 2, 2020

What version of Go are you using (go version)?

$ go version
go version go1.14 darwin/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GOARCH="amd64"
GOHOSTOS="darwin"

What did you do?

  1. Run on a mac with a large but not unreasonable number of certificates.
  2. Try to connect to https endpoint with a binary compiled with CGO_ENABLED=0
  3. Set a connection timeout of a few seconds
  4. Watch the connection time out

Even without a timeout, the initial response can take many seconds. The default mac has ~22 certificates and it can take 3-4 seconds. Any corporate system with more certs can take many multiples.

What did you expect to see?

Reasonable return times

What did you see instead?

Long waits and timeouts.

Lots More Detail

I spent quite some time tracking it down. When compiling with CGO_ENABLED=1, which is fine, although may require more tools installed and, critically, makes cross-compiling and some CI nearly impossible, the lookup time is roughly well below O(n). When compiling with CGO_ENABLED=0, which is desired when possible, every cert in the system increases the time to verify the server cert.

The core code issue is in root_darwin.go, and specifically execSecurityRoots().

It does the following (simplified):

  1. gets the root certs via exec'ing /usr/bin/security with appropriate args
  2. checks each of the root certs by exec'ing /usr/bin/security on it with appropriate args

The above is, essentially, O(n), and rises linearly with the number of certs. The comments here indicate that it should be optimized to reasonably close to with CGO, but I haven't found that to be the case. I have had several runs where 160 certs and a timeout of 5 seconds would, well, time out.

I created a repo with a program which recreates it. It also includes directions and output from my mac. In my case, it was 244ms for CGO and 1.7s for no CGO. I worked with someone who had ~160 certs (not unreasonable); it took well over 5 seconds.

repo here

@deitch
Copy link
Author

deitch commented Apr 2, 2020

cc @justincormack @thaJeztah

@deitch
Copy link
Author

deitch commented Apr 2, 2020

Aha. Just came across #32604 (and am not at all surprised to find @zx2c4 on it :-) )

@deitch
Copy link
Author

deitch commented Apr 2, 2020

Is this a duplicate? In this case, it isn't just about optimization, but about real issues that arise because of that.

@thaJeztah
Copy link
Contributor

/cc @FiloSottile for triage / if it's a duplicate

@deitch
Copy link
Author

deitch commented Apr 2, 2020

I actually am unsure what the logic is doing and why it is necessary. As far as I can tell, it does the following:

  1. Get the ID of all the user certs for the current user and admin certs that have non-default trust policies, and save them
  2. Get all of the certs and system certs across, well, everywhere
  3. For each of those many certs:
    • if it is a system cert, keep it
    • if it is a user or admin cert and in the list of "certs with policies", verify it against the "ssl" policy; if it passes, keep it - this is the part that takes a lot of time and grows linearly
  4. return the ones that passed

The part that surprise me is:

  • is there not a single call (exec or system call) that just says, "give me all of the valid root certs for ssl"?
  • is there a way to just get that info from the output of /usr/bin/security trust-settings-export?

In the end, we just return the system certs and all of those from the two trust-settings-export calls that pass to be used for ssl.

@FiloSottile
Copy link
Contributor

Yeah, the nocgo fallback is absolutely not great. The policy logic is extremely complex and we've gotten it wrong a few times before. The APIs don't help us.

The two options here are:

  1. run trust-settings-export, parse XML, apply similar but not identical logic to the cgo one to evaluate the policies;
  2. figure out how to call into Security.framework without cgo.

I'd rather keep only one copy of this logic, so I want to try (2) before (1). So yeah, duplicate of #32604, but I appreciate the detailed user impact report, it helps with prioritization.

@deitch
Copy link
Author

deitch commented Apr 2, 2020

I'd rather keep only one copy of this logic

I agree. Given the material impact it can have on end-users, and it doesn't look like option 2 is happening too soon (I am very happy to be wrong), can we consider some optimization to the nocgo fallback to carry us through?

Also, is it worth documenting somewhere reasonably prominent, maybe in net/http and/or crypto/x509? You don't want to know how long I spent digging into it, trying to understand why I had issues with user on macbook A and not on macbook B. :-(
I figure I am not the only one.

So yeah, duplicate of #32604, but I appreciate the detailed user impact report, it helps with prioritization

No problem. :-)

@FiloSottile
Copy link
Contributor

it doesn't look like option 2 is happening too soon (I am very happy to be wrong)

Let me see what I can do =)

It really is a bad and inconsistent experience, sorry about that.

@deitch
Copy link
Author

deitch commented Apr 2, 2020

I can try and give a hand. I am not sure if I captured the logic and options correctly here

@FiloSottile
Copy link
Contributor

We'd basically drop the nocgo code and port the cgo one to hardcoded references to Security.framework. The hard part is in what the linker needs to do for that, so I reached out to the cmd/link owners. I'll keep #32604 up to date!

@deitch
Copy link
Author

deitch commented Apr 2, 2020

Thanks. I subscribed to that one.

If it really is hard, perhaps we can just do what you suggested and parse the xml, buy is some breathing room?

@FiloSottile
Copy link
Contributor

We'll see. I would really regret depending on encoding/xml in crypto/x509 and rewriting all that logic.

@gopherbot
Copy link

Change https://golang.org/cl/227037 mentions this issue: crypto/x509: use Security.framework without cgo for roots on macOS

@deitch
Copy link
Author

deitch commented Apr 16, 2020

FWIW, I ran the gotip version that @FiloSottile put up in the link. It passed all tests, and the time is effectively identical to cgo. Also had someone whose cert store is quite large and it took up to 10s before with nocgo. With this patch, 1.3s. Much better!

gopherbot pushed a commit that referenced this issue May 7, 2020
+----------------------------------------------------------------------+
| Hello, if you are reading this and run macOS, please test this code: |
|                                                                      |
| $ GO111MODULE=on go get golang.org/dl/gotip@latest                   |
| $ gotip download                                              |
| $ GODEBUG=x509roots=1 gotip test crypto/x509 -v -run TestSystemRoots |
+----------------------------------------------------------------------+

We currently have two code paths to extract system roots on macOS: one
uses cgo to invoke a maze of Security.framework APIs; the other is a
horrible fallback that runs "/usr/bin/security verify-cert" on every
root that has custom policies to check if it's trusted for SSL.

The fallback is not only terrifying because it shells out to a binary,
but also because it lets in certificates that are not trusted roots but
are signed by trusted roots, and because it applies some filters (EKUs
and expiration) only to roots with custom policies, as the others are
not passed to verify-cert. The other code path, of course, requires cgo,
so can't be used when cross-compiling and involves a large ball of C.

It's all a mess, and it broke oh-so-many times (#14514, #16532, #19436,
 #20990, #21416, #24437, #24652, #25649, #26073, #27958, #28025, #28092,
 #29497, #30471, #30672, #30763, #30889, #32891, #38215, #38365, ...).

Since macOS does not have a stable syscall ABI, we already dynamically
link and invoke libSystem.dylib regardless of cgo availability (#17490).

How that works is that functions in package syscall (like syscall.Open)
take the address of assembly trampolines (like libc_open_trampoline)
that jump to symbols imported with cgo_import_dynamic (like libc_open),
and pass them along with arguments to syscall.syscall (which is
implemented as runtime.syscall_syscall). syscall_syscall informs the
scheduler and profiler, and then uses asmcgocall to switch to a system
stack and invoke runtime.syscall. The latter is an assembly trampoline
that unpacks the Go ABI arguments passed to syscall.syscall, finally
calls the remote function, and puts the return value on the Go stack.
(This last bit is the part that cgo compiles from a C wrapper.)

We can do something similar to link and invoke Security.framework!

The one difference is that runtime.syscall and friends check errors
based on the errno convention, which Security doesn't follow, so I added
runtime.syscallNoErr which just skips interpreting the return value.
We only need a variant with six arguments because the calling convention
is register-based, and extra arguments simply zero out some registers.

That's plumbed through as crypto/x509/internal/macOS.syscall. The rest
of that package is a set of wrappers for Security.framework and Core
Foundation functions, like syscall is for libSystem. In theory, as long
as macOS respects ABI backwards compatibility (a.k.a. as long as
binaries built for a previous OS version keep running) this should be
stable, as the final result is not different from what a C compiler
would make. (One exception might be dictionary key strings, which we
make our own copy of instead of using the dynamic symbol. If they change
the value of those strings things might break. But why would they.)

Finally, I rewrote the crypto/x509 cgo logic in Go using those wrappers.
It works! I tried to make it match 1:1 the old logic, so that
root_darwin_amd64.go can be reviewed by comparing it to
root_cgo_darwin_amd64.go. The only difference is that we do proper error
handling now, and assume that if there is no error the return values are
there, while before we'd just check for nil pointers and move on.

I kept the cgo logic to help with review and testing, but we should
delete it once we are confident the new code works.

The nocgo logic is gone and we shall never speak of it again.

Fixes #32604
Fixes #19561
Fixes #38365
Awakens Cthulhu

Change-Id: Id850962bad667f71e3af594bdfebbbb1edfbcbb4
Reviewed-on: https://go-review.googlesource.com/c/go/+/227037
Reviewed-by: Katie Hockman <katie@golang.org>
@golang golang locked and limited conversation to collaborators Apr 16, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants