Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crypto/x509: intermittent failures on darwin/amd64 with ‘x509: “…” certificate is expired’ since 2021-11-06 #51209

Closed
bcmills opened this issue Feb 15, 2022 · 22 comments
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. OS-Darwin release-blocker
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Feb 15, 2022

2022/02/14 12:08:07 failed to run `go mod download -json github.com/google/wycheproof@v0.0.0-20191219022705-2196000605e4`, output: {
	"Path": "github.com/google/wycheproof",
	"Version": "v0.0.0-20191219022705-2196000605e4",
	"Error": "github.com/google/wycheproof@v0.0.0-20191219022705-2196000605e4: Get \"https://proxy.golang.org/github.com/google/wycheproof/@v/v0.0.0-20191219022705-2196000605e4.info\": x509: “misc-sni.google.com” certificate is expired"
}
FAIL	golang.org/x/crypto/internal/wycheproof	0.555s

greplogs --dashboard -md -l -e 'x509: “misc-sni.google.com” certificate is expired'

2022-02-14T20:07:02-8634188-b2cb1bd/darwin-amd64-11_0
2022-02-11T19:36:36-f4118a5-0bde2cf/darwin-amd64-11_0
2022-01-10T21:48:09-03fcf44-1f411e9/darwin-amd64-11_0
2021-12-09T19:01:08-03fcf44-8ff254e/darwin-amd64-11_0
2021-12-08T18:06:06-5770296-7b7efd7/darwin-amd64-11_0
2021-12-08T15:15:40-03fcf44-08025a9/darwin-amd64-11_0
2021-11-12T18:57:22-c33205f-b1b6d92/darwin-amd64-10_15
2021-11-06T19:41:15-ec00351-61d789d/darwin-amd64-11_0

These failures seem to always occur when fetching from proxy.golang.org, but not for any specific package. (They occur intermittently across many different repos and tests.) However, I have only observed them on darwin/amd64 builders. I'm not sure where in the stack these failures are introduced — whether they're due to a bug in the darwin kernel, a bug (perhaps a race?) in crypto/x509, some misconfiguration on the builders, or something else entirely.

These failures also seem to have started recently (during or shortly before the Go 1.18 code freeze).

@golang/security: since darwin/amd64 is a first class port, could someone assess whether this is a Go 1.18 regression, and whether and how we should mitigate these errors?

@bcmills bcmills added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Feb 15, 2022
@bcmills bcmills added this to the Go1.18 milestone Feb 15, 2022
@rsc
Copy link
Contributor

rsc commented Feb 15, 2022

Is it possible the time/date on those machines is occasionally very wrong?
If it was a consistent regression we'd see it more often.
The current misc-sni.google.com cert (served from proxy.golang.org)
seems to be issued Jan 16, expiring April 10.
I don't see why failures would have happened on Feb 11 and Feb 14,
unless the time/date is only intermittently wrong on the builders.

I use proxy.golang.org on my Mac a bunch (as do many many users)
and have not seen this error myself. I suspect it is a bad builder somehow.

@FiloSottile
Copy link
Contributor

@rolandshoemaker, can you have a look? I wonder if there's something horrible happening at the macOS API call level, since that error is coming from the platform, I believe.

@rsc
Copy link
Contributor

rsc commented Feb 15, 2022

Thanks. The new ones seem to be *.storage.googleapis.com. The pair at 2021-12-08T18:06:06 is interesting. The non-misc-sni one - https://build.golang.org/log/a88423e7aee0231946f46eebf180dbfb8488aa7e - is a bunch of errors like:

kubernetes/gke/gke.go:26:2: google.golang.org/api@v0.51.0: Get "https://storage.googleapis.com/proxy-golang-org-prod/58f63afcba03a548-google.golang.org:api-v0.51.0.zip?Expires=1639073973&GoogleAccessId=gcs-urlsigner-prod%40golang-modproxy.iam.gserviceaccount.com&Signature=VlDXYR3ipiceXj5QbsCgxfwQ4k74yqzSnBgyaKnoOYe16Z4DWY4y1dUV9hHnP0JzQunHWV8BFkrtBKoACBIorlnt2NyoZAKdYw%2BXcJZd5dLxON4rrDPcKE1%2B8KmX%2B78EiifNJNIcd96oc7VqCNfhNF%2BiZk%2FDlS8twos4VOTwd5jNaCVyPeFmCH%2ByDHmKTbvdslJBsXDgec1WXN9QsY5hTyKQw5ukUGQ%2F9%2FSpBEaUpKe8drzEzvqCI0kRhqP38btgrpiEGFFrT4YQGDUMqG4GB%2BeGpz%2BNcp8zqTWz1uVS16NkT9qmzS5V7Bm6Et3cCR4xuXqTiKTYLoLPXJpjRWlFXA%3D%3D": x509: “*.storage.googleapis.com” certificate is expired

I guess the proxy must be issuing redirects for those packages rather than serving bytes itself? The final error still has the Unicode quotes indicative of a macOS error.

@FiloSottile
Copy link
Contributor

If for some reason Go knows the correct time (which it does because the time in the log lines is correct), while macOS somehow doesn't, this being noticed for the first time would make sense because we switched to using the macOS verifier in Go 1.18.

Alternatively, there might be an intermittent bug that corrupts the time provided to the verification API. (For example if we're getting the ABI wrong it might depend on uninitialized memory. There's an assembly trampoline involved.)

@rsc
Copy link
Contributor

rsc commented Feb 15, 2022

I don't see any evidence of what Go thinks the time is. I don't see a single timestamp in that entire log. Maybe I am missing it?

@rsc
Copy link
Contributor

rsc commented Feb 15, 2022

(The file names are assigned later, by the coordinator.)

@FiloSottile
Copy link
Contributor

2022/02/14 12:08:07 in 2022-02-14T20:07:02-8634188-b2cb1bd/darwin-amd64-11_0

@bcmills bcmills changed the title crypto/x509: intermittent failures on darwin/amd64 with ‘x509: “misc-sni.google.com” certificate is expired’ since 2021-11-06 crypto/x509: intermittent failures on darwin/amd64 with ‘x509: “…” certificate is expired’ since 2021-11-06 Feb 15, 2022
@rsc
Copy link
Contributor

rsc commented Feb 15, 2022

Looks like we only pass an explicit time in during testing. If we can reproduce it, we could pass a time in unconditionally.

@rsc
Copy link
Contributor

rsc commented Feb 18, 2022

I can reproduce this on my M1 Mac Mini. I will try to get more information about the failure.

@rsc
Copy link
Contributor

rsc commented Feb 18, 2022

I am working on establishing an accurate failure rate, but anecdotally I've never seen this program run for longer than 30 minutes without provoking a failure.

I added debugging to crypto/x509 to print the failed certs and as expected they are the right ones, with valid NotBefore/NotAfter ranges nowhere near the current time.

I am going to let it run for the rest of the day to establish a reliable baseline failure rate, and then I will try setting the time unconditionally when invoking the macOS API.

@rsc
Copy link
Contributor

rsc commented Feb 18, 2022

This program calls x509.Verify directly, from a bunch of goroutines, over and over, on the cert chain from proxy.golang.org. It fails pretty often:

2022/02/18 16:24:56 FAIL 1 (in 2.109s): x509: “misc-sni.google.com” certificate is expired
2022/02/18 16:24:57 FAIL 2 (in 3.065s): x509: “misc-sni.google.com” certificate is expired
2022/02/18 16:25:00 FAIL 3 (in 6.659s): x509: “misc-sni.google.com” certificate is expired
2022/02/18 16:25:02 FAIL 4 (in 8.741s): x509: “misc-sni.google.com” certificate is expired
2022/02/18 16:25:13 FAIL 5 (in 19.139s): x509: “misc-sni.google.com” certificate is expired
2022/02/18 16:25:17 FAIL 6 (in 23.597s): x509: “misc-sni.google.com” certificate is expired
2022/02/18 16:25:31 FAIL 7 (in 37.134s): x509: “misc-sni.google.com” certificate is expired
2022/02/18 16:25:31 FAIL 8 (in 37.686s): x509: “misc-sni.google.com” certificate is expired
2022/02/18 16:25:42 FAIL 9 (in 48.810s): x509: “misc-sni.google.com” certificate is expired
2022/02/18 16:25:44 FAIL 10 (in 50.137s): x509: “misc-sni.google.com” certificate is expired
2022/02/18 16:25:45 FAIL 11 (in 51.371s): x509: “misc-sni.google.com” certificate is expired
2022/02/18 16:25:57 FAIL 12 (in 63.067s): x509: “misc-sni.google.com” certificate is expired

The call is

_, err = c1.Verify(x509.VerifyOptions{
	DNSName:       "proxy.golang.org",
	Intermediates: pool,
	CurrentTime:   time.Now(),
})

If I remove the CurrentTime line, then the problem goes away. This strongly suggests that the time conversion is the problem. And looking more carefully at crypto/tls, it does set CurrentTime unconditionally in handshake_client.go.

The time conversion is done in crypto/x509/internal/macos:

func TimeToCFDateRef(t time.Time) CFRef {
	secs := t.Sub(time.Date(2001, 1, 1, 0, 0, 0, 0, time.UTC)).Seconds()
	ref := CFDateCreate(int(secs))
	return ref
}

The only problem is that CFDateCreate takes a CFAbsoluteTime and a CFAbsoluteTime is seconds since 2001-01-01, yes, but as a CFTimeInterval, which in turn is a double (float64).

We are passing an integer, in an integer register, and leaving the floating-point register where CFDateCreate expects to find its argument completely uninitialized.

Honestly it is amazing that it works as often as it does (most of the time!). I wonder if SecTrustSetVerifyDate treats super weird dates as if the call never happened and falls back to the current date. Or maybe X0 is very often zero, and zero is special, or maybe something else...

In any event, this hack makes the program run without failure for many minutes, when before it failed every few seconds:

% git diff 
diff --git a/src/crypto/x509/internal/macos/corefoundation.go b/src/crypto/x509/internal/macos/corefoundation.go
index eedc41a183..00925292a5 100644
--- a/src/crypto/x509/internal/macos/corefoundation.go
+++ b/src/crypto/x509/internal/macos/corefoundation.go
@@ -12,6 +12,7 @@ package macOS
 import (
 	"errors"
 	"internal/abi"
+	"math"
 	"reflect"
 	"runtime"
 	"time"
@@ -49,7 +50,7 @@ func CFStringToString(ref CFRef) string {
 // TimeToCFDateRef converts a time.Time into an apple CFDateRef
 func TimeToCFDateRef(t time.Time) CFRef {
 	secs := t.Sub(time.Date(2001, 1, 1, 0, 0, 0, 0, time.UTC)).Seconds()
-	ref := CFDateCreate(int(secs))
+	ref := CFDateCreate(secs)
 	return ref
 }
 
@@ -171,8 +172,8 @@ func x509_CFArrayAppendValue_trampoline()
 
 //go:cgo_import_dynamic x509_CFDateCreate CFDateCreate "/System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation"
 
-func CFDateCreate(seconds int) CFRef {
-	ret := syscall(abi.FuncPCABI0(x509_CFDateCreate_trampoline), kCFAllocatorDefault, uintptr(seconds), 0, 0, 0, 0)
+func CFDateCreate(seconds float64) CFRef {
+	ret := syscall(abi.FuncPCABI0(x509_CFDateCreate_trampoline), kCFAllocatorDefault, uintptr(math.Float64bits(seconds)), 0, 0, 0, 0)
 	return CFRef(ret)
 }
 func x509_CFDateCreate_trampoline()
diff --git a/src/runtime/sys_darwin_amd64.s b/src/runtime/sys_darwin_amd64.s
index 5d89cda8e6..6a65a80f2a 100644
--- a/src/runtime/sys_darwin_amd64.s
+++ b/src/runtime/sys_darwin_amd64.s
@@ -839,6 +839,7 @@ TEXT runtime·syscallNoErr(SB),NOSPLIT,$0
 	SUBQ	$16, SP
 	MOVQ	(0*8)(DI), R11// fn
 	MOVQ	(2*8)(DI), SI // a2
+	MOVQ	SI, X0 // hack for CFDateCreate
 	MOVQ	(3*8)(DI), DX // a3
 	MOVQ	(4*8)(DI), CX // a4
 	MOVQ	(5*8)(DI), R8 // a5
diff --git a/src/runtime/sys_darwin_arm64.s b/src/runtime/sys_darwin_arm64.s
index 96d2ed1076..9a3b3cc4f8 100644
--- a/src/runtime/sys_darwin_arm64.s
+++ b/src/runtime/sys_darwin_arm64.s
@@ -744,6 +744,7 @@ TEXT runtime·syscallNoErr(SB),NOSPLIT,$0
 
 	MOVD	0(R0), R12	// fn
 	MOVD	16(R0), R1	// a2
+	FMOVD	16(R0), F0 // hack for CFDateCreate
 	MOVD	24(R0), R2	// a3
 	MOVD	32(R0), R3	// a4
 	MOVD	40(R0), R4	// a5
% 

That seems to confirm the diagnosis. I will think more about the right fix.

@rsc rsc added release-blocker NeedsFix The path to resolution is known, but the work has not been done. labels Feb 18, 2022
@gopherbot gopherbot removed the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Feb 18, 2022
@rsc
Copy link
Contributor

rsc commented Feb 18, 2022

Honestly it is amazing that it works as often as it does (most of the time!).

It works so well because of the register ABI or the generated code, which caused t.Sub(...).Seconds() inside x509 to leave the number we wanted in the floating-point register. As long as it doesn't get overwritten due to goroutine rescheduling before the libc call, all is well.

I have confirmed this by putting a spurious math.Sin call in the middle.

@rsc
Copy link
Contributor

rsc commented Feb 18, 2022

Will send a fix after the weekend.

@rsc rsc self-assigned this Feb 18, 2022
@FiloSottile
Copy link
Contributor

It works so well because of the register ABI or the generated code, which caused t.Sub(...).Seconds() inside x509 to leave the number we wanted in the floating-point register.

(I'm on vacation, but this is amazing.)

@gopherbot
Copy link

Change https://go.dev/cl/387255 mentions this issue: crypto/x509, runtime: fix occasional spurious “certificate is expired”

@rsc
Copy link
Contributor

rsc commented Feb 22, 2022

Reopening to track cherry-pick to Go 1.18 release branch.

@rsc rsc reopened this Feb 22, 2022
@ericlagergren
Copy link
Contributor

Does this fail open or closed? In other words, could this allow Verify to return nil for an expired cert?

@dmitshur
Copy link
Contributor

dmitshur commented Mar 3, 2022

Merged to Go 1.18 release branch via https://go.dev/cl/389554.

@dmitshur dmitshur closed this as completed Mar 3, 2022
@ericlagergren
Copy link
Contributor

hey @FiloSottile friendly ping about

Does this fail open or closed? In other words, could this allow Verify to return nil for an expired cert?

@FiloSottile
Copy link
Contributor

This could result in Verify returning nil for an expired cert, yes, but it was never shipped in any release, so we are not handling it as a security issue.

@ericlagergren
Copy link
Contributor

Thanks!

@rsc rsc removed their assignment Jun 22, 2022
@golang golang locked and limited conversation to collaborators Jun 22, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. OS-Darwin release-blocker
Projects
None yet
Development

No branches or pull requests

6 participants