Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build/devapp: dev.golang.org cert has expired #32272

Closed
agnivade opened this issue May 27, 2019 · 21 comments
Closed

x/build/devapp: dev.golang.org cert has expired #32272

agnivade opened this issue May 27, 2019 · 21 comments
Labels
Builders x/build issues (builders, bots, dashboards) FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@agnivade
Copy link
Contributor

Issued On Tuesday, 26 February 2019 at 17:05:27
Expires On Monday, 27 May 2019 at 18:05:27

Probably needs a certbot renew run.

@dmitshur @bradfitz

@agnivade agnivade added the NeedsFix The path to resolution is known, but the work has not been done. label May 27, 2019
@gopherbot gopherbot added this to the Unreleased milestone May 27, 2019
@gopherbot gopherbot added the Builders x/build issues (builders, bots, dashboards) label May 27, 2019
@bradfitz
Copy link
Contributor

We don't use certbot. We use Go's acme autocert package which should handle renewal automatically.

@agnivade
Copy link
Contributor Author

Ah I see. Well, it's something else then.

@dmitshur
Copy link
Contributor

dmitshur commented May 27, 2019

The certificate of https://dev-staging.golang.org expires on Tuesday, May 28, 2019 at 6:20:45 AM Eastern Daylight Time, which is just 1~ day away.

dev.golang.org is powered by golang.org/x/build/app/appengine, which is an App Engine app. I don't think it uses autocert itself directly.

Edit: It's actually powered by golang.org/x/build/devapp, which does use autocert directly. I mixed them up. :(

@dmitshur dmitshur added the Soon This needs to be done soon. (regressions, serious bugs, outages) label May 27, 2019
@dmitshur dmitshur changed the title x/build: dev.golang.org cert has expired x/build/devapp: dev.golang.org cert has expired May 27, 2019
@dmitshur dmitshur self-assigned this May 27, 2019
@dmitshur
Copy link
Contributor

I've looked at devapp source code and I see it has some complex logic for redirecting HTTP to HTTPS. I saw some "acme/autocert: missing server name" errors in the logs. There's a chance the problem is in the code, perhaps related to HTTP handler redirection and the autocert package.

I thought simply redeploying would resolve the immediate problem of expired certificate, so I tried that on the staging instance first. The newly deployed version is currently in a crash-loop. Its logs say:

2019/05/27 21:40:18 Loading data from log *maintner.netMutSource ...
2019/05/27 21:40:18 Unknown error type x509.UnknownAuthorityError: x509.UnknownAuthorityError{Cert:(*x509.Certificate)(0xc0004e0580), hintErr:error(nil), hintCert:(*x509.Certificate)(nil)}
2019/05/27 21:40:18 Corpus GetMutations: Get https://maintner.golang.org/logs: x509: certificate signed by unknown authority
2019/05/27 21:40:18 Could not init corpus: godata.Get: Get https://maintner.golang.org/logs: x509: certificate signed by unknown authority

I got worried thinking https://maintner.golang.org/logs is also expired, but when I checked it, it seems to be using a healthy valid HTTPS cert that expires on July 19, 2019. So I can't explain that devapp error right now.

@dmitshur dmitshur added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. and removed NeedsFix The path to resolution is known, but the work has not been done. labels May 27, 2019
@x1ddos
Copy link

x1ddos commented May 27, 2019

FWIW "missing server name" seems normal. It just means the client didn't provide a name in the SNI in ClientHelloInfo. Intentionally or otherwise.

@dmitshur
Copy link
Contributor

I was able to deploy the staging instance of devapp using the parent commit of CL 176257, and it resolved the unknown authority error when fetching https://maintner.golang.org/logs. That CL removed the step that copied ca-certificates.crt into /etc/ssl/certs/, maybe that's related?

However, the certificate at https://dev-staging.golang.org hasn't been updated, so that still needs investigation and fixing.

@bradfitz
Copy link
Contributor

Found it. It's fallout from my gitlock removal CLs. We're missing ca-certificates. That explains both errors.

@gopherbot
Copy link

Change https://golang.org/cl/179077 mentions this issue: devapp: add ca-certificates to Dockerfile

@bradfitz
Copy link
Contributor

Nevermind. The CL I sent only explains the latter error and not the LetsEncrypt cert issue. I had the timeline wrong.

In the logs I see:

2019/05/27 22:36:52 http: TLS handshake error from 10.240.0.156:55541: acme/autocert: unable to authorize "dev.golang.org"; challenge "tls-alpn-01" failed with error: acme: authorization error for dev.golang.org: 403 urn:acme:error:unauthorized: Cannot negotiate ALPN protocol "acme-tls/1" for tls-alpn-01 challenge

gopherbot pushed a commit to golang/build that referenced this issue May 27, 2019
CL 176257 refactored the Dockerfile to remove the use of gitlock but I
forgot to include ca-certificates here.

Updates golang/go#32272 (fixes maybe)
Updates golang/go#26872

Change-Id: I7b0e3a756bc9805e81e499b8b7d7c6ed0defb871
Reviewed-on: https://go-review.googlesource.com/c/build/+/179077
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
@bradfitz
Copy link
Contributor

The LetsEncrypt problem is because we have 2 replicas now.

/cc @andybons, who increased it from 1 to 2 in https://golang.org/cl/127036

That means renewals only have a 50% chance of working, as challenge info is kept in memory with x/crypto/acme/autocert.

@bradfitz
Copy link
Contributor

As a temporary fix I've reduced it from 2 to 1.

There are several better fixes that will take more time.

@bradfitz
Copy link
Contributor

Also, the readiness probe added in CL 127036 doesn't work, because we're redirecting it to https.

kubectl describe says:

Readiness probe failed: Get https://10.200.1.8:80/healthz: http: server gave HTTP response to HTTPS client

@bradfitz
Copy link
Contributor

Actually I don't know what's up with that /healthz https error. The http-to-https redirect handler does appear to special case (allow) that path over http.

But I don't understand this output. It's like we're running some code that's not in the repo and I can't find....

bradfitz@go:~/go/src$ k logs devapp-deployment-798769dd55-rvqgp
2019/05/27 22:46:50 Updating dashboard data...
2019/05/27 22:46:50 Listening on [::]:80
2019/05/27 22:46:50 Serving TLS at [::]:443
2019/05/27 22:49:24 Fetched 494 open CLs in 15.951 seconds
2019/05/27 22:50:34 Fetched 4464 open issues in 70.638 seconds
2019/05/27 22:50:36 Fetched 99 closed issues in 1.499 seconds
2019/05/27 22:50:36 Cache "gzdata" update finished; writing 332116 bytes
2019/05/27 22:50:36 Sent 453 requests to GitHub

Sending requests to GitHub!? gzdata cache? What is all that?

I don't know what code logs that and I can't find it.

@bradfitz
Copy link
Contributor

A "make deploy-prod" got the right code running again.

When I changed its config from 2 to 1 replicas and re-applied the deployment-prod.yaml file, it picked up the latest from image: gcr.io/symbolic-datum-552/devapp:latest... but we don't use latest anymore. (The make push-prod rule doesn't push the MUTABLE_VERSION (latest)). So that's why it was temporarily running ancient code.

That's another thing to fix here.

@bradfitz
Copy link
Contributor

Well, it's running now and logs look happy, but I can't connect to it.

I have to run, though, so no time to debug further. Either @dmitshur can take over or we can wait until tomorrow.

@dmitshur
Copy link
Contributor

dmitshur commented May 27, 2019

GetCertificate doc says:

// If GetCertificate is used directly, instead of via Manager.TLSConfig, package users will
// also have to add acme.ALPNProto to NextProtos for tls-alpn-01, or use HTTPHandler
// for http-01. (The tls-sni-* challenges have been deprecated by popular ACME providers
// due to security issues in the ecosystem.)

It looks like we are using GetCertificate and not adding acme.ALPNProto:

https://github.com/golang/build/blob/78beebf19480669724269a7b3ee6c2eed3d2c64b/devapp/devapp.go#L137

If there's some HTTPS redirection bug preventing the "http-01" challenge from working via HTTPHandler, that could explain why it's not able to renew the LE cert. Edit: I noticed we're not using HTTPHandler anywhere, so I don't think "http-01" challenge was even enabled.

I can try adding acme.ALPNProto.

@bradfitz
Copy link
Contributor

Good find. I bet that's it. We should audit all our services for that.

@dmitshur
Copy link
Contributor

Looks like it was, I tried on https://dev-staging.golang.org/ and its cert has now been updated to expire on August 25, 2019 (previously, it was May 28, 2019).

I'll apply the same quick fix to dev.golang.org and leave this issue open for all the followup cleanup work here.

@gopherbot
Copy link

Change https://golang.org/cl/179097 mentions this issue: devapp: add acme.ALPNProto to NextProtos

@dmitshur
Copy link
Contributor

https://dev.golang.org/ is working now. Thanks @bradfitz.


Although it didn't work right away. Specifically, I saw the following in logs:

http: TLS handshake error from 10.240.0.7:52609: refusing to serve autocert on provided domain

The "refusing to serve autocert on provided domain" error is from our HostPolicy function:

https://github.com/golang/build/blob/78beebf19480669724269a7b3ee6c2eed3d2c64b/devapp/devapp.go#L128-L130

Then later a bunch of:

http: TLS handshake error from 10.240.0.7:47971: acme/autocert: missing certificate
http: TLS handshake error from 10.240.0.48:57269: acme/autocert: missing certificate
http: TLS handshake error from 10.240.0.96:50177: acme/autocert: missing certificate
http: TLS handshake error from 10.240.0.48:57127: acme/autocert: missing certificate

Then a rate limit error from LE:

http: TLS handshake error from 10.240.0.48:52209: 429 urn:acme:error:rateLimited: Error creating new authz :: too many failed authorizations recently: see https://letsencrypt.org/docs/rate-limits/

Then more EOFs:

http: TLS handshake error from 10.200.10.1:46531: EOF
http: TLS handshake error from 10.240.0.96:56001: EOF
http: TLS handshake error from 10.200.10.1:49963: EOF

And after a while it did renew the cert successfully. I don't know about the details of how Let's Encrypt works to explain this off the top of my head.

@dmitshur dmitshur removed the Soon This needs to be done soon. (regressions, serious bugs, outages) label May 27, 2019
gopherbot pushed a commit to golang/build that referenced this issue May 27, 2019
We need to add this manually in order to enable the tls-alpn-01
challenge, since we're using GetCertificate directly instead of
via Manager.TLSConfig. We also don't have the http-01 challenge
enabled (HTTPHandler isn't being used), so this is the only way
for a Let's Encrypt certificate to be acquired now that tls-sni-*
challenges have been deprecated.

In the future, this code can probably be simplified by using
higher-level APIs of autocert, but this fixes the immediate issue.

Updates golang/go#32272

Change-Id: Ia72bca3e44bc585b0dfe5c7bcd3e4f544272d1ab
Reviewed-on: https://go-review.googlesource.com/c/build/+/179097
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
@dmitshur dmitshur removed their assignment May 28, 2019
@dmitshur dmitshur added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels May 28, 2019
@bradfitz
Copy link
Contributor

bradfitz commented Sep 9, 2019

Old.

@bradfitz bradfitz closed this as completed Sep 9, 2019
@golang golang locked and limited conversation to collaborators Sep 8, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Builders x/build issues (builders, bots, dashboards) FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

5 participants