Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crypto/x509: Go opens many more CA cert files than Python #32172

Closed
yyq2013 opened this issue May 21, 2019 · 14 comments
Closed

crypto/x509: Go opens many more CA cert files than Python #32172

yyq2013 opened this issue May 21, 2019 · 14 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@yyq2013
Copy link

yyq2013 commented May 21, 2019

What version of Go are you using (go version)?

$ go version
go version go1.12.5 windows/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
set GOARCH=amd64
set GOBIN=
set GOCACHE=C:\Users\Administrator\AppData\Local\go-build
set GOEXE=.exe
set GOFLAGS=
set GOHOSTARCH=amd64
set GOHOSTOS=windows
set GOOS=windows
set GOPATH=C:\Users\Administrator\go
set GOPROXY=
set GORACE=
set GOROOT=C:\Go
set GOTMPDIR=
set GOTOOLDIR=C:\Go\pkg\tool\windows_amd64
set GCCGO=gccgo
set CC=gcc
set CXX=g++
set CGO_ENABLED=1
set GOMOD=
set CGO_CFLAGS=-g -O2
set CGO_CPPFLAGS=
set CGO_CXXFLAGS=-g -O2
set CGO_FFLAGS=-g -O2
set CGO_LDFLAGS=-g -O2
set PKG_CONFIG=pkg-config
set GOGCCFLAGS=-m64 -mthreads -fmessage-length=0 -fdebug-prefix-map=C:\Users\ADMINI~1\AppData\Local\Temp\go-build185645434=/tmp/go-build -gno-record-gcc-switches

What did you do?

Just get html content from a site, but it is slower a lot than patyhon client, I want to know what need to optimize or bug for the language itself

package main

import (
	"fmt"
	"io"
	"io/ioutil"
	"log"
	"net/http"
	"time"
)

type AddHeaderTransport struct {
	http.RoundTripper
}

func (adt *AddHeaderTransport) RoundTrip(req *http.Request) (*http.Response, error) {
	req.Header.Add("User-Agent", "Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1")
	req.Header.Add("accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3")
	req.Header.Add("cache-control", "max-age=0")
	req.Header.Add("accept-language", "en,zh-CN;q=0.9,zh;q=0.8*")
	req.Header.Add("referer", "https://www.bing.com/")
	return adt.RoundTripper.RoundTrip(req)
}
func main() {
	tr := AddHeaderTransport{http.DefaultTransport}
	http.DefaultClient = &http.Client{Transport: &tr}
	start := time.Now()
	res, err := http.Get("https://www.bing.com/")
	if err != nil {
		log.Println(err)
		return
	}
	fmt.Println(time.Since(start))
	defer res.Body.Close()
	io.CopyN(ioutil.Discard, res.Body, 1024)
}

213.2271ms

import urllib3
import time
from urllib3 import Retry


headers = {
    'user-agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1',
    'referer': 'https://www.bing.com/',
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'en,zh-CN;q=0.9,zh;q=0.8',
    'cache-control': 'max-age=0'
}

retries = Retry(connect=10, read=5, redirect=5, backoff_factor=1.2)
urllib3.disable_warnings()
c = urllib3.HTTPSConnectionPool('www.bing.com', port=443, cert_reqs='CERT_NONE',
                                assert_hostname=False, retries=retries)
start_time = time.time()
r = c.request('GET', "/", headers=headers)
elapsed_time = time.time() - start_time
print(elapsed_time)
r.encoding = 'utf-8'
content_type = r.headers['content-type']
index = content_type.find('=')
encode = 'utf-8'
if index > -1:
    encode = content_type[index + 1:]
print(r.data.decode(encode))

0.14572596549987793s

What did you expect to see?

The time elapsed should about the same between golang and python client

What did you see instead?

The time elapsed for golang http client is more than python client, nearly double.

@bcmills
Copy link
Contributor

bcmills commented May 21, 2019

Please provide more detail: what commands and flags are you using to build and run the Go program?

@bcmills bcmills added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label May 21, 2019
@yyq2013
Copy link
Author

yyq2013 commented May 21, 2019

Please provide more detail: what commands and flags are you using to build and run the Go program?

Run it both "go run test.go" and "go build test.go && test.exe", time elapsed about the same for golang: all 200+ms

@yyq2013
Copy link
Author

yyq2013 commented May 21, 2019

TCP package different for same url:
python client:
image
golang client:
image

Don't know what setting can impact the length of TCP application data,

@bradfitz
Copy link
Contributor

Go is a compiled language. You shouldn't include the compilation time in your benchmark comparisons. (At least, not with http bug reports.) If you have an example program where the compiler is too slow, feel free to file a compiler bug.

@davecheney
Copy link
Contributor

@bradfitz the op is not reporting the wall time to run the program, they are using a timer inside their program.

@bradfitz
Copy link
Contributor

@davecheney, ah, sorry, missed that.

@bradfitz bradfitz reopened this May 22, 2019
@bradfitz
Copy link
Contributor

Go opens up 912 files under /etc/ssl/*. Python opens 1.

bradfitz@go:~$ grep /etc/ssl/certs trace.go | head -20
29827 openat(AT_FDCWD, "/etc/ssl/certs/ca-certificates.crt", O_RDONLY|O_CLOEXEC) = 5
29827 openat(AT_FDCWD, "/etc/ssl/certs", O_RDONLY|O_CLOEXEC) = 5
29827 newfstatat(AT_FDCWD, "/etc/ssl/certs/157753a5.0", {st_mode=S_IFLNK|0777, st_size=26, ...}, AT_SYMLINK_NOFOLLOW) = 0
29827 newfstatat(AT_FDCWD, "/etc/ssl/certs/5a3f0ff8.0", {st_mode=S_IFLNK|0777, st_size=34, ...}, AT_SYMLINK_NOFOLLOW) = 0
29827 newfstatat(AT_FDCWD, "/etc/ssl/certs/aee5f10d.0", {st_mode=S_IFLNK|0777, st_size=45, ...}, AT_SYMLINK_NOFOLLOW) = 0
29827 newfstatat(AT_FDCWD, "/etc/ssl/certs/f081611a.0", {st_mode=S_IFLNK|0777, st_size=23, ...}, AT_SYMLINK_NOFOLLOW) = 0
29827 newfstatat(AT_FDCWD, "/etc/ssl/certs/9479c8c3.0", {st_mode=S_IFLNK|0777, st_size=63, ...}, AT_SYMLINK_NOFOLLOW) = 0
29827 newfstatat(AT_FDCWD, "/etc/ssl/certs/f3377b1b.0",  <unfinished ...>
29827 newfstatat(AT_FDCWD, "/etc/ssl/certs/TWCA_Global_Root_CA.pem", {st_mode=S_IFLNK|0777, st_size=58, ...}, AT_SYMLINK_NOFOLLOW) = 0
29827 newfstatat(AT_FDCWD, "/etc/ssl/certs/455f1b52.0", {st_mode=S_IFLNK|0777, st_size=45, ...}, AT_SYMLINK_NOFOLLOW) = 0
29827 newfstatat(AT_FDCWD, "/etc/ssl/certs/4f316efb.0",  <unfinished ...>
29827 newfstatat(AT_FDCWD, "/etc/ssl/certs/4be590e0.0", {st_mode=S_IFLNK|0777, st_size=37, ...}, AT_SYMLINK_NOFOLLOW) = 0
29827 newfstatat(AT_FDCWD, "/etc/ssl/certs/TrustCor_RootCert_CA-2.pem", {st_mode=S_IFLNK|0777, st_size=61, ...}, AT_SYMLINK_NOFOLLOW) = 0
29827 newfstatat(AT_FDCWD, "/etc/ssl/certs/1d3472b9.0",  <unfinished ...>
29827 newfstatat(AT_FDCWD, "/etc/ssl/certs/facacbc6.0", {st_mode=S_IFLNK|0777, st_size=64, ...}, AT_SYMLINK_NOFOLLOW) = 0
29827 newfstatat(AT_FDCWD, "/etc/ssl/certs/0708417d.0", {st_mode=S_IFLNK|0777, st_size=64, ...}, AT_SYMLINK_NOFOLLOW) = 0
29827 newfstatat(AT_FDCWD, "/etc/ssl/certs/Izenpe.com.pem",  <unfinished ...>
29827 newfstatat(AT_FDCWD, "/etc/ssl/certs/fb5fa911.0", {st_mode=S_IFLNK|0777, st_size=20, ...}, AT_SYMLINK_NOFOLLOW) = 0
29827 newfstatat(AT_FDCWD, "/etc/ssl/certs/ef954a4e.0",  <unfinished ...>
29827 newfstatat(AT_FDCWD, "/etc/ssl/certs/Starfield_Root_Certificate_Authority_-_G2.pem", {st_mode=S_IFLNK|0777, st_size=80, ...}, AT_SYMLINK_NOFOLLOW) = 0
bradfitz@go:~$ grep /etc/ssl/certs trace.go | wc -l
912

bradfitz@go:~$ grep /etc/ssl/certs trace.py | head -20
29676 open("/etc/ssl/certs/ca-certificates.crt", O_RDONLY) = 4
bradfitz@go:~$ grep /etc/ssl/certs trace.py | wc -l
1

If I modify both the Python & Go programs to first do one warm-up https request to force those files to be preloaded, then the performance difference disappears and Go is actually faster:

bradfitz@go:~$ ./get-py.py 
0.04723000526428223
bradfitz@go:~$ ./get-py.py 
0.04802656173706055
bradfitz@go:~$ ./get-py.py 
0.04329371452331543

bradfitz@go:~$ ./get-go 
0.046565191
bradfitz@go:~$ ./get-go 
0.045329321
bradfitz@go:~$ ./get-go 
0.041736011

@bradfitz bradfitz changed the title http client is slow, compare with python crypto/tls: Go opens many more CA cert files than Python May 22, 2019
@bradfitz bradfitz added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. and removed WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. labels May 22, 2019
@bradfitz bradfitz added this to the Go1.14 milestone May 22, 2019
@bradfitz
Copy link
Contributor

Moving to crypto/tls. (Or maybe this is crypto/x509?). Maybe rather than reading all files found we should read them lazily as needed.

/cc @FiloSottile @agl

@FiloSottile FiloSottile changed the title crypto/tls: Go opens many more CA cert files than Python crypto/x509: Go opens many more CA cert files than Python May 22, 2019
@FiloSottile
Copy link
Contributor

The OP was on Windows, so the Linux behavior has nothing to do with it. Want to open a separate issue? We can probably stop at the aggregate file like Python does if we find it.

@yyq2013
Copy link
Author

yyq2013 commented May 24, 2019

I put the http.Get(myurl) and pythonHttpClient.request(myurl) in a loop, and it is slower a half than python client every round, not just first time slow.

I don't konw the detail of golang http, it will initialize the tls related thing every time in a loop?
And like the snapshot of TCP I pasted above, seems the length of TCP packet(Application data) from golang client is small, and the same data, will send more times to remote server.

@bradfitz
Copy link
Contributor

I put the http.Get(myurl) and pythonHttpClient.request(myurl) in a loop

Those are not equivalent. http.Get returns the headers with a streamed response, and pythonHttpClient.request (at least in the code you used earlier) fetched the whole response. So in a loop Python will be able to reuse the same connection, but Go would need to set up a new connection, at least if the server didn't do HTTP/2. You'd need to consume the body in the Go code. But really you need to show the code.

@yyq2013
Copy link
Author

yyq2013 commented May 24, 2019

The whole code I have tested as following:
golang:

package main

import (
	"fmt"
	"github.com/PuerkitoBio/goquery"
	"golang.org/x/text/encoding/simplifiedchinese"
	"golang.org/x/text/transform"
	"log"
	"net/http"
	"os"
	"strings"
	"time"
)

type Process func(txt string) string

type CustomHeaderTransport struct {
	http.RoundTripper
}

func (adt *CustomHeaderTransport) RoundTrip(req *http.Request) (*http.Response, error) {
	req.Header.Add("User-Agent", "Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1")
	req.Header.Add("accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3")
	req.Header.Add("cache-control", "max-age=0")
	req.Header.Add("accept-language", "en,zh-CN;q=0.9,zh;q=0.8*")
	req.Header.Add("referer", "https://xxx/")
	return adt.RoundTripper.RoundTrip(req)
}

func main() {
	os.Setenv("http_proxy", "")
	os.Setenv("https_proxy", "")
	tr := CustomHeaderTransport{http.DefaultTransport}
	http.DefaultClient = &http.Client{Transport: &tr}
	processes := []Process{
		func(txt string) string {
			return strings.ReplaceAll(txt, "&nbsp;", " ")
		},
		func(txt string) string {
			s := strings.ReplaceAll(txt, "<br>", "")
			return strings.ReplaceAll(s, "<br/>", "")
		},
		func(txt string) string {
			start := strings.Index(txt, "<script>")
			end := strings.LastIndex(txt, "</script>")
			if start > 0 && end > start {
				txt = txt[0:start] + txt[end+9:]
			}
			return txt
		},
	}

	loadPages("https://xxxx/test.html", "D:/temp/notes.txt", true, processes)
}

func loadPages(targetUrl string, file string, isChapter bool, processes []Process) {
	fmt.Println("process link:", targetUrl)
	var res *http.Response
	var err error
	backOffFactor := 1.25
	retry := 10
	waitTime := float64(5)
	start := time.Now()
	log.Println("starting request html")
	for i := 1; i <= retry; i++ {
		res, err = http.Get(targetUrl)
		if err != nil {
			waitTime = waitTime * backOffFactor
			log.Println("waiting:", waitTime, "seconds")
			time.Sleep(time.Duration(waitTime) * time.Second)
			log.Println(err)
			log.Println("error when request, retry", i, "time(s)")
			continue
		}
	}

	if err != nil {
		fmt.Println("reach max retry[", retry, "] times")
		return
	}
	log.Println("got html,", time.Since(start))
	defer res.Body.Close()
	reader := transform.NewReader(res.Body, simplifiedchinese.GBK.NewDecoder())
	start = time.Now()
	log.Println("starting parse html")
	doc, err := goquery.NewDocumentFromReader(reader)
	if err != nil {
		log.Fatal(err)
	}
	log.Println("parsed html,", time.Since(start))
	f, err := os.OpenFile(file, os.O_APPEND|os.O_WRONLY, 0755)
	if err != nil {
		panic(err)
	}

	log.Println("starting find title")
	start = time.Now()
	title := strings.TrimSpace(doc.Find("div#nr_title").Text())
	if isChapter {
		f.WriteString(strings.Repeat("--", 40))
		f.WriteString("\n")
		f.WriteString(title)
		f.WriteString("\n")
	}
	log.Println("found title,", time.Since(start))
	log.Println("starting find content")
	// Find the chapter items
	content := doc.Find("div#nr1")

	text := content.Text()
	log.Println("found content,", time.Since(start))
	log.Println("starting process content")
	start = time.Now()
	for _, process := range processes {
		text = process(text)
	}
	log.Println("processed content,", time.Since(start))
	start = time.Now()
	f.WriteString(text)
	f.Close()
	log.Println("write file finished,", time.Since(start))
	log.Println("find next link")
	start = time.Now()
	nextLinkDiv := doc.Find("div .nr_page")
	nextLink := nextLinkDiv.Find("table > tbody > tr > td:nth-child(3) > a[class='pb_next']")
	if nextLink == nil {
		fmt.Println("no next page on link:", targetUrl)
		return
	}
	nextPage := nextLink.Text()
	nextUrl, ok := nextLink.Attr("href")
	if ok {
		lastSlash := strings.LastIndex(targetUrl, "/")
		nextUrl = targetUrl[0:lastSlash+1] + nextUrl
		nextUrl = targetUrl[0:lastSlash+1] + "test.html"
		log.Println("found next link,", time.Since(start))
		if nextPage != "" {
			loadPages(nextUrl, file, true, processes)
		}
	} else {
		fmt.Println("no next page on link:", targetUrl)
	}
}

process link: https://xxx/test.html
2019/05/24 23:35:40 starting request html
2019/05/24 23:35:44 got html, 3.3366429s
2019/05/24 23:35:44 starting parse html
...
process link: https://xxx/test.html
2019/05/24 23:35:44 found next link, 0s
2019/05/24 23:35:44 starting request html
2019/05/24 23:35:46 got html, 2.4598189s
2019/05/24 23:35:46 starting parse html

python

import urllib3
from urllib3 import Retry
from lxml import html

headers = {
    'user-agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1',
    'referer': 'https://xxx.com',
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'en,zh-CN;q=0.9,zh;q=0.8',
    'cache-control': 'max-age=0'}

retries = Retry(connect=10, read=5, redirect=5, backoff_factor=1.2)
urllib3.disable_warnings()
c = urllib3.HTTPSConnectionPool('xxx', port=443, cert_reqs='CERT_NONE',
                                assert_hostname=False, retries=retries)

import time


def loadPage(relative_path, file):
    print("process link " + relative_path)
    start_time = time.time()
    r = c.request('GET', relative_path, headers=headers)
    elapsed_time = time.time() - start_time
    print("got html:", elapsed_time)
    r.encoding = 'utf-8'
    content_type = r.headers['content-type']
    encode = content_type[content_type.index('=') + 1:]
    tree = html.fromstring(r.data.decode(encode))
    title = tree.xpath('//div[@id="nr_title"]/text()')[0].strip()
    # print(title)
    text = tree.xpath('//div[@id="nr1"]/text()')
    with open(file, "a", encoding="utf-8") as myfile:
        myfile.write(title)
        myfile.write('\n')
        for line in text:
            myfile.write(line)

    next_page = tree.xpath('//a[@class="pb_next"]')
    if len(next_page) > 0:
        next_page = next_page[0]
        next_link = next_page.attrib['href']
        index = relative_path.rfind('/')
        next_page_link = relative_path[0:index + 1] + next_link
        loadPage('/test.html', file)
    else:
        print("not next page on page:", relative_path)


loadPage('/test.html', 'D:/temp/test.py.txt')

process link /test.html
got html: 0.26856040954589844
process link /test.html
got html: 0.2736015319824219

@bradfitz
Copy link
Contributor

That Go code just looks buggy.

Your retry loop loops 10 times, even on success, and it starts 10 HTTP requests but doesn't consume their bodies.

You're not comparing two equivalent programs.

@yyq2013
Copy link
Author

yyq2013 commented May 24, 2019

Yes, you are right. My fault miss the break at the end of the loop.

After your tip. Now the time seems the same between golang client and python client

Many thanks.

@golang golang locked and limited conversation to collaborators May 23, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
None yet
Development

No branches or pull requests

6 participants