New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
crypto/x509: Go opens many more CA cert files than Python #32172
Comments
Please provide more detail: what commands and flags are you using to build and run the Go program? |
Run it both "go run test.go" and "go build test.go && test.exe", time elapsed about the same for golang: all 200+ms |
Go is a compiled language. You shouldn't include the compilation time in your benchmark comparisons. (At least, not with http bug reports.) If you have an example program where the compiler is too slow, feel free to file a compiler bug. |
@bradfitz the op is not reporting the wall time to run the program, they are using a timer inside their program. |
@davecheney, ah, sorry, missed that. |
Go opens up 912 files under /etc/ssl/*. Python opens 1.
If I modify both the Python & Go programs to first do one warm-up https request to force those files to be preloaded, then the performance difference disappears and Go is actually faster:
|
Moving to crypto/tls. (Or maybe this is crypto/x509?). Maybe rather than reading all files found we should read them lazily as needed. /cc @FiloSottile @agl |
The OP was on Windows, so the Linux behavior has nothing to do with it. Want to open a separate issue? We can probably stop at the aggregate file like Python does if we find it. |
I put the http.Get(myurl) and pythonHttpClient.request(myurl) in a loop, and it is slower a half than python client every round, not just first time slow. I don't konw the detail of golang http, it will initialize the tls related thing every time in a loop? |
Those are not equivalent. http.Get returns the headers with a streamed response, and pythonHttpClient.request (at least in the code you used earlier) fetched the whole response. So in a loop Python will be able to reuse the same connection, but Go would need to set up a new connection, at least if the server didn't do HTTP/2. You'd need to consume the body in the Go code. But really you need to show the code. |
The whole code I have tested as following: package main
import (
"fmt"
"github.com/PuerkitoBio/goquery"
"golang.org/x/text/encoding/simplifiedchinese"
"golang.org/x/text/transform"
"log"
"net/http"
"os"
"strings"
"time"
)
type Process func(txt string) string
type CustomHeaderTransport struct {
http.RoundTripper
}
func (adt *CustomHeaderTransport) RoundTrip(req *http.Request) (*http.Response, error) {
req.Header.Add("User-Agent", "Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1")
req.Header.Add("accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3")
req.Header.Add("cache-control", "max-age=0")
req.Header.Add("accept-language", "en,zh-CN;q=0.9,zh;q=0.8*")
req.Header.Add("referer", "https://xxx/")
return adt.RoundTripper.RoundTrip(req)
}
func main() {
os.Setenv("http_proxy", "")
os.Setenv("https_proxy", "")
tr := CustomHeaderTransport{http.DefaultTransport}
http.DefaultClient = &http.Client{Transport: &tr}
processes := []Process{
func(txt string) string {
return strings.ReplaceAll(txt, " ", " ")
},
func(txt string) string {
s := strings.ReplaceAll(txt, "<br>", "")
return strings.ReplaceAll(s, "<br/>", "")
},
func(txt string) string {
start := strings.Index(txt, "<script>")
end := strings.LastIndex(txt, "</script>")
if start > 0 && end > start {
txt = txt[0:start] + txt[end+9:]
}
return txt
},
}
loadPages("https://xxxx/test.html", "D:/temp/notes.txt", true, processes)
}
func loadPages(targetUrl string, file string, isChapter bool, processes []Process) {
fmt.Println("process link:", targetUrl)
var res *http.Response
var err error
backOffFactor := 1.25
retry := 10
waitTime := float64(5)
start := time.Now()
log.Println("starting request html")
for i := 1; i <= retry; i++ {
res, err = http.Get(targetUrl)
if err != nil {
waitTime = waitTime * backOffFactor
log.Println("waiting:", waitTime, "seconds")
time.Sleep(time.Duration(waitTime) * time.Second)
log.Println(err)
log.Println("error when request, retry", i, "time(s)")
continue
}
}
if err != nil {
fmt.Println("reach max retry[", retry, "] times")
return
}
log.Println("got html,", time.Since(start))
defer res.Body.Close()
reader := transform.NewReader(res.Body, simplifiedchinese.GBK.NewDecoder())
start = time.Now()
log.Println("starting parse html")
doc, err := goquery.NewDocumentFromReader(reader)
if err != nil {
log.Fatal(err)
}
log.Println("parsed html,", time.Since(start))
f, err := os.OpenFile(file, os.O_APPEND|os.O_WRONLY, 0755)
if err != nil {
panic(err)
}
log.Println("starting find title")
start = time.Now()
title := strings.TrimSpace(doc.Find("div#nr_title").Text())
if isChapter {
f.WriteString(strings.Repeat("--", 40))
f.WriteString("\n")
f.WriteString(title)
f.WriteString("\n")
}
log.Println("found title,", time.Since(start))
log.Println("starting find content")
// Find the chapter items
content := doc.Find("div#nr1")
text := content.Text()
log.Println("found content,", time.Since(start))
log.Println("starting process content")
start = time.Now()
for _, process := range processes {
text = process(text)
}
log.Println("processed content,", time.Since(start))
start = time.Now()
f.WriteString(text)
f.Close()
log.Println("write file finished,", time.Since(start))
log.Println("find next link")
start = time.Now()
nextLinkDiv := doc.Find("div .nr_page")
nextLink := nextLinkDiv.Find("table > tbody > tr > td:nth-child(3) > a[class='pb_next']")
if nextLink == nil {
fmt.Println("no next page on link:", targetUrl)
return
}
nextPage := nextLink.Text()
nextUrl, ok := nextLink.Attr("href")
if ok {
lastSlash := strings.LastIndex(targetUrl, "/")
nextUrl = targetUrl[0:lastSlash+1] + nextUrl
nextUrl = targetUrl[0:lastSlash+1] + "test.html"
log.Println("found next link,", time.Since(start))
if nextPage != "" {
loadPages(nextUrl, file, true, processes)
}
} else {
fmt.Println("no next page on link:", targetUrl)
}
} process link: https://xxx/test.html python import urllib3
from urllib3 import Retry
from lxml import html
headers = {
'user-agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1',
'referer': 'https://xxx.com',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en,zh-CN;q=0.9,zh;q=0.8',
'cache-control': 'max-age=0'}
retries = Retry(connect=10, read=5, redirect=5, backoff_factor=1.2)
urllib3.disable_warnings()
c = urllib3.HTTPSConnectionPool('xxx', port=443, cert_reqs='CERT_NONE',
assert_hostname=False, retries=retries)
import time
def loadPage(relative_path, file):
print("process link " + relative_path)
start_time = time.time()
r = c.request('GET', relative_path, headers=headers)
elapsed_time = time.time() - start_time
print("got html:", elapsed_time)
r.encoding = 'utf-8'
content_type = r.headers['content-type']
encode = content_type[content_type.index('=') + 1:]
tree = html.fromstring(r.data.decode(encode))
title = tree.xpath('//div[@id="nr_title"]/text()')[0].strip()
# print(title)
text = tree.xpath('//div[@id="nr1"]/text()')
with open(file, "a", encoding="utf-8") as myfile:
myfile.write(title)
myfile.write('\n')
for line in text:
myfile.write(line)
next_page = tree.xpath('//a[@class="pb_next"]')
if len(next_page) > 0:
next_page = next_page[0]
next_link = next_page.attrib['href']
index = relative_path.rfind('/')
next_page_link = relative_path[0:index + 1] + next_link
loadPage('/test.html', file)
else:
print("not next page on page:", relative_path)
loadPage('/test.html', 'D:/temp/test.py.txt') process link /test.html |
That Go code just looks buggy. Your retry loop loops 10 times, even on success, and it starts 10 HTTP requests but doesn't consume their bodies. You're not comparing two equivalent programs. |
Yes, you are right. My fault miss the break at the end of the loop. After your tip. Now the time seems the same between golang client and python client Many thanks. |
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
Just get html content from a site, but it is slower a lot than patyhon client, I want to know what need to optimize or bug for the language itself
213.2271ms
0.14572596549987793s
What did you expect to see?
The time elapsed should about the same between golang and python client
What did you see instead?
The time elapsed for golang http client is more than python client, nearly double.
The text was updated successfully, but these errors were encountered: