Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

archive/zip: compression performance #20031

Closed
bobjalex opened this issue Apr 18, 2017 · 12 comments
Closed

archive/zip: compression performance #20031

bobjalex opened this issue Apr 18, 2017 · 12 comments
Labels
FrozenDueToAge help wanted NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Milestone

Comments

@bobjalex
Copy link

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

go version go1.8.1 windows/amd64

What operating system and processor architecture are you using (go env)?

set GOARCH=amd64
set GOBIN=
set GOEXE=.exe
set GOHOSTARCH=amd64
set GOHOSTOS=windows
set GOOS=windows
set GOPATH=C:\GoLib
set GORACE=
set GOROOT=C:\Go
set GOTOOLDIR=C:\Go\pkg\tool\windows_amd64
set GCCGO=gccgo
set CC=gcc
set GOGCCFLAGS=-m64 -mthreads -fmessage-length=0 -fdebug-prefix-map=C:\Users\Bob\AppData\Local\Temp\go-build786174103=/tmp/go-build -gno-record-gcc-switches
set CXX=g++
set CGO_ENABLED=1
set PKG_CONFIG=pkg-config
set CGO_CFLAGS=-g -O2
set CGO_CPPFLAGS=
set CGO_CXXFLAGS=-g -O2
set CGO_FFLAGS=-g -O2
set CGO_LDFLAGS=-g -O2

What did you do?

Create a zip archive using archive/zip.

If possible, provide a recipe for reproducing the error.
A complete runnable program is good.
A link on play.golang.org is best.

Not sure how to provide a program for this. Recipe is, in general, create an archive using archive/zip and compare its run time with other implementations. I provided information on my timing comparisons below.

What did you expect to see?

Timing in line with other existing implementations.

What did you see instead?

Writing the ZIP archive is several times slower that other ZIP implementations.
At least for large archives. Not so noticeable with small archives, but painfully slow for large ones.

Based on comparison of Go archive/zip with archive/tar and with the included libraries of Python and Java distributions. For most operations, comparisons are pretty close, but for writing a ZIP archive, Go is way slower than the others.

Of course, the problem could be with my code. But the archive-writing part is pretty simple and is based on the documentation examples. And, my similar code that does the TGZ archiving performs OK.

Here is a table of the results of my experiments, followed by a profile of the archived hierarchy.

ZIP (468.2 MB archive file size)
Read all metadata
Go 1ms
Java 32ms
Python 210ms
Unpack all data
Go 31s
Java 43s
Python 27s
Pack all data
Go 5m3s !!!!
Java 38s
Python 28s

TGZ (466.9 MB archive file size)
(Java JDK does not have a tar module in its distribution so a 3rd party
package org.apache.commons.compress.archivers.tar is used.)
Read all metadata
Go 7.4s
Java 4.9S
Python 3.4s
Unpack all data
Go 29s
Java 38s
Python 34s
Pack all data
Go 23s
Java 29s
Python 34s

Profile of archived hierarchy:

Directory count: 416
File count: 2918
Total size: 1.1G
Average size: 361K
Median size: 3898
Maximum size: 53M
Size distribution:
0 : 19
1..10 : 8
10..100 : 168
100..1000: 639
1000..10K : 1036
10K..100K: 513
100K..1M : 96
1M..10M : 428
10M..100M: 11

@davecheney
Copy link
Contributor

davecheney commented Apr 18, 2017 via email

@bradfitz bradfitz added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Apr 18, 2017
@bradfitz bradfitz changed the title Compress performance of archive/zip is slow archive/zip: compression performance Apr 18, 2017
@bradfitz bradfitz added this to the Unplanned milestone Apr 18, 2017
@bradfitz bradfitz added help wanted NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Apr 18, 2017
@ianlancetaylor
Copy link
Contributor

CC @dsnet

@dsnet
Copy link
Member

dsnet commented Apr 18, 2017

It's really hard to optimize something without a performance test to optimize for. I'm inclined to blame compression, but the compress/flate optimization work that occurred in the Go1.7 cycle has brought Go's implementation to be on-par with that of the C zlib implementation.

@bobjalex
Copy link
Author

I learned how to use markdown, so here is part of my original submission, formatted. Didn't know that my cool, indented text would be flattened :)

What did you see instead?

Writing the ZIP archive is several times slower that other ZIP implementations.
At least for large archives. Not so noticeable with small archives, but painfully slow for large ones.

Based on comparison of Go archive/zip with archive/tar and with the included libraries of Python and Java distributions. For most operations, comparisons are pretty close, but for writing a ZIP archive, Go is way slower than the others.

Of course, the problem could be with my code. But the archive-writing part is pretty simple and is based on the documentation examples. And, my similar code that does the TGZ archiving performs OK.

Here is a table of the results of my experiments, followed by a profile of the archived hierarchy.

ZIP (468.2 MB archive file size)

  • Read all metadata
    • Go 1ms
    • Java 32ms
    • Python 210ms
  • Unpack all data
    • Go 31s
    • Java 43s
    • Python 27s
  • Pack all data
    • Go 5m3s !!!!
    • Java 38s
    • Python 28s

TGZ (466.9 MB archive file size)
(Java JDK does not have a tar module in its distribution so a 3rd party
package org.apache.commons.compress.archivers.tar is used.)

  • Read all metadata
    • Go 7.4s
    • Java 4.9S
    • Python 3.4s
  • Unpack all data
    • Go 29s
    • Java 38s
    • Python 34s
  • Pack all data
    • Go 23s
    • Java 29s
    • Python 34s

Profile of archived hierarchy:

Directory count: 416
File count:      2918
Total size:      1.1G
Average size:    361K
Median size:     3898
Maximum size:    53M
Size distribution:
     0      :   19
     1..10  :    8
    10..100 :  168
   100..1000:  639
  1000..10K : 1036
   10K..100K:  513
  100K..1M  :   96
    1M..10M :  428
   10M..100M:   11

@bradfitz
Copy link
Contributor

@bobjalex, Markdown is cool but code so we could get reproducible numbers would be even better. Also, do you have a CPU profile?

Could you share your code & data? We can host it if you don't have a place.

@dsnet
Copy link
Member

dsnet commented Apr 18, 2017

Furthermore, it's great that you have distribution based on filesizes, but I'm willing to bet that only a single file from that distribution is sufficient to demonstrate the performance slow-down.

@bobjalex
Copy link
Author

Here is the code snippet that archives a specified directory. I did try using buffered streams (bufio) in both directions but made no difference. Seems that the underlying code does a decent job of buffering. Suggestions???

BTW, I agree that this smells like a compression issue. It seems to be exasperated by large archives/files. But maybe that's because it's just not as frustrating with small files :)

func (a *ArchiverZip) Write(zipPath, rootPath string) error {
	file, err := os.Create(zipPath)
	if err != nil {
		return err
	}
	defer file.Close()
	w := zip.NewWriter(file)

	filepath.Walk(rootPath,
		func(path string, info os.FileInfo, err error) error {
			if err != nil {
				fmt.Printf("Error, not stored: %s, %s\n", path, err)
				return nil
			}
			if info.IsDir() {
				return nil
			}
			relPath, err := filepath.Rel(rootPath, path)
			if err != nil {
				fmt.Printf(
					"Error creating file relative path, not stored: %s, %s\n",
					path, err)
				return nil
			}
			inFile, err := os.Open(path)
			if err != nil {
				fmt.Printf("Error opening file, not stored: %s, %s\n",
					path, err)
				return nil
			}
			defer inFile.Close()
			hdr, err := zip.FileInfoHeader(info)
			if err != nil {
				fmt.Printf("Error getting ZIP entry info, not stored: %s, %s\n",
					path, err)
				return nil
			}
			hdr.Name = filepath.ToSlash(relPath)
			hdr.Method = zip.Deflate
			// Mod time is offset such that local time ends up being stored
			// in the DOS-like time field, to be compatible with most
			// existing ZIP implementations.
			hdr.SetModTime(localAsUtc(info.ModTime()))
			//hdr.SetModTime(info.ModTime())
			//hdr.Flags |= 2 // set "max compression" bit (no effect in Go 1.6)
			f, err := w.CreateHeader(hdr)
			if err != nil {
				fmt.Printf("Error creating ZIP entry, not stored: %s, %s\n",
					path, err)
				return nil
			}
			_, err = io.Copy(f, inFile)
			if err != nil {
				fmt.Printf("Error writing file, not stored: %s, %s\n",
					path, err)
				return nil
			}
			err = inFile.Close()
			if err != nil {
				fmt.Printf("Error closing file,: %s, %s\n", path, err)
				return nil
			}
			return nil
		})

	// Close the archive.
	return w.Close()
}

func localAsUtc(local time.Time) time.Time {
	_, offset := local.Zone()
	return local.UTC().Add(time.Duration(offset) * time.Second)
}

@bobjalex
Copy link
Author

I made a little test program from the code I sent:

zip_tst.zip

@bobjalex
Copy link
Author

Experimenting with the little test program I posted, I learned something that changes the nature of the problem. The timings I sent were writing the archive to a USB3 drive (I failed to mention that). When writing to a real disk, the time is much better. BUT, the difference between disk and USB performance is unique to zip. While the Java and Python implementations took only a little bit longer writing to the USB drive, the Go implementation took a lot longer.

The issue now is: why is the zip-archive-writing I/O so much slower when writing to a USB drive? It seems to be something specific to zip, since tgz does not show much difference for USB.

New ZIP timings on disk and USB drive:

Implementation Disk USB3 drive
Go 20s 4m58s
Java 49s 51s
Python 32s 35s

New TGZ timings on disk and USB drive:

Implementation Disk USB3 drive
Go 44s 40s
Java 46s 38s
Python 51s 56s

@bradfitz
Copy link
Contributor

Your USB3 stick's filesystem is probably mounted sync (default safe option), so every VFS write is one end-to-end write to flash. You want a bufio.Writer to do a few big writes to the OS.

@bradfitz
Copy link
Contributor

Looks like there's nothing to do here. Can we close this bug?

@bobjalex
Copy link
Author

OK with me. Thanks.

dgryski added a commit to Sereal/Sereal that referenced this issue Apr 29, 2017
According to golang/go#20031 (comment)
the Go zlib is now on par with the C zlib implementation in terms of
speed.
@golang golang locked and limited conversation to collaborators Apr 19, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge help wanted NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Projects
None yet
Development

No branches or pull requests

6 participants