Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: compress/lzw: extend package for correctly handling tiff files and pdf LZWDecode filters #25409

Closed
hhrutter opened this issue May 15, 2018 · 37 comments

Comments

@hhrutter
Copy link

We have both lzw reader and writer under compress/lzw.

There is also an lzw reader hidden under https://github.com/golang/image/blob/master/tiff/lzw/reader.go
that is supposed to be used for reading tiff files. (There is no writer.)
The difference being a slight modification of the algorithm as stated in https://github.com/golang/image/blob/master/tiff/lzw/reader.go

I am working on a pdf processor and have a use case for encoding/decoding LZW data using both variations. The PDF spec defines the LZWDecode filter with the parameter EarlyChange that decides about the variation of the algorithm to use:

if EarlyChange = 1 => use golang/image/tiff/lzw/reader.go from the experimental repos at golang.org/x and a correspondingwriter.

if EarlyChange = 0 => use compress.lzw.reader.go and compress/lzw/writer.go from the std lib.

The default value for EarlyChange is 1.

This proposal is about extending compress/lzw so it can be used by any package that needs to implement lzw compression in any way. While reading/writing tiff files may be an exotic usecase these days, upcoming PDF readers,writers,processors would greatly benefit from this. In addition this would be the first time to introduce tiff encoding and decoding within the stdlib.

These are the proposed code changes. The modifications of the algorithm are minor but
but for now I don't see any better way than extending the API:

  • Introduce a bool parameter oneOff :
// oneOff makes code length increases occur one code early. It should be true
// for tiff files or pdf LZWDecode filters with earlyChange=1 which is also the default.
func NewReader(r io.Reader, order Order, litWidth int, oneOff bool) io.ReadCloser
func NewWriter(w io.Writer, order Order, litWidth int, oneOff bool) io.WriteCloser
  • Extend encoder and decoder structs:
// oneOff makes code length increases occur one code early.
oneOff bool
  • Extend the encoding algorithm:
func (e *encoder) incHi() error {
	e.hi++

        // Code length increases dependent on d.oneOff.
	ui := e.hi
	if e.oneOff {
		ui++
	}

	if ui == e.overflow {
		e.width++
		e.overflow <<= 1
	}

	if ui == maxCode {
		clear := uint32(1) << e.litWidth
		if err := e.write(e, clear); err != nil {
			return err
		}
		e.width = e.litWidth + 1
		e.hi = clear + 1
		e.overflow = clear << 1
		for i := range e.table {
			e.table[i] = invalidEntry
		}
		return errOutOfCodes
	}
	return nil
}
  • Extend the decoding algorithm:
// decode decompresses bytes from r and leaves them in d.toRead.
// read specifies how to decode bytes into codes.
// litWidth is the width in bits of literal codes.
func (d *decoder) decode() {
	// Loop over the code stream, converting codes into decompressed bytes.
loop:
	for {
		code, err := d.read(d)
		if err != nil {
			if err == io.EOF {
				err = io.ErrUnexpectedEOF
			}
			d.err = err
			break
		}
		switch {
		case code < d.clear:
			// We have a literal code.
			d.output[d.o] = uint8(code)
			d.o++
			if d.last != decoderInvalidCode {
				// Save what the hi code expands to.
				d.suffix[d.hi] = uint8(code)
				d.prefix[d.hi] = d.last
			}
		case code == d.clear:
			d.width = 1 + uint(d.litWidth)
			d.hi = d.eof
			d.overflow = 1 << d.width
			d.last = decoderInvalidCode
			continue
		case code == d.eof:
			d.err = io.EOF
			break loop
		case code <= d.hi:
			c, i := code, len(d.output)-1
			if code == d.hi && d.last != decoderInvalidCode {
				// code == hi is a special case which expands to the last expansion
				// followed by the head of the last expansion. To find the head, we walk
				// the prefix chain until we find a literal code.
				c = d.last
				for c >= d.clear {
					c = d.prefix[c]
				}
				d.output[i] = uint8(c)
				i--
				c = d.last
			}
			// Copy the suffix chain into output and then write that to w.
			for c >= d.clear {
				d.output[i] = d.suffix[c]
				i--
				c = d.prefix[c]
			}
			d.output[i] = uint8(c)
			d.o += copy(d.output[d.o:], d.output[i:])
			if d.last != decoderInvalidCode {
				// Save what the hi code expands to.
				d.suffix[d.hi] = uint8(c)
				d.prefix[d.hi] = d.last
			}
		default:
			d.err = errors.New("lzw: invalid code")
			break loop
		}
		d.last, d.hi = code, d.hi+1
                 
                // Code length increases dependent on d.oneOff.
		ui := d.hi
		if d.oneOff {
			ui++
		}
		if ui >= d.overflow {
			if d.width == maxWidth {
				d.last = decoderInvalidCode
				// Undo the d.hi++ a few lines above, so that (1) we maintain
				// the invariant that d.hi <= d.overflow, and (2) d.hi does not
				// eventually overflow a uint16.
				if !d.oneOff {
					d.hi--
				}
			} else {
				d.width++
				d.overflow <<= 1
			}
		}
		if d.o >= flushBuffer {
			break
		}
	}
	// Flush pending output.
	d.toRead = d.output[:d.o]
	d.o = 0
}
@gopherbot gopherbot added this to the Proposal milestone May 15, 2018
@josharian
Copy link
Contributor

cc @dsnet

Note that due to the Go 1 compatibility promise, we cannot change the signature of any exported function or method. (At least for Go 1.)

@bradfitz bradfitz changed the title Proposal: extend compress/lzw for correctly handling tiff files and pdf LZWDecode filters Proposal: compress/lzw: extend package for correctly handling tiff files and pdf LZWDecode filters May 16, 2018
@hhrutter
Copy link
Author

👍
I guess that means waiting for an opportunity to make this happen whenever we reach Go 2.0.
It makes a lot of sense to consolidate all lzw code into compress.lzw.

@jamiec7919
Copy link

jamiec7919 commented May 16, 2018

Would it be possible to add a public Reader to compress/lzw and change the NewReader signature to return a *Reader? Does that still violate the Go1 promise?

EDIT: I suppose anywhere a type switch or similar is done as the result of := NewReader would fail so technically not a drop in replacement and can violate.

@ianlancetaylor
Copy link
Contributor

Yes, the Go 1 compatibility guarantee prevents us from changing the result type of NewReader.

@dsnet
Copy link
Member

dsnet commented May 16, 2018

My counter proposal is #25429, which is to actually freeze the lzw package. The lzw package is not a easy package to use for users as it accepts too many options that subtly change how the compressed format is interpreted.

@hhrutter
Copy link
Author

hhrutter commented May 16, 2018

I may not have been clear but compress/lzw does not work with PDFs right now the way it is implemented.

There is even a hinting TODO comment about that at the beginning of reader.go
// TODO(nigeltao): check that PDF uses LZW in the same way as GIF,

There are 2 ways lzw compression is used depending on the filter parameter EarlyChange.
Only if this parameter = 0, the current lzw reader/writer implementation works.
If this parameter = 1 or missing which is not only the default but also used in most of the cases I have seen, using the current implementation does not work.

There is a test with an lzw encoded string in reader_test.go:

// This example comes from http://compgroups.net/comp.lang.ruby/Decompressing-LZW-compression-from-PDF-file
	{
		"pdf;MSB;8",
		"-----A---B",
		"\x80\x0b\x60\x50\x22\x0c\x0c\x85\x01",
		nil,
	},

The way I see it this test is not representative and only passes because the chosen string is short.
The difference between the two variations of the algorithm only plays out if all codes are exhausted for a particular code length.

I think something needs to be done here.

@jamiec7919
Copy link

I'm not sure if I'm missing something but if the signature of NewReader can't be changed why not suggest adding a NewReaderEarlyChange() or NewReaderExt(oneOff bool)? (not necessarily advocating but it seems like that is the only issue with the change, the rest is internal?)

@dsnet
Copy link
Member

dsnet commented May 18, 2018

The lzw "format" as originally described in the paper "A Technique for High-Performance Data Compression" did not originally have the various configuration options that exist today. The variations on lzw, which are not well-specified can actually be thought as independent formats that are specific to GIF, TIFF, or PDF. The API as it stands today provides more variations on how to alter the output format than is necessary, which makes testing difficult as the surface is large. With 2 possible values of order and and 6 possible values of litWidth, this is 12 possible configurations. Adding oneOff into the mix increasing the complexity to 24.

To avoid the explosion of configuration cross products, you could make the case the API should have been:

// Format determines the variation of LZW used in certain file formats.
type Format int

const (
    _ Format = 1 << iota
    GIF
    TIFF
    PDF
)

func NewReader(r io.Reader, f Format) (*Reader, error)
func NewWriter(r io.Writer, f Format) (*Writer, error)

However, given that the number of formats that use "lzw" is a limited and specific set of formats, I argue that this functionality should not have been a general purpose package in the standard library.

If anything, we should document that the lzw package only works with GIF and stop advertising that it handles TIFF and PDF.

I understand the need for this functionality, but I don't understand why it has the be done in the standard library. Forking the lzw code seems entirely reasonable (as was done for TIFF).

@hhrutter
Copy link
Author

hhrutter commented May 19, 2018

👍

In addition to the oneOff issue PDF expects any lzw encoded stream to start with clear-table (which is the decimal code 256 or binary 1000 0000 0) - and for PDF litWidth has to be 8 and MSB is mandatory.

The few PDF files I could get a hand on that use the LZWDecode filter with or w/o EarlyChange are inconsistent as far as usage of EOD. Some use it and others don't.

The overall too much subtleties of the PDF LZWDecode filter justify to be separated from the standard lib compress/lzw package.

+1 also for adding a comment to compress/lzw stating it should not be used for PDF.

@rsc
Copy link
Contributor

rsc commented May 21, 2018

I'd rather have one lzw package that works for the three major formats (GIF, PDF, TIFF) than have two different ones. It looks like the code changes here are small. Obviously the API change must be done in a compatible way. But assuming it is, what's the harm in merging the x/ lzw functionality into the standard one?

Are there yet more variants in common use besides these two?

@dsnet
Copy link
Member

dsnet commented May 21, 2018

Are there yet more variants in common use besides these two?

Three. GIF, TIFF, and PDF are different in subtle ways.

what's the harm in merging the x/ lzw functionality into the standard one?

Let me ask it the opposite way, wrong's wrong with pairing each lzw implementation with the Go implementation of the respective formats? I don't imagine that there should be many GIF, TIFF, or PDF implementations.

As a format, LZW is not that complicated, being around 200 lines (per reader or writer) if you strip away the logic that already exists to vary based on the existing options.

Specializing on each LZW variant has benefits:

  • Being able to ship bug fixes for a given format's LZW variant with just a snapshot version of the library itself without depending on a specific version of the standard library. For example, given the state of affairs with PDF, you can imagine making more tweaks to PDF's version of LZW in the future. If this were in the standard library, then bug fixes for PDF are now tied with the release cycle of the standard library. If the code for PDF's LZW was with the Go implementation of PDF, then they can evolve together.
  • Keeping the LZW compression with the format keeps the domain expertise for the format in one place. For example, I'm familiar with DEFLATE and bzip2 as they are fairly common general purpose formats. I'm also familiar with LZW as originally described in the 1984 paper, but I'm not familiar with the format specific intricacies of each LZW variant. This separation of domain expertise makes it harder to review changes to lzw. A person (or team) implementing PDF already needs to know PDF's variant of LZW (as @hhrutter already demonstrates).
  • Each variant can be an internal package for the given format, preventing people from depending on compression format that should only be used for one format.
  • Specialization allows for more simplicity. By getting rid of GIF and TIFF specifics, the PDF variant of LZW can be simpler to understand (same goes for the other variants).
  • Specialization also allows each variant to write optimized versions (if they care) without wasting CPU cycles checking a runtime flag for what variant they are working with.

@hhrutter
Copy link
Author

I can only chip in on the PDF variant. I may be wrong but there are at least two reasons why I tend to agree with @dsnet it would be a good idea to keep the implementation of the PDF LZW variant separate from the standard library compress/lzw:

  1. Testing
  2. Stability

It seems to be true that the LZWDecode filter has been abandoned for PDF writers in favour of the FlateDecode filter by the major PDF tool vendors for performance reasons. Still PDF readers want to be able to digest lzw encoded content for backwards compatibility and PDF processors want to be able to handle any filter that's part of the PDF spec for the same reason. Therefore it would be nice to make an implementation of the PDF lzw variant available to the community. I think that's undisputed.

As described above and here PDFs published on the internet indicate not all PDF writers implement(ed) lzw like it is laid out in the PDF spec. I implemented an lzw Reader and Writer for pdfcpu based on compress/lzw and the PDF content I could find online. This is an implementation of PDF's LZWDecode plus it is relaxed about and able to handle the so far encountered anomalies.

For this reason testing focused on a particular PDF filter is cumbersome. The basic encode/decode and compare against golden cycle is not enough when you want to make sure you get the same result coming from either direction. Unfortunately we are in this position with PDF LZWDecode because there is not much content available out there. You have to identify or generate a PDF file using this filter for compression, then dissect the file, identify a fitting object stream and extract its compressed and decompressed content for testing.

Other than the testing issue I can see code adaptions coming up as the implementation gets exposure and people start testing against it. While bugfixes are part of the software development life cycle we do not want to have a fix for compress/lzw/pdf every 6 months if it makes it in the stdlib. I would like to see it in an experimental repo for some time and then maybe later move it into stdlib if at all.

@nigeltao
Copy link
Contributor

Just on a possible 2.0 API (or even an additional 1.x API if it wasn't called NewReader),

func NewReader(r io.Reader, f Format) (*Reader, error)

won't work for GIF, as a valid GIF's litWidth can vary between 2 and 8 (inclusive). See "LZW Code Size" in Appendix F of https://www.w3.org/Graphics/GIF/spec-gif89a.txt

As for bikeshedding on freezing or growing the stdlib's compress/lzw, I'm OK with growing the stdlib version to accomodate PDF and TIFF. Sure, the new API will be awkward, due to 1.x compatibility, but so be it.

But I don't have the spare time to work on it myself. If @dsnet is the de facto compress/lzw maintainer these days, then I'm happy for him to make the call.

@hhrutter
Copy link
Author

hhrutter commented May 24, 2018

Does the following make any sense?

func NewGIFReader(r io.Reader, order Order, litWidth int) (*Reader , error)
func NewTIFFReader(r io.Reader, order Order, litWidth int) (*Reader , error)
func NewPDFReader(r io.Reader, oneOff bool) (*Reader , error)

func NewGIFWriter(w io.Writer, order Order, litWidth int) (*Writer , error)
func NewTIFFWriter(w io.Writer, order Order, litWidth int) (*Writer , error)
func NewPDFWriter(w io.Writer, oneOff bool) (*Writer , error)

@nigeltao
Copy link
Contributor

I'm still undecided whether that's actually the best API, as opposed to e.g. NewReaderExt, but it does make sense.

Note that the Order args are no longer necessary if you're specifing GIF / TIFF / PDF. The TIFF reader and writer also doesn't need the litWidth arg. I think it's always 8 in TIFF, like PDF.

@hhrutter
Copy link
Author

hhrutter commented May 26, 2018

OK.
I have to agree the function signatures look a little bit awkward, still in lack of a better alternative though.
Let me see if I got that right:

func NewReaderGIF(r io.Reader, litWidth int) (*Reader , error)
func NewReaderTIFF(r io.Reader) (*Reader , error)
func NewReaderPDF(r io.Reader, oneOff bool) (*Reader , error)

func NewWriterGIF(w io.Writer, litWidth int) (*Writer , error)
func NewWriterTIFF(w io.Writer) (*Writer , error)
func NewWriterPDF(w io.Writer, oneOff bool) (*Writer , error)

So that means litWidth remains for GIF only and for GIF we default to LSB for order?

@nigeltao
Copy link
Contributor

nigeltao commented May 28, 2018

Yeah, you got it right. LitWidth is GIF only, and GIF only uses LSB.

On balance though, I think I like the NewReaderExt or NewReader2 function, with an extra oneOff or earlyChange argument, better.

@hhrutter
Copy link
Author

hhrutter commented May 28, 2018

I am not sure I understand what you are suggesting.

You want to have a separate ReaderExt that uses oneOff?
You are aware both TIFF and PDF use oneOff yet you cannot throw them together for other reasons.

Are we talking 1.11 or 2.0?
Can you share the function signatures the way you see them?

Has it been decided who is going to take care of the implementation and the corresponding CL?
I know you made the call for @dsnet but he has not declared himself.
I could chip in with the implementation for compress/lzw in case he's too busy
or at least with a contribution for the PDF part.

@nigeltao
Copy link
Contributor

func NewReaderExt(r io.Reader, order Order, litWidth int, earlyChange bool) io.ReadCloser

Ditto for NewWriterExt.

This can be for Go 1.x.

@hhrutter
Copy link
Author

I think I got you.

Keep the original Reader/Writer in order to comply with the Go 1 compatibility promise
and supply a flexible ReaderExt/WriterExt pair that can be used for GIF, TIFF and PDF.

Is that it?

Then of course supply proper documentation on the parameter usage for each format.
I like that.

@dsnet
Copy link
Member

dsnet commented May 29, 2018

In #25409 (comment), @hhrutter mentioned that PDF's variant of LZW expects streams to start with a clear-table symbol. Is this true of the other LZW variants? I still suspect that it really is better to have New{GIF,TIFF,PDF}{Reader,Writer} instead of a generalized New{Reader,Writer}Ext. Further, I still support developing these externally and waiting for the implementations to stabilize before adding new APIs (if ever).

@hhrutter
Copy link
Author

Good catch!
The initial clear-table has to be written during the call to NewWriter.

@rsc
Copy link
Contributor

rsc commented Jun 4, 2018

Maybe it would make sense to develop this in golang.org/x/image/lzw for now? Only GIF is in the standard library - TIFF and PDF are not. There's still the problem of who will review the code.

@hhrutter
Copy link
Author

hhrutter commented Jun 4, 2018

For PDF lzw is one of several options to compress data, not necessarily image data.
golang.org/x/compress/lzw maybe a better location still not ideal because there is compress/lzw in the standard library.

There is also already a golang.org/x/image/tiff package in need of a Writer and some minor modifications (default packing order to MSB, remove litWidth) according to this thread.
Maybe this package could leverage a new golang.org/x/compress/lzw.

@nigeltao
Copy link
Contributor

nigeltao commented Jun 6, 2018

some minor modifications (default packing order to MSB, remove litWidth) according to this thread.

To be clear, I didn't say that, if you're changing x/image/tiff/lzw, that you should make a backwards incompatible change by removing that package's litWidth adjustability (by modifying or removing NewReader). I said that, if you wanted a NewTIFFReader method, you don't need a litWidth arg.

What I would recommend is to fork the stdlib's compress/lzw to a new location, e.g. github.com/username/lzw to make a package that's acceptable to (1) the stdlib's image/gif, (2) the extended lib's x/image/tiff, and (3) whatever pdf packge you're working on. I'd consider also making username/gif and username/tiff forks that used username/lzw. I'd also consider organizing it into 1 repo instead of 3: username/foo/lzw, username/foo/gif and username/foo/tiff for some good value of foo.

The point of this is, nobody should be blocked on e.g. the Go 1.x release cadence, and once we have some concrete code, with some experience with how that API works in practice, then we can have a much more informed discussion of what to do with e.g. the stdlib's compress/lzw package (and its backwards compat guarantees), versus new packages under x.

@hhrutter
Copy link
Author

hhrutter commented Jun 6, 2018

On top of these recommendations I'd like to follow the stdlib's hierarchy and keep compress/ and image/ separate.

Repo 1:
hhrutter/compress/lzw (forked off compress/lzw)

Repo 2:
hhrutter/image/gif (forked off image/gif) using repo 1.
hhrutter/image/tiff (forked off x/image/tiff) using repo 1.

Repo 3:
hhrutter/pdfcpu using repo 1.

As far as the reader/writer implementation in repo 1 goes I am undecided but I am thinking
if I need an lzw Reader for PDF it's easier if there is a PDFReader in place where I only need to worry about the parameters concerning PDF, which is earlyChange. For the same reason I would prefer a TIFF Reader with no parameters other than the io.Reader.

Hence I am leaning towards:

func NewReaderGIF(r io.Reader, litWidth int) (*Reader , error)
func NewReaderTIFF(r io.Reader) (*Reader , error)
func NewReaderPDF(r io.Reader, earlyChange bool) (*Reader , error)

func NewWriterGIF(w io.Writer, litWidth int) (*Writer , error)
func NewWriterTIFF(w io.Writer) (*Writer , error)
func NewWriterPDF(w io.Writer, earlyChange bool) (*Writer , error)

It would definitely make sense though to put the revised compress/lzw into x/compress/lzw along with modifications to x/image/tiff in order to get proper visibility and feedback. That is because there is not much TIFF related activity out there. TIFF is an old image format and the same applies to the usage of the PDF lzw compression filter.

Please share your opinions.

@nigeltao
Copy link
Contributor

nigeltao commented Jun 7, 2018

Please share your opinions.

I'm repeating myself, but I still prefer

func NewReaderExt(r io.Reader, order Order, litWidth int, earlyChange bool) io.ReadCloser

To me, it seems like a layering violation for lzw (which is a low layer) to have API named after things like GIF and PDF (which are high layers).

@hhrutter
Copy link
Author

hhrutter commented Jun 7, 2018

I see where you are coming from.

If lzw/compress provides a

func NewReaderExt(r io.Reader, order Order, litWidth int, earlyChange bool) io.ReadCloser

you have higher level packages like image/gif and x/image/tiff calling NewReader with corresponding arguments as discussed above.

The same applies for PDF consumers only there is nothing PDF related in the stdlib.

@ianlancetaylor
Copy link
Contributor

@dsnet Any thoughts on @nigeltao 's comment #25409 (comment) ?

@dsnet
Copy link
Member

dsnet commented Jun 18, 2018

I still hold to my opinion in #25409 (comment) that I suspect specialized APIs are better.

Per @nigeltao's comment in #25409 (comment), it seems that we both support development of this outside of standard library. Perhaps the proposal should be on hold until the API for LZW as used in PDF stabilizes and we can see what the subtle differences are with GIF and TIFF.

@hhrutter
Copy link
Author

In order to let the PDF version stabilize it would be good to put it somewhere it gets proper exposure.
I can't think of a better place than an official experimental go repo.

@dsnet
Copy link
Member

dsnet commented Jun 18, 2018

I don't think that's necessary. Wouldn't the repo that is developing an implementation of Go PDF be the perfect place to put that?

@hhrutter
Copy link
Author

hhrutter commented Jun 19, 2018

Basically yes, but I can give you two reasons:

  • I am not going to assume pdfcpu is or will be the sole consumer for a golang.org/x/compress/lzw/pdf or however you want to name it.

  • Visibility. If pdfcpu delegates to a golang.org/x package for lzw compression we can get more/quicker feedback.

@dsnet
Copy link
Member

dsnet commented Jun 19, 2018

I am not going to assume pdfcpu is or will be the sole consumer for a golang.org/x/compress/lzw/pdf or however you want to name it.

If you create github.com/hhrutter/pdfcpu/lzw, nothing prevents others from using it.

Visibility. If pdfcpu delegates to a golang.org/x package for lzw compression we can get more/quicker feedback.

I'm not fond of adding more to golang.org/x; see #17244. Furthermore, the two most likely candidates to review CLs for that code would be me or @nigeltao, and I don't think either of us have the bandwidth to take on ownership of more packages.

@nigeltao
Copy link
Contributor

I would also recommend github.com/hhrutter/pdfcpu/lzw for the same reasons @dsnet gave.

@hhrutter
Copy link
Author

hhrutter commented Jun 20, 2018

Ok then. I will provide github.com/hhrutter/pdfcpu/lzw.

Still, somebody needs to update the comments in compress/lzw so it is clear as of now this package is not suited for PDF.

@rsc
Copy link
Contributor

rsc commented Jul 9, 2018

It seems like this discussion converged on having it outside the standard library (and outside golang.org/x) which sounds fine.

@rsc rsc closed this as completed Jul 9, 2018
@golang golang locked and limited conversation to collaborators Jul 9, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants