Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

archive/zip: need new api to support local file name encoding #10741

Closed
chai2010 opened this issue May 7, 2015 · 21 comments
Closed

archive/zip: need new api to support local file name encoding #10741

chai2010 opened this issue May 7, 2015 · 21 comments
Milestone

Comments

@chai2010
Copy link
Contributor

chai2010 commented May 7, 2015

The archive/zip can't support gbk filename on chinese windows.

This is a simple test (test-gbk-zip.go):

// only for go1.4

// Go output on Chinese Win7/64bit
// go run test-gbk-zip.go
//
// File name(%s): ����/���¶���.txt
// File name(%q): "\xc0\xee\xb0\xd7/\xd4\xc2\xcf¶\xc0\xd7\xc3.txt"
// File name(%s): �Ÿ�/
// File name(%q): "\xb6Ÿ\xa6/"
//

// 7z output on Chinese Win7/64bit
//
// 7z l -r 中文-测试.zip
//
// 7-Zip [64] 9.25 alpha  Copyright (c) 1999-2011 Igor Pavlov  2011-09-16
//
// Listing archive: 中文-测试.zip
//
// --
// Path = 中文-测试.zip
// Type = zip
// Physical Size = 483
//
//    Date      Time    Attr         Size   Compressed  Name
// ------------------- ----- ------------ ------------  ------------------------
//                     .....          282          228  李白\月下独酌.txt
//                     D....            0            5  杜甫
// ------------------- ----- ------------ ------------  ------------------------
//                                    282          233  1 files, 1 folders

package main

import (
    "archive/zip"
    "fmt"
    "log"
    "os"

    "golang.org/x/text/encoding/simplifiedchinese"
)

type ZipTest struct {
    Name string
    File []ZipTestFile
}

type ZipTestFile struct {
    Name string
    Body string
}

func main() {
    makeGbkTestfile()

    r, err := zip.OpenReader(gbkZipTest.Name)
    if err != nil {
        log.Fatal(err)
    }
    defer r.Close()

    for _, f := range r.File {
        fmt.Printf("File name(%%s): %s\n", f.Name)
        fmt.Printf("File name(%%q): %q\n", f.Name)
    }
}

func makeGbkTestfile() {
    file, err := os.Create(gbkZipTest.Name)
    if err != nil {
        log.Fatal(err)
    }
    defer file.Close()

    wzip := zip.NewWriter(file)
    defer func() {
        if err := wzip.Close(); err != nil {
            log.Fatal(err)
        }
    }()

    for _, file := range gbkZipTest.File {
        localGbkName, _ := utf8ToGBK(file.Name) // 文件名转换为 GBK编码
        f, err := wzip.Create(localGbkName)
        if err != nil {
            log.Fatal(err)
        }
        _, err = f.Write([]byte(file.Body))
        if err != nil {
            log.Fatal(err)
        }
    }
}

func utf8ToGBK(text string) (string, error) {
    dst := make([]byte, len(text)*2)
    tr := simplifiedchinese.GB18030.NewEncoder()
    nDst, _, err := tr.Transform(dst, []byte(text), true)
    if err != nil {
        return text, err
    }
    return string(dst[:nDst]), nil
}

var gbkZipTest = ZipTest{
    Name: "中文-测试.zip",
    File: []ZipTestFile{
        {
            Name: "李白/月下独酌.txt",
            Body: `月下独酌 - 李白

花间一壶酒,独酌无相亲。
举杯邀明月,对影成三人。
月既不解饮,影徒随我身。
暂伴月将影,行乐须及春。
我歌月徘徊,我舞影零乱。
醒时同交欢,醉后各分散。
永结无情游,相期邈云汉。
`,
        },
        {
            Name: "杜甫/",
            Body: "",
        },
    },
}

So i create CL9381 try to fix this poblem.
But CL9381 can only support utf8 encoding,
it can't create local gbk encoding zip.

I think we need new api for user defined filename encoding:

func OpenReaderEx(name string, decoder func(x string)(utf8Name string)) (*ReadCloser, error)
func (w *Writer) CreateEx(name string, encoder func(x string)(localName string)) (io.Writer, error)

If the decoder or encoder is nil, the local encoding is utf8.

The func (w *Writer) Create(name string) (io.Writer, error) force use utf8 encoding.

@minux minux added this to the Go1.6 milestone May 7, 2015
@minux
Copy link
Member

minux commented May 7, 2015 via email

@chai2010
Copy link
Contributor Author

chai2010 commented May 7, 2015

@minux If a filename is GBK and valid utf8, and i want to use GBK(not utf8),
How to do this?

@bradfitz
Copy link
Contributor

bradfitz commented May 7, 2015

@chai2010, use CreateHeader directly, and not Create.

@chai2010
Copy link
Contributor Author

chai2010 commented May 7, 2015

@bradfitz If we force use utf8 encoding for valid utf8 gbkFilename (as @minux suggest), CreateHeader will set the flag bit11,
and will call

if err := writeHeader(w.cw, fh); err != nil {
    return nil, err
}

we can't clear flag's bit11 before writeHeader(w.cw, fh).

But the filename is gbk, NOT utf8string(though utf8.ValidString test is true).
The user need local encoding, not utf8.

@minux
Copy link
Member

minux commented May 7, 2015 via email

@chai2010
Copy link
Contributor Author

chai2010 commented May 7, 2015

@minux I agree with you. Only Create will set fh.Flag.

@rsc rsc modified the milestones: Unplanned, Go1.6 Nov 26, 2015
@tgulacsi
Copy link
Contributor

What's the status of this issue?

At least we should document the bit11 - I've had to read through the APPNOTE.TXT of the ZIP spec to find out how t ostore UTF-8 names in the zip file.

The best (IMHO) would be to document & make Create set that bit, as in Go, every valid string is UTF-8, if one wants something else, should use CreateHeader.

@dsnet
Copy link
Member

dsnet commented Oct 20, 2017

The fact that many zip writers use local character encoding is a hack since the specification only seems to support two different encodings: CP-437 or UTF-8.

CL/39570 (exposed in Go1.9) changed it such that the UTF-8 flag was set automatically if the string looked like valid UTF-8, but it seems that the check was too liberal.

It seems to me that the thing that can be practically done here is to provide the user with a way to ensure bit 11 (which controls UTF-8) is cleared. The automatic detection scheme adding in Go1.9 is problematic, so we can either revert it for Go1.10 or explicitly add API to control clearing it.

If we revert CL/39570, users were always able to set UTF8 by manually setting bit 11 on FileHeader.Flags. To assist with that, we can document it and/or export a constant for it.

\cc @mattn

@mattn
Copy link
Member

mattn commented Oct 21, 2017

When set utf-8 string in filename, it works fine.

package main

import (
	"archive/zip"
	"io"
	"log"
	"os"
)

func init() {
	f, err := os.Create("sample.txt")
	if err != nil {
		panic(err)
	}
	f.Write([]byte("hello world"))
	f.Close()
}

func main() {
	zf, err := os.Create("sample.zip")
	if err != nil {
		log.Fatal(err)
	}
	defer zf.Close()

	zw := zip.NewWriter(zf)
	defer zw.Close()

	info, err := os.Stat("sample.txt")
	if err != nil {
		log.Fatal(err)
	}

	hdr, err := zip.FileInfoHeader(info)
	if err != nil {
		log.Fatal(err)
	}

	hdr.Name = "サンプル.txt"
	zh, err := zw.CreateHeader(hdr)
	if err != nil {
		log.Fatal(err)
	}

	f, err := os.Open("sample.txt")
	if err != nil {
		log.Fatal(err)
	}
	io.Copy(zh, f)
}

Windows command line.

zip

Linux

zip

Also Windows explorer.

explorer

Sorry, I don't know how this works on OSX.

When set non-utf-8 multi-byte string, user should handle himself.

sample.zip

This zip archive contain filename in Shift_JIS. Windows explorer (japanese environment) works well. But zip command line doesn't since the archive doesn't have encoding name. You must convert filename manually.

package main

import (
	"archive/zip"
	"io"
	"log"
	"os"

	"golang.org/x/text/encoding/japanese"
)

func main() {
	zr, err := zip.OpenReader("sample.zip")
	if err != nil {
		log.Fatal(err)
	}
	defer zr.Close()

	name, err := japanese.ShiftJIS.NewDecoder().String(zr.File[0].Name)
	if err != nil {
		log.Fatal(err)
	}
	f, err := os.Create(name)
	if err != nil {
		log.Fatal(err)
	}
	defer f.Close()

	r, err := zr.File[0].Open()
	if err != nil {
		log.Fatal(err)
	}
	io.Copy(f, r)
}

@k-hiro
Copy link

k-hiro commented Oct 21, 2017

hi!
I think attached sample.zip was created by zip command.
As i mentioned in #22367, my problem is i can't make same type of zip (using ShiftJIS encoding) with Go 1.9.
Sorry about that my issue post was unclear.

If i call zip.CreateHeader with local encoding, Any non ascii characters is recognized as valid utf8 in hasValidUTF8 function.

package main

import (
  "fmt"
  "log"
  "unicode/utf8"

  "golang.org/x/text/encoding/japanese"
  "golang.org/x/text/transform"
)

func main() {
  name, _, err := transform.String(japanese.ShiftJIS.NewEncoder(), "日本語")
  if err != nil {
    log.Fatal(err)
  }

  for _, s := range name {
    b := utf8.ValidRune(s)
    fmt.Printf("%x = %v\n", s, b)
  }
}

output is

fffd = true
fffd = true
fffd = true
7b = true
fffd = true
fffd = true

So zip.CreateHeader set bit 11 and results corrupted zip file.
(file name is encoded ShiftJIS but utf-8 flag is on)

And Windows7's explorer without optional hotfix can't handle utf-8 encoded zip.

@mattn
Copy link
Member

mattn commented Oct 21, 2017

I created it with 7-zip.

I'm thinking it is not neccesary that Go can create zip file with non-utf-8. It's enough that Go can read/write utf-8 file and read non-utf-8 filename as bytes.

@k-hiro
Copy link

k-hiro commented Oct 21, 2017

I can agree. I think using utf-8 encoding with bit 11 is a standard way.
Please don't revert CL/39570.
but Go 1.8 had a way to handle custom encoding.
so add simple api to clear bit 11 is also welcome.

@mattn
Copy link
Member

mattn commented Oct 21, 2017

Or one another idea. Add new field into ZipReader/ZipWriter like below.

NameEncoder  func(string) (string error)
NameDecoder  func(string) (string error)

This translate filename and comments in zip file. NameEncoder can be set japanese.ShiftJIS.NewEncoder().String. And NameDecoder can be set japanese.ShiftJIS.NewDecoder().String. @khiro, @dsnet How do you think?

@mattn
Copy link
Member

mattn commented Oct 21, 2017

little mistake. Reader.File can not provide way to do it. So only NameEncoder for Writer can be useful.

@gopherbot
Copy link

Change https://golang.org/cl/72410 mentions this issue: archive/zip: add NameEncoder to Writer

@k-hiro
Copy link

k-hiro commented Oct 21, 2017

👍 for CL/72410
Thanks for taking time for this.
This is better way than provide api to edit utf8 bit manually.

@gopherbot
Copy link

Change https://golang.org/cl/72430 mentions this issue: archive/zip: add NonUTF8 flag

@mattn
Copy link
Member

mattn commented Oct 21, 2017

I post one another CL. Which do you like?

@k-hiro
Copy link

k-hiro commented Oct 21, 2017

CL/72410
There is no possibility to set wrong flag. easy to use.
But only for writing. if i want read same zip, i need to manually decode string.

CL/72430
Less modification.

I think most people are fine with utf-8 encoding.
I just needed local encoding for compatibility to old windows os and this is special case.
so CL/72430 is more acceptable for anyone?

@gopherbot
Copy link

Change https://golang.org/cl/72791 mentions this issue: archive/zip: restrict UTF8 detection for comment and name fields

gopherbot pushed a commit that referenced this issue Oct 25, 2017
CL 39570 added support for automatically setting flag bit 11 to
indicate that the filename and comment fields are encoded in UTF-8,
which is (conventionally) the encoding using for most Go strings.

However, the detection added is too lose for two reasons:
* We need to ensure both fields are at least possibly UTF-8.
That is, if any field is definitely not UTF-8, then we can't set the bit.
* The utf8.ValidRune returns true for utf8.RuneError, which iterating
over a Go string automatically returns for invalid UTF-8.
Thus, we manually check for that value.

Updates #22367
Updates #10741

Change-Id: Ie8aae388432e546e44c6bebd06a00434373ca99e
Reviewed-on: https://go-review.googlesource.com/72791
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Run-TryBot: Joe Tsai <thebrokentoaster@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
@gopherbot
Copy link

Change https://golang.org/cl/75592 mentions this issue: archive/zip: add FileHeader.NonUTF8 field

@dsnet dsnet self-assigned this Nov 2, 2017
@dsnet dsnet modified the milestones: Unplanned, Go1.10 Nov 2, 2017
@golang golang locked and limited conversation to collaborators Nov 6, 2018
@rsc rsc unassigned dsnet Jun 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants