Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: strings: add CompareFold function #22425

Closed
akutz opened this issue Oct 24, 2017 · 10 comments
Closed

proposal: strings: add CompareFold function #22425

akutz opened this issue Oct 24, 2017 · 10 comments

Comments

@akutz
Copy link

akutz commented Oct 24, 2017

What version of Go are you using (go version)?

$ go version
go version go1.9.1 darwin/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

$ go env
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/akutz/Projects/go"
GORACE=""
GOROOT="/Users/akutz/.go/1.9.1"
GOTOOLDIR="/Users/akutz/.go/1.9.1/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/48/52fjq9r10zx3gnm7j57th0500000gn/T/go-build196061406=/tmp/go-build -gno-record-gcc-switches -fno-common"
CXX="clang++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"

CompareFold

I recently wrote a blog post about the overwhelming performance increase that comes with sorting a slice of strings in a case-insensitive manner using an implementation of CompareFold. For example:

542000 Word Benchmark

Ultimately the implementation of CompareFold I used is really just a modification of the existing strings.EqualFold:

// CompareFold returns an integer comparing two strings lexicographically
// using Unicode case-folding.
// The result will be 0 if a==b, -1 if a < b, and +1 if a > b.
func CompareFold(s, t string) int {
	for s != "" && t != "" {

		// Extract first rune from each string.
		var sr, tr rune
		if s[0] < utf8.RuneSelf {
			sr, s = rune(s[0]), s[1:]
		} else {
			r, size := utf8.DecodeRuneInString(s)
			sr, s = r, s[size:]
		}
		if t[0] < utf8.RuneSelf {
			tr, t = rune(t[0]), t[1:]
		} else {
			r, size := utf8.DecodeRuneInString(t)
			tr, t = r, t[size:]
		}

		// If they match, keep going; if not, return false.

		// Easy case.
		if tr == sr {
			continue
		}

		// Make sr < tr to simplify what follows.
		result := 1
		if tr < sr {
			result = -result
			tr, sr = sr, tr
		}
		// Fast check for ASCII.
		if tr < utf8.RuneSelf && 'A' <= sr && sr <= 'Z' {
			// ASCII, and sr is upper case.  tr must be lower case.
			srr := sr + 'a' - 'A'
			if tr == srr {
				continue
			}
			if tr < srr {
				return result
			}
			if tr > srr {
				return -result
			}
		}

		// General case. SimpleFold(x) returns the next equivalent rune > x
		// or wraps around to smaller values.
		r := unicode.SimpleFold(sr)
		for r != sr && r < tr {
			r = unicode.SimpleFold(r)
		}
		if r == tr {
			continue
		}
		if tr < r {
			return result
		}
		if tr > r {
			return -result
		}
	}

	// One string is empty. Are both?
	if s == "" && t != "" {
		return -1
	}
	if s != "" && t == "" {
		return 1
	}
	return 0
}

However, because writing CompareFold from scratch (even basing it off of an existing function) isn't necessarily a simple task, I'd like to propose its inclusion in the Go standard library strings package. To that end, the implementation of strings.EqualFold could then be replaced with the following:

// EqualFold reports whether s and t, interpreted as UTF-8 strings,
// are equal under Unicode case-folding.
func EqualFold(s, t string) bool {
	return CompareFold(s, t) == 0
}

Sorting slices of strings in a case-insensitive manner is not an obscure use case. Therefore I'd posit there is great value in providing an easy-to-use implementation of a comparison function that both uses Unicode case-folding and returns more than a binary result relating to equality.

Please note that akutz/sortfold has the full implementation of CompareFold as well as Go unit tests, examples, and benchmarks.

@ianlancetaylor
Copy link
Contributor

I'm not sure this is a good idea. Case-insensitive lexicographic comparison gets very close to collation, but case-insensitive collation is much more complex than this function. People should be using golang.org/x/text/collate instead. I'm concerned that adding CompareFold will be an attractive nuisance. If you want it, it can be in an external library.

@ianlancetaylor ianlancetaylor changed the title strings: CompareFold proposal: strings: add CompareFold function Oct 24, 2017
@gopherbot gopherbot added this to the Proposal milestone Oct 24, 2017
@ianlancetaylor
Copy link
Contributor

CC @mpvl

@akutz
Copy link
Author

akutz commented Oct 24, 2017

Hi @ianlancetaylor,

First, thank you for taking a look at this issue. Second of all, I was unaware of golang.org/x/text/collate, and I agree it would be trivial to provide a case-insensitive comparator function using a Collator with the IgnoreCase option enabled.

@akutz
Copy link
Author

akutz commented Oct 24, 2017

Hi @ianlancetaylor,

I am interested in this statement:

case-insensitive collation is much more complex than this function

I can see how the modifications I made to strings.EqualFold would work for the test cases and benchmarks I created but may still fail the broader Unicode character set. However, I'm not as informed or well-versed in this area as someone like yourself. Would you mind, for my own education, providing an example of what you mean?

@akutz
Copy link
Author

akutz commented Oct 24, 2017

Hi @ianlancetaylor,

FWIW, I added a 542,000 word benchmark that uses golang.org/x/text/collate.Collator with the IgnoreCase option, and it does not perform well. It uses more memory than an implementation of sort.Interface that uses strings.ToLower, and the sort using the collator is much, much slower, even when run by itself.

$ go test -benchmem -run Bench -bench Benchmark_.*542000 -v
goos: darwin
goarch: amd64
pkg: github.com/akutz/sortfold
Benchmark_FoldedSort_542000_Words____MixedCase_Shuffled-8   	      10	 163598029 ns/op	       0 B/op	       0 allocs/op
Benchmark_LCasedSort_542000_Words____MixedCase_Shuffled-8   	       2	 531652914 ns/op	30080968 B/op	 3180838 allocs/op
Benchmark_CollatSort_542000_Words____MixedCase_Shuffled-8   	       1	1703743345 ns/op	307747456 B/op	 2404277 allocs/op
Benchmark_ColatDSort_542000_Words____MixedCase_Shuffled-8   	       3	 407087316 ns/op	94040096 B/op	  196002 allocs/op
PASS
ok  	github.com/akutz/sortfold	8.880s

Please note that Benchmark_CollatSort uses an implementation of sort.Interface that uses a case-insensitive Collator. The Benchmark_ColatDSort benchmark is a direct sort using the Collator's SortStrings function.

Both collator-backed sorts are far less performant than the above implementation of CompareFold, which, as you said, may be incorrect with regards to comprehensiveness.

@ianlancetaylor
Copy link
Contributor

Collation is language and even region specific. It's not at all the same as simple sorting by character values. There is a use for sorting by case dependent character values, which is why it is in the language. There is much less of a use for sorting by case independent character values. If you want to dig into the complexities of collation perhaps you could start with http://unicode.org/reports/tr10/ .

@akutz
Copy link
Author

akutz commented Oct 24, 2017

Thank you. I guess the next logical question is how EqualFold is valid if the modification of it isn’t? Would the CompareFold implementation be valid for UTF-8 strings as documented for EqualFold?

@akutz
Copy link
Author

akutz commented Oct 24, 2017

Never mind. I get it now. It’s not the CompareFold so much as the sort itself. Got it.

@akutz
Copy link
Author

akutz commented Oct 24, 2017

Thank you for your patience. I do think there is still a place for my work in a use-case specific situation, dependent on the input data. But you’re correct, given the explanation you’ve provided it’s clear that CompareFold makes no sense as part of stdlib.

I’m happy to close this issue unless you want the person you CCd to review it first.

@ianlancetaylor
Copy link
Contributor

OK, closing. Thanks for following up.

@golang golang locked and limited conversation to collaborators Oct 25, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants