New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: strings: add CompareFold function #22425
Comments
I'm not sure this is a good idea. Case-insensitive lexicographic comparison gets very close to collation, but case-insensitive collation is much more complex than this function. People should be using golang.org/x/text/collate instead. I'm concerned that adding |
CC @mpvl |
Hi @ianlancetaylor, First, thank you for taking a look at this issue. Second of all, I was unaware of |
Hi @ianlancetaylor, I am interested in this statement:
I can see how the modifications I made to |
Hi @ianlancetaylor, FWIW, I added a 542,000 word benchmark that uses $ go test -benchmem -run Bench -bench Benchmark_.*542000 -v
goos: darwin
goarch: amd64
pkg: github.com/akutz/sortfold
Benchmark_FoldedSort_542000_Words____MixedCase_Shuffled-8 10 163598029 ns/op 0 B/op 0 allocs/op
Benchmark_LCasedSort_542000_Words____MixedCase_Shuffled-8 2 531652914 ns/op 30080968 B/op 3180838 allocs/op
Benchmark_CollatSort_542000_Words____MixedCase_Shuffled-8 1 1703743345 ns/op 307747456 B/op 2404277 allocs/op
Benchmark_ColatDSort_542000_Words____MixedCase_Shuffled-8 3 407087316 ns/op 94040096 B/op 196002 allocs/op
PASS
ok github.com/akutz/sortfold 8.880s Please note that Both collator-backed sorts are far less performant than the above implementation of |
Collation is language and even region specific. It's not at all the same as simple sorting by character values. There is a use for sorting by case dependent character values, which is why it is in the language. There is much less of a use for sorting by case independent character values. If you want to dig into the complexities of collation perhaps you could start with http://unicode.org/reports/tr10/ . |
Thank you. I guess the next logical question is how EqualFold is valid if the modification of it isn’t? Would the CompareFold implementation be valid for UTF-8 strings as documented for EqualFold? |
Never mind. I get it now. It’s not the CompareFold so much as the sort itself. Got it. |
Thank you for your patience. I do think there is still a place for my work in a use-case specific situation, dependent on the input data. But you’re correct, given the explanation you’ve provided it’s clear that CompareFold makes no sense as part of stdlib. I’m happy to close this issue unless you want the person you CCd to review it first. |
OK, closing. Thanks for following up. |
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?CompareFold
I recently wrote a blog post about the overwhelming performance increase that comes with sorting a slice of strings in a case-insensitive manner using an implementation of
CompareFold
. For example:Ultimately the implementation of
CompareFold
I used is really just a modification of the existingstrings.EqualFold
:However, because writing
CompareFold
from scratch (even basing it off of an existing function) isn't necessarily a simple task, I'd like to propose its inclusion in the Go standard librarystrings
package. To that end, the implementation ofstrings.EqualFold
could then be replaced with the following:Sorting slices of strings in a case-insensitive manner is not an obscure use case. Therefore I'd posit there is great value in providing an easy-to-use implementation of a comparison function that both uses Unicode case-folding and returns more than a binary result relating to equality.
Please note that akutz/sortfold has the full implementation of
CompareFold
as well as Go unit tests, examples, and benchmarks.The text was updated successfully, but these errors were encountered: