Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

strings: ToLower gives wrong result for uppercase Σ in the word-final position #33005

Open
zurk opened this issue Jul 9, 2019 · 3 comments
Open
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@zurk
Copy link

zurk commented Jul 9, 2019

What version of Go are you using (go version)?

$ go version
go version go1.12.5 darwin/amd64

Does this issue reproduce with the latest release?

yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/k/Library/Caches/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/k/go"
GOPROXY=""
GORACE=""
GOROOT="/usr/local/go"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/kw/93jybvs16_954hytgsq6ld7r0000gn/T/go-build305684975=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

https://play.golang.org/p/fEDCPSV7Dqi

What did you expect to see?

The program output should be β︎δℕ︎ς because if you lowercase Σ at the last position of the word it becomes ς. See https://en.wikipedia.org/wiki/Sigma

Sigma (uppercase Σ, lowercase σ, lowercase in word-final position ς;

What did you see instead?

The output is β︎δℕ︎σ.


I am not sure it is the only case in all languages when lower case depends on the position. I just faced different behavior with python code:

t = "β︎Δℕ︎Σ"
print(t.lower()) # output: β︎δℕ︎ς
@agnivade agnivade changed the title strings.ToLower gives wrong result for uppercase Σ in the word-final position strings: ToLower gives wrong result for uppercase Σ in the word-final position Jul 9, 2019
@agnivade
Copy link
Contributor

agnivade commented Jul 9, 2019

Does this need another unicode.SpecialCase in https://golang.org/pkg/strings/#ToLowerSpecial ?

I do see a TODO in unicode/casetables.go.

@robpike @ianlancetaylor

@agnivade agnivade added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jul 9, 2019
@ianlancetaylor
Copy link
Contributor

CC @mpvl

@ianlancetaylor ianlancetaylor added this to the Go1.14 milestone Jul 9, 2019
@rsc rsc modified the milestones: Go1.14, Backlog Oct 9, 2019
@ALTree
Copy link
Member

ALTree commented Sep 16, 2021

Unicode case folding requires handling the final sigma special case, but the rule is overridden in a few standards; for example Appendix C of rfc7790 (PRECIS) says:

local case mapping is not applicable to small sigma or final sigma, so case mapping in the PRECIS framework always maps final sigma to small sigma, independent of context

Changing the strings.ToLower function to handle the final sigma (in full compliance with Unicode Folding rules) may break existing code relying on the current behaviour. Also from a cursory look (but I may be wrong) the current special-case mechanism in unicode does not support context-sensitive replacement rules, so it may be not trivial to implement the rule in a non-hacky way.

On the other hand, the text/cases package handles the final sigma special case, and also provides a way to get a PRECIS compliant folding:

package main

import (
	"fmt"

	"golang.org/x/text/cases"
	"golang.org/x/text/language"
)

func main() {
	greekLower1 := cases.Lower(language.Greek)
	greekLower2 := cases.Lower(language.Greek, cases.HandleFinalSigma(false))

	fmt.Println(greekLower1.String("β︎Δℕ︎Σ"))   // prints β︎δℕ︎ς
	fmt.Println(greekLower2.String("β︎Δℕ︎Σ"))   // prints β︎δℕ︎σ
}

My proposal is to preserve the existing strings behaviour, and maybe add a small note about the final sigma handling in the documentation, and to point users to the text/cases package for full Unicode Compliant folding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

5 participants