Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/text/collate: Norwegian collation order differs from Danish #59908

Open
flwyd opened this issue May 1, 2023 · 2 comments
Open

x/text/collate: Norwegian collation order differs from Danish #59908

flwyd opened this issue May 1, 2023 · 2 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@flwyd
Copy link

flwyd commented May 1, 2023

What version of Go are you using (go version)?

$ go version
go version go1.20.3 darwin/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/tstone/Library/Caches/go-build"
GOENV="/Users/tstone/Library/Application Support/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/tstone/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/tstone/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/opt/local/lib/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/opt/local/lib/go/pkg/tool/darwin_amd64"
GOVCS=""
GOVERSION="go1.20.3"
GCCGO="gccgo"
GOAMD64="v1"
AR="ar"
CC="/usr/bin/clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/Users/tstone/devel/adif-multitool/go.mod"
GOWORK=""
CGO_CFLAGS="-O2 -g"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-O2 -g"
CGO_FFLAGS="-O2 -g"
CGO_LDFLAGS="-O2 -g"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -arch x86_64 -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/jl/971w22kn1_l85jzmswnpn3tw0000gn/T/go-build2430323303=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

Code on playground

The collation order for language.Norwegian sorts the letters æ, ø, and å after a, o, and a respectively, rather than as the last three letters in the alphabet (in that order). The collation for language.Danish puts those three letters at the end of the alphabet, as expected. It's my understanding that Norwegian and Danish use the same alphabetic order, which is the same initial 26 letter order as English, followed by the three others, which are not treated as diacritics. This ordering for both Norwegian and Danish is called out in the introduction to Unicode Technical Standard #10: Unicode Collation Algorithm and is also described in the "Danish and Norwegian alphabet" Wikipedia page.

What did you expect to see?

Norwegian and Danish should collate the same, with Æ, Ø, and Å at the end of the alphabet. These are U+00C6 LATIN CAPITAL LETTER AE, U+00D8 LATIN CAPITAL LETTER O WITH STROKE, U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE, and "SMALL" variants for lower case.

What did you see instead?

Norwegian (but not Danish) sorts these letters similar to diacritics in other European languages rather than treating them as independent letters.

@gopherbot gopherbot added this to the Unreleased milestone May 1, 2023
@cagedmantis cagedmantis added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label May 1, 2023
@cagedmantis cagedmantis modified the milestones: Unreleased, Backlog May 1, 2023
@cagedmantis
Copy link
Contributor

cc @mpvl

@danderson
Copy link
Contributor

Smaller demonstration of the issue: https://go.dev/play/p/0IzO5PDupZU

Per https://icu4c-demos.unicode.org/icu-bin/collation.html, Go's sort of Norwegian is indeed incorrect for the additional letters of the alphabet: Go sorts the additional letters according to the root collation order (language.Und), rather than placing them after Z.

Both the Danish and Norwegian collation definitions in CLDR (https://github.com/unicode-org/cldr/blob/main/common/collation/da.xml and https://github.com/unicode-org/cldr/blob/main/common/collation/no.xml specify a comparable override of the root collation for these characters:

# Danish
&[before 1]ǀ<æ<<<Æ<<ä<<<Ä<ø<<<Ø<<ö<<<Ö<<ő<<<Ő<å<<<Å<<<aa<<<Aa<<<AA
&oe<<œ<<<Œ

# Norwegian
&[before 1]ǀ<æ<<<Æ<<ä<<<Ä<<ę<<<Ę<ø<<<Ø<<ö<<<Ö<<ő<<<Ő<<œ<<<Œ<å<<<Å<<aa<<<Aa<<<AA

The two languages differ in the details of the ordering, but I don't see a smoking gun syntactic difference that would cause colltab's builder to get Danish right and Norwegian wrong. But my parsing-fu for these collation rules is still weak, and these rules don't make obvious sense to me. I think it's saying that æ should sort before |, which is odd since all three of und, da and no sort | before A, and neither go nor libicu sort æ before |. I assume the | means something else I haven't found, in addition to being a previous context indicator elsewhere in rules (but not in first position, according to the icu4j parser at least).

I have no explanation, just found this bug while searching for a different collation bug and thought I'd dig a tiny bit more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

4 participants