Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/text/collate: incorrect sorting of Slovak words #48061

Open
janbodnar opened this issue Aug 30, 2021 · 7 comments
Open

x/text/collate: incorrect sorting of Slovak words #48061

janbodnar opened this issue Aug 30, 2021 · 7 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@janbodnar
Copy link

What version of Go are you using (go version)?

$ go version
go version go1.17 linux/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/jano7/.cache/go-build"
GOENV="/home/jano7/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/jano7/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/jano7/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.17"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/home/jano7/Documents/prog/golang/tmp/sortwords/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build2259433300=/tmp/go-build -gno-record-gcc-switches"

What did you do?

package main

import (
	"fmt"

	"golang.org/x/text/collate"
	"golang.org/x/text/language"
)

func main() {

	words := []string{"čaj", "auto", "pot", "márny", "kľak", "chyba", "drevo",
		"cibuľa", "džíp", "džem", "šum", "pól", "čučoriedka", "banán", "čerešňa",
		"červený", "klam", "čierny", "tŕň", "pôst", "hôrny", "mat", "chobot",
		"cesnak", "kĺb", "mäta", "ďateľ", "troska", "sýkorka", "elektrón",
		"fuj", "zem", "guma", "hora", "gejzír", "ihla", "pýr", "hrozno",
		"jazva", "džavot", "lom"}

	c := collate.New(language.Slovak)
	c.SortStrings(words)

	fmt.Println(words)
}
$ go run sort_accented_words.go 
[auto banán cesnak cibuľa čaj čerešňa červený čierny čučoriedka ďateľ drevo džavot džem 
džíp elektrón fuj gejzír guma hora hôrny hrozno chobot chyba ihla jazva kľak klam kĺb lom 
márny mat mäta pól pot pôst pýr sýkorka šum tŕň troska zem]

This is not 100% correct solution. (Still quite good in comparison to other languages such as Java, C#, or Python.)
From the languages I have tested, Raku and Go provide the closest solutions. (Raku does accents right, but fails with
digraphs.)

The following is the correct ordering of characters in Slovak language:

a, á, ä, b, c, č, d, ď, dz, dž, e, é, f, g, h, ch, i, í, j, k, l, ĺ, ľ, m, n, ň, o, ó, ô, p, q, r, ŕ, s, š, t, ť, u, ú, v, w, x, y, ý, z, ž

Note that plain characters always precede the accented ones. So the word ďateľ should go after drevo, the sequence
kľak klam kĺb should be klam kĺb kľak, márny mat mäta should be mat márny mäta, pól pot pôst should be
pot pól pôst, and tŕň troska should be troska tŕň.

The words with digraph ch are correct. (The dž digraphs are probably also good. Not 100% sure because there is
an issue with ď.)

So the correct ordering should be:

$ go run sort_accented_words.go 
[auto banán cesnak cibuľa čaj čerešňa červený čierny čučoriedka drevo ďateľ džavot džem 
džíp elektrón fuj gejzír guma hora hôrny hrozno chobot chyba ihla jazva klam kĺb kľak lom 
mat márny mäta pot pól pôst pýr sýkorka šum troska tŕň zem]
@mdlayher mdlayher changed the title Incorrect sorting of Slovak words x/text/collate: incorrect sorting of Slovak words Aug 30, 2021
@gopherbot gopherbot added this to the Unreleased milestone Aug 30, 2021
@randall77
Copy link
Contributor

Go string ordering is not intended to be the order things would sort in any language. It's just ordering the raw bytes of the utf8 encoding.
You might want to use golang.org/x/text/collate . See https://webdevstation.com/posts/how-to-sort-strings-with-go-alphabetically-in-any-language/ for a sample.

@randall77 randall77 reopened this Aug 30, 2021
@randall77
Copy link
Contributor

Sorry, you are using that. I guess this is a bug in x/text/collate?

@janbodnar
Copy link
Author

Yes, it looks like it's x/text/collate.

@cherrymui
Copy link
Member

cc @mpvl

@cherrymui cherrymui added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Sep 1, 2021
@ameowlia
Copy link
Contributor

ameowlia commented Nov 9, 2021

👋 Hello,

I tried to look into this, but didn't make it down the rabbit hole deep enough to make a fix. Here is what I discovered (buckle up, it's a long ride):

First I want to start with the fact that the alphabet seems to be defined correctly here. So that is not the problem.

The collate pkg uses this unicode algorithm to collate words:

The main algorithm has four steps:

1. Normalize each input string.
2. Produce an array of collation elements for each string.
3. Produce a sort key for each string from the arrays of collation elements.
4. Compare the two sort keys with a binary comparison operation.

The letter č sorts correctly.

  • With č, the algorithm produces an array with one collation element (0x402c3c20).
  • This element has primary weight 5662, secondary weight 32, tertiary weight 2.
  • This produces a sort string of [22 30 0 0 0 32 0 0 2]

The letter ď does NOT sort correctly.

  • With ď , the algorithm produces an array with one collation element (0xe0000a83). Code.
  • Then, because of this value, this element gets split into two elements: one for the letter d (0x402c6220) and the accent (0xae604102). Code.
  • The letter d (no accent) has primary weight 5681, secondary weight 32, tertiary weight 2.
  • The accent has primary weight 0, secondary weight 65, tertiary weight 2.
  • To form the sort string, all of the primary weights are added to an array, then all of the secondary weights, then all of the tertiary weights. If a value is 0 then it is skipped.
  • This produces a sort string of [22 49 0 0 0 32 0 65 0 0 2 2]
  • The splitting of the one element into two has the effect of treating d as nearly identical to ď.
  • You can see this if you sort: [d da dat d ď ďat] you get [d ď da dat ďat dr], which acts as if d'== d.

A more in-depth example

The sorting string for ďat is: [22 49 21 239 24 22 0 0 0 32 0 65 0 32 0 32 0 0 2 2 2 2]:
1 = primary weight
2 = secondary weight
3 = tertiary weight

d 1 a 1 t 1 d 2 ' 2 a 2 t 2 d 3 ' 3 a 3 t 3
22 49 21 239 24 22 0 0 0 32 0 65 0 32 0 32 0 0 2 2 2 2

The sorting string for dre: [22 49 23 189 22 76 0 0 0 32 0 32 0 32 0 0 2 2 2]:

d 1 r 1 e 1 d 2 r 2 e d 3 r 3 e 3
22 49 23 189 22 76 0 0 0 32 0 32 0 32 0 0 2 2 2

So the reason that ďat is (incorrectly) sorted before dre is because d' does not have a different primary value from d. Then it sorts based on the next letter. And in this case a<r.

Other letters
I used d' as an example to dig into, but I found similar behavior with the following accented letters: á, ď, é, í, ĺ, ľ, ň, ó, ŕ, ť, ú, ý
. If you collate the following words you see that they all sort into the wrong order following the same pattern I laid out above for d':

package main

import (
	"fmt"
	"golang.org/x/text/collate"
	"golang.org/x/text/language"
)

func main() {

	words := []string{"ab", "áa", "eb", "éa", "ib", "ía", "lb", "ĺa", "lb", "ľa", "nb", "ňa", "ob", "óa", "rb", "ŕa", "tb", "ťa", "ub", "úa", "yb", "ýa"}
	c := collate.New(language.Slovak)
	c.SortStrings(words)
	fmt.Println(words)
}

Result:

[áa ab éa eb ía ib ĺa ľa lb lb ňa nb óa ob ŕa rb ťa tb úa ub ýa yb]

Next Steps
I'm not sure why the lookup for d' returns an element that ends up being split into two elements, but the element for č does not. I tried to figure out how the trie gets made, but couldn't figure it out.

❓ Maybe @mpvl or @bcmills could help point me in the right direction to go from here?

@bcmills
Copy link
Contributor

bcmills commented Nov 10, 2021

@ameowlia, thanks for the analysis! Unfortunately I have very little context on x/text/collate, so I'm not sure that I can provide useful advice on where to go from here.

I do have one question, though: is the element-splitting problem present in the Unicode TR10 algorithm itself, or does the Go implementation deviate from the algorithm? (If the problem is in the TR10 algorithm itself, the Go project can't unilaterally fix it...)

@ameowlia
Copy link
Contributor

Hi @bcmills,

is the element-splitting problem present in the Unicode TR10 algorithm itself, or does the Go implementation deviate from the algorithm

🐛 I think it is a bug in Go
In Section 7.5 it uses the following examples:
... the following incorrect ordering would result: "aa" < "àa" < "ab"
... the following correct ordering would result: "aa" < "ab" < "àa"
This is the exact same incorrect result that I describe above. I think this section is describing a different (but maybe related?) problem, not the issue we are having here. However, my point is that the doc clearly states that the outcome we are getting is incorrect.

🇸🇰 It calls out ch specifically
ch is considered one character in Slovak that is sorted after h and before i. In Section 3.3.3 Contractions the doc says:

Similarly, where the sequence ch is treated as a single digraph letter, as for instance in Slovak, it is represented as a mapping from two characters to a single collation element, as shown in the following example...

In this example, the collation element [.1D19.0020.0002] has a primary weight one greater than the primary weight for the letter h by itself, so that the sequence ch will collate after h and before i. This example shows the result of a tailoring of collation element mappings to weight sequences of letters as a single unit.

Above I said that the issue is because d' does not have a different primary value from d. I think this section of the doc helps confirm that different letters should have different primary values. And because Slovak treats accented letters as different letters then they should have different primary values.

👀 Where does Go go wrong?

  • Expanding and contracting elements is not necessarily wrong, but I think it is incorrect in this case.
  • In order for letters to be sorted so that "aa" < "ab" < "àa" then a, à, and b all need different primary values.
  • This is not happening for Slovak for many of its accented letters.

Open Questions

  • I don't know where to start for fixing this. Some questions I have are:
    • How do these tries get made?
    • Are they 100% identical to some table from unicode?
    • What are the boundaries between what go is responsible for and what unicode provides?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

6 participants