x/text/language: Match does not work as expected for artificial languages #45749

bep · 2021-04-24T15:30:45Z

golang.org/x/text@v0.3.3 golang.org/x/tools@v0.0.0-20180917221912-90fa682c2a6e

The following test failes with got art-x-klingon; want art-x-elvish.

func TestMatchArtificialLanguages(t *testing.T) {
	// See https://en.wikipedia.org/wiki/Codes_for_constructed_languages
	klingon, err := language.Parse("art-x-klingon")
	if err != nil {
		t.Fatal(err)
	}
	elvish, err := language.Parse("art-x-elvish")
	if err != nil {
		t.Fatal(err)
	}

	matcher := language.NewMatcher([]language.Tag{klingon, elvish})

	m, _, _ := matcher.Match(elvish)

	if m.String() != "art-x-elvish" {
		t.Errorf("got %s; want art-x-elvish", m.String())
	}
}

nicksnyder/go-i18n#252

The text was updated successfully, but these errors were encountered:

nicksnyder · 2021-04-25T20:15:53Z

@bep Does this diff change the results?

func TestMatchArtificialLanguages(t *testing.T) {
	// See https://en.wikipedia.org/wiki/Codes_for_constructed_languages
	klingon, err := language.Parse("art-x-klingon")
	if err != nil {
		t.Fatal(err)
	}
	elvish, err := language.Parse("art-x-elvish")
	if err != nil {
		t.Fatal(err)
	}

+	tags := []language.Tag{klingon, elvish}
-	matcher := language.NewMatcher([]language.Tag{klingon, elvish})
+	matcher := language.NewMatcher(tags)

-	m, _, _ := matcher.Match(elvish)
+	_, i, _ := matcher.Match(elvish)
+	m := tags[i]

	if m.String() != "art-x-elvish" {
		t.Errorf("got %s; want art-x-elvish", m.String())
	}
}

(I am curious if this is related to #24211 (comment))

bep · 2021-04-26T07:55:36Z

@bep Does this diff change the results?

No. It sill returns the first index.

cherrymui · 2021-04-26T16:48:20Z

cc @mpvl

ameowlia · 2021-11-16T22:18:29Z

👋 hello, I'm no expert, but I've been poking around in /x/text recently, so I thought I would look into this. I think the root cause of this is that related to the private tags.

🏷️ Go is not sorting properly based on private tags.

Here is an example where non-artificial languages have the same issue.
https://play.golang.org/p/ZtyNz_CbSgO

	enA, _ := language.Parse("en-x-a")
	enB, _ := language.Parse("en-x-b")
	matcher := language.NewMatcher([]language.Tag{enA, enB})

	m, _, _ := matcher.Match(enB)
	if m.String() != "en-x-b" {
		fmt.Printf("got %s; want en-x-b\n", m.String())
	}

This returns: got en-x-a; want en-x-b

✏️ Go should sort based on private tags
According to rfc4647 "Matching of Language Tags"

In the lookup scheme, the language range is progressively truncated
from the end until a matching language tag is located. Single letter
or digit subtags (including both the letter 'x', which introduces
private-use sequences, and the subtags that introduce extensions) are
removed at the same time as their closest trailing subtag. For
example, starting with the range "zh-Hant-CN-x-private1-private2"
(Chinese, Traditional script, China, two private-use tags) the lookup
progressively searches for content as shown below:

Example of a Lookup Fallback Pattern

Range to match: zh-Hant-CN-x-private1-private2

zh-Hant-CN-x-private1-private2

zh-Hant-CN-x-private1

zh-Hant-CN

zh-Hant

zh

(default)

💻 The code looks like it inspects private tags... but it doesn't

Match eventually calls a function called VariantOrPrivateUseTags and compares the result of the two languages. I added some print lines and it only ever returned an empty string, even when my languages did have private tags.
I made a hacky fix locally so that it did return private use tags. But this didn't change the outcome. This is because for en-x-a and en-x-b they are initially given a confidence of "Exact" because they are the same language. Even when equalsRest returns false (with my hacky fix), it never lowers the confidence level based on private tags not matching.

✏️ I'm working on a fix to both return the private use tags properly and lower the confidence level when they do not match, which should fix this issue.

gopherbot · 2021-11-17T21:30:20Z

Change https://golang.org/cl/364855 mentions this issue: language/match: match based on all subtags

mpvl · 2021-11-18T07:33:53Z

The use of private use tags is more or less deprecated. The official tag for Klingon is tlh and for Elvish, if I’m not mistaken, qya. I can imagine preprocessors to normalize user input tags, but I don’t think this normalization should be applied to tags passed to the normalizer by the developer, for instance.

The marcher does not, nor does it aim to, implement RFC 4647. Instead, the algorithm is an enhancement of a matcher used internally at google, which is generally believed to give more reliable results.

I no longer work at Google, but I doubt the move away from -x support will have reverted.

mpvl · 2021-11-18T08:12:56Z

Note, the compliance test set is in language/testdata. The CLDR test set is the golden standard.

The two are mostly the same, but there are some notable differences. The Go algorithm uses rule based matching, instead of the scoring algorithm used by Google's algorithm. This has the advantage that it

preserves local information across matches (this is the whole point of these BCP 47 tags, so I wouldn't be surprised if other implementations are catching up to this),
observes user preferences where a language is preferred interstitially to two dialects of another language (for instance, for supported languages en-GB, nl and a user preference of en, nl, en-GB it matches nl, not en-GB,
is much easier to maintain and tweak, and
is much faster (due to other performance improvements not related to the rule-based system per se).

Someone of the internationalization team in Google could verify if the Golden set is up to date. But I doubt support for -x has made its inroads, as the world is moving away from this as it was too problematic.

I would not, however, modify the algorithm based on isolated use cases. There are very specific reasons why it is the way it is with years of tweaking and testing based data-driven design.

mpvl · 2021-11-18T09:04:42Z

As another note. The OP mentions the desire to support art-x-klingon, but this was never a valid language tag for Klingon. The (now deprecated) i-klingon predates the introduction of art. In turn, art was introduced on the same date as tlh (the modern canonical form of Klingon). The tag art only serves as a fallback for artificial languages, so art-x-klingon was never a valid tag, and a matcher relying on matching private use tags would miss this.

This already shows the problem of the use of private use tags: their usage is not stable and changes over time. This is one of the reasons why there was a move away from supporting private use tags.

Note that many artificial languages have their own language code. Here is a list:
https://en.wikipedia.org/wiki/Codes_for_constructed_languages

I can imagine supporting private use tags, though, if it does not interact with the rest of tag matching and if thorough consideration of the implications of canonicalization are given. But, as the use of -x is largely deprecated, the impact of considering them for matching regular languages will have to be tested against a large body of user data. My suspicion is that including this information in matches generally will make things worse, not better. This means that supporting matching on private use tags, should probably be limited to specifically supporting art-x-X and maybe tags starting with x-, if at all.

bep · 2021-11-18T10:04:27Z

The official tag for Klingon is tlh and for Elvish, if I’m not mistaken,

Those were just two random examples. My original problem is that of handling of "unknown language tags", but you can also easily make the argument for unofficial artificial languages (like Foozzy, a new language i just made up).

ameowlia · 2021-11-18T21:25:25Z

😮 Wow, thank you so much for all of this information @mpvl . It has been very hard to discover all of this. (Or maybe there are some docs I am missing?)

What I am taking from this is:

This is deprecated and not supported on purpose.
Therefore I will close my PR.
There might be a case for allowing this for art-x-X.
If I have the time I might reopen one just for the art-x-X case.
I will look for places to document this useful information you provided.

🙏 Thanks again!

gopherbot added this to the Unreleased milestone Apr 24, 2021

bep closed this as completed Apr 24, 2021

bep reopened this Apr 24, 2021

cherrymui added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Apr 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x/text/language: Match does not work as expected for artificial languages #45749

x/text/language: Match does not work as expected for artificial languages #45749

bep commented Apr 24, 2021

nicksnyder commented Apr 25, 2021

bep commented Apr 26, 2021

cherrymui commented Apr 26, 2021

ameowlia commented Nov 16, 2021 •

edited

gopherbot commented Nov 17, 2021

mpvl commented Nov 18, 2021

mpvl commented Nov 18, 2021

mpvl commented Nov 18, 2021 •

edited

bep commented Nov 18, 2021

ameowlia commented Nov 18, 2021

x/text/language: Match does not work as expected for artificial languages #45749

x/text/language: Match does not work as expected for artificial languages #45749

Comments

bep commented Apr 24, 2021

nicksnyder commented Apr 25, 2021

bep commented Apr 26, 2021

cherrymui commented Apr 26, 2021

ameowlia commented Nov 16, 2021 • edited

gopherbot commented Nov 17, 2021

mpvl commented Nov 18, 2021

mpvl commented Nov 18, 2021

mpvl commented Nov 18, 2021 • edited

bep commented Nov 18, 2021

ameowlia commented Nov 18, 2021

ameowlia commented Nov 16, 2021 •

edited

mpvl commented Nov 18, 2021 •

edited