Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/text/language: Match detects wrong language #49176

Open
eyudkin opened this issue Oct 27, 2021 · 5 comments
Open

x/text/language: Match detects wrong language #49176

eyudkin opened this issue Oct 27, 2021 · 5 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@eyudkin
Copy link

eyudkin commented Oct 27, 2021

What version of Go are you using (go version)?

$ go version
go1.17.2

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
darwin_amd64

What did you do?

Im trying to resolve a locale language from the "accept-language" http header: en-GB;q=1.0, fr-DE;q=0.9, fr-CA;q=0.8, en-DE;q=0.7

by using golang.org/x/text/language:

m := language.NewMatcher([]language.Tag{language.English, language.French})
desired, _, _ := language.ParseAcceptLanguage("en-GB;q=1.0, fr-DE;q=0.9, fr-CA;q=0.8, en-DE;q=0.7")
tag, i, conf := m.Match(desired...)
fmt.Println("case B", tag, i, conf) // fr-u-rg-dezzzz instead of en-u-rg-gbzzzz

What did you expect to see?

en-u-rg-gbzzzz

What did you see instead?

fr-u-rg-dezzzz

Other comments

Actually this is reproducible at playground: https://play.golang.org/p/puiT36mYjiU
Please notice that english lang has been detected correctly if we remove en-DE from the end of the string: en-GB;q=1.0, fr-DE;q=0.9, fr-CA;q=0.8 (I dont understand how adding ofher low-priority language affects french lang detection but it works)

Realizations in other programming languages

I wrote similar code for other languages to compare the outputs and both of them detected "en" instead of "fr" so probably we have a bug in golang.org/x/text/language package.

javascript

 var parser = require('accept-language-parser');
 
 var language = parser.pick(['en', 'fr'], 'en-GB;q=1.0, fr-DE;q=0.9, fr-CA;q=0.8, en-DE;q=0.7', { loose: true });
 
 console.log(language); // en

java

 import java.util.Arrays;
 import java.util.List;
 import java.util.Locale;
 
 public class MyClass {
     public static void main(String args[]) {
        List<Locale> locales = Arrays.asList(new Locale("en"),new Locale("fr"));
        List<Locale.LanguageRange> list = Locale.LanguageRange.parse("en-GB;q=1.0, fr-DE;q=0.9, fr-CA;q=0.8, en-DE;q=0.7");
        Locale locale = Locale.lookup(list,locales);
      
       System.out.println(locale); // en
     }
 }
@seankhliao seankhliao changed the title golang.org/x/text/language detects wrong language x/text/language: Match detects wrong language Oct 27, 2021
@seankhliao seankhliao added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Oct 27, 2021
@seankhliao
Copy link
Member

cc @mpvl

@gopherbot gopherbot added this to the Unreleased milestone Oct 27, 2021
@ameowlia
Copy link
Contributor

ameowlia commented Oct 27, 2021

I tried to look into this. I wasn't able to get to the root cause today, but here's what I have found so far:

Some background Context

  1. Match calls getBest .
  2. getBest loops through all of the provided tags. It keeps track of a best option. It calls best.update for each provided tag. If the tag is a better match than the current best, then it updates best.

For the working as expected case with desired, _, err := language.ParseAcceptLanguage("en-GB;q=1.0, fr-DE;q=0.9, fr-CA;q=0.8")

  1. Because English is only in the list of tags once, pin for en-GB is true.
  2. This sets m.LanguagePin to true here.
  3. en-GB is initially made the best option because it is first and beaten is set to true here.
  4. Then the loop continues and best.update is called for the next tags one at a time.
  5. Because m.LanguagePin is true, every tag after en-GB returns from update here and does not replace en-GB as best.
  6. Thus en-GB is returned as the best match.

For the buggy case with desired, _, err := language.ParseAcceptLanguage("en-GB;q=1.0, fr;q=0.9, fr-CA;q=0.8, en-DE;q=0.7")

  1. Because English is in the list of tags twice, pin for en-GB is false.
  2. This sets m.LanguagePin to false here.
  3. en-GB is initially made the best option because it is first and beaten is set to true here.
  4. Then the loop continues and best.update is called for the next tag fr-DE.
  5. Since m.LanguagePin is false, it does NOT return from update here like in the example above.
  6. This tag makes it into a tie breaker with en-GB. fr-DE ends up winning the tie-breaker here.
  7. Thus fr-DE is returned as the best match.

@ameowlia
Copy link
Contributor

Strangely enough, I think this is working as intended. @mpvl should definitely check my work though 😅

Here is a playground with my examples: https://play.golang.org/p/D4ucCTJ9LfY

Key assumptions the algorithm is making:

1️⃣ Most people do not speak multiple languages equally. A heavy emphasis is put on finding the first exact language match and "pinning" that language. Docs about assumption 1.

m := language.NewMatcher([]language.Tag{language.English, language.French})

desired, _, _ := language.ParseAcceptLanguage("en-GB, fr")
tag, i, conf := m.Match(desired...)
fmt.Println("case A", tag, i, conf) // en-u-rg-gbzzzz
// Returns English because it is the first exact language match, 
// even though "fr" is an exact region and language match.

2️⃣ If a person lists the dialects they know and the languages are not continuous, then it is assumed they speak all of those languages equally well and it will not give extra preference to the first exact language match it finds.

For example, if your list of desired language tags are: en-GB, fr-DE, fr-CA, en-DE then I will assume that you speak English and French very well. In this case, the algorithm will not "pin" English even though it is first. Instead it runs the whole algorithm for all of your options and it decides that fr-DE is a better match because of regionality (Germany is closer to France than Great Britain is to the US). Docs about assumption 2.

desired, _, _ = language.ParseAcceptLanguage("en-GB, fr-DE, en-DE")
tag, i, conf = m.Match(desired...)
fmt.Println("case B", tag, i, conf) // fr-u-rg-dezzzz
	
// Because of the ordering of dialects, the algorithm
// assumes this speaker knows English and French well.
// Returns French because Germany is closer to France 
// then Great Britain is to the US.

❓ You may be wondering: "But what about those q values I set???" (At least that is what I was wondering.)

It turns out those values are only for ordering the desired language tags in ParseAcceptLanguage. They are not passed into match at all.

@eyudkin
Copy link
Author

eyudkin commented Nov 8, 2021

Strangely enough, I think this is working as intended.

Thanks for your reply & investigation.
For me that should not work that way, because:

  1. Algorythm should take q values into account because otherwise golang implementation does not compatible with rfc (rfc7231 and others);
  2. Thats wery weird that golang implementation differs from others. In my case, there are mobile app with some language detection and some backend with some language detection and obviously I dont want my users to "switch" from one language to another;
  3. Logically, I've said that "I speak English and French well" (en-GB, fr-DE), and English language is preferred for the user (because it goes first and, well, it has q=1.0) and also English is preferred for my app (it goes first for the matcher). So I expect final language to be resolved to "English";
  4. Also logically, thats very weird case: "algo resolves language to French if I add another English option";

@ameowlia
Copy link
Contributor

ameowlia commented Nov 9, 2021

Hi @eyudkin,

🇫🇷 🇬🇧 UI

Thats very weird that golang implementation differs from others. In my case, there are mobile app with some language detection and some backend with some language detection and obviously I dont want my users to "switch" from one language to another.

I agree that this sounds like a very bad user experience and an unexpected bug.

🔢 q values

Algorithm should take q values into account because otherwise golang implementation does not compatible with rfc (rfc7231 and others);

Go does take q values into account, but just not in the way that I (and maybe you?) assumed when using them initially. According to rfc7231 section 5.3.5:

Note that some recipients treat the order in which language tags are
listed as an indication of descending priority, particularly for tags
that are assigned equal quality values (no value is the same as q=1).
However, this behavior cannot be relied upon. For consistency and to
maximize interoperability, many user agents assign each language tag
a unique quality value while also listing them in order of decreasing
quality.

The q values are for ordering the preferred language tags correctly, which is what go does when you use ParseLanguageAccept.

✏️ RFC Guidance

I found this quote also in rfc7231 section 5.3.5 that seems very related to your example:

Note: User agents ought to provide guidance to users when setting
a preference, since users are rarely familiar with the details of
language matching as described above. For example, users might
assume that on selecting "en-gb", they will be served any kind of
English document if British English is not available. A user
agent might suggest, in such a case, to add "en" to the list for
better matching behavior.

However, I am guessing that you are the server and you do not control the client and so you cannot change which desired language tags are being sent by the user.

🥾 Next Steps

While I agree that the logic is not initially intuitive for this case, it is how the algorithm is purposefully written. I am not an expert in this area, just some one who was interested in digging into this case and learning more. While you can argue with @mpvl about changing the algorithm, I doubt that will happen quickly or at all. I think the best and quickest fix is to change your code to detect the language once so that there is not a mismatch between the frontend/backend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

4 participants