Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/text/language: fails to parse valid BCP47 -t extension string; mistakes field for a region #54316

Open
golightlyb opened this issue Aug 6, 2022 · 6 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@golightlyb
Copy link
Contributor

golightlyb commented Aug 6, 2022

What did you do?

package main

import (
	"fmt"

	"golang.org/x/text/language"
)

func main() {
	t, err := language.Parse("en-t-en-m0-ungegn")
	fmt.Printf("got %s, err %s\n", t, err)
}

What did you expect to see?

got en-t-en-m0-ungegn, err nil

What did you see instead?

got en-t-en, err language: tag is not well-formed

Cause

In /x/text/internal/language, this error happens whenever parseExtension calls parseTag to parse language, script, region and variants if they appear after "-t".

(e.g. "en-t-m0-ungegn" is fine and does not trigger this error, but "en-t-en-m0-..." does.)

rfc6497 gives another example, und-Cyrl-t-und-latn-m0-ungegn-2007, which Go also fails to parse for the same reason.

"The field separator subtags, such as 'm0', were chosen because they are short, visually distinctive, and cannot occur in a language subtag". But parseTag believes "m0" is part of the (language, script, region, variants). It thinks "m0" is like the "GB" in "en-GB".

Note also that a language code like "en-001" is a valid part of the language/script/region/variants, so we have to be very specific that Go is parsing the tag incorrectly only when the subtag is exactly two characters long and contains a digit.

Fix

Trivial small pull request incoming. Passes all existing tests, adds one new test case.

@gopherbot gopherbot added this to the Unreleased milestone Aug 6, 2022
@gopherbot
Copy link

Change https://go.dev/cl/421914 mentions this issue: x/text/language: fix parser treating BCP47 extension field as region

@thanm thanm added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Aug 8, 2022
@golightlyb
Copy link
Contributor Author

Can I get a second reviewer on this? Thanks

@fgm
Copy link

fgm commented Sep 27, 2022

Reviewing the RFC-level aspect for now: this format uses the T (transform content) extension in RFC6497, which is only informational, and does not have to be implemented, although it's still nice for us to do.

@golightlyb
Copy link
Contributor Author

Reviewing the RFC-level aspect for now: this format uses the T (transform content) extension in RFC6497, which is only informational, and does not have to be implemented, although it's still nice for us to do.

This is true, but the problem is it is implemented, just incorrectly. Following "-t" it parses a language tag (short on time so forgive me if I use Go terminology rather than RFC terminology here). At least to my recollection.

The two solutions are either consume the whole -t-.... until the next single-letter extension, without parsing it, or to fix the language tag parsing that follows -t. Honestly either are fine, as long as a valid locale string doesn't error.

@golightlyb
Copy link
Contributor Author

@thanm pinging to see if you can get a 2nd reviewer as I am cognisant there is a release freeze coming up. Many thanks

@thanm
Copy link
Contributor

thanm commented Oct 20, 2022

Happy to recruit other Googlers to +1, but for +2 we need someone familiar with the x/text repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

4 participants