x/text/language: fails to parse valid BCP47 -t extension string; mistakes field for a region #54316

golightlyb · 2022-08-06T14:43:55Z

What did you do?

package main

import (
	"fmt"

	"golang.org/x/text/language"
)

func main() {
	t, err := language.Parse("en-t-en-m0-ungegn")
	fmt.Printf("got %s, err %s\n", t, err)
}

What did you expect to see?

got en-t-en-m0-ungegn, err nil

What did you see instead?

got en-t-en, err language: tag is not well-formed

Cause

In /x/text/internal/language, this error happens whenever parseExtension calls parseTag to parse language, script, region and variants if they appear after "-t".

(e.g. "en-t-m0-ungegn" is fine and does not trigger this error, but "en-t-en-m0-..." does.)

rfc6497 gives another example, und-Cyrl-t-und-latn-m0-ungegn-2007, which Go also fails to parse for the same reason.

"The field separator subtags, such as 'm0', were chosen because they are short, visually distinctive, and cannot occur in a language subtag". But parseTag believes "m0" is part of the (language, script, region, variants). It thinks "m0" is like the "GB" in "en-GB".

Note also that a language code like "en-001" is a valid part of the language/script/region/variants, so we have to be very specific that Go is parsing the tag incorrectly only when the subtag is exactly two characters long and contains a digit.

Fix

Trivial small pull request incoming. Passes all existing tests, adds one new test case.

The text was updated successfully, but these errors were encountered:

gopherbot · 2022-08-06T14:48:11Z

Change https://go.dev/cl/421914 mentions this issue: x/text/language: fix parser treating BCP47 extension field as region

golightlyb · 2022-08-30T09:56:43Z

Can I get a second reviewer on this? Thanks

fgm · 2022-09-27T16:27:10Z

Reviewing the RFC-level aspect for now: this format uses the T (transform content) extension in RFC6497, which is only informational, and does not have to be implemented, although it's still nice for us to do.

golightlyb · 2022-09-27T19:49:44Z

Reviewing the RFC-level aspect for now: this format uses the T (transform content) extension in RFC6497, which is only informational, and does not have to be implemented, although it's still nice for us to do.

This is true, but the problem is it is implemented, just incorrectly. Following "-t" it parses a language tag (short on time so forgive me if I use Go terminology rather than RFC terminology here). At least to my recollection.

The two solutions are either consume the whole -t-.... until the next single-letter extension, without parsing it, or to fix the language tag parsing that follows -t. Honestly either are fine, as long as a valid locale string doesn't error.

golightlyb · 2022-10-19T19:35:58Z

@thanm pinging to see if you can get a 2nd reviewer as I am cognisant there is a release freeze coming up. Many thanks

thanm · 2022-10-20T12:09:22Z

Happy to recruit other Googlers to +1, but for +2 we need someone familiar with the x/text repo.

gopherbot added this to the Unreleased milestone Aug 6, 2022

thanm added the NeedsInvestigation label Aug 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x/text/language: fails to parse valid BCP47 -t extension string; mistakes field for a region #54316

x/text/language: fails to parse valid BCP47 -t extension string; mistakes field for a region #54316

golightlyb commented Aug 6, 2022 •

edited

Loading

gopherbot commented Aug 6, 2022

golightlyb commented Aug 30, 2022

fgm commented Sep 27, 2022

golightlyb commented Sep 27, 2022

golightlyb commented Oct 19, 2022

thanm commented Oct 20, 2022

x/text/language: fails to parse valid BCP47 -t extension string; mistakes field for a region #54316

x/text/language: fails to parse valid BCP47 -t extension string; mistakes field for a region #54316

Comments

golightlyb commented Aug 6, 2022 • edited Loading

What did you do?

What did you expect to see?

What did you see instead?

Cause

Fix

gopherbot commented Aug 6, 2022

golightlyb commented Aug 30, 2022

fgm commented Sep 27, 2022

golightlyb commented Sep 27, 2022

golightlyb commented Oct 19, 2022

thanm commented Oct 20, 2022

golightlyb commented Aug 6, 2022 •

edited

Loading