x/text/encoding: UTF-16 decoder handles unpaired surrogates incorrectly #39492

abacabadabacaba · 2020-06-09T22:39:38Z

When decoding some strings containing unpaired surrogates, UTF-16 decoder produces wrong number of \ufffd runes. Some examples:

On string \xdc\x00\xdc\x00: expected result \ufffd\ufffd (two copies of \ufffd), actual result \ufffd.
On string \xd8\x00\x00: expected result \ufffd, actual result \ufffd\ufffd.

The expected results are derived from a WhatWG spec.

Also, the name of internal function isHighSurrogate is misleading: it actually checks whether the argument is a low surrogate.

Code to reproduce:

package main

import (
	"golang.org/x/text/encoding/unicode"
	"fmt"
)

func main() {
	res, err := unicode.UTF16(unicode.BigEndian, unicode.IgnoreBOM).NewDecoder().String("\xd8\x00 ")
	fmt.Println(res, err)
}

The text was updated successfully, but these errors were encountered:

odeke-em · 2020-06-10T03:23:47Z

Thank you for this report @abacabadabacaba!

Kindly cc-ing @mpvl.

gopherbot added this to the Unreleased milestone Jun 9, 2020

andybons added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jun 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x/text/encoding: UTF-16 decoder handles unpaired surrogates incorrectly #39492

x/text/encoding: UTF-16 decoder handles unpaired surrogates incorrectly #39492

abacabadabacaba commented Jun 9, 2020

odeke-em commented Jun 10, 2020

x/text/encoding: UTF-16 decoder handles unpaired surrogates incorrectly #39492

x/text/encoding: UTF-16 decoder handles unpaired surrogates incorrectly #39492

Comments

abacabadabacaba commented Jun 9, 2020

odeke-em commented Jun 10, 2020