Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/text/cases: Upper drops the 129th character #11460

Closed
chowey opened this issue Jun 29, 2015 · 3 comments
Closed

x/text/cases: Upper drops the 129th character #11460

chowey opened this issue Jun 29, 2015 · 3 comments

Comments

@chowey
Copy link

chowey commented Jun 29, 2015

Here is a simple program that should uppercase some text:

package main

import (
    "golang.org/x/text/cases"
    "golang.org/x/text/language"
)

func main() {
    const a = "abcdefghijklmnopqrstuvwx\n"
    text := a + a + a + a + a + a
    print(cases.Upper(language.Make("en")).String(text))
}

I would expect to get:

ABCDEFGHIJKLMNOPQRSTUVWX
ABCDEFGHIJKLMNOPQRSTUVWX
ABCDEFGHIJKLMNOPQRSTUVWX
ABCDEFGHIJKLMNOPQRSTUVWX
ABCDEFGHIJKLMNOPQRSTUVWX
ABCDEFGHIJKLMNOPQRSTUVWX

but instead I get:

ABCDEFGHIJKLMNOPQRSTUVWX
ABCDEFGHIJKLMNOPQRSTUVWX
ABCDEFGHIJKLMNOPQRSTUVWX
ABCDEFGHIJKLMNOPQRSTUVWX
ABCDEFGHIJKLMNOPQRSTUVWX
ABCEFGHIJKLMNOPQRSTUVWX

The last "D" is dropped, which is the 129th character.

@chowey chowey changed the title x/text/cases: Upper drops the 128th character after a non-breaking space (nbsp) x/text/cases: Upper drops the 129th character Jun 29, 2015
@chowey
Copy link
Author

chowey commented Jun 29, 2015

Greek works fine.

print(cases.Upper(language.Make("el")).String(text))
ABCDEFGHIJKLMNOPQRSTUVWX
ABCDEFGHIJKLMNOPQRSTUVWX
ABCDEFGHIJKLMNOPQRSTUVWX
ABCDEFGHIJKLMNOPQRSTUVWX
ABCDEFGHIJKLMNOPQRSTUVWX
ABCDEFGHIJKLMNOPQRSTUVWX

@ianlancetaylor ianlancetaylor added this to the Unreleased milestone Jun 29, 2015
@chowey
Copy link
Author

chowey commented Jul 16, 2015

The problem starts in transform/transform.go (line 571):

        nDst, nSrc, err = t.Transform(dst[pDst:], src[:n], pSrc+n == len(s))

At this point, dst[pDst:] will be a zero-length slice. This calls cases/map.go (lines 135-144):

func (t *undUpperCaser) Transform(dst, src []byte, atEOF bool) (nDst, nSrc int, err error) {
    c := context{dst: dst, src: src, atEOF: atEOF}
    for c.next() {
        upper(&c)
    }
    // Standard upper case does not need any lookahead so we can safely not use
    // the checkpointing mechanism. pDst and pSrc will always point to the
    // furthest possible position.
    return c.pDst, c.pSrc, c.err
}

But unfortunately c.next() will, after two loops, increment c.pSrc. See cases/context.go (lines 73-93):

func (c *context) next() bool {
    c.pSrc += c.sz
    if c.pSrc == len(c.src) || c.err != nil {
        c.info, c.sz = 0, 0
        return false
    }
    v, sz := trie.lookup(c.src[c.pSrc:])
    c.info, c.sz = info(v), sz
    if c.sz == 0 {
        if c.atEOF {
            // A zero size means we have an incomplete rune. If we are atEOF,
            // this means it is an illegal rune, which we will consume one
            // byte at a time.
            c.sz = 1
        } else {
            c.err = transform.ErrShortSrc
            return false
        }
    }
    return true
}

Back in transform/transform.go, this means dst will grow and then start reading from pSrc+1. This is wrong and leads to the letter being skipped.

I see two solutions. First, if passing an empty dst is supposed to result in undefined behavior, then a simple fix is to avoid it by changing transform/transform.go (line 554) to:

    if pDst+nDst < initialBufSize {

If instead an empty dst should be checked for in a robust way, then context.next() should be fixed like so:

func (c *context) next() bool {
    if len(c.dst) == 0 {
        c.err = transform.ErrShortDst
        return false
    }
    c.pSrc += c.sz
    if c.pSrc == len(c.src) || c.err != nil {
        c.info, c.sz = 0, 0
        return false
    }
    v, sz := trie.lookup(c.src[c.pSrc:])
    c.info, c.sz = info(v), sz
    if c.sz == 0 {
        if c.atEOF {
            // A zero size means we have an incomplete rune. If we are atEOF,
            // this means it is an illegal rune, which we will consume one
            // byte at a time.
            c.sz = 1
        } else {
            c.err = transform.ErrShortSrc
            return false
        }
    }
    return true
}

@chowey
Copy link
Author

chowey commented Aug 6, 2015

Resolved, see https://go-review.googlesource.com/#/c/13076/.

@chowey chowey closed this as completed Aug 6, 2015
@golang golang locked and limited conversation to collaborators Aug 5, 2016
@rsc rsc unassigned mpvl Jun 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants