Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: net/url: adhere to RFC3986 #16127

Closed
gmccue opened this issue Jun 20, 2016 · 7 comments
Closed

proposal: net/url: adhere to RFC3986 #16127

gmccue opened this issue Jun 20, 2016 · 7 comments
Labels
FrozenDueToAge Proposal WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Milestone

Comments

@gmccue
Copy link

gmccue commented Jun 20, 2016

As detailed in #15891 , net/url does not properly follow RFC3986 when encoding URL strings. Certain characters, such as ( and ) are URL-encoded when they are not required to be by the RFC.

I propose updating the logic in net/url to allow for valid URL characters to remain un-encoded, possibly adding a new case statement for differentiating between escaping URL strings and encoding URL components.

@ianlancetaylor
Copy link
Contributor

Please be more specific about precisely which functions you want to change, and how. Thanks.

@ianlancetaylor ianlancetaylor added this to the Proposal milestone Jun 20, 2016
@ianlancetaylor ianlancetaylor changed the title proposal: adhere to RFC3986 in net/url proposal: net/url: adhere to RFC3986 Jun 20, 2016
@gmccue
Copy link
Author

gmccue commented Jun 21, 2016

Add a new const called escapeQueryComponent which would be used to escape "everything" by default when constructing safe URL strings.

const (
    encodePath encoding = 1 + iota
    encodeHost
    encodeZone
    encodeUserPassword
    encodeQueryComponent
    escapeQueryComponent
    encodeFragment
)

Adhere properly to RFC3986 in the shouldEscape() function.

// Return true if the specified character should be escaped when
// appearing in a URL string, according to RFC 3986.
//
// Please be informed that for now shouldEscape does not check all
// reserved characters correctly. See golang.org/issue/5684.
func shouldEscape(c byte, mode encoding) bool {
    // §2.3 Unreserved characters (alphanum)
    if 'A' <= c && c <= 'Z' || 'a' <= c && c <= 'z' || '0' <= c && c <= '9' {
        return false
    }

    if mode == encodeHost || mode == encodeZone {
        // §3.2.2 Host allows
        // sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="
        // as part of reg-name.
        // We add : because we include :port as part of host.
        // We add [ ] because we include [ipv6]:port as part of host.
        // We add < > because they're the only characters left that
        // we could possibly allow, and Parse will reject them if we
        // escape them (because hosts can't use %-encoding for
        // ASCII bytes).
        switch c {
        case '!', '$', '&', '\'', '(', ')', '*', '+', ',', ';', '=', ':', '[', ']', '<', '>', '"':
            return false
        }
    }

    switch c {
    case '-', '_', '.', '~': // §2.3 Unreserved characters (mark)
        return false

    // §2.2 Reserved characters (reserved)
    // reserved = gen-delims / sub-delims
    // gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
    // sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*"
    // sub-delims (contd.) =  "+" / "," / ";" / "="
    case '$', '&', '!', '#', '+', ',', '\'', '/', ':', ';', '=', '?', '@',
        '(', ')', '[', ']', '*':
        // Different sections of the URL allow a few of
        // the reserved characters to appear unescaped.
        switch mode {
        case encodePath: // §3.3
            // The RFC allows sub-delims and : @ but saves / ; , for assigning
            // meaning to individual path segments. This package
            // only manipulates the path as a whole, so we allow those
            // last two as well. That leaves only ? to escape.
            return c == '?' || c == '#' || c == '[' || c == ']'

        case encodeUserPassword: // §3.2.1
            // The RFC allows ';', ':', '&', '=', '+', '$', and ',' in
            // userinfo, so we must escape only '@', '/', and '?'.
            // The parsing of userinfo treats ':' as special so we must escape
            // that too.
            return c == '@' || c == '/' || c == '?' || c == ':'

        case encodeQueryComponent: // §3.4
            // The RFC allows : @ and sub-delims. Escape all gen-delims except
            // : and @. Additionally escape any superfluous &.
            // pchar = unreserved / pct-encoded / sub-delims / ":" / "@".
            // RFC 3986, Appendix A.
            return c == '&' || c == '?' || c == '/' || c == '#' || c == '[' || c == ']'

        case encodeFragment: // §4.1
            // The RFC text is silent but the grammar allows
            // everything, so escape nothing.
            return false

        case escapeQueryComponent:
            // As a special case, escape everything by default when
            // constructing URLs.
            return true
        }
    }

    // Everything else must be escaped.
    return true
}

Use the escapeQueryComponent case for the QueryEscape() function.

// QueryEscape escapes the string so it can be safely placed
// inside a URL query.
func QueryEscape(s string) string {
    return escape(s, escapeQueryComponent)
}

Add support for the escapeQueryComponent case to the escape() function.

func escape(s string, mode encoding) string {
    spaceCount, hexCount := 0, 0
    for i := 0; i < len(s); i++ {
        c := s[i]
        if shouldEscape(c, mode) {
            if c == ' ' && mode == encodeQueryComponent {
                spaceCount++
            } else if c == ' ' && mode == escapeQueryComponent {
                spaceCount++
            } else {
                hexCount++
            }
        }
    }

    if spaceCount == 0 && hexCount == 0 {
        return s
    }

    t := make([]byte, len(s)+2*hexCount)
    j := 0
    for i := 0; i < len(s); i++ {
        switch c := s[i]; {
        case c == ' ' && (mode == encodeQueryComponent || mode == escapeQueryComponent):
            t[j] = '+'
            j++
        case shouldEscape(c, mode):
            t[j] = '%'
            t[j+1] = "0123456789ABCDEF"[c>>4]
            t[j+2] = "0123456789ABCDEF"[c&15]
            j += 3
        default:
            t[j] = s[i]
            j++
        }
    }
    return string(t)
}

Remove the work-around for RFC3986 characters in the validEncodedPath() function

// validEncodedPath reports whether s is a valid encoded path.
// It must not contain any bytes that require escaping during path encoding.
func validEncodedPath(s string) bool {
    for i := 0; i < len(s); i++ {
        switch s[i] {
        case '[', ']':
            // ok - not specified in RFC 3986 but left alone by modern browsers
        case '%':
            // ok - percent encoded, will decode
        default:
            if shouldEscape(s[i], encodePath) {
                return false
            }
        }
    }
    return true
}

@ianlancetaylor
Copy link
Contributor

Thanks. I apologize if I am missing something, but the only change I see to an exported function is that you want to change QueryEscape, although actually I don't quite see how its behavior will change. And it seems that you are suggesting that QueryEscape and QueryUnescape will no longer be inverse transforms.

Are you suggesting that any changes to the current tests?

@gmccue
Copy link
Author

gmccue commented Jun 24, 2016

I did leave out QueryUnescape as well, but the only exported functions that would change are QueryEscape and QueryUnescape.

// QueryUnescape does the inverse transformation of QueryEscape, converting
// %AB into the byte 0xAB and '+' into ' ' (space). It returns an error if
// any % is not followed by two hexadecimal digits.
func QueryUnescape(s string) (string, error) {
    return unescape(s, escapeQueryComponent)
}

// unescape unescapes a string; the mode specifies
// which section of the URL string is being unescaped.
func unescape(s string, mode encoding) (string, error) {
    // Count %, check that they're well-formed.
    n := 0
    hasPlus := false
    for i := 0; i < len(s); {
        switch s[i] {
        case '%':
            n++
            if i+2 >= len(s) || !ishex(s[i+1]) || !ishex(s[i+2]) {
                s = s[i:]
                if len(s) > 3 {
                    s = s[:3]
                }
                return "", EscapeError(s)
            }
            // Per https://tools.ietf.org/html/rfc3986#page-21
            // in the host component %-encoding can only be used
            // for non-ASCII bytes.
            // But https://tools.ietf.org/html/rfc6874#section-2
            // introduces %25 being allowed to escape a percent sign
            // in IPv6 scoped-address literals. Yay.
            if mode == encodeHost && unhex(s[i+1]) < 8 && s[i:i+3] != "%25" {
                return "", EscapeError(s[i : i+3])
            }
            if mode == encodeZone {
                // RFC 6874 says basically "anything goes" for zone identifiers
                // and that even non-ASCII can be redundantly escaped,
                // but it seems prudent to restrict %-escaped bytes here to those
                // that are valid host name bytes in their unescaped form.
                // That is, you can use escaping in the zone identifier but not
                // to introduce bytes you couldn't just write directly.
                // But Windows puts spaces here! Yay.
                v := unhex(s[i+1])<<4 | unhex(s[i+2])
                if s[i:i+3] != "%25" && v != ' ' && shouldEscape(v, encodeHost) {
                    return "", EscapeError(s[i : i+3])
                }
            }
            i += 3
        case '+':
            if mode == encodeQueryComponent || mode == escapeQueryComponent {
                hasPlus = true
            }

            i++
        default:
            if (mode == encodeHost || mode == encodeZone) && s[i] < 0x80 && shouldEscape(s[i], mode) {
                return "", InvalidHostError(s[i : i+1])
            }
            i++
        }
    }

    if n == 0 && !hasPlus {
        return s, nil
    }

    t := make([]byte, len(s)-2*n)
    j := 0
    for i := 0; i < len(s); {
        switch s[i] {
        case '%':
            t[j] = unhex(s[i+1])<<4 | unhex(s[i+2])
            j++
            i += 3
        case '+':
            if mode == encodeQueryComponent || mode == escapeQueryComponent {
                t[j] = ' '
            } else {
                t[j] = '+'
            }
            j++
            i++
        default:
            t[j] = s[i]
            j++
            i++
        }
    }
    return string(t), nil
}

net/url_test.go would additionally change.

type shouldEscapeTest struct {
    in     byte
    mode   encoding
    escape bool
}

var shouldEscapeTests = []shouldEscapeTest{
    // Unreserved characters (§2.3)
    {'a', encodePath, false},
    {'a', encodeUserPassword, false},
    {'a', encodeQueryComponent, false},
    {'a', encodeFragment, false},
    {'a', escapeQueryComponent, false},
    {'a', encodeHost, false},
    {'z', encodePath, false},
    {'A', encodePath, false},
    {'Z', encodePath, false},
    {'0', encodePath, false},
    {'9', encodePath, false},
    {'-', encodePath, false},
    {'-', encodeUserPassword, false},
    {'-', encodeQueryComponent, false},
    {'-', encodeFragment, false},
    {'-', escapeQueryComponent, false},
    {'.', encodePath, false},
    {'_', encodePath, false},
    {'~', encodePath, false},

    // Reserved characters (§2.2)
    {'?', encodePath, true},
    {'#', encodePath, true},
    {'[', encodePath, true},
    {']', encodePath, true},
    {'$', encodePath, false},
    {'&', encodePath, false},
    {'!', encodePath, false},
    {'+', encodePath, false},
    {',', encodePath, false},
    {'\'', encodePath, false},
    {':', encodePath, false},
    {';', encodePath, false},
    {'=', encodePath, false},
    {'@', encodePath, false},
    {'(', encodePath, false},
    {')', encodePath, false},
    {'*', encodePath, false},

    // Query component (§3.4)
    {'&', encodeQueryComponent, true},
    {'?', encodeQueryComponent, true},
    {'/', encodeQueryComponent, true},
    {'#', encodeQueryComponent, true},
    {'[', encodeQueryComponent, true},
    {']', encodeQueryComponent, true},
    {'$', encodeQueryComponent, false},
    {'!', encodeQueryComponent, false},
    {'+', encodeQueryComponent, false},
    {',', encodeQueryComponent, false},
    {'\'', encodeQueryComponent, false},
    {':', encodeQueryComponent, false},
    {';', encodeQueryComponent, false},
    {'=', encodeQueryComponent, false},
    {'@', encodeQueryComponent, false},
    {'(', encodeQueryComponent, false},
    {')', encodeQueryComponent, false},
    {'*', encodeQueryComponent, false},

    // User information (§3.2.1)
    {':', encodeUserPassword, true},
    {'/', encodeUserPassword, true},
    {'?', encodeUserPassword, true},
    {'@', encodeUserPassword, true},
    {'$', encodeUserPassword, false},
    {'&', encodeUserPassword, false},
    {'+', encodeUserPassword, false},
    {',', encodeUserPassword, false},
    {';', encodeUserPassword, false},
    {'=', encodeUserPassword, false},

    // Host (IP address, IPv6 address, registered name, port suffix; §3.2.2)
    {'!', encodeHost, false},
    {'$', encodeHost, false},
    {'&', encodeHost, false},
    {'\'', encodeHost, false},
    {'(', encodeHost, false},
    {')', encodeHost, false},
    {'*', encodeHost, false},
    {'+', encodeHost, false},
    {',', encodeHost, false},
    {';', encodeHost, false},
    {'=', encodeHost, false},
    {':', encodeHost, false},
    {'[', encodeHost, false},
    {']', encodeHost, false},
    {'0', encodeHost, false},
    {'9', encodeHost, false},
    {'A', encodeHost, false},
    {'z', encodeHost, false},
    {'_', encodeHost, false},
    {'-', encodeHost, false},
    {'.', encodeHost, false},
}

func TestShouldEscape(t *testing.T) {
    for _, tt := range shouldEscapeTests {
        if shouldEscape(tt.in, tt.mode) != tt.escape {
            t.Errorf("shouldEscape(%q, %v) returned %v; expected %v", tt.in, tt.mode, !tt.escape, tt.escape)
        }
    }
}

@bradfitz
Copy link
Contributor

Maybe it's best if you just send a change (in Gerrit) so we can see differences and comment there, since large code dumps in comments aren't easy to read. See https://golang.org/doc/contribute.html

But be warned that changes to net/url have become increasingly hard (nearly impossible) for compatibility reasons. There's a large chance the change won't make it in or will be reverted. Let's see what the diff looks like, though.

@bradfitz
Copy link
Contributor

Also, note that I removed the RFC mention in 83676d6, specifically because we've accepted that we can't fully implement the RFC at this point for compatibility reasons.

@bradfitz bradfitz added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Jun 29, 2016
@bradfitz
Copy link
Contributor

Closing due to timeout waiting for reply.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge Proposal WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Projects
None yet
Development

No branches or pull requests

4 participants