Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net/url: URL.String() hashbang "#!" encodes as "#%21" #19917

Closed
dragonsinth opened this issue Apr 10, 2017 · 21 comments
Closed

net/url: URL.String() hashbang "#!" encodes as "#%21" #19917

dragonsinth opened this issue Apr 10, 2017 · 21 comments
Milestone

Comments

@dragonsinth
Copy link

dragonsinth commented Apr 10, 2017

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

1.7.4

What did you do?

https://play.golang.org/p/R-4KVNXa9Z

        u, _ := url.Parse("http://example.org/#!wut")
	fmt.Println(u.String())

What did you expect to see?

http://example.org/#!wut

What did you see instead?

http://example.org/#%21wut

Extra info

#6784
#5684

Looks like the first issue was marked a duplicate of the second, but the underlying problem was never fixed.

@odeke-em
Copy link
Member

/cc @rsc @bradfitz

@dragonsinth is there some RFC that we could quote/refer to that defines
that expectation on how to encode those values?

@bradfitz bradfitz added this to the Go1.10 milestone May 10, 2017
@jhump
Copy link

jhump commented May 11, 2017

@odeke-em, the summary for #5684 has the details, from RFC 3986.

@dragonsinth
Copy link
Author

dragonsinth commented May 11, 2017

Sure, in RFC 2396 (https://www.ietf.org/rfc/rfc2396.txt):

2.3. Unreserved Characters

   Data characters that are allowed in a URI but do not have a reserved
   purpose are called unreserved.  These include upper and lower case
   letters, decimal digits, and a limited set of punctuation marks and
   symbols.

      unreserved  = alphanum | mark

      mark        = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"

   Unreserved characters can be escaped without changing the semantics
   of the URI, but this should not be done unless the URI is being used
   in a context that does not allow the unescaped character to appear.

RFC3986 (https://www.ietf.org/rfc/rfc3986.txt) then later redefines several of those characters (including !) as sub-delims in 2.2 and 2.3, which means reserved for using within indivdual components, such as the fragment.

If you read down to appendix A of RFC3986, "Collected ABNF for URI", you'll find the following rules:

fragment      = *( pchar / "/" / "?" )

pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"

sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                 / "*" / "+" / "," / ";" / "="

As you can see, sub-delims is distinct from pct-encoded and they are allow in the fragment without encoding.

For further reference, Google Search build a whole spec on it a while ago (though it's now deprecated, but lots of app still use this): https://developers.google.com/webmasters/ajax-crawling/docs/getting-started

@fraenkel
Copy link
Contributor

fraenkel commented May 12, 2017

The sub-delims isn't handled properly which affects more than just fragments.

@gopherbot
Copy link

Change https://golang.org/cl/61650 mentions this issue: net/url: don't escape sub-delims in fragment

@tombergan
Copy link
Contributor

@fraenkel The sub-delims isn't handled properly which affects more than just fragments.

Can you provide a specific example?

@fraenkel
Copy link
Contributor

Unless my reading of RFC 3986 is incorrect, sub-delims is part of the reserved set as well.
The shouldEscape method lists the set of reserved characters which does not follow rfc 3986.
I am a bit more concerned that some of the others may be different as well. I didn't do an exhaustive comparison but noticed that the rules did change.

@tombergan
Copy link
Contributor

tombergan commented Sep 12, 2017

@fraenkel I should have been more clear: I meant, are there any examples where programs are broken because Go escapes sub-delims in the Path? You've observed that shouldEscape does not handle sub-delims according to RFC 3986, but see URL.EscapedPath and how that is used in URL.String. I believe Go is correctly handling sub-delims in Path when the caller uses those two methods. If that is wrong, it would help to see a repro.

This issue is about sub-delims in fragments. @dragonsinth has convinced me that we need to support unescaped ! in fragments, at the very least. There are two ways we could do this:

  1. Update shouldEscape to return false for sub-delims when mode is escapeFragment.

  2. Add URL.RawFragment and URL.EscapedFragment(), following the existing pattern for RawPath and EscapedPath. This makes my stomach churn, but at least it is symmetric with the approach we've already taken for Path.

The first approach avoids adding public APIs. Both proposals might possibly break user code, but the second approach breaks less code. The first approach changes the semantics of URL.Fragment, while the second approach keeps the current semantics of URL.Fragment -- a caller must switch to RawFragment and EscapedFragment to get the new escaping behavior. The only code potentially broken by the second approach is code like the following, where the caller is expecting that u.String() will never contain an unescaped sub-delim:

u, err := url.Parse(...)
foo(u.String())

I lean towards the second approach, despite my general dislike of the API, because it's the same pattern as RawPath and that change seemed to work out OK. Also, I'm paranoid of breaking user code. Thoughts?

/cc @rsc for the API question

@dragonsinth
Copy link
Author

i strongly prefer the first approach; a user should have no reasonable expectation that u.String() won't contain unescaped sub-delims in the fragment portion, and in fact the current problematic u.String() rendering is exactly what we've had to code around in our system.

Here's how we're working around this today:

func fixUrlEscaping(ctx context.Context, urlString string) string {
	// Parse + String preserves the original encoding.
	u, err := url.Parse(urlString)
	if err != nil {
		log.Warningf(ctx, "url could not be parsed: %s", urlString)
		return urlString
	}
	// Special fixup for spaces in query string.
	// TODO: something more sophisticated?
	u.RawQuery = strings.Replace(u.RawQuery, " ", "+", -1)

	// Go's  built-in fragment escaping is overly ambitious, replace theirs with a minimalist escaping
	if u.Fragment != "" {
		frag := u.Fragment
		u.Fragment = ""
		return fmt.Sprintf("%s#%s", u.String(), fragmentEscape(frag))
	}

	return u.String()
}

// Scavanged from net/url/url.go
func fragmentEscape(f string) string {
	shouldEscape := func(c byte, curr int, s string) bool {
		// §2.3 Unreserved characters (alphanum)
		if 'A' <= c && c <= 'Z' || 'a' <= c && c <= 'z' || '0' <= c && c <= '9' {
			return false
		}
		switch c {
		case '-', '_', '.', '~': // §2.3 Unreserved characters (mark)
			return false
		case '$', '&', '+', ',', '/', ':', ';', '=', '?', '@', '#', '[', ']', '!', '\'', '(', ')', '*':
			return false
		}

		return true
	}

	escape := func(s string) string {
		spaceCount, hexCount := 0, 0
		for i := 0; i < len(s); i++ {
			c := s[i]
			if shouldEscape(c, i, s) {
				hexCount++
			}
		}

		if spaceCount == 0 && hexCount == 0 {
			return s
		}

		t := make([]byte, len(s)+2*hexCount)
		j := 0
		for i := 0; i < len(s); i++ {
			switch c := s[i]; {
			case shouldEscape(c, i, s):
				t[j] = '%'
				t[j+1] = "0123456789ABCDEF"[c>>4]
				t[j+2] = "0123456789ABCDEF"[c&15]
				j += 3
			default:
				t[j] = s[i]
				j++
			}
		}
		return string(t)
	}

	return escape(f)
}

@tombergan
Copy link
Contributor

tombergan commented Sep 12, 2017

In case I wasn't clear, in both proposals, u.String() will contain unescaped fragment sub-delims when u is generated from url.Parse. Your munging of Fragment in fixUrlEscaping will be unnecessary with both proposals. The only difference is that, with the second proposal, you will need to access EscapedFragment or RawFragment to see the unescaped sub-delims. This is the same way this works today for Path. With the first proposal, you can use Fragment directly (at the potential cost of breaking other users who are not expecting this behavior).

Also, here is some code that would be broken by the first proposal:

var u url.URL
u.Fragment = `'x`
fmt.Sprintf("<a href='%s'>", u.String()) // prints <a href=''x'> because ' is a sub-delim

You could argue that no one should write code like this. I mention this because we actually found code similar to the above in net/http/fs.go. It's possible (likely, even) that code like the above exists in the wild today. This code happens to work today because u.String() will escape the single quote. My first proposal will break the above code. My second proposal will not.

@dragonsinth
Copy link
Author

dragonsinth commented Sep 12, 2017

Ahh, that makes sense. Thanks! In that case, Fragment + RawFragment seems good to me; it mirrors Path and Query. It seems like a tiny break in behavior to make it net more sensible. But I'm not super strongly opinionated here as long as u.String() is great.

@fraenkel
Copy link
Contributor

I can make something up.

 u, _ := url.Parse("http://a!:b!@example.org/#!wut")
fmt.Println(u.String())

userinfo should not escape ! since it too allows sub-delims. I haven't seen anyone complain.
I think the likelihood of a ! in a path is rare which is why we haven't seen anything break there.
And IPVFuture is another case where we won't see any issue.

@tombergan
Copy link
Contributor

Sub-delims in Path work just fine. That was fixed in go1.5, via #5684, which added RawPath and EncodedPath, similarly to a solution proposed above for Fragment. I am not concerned about userinfo given that userinfo is essentially deprecated.

@fraenkel
Copy link
Contributor

What would be the replacement for userinfo?

@tombergan
Copy link
Contributor

I would wait for an actual report (rather than a hypothetical) before doing anything about userinfo, especially given that userinfo is essentially deprecated.

@namusyaka
Copy link
Member

Vote to second approach. Looks reasonable.

@rsc
Copy link
Contributor

rsc commented Nov 22, 2017

I didn't get to thinking about this for Go 1.10, sorry.

To go down the RawFragment/EscapedFragment path I think we would need to have evidence that something cares about the distinction between #!wut and #%21wut. Does anything care? It really really really should not.

(This is in contrast to paths, where the RFCs explicitly allow caring about the distinction between %2F and /, even though it's impossible to tell them apart if you only record a single string.)

@rsc rsc modified the milestones: Go1.10, Go1.11 Nov 22, 2017
@dragonsinth
Copy link
Author

dragonsinth commented Nov 22, 2017

I super care, because browsers / javascript see the two as different values.

https://www.google.com/#!wat
https://www.google.com/#%21wat

screen shot 2017-11-22 at 3 52 02 pm

(Ultimately that means round tripping a url from a browser -> go URL -> browser changes the value seen by javascript code.)

@crvv
Copy link
Contributor

crvv commented Nov 23, 2017

caring about the distinction between %2F and /

There is an issue for this.
#21955

@namusyaka
Copy link
Member

Any progress on this?

@dragonsinth
Copy link
Author

Sweet!

dmitshur added a commit to shurcooL/home that referenced this issue Oct 7, 2018
In Go 1.11, the issue golang/go#19917 has been resolved,
and "#!" is now encoded as "#!" rather than "#%21".

Follows golang/go@8a33045.
@golang golang locked and limited conversation to collaborators Jul 13, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

10 participants