Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: x/net/html: Allow getting raw HTML attribute values on Tokenizer #52911

Closed
paulo opened this issue May 15, 2022 · 16 comments
Closed

proposal: x/net/html: Allow getting raw HTML attribute values on Tokenizer #52911

paulo opened this issue May 15, 2022 · 16 comments

Comments

@paulo
Copy link

paulo commented May 15, 2022

Related to #17667

The current Tokenizer API does not provide a way to get the raw tag attribute values when parsing, as it always unescapes the value.

My proposal is to configure such behavior by providing a new API method UnescapeAttr which allows us to do it while keeping consistency across the package. There is also the option of implementing Raw... API methods that replicate the logic of the existing ones while maintaining the original parsed value.

A tentative PR can be found at https://go-review.googlesource.com/c/net/+/405034

@paulo paulo added the Proposal label May 15, 2022
@gopherbot gopherbot added this to the Proposal milestone May 15, 2022
@ianlancetaylor
Copy link
Contributor

CC @neild @bradfitz

@ianlancetaylor ianlancetaylor added this to Incoming in Proposals (old) May 18, 2022
@rsc
Copy link
Contributor

rsc commented May 18, 2022

CC @nigeltao

@nigeltao
Copy link
Contributor

What's the use case? I see that https://go-review.googlesource.com/c/net/+/405034 says it fixes #17667 but that issue notes that "Escaping again the unescaped attribute values can be a solution" that doesn't require new x/net/html code or API changes.

@paulo
Copy link
Author

paulo commented May 20, 2022

@nigeltao I'm not the author of the original issue, but the statement of Escaping again the unescaped attribute values can be a solution isn't true. Two situations where that doesn't work are:

  • when the value is unescaped to begin with, so you're escaping something that wasn't before
  • when the value contains both escaped and unescaped characters

The goal here is to get the raw value, not an escaped/unescaped version of it. In addition, as per the docs: UnescapeString(EscapeString(s)) == s always holds, but the converse isn't always true.

@nigeltao
Copy link
Contributor

The goal here is to get the raw value

This sounds like an XY Problem, where "get the raw value" is a solution but it's not clear (1) what the underlying problem is and (2) whether "get the raw value" is the best solution to that.

What's the actual problem?

@paulo
Copy link
Author

paulo commented May 20, 2022

For context, I'm working on a tool that:

  • parses html
  • possibly does some transformation on the content depending on a set of conditions. None of those conditions are "unescape the value of the tag attributes if X", so those values should stay the same as the original input.
  • outputs said transformed html

This sounds like an XY Problem

I guess I could argue the same for the current implementation. Why are the values of the current implementation unescaped, considering the lib provides a method to do so if needed.

@nigeltao
Copy link
Contributor

nigeltao commented May 21, 2022

possibly does some transformation on the content depending on a set of conditions. None of those conditions are "unescape the value of the tag attributes if X", so those values should stay the same as the original input.

Why do the attribute values have to stay the same bytes? Would it work if your tool outputs equivalent attributes (in that both before and after's unescaped forms are equal) instead of identical attributes?

In other words, what breaks if passing <div bar="&lt;"> to your tool produces <div bar="<">?

Why are the values of the current implementation unescaped, considering the lib provides a method to do so if needed.

Unescaping for text nodes is unfortunately different from unescaping for attributes. Go's Unescape function's documentation could admittedly be better, but it is only for text nodes' raw bytes and it would be incorrect to apply the Unescape function to attributes' raw bytes.

Specifically, look for "&pound=" in https://github.com/WebKit/WebKit/blob/6b07b8bc6e0e5aaac87b1c8373d52e8fe1f942c1/LayoutTests/html5lib/resources/entities02.dat

Focus on lines 195, 204, 263 and 272:

195: <div bar="ZZ&pound=23"></div>
204: |       bar="ZZ&pound=23"

263: <div>ZZ&pound=23</div>
272: |       "ZZ£=23"

For "&pound=", which does not contain a semi-colon, this is not unescaped when in an attribute context but is unescaped (to become "£=") when in a text node context.

Yes, it's maddeningly inconsistent. There's different escaping rules again for <script> and <textarea> tags. Welcome to HTML parsing.

@nigeltao
Copy link
Contributor

Unescaping for text nodes is unfortunately different from unescaping for attributes

For the record, section 13.2.5.73 Named character reference state is the relevant part of the HTML spec:

If the character reference was consumed as part of an attribute, and the last character matched is not a U+003B SEMICOLON character (;), and the next input character is either a U+003D EQUALS SIGN character (=) or an ASCII alphanumeric, then, for historical reasons, flush code points consumed as a character reference and switch to the return state.

@paulo
Copy link
Author

paulo commented May 22, 2022

Why do the attribute values have to stay the same bytes? Would it work if your tool outputs equivalent attributes (in that both before and after's unescaped forms are equal) instead of identical attributes?

Unfortunately, I don't have a good answer for this other than that "in my use case, we can't assume that equivalent attributes would work because we're not the end-user of that output, we're just the ones serving it".

Unescaping for text nodes is unfortunately different from unescaping for attributes.

That's a fair point, thank you for the great example! Indeed escaping/unescaping is not as straightforward as I thought in HTML parsing, but I'm not sure I see it as a blocker to the proposal.

I guess the question we should be asking here then is "why not provide a way to get the raw bytes of a parsed tag?", considering there are use cases for it.

@nigeltao
Copy link
Contributor

"why not provide a way to get the raw bytes of a parsed tag?", considering there are use cases for it.

Because I'm hesitant to increase complexity without first understanding the use case, especially as (1) we'd probably have to maintain this feature forever if we add it and (2) it silently breaks a previous guarantee that Token attributes are always escaped.

I would recommend that your tool canonicalizes the HTML, in addition to whatever other transformations it makes.

If you can't do that, and the existing Tokenizer.Raw method also doesn't help, then fork the html package (or just its Tokenizer).

@rsc
Copy link
Contributor

rsc commented May 25, 2022

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@rsc rsc moved this from Incoming to Active in Proposals (old) May 25, 2022
@paulo
Copy link
Author

paulo commented May 29, 2022

Because I'm hesitant to increase complexity without first understanding the use case, especially as (1) we'd probably have to maintain this feature forever if we add it and (2) it silently breaks a previous guarantee that Token attributes are always escaped.

I can empathize with that. Would a less pervasive change like adding the RawTagAttr method instead of my initial proposal help with both of these concerns?

I would recommend that your tool canonicalizes the HTML, in addition to whatever other transformations it makes. If you can't do that, and the existing Tokenizer.Raw method also doesn't help, then fork the html package (or just its Tokenizer).

I'd prefer to avoid both of these as the first goes against a hard requirement of my tool, and the second, as you'd say, would force me to maintain it and it would be an (in my opinion) unnecessary added dependency on the codebase.

@rsc
Copy link
Contributor

rsc commented Jun 8, 2022

It sounds like there is no consensus on adding this, and it would be best not to add to x/net/html piecemeal.

@rsc rsc moved this from Active to Likely Decline in Proposals (old) Jun 8, 2022
@rsc
Copy link
Contributor

rsc commented Jun 8, 2022

Based on the discussion above, this proposal seems like a likely decline.
— rsc for the proposal review group

@paulo
Copy link
Author

paulo commented Jun 9, 2022

I understand, thank you @nigeltao for your time!

@rsc rsc moved this from Likely Decline to Declined in Proposals (old) Jun 15, 2022
@rsc
Copy link
Contributor

rsc commented Jun 15, 2022

No change in consensus, so declined.
— rsc for the proposal review group

@rsc rsc closed this as completed Jun 15, 2022
@golang golang locked and limited conversation to collaborators Jun 15, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
No open projects
Development

No branches or pull requests

5 participants