proposal: x/net/html: Allow getting raw HTML attribute values on Tokenizer #52911

paulo · 2022-05-15T07:59:19Z

Related to #17667

The current Tokenizer API does not provide a way to get the raw tag attribute values when parsing, as it always unescapes the value.

My proposal is to configure such behavior by providing a new API method UnescapeAttr which allows us to do it while keeping consistency across the package. There is also the option of implementing Raw... API methods that replicate the logic of the existing ones while maintaining the original parsed value.

A tentative PR can be found at https://go-review.googlesource.com/c/net/+/405034

The text was updated successfully, but these errors were encountered:

ianlancetaylor · 2022-05-18T16:40:58Z

CC @neild @bradfitz

rsc · 2022-05-18T17:41:44Z

CC @nigeltao

nigeltao · 2022-05-19T23:47:29Z

What's the use case? I see that https://go-review.googlesource.com/c/net/+/405034 says it fixes #17667 but that issue notes that "Escaping again the unescaped attribute values can be a solution" that doesn't require new x/net/html code or API changes.

paulo · 2022-05-20T07:13:37Z

@nigeltao I'm not the author of the original issue, but the statement of Escaping again the unescaped attribute values can be a solution isn't true. Two situations where that doesn't work are:

when the value is unescaped to begin with, so you're escaping something that wasn't before
when the value contains both escaped and unescaped characters

The goal here is to get the raw value, not an escaped/unescaped version of it. In addition, as per the docs: UnescapeString(EscapeString(s)) == s always holds, but the converse isn't always true.

nigeltao · 2022-05-20T10:54:47Z

The goal here is to get the raw value

This sounds like an XY Problem, where "get the raw value" is a solution but it's not clear (1) what the underlying problem is and (2) whether "get the raw value" is the best solution to that.

What's the actual problem?

paulo · 2022-05-20T11:18:52Z

For context, I'm working on a tool that:

parses html
possibly does some transformation on the content depending on a set of conditions. None of those conditions are "unescape the value of the tag attributes if X", so those values should stay the same as the original input.
outputs said transformed html

This sounds like an XY Problem

I guess I could argue the same for the current implementation. Why are the values of the current implementation unescaped, considering the lib provides a method to do so if needed.

nigeltao · 2022-05-21T00:39:43Z

possibly does some transformation on the content depending on a set of conditions. None of those conditions are "unescape the value of the tag attributes if X", so those values should stay the same as the original input.

Why do the attribute values have to stay the same bytes? Would it work if your tool outputs equivalent attributes (in that both before and after's unescaped forms are equal) instead of identical attributes?

In other words, what breaks if passing <div bar="<"> to your tool produces <div bar="<">?

Why are the values of the current implementation unescaped, considering the lib provides a method to do so if needed.

Unescaping for text nodes is unfortunately different from unescaping for attributes. Go's Unescape function's documentation could admittedly be better, but it is only for text nodes' raw bytes and it would be incorrect to apply the Unescape function to attributes' raw bytes.

Specifically, look for "&pound=" in https://github.com/WebKit/WebKit/blob/6b07b8bc6e0e5aaac87b1c8373d52e8fe1f942c1/LayoutTests/html5lib/resources/entities02.dat

Focus on lines 195, 204, 263 and 272:

195: <div bar="ZZ&pound=23"></div>
204: |       bar="ZZ&pound=23"

263: <div>ZZ&pound=23</div>
272: |       "ZZ£=23"

For "&pound=", which does not contain a semi-colon, this is not unescaped when in an attribute context but is unescaped (to become "£=") when in a text node context.

Yes, it's maddeningly inconsistent. There's different escaping rules again for <script> and <textarea> tags. Welcome to HTML parsing.

nigeltao · 2022-05-21T01:25:20Z

Unescaping for text nodes is unfortunately different from unescaping for attributes

For the record, section 13.2.5.73 Named character reference state is the relevant part of the HTML spec:

If the character reference was consumed as part of an attribute, and the last character matched is not a U+003B SEMICOLON character (;), and the next input character is either a U+003D EQUALS SIGN character (=) or an ASCII alphanumeric, then, for historical reasons, flush code points consumed as a character reference and switch to the return state.

paulo · 2022-05-22T09:50:36Z

Why do the attribute values have to stay the same bytes? Would it work if your tool outputs equivalent attributes (in that both before and after's unescaped forms are equal) instead of identical attributes?

Unfortunately, I don't have a good answer for this other than that "in my use case, we can't assume that equivalent attributes would work because we're not the end-user of that output, we're just the ones serving it".

Unescaping for text nodes is unfortunately different from unescaping for attributes.

That's a fair point, thank you for the great example! Indeed escaping/unescaping is not as straightforward as I thought in HTML parsing, but I'm not sure I see it as a blocker to the proposal.

I guess the question we should be asking here then is "why not provide a way to get the raw bytes of a parsed tag?", considering there are use cases for it.

nigeltao · 2022-05-23T08:58:54Z

"why not provide a way to get the raw bytes of a parsed tag?", considering there are use cases for it.

Because I'm hesitant to increase complexity without first understanding the use case, especially as (1) we'd probably have to maintain this feature forever if we add it and (2) it silently breaks a previous guarantee that Token attributes are always escaped.

I would recommend that your tool canonicalizes the HTML, in addition to whatever other transformations it makes.

If you can't do that, and the existing Tokenizer.Raw method also doesn't help, then fork the html package (or just its Tokenizer).

rsc · 2022-05-25T18:05:05Z

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

paulo · 2022-05-29T09:28:32Z

Because I'm hesitant to increase complexity without first understanding the use case, especially as (1) we'd probably have to maintain this feature forever if we add it and (2) it silently breaks a previous guarantee that Token attributes are always escaped.

I can empathize with that. Would a less pervasive change like adding the RawTagAttr method instead of my initial proposal help with both of these concerns?

I would recommend that your tool canonicalizes the HTML, in addition to whatever other transformations it makes. If you can't do that, and the existing Tokenizer.Raw method also doesn't help, then fork the html package (or just its Tokenizer).

I'd prefer to avoid both of these as the first goes against a hard requirement of my tool, and the second, as you'd say, would force me to maintain it and it would be an (in my opinion) unnecessary added dependency on the codebase.

rsc · 2022-06-08T17:31:09Z

It sounds like there is no consensus on adding this, and it would be best not to add to x/net/html piecemeal.

rsc · 2022-06-08T18:00:30Z

Based on the discussion above, this proposal seems like a likely decline.
— rsc for the proposal review group

paulo · 2022-06-09T18:30:54Z

I understand, thank you @nigeltao for your time!

rsc · 2022-06-15T18:35:10Z

No change in consensus, so declined.
— rsc for the proposal review group

paulo added the Proposal label May 15, 2022

gopherbot added this to the Proposal milestone May 15, 2022

rsc added the Proposal-FinalCommentPeriod label Jun 8, 2022

rsc removed the Proposal-FinalCommentPeriod label Jun 15, 2022

rsc closed this as completed Jun 15, 2022

rsc added this to Proposals Aug 10, 2022

rsc moved this to Declined in Proposals Aug 10, 2022

golang locked and limited conversation to collaborators Jun 15, 2023

gopherbot added the FrozenDueToAge label Jun 15, 2023

rsc removed this from Proposals Jun 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal: x/net/html: Allow getting raw HTML attribute values on Tokenizer #52911

proposal: x/net/html: Allow getting raw HTML attribute values on Tokenizer #52911

paulo commented May 15, 2022

ianlancetaylor commented May 18, 2022

rsc commented May 18, 2022

nigeltao commented May 19, 2022

paulo commented May 20, 2022 •

edited

Loading

nigeltao commented May 20, 2022

paulo commented May 20, 2022

nigeltao commented May 21, 2022 •

edited

Loading

nigeltao commented May 21, 2022

paulo commented May 22, 2022

nigeltao commented May 23, 2022

rsc commented May 25, 2022

paulo commented May 29, 2022

rsc commented Jun 8, 2022

rsc commented Jun 8, 2022

paulo commented Jun 9, 2022

rsc commented Jun 15, 2022

proposal: x/net/html: Allow getting raw HTML attribute values on Tokenizer #52911

proposal: x/net/html: Allow getting raw HTML attribute values on Tokenizer #52911

Comments

paulo commented May 15, 2022

ianlancetaylor commented May 18, 2022

rsc commented May 18, 2022

nigeltao commented May 19, 2022

paulo commented May 20, 2022 • edited Loading

nigeltao commented May 20, 2022

paulo commented May 20, 2022

nigeltao commented May 21, 2022 • edited Loading

nigeltao commented May 21, 2022

paulo commented May 22, 2022

nigeltao commented May 23, 2022

rsc commented May 25, 2022

paulo commented May 29, 2022

rsc commented Jun 8, 2022

rsc commented Jun 8, 2022

paulo commented Jun 9, 2022

rsc commented Jun 15, 2022

paulo commented May 20, 2022 •

edited

Loading

nigeltao commented May 21, 2022 •

edited

Loading