-
Notifications
You must be signed in to change notification settings - Fork 18k
proposal: x/net/html: Allow getting raw HTML attribute values on Tokenizer #52911
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
CC @nigeltao |
What's the use case? I see that https://go-review.googlesource.com/c/net/+/405034 says it fixes #17667 but that issue notes that "Escaping again the unescaped attribute values can be a solution" that doesn't require new x/net/html code or API changes. |
@nigeltao I'm not the author of the original issue, but the statement of
The goal here is to get the raw value, not an escaped/unescaped version of it. In addition, as per the docs: |
This sounds like an XY Problem, where "get the raw value" is a solution but it's not clear (1) what the underlying problem is and (2) whether "get the raw value" is the best solution to that. What's the actual problem? |
For context, I'm working on a tool that:
I guess I could argue the same for the current implementation. Why are the values of the current implementation unescaped, considering the lib provides a method to do so if needed. |
Why do the attribute values have to stay the same bytes? Would it work if your tool outputs equivalent attributes (in that both before and after's unescaped forms are equal) instead of identical attributes? In other words, what breaks if passing
Unescaping for text nodes is unfortunately different from unescaping for attributes. Go's Unescape function's documentation could admittedly be better, but it is only for text nodes' raw bytes and it would be incorrect to apply the Unescape function to attributes' raw bytes. Specifically, look for "£=" in https://github.com/WebKit/WebKit/blob/6b07b8bc6e0e5aaac87b1c8373d52e8fe1f942c1/LayoutTests/html5lib/resources/entities02.dat Focus on lines 195, 204, 263 and 272:
For "£=", which does not contain a semi-colon, this is not unescaped when in an attribute context but is unescaped (to become "£=") when in a text node context. Yes, it's maddeningly inconsistent. There's different escaping rules again for |
For the record, section 13.2.5.73 Named character reference state is the relevant part of the HTML spec:
|
Unfortunately, I don't have a good answer for this other than that "in my use case, we can't assume that equivalent attributes would work because we're not the end-user of that output, we're just the ones serving it".
That's a fair point, thank you for the great example! Indeed escaping/unescaping is not as straightforward as I thought in HTML parsing, but I'm not sure I see it as a blocker to the proposal. I guess the question we should be asking here then is "why not provide a way to get the raw bytes of a parsed tag?", considering there are use cases for it. |
Because I'm hesitant to increase complexity without first understanding the use case, especially as (1) we'd probably have to maintain this feature forever if we add it and (2) it silently breaks a previous guarantee that Token attributes are always escaped. I would recommend that your tool canonicalizes the HTML, in addition to whatever other transformations it makes. If you can't do that, and the existing |
This proposal has been added to the active column of the proposals project |
I can empathize with that. Would a less pervasive change like adding the
I'd prefer to avoid both of these as the first goes against a hard requirement of my tool, and the second, as you'd say, would force me to maintain it and it would be an (in my opinion) unnecessary added dependency on the codebase. |
It sounds like there is no consensus on adding this, and it would be best not to add to x/net/html piecemeal. |
Based on the discussion above, this proposal seems like a likely decline. |
I understand, thank you @nigeltao for your time! |
No change in consensus, so declined. |
Related to #17667
The current Tokenizer API does not provide a way to get the raw tag attribute values when parsing, as it always unescapes the value.
My proposal is to configure such behavior by providing a new API method
UnescapeAttr
which allows us to do it while keeping consistency across the package. There is also the option of implementingRaw...
API methods that replicate the logic of the existing ones while maintaining the original parsed value.A tentative PR can be found at https://go-review.googlesource.com/c/net/+/405034
The text was updated successfully, but these errors were encountered: