Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/net/html: 'noscript' inner HTML parsing issue #16318

Closed
bearburger opened this issue Jul 11, 2016 · 23 comments
Closed

x/net/html: 'noscript' inner HTML parsing issue #16318

bearburger opened this issue Jul 11, 2016 · 23 comments
Labels
FrozenDueToAge NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made.
Milestone

Comments

@bearburger
Copy link

bearburger commented Jul 11, 2016

  1. What version of Go are you using (go version)?

    go1.6 windows/amd64
    
  2. What operating system and processor architecture are you using (go env)?

    set GOARCH=amd64                                  
    set GOBIN=                                        
    set GOEXE=.exe                                    
    set GOHOSTARCH=amd64                              
    set GOHOSTOS=windows                              
    set GOOS=windows                                  
    set GOPATH=F:\go\path64                           
    set GORACE=                                       
    set GOROOT=F:\go\root16                           
    set GOTOOLDIR=F:\go\root16\pkg\tool\windows_amd64 
    set GO15VENDOREXPERIMENT=1                        
    set CC=gcc                                        
    set GOGCCFLAGS=-m64 -mthreads -fmessage-length=0  
    set CXX=g++                                       
    set CGO_ENABLED=1                                 
    
  3. What did you do?

    package main
    
    import (
        "golang.org/x/net/html"
        "log"
        "bytes"
    )
    
    
    func main() {
        data := "<noscript><img src='https://golang.org/doc/gopher/frontpage.png' /></noscript><p><img src='https://golang.org/doc/gopher/doc.png' /></p>"
        if document, err := html.Parse(bytes.NewReader([]byte(data))); err == nil {
            var parser func(*html.Node)
    
            parser = func(n *html.Node) {
                if n.Data == "img" {
                    log.Println(n.Attr[0].Val)
                }
                if n.Data == "noscript" {
                    // here is the issue - noscript tag inner html is represented as TextNode and can't be used as ElementNode
                    log.Println("noscript", n.FirstChild.Type == html.TextNode, n.FirstChild.Data)
                }
                for c := n.FirstChild; c != nil; c = c.NextSibling {
                    parser(c)
                }
            }
    
            parser(document)
        } else {
            log.Panicln("Parse html error", err)
        }
    }
  4. What did you expect to see?

    2016/07/11 13:47:33 noscript false img
    2016/07/11 13:47:33 https://golang.org/doc/gopher/frontpage.png
    2016/07/11 13:47:33 https://golang.org/doc/gopher/doc.png
    
  5. What did you see instead?

    2016/07/11 13:39:57 noscript true <img src='https://golang.org/doc/gopher/frontpage.png' />
    2016/07/11 13:39:57 https://golang.org/doc/gopher/doc.png
    
@bearburger bearburger changed the title x/net/html 'noscript' inner HTML parsing issue x/net/html: 'noscript' inner HTML parsing issue Jul 11, 2016
@roma86
Copy link

roma86 commented Jul 11, 2016

Same here

Mac OS 10.11.5

  1. What version of Go are you using (go version)?
go version go1.6.2 darwin/amd64
  1. What operating system and processor architecture are you using (go env)?
OARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/Georg/Develop/Go"
GORACE=""
GOROOT="/usr/local/Cellar/go/1.6.2/libexec"
GOTOOLDIR="/usr/local/Cellar/go/1.6.2/libexec/pkg/tool/darwin_amd64"
GO15VENDOREXPERIMENT="1"
CC="clang"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fno-common"
CXX="clang++"
CGO_ENABLED="1"

Detailed explanation with code and source html

@bearburger
Copy link
Author

bearburger commented Jul 11, 2016

Here is the fix but I need someone who completely understand how parser and tokenizer works as I'm not sure if there is no side effects of my changes. Also I'm not sure if it works as it should in edge case tests like <noscript><iframe></noscript>. All tests are updated to new version so I think for one of the core package maintainers shouldn't be hard to understand what has been changed and check if it works OK.

@nathan-osman
Copy link

nathan-osman commented Dec 5, 2016

According to W3C, <noscript> may contain other elements, so this is clearly incorrect behavior. Can we please apply @bearburger's commit and get this fixed?

@riking
Copy link

riking commented May 3, 2017

@bearburger the test changes for that seem wrong: I'm pretty sure that

<!doctype html><noscript><iframe></noscript>X

should be parsed as

| <!DOCTYPE html>
| <html>
|   <body>
|     <noscript>
|       <iframe>
|     "X"

@bearburger
Copy link
Author

@riking you are right - iframe can't include other tags.

@riking
Copy link

riking commented May 3, 2017

I meant that the </noscript> should force closing of all unclosed tags until the noscript open tag. There's a mechanism for that, right? Like how closing a </table> would close still-open <td>s.

<!doctype html><noscript><div></noscript>X
should give the same parse structure.

@andlabs
Copy link
Contributor

andlabs commented Sep 1, 2017

Is there a workaround I can use until this is resolved? Should I recursively tokenize z.Text() for a <noscript>? And if so, would NewTokenizerFragment("noscript") just do the same thing as this bug?

@hryamzik
Copy link

hryamzik commented Sep 1, 2017

reg, err := regexp.Compile("</?noscript>")
safeHTMLBytes := reg.ReplaceAll(data, []byte{})

@andlabs
Copy link
Contributor

andlabs commented Sep 2, 2017

That is such a bad solution that I am tempted to file an issue on the project linked about it. It will not capture every case (uppercase, attributes, whitespace, etc.) and it will capture false positives. A solution must exist that does not attempt to transform input outside the scope of x/net/html.

@nathan-osman
Copy link

nathan-osman commented Sep 2, 2017

I have a package affected by this issue as well. The simplest solution is to just feed the content back into the parser:

https://github.com/nathan-osman/go-sechat/blob/563300db9224ed01797dfe1f89f489e8eb9dd5ba/auth.go#L65

Assuming this bug is eventually fixed, I'll need to come up with a way to determine whether the recursive parsing is necessary at runtime.

@winteraz
Copy link

winteraz commented Nov 3, 2017

Is this accepted as a valid bug? I just found that noscript children are not parsed and I've found this issue.

@ghost
Copy link

ghost commented Dec 1, 2017

Discovered this yesterday working with goquery. While I could parse the noscript tag contents (get document.Find("noscript").Text(), pass it to strings.NewReader(), then goquery.NewDocumentFromReader()), this is clumsy.

@mmasouras
Copy link

I stumbled upon this as well, everything under in my document is returned as one text node.

@alin04
Copy link

alin04 commented Oct 3, 2018

It's odd that 12.2.6.4.5 isn't implemented in parse.go
https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inheadnoscript

Ah, upon further inspection, parser assumes scripting is always enabled. There is no support for scripting = false (that I can see). Therefore, this behavior (to treat noscript contents as raw text) is correct.

https://html.spec.whatwg.org/multipage/scripting.html#the-noscript-element

The ask, then, is that parse.go supports scripting disabled mode.

@driusan
Copy link

driusan commented Dec 7, 2018

Even if the parser does assume scripting is enabled the behaviour is still wrong. The spec says "it represents nothing if scripting is enabled", and it's being represented as a TextNode by the parser.

The documentation for the package also doesn't state anywhere that it acts as if scripting is enabled (which is the opposite of what most people would for assume for a Go HTML parser given the lack of javascript engines written in Go.)

@gopherbot
Copy link

Change https://golang.org/cl/172557 mentions this issue: html: add "in head noscript" im support

gopherbot pushed a commit to golang/net that referenced this issue Apr 24, 2019
In the spec 12.2.6.4.5, the "in head noscript" insertion mode is defined.
However, this package and its parser doesn't have the insertion mode,
because the scripting=false case is not considered currently.

This commit adds a test and a support for the "in head noscript"
insertion mode. This change has no effect on the actual behavior.

Updates golang/go#16318

Change-Id: I9314c3342bea27fa2acf2fa7d980a127ee0fbf91
Reviewed-on: https://go-review.googlesource.com/c/net/+/172557
Reviewed-by: Nigel Tao <nigeltao@golang.org>
@namusyaka
Copy link
Member

namusyaka commented Apr 24, 2019

This behavior can be considered correct given that scripting is always true.
On the other hand, we have now incorporated the implementation of in head noscript im into the x/net/html package.
There is no way to use this im at this time, but I think it would be nice to provide some way to configure the scripting property as a public API.
I ask @nigeltao for making decision on this.

@namusyaka namusyaka added the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Apr 24, 2019
@jimmyfrasche
Copy link
Member

Isn't scripting being true itself a deviation from the spec?

@alin04
Copy link

alin04 commented Apr 24, 2019

Scripting enabled/disabled should be configurable in order to support emulating different browser contexts.

https://html.spec.whatwg.org/multipage/webappapis.html#enabling-and-disabling-scripting

@namusyaka
Copy link
Member

namusyaka commented Apr 24, 2019

Yes, I'm sorry if my previous comment made you confusing, but I was thinking about how to make the scripting property configurable.

@nigeltao
Copy link
Contributor

Yeah, we should have some way to configure scripting. I'd also like to keep API compatibility, and room for future configuration options. Perhaps some new functions like

func ParseWithOptions(r io.Reader, opts ...ParseOptions) (*Node, error) { etc }
func ParseFragmentWithOptions(r io.Reader, context *Node, opts ...ParseOptions) (*Node, error) { etc }

type ParseOption func(*parser)

func ParseOptionEnableScripting(enable bool) ParseOption {
  return func(p *parser) {
    p.scripting = enable
  }
}

Call sites would look like:

doc, err := html.ParseWithOptions(r, html.ParseOptionEnableScripting(false))

The "options are functions" pattern here is the same as e.g. https://github.com/golang/text/blob/master/internal/ucd/ucd.go#L134

@gopherbot
Copy link

Change https://golang.org/cl/174157 mentions this issue: html: implement ParseWithOptions and ParseFragmentWithOptions

@marcobeierer
Copy link

@nigeltao
Thank you for the fix for this issue and the work done.

I just ran into this issue and investigated it and if I see it right, the fix is not complete yet. It just fixes parsing noscript elements in the head, but not outside of head. Outside of head the content of the noscript element is still treated as text.

According to the specs the content of noscript elements should be parsed normally outside of head.

Outside of head elements, if scripting is disabled for the noscript element
The noscript element's content model is transparent, with the additional restriction that a noscript element must not have a noscript element as an ancestor (that is, noscript can't be nested).

I may create a pull request to fix this, but have to work through the Contribution Guide first next week.

@golang golang locked and limited conversation to collaborators Sep 25, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made.
Projects
None yet
Development

No branches or pull requests