x/net/html: add StripTags example #40345

lesichkovm · 2020-07-22T12:23:51Z

Stripping HTML tags is absolutely required for every web related development task.

This was previously raised in: #5884

empijei · 2020-08-04T14:16:53Z

Can you please expand on the requirements you have and the problem you're trying to solve that would require the use of stripTags?

(Also a duplicate of #22639)

lesichkovm · 2020-08-07T21:41:42Z

@empijei Please makes sure you read the first message carefully.

We are not talking about specific issue at hand.

Its about being able to use a function that should be part of the standard library, instead of relying on 3rd party implementation.

empijei · 2020-08-08T15:11:35Z

Why do you think it should be part of the standard library? I don't think it should and I think it might be even dangerous to expose.

If you ask me about implementing and exposing a sanitizer I might be on board with that, I am just not sure stripTags would be the best thing to have in our public API.

When you say "Please makes sure you read the first message carefully." I read all 18 words of the initial report very carefully and I see 0 reasons that support stripping tags as "absolutely required for every web related development task". Moreover the initial report you linked does not have that information.

I have been working on the web for quite a while and never needed such a function. I might have needed a textual representation of a page (which is not achievable by simply stripping tags) or I might have needed a sanitizer (which, again, cannot be done by just stripping tags) but not once I just needed to strip tags and not do anything else.

That said I might be missing the obvious use cases here, so if you want this in the standard library we would need to build a case and gather some info on why and when this would be useful, so can you please provide some use cases? I really need more information to propose a change the stdlib.

Edit

If you, for example, rely on that function to sanitize your input (e.g. for writing an email client and display text of emails without potential HTML injection) you might find yourself in trouble.

Given, for example, this input: Foo<<template>script</template>>alert(1)<<svg>/script> the output is Foo<script>alert(1)</script> which would execute code if embedded in a page, so please try to understand why I am not so keen to expose this function: it is quite easy to misuse.

If what you need is a sanitizer please take a look at bluemonday or similar packages, or open a feature request to add a sanitizer to Go.

To conclude: stripTags strips tags that already are in the input that is passed to it, but it does not attempt to remove tags that might be in the output after the function is called, which makes it a pretty scary footgun.

/cc @kele

lesichkovm · 2020-08-16T21:31:47Z

Sorry to disagree. Stipping HTML tags is used regularly in web development, and I find myself using it in absolutely every web project.

Definitely there is a reason why its in the PHP standard library:
https://www.php.net/manual/en/function.strip-tags.php

The easiest case to make is to go to https://pkg.go.dev/search?page=1&q=strip+tags and check how many times people have implemented it, and will keep re-implementing it as its a must have function.

When all it takes is to make a function that already exists in the library public.

kele · 2020-08-17T05:15:09Z

It's okay to disagree.

Please provide some use cases, though. The argument of having others implemented such a function is not enough, as we still don't know why they have implemented it. They might have had perfectly valid reasons that simply don't fit html/template. Could you help us with some examples of usage?

empijei · 2020-08-17T08:31:39Z

Stipping HTML tags is used regularly in web development, and I find myself using it in absolutely every web project.

If that is the case can you please provide one use case for it? Please, just one so that we know what we are talking about.
Even the standard library never uses that function directly: there are only two calls to it and both of them are used to pass the result to htmlReplacer. Both of the results are then used for the contextual autoescaper in the template package. I still think there is no good way to use this function other than in a precise context so I am asking you to please provide one valid use case for this. If you find yourself using it so much you might have vulnerabilities in your code, so you might want to have your code reviewed by a security expert.

Definitely there is a reason why its in the PHP standard library:

Thanks for bringing PHP to a security discussion, let's take a look at what exposing this function caused in the PHP ecosystem:

People are asking if it is enough to protect their site (spoiler, it is not). The first answer correctly says that you should then take the output and pass it to an escaping funciton that needs to be correct for the context you are going to put the data in. This is what the html/template package does for you.
Let's look at the doc you linked, note section, which is quite telling on the dangers that come with this funciton:
So one might ask, has this ever caused real vulnerabilities in practice? Judging by all mentions from several sources [1][2][3][4][5] I would be inclined to say it did, and it even allows to bypass protections made exactly for that
PHP and ruby on rails had CVEs assigned because they were not considering non-printable or null bytes in the strip-tags and strip_tags helpers. Had they exposed actual escapers this wouldn't have been an issue since these characters would have been encoded, but that was not the case

The easiest case to make is to go to https://pkg.go.dev/search?page=1&q=strip+tags and check how many times people have implemented it, and will keep re-implementing it as its a must have function.

Let's now take a look at the Go ecosystem as you suggested. Searching for "go strip tags" on most search engines takes you to a stackoverflow thread that suggests to use an actual sanitizer, which is what I wrote initially we could decide to implement here.
The second result is a package that does exactly that and that has the same warning I mentioned in the description:

This package has 40 dependents according to pkg.go.dev, which doesn't seem like a lot considering how important you say this functionality is, but I decided to take a look anyways. All projects I could find on github that seem to be stripping tags are personal projects, CLI utils, crawlers or email services. The only use I could find of it in a web app or framework is in gobuffalo/docs and it uses it in conjunction with bluemonday.

Thanks for pointing out all this data, because it looks like it is building a good case against exporting this function. We might decide to provide a sanitizer in the future, but stripTags looks like a no-no to me. /cc @FiloSottile for one last security opinion here.

Before I close this bug I would like to ask you if you have any actual example we can look at to be sure we didn't overlook any potential good uses of tag stripping for web application development.

lesichkovm · 2020-08-18T06:53:02Z

First, enough with the scare mongering @empijei. This is your third post where you just try to scare the community. Every function can be used in a wrong context depending on the programmers experience. This doesn't mean the function has a security flaw. Your presumption is you have untrusted input. If this is the case you will need extra validation and not only for XSS injections, but also SQL injections. But this is not the responsibility of the StripTags function. And most usages for StripTags are not for securing untrusted sources anyway. All StripTags does is remove <WORDS_SURROUNDED_WITH_ANGLE_BRACKETS> from a TEXT string. Period nothing more no less.

Second. Use cases (non-exhaustive list), as I said its an everyday occurrence in the live of a web developer.

Convert HTML to text for news articles, blog posts, books in HTML format, RSS feeds, etc.
Generating summaries for bulettins, news letters
Counting real text characters in an HTML block
Convert HTML emails to TEXT email content. Especially for mail clients not supporting HTML.
Cleanup HTML from rich text editors, where browsers add unwanted tags
Convert HTML output to text for text readers
Creating audio from subtitles in HTML format
Converting HTML formatted books to text for better readability
Converting HTML pages to text, for indexing for search engines (Solr, ElasticSearch, etc)

Third. Lets see what happens when this small and useful utility function is not a standard
A. Everyone re-invents it

see this thread here for C#: https://stackoverflow.com/questions/785715/how-can-i-strip-html-tags-from-a-string-in-asp-net
GoLang - https://stackoverflow.com/questions/55036156/how-to-replace-all-html-tag-with-empty-string-in-golang
GoLang - https://www.reddit.com/r/golang/comments/7l1vqj/strip_html_tags_from_strings/ (pay attention to the last post, where the author wrote one himself)
Java - https://stackoverflow.com/questions/240546/remove-html-tags-from-a-string
GoLang - https://gowalker.org/github.com/kennygrant/sanitize
GoLang - https://www.php2golang.com/method/function.strip-tags.html
B. Everyone reinvents it poorly (99%). Here @empijei. How much of the above are secure?
C. Having to rely on external 3rd party package (bluemonday, redtuesday, greenwednesday, you name it) is a security issue by itself

empijei · 2020-08-18T07:51:44Z

This is your third post where you just try to scare the community

I am just trying to showcase why adding such a function would be dangerous. I care about the security of the ecosystem.

If this is the case you will need extra validation and not only for XSS injections, but also SQL injections

Let's leave SQL out of this please. It has nothing to do with tag stripping.

Convert HTML to text for news articles, blog posts, books in HTML format, RSS feeds, etc.

Converting HTML pages to text, for indexing for search engines (Solr, ElasticSearch, etc)

Generating summaries for bulettins, news letters

Convert HTML emails to TEXT email content. Especially for mail clients not supporting HTML.

For these you would then need to encode the result in HTML, so after calling stripTags you would have to call an html replacer, like html/template does. This is the exact concept of "putting untrusted input in your HTML page". If instead the output goes in something that is not rendering HTML (e.g. an RSS feed, an email client or a CLI application) this is exactly what I said: "CLI utils, crawlers or email services" (see my previous message).
That said email, CLI and non-web-related technologies are not strictly web development and we might want to add those functions where they belong, like an HTML manipulation library instead of a templating one. I am just arguing this function might not belong to this package.

Counting real text characters in an HTML block

Convert HTML output to text for text readers

Creating audio from subtitles in HTML format

Converting HTML formatted books to text for better readability

This doesn't work, stripTags would count characters in hidden blocks too and would ignore positioning: you might end up with a completely different text than the one in the page that would be displayed by rendering the HTML.

Cleanup HTML from rich text editors, where browsers add unwanted tags

Again, a sanitizer would do this job.

So, let's recap if I got this right, you are arguing this is a much needed feature because:

HTML-to-text for email, crawlling or RSS-like tools are a common enough use case
Providing an inherently unsafe function is okay as long as it doesn't get confused for something security related
Not providing this feature would make people write their own, which might even be worse
Using a publicly available sanitizer that most of the go ecosystem relies on is a security issue

Let's work on this together.

The stripTags function is not exactly related to generating HTML or producing web pages, as you said most use cases are RSS, crawling, ebooks, audio and such. Luckily we already have a library dedicated to HTML processing: /x/net/html. Would it be okay for you to provide either of the following:

an example of how to use this library to strip tags
a function that strips tags from the input
?

Edit

This is more or less what StripTags would look like if you wanted to use the html package:

func StripTags(htmlin io.Reader) (string, error) {
	doc, err := html.Parse(htmlin)
	if err != nil {
		return "", err
	}
	skip := map[string]bool{
		"script":   true,
		"style":    true,
		"textarea": true,
		"title":    true,
	}
	var sb strings.Builder
	var f func(*html.Node)
	f = func(n *html.Node) {
		if n.Type == html.TextNode {
			if n.Parent.Type == html.ElementNode && !skip[strings.ToLower(n.Parent.Data)] {
				sb.WriteString(n.Data)
			}
		}
		for c := n.FirstChild; c != nil; c = c.NextSibling {
			f(c)
		}
	}
	f(doc)
	return sb.String(), nil
}

And this is a couple of test cases I wrote:

func TestStripTags(t *testing.T) {
	var tests = []struct {
		name, in, want string
	}{
		{
			name: "stripTags example",
			in:   `<b>Hi&excl;</b> <script>...</script>`,
			want: `Hi! `,
		},
		{
			name: "stripTags example",
			in:   `<div><p>foo <b>bold</b> bar</p><div><style>ignored</style>`,
			want: `foo bold bar`,
		},
	}
	for _, tt := range tests {
		t.Run(tt.name, func(t *testing.T) {
			got, err := StripTags(strings.NewReader(tt.in))
			if err != nil {
				t.Fatal(err)
			}
			if got != tt.want {
				t.Errorf("got: %v, want: %v", got, tt.want)
			}
		})
	}
}

Is this what you are asking to implement/export?

lesichkovm · 2020-08-21T03:57:09Z

Hi @empijei, thank you for taking the time to think the code over:

Its similar but I would expect to pass a string as parameter instead of io.Reader, The most HTML one would receive would be a string (i.e. database text field), converting it to io.Reader may be a step over what one would like to do, but is not a deal breaker.
Initially I did not like the skip map, thinking its an overkill for such a simple function. But the more I thought about it the more I liked it, and now I think its really good. I would probably add the template tag in the map too.
About using html.Parse, I quite like how you approached it. I am sure it would be good for properly validated nested HTML. What about not properly nested tags "[b]....[u]....[/b]...[/u]". Would html.Parse cope with it?

About actual implementation not sure I am the right person. I have seen this previously implemented here:
https://storage.googleapis.com/go-attachment/5884/7/strip.go

The two approaches are much different, not sure which is correct, and what are the pros and cons of both. I personally like yours as is more readable, while the other seems more convoluted.

empijei · 2020-08-21T07:17:58Z

Its similar but I would expect to pass a string as parameter instead of io.Reader, The most HTML one would receive would be a string (i.e. database text field), converting it to io.Reader may be a step over what one would like to do, but is not a deal breaker.

A very nice thing about Go is that any string can be used as an io.Reader, so if we wanted to build a wrapper function that takes a string it would be

StripTagsString(htmlin string)(string, error) {
  // This is the same StripTags that I wrote in the previous comment
  return StripTags(strings.NewReader(htmlin))
}

Initially I did not like the skip map, thinking its an overkill for such a simple function. But the more I thought about it the more I liked it, and now I think its really good. I would probably add the template tag in the map too.

The skip map is necessary and a similar thing is used in the stripTag in the template package that you were asking to have exported:

go/src/html/template/context.go

Lines 227 to 239 in 2cd2ff6

    
           const ( 
        
           	// elementNone occurs outside a special tag or special element body. 
        
           	elementNone element = iota 
        
           	// elementScript corresponds to the raw text <script> element 
        
           	// with JS MIME type or no type attribute. 
        
           	elementScript 
        
           	// elementStyle corresponds to the raw text <style> element. 
        
           	elementStyle 
        
           	// elementTextarea corresponds to the RCDATA <textarea> element. 
        
           	elementTextarea 
        
           	// elementTitle corresponds to the RCDATA <title> element. 
        
           	elementTitle 
        
           )

About using html.Parse, I quite like how you approached it. I am sure it would be good for properly validated nested HTML. What about not properly nested tags "[b]....[u]....[/b]...[/u]". Would html.Parse cope with it?

It would work, but if you want to have the matching even more relaxed you could change the code of `f`` to be:

	f = func(n *html.Node) {
		if n.Type == html.TextNode {
			if n.Parent.Type == html.ElementNode && skip[strings.ToLower(n.Parent.Data)] {
				return
			}
			sb.WriteString(n.Data)
		}
		for c := n.FirstChild; c != nil; c = c.NextSibling {
			f(c)
		}
	}

About actual implementation not sure I am the right person. I have seen this previously implemented here:
https://storage.googleapis.com/go-attachment/5884/7/strip.go

This code is an overly complicated implementation that basically copy-pasted the entire source of the template package just to export stripTag. The issue is that the implementation of stripTag in this case is so complicated because it needs to be in the context of a templating package, which is not what you seem to be asking for here.

I would honestly think that the template package is not a good place to put the stripTags function in. The html package seems like a better place. Maybe we could add an example there that explains how to extract text from the given HTML? (which is what I did in my example above).
Would you find that useful?

lesichkovm · 2020-08-22T08:15:25Z

@empijei I agree with you the "template" package is not the correct place for it to be

The first place I would look for a StripTag function would be the "strings" package, simply because the HTML will be a string.
The second place I would look into will be the "html" package, as this would be an intuitive place also, where such a function may reside.

Yes, an example of stripping tags would be useful. Also a warning that this is not a "safe sanitizing" function would be required.

BTW. How difficult would be to implement an allowable tags option similar to this one here: https://www.php.net/manual/en/function.strip-tags.php, probably with a signature simlar to this:
StripTags("[span]Hello[/span] [b][u]World[/u][/b]", "b,u") - where [b] and [u] tags will be left over
For instance some older HTML renderers support only a limited subset of tags.

empijei · 2020-08-26T06:45:53Z

Yes, an example of stripping tags would be useful. Also a warning that this is not a "safe sanitizing" function would be required.

I could add this to the "html" docs, since we now have an example to extract links from a page. @bradfitz what do you think about adding an example to extract text from html to the x/ html package?

How difficult would be to implement an allowable tags option similar to this one here:

It would not be difficult but since at that point whatever is produced will end up being rendered as HTML I would really prefer if people used a sanitizer instead (there would be no safe way to use a StripTag function that leaves some tags untouched, it would require an allowlist of per-tag attributes, and at that point you'd be rewriting a full sanitizer).

FiloSottile · 2020-08-26T14:24:20Z

I agree with @empijei that html/template is a security focused package that should not provide tools for developers to shoot themselves in the foot. Sounds like this can live as an example in golang.org/x/net/html, so different requirements can be handled by modifying the code.

lesichkovm · 2023-01-11T19:47:03Z

Too old, closing

ALTree changed the title ~~html/template: "stripTags" to be made public~~ html/template: export stripTags Jul 22, 2020

ALTree added FeatureRequest NeedsDecision labels Jul 22, 2020

FiloSottile changed the title ~~html/template: export stripTags~~ golang.org/x/net/html: add StripTags example Aug 26, 2020

FiloSottile added Documentation NeedsFix and removed FeatureRequest NeedsDecision labels Aug 26, 2020

FiloSottile added this to the Unreleased milestone Aug 26, 2020

seankhliao changed the title ~~golang.org/x/net/html: add StripTags example~~ x/net/html: add StripTags example Sep 8, 2021

jmooring mentioned this issue Jan 19, 2022

Symbol '>' breaks plainify function gohugoio/hugo#9199

Closed

lesichkovm closed this as not planned Jan 11, 2023

golang locked and limited conversation to collaborators Jan 11, 2024

gopherbot added the FrozenDueToAge label Jan 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x/net/html: add StripTags example #40345

x/net/html: add StripTags example #40345

lesichkovm commented Jul 22, 2020

empijei commented Aug 4, 2020 •

edited

Loading

lesichkovm commented Aug 7, 2020 •

edited

Loading

empijei commented Aug 8, 2020 •

edited

Loading

lesichkovm commented Aug 16, 2020 •

edited

Loading

kele commented Aug 17, 2020

empijei commented Aug 17, 2020

lesichkovm commented Aug 18, 2020

empijei commented Aug 18, 2020 •

edited

Loading

lesichkovm commented Aug 21, 2020 •

edited

Loading

empijei commented Aug 21, 2020

lesichkovm commented Aug 22, 2020 •

edited

Loading

empijei commented Aug 26, 2020

FiloSottile commented Aug 26, 2020

lesichkovm commented Jan 11, 2023

x/net/html: add StripTags example #40345

x/net/html: add StripTags example #40345

Comments

lesichkovm commented Jul 22, 2020

empijei commented Aug 4, 2020 • edited Loading

lesichkovm commented Aug 7, 2020 • edited Loading

empijei commented Aug 8, 2020 • edited Loading

Edit

lesichkovm commented Aug 16, 2020 • edited Loading

kele commented Aug 17, 2020

empijei commented Aug 17, 2020

lesichkovm commented Aug 18, 2020

empijei commented Aug 18, 2020 • edited Loading

Edit

lesichkovm commented Aug 21, 2020 • edited Loading

empijei commented Aug 21, 2020

lesichkovm commented Aug 22, 2020 • edited Loading

empijei commented Aug 26, 2020

FiloSottile commented Aug 26, 2020

lesichkovm commented Jan 11, 2023

empijei commented Aug 4, 2020 •

edited

Loading

lesichkovm commented Aug 7, 2020 •

edited

Loading

empijei commented Aug 8, 2020 •

edited

Loading

lesichkovm commented Aug 16, 2020 •

edited

Loading

empijei commented Aug 18, 2020 •

edited

Loading

lesichkovm commented Aug 21, 2020 •

edited

Loading

lesichkovm commented Aug 22, 2020 •

edited

Loading