Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/net/html: add StripTags example #40345

Closed
lesichkovm opened this issue Jul 22, 2020 · 14 comments
Closed

x/net/html: add StripTags example #40345

lesichkovm opened this issue Jul 22, 2020 · 14 comments
Labels
Documentation FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@lesichkovm
Copy link

Stripping HTML tags is absolutely required for every web related development task.

This was previously raised in: #5884

@ALTree ALTree changed the title html/template: "stripTags" to be made public html/template: export stripTags Jul 22, 2020
@ALTree ALTree added FeatureRequest NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. labels Jul 22, 2020
@empijei
Copy link
Contributor

empijei commented Aug 4, 2020

Can you please expand on the requirements you have and the problem you're trying to solve that would require the use of stripTags?

(Also a duplicate of #22639)

@lesichkovm
Copy link
Author

lesichkovm commented Aug 7, 2020

@empijei Please makes sure you read the first message carefully.

We are not talking about specific issue at hand.

Its about being able to use a function that should be part of the standard library, instead of relying on 3rd party implementation.

@empijei
Copy link
Contributor

empijei commented Aug 8, 2020

Why do you think it should be part of the standard library? I don't think it should and I think it might be even dangerous to expose.

If you ask me about implementing and exposing a sanitizer I might be on board with that, I am just not sure stripTags would be the best thing to have in our public API.

When you say "Please makes sure you read the first message carefully." I read all 18 words of the initial report very carefully and I see 0 reasons that support stripping tags as "absolutely required for every web related development task". Moreover the initial report you linked does not have that information.

I have been working on the web for quite a while and never needed such a function. I might have needed a textual representation of a page (which is not achievable by simply stripping tags) or I might have needed a sanitizer (which, again, cannot be done by just stripping tags) but not once I just needed to strip tags and not do anything else.

That said I might be missing the obvious use cases here, so if you want this in the standard library we would need to build a case and gather some info on why and when this would be useful, so can you please provide some use cases? I really need more information to propose a change the stdlib.

Edit

If you, for example, rely on that function to sanitize your input (e.g. for writing an email client and display text of emails without potential HTML injection) you might find yourself in trouble.

Given, for example, this input: Foo<<template>script</template>>alert(1)<<svg>/script> the output is Foo<script>alert(1)</script> which would execute code if embedded in a page, so please try to understand why I am not so keen to expose this function: it is quite easy to misuse.

If what you need is a sanitizer please take a look at bluemonday or similar packages, or open a feature request to add a sanitizer to Go.

To conclude: stripTags strips tags that already are in the input that is passed to it, but it does not attempt to remove tags that might be in the output after the function is called, which makes it a pretty scary footgun.

/cc @kele

@lesichkovm
Copy link
Author

lesichkovm commented Aug 16, 2020

Sorry to disagree. Stipping HTML tags is used regularly in web development, and I find myself using it in absolutely every web project.

Definitely there is a reason why its in the PHP standard library:
https://www.php.net/manual/en/function.strip-tags.php

The easiest case to make is to go to https://pkg.go.dev/search?page=1&q=strip+tags and check how many times people have implemented it, and will keep re-implementing it as its a must have function.

When all it takes is to make a function that already exists in the library public.

@kele
Copy link

kele commented Aug 17, 2020

It's okay to disagree.

Please provide some use cases, though. The argument of having others implemented such a function is not enough, as we still don't know why they have implemented it. They might have had perfectly valid reasons that simply don't fit html/template. Could you help us with some examples of usage?

@empijei
Copy link
Contributor

empijei commented Aug 17, 2020

Stipping HTML tags is used regularly in web development, and I find myself using it in absolutely every web project.

If that is the case can you please provide one use case for it? Please, just one so that we know what we are talking about.
Even the standard library never uses that function directly: there are only two calls to it and both of them are used to pass the result to htmlReplacer. Both of the results are then used for the contextual autoescaper in the template package. I still think there is no good way to use this function other than in a precise context so I am asking you to please provide one valid use case for this. If you find yourself using it so much you might have vulnerabilities in your code, so you might want to have your code reviewed by a security expert.

Definitely there is a reason why its in the PHP standard library:

Thanks for bringing PHP to a security discussion, let's take a look at what exposing this function caused in the PHP ecosystem:

  • People are asking if it is enough to protect their site (spoiler, it is not). The first answer correctly says that you should then take the output and pass it to an escaping funciton that needs to be correct for the context you are going to put the data in. This is what the html/template package does for you.
  • Let's look at the doc you linked, note section, which is quite telling on the dangers that come with this funciton:
    image
  • So one might ask, has this ever caused real vulnerabilities in practice? Judging by all mentions from several sources [1][2][3][4][5] I would be inclined to say it did, and it even allows to bypass protections made exactly for that
  • PHP and ruby on rails had CVEs assigned because they were not considering non-printable or null bytes in the strip-tags and strip_tags helpers. Had they exposed actual escapers this wouldn't have been an issue since these characters would have been encoded, but that was not the case

The easiest case to make is to go to https://pkg.go.dev/search?page=1&q=strip+tags and check how many times people have implemented it, and will keep re-implementing it as its a must have function.

Let's now take a look at the Go ecosystem as you suggested. Searching for "go strip tags" on most search engines takes you to a stackoverflow thread that suggests to use an actual sanitizer, which is what I wrote initially we could decide to implement here.
The second result is a package that does exactly that and that has the same warning I mentioned in the description:
image
This package has 40 dependents according to pkg.go.dev, which doesn't seem like a lot considering how important you say this functionality is, but I decided to take a look anyways. All projects I could find on github that seem to be stripping tags are personal projects, CLI utils, crawlers or email services. The only use I could find of it in a web app or framework is in gobuffalo/docs and it uses it in conjunction with bluemonday.

Thanks for pointing out all this data, because it looks like it is building a good case against exporting this function. We might decide to provide a sanitizer in the future, but stripTags looks like a no-no to me. /cc @FiloSottile for one last security opinion here.

Before I close this bug I would like to ask you if you have any actual example we can look at to be sure we didn't overlook any potential good uses of tag stripping for web application development.

@lesichkovm
Copy link
Author

First, enough with the scare mongering @empijei. This is your third post where you just try to scare the community. Every function can be used in a wrong context depending on the programmers experience. This doesn't mean the function has a security flaw. Your presumption is you have untrusted input. If this is the case you will need extra validation and not only for XSS injections, but also SQL injections. But this is not the responsibility of the StripTags function. And most usages for StripTags are not for securing untrusted sources anyway. All StripTags does is remove <WORDS_SURROUNDED_WITH_ANGLE_BRACKETS> from a TEXT string. Period nothing more no less.

Second. Use cases (non-exhaustive list), as I said its an everyday occurrence in the live of a web developer.

  • Convert HTML to text for news articles, blog posts, books in HTML format, RSS feeds, etc.
  • Generating summaries for bulettins, news letters
  • Counting real text characters in an HTML block
  • Convert HTML emails to TEXT email content. Especially for mail clients not supporting HTML.
  • Cleanup HTML from rich text editors, where browsers add unwanted tags
  • Convert HTML output to text for text readers
  • Creating audio from subtitles in HTML format
  • Converting HTML formatted books to text for better readability
  • Converting HTML pages to text, for indexing for search engines (Solr, ElasticSearch, etc)

Third. Lets see what happens when this small and useful utility function is not a standard
A. Everyone re-invents it

@empijei
Copy link
Contributor

empijei commented Aug 18, 2020

This is your third post where you just try to scare the community

I am just trying to showcase why adding such a function would be dangerous. I care about the security of the ecosystem.

If this is the case you will need extra validation and not only for XSS injections, but also SQL injections

Let's leave SQL out of this please. It has nothing to do with tag stripping.

  • Convert HTML to text for news articles, blog posts, books in HTML format, RSS feeds, etc.
  • Converting HTML pages to text, for indexing for search engines (Solr, ElasticSearch, etc)
  • Generating summaries for bulettins, news letters
  • Convert HTML emails to TEXT email content. Especially for mail clients not supporting HTML.

For these you would then need to encode the result in HTML, so after calling stripTags you would have to call an html replacer, like html/template does. This is the exact concept of "putting untrusted input in your HTML page". If instead the output goes in something that is not rendering HTML (e.g. an RSS feed, an email client or a CLI application) this is exactly what I said: "CLI utils, crawlers or email services" (see my previous message).
That said email, CLI and non-web-related technologies are not strictly web development and we might want to add those functions where they belong, like an HTML manipulation library instead of a templating one. I am just arguing this function might not belong to this package.

  • Counting real text characters in an HTML block
  • Convert HTML output to text for text readers
  • Creating audio from subtitles in HTML format
  • Converting HTML formatted books to text for better readability

This doesn't work, stripTags would count characters in hidden blocks too and would ignore positioning: you might end up with a completely different text than the one in the page that would be displayed by rendering the HTML.

  • Cleanup HTML from rich text editors, where browsers add unwanted tags

Again, a sanitizer would do this job.

So, let's recap if I got this right, you are arguing this is a much needed feature because:

  • HTML-to-text for email, crawlling or RSS-like tools are a common enough use case
  • Providing an inherently unsafe function is okay as long as it doesn't get confused for something security related
  • Not providing this feature would make people write their own, which might even be worse
  • Using a publicly available sanitizer that most of the go ecosystem relies on is a security issue

Let's work on this together.

The stripTags function is not exactly related to generating HTML or producing web pages, as you said most use cases are RSS, crawling, ebooks, audio and such. Luckily we already have a library dedicated to HTML processing: /x/net/html. Would it be okay for you to provide either of the following:

  • an example of how to use this library to strip tags
  • a function that strips tags from the input
    ?

Edit

This is more or less what StripTags would look like if you wanted to use the html package:

func StripTags(htmlin io.Reader) (string, error) {
	doc, err := html.Parse(htmlin)
	if err != nil {
		return "", err
	}
	skip := map[string]bool{
		"script":   true,
		"style":    true,
		"textarea": true,
		"title":    true,
	}
	var sb strings.Builder
	var f func(*html.Node)
	f = func(n *html.Node) {
		if n.Type == html.TextNode {
			if n.Parent.Type == html.ElementNode && !skip[strings.ToLower(n.Parent.Data)] {
				sb.WriteString(n.Data)
			}
		}
		for c := n.FirstChild; c != nil; c = c.NextSibling {
			f(c)
		}
	}
	f(doc)
	return sb.String(), nil
}

And this is a couple of test cases I wrote:

func TestStripTags(t *testing.T) {
	var tests = []struct {
		name, in, want string
	}{
		{
			name: "stripTags example",
			in:   `<b>Hi&excl;</b> <script>...</script>`,
			want: `Hi! `,
		},
		{
			name: "stripTags example",
			in:   `<div><p>foo <b>bold</b> bar</p><div><style>ignored</style>`,
			want: `foo bold bar`,
		},
	}
	for _, tt := range tests {
		t.Run(tt.name, func(t *testing.T) {
			got, err := StripTags(strings.NewReader(tt.in))
			if err != nil {
				t.Fatal(err)
			}
			if got != tt.want {
				t.Errorf("got: %v, want: %v", got, tt.want)
			}
		})
	}
}

Is this what you are asking to implement/export?

@lesichkovm
Copy link
Author

lesichkovm commented Aug 21, 2020

Hi @empijei, thank you for taking the time to think the code over:

  1. Its similar but I would expect to pass a string as parameter instead of io.Reader, The most HTML one would receive would be a string (i.e. database text field), converting it to io.Reader may be a step over what one would like to do, but is not a deal breaker.

  2. Initially I did not like the skip map, thinking its an overkill for such a simple function. But the more I thought about it the more I liked it, and now I think its really good. I would probably add the template tag in the map too.

  3. About using html.Parse, I quite like how you approached it. I am sure it would be good for properly validated nested HTML. What about not properly nested tags "[b]....[u]....[/b]...[/u]". Would html.Parse cope with it?

About actual implementation not sure I am the right person. I have seen this previously implemented here:
https://storage.googleapis.com/go-attachment/5884/7/strip.go

The two approaches are much different, not sure which is correct, and what are the pros and cons of both. I personally like yours as is more readable, while the other seems more convoluted.

@empijei
Copy link
Contributor

empijei commented Aug 21, 2020

Its similar but I would expect to pass a string as parameter instead of io.Reader, The most HTML one would receive would be a string (i.e. database text field), converting it to io.Reader may be a step over what one would like to do, but is not a deal breaker.

A very nice thing about Go is that any string can be used as an io.Reader, so if we wanted to build a wrapper function that takes a string it would be

StripTagsString(htmlin string)(string, error) {
  // This is the same StripTags that I wrote in the previous comment
  return StripTags(strings.NewReader(htmlin))
}

Initially I did not like the skip map, thinking its an overkill for such a simple function. But the more I thought about it the more I liked it, and now I think its really good. I would probably add the template tag in the map too.

The skip map is necessary and a similar thing is used in the stripTag in the template package that you were asking to have exported:

const (
// elementNone occurs outside a special tag or special element body.
elementNone element = iota
// elementScript corresponds to the raw text <script> element
// with JS MIME type or no type attribute.
elementScript
// elementStyle corresponds to the raw text <style> element.
elementStyle
// elementTextarea corresponds to the RCDATA <textarea> element.
elementTextarea
// elementTitle corresponds to the RCDATA <title> element.
elementTitle
)

About using html.Parse, I quite like how you approached it. I am sure it would be good for properly validated nested HTML. What about not properly nested tags "[b]....[u]....[/b]...[/u]". Would html.Parse cope with it?

It would work, but if you want to have the matching even more relaxed you could change the code of `f`` to be:

	f = func(n *html.Node) {
		if n.Type == html.TextNode {
			if n.Parent.Type == html.ElementNode && skip[strings.ToLower(n.Parent.Data)] {
				return
			}
			sb.WriteString(n.Data)
		}
		for c := n.FirstChild; c != nil; c = c.NextSibling {
			f(c)
		}
	}

About actual implementation not sure I am the right person. I have seen this previously implemented here:
https://storage.googleapis.com/go-attachment/5884/7/strip.go

This code is an overly complicated implementation that basically copy-pasted the entire source of the template package just to export stripTag. The issue is that the implementation of stripTag in this case is so complicated because it needs to be in the context of a templating package, which is not what you seem to be asking for here.

I would honestly think that the template package is not a good place to put the stripTags function in. The html package seems like a better place. Maybe we could add an example there that explains how to extract text from the given HTML? (which is what I did in my example above).
Would you find that useful?

@lesichkovm
Copy link
Author

lesichkovm commented Aug 22, 2020

@empijei I agree with you the "template" package is not the correct place for it to be

The first place I would look for a StripTag function would be the "strings" package, simply because the HTML will be a string.
The second place I would look into will be the "html" package, as this would be an intuitive place also, where such a function may reside.

Yes, an example of stripping tags would be useful. Also a warning that this is not a "safe sanitizing" function would be required.

BTW. How difficult would be to implement an allowable tags option similar to this one here: https://www.php.net/manual/en/function.strip-tags.php, probably with a signature simlar to this:
StripTags("[span]Hello[/span] [b][u]World[/u][/b]", "b,u") - where [b] and [u] tags will be left over
For instance some older HTML renderers support only a limited subset of tags.

@empijei
Copy link
Contributor

empijei commented Aug 26, 2020

Yes, an example of stripping tags would be useful. Also a warning that this is not a "safe sanitizing" function would be required.

I could add this to the "html" docs, since we now have an example to extract links from a page. @bradfitz what do you think about adding an example to extract text from html to the x/ html package?

How difficult would be to implement an allowable tags option similar to this one here:

It would not be difficult but since at that point whatever is produced will end up being rendered as HTML I would really prefer if people used a sanitizer instead (there would be no safe way to use a StripTag function that leaves some tags untouched, it would require an allowlist of per-tag attributes, and at that point you'd be rewriting a full sanitizer).

@FiloSottile
Copy link
Contributor

I agree with @empijei that html/template is a security focused package that should not provide tools for developers to shoot themselves in the foot. Sounds like this can live as an example in golang.org/x/net/html, so different requirements can be handled by modifying the code.

@FiloSottile FiloSottile changed the title html/template: export stripTags golang.org/x/net/html: add StripTags example Aug 26, 2020
@FiloSottile FiloSottile added Documentation NeedsFix The path to resolution is known, but the work has not been done. and removed FeatureRequest NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. labels Aug 26, 2020
@FiloSottile FiloSottile added this to the Unreleased milestone Aug 26, 2020
@seankhliao seankhliao changed the title golang.org/x/net/html: add StripTags example x/net/html: add StripTags example Sep 8, 2021
@lesichkovm
Copy link
Author

Too old, closing

@lesichkovm lesichkovm closed this as not planned Won't fix, can't repro, duplicate, stale Jan 11, 2023
@golang golang locked and limited conversation to collaborators Jan 11, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Documentation FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

6 participants