-
Notifications
You must be signed in to change notification settings - Fork 18k
x/net/html: add StripTags example #40345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Can you please expand on the requirements you have and the problem you're trying to solve that would require the use of stripTags? (Also a duplicate of #22639) |
@empijei Please makes sure you read the first message carefully. We are not talking about specific issue at hand. Its about being able to use a function that should be part of the standard library, instead of relying on 3rd party implementation. |
Why do you think it should be part of the standard library? I don't think it should and I think it might be even dangerous to expose. If you ask me about implementing and exposing a sanitizer I might be on board with that, I am just not sure When you say "Please makes sure you read the first message carefully." I read all 18 words of the initial report very carefully and I see 0 reasons that support stripping tags as "absolutely required for every web related development task". Moreover the initial report you linked does not have that information. I have been working on the web for quite a while and never needed such a function. I might have needed a textual representation of a page (which is not achievable by simply stripping tags) or I might have needed a sanitizer (which, again, cannot be done by just stripping tags) but not once I just needed to strip tags and not do anything else. That said I might be missing the obvious use cases here, so if you want this in the standard library we would need to build a case and gather some info on why and when this would be useful, so can you please provide some use cases? I really need more information to propose a change the stdlib. EditIf you, for example, rely on that function to sanitize your input (e.g. for writing an email client and display text of emails without potential HTML injection) you might find yourself in trouble. Given, for example, this input: If what you need is a sanitizer please take a look at bluemonday or similar packages, or open a feature request to add a sanitizer to Go. To conclude: /cc @kele |
Sorry to disagree. Stipping HTML tags is used regularly in web development, and I find myself using it in absolutely every web project. Definitely there is a reason why its in the PHP standard library: The easiest case to make is to go to https://pkg.go.dev/search?page=1&q=strip+tags and check how many times people have implemented it, and will keep re-implementing it as its a must have function. When all it takes is to make a function that already exists in the library public. |
It's okay to disagree. Please provide some use cases, though. The argument of having others implemented such a function is not enough, as we still don't know why they have implemented it. They might have had perfectly valid reasons that simply don't fit |
If that is the case can you please provide one use case for it? Please, just one so that we know what we are talking about.
Thanks for bringing PHP to a security discussion, let's take a look at what exposing this function caused in the PHP ecosystem:
Let's now take a look at the Go ecosystem as you suggested. Searching for "go strip tags" on most search engines takes you to a stackoverflow thread that suggests to use an actual sanitizer, which is what I wrote initially we could decide to implement here. Thanks for pointing out all this data, because it looks like it is building a good case against exporting this function. We might decide to provide a sanitizer in the future, but Before I close this bug I would like to ask you if you have any actual example we can look at to be sure we didn't overlook any potential good uses of tag stripping for web application development. |
First, enough with the scare mongering @empijei. This is your third post where you just try to scare the community. Every function can be used in a wrong context depending on the programmers experience. This doesn't mean the function has a security flaw. Your presumption is you have untrusted input. If this is the case you will need extra validation and not only for XSS injections, but also SQL injections. But this is not the responsibility of the StripTags function. And most usages for StripTags are not for securing untrusted sources anyway. All StripTags does is remove <WORDS_SURROUNDED_WITH_ANGLE_BRACKETS> from a TEXT string. Period nothing more no less. Second. Use cases (non-exhaustive list), as I said its an everyday occurrence in the live of a web developer.
Third. Lets see what happens when this small and useful utility function is not a standard
|
I am just trying to showcase why adding such a function would be dangerous. I care about the security of the ecosystem.
Let's leave SQL out of this please. It has nothing to do with tag stripping.
For these you would then need to encode the result in HTML, so after calling
This doesn't work,
Again, a sanitizer would do this job. So, let's recap if I got this right, you are arguing this is a much needed feature because:
Let's work on this together. The
EditThis is more or less what StripTags would look like if you wanted to use the html package: func StripTags(htmlin io.Reader) (string, error) {
doc, err := html.Parse(htmlin)
if err != nil {
return "", err
}
skip := map[string]bool{
"script": true,
"style": true,
"textarea": true,
"title": true,
}
var sb strings.Builder
var f func(*html.Node)
f = func(n *html.Node) {
if n.Type == html.TextNode {
if n.Parent.Type == html.ElementNode && !skip[strings.ToLower(n.Parent.Data)] {
sb.WriteString(n.Data)
}
}
for c := n.FirstChild; c != nil; c = c.NextSibling {
f(c)
}
}
f(doc)
return sb.String(), nil
} And this is a couple of test cases I wrote: func TestStripTags(t *testing.T) {
var tests = []struct {
name, in, want string
}{
{
name: "stripTags example",
in: `<b>Hi!</b> <script>...</script>`,
want: `Hi! `,
},
{
name: "stripTags example",
in: `<div><p>foo <b>bold</b> bar</p><div><style>ignored</style>`,
want: `foo bold bar`,
},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
got, err := StripTags(strings.NewReader(tt.in))
if err != nil {
t.Fatal(err)
}
if got != tt.want {
t.Errorf("got: %v, want: %v", got, tt.want)
}
})
}
} Is this what you are asking to implement/export? |
Hi @empijei, thank you for taking the time to think the code over:
About actual implementation not sure I am the right person. I have seen this previously implemented here: The two approaches are much different, not sure which is correct, and what are the pros and cons of both. I personally like yours as is more readable, while the other seems more convoluted. |
A very nice thing about Go is that any string can be used as an StripTagsString(htmlin string)(string, error) {
// This is the same StripTags that I wrote in the previous comment
return StripTags(strings.NewReader(htmlin))
}
The skip map is necessary and a similar thing is used in the go/src/html/template/context.go Lines 227 to 239 in 2cd2ff6
It would work, but if you want to have the matching even more relaxed you could change the code of `f`` to be: f = func(n *html.Node) {
if n.Type == html.TextNode {
if n.Parent.Type == html.ElementNode && skip[strings.ToLower(n.Parent.Data)] {
return
}
sb.WriteString(n.Data)
}
for c := n.FirstChild; c != nil; c = c.NextSibling {
f(c)
}
}
This code is an overly complicated implementation that basically copy-pasted the entire source of the template package just to export I would honestly think that the template package is not a good place to put the |
@empijei I agree with you the "template" package is not the correct place for it to be The first place I would look for a StripTag function would be the "strings" package, simply because the HTML will be a string. Yes, an example of stripping tags would be useful. Also a warning that this is not a "safe sanitizing" function would be required. BTW. How difficult would be to implement an allowable tags option similar to this one here: https://www.php.net/manual/en/function.strip-tags.php, probably with a signature simlar to this: |
I could add this to the "html" docs, since we now have an example to extract links from a page. @bradfitz what do you think about adding an example to extract text from html to the x/ html package?
It would not be difficult but since at that point whatever is produced will end up being rendered as HTML I would really prefer if people used a sanitizer instead (there would be no safe way to use a StripTag function that leaves some tags untouched, it would require an allowlist of per-tag attributes, and at that point you'd be rewriting a full sanitizer). |
I agree with @empijei that html/template is a security focused package that should not provide tools for developers to shoot themselves in the foot. Sounds like this can live as an example in golang.org/x/net/html, so different requirements can be handled by modifying the code. |
Too old, closing |
Stripping HTML tags is absolutely required for every web related development task.
This was previously raised in: #5884
The text was updated successfully, but these errors were encountered: