encoding/xml: Parser accepts invalid prolog, parser.text may accept disallowed characters #1259

gopherbot · 2010-11-06T02:31:42Z

by nigel.kerr:

versus revision 6725 and as early as the 2010-11-02 release.

What steps will reproduce the problem?
1. create an xml.Parser on a document that has disallowed characters between start of
document and root element, or between start of document and XML declaration, or between
start of document and DOCTYPE declaration
2. exhaust the parser with _, err parser.Token() until err == os.EOF

What is the expected output?

Expected is to return the first call to Token() with an error of some sort, about the
disallowed characters, and not parse further.

What do you see instead?

xml.Parser will return tokens all the way to EOF.  The disallowed characters are the
first token, and are of xml.CharData, followed by the rest of the document.


Which compiler are you using (5g, 6g, 8g, gccgo)?

6g


Which operating system are you using?

darwin
( uname -a
Darwin host-elided.local 10.4.0 Darwin Kernel Version 10.4.0: Fri Apr 23 18:28:53 PDT
2010; root:xnu-1504.7.4~1/RELEASE_I386 i386)


Which revision are you using?  (hg identify)

68aae563fd33+ tip


Please provide any additional information below.

I am attaching as small a go program as I can think of with examples inside it.

This from a conversation with Russ Cox on golang-nuts (
http://groups.google.com/group/golang-nuts/browse_thread/thread/ddabf01fdbe57c9f# )
about xml.Parser and results of using it with the XML Conformance Suite.

I believe that there are two specific difficulties: 

First, the initial call to RawToken() does not know that we are at the beginning of the
document, before the prolog.  At line 430, we check the first read byte for not being
'<', at which point we call p.text and create a CharData with the results.  This if
block would need to handle the before-prolog case with some state (that
no-longer-before-prolog-detection would need to change...).  There are some character
sequences that are permissible at this location, provided there isn't an XML declaration
(DOCTYPE and root element can have whitespace and comments and processing instructions
before them, it seems).

Second, parser.text() accepts at least some byte sequences that we don't think it should
(my attached example with have a single byte 0x12 at the beginning of the document,
which isn't in the XML Character Range).  This I don't haven't analyzed at all beyond
this.

Respectfully submitted,
Nigel Kerr

Attachments:

badprolog.go (471 bytes)

robpike · 2010-11-06T16:00:46Z

Comment 1:

Owner changed to r...@golang.org.

Status changed to Accepted.

rsc · 2010-12-09T19:51:10Z

Comment 2:

This issue was closed by revision 27f2d5c.

Status changed to Fixed.

gopherbot added fixed labels Dec 9, 2010

gopherbot assigned rsc Dec 9, 2010

mikioh changed the title ~~xml.Parser accepts invalid prolog, parser.text may accept disallowed characters~~ encoding/xml: Parser accepts invalid prolog, parser.text may accept disallowed characters Jan 9, 2015

golang locked and limited conversation to collaborators Jun 24, 2016

gopherbot added the FrozenDueToAge label Jun 24, 2016

rsc removed their assignment Jun 22, 2022

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

encoding/xml: Parser accepts invalid prolog, parser.text may accept disallowed characters #1259

encoding/xml: Parser accepts invalid prolog, parser.text may accept disallowed characters #1259

gopherbot commented Nov 6, 2010

robpike commented Nov 6, 2010

rsc commented Dec 9, 2010

encoding/xml: Parser accepts invalid prolog, parser.text may accept disallowed characters #1259

encoding/xml: Parser accepts invalid prolog, parser.text may accept disallowed characters #1259

Comments

gopherbot commented Nov 6, 2010

robpike commented Nov 6, 2010

rsc commented Dec 9, 2010