You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
versus revision 6725 and as early as the 2010-11-02 release.
What steps will reproduce the problem?
1. create an xml.Parser on a document that has disallowed characters between start of
document and root element, or between start of document and XML declaration, or between
start of document and DOCTYPE declaration
2. exhaust the parser with _, err parser.Token() until err == os.EOF
What is the expected output?
Expected is to return the first call to Token() with an error of some sort, about the
disallowed characters, and not parse further.
What do you see instead?
xml.Parser will return tokens all the way to EOF. The disallowed characters are the
first token, and are of xml.CharData, followed by the rest of the document.
Which compiler are you using (5g, 6g, 8g, gccgo)?
6g
Which operating system are you using?
darwin
( uname -a
Darwin host-elided.local 10.4.0 Darwin Kernel Version 10.4.0: Fri Apr 23 18:28:53 PDT
2010; root:xnu-1504.7.4~1/RELEASE_I386 i386)
Which revision are you using? (hg identify)
68aae563fd33+ tip
Please provide any additional information below.
I am attaching as small a go program as I can think of with examples inside it.
This from a conversation with Russ Cox on golang-nuts (
http://groups.google.com/group/golang-nuts/browse_thread/thread/ddabf01fdbe57c9f# )
about xml.Parser and results of using it with the XML Conformance Suite.
I believe that there are two specific difficulties:
First, the initial call to RawToken() does not know that we are at the beginning of the
document, before the prolog. At line 430, we check the first read byte for not being
'<', at which point we call p.text and create a CharData with the results. This if
block would need to handle the before-prolog case with some state (that
no-longer-before-prolog-detection would need to change...). There are some character
sequences that are permissible at this location, provided there isn't an XML declaration
(DOCTYPE and root element can have whitespace and comments and processing instructions
before them, it seems).
Second, parser.text() accepts at least some byte sequences that we don't think it should
(my attached example with have a single byte 0x12 at the beginning of the document,
which isn't in the XML Character Range). This I don't haven't analyzed at all beyond
this.
Respectfully submitted,
Nigel Kerr
by nigel.kerr:
Attachments:
The text was updated successfully, but these errors were encountered: