Files · a49b8b9875fd43774e96ed81ce1316c32fb48fa0 · go / golang

html: rewrite the tokenizer to be more consistent. · a49b8b98

Nigel Tao authored Oct 13, 2011

Previously, the tokenizer made two passes per token. The first pass
established the token boundary. The second pass picked out the tag name
and attributes inside that boundary. This was problematic when the two
passes disagreed. For example, "<p id=can't><p id=won't>" caused an
infinite loop because the first pass skipped everything inside the
single quotes, and recognized only one token, but the second pass never
got past the first '>'.

This change rewrites the tokenizer to use one pass, accumulating the
boundary points of token text, tag names, attribute keys and attribute
values as it looks for the token endpoint.

It should still be reasonably efficient: text, names, keys and values
are not lower-cased or unescaped (and converted from []byte to string)
until asked for.

One of the token_test test cases was fixed to be consistent with
html5lib. Three more test cases were temporarily disabled, and will be
re-enabled in a follow-up CL. All the parse_test test cases pass.

R=andybalholm, gri
CC=golang-dev
https://golang.org/cl/5244061

a49b8b98

Name	Last commit	Last update
doc		Loading commit data...
include		Loading commit data...
lib		Loading commit data...
misc		Loading commit data...
src		Loading commit data...
test		Loading commit data...
.hgignore		Loading commit data...
.hgtags		Loading commit data...
AUTHORS		Loading commit data...
CONTRIBUTORS		Loading commit data...
LICENSE		Loading commit data...
PATENTS		Loading commit data...
README		Loading commit data...
favicon.ico		Loading commit data...
robots.txt		Loading commit data...

README