A Lisp Based HTML Parser
Introduction/Simple Example
LHTML parse output format
Case mode notes
Parsing HTML comments
Parsing <SCRIPT> and <STYLE> tags
Parsing SGML <! tags
Parsing Illegal and Deprecated Tags
Default Attribute Values
Parsing Interleaved Character Formatting Tags
parse-html reference
methods
phtml-internal
The parse-html generic function processes HTML
input, returning a list of HTML tags, attributes, and text. Here is a simple example:
(parse-html "<HTML>
<HEAD>
<TITLE>Example HTML input</TITLE>
<BODY>
<P>Here is some text with a <B>bold</B> word<br>and a <A
HREF=\"help.html\">link</P>
</HTML>")
generates:
((:html (:head (:title "Example HTML input"))
(:body (:p "Here is some text with a " (:b "bold") "
word" :br "and a "
((:a :href "help.html") "link")))))
The output format is known as LHTML format; it is the same format that the
aserve htmlgen macro accepts.
LHTML format
LHTML is a list representation of HTML tags and content.
Each list member may be:
If excl:*current-case-mode* is :CASE-INSENSITIVE-UPPER, keyword package symbols will be
in upper case; otherwise, they will be in lower case.
HTML comments are represented use a :comment symbol. For example,
(parse-html "<!-- this is a comment-->")
--> ((:comment " this is a comment"))
HTML <SCRIPT> and <STYLE> tags
All <SCRIPT> and <STYLE> content is not parsed; it is returned as text
content.
For example,
(parse-html "<SCRIPT>this <B>will not</B> be
parsed</SCRIPT>")
--> ((:script "this <B>will not</B> be parsed"))
Since, some HTML pages contain special XML/SGML tags, non-comment tags
starting with '<!' are treated specially:
(parse-html "<!doctype this is some text>")
--> ((:!doctype " this is some text"))
There is plenty of illegal and deprecated HTML on the web that popular browsers
nonetheless successfully display. The parse-html parser is generous - it will not
raise an error condition upon encountering most input. In particular, it does not
maintain a list of legal HTML tags and will successfully parse nonsense input.
For example,
(parse-html "<this> <is> <some> <nonsense>
<input>")
--> ((:this (:is (:some (:nonsense :input)))))
In some situations, you may prefer a two-pass parse that results in a parse where
deep nesting related to unrecognized tags is minimized:
(let ((string "<this> <is> <some> <nonsense> </some>
<input>"))
(multiple-value-bind (res rogues)
(parse-html string
:collect-rogue-tags t)
(declare (ignorable
res))
(parse-html string
:no-body-tags rogues)))
--> (:this :is (:some (:nonsense)) :input)
See the :collect-rogue-tags and :no-body-tags argument
descriptions in the reference
section below for more information.
As per the HTML 4.0 specification, attributes without specified values are given a
lower case
string value that matches the attribute name.
For example,
(parse-html "<P here ARE some attributes>")
--> (((:p :here "here" :are "are" :some "some"
:attributes "attributes")))
Interleaved Character Formatting Tags
Existing HTML pages often have character format tags that are interleaved among
other tags. Such interleaving is removed in a manner consistent with the HTML 4.0
specification.
For example,
(parse-html "<P>Here is <B>bold text<P>that spans</B>two
paragraphs")
--> ((:p "Here is " (:b "bold text")) (:p (:b "that
spans") "two paragraphs"))
parse-html Reference
parse-html [Generic function]
Arguments: input-source &key callbacks callback-only
collect-rogue-tags
no-body-tags parse-entities
Returns LHTML output, as described above.
The callbacks argument, if non-nil, should be an association list. Each list member's
car (first) element specifies a keyword package symbol, and each list member's cdr (rest)
element specifies a function object or a symbol naming a function. The function should
expect one argument. The function will be invoked once for each time the HTML tag
corresponding to the specified keyword package symbol is encountered in the HTML input;
the
argument will be an LHTML list containing the tag, along with associated attributes and
content. The default callbacks argument value is nil.
The callback-only argument, if non-nil, directs parse-html to not generate a complete
LHTML
output. Instead, LHTML lists will only be generated when necessary as arguments for
functions
specified in the callbacks association list. This results in faster parser execution. The
default
callback-only argument value is nil.
The collect-rogue-tags argument, if non-nil, directs parse-html to return an additional
value,
a list containing any unrecognized tags closed by the end of input.
The no-body-tags argument, if non-nil, should be a list containing unknown tags that, if
encountered, will be treated as a tag with no body or content, and thus, no associated end
tag. Typically, the argument is a list or modified list resulting from an earlier
parse-html
execution with the :collect-rogue-tags argument specified as non-nil.
If the parse-entities argument is true then entities are converted to the character
they name. Thus for example the < entity is converted to the less than sign.
parse-html Methods
parse-html (p stream) &key callbacks callback-only
collect-rogue-tags
no-body-tags parse-entities
parse-html (str string) &key callbacks callback-only
collect-rogue-tags
no-body-tags parse-entities
parse-html (file t) &key callbacks callback-only
collect-rogue-tags
no-body-tags parse-entities
The t method assumes the argument is a pathname suitable
for use with the with-open-file macro.
phtml-internal [Function]
Arguments: stream read-sequence-func callback-only callbacks
collect-rogue-tags no-body-tags parse-entities
This function may be used when more control is needed for supplying
the HTML input. The read-sequence-func argument, if non-nil, should be a function
object or a symbol naming a function. When phtml-internal requires another buffer
of HTML input, it will invoke the read-sequence-func function with two arguments -
the first argument is an internal buffer character array and the second argument is
the phtml-internal stream argument. If read-sequence-fun is nil, phtml-internal
will invoke read-sequence to fill the buffer. The read-sequence-func function must
return the number of character array elements successfully stored in the buffer.