 |
Chapter 7. HTML Processing with Tokens
Regular expressions are powerful, but they're a
painfully low-level way of dealing with HTML. You're
forced to worry about spaces and newlines, single and double quotes,
HTML comments, and a lot more. The next step up from a regular
expression is an HTML tokenizer. In this chapter,
we'll
use
HTML::TokeParser to extract information from HTML files. Using these
techniques, you can extract information from any HTML file, and never
again have to worry about character-level trivia of HTML markup.
7.1. HTML as Tokens
Your experience with HTML code probably involves seeing raw text such
as this:
<p>Dear Diary,
<br>I'm gonna be a superstar, because I'm learning to play
the <a href="http://MyBalalaika.com">balalaika</a> & the <a
href='http://MyBazouki.com'>bazouki</a>!!!
The HTML::TokeParser module divides the HTML into units called
tokens, which means units of parsing. The above
source code is parsed as this series of tokens:
- start-tag token
-
p with no
attributes
- text token
-
Dear Diary,\n
- start-tag token
-
br with no attributes
- text token
-
I'm gonna be a superstar, because I'm learning to
play\nthe
- start-tag token
-
a, with attribute href whose
value is http://MyBalalaika.com
- text token
-
balalaika
- end-tag token
-
a
- text token
-
& the , which
means & the
- start-tag token
-
a, with attribute href equals
http://MyBazouki.com
- text token
-
bazouki
- end-tag token
-
a
- text token
-
!!!\n
This representation of things is more abstract, focusing on markup
concepts and not individual characters. So whereas the two
<a> tags have different types of quotes
around their attribute values in the raw HTML, as tokens each has a
start-tag of type a, with an
href attribute of a particular value. A program
that extracts information by working with a stream of tokens
doesn't have to worry about the idiosyncrasies of
entity encoding, whitespace, quotes, and trying to work out where a
tag ends.
 |  |  | | 6.7. Example: Extracting Temperatures from Weather Underground |  | 7.2. Basic HTML::TokeParser Use |
Copyright © 2002 O'Reilly & Associates. All rights reserved.
|
 |
|