7.3. Individual Tokens
Now that you know the
composition of the various types of
tokens, let's see how to use HTML::TokeParser to
write useful programs. Many problems are quite simple and require
only one token at a time. Programs to solve these problems consist of
a loop over all the tokens, with an if statement
in the
body of the loop identifying the interesting parts of the HTML:
use HTML::TokeParser;
my $stream = HTML::TokeParser->new($filename)
|| die "Couldn't read HTML file $filename: $!";
# For a string: HTML::TokeParser->new( \$string_of_html );
while (my $token = $stream->get_token) {
if ($token->[0] eq 'T') { # text
# process the text in $text->[1]
} elsif ($token->[0] eq 'S') { # start-tag
my($tagname, $attr) = @$token[1,2];
# consider this start-tag...
} elsif ($token->[0] eq 'E') {
my $tagname = $token->[1];
# consider this end-tag
}
# ignoring comments, declarations, and PIs
}
7.3.1. Checking Image Tags
Example 7-1 complains about
any img tags in a
document that are missing alt,
height, or width attributes:
Example 7-1. Check <img> tags
while(my $token = $stream->get_token) {
if($token->[0] eq 'S' and $token->[1] eq 'img') {
my $i = $token->[2]; # attributes of this img tag
my @lack = grep !exists $i->{$_}, qw(alt height width);
print "Missing for ", $i->{'src'} || "????", ": @lack\n" if @lack;
}
}
When run on an HTML stream (whether from a file or a string), this
outputs:
Missing for liza.jpg: height width
Missing for aimee.jpg: alt
Missing for laurie.jpg: alt height width
Identifying images has many applications: making HEAD requests to
ensure the URLs are valid, or making a GET request to fetch the image
and using Image::Size from CPAN to check or insert the
height and width attributes.
7.3.2. HTML Filters
A similar while loop
can use
HTML::TokeParser as a simple code filter. You just pass through the
$source from each token you don't
mean to alter. Here's one that passes through every
tag that it sees (by just printing its source as HTML::TokeParser
passes it in), except for img start-tags, which
get replaced with the content of their alt
attributes:
while (my $token = $stream->get_token) {
if ($token->[0] eq 'S') {
if ($token->[1] eq 'img') {
print $token->[2]{'alt'} || '';
} else {
print $token->[4];
}
}
elsif($token->[0] eq 'E' ) { print $token->[2] }
elsif($token->[0] eq 'T' ) { print $token->[1] }
elsif($token->[0] eq 'C' ) { print $token->[1] }
elsif($token->[0] eq 'D' ) { print $token->[1] }
elsif($token->[0] eq 'PI') { print $token->[2] }
}
So, for example, a document consisting just of this:
<!-- new entry -->
<p>Dear Diary,
<br>This is me & my balalaika, at BalalaikaCon 1998:
<img src="mybc1998.jpg" src="BC1998! WHOOO!"> Rock on!</p>
is then spat out as this:
<!-- new entry -->
<p>Dear Diary,
<br>This is me & my balalaika, at BalalaikaCon 1998:
BC1998! WHOOO! Rock on!</p>
 |  |  | | 7.2. Basic HTML::TokeParser Use |  | 7.4. Token Sequences |
Copyright © 2002 O'Reilly & Associates. All rights reserved.
|