 |
7.4. Token Sequences
Some problems cannot
be solved with a single-token
approach. Often you need to scan for a sequence of tokens. For
example in Chapter 4, "URLs", we extracted the Amazon
sales rank from HTML like this:
<b>Amazon.com Sales Rank: </b> 4,070 </font><br>
Here we're looking for the text
Amazon.com Sales
Rank: , an end-tag for b, and
the next token as a text token with the sales rank. To solve this, we
need to check the next few tokens while being able to put them back
if they're not what we expect.
To put tokens back into the stream, use
the unget_token( )
method:
$stream->unget_token(@next);
The tokens stored in @next will be returned to the
stream. For example, to solve our Amazon problem:
while (my $token = $stream->get_token( )) {
if ($token->[0] eq 'T' and
$token->[1] eq 'Amazon.com Sales Rank: ') {
my @next;
push @next, $stream->get_token( );
my $found = 0;
if ($next[0][0] eq 'E' and $next[0][1] eq 'b') {
push @next, $stream->get_token( );
if ($next[1][0] eq 'T') {
$sales_rank = $next[1][1];
$found = 1;
}
}
$stream->unget_token(@next) unless $found;
}
}
If it's the text we're looking for,
we cautiously explore the next tokens. If the next one is a
</b> end-tag, check the next token to ensure
that it's text. If it is, then
that's the sales rank. If any of the tests fail, put
the tokens back on the stream and go back to processing.
7.4.1. Example: BBC Headlines
Suppose, for example, that your morning ritual is to have the
help come and wake you at about 11 a.m. as they bring two serving
trays to your bed. On one tray there's a croissant,
some pain au chocolat, and of course some
café au lait, and on the other tray,
your laptop with a browser window already open on each story from BBC
News's front page
(http://news.bbc.co.uk). However, the help have
been getting mixed up lately and opening the stories on The
Guardian's web site, and
that's a bit awkward, since clearly The
Guardian is an after-lunch paper. You'd
say something about it, but one doesn't want to make
a scene, so you just decide to write a program that the help can run
on the laptop to find all the BBC story URLs.
So you look at the source of
http://news.bbc.co.uk and discover that each
headline link is wrapped in one of two kinds of code. There are lots
of headlines in code such as these:
<B CLASS="h3"><A href="/hi/english/business/newsid_1576000/1576290.stm">Bank
of England mulls rate cut</A></B><BR>
<B CLASS="h3"><A href="/hi/english/uk_politics/newsid_1576000/1576541.stm">Euro
battle revived by Blair speech</A></B><BR>
and also some headlines in code like this:
<A href="/hi/english/business/newsid_1576000/1576636.stm">
<B class="h2"> Swissair shares wiped out</B><BR>
</A>
<A href="/hi/english/world/middle_east/newsid_1576000/1576113.stm">
<B class="h1">Mid-East blow to US anti-terror drive</B><BR>
</A>
(Note that the a start-tag's
class value can be h1 or h2.)
Studying this, you realize that this is how you find the story URLs:
-
Every time there's a B start-tag
with class value of h3, and then an
A start-tag with an href value,
save that href.
-
Every time there's an A start-tag
with an href value, a text token consisting of
just whitespace, and then a B start-tag with a
class value of h1 or h2, save
the first token's href value.
7.4.2. Translating the Problem into Code
We can take some shortcuts when translating this into
$stream->unget_token($token) code. The
following HTML is typical:
<B CLASS="h3">Top Stories</B><BR>
...
<B CLASS="h3"><A href="/hi/english/business/newsid_1576000/1576290.stm">Bank
of England mulls rate cut</A></B><BR>
When we see the first B-h3 start-tag token, we
think it might be the start of a B-h3-A-href
pattern. So we get another token and see if it's an
A-href token. It's not
(it's the text token Top
Stories), so we put it back into the stream
(useful in case some other pattern we're looking for
involves that being the first token), and we keep looping. Later, we
see another B-h3, we get another token, and we
inspect it to see if it's an
A-href token. This time it is, so we process its
href value and resume looping.
There's no reason for us to put that
a-href back, so the next iteration of the loop
will resume with the next token being Bank
of England
mulls rate
cut.
sub scan_bbc_stream {
my($stream, $docbase) = @_;
Token:
while(my $token = $stream->get_token) {
if ($token->[0] eq 'S' and $token->[1] eq 'b' and
($token->[2]{'class'} || '') eq 'h3') {
# The href we want is in the NEXT token... probably.
# Like: <B CLASS="h3"><A href="magic_url_here">
my(@next) = ($stream->get_token);
if ($next[0] and $next[0][0] eq 'S' and $next[0][1] eq 'a' and
defined $next[0][2]{'href'} ) {
# We found <a href="...">! This rule matches!
print URI->new_abs($next[0][2]{'href'}, $docbase), "\n";
next Token;
}
# We get here only if we've given up on this rule:
$stream->unget_token(@next);
}
# fall thru to subsequent rules here...
}
return;
}
The general form of the rule above is this: if the current token
looks promising, pull off a token and see if that looks promising
too. If, at any point, we see an unexpected token or hit the end of
the stream, we restore what we've pulled off (held
in the temporary array @next), and continue to try
other rules. But if all the expectations in this rule are met, we
make it to the part that processes this bunch of tokens (here
it's just a single line, which prints the URL), and
then call next Token to start
another iteration of this loop without restoring
the tokens that have matched this pattern. (If you are disturbed by
this use of a named block and last ing and
next ing around, consider that this could be
written as a giant if/else
statement at the risk of potentially greater damage to
what's left of your sanity.)
Each such rule, then, can pull from the stream however many tokens it
needs to either match or reject the pattern it's
after. Either it matches and starts another iteration of this loop,
or it restores the stream to exactly the way it was before this rule
started pulling from it. This business of a temporary
@next list may seem like overkill when we only
have to look one token ahead, only ever looking at
$next[0]. However, the if block
for the next pattern (which requires looking two tokens ahead) shows
how the same framework can be accommodating:
# Add this right after the first if-block ends.
if($token->[0] eq 'S' and $token->[1] eq 'a' and
defined $token->[2]{'href'} ) {
# Like: <A href="magic_url_here"> <B class="h2">
my(@next) = ($stream->get_token);
if ($next[0] and $next[0][0] eq 'T' and $next[0][1] =~ m/^\s+/s ) {
# We found whitespace.
push @next, $stream->get_token;
if ($next[1] and $next[1][0] eq 'S' and $next[1][1] eq 'b' and
($next[1][2]{'class'} || '') =~ m/^h[12]$/s ) {
# We found <b class="h2">! This rule matches!
print URI->new_abs( $token->[2]{'href'}, $docbase ), "\n";
next Token;
}
}
# We get here only if we've given up on this rule:
$stream->unget_token(@next);
}
7.4.3. Bundling into a Program
With all that wrapped up in a pure function scan_bbc_stream(
), we
can test it by first saving the contents of
http://news.bbc.co.uk locally as
bbc.html (which we probably already did to
scrutinize its source code and figure out what HTML patterns surround
headlines), and then calling this:
use strict;
use HTML::TokeParser;
use URI;
scan_bbc_stream(
HTML::TokeParser->new('bbc.html') || die($!),
'http://news.bbc.co.uk/' # base URL
);
When run, this merrily scans the local copy and say:
http://news.bbc.co.uk/hi/english/world/middle_east/newsid_1576000/1576113.stm
http://news.bbc.co.uk/hi/english/world/south_asia/newsid_1576000/1576186.stm
http://news.bbc.co.uk/hi/english/uk_politics/newsid_1576000/1576051.stm
http://news.bbc.co.uk/hi/english/uk/newsid_1576000/1576379.stm
http://news.bbc.co.uk/hi/english/business/newsid_1576000/1576636.stm
http://news.bbc.co.uk/sport/hi/english/in_depth/2001/england_in_zimbabwe/newsid_
1574000/1574824.stm
http://news.bbc.co.uk/hi/english/business/newsid_1576000/1576546.stm
http://news.bbc.co.uk/hi/english/uk/newsid_1576000/1576313.stm
http://news.bbc.co.uk/hi/english/uk_politics/newsid_1576000/1576541.stm
http://news.bbc.co.uk/hi/english/business/newsid_1576000/1576290.stm
http://news.bbc.co.uk/hi/english/entertainment/music/newsid_1576000/1576599.stm
http://news.bbc.co.uk/hi/english/sci/tech/newsid_1574000/1574048.stm
http://news.bbc.co.uk/hi/english/health/newsid_1576000/1576776.stm
http://news.bbc.co.uk/hi/english/in_depth/uk_politics/2001/conferences_2001/labour/
newsid_1576000/1576086.stm
At least that's what the program said once I got
scan_bbc_stream( ) in its final working state
shown above. As I was writing it and testing bits of it, I could run
and re-run the program, scanning the same local file. Then once
it's working on the local file (or files, depending
on how many test cases you have), you can write the routine that gets
what's at a URL, makes a stream pointing to its
content, and runs a given scanner routine (such as
scan_bbc_stream( )) on it:
my $browser;
BEGIN {
use LWP::UserAgent;
$browser = LWP::UserAgent->new;
# and any other $browser initialization code here
}
sub url_scan {
my($scanner, $url) = @_;
die "What scanner function?" unless $scanner and ref($scanner) eq 'CODE';
die "What URL?" unless $url;
my $resp = $browser->get( $url );
die "Error getting $url: ", $resp->status_line
unless $resp->is_success;
die "It's not HTML, it's ", $resp->content_type
unless $resp->content_type eq 'text/html';
my $stream = HTML::TokeParser->new( $resp->content_ref )
|| die "Couldn't make a stream from $url\'s content!?";
# new( ) on a string wants a reference, and so that's what
# we give it! HTTP::Response objects just happen to
# offer a method that returns a reference to the content.
$scanner->($stream, $resp->base);
}
If you thought the contents of $url could be very
large, you could save the contents to a temporary file, and start the
stream off with
HTML::TokeParser->new($tempfile). With the
above url_scan( ), to
retrieve the BBC main page and scan it, you
need only replace our test statement that scans the input stream,
with this:
url_scan(\&scan_bbc_stream, 'http://news.bbc.co.uk/');
And then the program outputs the URLs from the live BBC main page (or
will die with an error message if it can't get it).
To actually complete the task of getting the printed URLs to each
open a new browser instance, well, this depends on your browser and
OS, but for my MS Windows laptop and Netscape, this Perl program will
do it:
my $ns = "c:\\program files\\netscape\\communicator\\program\\netscape.exe";
die "$ns doesn't exist" unless -e $ns;
die "$ns isn't executable" unless -x $ns;
while (<>) { chomp; m/\S/ and system($ns, $_) and die $!; }
This is then called as:
C:\perlstuff> perl bbc_urls.pl | perl urls2ns.pl
Under Unix, the correct system( ) command is:
system("netscape '$url' &")
 |  |  | | 7.3. Individual Tokens |  | 7.5. More HTML::TokeParser Methods |
Copyright © 2002 O'Reilly & Associates. All rights reserved.
|
 |
|