 |
7.5. More HTML::TokeParser Methods
Example 7-1 illustrates that often
you
aren't interested in every kind of token in a
stream, but care only about tokens of a certain kind. The
HTML::TokeParser interface supports this with
three methods, get_tag(
), get_text( ), and
get_trimmed_text( ) that do something other than
simply get the next token.
- $text_string = $stream->get_text( );
-
If the next token is text, return its value.
- $text_string = $stream->get_text('foo');
-
Return all text up to the next foo start-tag.
- $text_string = $stream->get_text('/bar');
-
Return all text up to the next /bar end-tag.
- $text = $stream->get_trimmed_text( );
- $text = $stream->get_trimmed_text('foo');
- $text = $stream->get_trimmed_text('/bar');
-
Like get_text( ) calls, except with initial and
final whitespace removed, and all other whitespace collapsed.
- $tag_ref = $stream->get_tag( );
-
Return the next start-tag or end-tag token.
- $tag_ref = $stream->get_tag('foo', '/bar', 'baz');
-
Return the next foo start-tag,
/bar end-tag, or baz start-tag.
We will explain these methods in detail in the following sections.
7.5.1. The get_text( ) Method
The get_text( ) syntax is:
$text_string = $stream->get_text( );
If $stream's next token is
text, this gets it, resolves any entities in
it, and returns its string value. Otherwise, this returns an empty
string.
For example, if you are parsing this snippet:
<h1 lang='en-GB'>Shatner Reprises Kirk Rôle</h1>
and have just parsed the token for h1,
$stream->get_text( ) returns
"Shatner Reprises Kirk
Rôle." If you call it again (and again
and again), it will return the empty string, because the next token
waiting is not a text token but an h1 end-tag
token.
7.5.2. The get_text( ) Method with Parameters
The syntax for get_text( ) with parameters is:
$text_string = $stream->get_text('foo');
$text_string = $stream->get_text('/bar');
Specifying a foo or /bar
parameter
changes the
meaning of get_text( ). If you specify a tag, you
get all the text up to the next time that tag occurs (or until the
end of the file, if that tag never occurs).
For however many text tokens are found, their text values are taken,
entity sequences are resolved, and they are combined and returned.
(All the other sorts of tokens seen along the way are just ignored.)
Note that the tag name that you specify (whether
foo or /bar) must be in
lowercase.
This sounds complex, but it works out well in real use. For example,
imagine you've got this snippet:
<h1 lang='en-GB'>Star of <cite>Star Trek</cite> in New Rôle</h1>
<cite>American Psycho II</cite> in Production.
<!-- I'm not making this up, folks. -->
<br>Shatner to play FBI profiler.
and that you've just parsed the token for
h1. Calling $stream->get_text(
), simply gets Star
of . If, however, the task you're
performing is the extraction of the text content of
<h1> elements, then what's
called for is:
$stream->get_text('/h1')
This returns Star of Star Trek in New
Rôle.
Calling:
$stream->get_text('br')
returns:
"Star of Star Trek in New Rôle\n American Psycho II in Production.\n \n "
And if you instead called
$stream->get_text('schlock') and there is no
<schlock...> in the rest of the document,
you will get Star of
Star Trek in
New Rôle\n
American Psycho
II in
Production.\n \n
Shatner to
play FBI
profiler.\n, plus whatever text there is in the
rest of the document.
Note that this never introduces whitespace where
it's not there in the original. So if
you're parsing this:
<table>
<tr><th>Height<th>Weight<th>Shoe Size</tr>
<tr><th>6' 2"<th>180lbs<th>n/a</tr>
</table>
and you've just parsed the table
token, if you call:
$stream->get_text('/table')
you'll get back:
"\nHeightWeightShoe Size\n6' 2"180lbsn/a\n"
Not all nontext tokens are ignored by $stream->get_text(
). Some tags receive special treatment: if an
img or applet tag is seen, it
is treated as if it were a text token; if it has an
alt attribute, its value is used as the content of
the virtual text token; otherwise, you get just the uppercase tag
name in brackets: [IMG] or
[APPLET]. For further information on altering and
expanding this feature, see perldoc HTML::TokeParser in the
documentation for the get_text method, and
possibly even the surprisingly short HTML::TokeParser source code.
If you just want to turn off such special treatment for all tags:
$stream->{'textify'} = {}
This is the only case of the $object->{'thing'}
syntax we'll discuss in this book. In no other case
does an object require us to access its internals directly like this,
because it has no method for more normal access. For more information
on this particular syntax, see perldoc perlref's
documentation on hash references.
7.5.3. The get_trimmed_text( ) Method
The syntax for the get_trimmed_text( ) method is:
$text = $stream->get_trimmed_text( );
$text = $stream->get_trimmed_text('foo');
$text = $stream->get_trimmed_text('/bar');
These work exactly like the
corresponding $stream->get_text(
) calls, except any leading and trailing whitespace is
removed and each sequence of whitespace is replaced with a single
space.
Returning to our news example:
$html = <<<EOF ;
<h1 lang='en-GB'>Star of <cite>Star Trek</cite> in New Rôle</h1>
<cite>American Psycho II</cite> in Production.
<!-- I'm not making this up, folks. -->
<br>Shatner to play FBI profiler.
EOF
$stream = HTML::TokeParser->new(\$html);
$stream->get_token( ); # skip h1
The get_text( ) method would return
Star of (with the trailing
space), while the get_trimmed_text( ) method would
return Star of (no trailing
space).
Similarly, $stream->get_text('br') would return:
"Star of Star Trek in New Rôle\n American Psycho II in Production.\n \n "
whereas $stream->get_trimmed_text ('br') would
return:
"Star of Star Trek in New Rôle American Psycho II in Production."
Notice that the medial newline-space-space became a single space, and
the final newline-space-space-newline-space-space was simply removed.
The caveat that get_text( ) does not introduce any
new whitespace applies also to get_trimmed_text(
). So where, in the last example in get_text(
), you would have gotten
\nHeightWeightShoe Size\n6'
2"180lbsn/a\n, get_trimmed_text(
) would return HeightWeightShoe
Size 6'
2"180lbsn/a.
7.5.4. The get_tag( ) Method
The syntax for the get_tag( ) method
is:
$tag_reference = $stream->get_tag( );
This returns the next start-tag or end-tag token (throwing out
anything else it has to skip to get there), except while
get_token( ) would return start and end-tags in
these formats:
['S', 'hr', {'class','Ginormous'}, ['class'], '<hr class=Ginormous>']
['E', 'p' , '</P>']
get_tag( ) instead returns them in this format:
['hr', {'class','Ginormous'}, ['class'], '<hr class=Ginormous>']
['/p' , '</P>']
That is, the first item has been taken away, and end-tag names start
with /.
7.5.4.1. Start-tags
Unless $tag->[0] begins with a
/, the tag represents a start-tag:
[$tag, $attribute_hash, $attribute_order_arrayref, $source]
The components of this token are:
- $tag
-
The tag name, in lowercase.
- $attribute_hashref
-
A reference to a hash encoding the attributes of this tag. The
(lowercase) attribute names are the keys of the hash.
- $attribute_order_arrayref
-
A reference to an array of (lowercase) attribute names, in case you
need to access elements in order.
- $source
-
The original HTML for this token.
The first two values are the most interesting ones, for most purposes.
For example, parsing this HTML with $stream->get_tag(
) :
<IMG SRC="kirk.jpg" alt="Shatner in rôle of Kirk" WIDTH=352 height=522>
gives this tag:
[
'img',
{ 'alt' => 'Shatner in rôle of Kirk',
'height' => '522', 'src' => 'kirk.jpg', 'width' => '352'
},
[ 'src', 'alt', 'width', 'height' ],
'<IMG SRC="kirk.jpg" alt="Shatner in rôle of Kirk" WIDTH=352 height=522>'
]
Notice that the tag and attribute names have been lowercased, and the
ô entity decoded within the
alt attribute.
7.5.4.2. End-tags
When $tag->[0] does begin with a
/, the token represents an
end-tag:
[ "/$tag", $source ]
The components of this tag are:
- $tag
-
The lowercase name of the tag being closed, with a leading
/.
- $source
-
The original HTML for this token.
Parsing this HTML with $stream->get_tag( ) :
</A>
gives this tag:
[ '/a', '</A>' ]
Note that if get_tag( ) reads to the end of the
stream and finds no tag tokens, it will return
undef.
7.5.5. The get_tag( ) Method with Parameters
Pass a list of tags, to skip through
the tokens
until a matching tag is found:
$tag_reference = $stream->get_tag('foo', '/bar', 'baz');
This returns the next start-tag or end-tag that matches any of the
strings you provide (throwing out anything it has to skip to get
there). Note that the tag name(s) that you provide as parameters must
be in lowercase.
If get_tag( ) reads to the end of the stream and
finds no matching tag tokens, it will return
undef. For example, this code's
get_tag( ) looks for img
start-tags:
while (my $img_tag = $stream->get_tag('img')) {
my $i = $img_tag->[1]; # attributes of this img tag
my @lack = grep !exists $i->{$_}, qw(alt height width);
print "Missing for ", $i->{'src'} || "????", ": @lack\n" if @lack;
}
 |  |  | | 7.4. Token Sequences |  | 7.6. Using Extracted Text |
Copyright © 2002 O'Reilly & Associates. All rights reserved.
|
 |
|