10.1. Changing Attributes
Suppose that in your new role as fixer of large sets of
HTML documents, you are given a bunch of documents that have headings
like this:
<h3 align=center>Free Monkey</h3>
<h3 color=red>Inquire Within</h3>
that need to be changed like this:
<h2 class=scream>Free Monkey</h2>
<h4 class=mutter>Inquire Within</h4>
Before you start phrasing this in terms of HTML::Element methods, you
should consider whether this can be done with a search-and-replace
operation in an editor. In this case, it cannot, because
you're not just changing every <h3
align=center> to <h2
class=scream> and every
<h4 color=red> to
<h3 class=mutter> (which
are apparently simple search-and-replace operations), you also have
to change </h3> to
</h2> or to </h4>,
depending on what you did to the element that it closes. That sort of
context dependency puts this well outside the realm of simple
search-and-replace operations. One could try to implement this with
HTML::TokeParser, reading every token and printing it back out, after
having possibly altered it. In such a program, every time we see an
<h3...> and maybe alter it,
we'd have to set a flag indicating what the next
</h3> should be changed to.
So far, you've seen the method
$element->attr(attrname)
to get the value of an attribute (returning undef
if there is no such attribute). To alter attribute values, you need
only two additional syntaxes:
$element->attr(attrname,
newval) sets a value
(regardless of whether that attribute had a previous value), and
$element->attr(attrname,
undef) deletes an attribute. That works even for
changing the _tag attribute (for which the
$element->tag method is a shortcut).
That said, it's just a matter of knowing
what nodes to change and then
changing them, as in Example 10-1.
Example 10-1. Modifying attributes
use strict;
use HTML::TreeBuilder;
my $root = HTML::TreeBuilder->new;
$root->parse_file('rewriters1/in1.html') || die $!;
print "Before:\n";
$root->dump;
my @h3_center = $root->look_down('_tag', 'h3', 'align', 'center');
my @h3_red = $root->look_down('_tag', 'h3', 'color', 'red');
foreach my $h3c (@h3_center) {
$h3c->attr('_tag', 'h2');
$h3c->attr('style', 'scream');
$h3c->attr('align', undef);
}
foreach my $h3r (@h3_red) {
$h3r->attr('_tag', 'h4');
$h3r->attr('style', 'mumble');
$h3r->attr('color', undef);
}
print "\n\nAfter:\n";
$root->dump;
Suppose that the input file consists of this:
<html><body>
<h3 align=center>Free Monkey</h3>
<h3 color=red>Inquire Within</h3>
<p>It's a monkey! <em>And it's free!</em></html>
When we run the program, we can see the tree dump before and after
the modifications happen:
Before:
<html> @0
<head> @0.0 (IMPLICIT)
<body> @0.1
<h3 align="center"> @0.1.0
"Free Monkey"
<h3 color="red"> @0.1.1
"Inquire Within"
<p> @0.1.2
"It's a monkey! "
<em> @0.1.2.1
"And it's free!"
After:
<html> @0
<head> @0.0 (IMPLICIT)
<body> @0.1
<h2 style="scream"> @0.1.0
"Free Monkey"
<h4 style="mumble"> @0.1.1
"Inquire Within"
<p> @0.1.2
"It's a monkey! "
<em> @0.1.2.1
"And it's free!"
The changes applied correctly, so we can go ahead and add this code
to the end of the program, to dump the tree to disk:
open(OUT, ">rewriters1/out1.html") || die "Can't write: $!";
print OUT $root->as_HTML;
close(OUT);
$root->delete; # done with it, so delete it
10.1.1. Whitespace
Examining the output file shows
it
to be one single line, consisting of this (wrapped so it will fit on
the page):
<html><head></head><body><h2 style="scream">Free Monkey</h2><h4
style="mumble">Inquire Within</h4><p>It's a monkey! <em>And it's
free!</em></body></html>
Where did all the nice whitespace from the original go, such as the
newline after each </h3>?
Whitespace in HTML (except in pre elements and a
few others) isn't contrastive. That is, any amount
of whitespace is as good as just one space. So whenever
HTML::TreeBuilder sees whitespace tokens as it is parsing the HTML
source, it compacts each group into a single space. Furthermore,
whitespace between some kinds of tags (such as between
</h3> and <h3>, or
between </h3> and
<p>) isn't meaningful at
all, so when HTML::TreeBuilder sees such whitespace, it just discards
it.
This whitespace mangling is the default behavior of an
HTML::TreeBuilder tree and can be changed by two options that you set
before parsing from a file:
my $root = HTML::TreeBuilder->new;
$root->ignore_ignorable_whitespace(0);
# Don't try to delete whitespace between block-level elements.
$root->no_space_compacting(1);
# Don't smash every whitespace sequences into a single space.
With those lines added to our program, the parse tree output file
ends up with the appropriate whitespace.
<html><head></head><body>
<h2 style="scream">Free Monkey</h2>
<h4 style="mumble">Inquire Within</h4>
<p>It's a monkey! <em>And it's free!</em></body>
</html>
An alternative is to have the as_HTML( ) method
try to indent the HTML as it prints it. This is achieved by calling
as_HTML like so:
print OUT $root->as_HTML(undef, " ");
This feature is still somewhat experimental, and its implementation
might change, but at time of this writing, this makes the output
file's code look like this:
<html>
<head>
</head>
<body>
<h2 style="scream">Free Monkey</h2>
<h4 style="mumble">Inquire Within</h4>
<p>It's a monkey! <em>And it's free!</em></body>
</html>
10.1.2. Other HTML Options
Besides this indenting option, there are further options to
as_HTML( ), as described in Chapter 9, "HTML Processing with Trees". One option controls whether omissible
end-tags (such as </p> and
</li>) are printed.
Another controls what characters are escaped using
&foo; sequences. Notably, by default, this
encodes all characters over ASCII 126, so for example,
as_HTML will print an
é in the parse tree as
é (whether it came from a literal
é or from an
é). This is always safe, but in cases
where you're dealing with text with a lot of Latin-1
or Unicode characters, having every one of those characters encoded
as a
&foo;
sequence might be bothersome to any people looking at the HTML markup
output.
In that case, your call to as_HTML can consist of
$root->as_HTML('<>&'), in which case
only the minimum of characters (<,
>, and &) will be
escaped. There's no point is using these options (or
in preserving whitespace with
ignore_ignorable_whitespace and
no_space_compacting) if you're
reasonably sure nobody will ever be looking at the resulting HTML.
But for cases where people might need to look at the HTML, these
options will make the code more inviting than just one huge block
of HTML.