ГЛАВНАЯ  |  НОВОСТИ  |  О КОМПАНИИ  |  ДОКУМЕНТАЦИЯ  |  СЛОВАРЬ  |  КОНТАКТЫ  
 

 
W3C
Спецификация HTML 4.01 (на русском)
HTML 4.0 Entities
Спецификация CSS2 (на русском)
Расширяемый Язык Разметки (XML) 1.0 (на русском)
Расширяемый Язык Стилей (XSL) 1.0 (на русском)
Расширяемый Язык Гипертекстовой Разметки (XHTML) 1.0 (на русском)
XHTML 1.1 - XHTML на базе модулей (на русском)
Модуляризация XHTML (на русском)
Mason
Документация по Mason (на русском)
Mason Documentation (english)
Embedding Perl in HTML with Mason (english)
Поисковые машины
mnoGoSearch 3.2.15 (на русском)
mnoGoSearch 3.3.7 (english)
RDBMS
PostgreSQL 8.3 Documentation (english)
FreeBSD & UNIX
Manpage Viewer / Страницы помощи (english)
Руководство FreeBSD (на русском)
FreeBSD Handbook (english)
Часто задаваемые вопросы по FreeBSD (на русском)
Frequently Asked Questions for FreeBSD (english)
Статьи о FreeBSD на русском
Другие руководства по FreeBSD (на русском)
Операционная система UNIX (на русском)
Протоколы сетевого взаимодействия TCP/IP (на русском)
VIDEO
MPlayer (mencoder) (на русском)
MPlayer (mencoder) (english)
FFMPEG (english)
GRAPHICS
User manual for Netpbm (english)
O'Reilly
O'Reilly's bookshelves (english)
 

Book HomePerl & LWPSearch this book

Chapter 7. HTML Processing with Tokens

Regular expressions are powerful, but they're a painfully low-level way of dealing with HTML. You're forced to worry about spaces and newlines, single and double quotes, HTML comments, and a lot more. The next step up from a regular expression is an HTML tokenizer. In this chapter, we'll use HTML::TokeParser to extract information from HTML files. Using these techniques, you can extract information from any HTML file, and never again have to worry about character-level trivia of HTML markup.

7.1. HTML as Tokens

Your experience with HTML code probably involves seeing raw text such as this:

<p>Dear Diary,
<br>I'm gonna be a superstar, because I'm learning to play
the <a href="http://MyBalalaika.com">balalaika</a> &amp; the <a
href='http://MyBazouki.com'>bazouki</a>!!!

The HTML::TokeParser module divides the HTML into units called tokens, which means units of parsing. The above source code is parsed as this series of tokens:

start-tag token
p with no attributes
text token
Dear Diary,\n
start-tag token
br with no attributes
text token
I'm gonna be a superstar, because I'm learning to play\nthe
start-tag token
a, with attribute href whose value is http://MyBalalaika.com
text token
balalaika
end-tag token
a
text token
&amp; the , which means & the
start-tag token
a, with attribute href equals http://MyBazouki.com
text token
bazouki
end-tag token
a
text token
!!!\n

This representation of things is more abstract, focusing on markup concepts and not individual characters. So whereas the two <a> tags have different types of quotes around their attribute values in the raw HTML, as tokens each has a start-tag of type a, with an href attribute of a particular value. A program that extracts information by working with a stream of tokens doesn't have to worry about the idiosyncrasies of entity encoding, whitespace, quotes, and trying to work out where a tag ends.



Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.

 
 


 Copyright © DHS Group, 2000-2007.

 
 
E-mail: info@dhs.net.ru 
Тел.: +7 3952 664321