"Fossies" - the Fresh Open Source Software Archive

Member "install-tl-20200916/tlpkg/tlperl/site/lib/HTML/HeadParser.pm" (5 Apr 2016, 8526 Bytes) of package /windows/misc/install-tl.zip:


As a special service "Fossies" has tried to format the requested source page into HTML format using (guessed) Perl source code syntax highlighting (style: standard) with prefixed line numbers and code folding option. Alternatively you can here view or download the uninterpreted source code file.

    1 package HTML::HeadParser;
    2 
    3 =head1 NAME
    4 
    5 HTML::HeadParser - Parse <HEAD> section of a HTML document
    6 
    7 =head1 SYNOPSIS
    8 
    9  require HTML::HeadParser;
   10  $p = HTML::HeadParser->new;
   11  $p->parse($text) and  print "not finished";
   12 
   13  $p->header('Title')          # to access <title>....</title>
   14  $p->header('Content-Base')   # to access <base href="http://...">
   15  $p->header('Foo')            # to access <meta http-equiv="Foo" content="...">
   16  $p->header('X-Meta-Author')  # to access <meta name="author" content="...">
   17  $p->header('X-Meta-Charset') # to access <meta charset="...">
   18 
   19 =head1 DESCRIPTION
   20 
   21 The C<HTML::HeadParser> is a specialized (and lightweight)
   22 C<HTML::Parser> that will only parse the E<lt>HEAD>...E<lt>/HEAD>
   23 section of an HTML document.  The parse() method
   24 will return a FALSE value as soon as some E<lt>BODY> element or body
   25 text are found, and should not be called again after this.
   26 
   27 Note that the C<HTML::HeadParser> might get confused if raw undecoded
   28 UTF-8 is passed to the parse() method.  Make sure the strings are
   29 properly decoded before passing them on.
   30 
   31 The C<HTML::HeadParser> keeps a reference to a header object, and the
   32 parser will update this header object as the various elements of the
   33 E<lt>HEAD> section of the HTML document are recognized.  The following
   34 header fields are affected:
   35 
   36 =over 4
   37 
   38 =item Content-Base:
   39 
   40 The I<Content-Base> header is initialized from the E<lt>base
   41 href="..."> element.
   42 
   43 =item Title:
   44 
   45 The I<Title> header is initialized from the E<lt>title>...E<lt>/title>
   46 element.
   47 
   48 =item Isindex:
   49 
   50 The I<Isindex> header will be added if there is a E<lt>isindex>
   51 element in the E<lt>head>.  The header value is initialized from the
   52 I<prompt> attribute if it is present.  If no I<prompt> attribute is
   53 given it will have '?' as the value.
   54 
   55 =item X-Meta-Foo:
   56 
   57 All E<lt>meta> elements containing a C<name> attribute will result in
   58 headers using the prefix C<X-Meta-> appended with the value of the
   59 C<name> attribute as the name of the header, and the value of the
   60 C<content> attribute as the pushed header value.
   61 
   62 E<lt>meta> elements containing a C<http-equiv> attribute will result
   63 in headers as in above, but without the C<X-Meta-> prefix in the
   64 header name.
   65 
   66 E<lt>meta> elements containing a C<charset> attribute will result in
   67 an C<X-Meta-Charset> header, using the value of the C<charset>
   68 attribute as the pushed header value.
   69 
   70 The ':' character can't be represented in header field names, so
   71 if the meta element contains this char it's substituted with '-'
   72 before forming the field name.
   73 
   74 =back
   75 
   76 =head1 METHODS
   77 
   78 The following methods (in addition to those provided by the
   79 superclass) are available:
   80 
   81 =over 4
   82 
   83 =cut
   84 
   85 
   86 require HTML::Parser;
   87 @ISA = qw(HTML::Parser);
   88 
   89 use HTML::Entities ();
   90 
   91 use strict;
   92 use vars qw($VERSION $DEBUG);
   93 #$DEBUG = 1;
   94 $VERSION = "3.71";
   95 
   96 =item $hp = HTML::HeadParser->new
   97 
   98 =item $hp = HTML::HeadParser->new( $header )
   99 
  100 The object constructor.  The optional $header argument should be a
  101 reference to an object that implement the header() and push_header()
  102 methods as defined by the C<HTTP::Headers> class.  Normally it will be
  103 of some class that is a or delegates to the C<HTTP::Headers> class.
  104 
  105 If no $header is given C<HTML::HeadParser> will create an
  106 C<HTTP::Headers> object by itself (initially empty).
  107 
  108 =cut
  109 
  110 sub new
  111 {
  112     my($class, $header) = @_;
  113     unless ($header) {
  114     require HTTP::Headers;
  115     $header = HTTP::Headers->new;
  116     }
  117 
  118     my $self = $class->SUPER::new(api_version => 3,
  119                   start_h => ["start", "self,tagname,attr"],
  120                   end_h   => ["end",   "self,tagname"],
  121                   text_h  => ["text",  "self,text"],
  122                   ignore_elements => [qw(script style)],
  123                  );
  124     $self->{'header'} = $header;
  125     $self->{'tag'} = '';   # name of active element that takes textual content
  126     $self->{'text'} = '';  # the accumulated text associated with the element
  127     $self;
  128 }
  129 
  130 =item $hp->header;
  131 
  132 Returns a reference to the header object.
  133 
  134 =item $hp->header( $key )
  135 
  136 Returns a header value.  It is just a shorter way to write
  137 C<$hp-E<gt>header-E<gt>header($key)>.
  138 
  139 =cut
  140 
  141 sub header
  142 {
  143     my $self = shift;
  144     return $self->{'header'} unless @_;
  145     $self->{'header'}->header(@_);
  146 }
  147 
  148 sub as_string    # legacy
  149 {
  150     my $self = shift;
  151     $self->{'header'}->as_string;
  152 }
  153 
  154 sub flush_text   # internal
  155 {
  156     my $self = shift;
  157     my $tag  = $self->{'tag'};
  158     my $text = $self->{'text'};
  159     $text =~ s/^\s+//;
  160     $text =~ s/\s+$//;
  161     $text =~ s/\s+/ /g;
  162     print "FLUSH $tag => '$text'\n"  if $DEBUG;
  163     if ($tag eq 'title') {
  164     my $decoded;
  165     $decoded = utf8::decode($text) if $self->utf8_mode && defined &utf8::decode;
  166     HTML::Entities::decode($text);
  167     utf8::encode($text) if $decoded;
  168     $self->{'header'}->push_header(Title => $text);
  169     }
  170     $self->{'tag'} = $self->{'text'} = '';
  171 }
  172 
  173 # This is an quote from the HTML3.2 DTD which shows which elements
  174 # that might be present in a <HEAD>...</HEAD>.  Also note that the
  175 # <HEAD> tags themselves might be missing:
  176 #
  177 # <!ENTITY % head.content "TITLE & ISINDEX? & BASE? & STYLE? &
  178 #                            SCRIPT* & META* & LINK*">
  179 #
  180 # <!ELEMENT HEAD O O  (%head.content)>
  181 #
  182 # From HTML 4.01:
  183 #
  184 # <!ENTITY % head.misc "SCRIPT|STYLE|META|LINK|OBJECT">
  185 # <!ENTITY % head.content "TITLE & BASE?">
  186 # <!ELEMENT HEAD O O (%head.content;) +(%head.misc;)>
  187 #
  188 # From HTML 5 as of WD-html5-20090825:
  189 #
  190 # One or more elements of metadata content, [...]
  191 # => base, command, link, meta, noscript, script, style, title
  192 
  193 sub start
  194 {
  195     my($self, $tag, $attr) = @_;  # $attr is reference to a HASH
  196     print "START[$tag]\n" if $DEBUG;
  197     $self->flush_text if $self->{'tag'};
  198     if ($tag eq 'meta') {
  199     my $key = $attr->{'http-equiv'};
  200     if (!defined($key) || !length($key)) {
  201         if ($attr->{name}) {
  202         $key = "X-Meta-\u$attr->{name}";
  203         } elsif ($attr->{charset}) { # HTML 5 <meta charset="...">
  204         $key = "X-Meta-Charset";
  205         $self->{header}->push_header($key => $attr->{charset});
  206         return;
  207         } else {
  208         return;
  209         }
  210     }
  211     $key =~ s/:/-/g;
  212     $self->{'header'}->push_header($key => $attr->{content});
  213     } elsif ($tag eq 'base') {
  214     return unless exists $attr->{href};
  215     (my $base = $attr->{href}) =~ s/^\s+//; $base =~ s/\s+$//; # HTML5
  216     $self->{'header'}->push_header('Content-Base' => $base);
  217     } elsif ($tag eq 'isindex') {
  218     # This is a non-standard header.  Perhaps we should just ignore
  219     # this element
  220     $self->{'header'}->push_header(Isindex => $attr->{prompt} || '?');
  221     } elsif ($tag =~ /^(?:title|noscript|object|command)$/) {
  222     # Just remember tag.  Initialize header when we see the end tag.
  223     $self->{'tag'} = $tag;
  224     } elsif ($tag eq 'link') {
  225     return unless exists $attr->{href};
  226     # <link href="http:..." rel="xxx" rev="xxx" title="xxx">
  227     my $href = delete($attr->{href});
  228     $href =~ s/^\s+//; $href =~ s/\s+$//; # HTML5
  229     my $h_val = "<$href>";
  230     for (sort keys %{$attr}) {
  231         next if $_ eq "/";  # XHTML junk
  232         $h_val .= qq(; $_="$attr->{$_}");
  233     }
  234     $self->{'header'}->push_header(Link => $h_val);
  235     } elsif ($tag eq 'head' || $tag eq 'html') {
  236     # ignore
  237     } else {
  238      # stop parsing
  239     $self->eof;
  240     }
  241 }
  242 
  243 sub end
  244 {
  245     my($self, $tag) = @_;
  246     print "END[$tag]\n" if $DEBUG;
  247     $self->flush_text if $self->{'tag'};
  248     $self->eof if $tag eq 'head';
  249 }
  250 
  251 sub text
  252 {
  253     my($self, $text) = @_;
  254     print "TEXT[$text]\n" if $DEBUG;
  255     unless ($self->{first_chunk}) {
  256     # drop Unicode BOM if found
  257     if ($self->utf8_mode) {
  258         $text =~ s/^\xEF\xBB\xBF//;
  259     }
  260     else {
  261         $text =~ s/^\x{FEFF}//;
  262     }
  263     $self->{first_chunk}++;
  264     }
  265     my $tag = $self->{tag};
  266     if (!$tag && $text =~ /\S/) {
  267     # Normal text means start of body
  268         $self->eof;
  269     return;
  270     }
  271     return if $tag ne 'title';
  272     $self->{'text'} .= $text;
  273 }
  274 
  275 BEGIN {
  276     *utf8_mode = sub { 1 } unless HTML::Entities::UNICODE_SUPPORT;
  277 }
  278 
  279 1;
  280 
  281 __END__
  282 
  283 =back
  284 
  285 =head1 EXAMPLE
  286 
  287  $h = HTTP::Headers->new;
  288  $p = HTML::HeadParser->new($h);
  289  $p->parse(<<EOT);
  290  <title>Stupid example</title>
  291  <base href="http://www.linpro.no/lwp/">
  292  Normal text starts here.
  293  EOT
  294  undef $p;
  295  print $h->title;   # should print "Stupid example"
  296 
  297 =head1 SEE ALSO
  298 
  299 L<HTML::Parser>, L<HTTP::Headers>
  300 
  301 The C<HTTP::Headers> class is distributed as part of the
  302 I<libwww-perl> package.  If you don't have that distribution installed
  303 you need to provide the $header argument to the C<HTML::HeadParser>
  304 constructor with your own object that implements the documented
  305 protocol.
  306 
  307 =head1 COPYRIGHT
  308 
  309 Copyright 1996-2001 Gisle Aas. All rights reserved.
  310 
  311 This library is free software; you can redistribute it and/or
  312 modify it under the same terms as Perl itself.
  313 
  314 =cut
  315