"Fossies" - the Fresh Open Source Software Archive

Member "dada/DADA/perllib/MIME/EncWords 2.pm" (6 Aug 2020, 34842 Bytes) of package /linux/www/dada-11_11_0.tar.gz:


As a special service "Fossies" has tried to format the requested source page into HTML format using (guessed) Perl source code syntax highlighting (style: standard) with prefixed line numbers and code folding option. Alternatively you can here view or download the uninterpreted source code file.

    1 #-*- perl -*-
    2 
    3 package MIME::EncWords;
    4 require 5.005;
    5 
    6 =head1 NAME
    7 
    8 MIME::EncWords - deal with RFC 2047 encoded words (improved)
    9 
   10 =head1 SYNOPSIS
   11 
   12 I<L<MIME::EncWords> is aimed to be another implimentation
   13 of L<MIME::Words> so that it will achieve more exact conformance with
   14 RFC 2047 (formerly RFC 1522) specifications.  Additionally, it contains
   15 some improvements.
   16 Following synopsis and descriptions are inherited from its inspirer,
   17 then added descriptions on improvements (B<**>) or changes and
   18 clarifications (B<*>).>
   19 
   20 Before reading further, you should see L<MIME::Tools> to make sure that
   21 you understand where this module fits into the grand scheme of things.
   22 Go on, do it now.  I'll wait.
   23 
   24 Ready?  Ok...
   25 
   26     use MIME::EncWords qw(:all);
   27 
   28     ### Decode the string into another string, forgetting the charsets:
   29     $decoded = decode_mimewords(
   30           'To: =?ISO-8859-1?Q?Keld_J=F8rn_Simonsen?= <keld@dkuug.dk>',
   31           );
   32 
   33     ### Split string into array of decoded [DATA,CHARSET] pairs:
   34     @decoded = decode_mimewords(
   35           'To: =?ISO-8859-1?Q?Keld_J=F8rn_Simonsen?= <keld@dkuug.dk>',
   36           );
   37 
   38     ### Encode a single unsafe word:
   39     $encoded = encode_mimeword("\xABFran\xE7ois\xBB");
   40 
   41     ### Encode a string, trying to find the unsafe words inside it:
   42     $encoded = encode_mimewords("Me and \xABFran\xE7ois\xBB in town");
   43 
   44 =head1 DESCRIPTION
   45 
   46 Fellow Americans, you probably won't know what the hell this module
   47 is for.  Europeans, Russians, et al, you probably do.  C<:-)>.
   48 
   49 For example, here's a valid MIME header you might get:
   50 
   51       From: =?US-ASCII?Q?Keith_Moore?= <moore@cs.utk.edu>
   52       To: =?ISO-8859-1?Q?Keld_J=F8rn_Simonsen?= <keld@dkuug.dk>
   53       CC: =?ISO-8859-1?Q?Andr=E9_?= Pirard <PIRARD@vm1.ulg.ac.be>
   54       Subject: =?ISO-8859-1?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW8=?=
   55        =?ISO-8859-2?B?dSB1bmRlcnN0YW5kIHRoZSBleGFtcGxlLg==?=
   56        =?US-ASCII?Q?.._cool!?=
   57 
   58 The fields basically decode to (sorry, I can only approximate the
   59 Latin characters with 7 bit sequences /o and 'e):
   60 
   61       From: Keith Moore <moore@cs.utk.edu>
   62       To: Keld J/orn Simonsen <keld@dkuug.dk>
   63       CC: Andr'e  Pirard <PIRARD@vm1.ulg.ac.be>
   64       Subject: If you can read this you understand the example... cool!
   65 
   66 B<Supplement>: Fellow Americans, Europeans, you probably won't know
   67 what the hell this module is for.  East Asians, et al, you probably do.
   68 C<(^_^)>.
   69 
   70 For example, here's a valid MIME header you might get:
   71 
   72       Subject: =?EUC-KR?B?sNTAuLinKGxhemluZXNzKSwgwvzB9ri7seIoaW1w?=
   73        =?EUC-KR?B?YXRpZW5jZSksILGzuLgoaHVicmlzKQ==?=
   74 
   75 The fields basically decode to (sorry, I cannot approximate the
   76 non-Latin multibyte characters with any 7 bit sequences):
   77 
   78       Subject: ???(laziness), ????(impatience), ??(hubris)
   79 
   80 =head1 PUBLIC INTERFACE
   81 
   82 =over 4
   83 
   84 =cut
   85 
   86 ### Pragmas:
   87 use strict;
   88 use vars qw($VERSION @EXPORT_OK %EXPORT_TAGS @ISA $Config);
   89 
   90 ### Exporting:
   91 use Exporter;
   92 
   93 %EXPORT_TAGS = (all => [qw(decode_mimewords
   94                encode_mimeword
   95                encode_mimewords)]);
   96 Exporter::export_ok_tags(qw(all));
   97 
   98 ### Inheritance:
   99 @ISA = qw(Exporter);
  100 
  101 ### Other modules:
  102 use Carp qw(croak carp);
  103 use MIME::Base64;
  104 use MIME::Charset qw(:trans);
  105 
  106 my @ENCODE_SUBS = qw(FB_CROAK is_utf8 resolve_alias);
  107 if (MIME::Charset::USE_ENCODE) {
  108     eval "use ".MIME::Charset::USE_ENCODE." \@ENCODE_SUBS;";
  109     if ($@) { # Perl 5.7.3 + Encode 0.40
  110     eval "use ".MIME::Charset::USE_ENCODE." qw(is_utf8);";
  111     require MIME::Charset::_Compat;
  112     for my $sub (@ENCODE_SUBS) {
  113         no strict "refs";
  114         *{$sub} = \&{"MIME::Charset::_Compat::$sub"}
  115         unless $sub eq 'is_utf8';
  116     }
  117     }
  118 } else {
  119     require MIME::Charset::_Compat;
  120     for my $sub (@ENCODE_SUBS) {
  121         no strict "refs";
  122         *{$sub} = \&{"MIME::Charset::_Compat::$sub"};
  123     }
  124 }
  125 
  126 #------------------------------
  127 #
  128 # Globals...
  129 #
  130 #------------------------------
  131 
  132 ### The package version, both in 1.23 style *and* usable by MakeMaker:
  133 $VERSION = '1.012.6';
  134 
  135 ### Public Configuration Attributes
  136 $Config = {
  137     %{$MIME::Charset::Config}, # Detect7bit, Replacement, Mapping
  138     Charset => 'ISO-8859-1',
  139     Encoding => 'A',
  140     Field => undef,
  141     Folding => "\n",
  142     MaxLineLen => 76,
  143     Minimal => 'YES',
  144 };
  145 eval { require MIME::EncWords::Defaults; };
  146 
  147 ### Private Constants
  148 
  149 my $PRINTABLE = "\\x21-\\x7E";
  150 #my $NONPRINT = "\\x00-\\x1F\\x7F-\\xFF";
  151 my $NONPRINT = qr{[^$PRINTABLE]}; # Improvement: Unicode support.
  152 my $UNSAFE = qr{[^\x01-\x20$PRINTABLE]};
  153 my $WIDECHAR = qr{[^\x00-\xFF]};
  154 my $ASCIITRANS = qr{^(?:HZ-GB-2312|UTF-7)$}i;
  155 my $DISPNAMESPECIAL = "\\x22(),:;<>\\x40\\x5C"; # RFC5322 name-addr specials.
  156 
  157 #------------------------------
  158 
  159 # _decode_B STRING
  160 #     Private: used by _decode_header() to decode "B" encoding.
  161 #     Improvement by this module: sanity check on encoded sequence.
  162 sub _decode_B {
  163     my $str = shift;
  164     unless ((length($str) % 4 == 0) and
  165     $str =~ m|^[A-Za-z0-9+/]+={0,2}$|) {
  166     return undef;
  167     }
  168     return decode_base64($str);
  169 }
  170 
  171 # _decode_Q STRING
  172 #     Private: used by _decode_header() to decode "Q" encoding, which is
  173 #     almost, but not exactly, quoted-printable.  :-P
  174 #     Improvement by this module: sanity check on encoded sequence (>=1.012.3).
  175 sub _decode_Q {
  176     my $str = shift;
  177     if ($str =~ /=(?![0-9a-fA-F][0-9a-fA-F])/) { #XXX:" " and "\t" are allowed
  178     return undef;
  179     }
  180     $str =~ s/_/\x20/g;                 # RFC 2047, Q rule 2
  181     $str =~ s/=([0-9a-fA-F]{2})/pack("C", hex($1))/ge;  # RFC 2047, Q rule 1
  182     $str;
  183 }
  184 
  185 # _encode_B STRING
  186 #     Private: used by encode_mimeword() to encode "B" encoding.
  187 sub _encode_B {
  188     my $str = shift;
  189     encode_base64($str, '');
  190 }
  191 
  192 # _encode_Q STRING
  193 #     Private: used by encode_mimeword() to encode "Q" encoding, which is
  194 #     almost, but not exactly, quoted-printable.  :-P
  195 #     Improvement by this module: Spaces are escaped by ``_''.
  196 sub _encode_Q {
  197     my $str = shift;
  198     # Restrict characters to those listed in RFC 2047 section 5 (3)
  199     $str =~ s{[^-!*+/0-9A-Za-z]}{
  200     $& eq "\x20"? "_": sprintf("=%02X", ord($&))
  201     }eog;
  202     $str;
  203 }
  204 
  205 #------------------------------
  206 
  207 =item decode_mimewords ENCODED, [OPTS...]
  208 
  209 I<Function.>
  210 Go through the string looking for RFC 2047-style "Q"
  211 (quoted-printable, sort of) or "B" (base64) encoding, and decode them.
  212 
  213 B<In an array context,> splits the ENCODED string into a list of decoded
  214 C<[DATA, CHARSET]> pairs, and returns that list.  Unencoded
  215 data are returned in a 1-element array C<[DATA]>, giving an effective
  216 CHARSET of C<undef>.
  217 
  218     $enc = '=?ISO-8859-1?Q?Keld_J=F8rn_Simonsen?= <keld@dkuug.dk>';
  219     foreach (decode_mimewords($enc)) {
  220         print "", ($_[1] || 'US-ASCII'), ": ", $_[0], "\n";
  221     }
  222 
  223 B<**>
  224 However, adjacent encoded-words with same charset will be concatenated
  225 to handle multibyte sequences safely.
  226 
  227 B<**>
  228 Language information defined by RFC2231, section 5 will be additonal
  229 third element, if any.
  230 
  231 B<*>
  232 Whitespaces surrounding unencoded data will not be stripped so that
  233 compatibility with L<MIME::Words> will be ensured.
  234 
  235 B<In a scalar context,> joins the "data" elements of the above
  236 list together, and returns that.  I<Warning: this is information-lossy,>
  237 and probably I<not> what you want, but if you know that all charsets
  238 in the ENCODED string are identical, it might be useful to you.
  239 (Before you use this, please see L<MIME::WordDecoder/unmime>,
  240 which is probably what you want.)
  241 B<**>
  242 See also "Charset" option below.
  243 
  244 In the event of a syntax error, $@ will be set to a description
  245 of the error, but parsing will continue as best as possible (so as to
  246 get I<something> back when decoding headers).
  247 $@ will be false if no error was detected.
  248 
  249 B<*>
  250 Malformed encoded-words will be kept encoded.
  251 In this case $@ will be set.
  252 
  253 Any arguments past the ENCODED string are taken to define a hash of options.
  254 B<**>
  255 When Unicode/multibyte support is disabled
  256 (see L<MIME::Charset/USE_ENCODE>),
  257 these options will not have any effects.
  258 
  259 =over 4
  260 
  261 =item Charset
  262 B<**>
  263 
  264 Name of character set by which data elements in scalar context
  265 will be converted.
  266 The default is no conversion.
  267 If this option is specified as special value C<"_UNICODE_">,
  268 returned value will be Unicode string.
  269 
  270 B<Note>:
  271 This feature is still information-lossy, I<except> when C<"_UNICODE_"> is
  272 specified.
  273 
  274 =item Detect7bit
  275 B<**>
  276 
  277 Try to detect 7-bit charset on unencoded portions.
  278 Default is C<"YES">.
  279 
  280 =cut
  281 
  282 #=item Field
  283 #
  284 #Name of the mail field this string came from.  I<Currently ignored.>
  285 
  286 =item Mapping
  287 B<**>
  288 
  289 In scalar context, specify mappings actually used for charset names.
  290 C<"EXTENDED"> uses extended mappings.
  291 C<"STANDARD"> uses standardized strict mappings.
  292 Default is C<"EXTENDED">.
  293 
  294 =back
  295 
  296 =cut
  297 
  298 sub decode_mimewords {
  299     my $encstr = shift;
  300     my %params = @_;
  301     my %Params = &_getparams(\%params,
  302                  NoDefault => [qw(Charset)], # default is no conv.
  303                  YesNo => [qw(Detect7bit)],
  304                  Others => [qw(Mapping)],
  305                  Obsoleted => [qw(Field)],
  306                  ToUpper => [qw(Charset Mapping)],
  307                 );
  308     my $cset = MIME::Charset->new($Params{Charset},
  309                   Mapping => $Params{Mapping});
  310     # unfolding: normalize linear-white-spaces and orphan newlines.
  311     $encstr =~ s/(?:[\r\n]+[\t ])*[\r\n]+([\t ]|\Z)/$1? " ": ""/eg;
  312     $encstr =~ s/[\r\n]+/ /g;
  313 
  314     my @tokens;
  315     $@ = '';           ### error-return
  316 
  317     ### Decode:
  318     my ($word, $charset, $language, $encoding, $enc, $dec);
  319     my $spc = '';
  320     pos($encstr) = 0;
  321     while (1) {
  322         last if (pos($encstr) >= length($encstr));
  323         my $pos = pos($encstr);               ### save it
  324 
  325         ### Case 1: are we looking at "=?..?..?="?
  326         if ($encstr =~    m{\G             # from where we left off..
  327                             =\?([^?]*)     # "=?" + charset +
  328                              \?([bq])      #  "?" + encoding +
  329                              \?([^?]+)     #  "?" + data maybe with spcs +
  330                              \?=           #  "?="
  331                  ([\r\n\t ]*)
  332                             }xgi) {
  333         ($word, $charset, $encoding, $enc) = ($&, $1, lc($2), $3);
  334         my $tspc = $4;
  335 
  336         # RFC 2231 section 5 extension
  337         if ($charset =~ s/^([^\*]*)\*(.*)/$1/) {
  338         $language = $2 || undef;
  339         $charset ||= undef;
  340         } else {
  341         $language = undef;
  342         }
  343 
  344         if ($encoding eq 'q') {
  345         $dec = _decode_Q($enc);
  346         } else {
  347         $dec = _decode_B($enc);
  348         }
  349         unless (defined $dec) {
  350         $@ .= qq|Illegal sequence in "$word" (pos $pos)\n|;
  351         push @tokens, [$spc.$word];
  352         $spc = '';
  353         next;
  354         }
  355 
  356       { local $@;
  357         if (scalar(@tokens) and
  358         lc($charset || "") eq lc($tokens[-1]->[1] || "") and
  359         resolve_alias($charset) and
  360         (!${tokens[-1]}[2] and !$language or
  361          lc(${tokens[-1]}[2]) eq lc($language))) { # Concat words if possible.
  362         $tokens[-1]->[0] .= $dec;
  363         } elsif ($language) {
  364         push @tokens, [$dec, $charset, $language];
  365         } elsif ($charset) {
  366         push @tokens, [$dec, $charset];
  367         } else {
  368         push @tokens, [$dec];
  369         }
  370         $spc = $tspc;
  371       }
  372             next;
  373         }
  374 
  375         ### Case 2: are we looking at a bad "=?..." prefix?
  376         ### We need this to detect problems for case 3, which stops at "=?":
  377         pos($encstr) = $pos;               # reset the pointer.
  378         if ($encstr =~ m{\G=\?}xg) {
  379             $@ .= qq|unterminated "=?..?..?=" in "$encstr" (pos $pos)\n|;
  380             push @tokens, [$spc.'=?'];
  381         $spc = '';
  382             next;
  383         }
  384 
  385         ### Case 3: are we looking at ordinary text?
  386         pos($encstr) = $pos;               # reset the pointer.
  387         if ($encstr =~ m{\G                # from where we left off...
  388                          (.*?              #   shortest possible string,
  389                           \n*)             #   followed by 0 or more NLs,
  390                          (?=(\Z|=\?))      # terminated by "=?" or EOS
  391                         }xgs) {
  392             length($1) or croak "MIME::EncWords: internal logic err: empty token\n";
  393             push @tokens, [$spc.$1];
  394         $spc = '';
  395             next;
  396         }
  397 
  398         ### Case 4: bug!
  399         croak "MIME::EncWords: unexpected case:\n($encstr) pos $pos\n\t".
  400             "Please alert developer.\n";
  401     }
  402     push @tokens, [$spc] if $spc;
  403 
  404     # Detect 7-bit charset
  405     if ($Params{Detect7bit} ne "NO") {
  406     local $@;
  407     foreach my $t (@tokens) {
  408         unless ($t->[0] =~ $UNSAFE or $t->[1]) {
  409         my $charset = MIME::Charset::_detect_7bit_charset($t->[0]);
  410         if ($charset and $charset ne &MIME::Charset::default()) {
  411             $t->[1] = $charset;
  412         }
  413         }
  414     }
  415     }
  416 
  417     if (wantarray) {
  418     @tokens;
  419     } else {
  420     join('', map {
  421         &_convert($_->[0], $_->[1], $cset, $Params{Mapping})
  422     } @tokens);
  423     }
  424 }
  425 
  426 #------------------------------
  427 
  428 # _convert RAW, FROMCHARSET, TOCHARSET, MAPPING
  429 #     Private: used by decode_mimewords() to convert string by other charset
  430 #     or to decode to Unicode.
  431 #     When source charset is unknown and Unicode string is requested, at first
  432 #     try well-formed UTF-8 then fallback to ISO-8859-1 so that almost all
  433 #     non-ASCII bytes will be preserved.
  434 sub _convert($$$$) {
  435     my $s = shift;
  436     my $charset = shift;
  437     my $cset = shift;
  438     my $mapping = shift;
  439     return $s unless &MIME::Charset::USE_ENCODE;
  440     return $s unless $cset->as_string;
  441     croak "unsupported charset ``".$cset->as_string."''"
  442     unless $cset->decoder or $cset->as_string eq "_UNICODE_";
  443 
  444     local($@);
  445     $charset = MIME::Charset->new($charset, Mapping => $mapping);
  446     if ($charset->as_string and $charset->as_string eq $cset->as_string) {
  447     return $s;
  448     }
  449     # build charset object to transform string from $charset to $cset.
  450     $charset->encoder($cset);
  451 
  452     my $converted = $s;
  453     if (is_utf8($s) or $s =~ $WIDECHAR) {
  454     if ($charset->output_charset ne "_UNICODE_") {
  455         $converted = $charset->encode($s);
  456     }
  457     } elsif ($charset->output_charset eq "_UNICODE_") {
  458     if (!$charset->decoder) {
  459         if ($s =~ $UNSAFE) {
  460         $@ = '';
  461         eval {
  462             $charset = MIME::Charset->new("UTF-8",
  463                           Mapping => 'STANDARD');
  464             $converted = $charset->decode($converted, FB_CROAK());
  465         };
  466         if ($@) {
  467             $converted = $s;
  468             $charset = MIME::Charset->new("ISO-8859-1",
  469                           Mapping => 'STANDARD');
  470             $converted = $charset->decode($converted, 0);
  471         }
  472         }
  473     } else {
  474         $converted = $charset->decode($s);
  475     }
  476     } elsif ($charset->decoder) {
  477     $converted = $charset->encode($s);
  478     }
  479     return $converted;
  480 }
  481 
  482 #------------------------------
  483 
  484 =item encode_mimeword RAW, [ENCODING], [CHARSET]
  485 
  486 I<Function.>
  487 Encode a single RAW "word" that has unsafe characters.
  488 The "word" will be encoded in its entirety.
  489 
  490     ### Encode "<<Franc,ois>>":
  491     $encoded = encode_mimeword("\xABFran\xE7ois\xBB");
  492 
  493 You may specify the ENCODING (C<"Q"> or C<"B">), which defaults to C<"Q">.
  494 B<**>
  495 You may also specify it as ``special'' value: C<"S"> to choose shorter
  496 one of either C<"Q"> or C<"B">.
  497 
  498 You may specify the CHARSET, which defaults to C<iso-8859-1>.
  499 
  500 B<*>
  501 Spaces will be escaped with ``_'' by C<"Q"> encoding.
  502 
  503 =cut
  504 
  505 sub encode_mimeword {
  506     my $word = shift;
  507     my $encoding = uc(shift || 'Q');          # not overridden.
  508     my $charset  = shift || 'ISO-8859-1';     # ditto.
  509     my $language = uc(shift || "");       # ditto.
  510 
  511     if (ref $charset) {
  512     if (is_utf8($word) or $word =~ /$WIDECHAR/) {
  513         $word = $charset->undecode($word, 0);
  514     }
  515     $charset = $charset->as_string;
  516     } else {
  517     $charset = uc($charset);
  518     }
  519     my $encstr;
  520     if ($encoding eq 'Q') {
  521     $encstr = &_encode_Q($word);
  522     } elsif ($encoding eq "S") {
  523     my ($B, $Q) = (&_encode_B($word), &_encode_Q($word));
  524     if (length($B) < length($Q)) {
  525         $encoding = "B";
  526         $encstr = $B;
  527     } else {
  528         $encoding = "Q";
  529         $encstr = $Q;
  530     }
  531     } else { # "B"
  532     $encoding = "B";
  533     $encstr = &_encode_B($word);
  534     }
  535 
  536     if ($language) {
  537     return "=?$charset*$language?$encoding?$encstr?=";
  538     } else {
  539     return "=?$charset?$encoding?$encstr?=";
  540     }
  541 }
  542 
  543 #------------------------------
  544 
  545 =item encode_mimewords RAW, [OPTS]
  546 
  547 I<Function.>
  548 Given a RAW string, try to find and encode all "unsafe" sequences
  549 of characters:
  550 
  551     ### Encode a string with some unsafe "words":
  552     $encoded = encode_mimewords("Me and \xABFran\xE7ois\xBB");
  553 
  554 Returns the encoded string.
  555 
  556 B<**>
  557 RAW may be a Unicode string when Unicode/multibyte support is enabled
  558 (see L<MIME::Charset/USE_ENCODE>).
  559 Furthermore, RAW may be a reference to that returned
  560 by L</decode_mimewords> on array context.  In latter case "Charset"
  561 option (see below) will be overridden (see also a note below).
  562 
  563 B<Note>:
  564 B<*>
  565 When RAW is an arrayref,
  566 adjacent encoded-words (i.e. elements having non-ASCII charset element)
  567 are concatenated.  Then they are split taking
  568 care of character boundaries of multibyte sequences when Unicode/multibyte
  569 support is enabled.
  570 Portions for unencoded data should include surrounding whitespace(s), or
  571 they will be merged into adjoining encoded-word(s).
  572 
  573 Any arguments past the RAW string are taken to define a hash of options:
  574 
  575 =over 4
  576 
  577 =item Charset
  578 
  579 Encode all unsafe stuff with this charset.  Default is 'ISO-8859-1',
  580 a.k.a. "Latin-1".
  581 
  582 =item Detect7bit
  583 B<**>
  584 
  585 When "Encoding" option (see below) is specified as C<"a"> and "Charset"
  586 option is unknown, try to detect 7-bit charset on given RAW string.
  587 Default is C<"YES">.
  588 When Unicode/multibyte support is disabled,
  589 this option will not have any effects
  590 (see L<MIME::Charset/USE_ENCODE>).
  591 
  592 =item Encoding
  593 
  594 The encoding to use, C<"q"> or C<"b">.
  595 B<**>
  596 You may also specify ``special'' values: C<"a"> will automatically choose
  597 recommended encoding to use (with charset conversion if alternative
  598 charset is recommended: see L<MIME::Charset>);
  599 C<"s"> will choose shorter one of either C<"q"> or C<"b">.
  600 B<Note>:
  601 B<*>
  602 As of release 1.005, The default was changed from C<"q">
  603 (the default on MIME::Words) to C<"a">.
  604 
  605 =item Field
  606 
  607 Name of the mail field this string will be used in.
  608 B<**>
  609 Length of mail field name will be considered in the first line of
  610 encoded header.
  611 
  612 =item Folding
  613 B<**>
  614 
  615 A Sequence to fold encoded lines.  The default is C<"\n">.
  616 If empty string C<""> is specified, encoded-words exceeding line length
  617 (see L</MaxLineLen> below) will be split by SPACE.
  618 
  619 B<Note>:
  620 B<*>
  621 Though RFC 5322 (formerly RFC 2822) states that the lines in
  622 Internet messages are delimited by CRLF (C<"\r\n">), 
  623 this module chose LF (C<"\n">) as a default to keep backward compatibility.
  624 When you use the default, you might need converting newlines
  625 before encoded headers are thrown into session.
  626 
  627 =item Mapping
  628 B<**>
  629 
  630 Specify mappings actually used for charset names.
  631 C<"EXTENDED"> uses extended mappings.
  632 C<"STANDARD"> uses standardized strict mappings.
  633 The default is C<"EXTENDED">.
  634 When Unicode/multibyte support is disabled,
  635 this option will not have any effects
  636 (see L<MIME::Charset/USE_ENCODE>).
  637 
  638 =item MaxLineLen
  639 B<**>
  640 
  641 Maximum line length excluding newline.
  642 The default is 76.
  643 Negative value means unlimited line length (as of release 1.012.3).
  644 
  645 =item Minimal
  646 B<**>
  647 
  648 Takes care of natural word separators (i.e. whitespaces)
  649 in the text to be encoded.
  650 If C<"NO"> is specified, this module will encode whole text
  651 (if encoding needed) not regarding whitespaces;
  652 encoded-words exceeding line length will be split based only on their
  653 lengths.
  654 Default is C<"YES"> by which minimal portions of text are encoded.
  655 If C<"DISPNAME"> is specified, portions including special characters
  656 described in RFC5322 (formerly RFC2822, RFC822) address specification
  657 (section 3.4) are also encoded.
  658 This is useful for encoding display-name of address fields.
  659 
  660 B<Note>:
  661 As of release 0.040, default has been changed to C<"YES"> to ensure
  662 compatibility with MIME::Words.
  663 On earlier releases, this option was fixed to be C<"NO">.
  664 
  665 B<Note>:
  666 C<"DISPNAME"> option was introduced at release 1.012.
  667 
  668 =item Replacement
  669 B<**>
  670 
  671 See L<MIME::Charset/Error Handling>.
  672 
  673 =back
  674 
  675 =cut
  676 
  677 sub encode_mimewords  {
  678     my $words = shift;
  679     my %params = @_;
  680     my %Params = &_getparams(\%params,
  681                  YesNo => [qw(Detect7bit)],
  682                  Others => [qw(Charset Encoding Field Folding
  683                        Mapping MaxLineLen Minimal
  684                        Replacement)],
  685                  ToUpper => [qw(Charset Encoding Mapping Minimal
  686                         Replacement)],
  687                 );
  688     croak "unsupported encoding ``$Params{Encoding}''"
  689     unless $Params{Encoding} =~ /^[ABQS]$/;
  690     # newline and following WSP
  691     my ($fwsbrk, $fwsspc);
  692     if ($Params{Folding} =~ m/^([\r\n]*)([\t ]?)$/) {
  693     $fwsbrk = $1;
  694     $fwsspc = $2 || " ";
  695     } else {
  696     croak sprintf "illegal folding sequence ``\\x%*v02X''", '\\x',
  697               $Params{Folding};
  698     }
  699     # charset objects
  700     my $charsetobj = MIME::Charset->new($Params{Charset},
  701                     Mapping => $Params{Mapping});
  702     my $ascii = MIME::Charset->new("US-ASCII", Mapping => 'STANDARD');
  703     $ascii->encoder($ascii);
  704     # lengths
  705     my $firstlinelen = $Params{MaxLineLen} -
  706     ($Params{Field}? length("$Params{Field}: "): 0);
  707     my $maxrestlen = $Params{MaxLineLen} - length($fwsspc);
  708     # minimal encoding flag
  709     if (!$Params{Minimal}) {
  710     $Params{Minimal} = 'NO';
  711     } elsif ($Params{Minimal} !~ /^(NO|DISPNAME)$/) {
  712     $Params{Minimal} = 'YES';
  713     }
  714     # unsafe ASCII sequences
  715     my $UNSAFEASCII = ($maxrestlen <= 1)?
  716     qr{(?: =\? )}ox:
  717     qr{(?: =\? | [$PRINTABLE]{$Params{MaxLineLen}} )}ox;
  718     $UNSAFEASCII = qr{(?: [$DISPNAMESPECIAL] | $UNSAFEASCII )}ox
  719     if $Params{Minimal} eq 'DISPNAME';
  720 
  721     unless (ref($words) eq "ARRAY") {
  722     my @words = ();
  723     # unfolding: normalize linear-white-spaces and orphan newlines.
  724     $words =~ s/(?:[\r\n]+[\t ])*[\r\n]+([\t ]|\Z)/$1? " ": ""/eg;
  725     $words =~ s/[\r\n]+/ /g;
  726     # split if required
  727     if ($Params{Minimal} =~ /YES|DISPNAME/) {
  728         my ($spc, $unsafe_last) = ('', 0);
  729         foreach my $w (split(/([\t ]+)/, $words)) {
  730         next unless scalar(@words) or length($w); # skip garbage
  731         if ($w =~ /[\t ]/) {
  732             $spc = $w;
  733             next;
  734         }
  735 
  736         # workaround for ``ASCII transformation'' charsets
  737         my $u = $w;
  738         if ($charsetobj->as_string =~ /$ASCIITRANS/) {
  739             if (MIME::Charset::USE_ENCODE) {
  740             if (is_utf8($w) or $w =~ /$WIDECHAR/) {
  741                 $w = $charsetobj->undecode($u);
  742             } else {
  743                 $u = $charsetobj->decode($w);
  744             }
  745             } elsif ($w =~ /[+~]/) { #FIXME: for pre-Encode environment
  746                 $u = "x$w";
  747             }
  748         }
  749         if (scalar(@words)) {
  750             if (($w =~ /$NONPRINT|$UNSAFEASCII/ or $u ne $w) xor
  751             $unsafe_last) {
  752             if ($unsafe_last) {
  753                 push @words, $spc.$w;
  754             } else {
  755                 $words[-1] .= $spc;
  756                 push @words, $w;
  757             }
  758             $unsafe_last = not $unsafe_last;
  759             } else {
  760             $words[-1] .= $spc.$w;
  761             }
  762         } else {
  763             push @words, $spc.$w;
  764             $unsafe_last =
  765             ($w =~ /$NONPRINT|$UNSAFEASCII/ or $u ne $w);
  766         }
  767         $spc = '';
  768         }
  769         if ($spc) {
  770         if (scalar(@words)) {
  771             $words[-1] .= $spc;
  772         } else { # only WSPs
  773             push @words, $spc;
  774         }
  775         }
  776     } else {
  777         @words = ($words);
  778     }
  779     $words = [map { [$_, $Params{Charset}] } @words];
  780     }
  781 
  782     # Translate / concatenate words.
  783     my @triplets;
  784     foreach (@$words) {
  785     my ($s, $cset) = @$_;
  786     next unless length($s);
  787     my $csetobj = MIME::Charset->new($cset || "",
  788                      Mapping => $Params{Mapping});
  789     # determine charset and encoding
  790     # try defaults only if 7-bit charset detection is not required
  791     my $enc;
  792     my $obj = $csetobj;
  793     unless ($obj->as_string) {
  794         if ($Params{Encoding} ne "A" or $Params{Detect7bit} eq "NO" or
  795         $s =~ /$UNSAFE/) {
  796         $obj = $charsetobj;
  797         }
  798     }
  799     ($s, $cset, $enc) =
  800         $obj->header_encode($s,
  801                 Detect7bit => $Params{Detect7bit},
  802                 Replacement => $Params{Replacement},
  803                 Encoding => $Params{Encoding});
  804     # Resolve 'S' encoding based on global length. See (*).
  805     $enc = 'S'
  806         if defined $enc and
  807            ($Params{Encoding} eq 'S' or
  808         $Params{Encoding} eq 'A' and $obj->header_encoding eq 'S');
  809 
  810     # pure ASCII
  811     if ($cset eq "US-ASCII" and !$enc and $s =~ /$UNSAFEASCII/) {
  812         # pure ASCII with unsafe sequences should be encoded
  813         $cset = $csetobj->output_charset ||
  814         $charsetobj->output_charset ||
  815         $ascii->output_charset;
  816         $csetobj = MIME::Charset->new($cset,
  817                       Mapping => $Params{Mapping});
  818         # Preserve original Encoding option unless it was 'A'.
  819         $enc = ($Params{Encoding} eq 'A') ?
  820            ($csetobj->header_encoding || 'Q') :
  821            $Params{Encoding};
  822     } else {
  823         $csetobj = MIME::Charset->new($cset,
  824                       Mapping => $Params{Mapping});
  825     }
  826 
  827     # Now no charset translations are needed.
  828     $csetobj->encoder($csetobj);
  829 
  830     # Concatenate adjacent ``words'' so that multibyte sequences will
  831     # be handled safely.
  832     # Note: Encoded-word and unencoded text must not adjoin without
  833     # separating whitespace(s).
  834     if (scalar(@triplets)) {
  835         my ($last, $lastenc, $lastcsetobj) = @{$triplets[-1]};
  836         if ($csetobj->decoder and
  837         ($lastcsetobj->as_string || "") eq $csetobj->as_string and
  838         ($lastenc || "") eq ($enc || "")) {
  839         $triplets[-1]->[0] .= $s;
  840         next;
  841         } elsif (!$lastenc and $enc and $last !~ /[\r\n\t ]$/) {
  842         if ($last =~ /^(.*)([\r\n\t ])([$PRINTABLE]+)$/s) {
  843             $triplets[-1]->[0] = $1.$2;
  844             $s = $3.$s;
  845         } elsif ($lastcsetobj->as_string eq "US-ASCII") {
  846             $triplets[-1]->[0] .= $s;
  847             $triplets[-1]->[1] = $enc;
  848             $triplets[-1]->[2] = $csetobj;
  849             next;
  850         }
  851         } elsif ($lastenc and !$enc and $s !~ /^[\r\n\t ]/) {
  852         if ($s =~ /^([$PRINTABLE]+)([\r\n\t ])(.*)$/s) {
  853             $triplets[-1]->[0] .= $1;
  854             $s = $2.$3;
  855         } elsif ($csetobj->as_string eq "US-ASCII") {
  856             $triplets[-1]->[0] .= $s;
  857             next;
  858         }
  859         }
  860     }
  861     push @triplets, [$s, $enc, $csetobj];
  862     }
  863 
  864     # (*) Resolve 'S' encoding based on global length.
  865     my @s_enc = grep { $_->[1] and $_->[1] eq 'S' } @triplets;
  866     if (scalar @s_enc) {
  867     my $enc;
  868     my $b = scalar grep { $_->[1] and $_->[1] eq 'B' } @triplets;
  869     my $q = scalar grep { $_->[1] and $_->[1] eq 'Q' } @triplets;
  870     # 'A' chooses 'B' or 'Q' when all other encoded-words have same enc.
  871     if ($Params{Encoding} eq 'A' and $b and ! $q) {
  872         $enc = 'B';
  873     } elsif ($Params{Encoding} eq 'A' and ! $b and $q) {
  874         $enc = 'Q';
  875     # Otherwise, assuming 'Q', when characters to be encoded are more than
  876     # 6th of total (plus a little fraction), 'B' will win.
  877     # Note: This might give 'Q' so great advantage...
  878     } else {
  879         my @no_enc = grep { ! $_->[1] } @triplets;
  880         my $total = length join('', map { $_->[0] } (@no_enc, @s_enc));
  881         my $q = scalar(() = join('', map { $_->[0] } @s_enc) =~
  882                m{[^- !*+/0-9A-Za-z]}g);
  883         if ($total + 8 < $q * 6) {
  884         $enc = 'B';
  885         } else {
  886         $enc = 'Q';
  887         }
  888     }
  889         foreach (@triplets) {
  890         $_->[1] = $enc if $_->[1] and $_->[1] eq 'S';
  891     }
  892     }
  893 
  894     # chop leading FWS
  895     while (scalar(@triplets) and $triplets[0]->[0] =~ s/^[\r\n\t ]+//) {
  896     shift @triplets unless length($triplets[0]->[0]);
  897     }
  898 
  899     # Split long ``words''.
  900     my @splitwords;
  901     my $restlen;
  902     if ($Params{MaxLineLen} < 0) {
  903       @splitwords = @triplets;
  904     } else {
  905       $restlen = $firstlinelen;
  906       foreach (@triplets) {
  907     my ($s, $enc, $csetobj) = @$_;
  908 
  909     my @s = &_split($s, $enc, $csetobj, $restlen, $maxrestlen);
  910     push @splitwords, @s;
  911     my ($last, $lastenc, $lastcsetobj) = @{$s[-1]};
  912     my $lastlen;
  913     if ($lastenc) {
  914         $lastlen = $lastcsetobj->encoded_header_len($last, $lastenc);
  915     } else {
  916         $lastlen = length($last);
  917     }
  918     $restlen = $maxrestlen if scalar @s > 1; # has split; new line(s) fed
  919     $restlen -= $lastlen;
  920     $restlen = $maxrestlen if $restlen <= 1;
  921       }
  922     }
  923 
  924     # Do encoding.
  925     my @lines;
  926     $restlen = $firstlinelen;
  927     foreach (@splitwords) {
  928     my ($str, $encoding, $charsetobj) = @$_;
  929     next unless length($str);
  930 
  931     my $s;
  932     if (!$encoding) {
  933         $s = $str;
  934     } else {
  935         $s = encode_mimeword($str, $encoding, $charsetobj);
  936     }
  937 
  938     my $spc = (scalar(@lines) and $lines[-1] =~ /[\r\n\t ]$/ or
  939            $s =~ /^[\r\n\t ]/)? '': ' ';
  940     if (!scalar(@lines)) {
  941         push @lines, $s;
  942     } elsif ($Params{MaxLineLen} < 0) {
  943         $lines[-1] .= $spc.$s;
  944     } elsif (length($lines[-1].$spc.$s) <= $restlen) {
  945         $lines[-1] .= $spc.$s;
  946     } else {
  947         if ($lines[-1] =~ s/([\r\n\t ]+)$//) {
  948         $s = $1.$s;
  949         }
  950         $s =~ s/^[\r\n]*[\t ]//; # strip only one WSP replaced by FWS
  951         push @lines, $s;
  952         $restlen = $maxrestlen;
  953     }
  954     }
  955 
  956     join($fwsbrk.$fwsspc, @lines);
  957 }
  958 
  959 #------------------------------
  960 
  961 # _split RAW, ENCODING, CHARSET_OBJECT, ROOM_OF_FIRST_LINE, MAXRESTLEN
  962 #     Private: used by encode_mimewords() to split a string into
  963 #     (encoded or non-encoded) words.
  964 #     Returns an array of arrayrefs [SUBSTRING, ENCODING, CHARSET].
  965 sub _split {
  966     my $str = shift;
  967     my $encoding = shift;
  968     my $charset = shift;
  969     my $restlen = shift;
  970     my $maxrestlen = shift;
  971 
  972     if (!$charset->as_string or $charset->as_string eq '8BIT') {# Undecodable.
  973     $str =~ s/[\r\n]+[\t ]*|\x00/ /g;   # Eliminate hostile characters.
  974     return ([$str, undef, $charset]);
  975     }
  976     if (!$encoding and $charset->as_string eq 'US-ASCII') { # Pure ASCII.
  977     return &_split_ascii($str, $restlen, $maxrestlen);
  978     }
  979     if (!$charset->decoder and MIME::Charset::USE_ENCODE) { # Unsupported.
  980     return ([$str, $encoding, $charset]);
  981     }
  982 
  983     my (@splitwords, $ustr, $first);
  984     while (length($str)) {
  985     if ($charset->encoded_header_len($str, $encoding) <= $restlen) {
  986         push @splitwords, [$str, $encoding, $charset];
  987         last;
  988     }
  989     $ustr = $str;
  990     if (!(is_utf8($ustr) or $ustr =~ /$WIDECHAR/) and
  991         MIME::Charset::USE_ENCODE) {
  992         $ustr = $charset->decode($ustr);
  993     }
  994     ($first, $str) = &_clip_unsafe($ustr, $encoding, $charset, $restlen);
  995     # retry splitting if failed
  996     if ($first and !$str and
  997         $maxrestlen < $charset->encoded_header_len($first, $encoding)) {
  998         ($first, $str) = &_clip_unsafe($ustr, $encoding, $charset,
  999                        $maxrestlen);
 1000     }
 1001     push @splitwords, [$first, $encoding, $charset];
 1002     $restlen = $maxrestlen;
 1003     }
 1004     return @splitwords;
 1005 }
 1006 
 1007 # _split_ascii RAW, ROOM_OF_FIRST_LINE, MAXRESTLEN
 1008 #     Private: used by encode_mimewords() to split an US-ASCII string into
 1009 #     (encoded or non-encoded) words.
 1010 #     Returns an array of arrayrefs [SUBSTRING, undef, "US-ASCII"].
 1011 sub _split_ascii {
 1012     my $s = shift;
 1013     my $restlen = shift;
 1014     my $maxrestlen = shift;
 1015     $restlen ||= $maxrestlen;
 1016 
 1017     my @splitwords;
 1018     my $ascii = MIME::Charset->new("US-ASCII", Mapping => 'STANDARD');
 1019     foreach my $line (split(/(?:[\t ]*[\r\n]+)+/, $s)) {
 1020         my $spc = '';
 1021     foreach my $word (split(/([\t ]+)/, $line)) {
 1022         next unless scalar(@splitwords) or $word; # skip first garbage
 1023         if ($word =~ /[\t ]/) {
 1024         $spc = $word;
 1025         next;
 1026         }
 1027 
 1028         my $cont = $spc.$word;
 1029         my $elen = length($cont);
 1030         next unless $elen;
 1031         if (scalar(@splitwords)) {
 1032         # Concatenate adjacent words so that encoded-word and
 1033         # unencoded text will adjoin with separating whitespace.
 1034         if ($elen <= $restlen) {
 1035             $splitwords[-1]->[0] .= $cont;
 1036             $restlen -= $elen;
 1037         } else {
 1038             push @splitwords, [$cont, undef, $ascii];
 1039             $restlen = $maxrestlen - $elen;
 1040         }
 1041         } else {
 1042         push @splitwords, [$cont, undef, $ascii];
 1043         $restlen -= $elen;
 1044         }
 1045         $spc = '';
 1046     }
 1047     if ($spc) {
 1048         if (scalar(@splitwords)) {
 1049         $splitwords[-1]->[0] .= $spc;
 1050         $restlen -= length($spc);
 1051         } else { # only WSPs
 1052         push @splitwords, [$spc, undef, $ascii];
 1053         $restlen = $maxrestlen - length($spc);
 1054         }
 1055     }
 1056     }
 1057     return @splitwords;
 1058 }
 1059 
 1060 # _clip_unsafe UNICODE, ENCODING, CHARSET_OBJECT, ROOM_OF_FIRST_LINE
 1061 #     Private: used by encode_mimewords() to bite off one encodable
 1062 #     ``word'' from a Unicode string.
 1063 #     Note: When Unicode/multibyte support is not enabled, character
 1064 #     boundaries of multibyte string shall be broken!
 1065 sub _clip_unsafe {
 1066     my $ustr = shift;
 1067     my $encoding = shift;
 1068     my $charset = shift;
 1069     my $restlen = shift;
 1070     return ("", "") unless length($ustr);
 1071 
 1072     # Seek maximal division point.
 1073     my ($shorter, $longer) = (0, length($ustr));
 1074     while ($shorter < $longer) {
 1075     my $cur = ($shorter + $longer + 1) >> 1;
 1076     my $enc = substr($ustr, 0, $cur);
 1077     if (MIME::Charset::USE_ENCODE ne '') {
 1078         $enc = $charset->undecode($enc);
 1079     }
 1080     my $elen = $charset->encoded_header_len($enc, $encoding);
 1081     if ($elen <= $restlen) {
 1082         $shorter = $cur;
 1083     } else {
 1084         $longer = $cur - 1;
 1085     }
 1086     }
 1087 
 1088     # Make sure that combined characters won't be divided.
 1089     my ($fenc, $renc);
 1090     my $max = length($ustr);
 1091     while (1) {
 1092     $@ = '';
 1093     eval {
 1094         ($fenc, $renc) =
 1095         (substr($ustr, 0, $shorter), substr($ustr, $shorter));
 1096         if (MIME::Charset::USE_ENCODE ne '') {
 1097         # FIXME: croak if $renc =~ /^\p{M}/
 1098         $fenc = $charset->undecode($fenc, FB_CROAK());
 1099         $renc = $charset->undecode($renc, FB_CROAK());
 1100         }
 1101     };
 1102     last unless ($@);
 1103 
 1104     $shorter++;
 1105     unless ($shorter < $max) { # Unencodable character(s) may be included.
 1106         return ($charset->undecode($ustr), "");
 1107     }
 1108     }
 1109 
 1110     if (length($fenc)) {
 1111     return ($fenc, $renc);
 1112     } else {
 1113     return ($renc, "");
 1114     }
 1115 }
 1116 
 1117 #------------------------------
 1118 
 1119 # _getparams HASHREF, OPTS
 1120 #     Private: used to get option parameters.
 1121 sub _getparams {
 1122     my $params = shift;
 1123     my %params = @_;
 1124     my %Params;
 1125     my %GotParams;
 1126     foreach my $k (qw(NoDefault YesNo Others Obsoleted ToUpper)) {
 1127     $Params{$k} = $params{$k} || [];
 1128     }
 1129     foreach my $k (keys %$params) {
 1130     my $supported = 0;
 1131     foreach my $i (@{$Params{NoDefault}}, @{$Params{YesNo}},
 1132                @{$Params{Others}}, @{$Params{Obsoleted}}) {
 1133         if (lc $i eq lc $k) {
 1134         $GotParams{$i} = $params->{$k};
 1135         $supported = 1;
 1136         last;
 1137         }
 1138     }
 1139     carp "unknown or deprecated option ``$k''" unless $supported;
 1140     }
 1141     # get defaults
 1142     foreach my $i (@{$Params{YesNo}}, @{$Params{Others}}) {
 1143     $GotParams{$i} = $Config->{$i} unless defined $GotParams{$i};
 1144     }
 1145     # yesno params
 1146     foreach my $i (@{$Params{YesNo}}) {
 1147         if (!$GotParams{$i} or uc $GotParams{$i} eq "NO") {
 1148             $GotParams{$i} = "NO";
 1149         } else {
 1150             $GotParams{$i} = "YES";
 1151         }
 1152     }
 1153     # normalize case
 1154     foreach my $i (@{$Params{ToUpper}}) {
 1155         $GotParams{$i} &&= uc $GotParams{$i};
 1156     }
 1157     return %GotParams;
 1158 }
 1159 
 1160 
 1161 #------------------------------
 1162 
 1163 =back
 1164 
 1165 =head2 Configuration Files
 1166 B<**>
 1167 
 1168 Built-in defaults of option parameters for L</decode_mimewords>
 1169 (except 'Charset' option) and
 1170 L</encode_mimewords> can be overridden by configuration files:
 1171 F<MIME/Charset/Defaults.pm> and F<MIME/EncWords/Defaults.pm>.
 1172 For more details read F<MIME/EncWords/Defaults.pm.sample>.
 1173 
 1174 =head1 VERSION
 1175 
 1176 Consult C<$VERSION> variable.
 1177 
 1178 Development versions of this module may be found at
 1179 L<http://hatuka.nezumi.nu/repos/MIME-EncWords/>.
 1180 
 1181 =head1 SEE ALSO
 1182 
 1183 L<MIME::Charset>,
 1184 L<MIME::Tools>
 1185 
 1186 =head1 AUTHORS
 1187 
 1188 The original version of function decode_mimewords() is derived from
 1189 L<MIME::Words> module that was written by:
 1190     Eryq (F<eryq@zeegee.com>), ZeeGee Software Inc (F<http://www.zeegee.com>).
 1191     David F. Skoll (dfs@roaringpenguin.com) http://www.roaringpenguin.com
 1192 
 1193 Other stuff are rewritten or added by:
 1194     Hatuka*nezumi - IKEDA Soji <hatuka(at)nezumi.nu>.
 1195 
 1196 This program is free software; you can redistribute
 1197 it and/or modify it under the same terms as Perl itself.
 1198 
 1199 =cut
 1200 
 1201 1;