"Fossies" - the Fresh Open Source Software Archive

Member "spambayes-1.1a6/TESTING.txt" (15 Jul 2007, 10278 Bytes) of archive /windows/mail/spambayes-1.1a6.zip:


As a special service "Fossies" has tried to format the requested text file into HTML format (style: standard) with prefixed line numbers. Alternatively you can here view or download the uninterpreted source code file.

    1 [Clues about the practice of statistical testing, adapted from Tim's
    2  comments on python-dev.]
    3 
    4 Combining pairs of words is called "word bigrams".  My intuition at the
    5 start was that it would do better.  OTOH, my intuition also was that
    6 character n-grams for a relatively large n would do better still.  The
    7 latter may be so for "foreign" languages, but for this particular task using
    8 Graham's scheme on the c.l.py tests, turns out they sucked.  A comment block
    9 in timtest.py explains why.
   10 
   11 I didn't try word bigrams because the f-p rate is already supernaturally
   12 low, so there doesn't seem anything left to be gained there.  This echoes
   13 what Graham sez on his web page:
   14 
   15     One idea that I haven't tried yet is to filter based on word pairs, or
   16     even triples, rather than individual words.  This should yield a much
   17     sharper estimate of the probability.
   18 
   19 My comment with benefit of hindsight:  it doesn't.  Because the scoring
   20 scheme throws away everything except about a dozen extremes, the
   21 "probabilities" that come out are almost always very near 0 or very near 1;
   22 only very short or (or especially "and") very bland msgs come out in
   23 between.  This outcome is largely independent of the tokenization scheme --
   24 the scoring scheme forces it, provided only that the tokenization scheme
   25 produces stuff *some* of which *does* vary in frequency between spam and
   26 ham.
   27 
   28     For example, in my current database, the word "offers" has a
   29     probability of .96. If you based the probabilities on word pairs, you'd
   30     end up with "special offers" and "valuable offers" having probabilities
   31     of .99 and, say, "approach offers" (as in "this approach offers")
   32     having a probability of .1 or less.
   33 
   34 The theory is indeed appealing <wink>.
   35 
   36     The reason I haven't done this is that filtering based on individual
   37     words already works so well.
   38 
   39 Which is also the reason I didn't pursue it.
   40 
   41     But it does mean that there is room to tighten the filters if spam gets
   42     harder to detect.
   43 
   44 I expect it would also need a different scoring scheme then.
   45 
   46 OK, I ran a full test using word bigrams.  It gets one strike against it at
   47 the start because the database size grows by a factor between 2 and 3.
   48 That's only justified if the results are better.  Before-and-after f-p
   49 (false positive) percentages:
   50 
   51    before   bigrams
   52     0.000   0.025
   53     0.000   0.025
   54     0.050   0.050
   55     0.000   0.025
   56     0.025   0.050
   57     0.025   0.100
   58     0.050   0.075
   59     0.025   0.025
   60     0.025   0.050
   61     0.000   0.025
   62     0.075   0.050
   63     0.050   0.000
   64     0.025   0.050
   65     0.000   0.025
   66     0.050   0.075
   67     0.025   0.025
   68     0.025   0.025
   69     0.000   0.000
   70     0.025   0.050
   71     0.050   0.025
   72 
   73 Lost on 12 runs
   74 Tied on  5 runs
   75 Won  on  3 runs
   76 
   77 total # of unique fps across all runs rose from 8 to 17
   78 
   79 The f-n percentages on the same runs:
   80 
   81    before   bigrams
   82     1.236   1.091
   83     1.164   1.091
   84     1.454   1.708
   85     1.599   1.563
   86     1.527   1.491
   87     1.236   1.127
   88     1.163   1.345
   89     1.309   1.309
   90     1.891   1.927
   91     1.418   1.382
   92     1.745   1.927
   93     1.708   1.963
   94     1.491   1.782
   95     0.836   0.800
   96     1.091   1.127
   97     1.309   1.309
   98     1.491   1.709
   99     1.127   1.018
  100     1.309   1.018
  101     1.636   1.672
  102 
  103 Lost on  9 runs
  104 Tied on  2 runs
  105 Won  on  9 runs
  106 
  107 total # of unique fns across all runs rose from 336 to 350
  108 
  109 This doesn't need deep analysis:  it costs more, and on the face of it
  110 either doesn't help, or helps so little it's not worth the cost.
  111 
  112 Now I'll tell you in confidence <wink> that the way to make a scheme like
  113 this excellent is to keep your ego out of it and let the data *tell* you
  114 what works:  getting the best test setup you can is the most important thing
  115 you can possibly do.  It must include multiple training and test corpora
  116 (e.g., if I had used only one pair, I would have had a 3/20 chance of
  117 erroneously concluding that bigrams might help the f-p rate, when running
  118 across 20 pairs shows that they almost certainly do it harm; while I would
  119 have had an even chance of drawing a wrong conclusion-- in either direction
  120 --about the effect on the f-n rate).
  121 
  122 The second most important thing is to run a fat test all the way to the end
  123 before concluding anything.  A subtler point is that you should never keep
  124 a change that doesn't *prove* itself a winner:  neutral changes bloat your
  125 code with proven irrelevancies that will come back to make your life harder
  126 later, in part because they'll randomly interfere with future changes in
  127 ways that make it harder to recognize a significant change when you stumble
  128 into one.
  129 
  130 Most things you try won't help -- indeed, many of them will deliver worse
  131 results.  I dare say my intuition for this kind of classification task is
  132 better than most programmers' (in part because I had years of professional
  133 experience in a related field), and most of the things I tried I had to
  134 throw away.  BFD -- then you try something else.  When I find something
  135 that works I can rationalize it, but when I try something that doesn't, no
  136 amount of argument can change that the data said it sucked <wink>.
  137 
  138 Two things about *this* task have fooled me repeatedly:
  139 
  140 1. The "only look at smoking guns" nature of the scoring step makes many
  141    kinds of "on average" intuitions worthless:  "on average" almost
  142    everything is thrown away!  For example, you're not going to find bad
  143    results reported for n-grams (neither character- nor word-based) in the
  144    literature, and because most scoring schemes throw much less away.
  145    Graham's scheme strikes me as brilliant in this specific respect:  it's
  146    worth enduring the ego humiliation to get such a spectacularly
  147    low f-p rate from such simple and fast code.  Graham's assumption
  148    that the spam-vs-ham distinction should be *easy* pays off big.
  149 
  150 2. Most mailing-list messages are much shorter than this one.  This
  151    systematically frustrates "well, averaged over enough words" intuitions
  152    too.
  153 
  154 Cute:  In particular, word bigrams systematically hate conference
  155 announcements.  The current word one-gram scheme hated them too, until I
  156 started folding case.  Then their SCREAMING stopped acting against them.
  157 But they're still using the language of advertisement, and word bigrams
  158 can't help but notice that more strongly than individual words do.
  159 
  160 Here from the TOOLS Europe '99 announcement:
  161 
  162 prob('more information') = 0.916003
  163 prob('web site') = 0.895518
  164 prob('please write') = 0.99
  165 prob('you wish') = 0.984494
  166 prob('our web') = 0.985578
  167 prob('visit our') = 0.99
  168 
  169 Here from the XP2001 - FINAL CALL FOR PAPERS:
  170 
  171 prob('web site:') = 0.926174
  172 prob('receive this') = 0.945813
  173 prob('you receive') = 0.987542
  174 prob('most exciting') = 0.99
  175 prob('alberta, canada') = 0.99
  176 prob('e-mail to:') = 0.99
  177 
  178 Here from the XP2002 - CALL FOR PRACTITIONER'S REPORTS ('BOM' is an
  179 artificial token I made up for "beginning of message", to give something
  180 for the first word in the message to pair up with):
  181 
  182 prob('web site:') = 0.926174
  183 prob('this announcement') = 0.94359
  184 prob('receive this') = 0.945813
  185 prob('forward this') = 0.99
  186 prob('e-mail to:') = 0.99
  187 prob('BOM *****') = 0.99
  188 prob('you receive') = 0.987542
  189 
  190 Here from the TOOLS Europe 2000 announcement:
  191 
  192 prob('visit the') = 0.96
  193 prob('you receive') = 0.967805
  194 prob('accept our') = 0.99
  195 prob('our apologies') = 0.99
  196 prob('quality and') = 0.99
  197 prob('receive more') = 0.99
  198 prob('asia and') = 0.99
  199 
  200 A vanilla f-p showing where bigrams can hurt was a short msg about setting
  201 up a Python user's group.  Bigrams gave it large penalties for phrases like
  202 "fully functional" (most often seen in spams for bootleg software, but here
  203 applied to the proposed user group's web site -- and "web site" is also a
  204 strong spam indicator!).  OTOH, the poster also said "Aahz rocks".  As a
  205 bigram, that neither helped nor hurt (that 2-word phrase is unique in the
  206 corpus); but as an individual word, "Aahz" is a strong non-spam indicator
  207 on c.l.py (and will probably remain so until he starts spamming <wink>).
  208 
  209 It did find one spam hiding in a ham corpus:
  210 
  211 """
  212 NNTP-Posting-Host: 212.64.45.236
  213 Newsgroups: comp.lang.python,comp.lang.rexx
  214 Date: Thu, 21 Oct 1999 10:18:52 -0700
  215 Message-ID: <67821AB23987D311ADB100A0241979E5396955@news.ykm.com>
  216 From: znblrn@hetronet.com
  217 Subject: Rudolph The Rednose Hooters Here
  218 Lines: 4
  219 Path: news!uunet!ffx.uu.net!newsfeed.fast.net!howland.erols.net!newsfeed.cwix.com!news.cfw.com!paxfeed.eni.net!DAIPUB.DataAssociatesInc..com
  220 Xref: news comp.lang.python:74468 comp.lang.rexx:31946
  221 To: python-list@python.org
  222 
  223 THis IS it: The site where they talk about when you are 50 years old.
  224 
  225 http://huizen.dds.nl/~jansen20
  226 """
  227 
  228 there's-no-substitute-for-experiment-except-drugs-ly y'rs  - tim
  229 
  230 
  231 
  232 Other points:
  233 
  234 + Something I didn't do but should have:  keep a detailed log of every
  235   experiment run, and of the results you got.  The only clues about dozens
  236   of experiments with the current code are in brief "XXX" comment blocks,
  237   and a bunch of test results were lost when we dropped the old checkin
  238   comments on the way to moving this code to SourceForge.
  239 
  240 + Every time you check in an algorithmic change that proved to be a
  241   winner, in theory you should also reconsider every previous change.
  242   You really can't guess whether, e.g., tokenization changes are all
  243   independent of each other, or whether some reinforce others in
  244   helpful ways.  In practice there's not enough time to reconsider
  245   everything every time, but do make a habit of reconsidering *something*
  246   each time you've had a success.  Nothing is sacred except the results
  247   in the end, and heresy can pay; every decision remains suspect forever.
  248 
  249 + Any sufficiently general scheme with enough free parameters can eventually
  250   be trained to recognize any specific dataset exactly.  It's wonderful
  251   if other people test your changes against other datasets too.  That's
  252   hard to arrange, so at least change your own data periodically.  I'm
  253   suspicious that some of the weirder "proven winner" changes I've made
  254   are really specific to statistical anomalies in my test data; and as
  255   the error rates get closer to 0%, the chance that a winning change helped
  256   only a few specific msgs zooms (of course sometimes that's intentional!
  257   I haven't been shy about adding changes specifically geared toward
  258   squashing very narrow classes of false positives).