"Fossies" - the Fresh Open Source Software Archive

Member "spambayes-1.1a6/README-DEVEL.txt" (23 Feb 2009, 31896 Bytes) of archive /windows/mail/spambayes-1.1a6.zip:


As a special service "Fossies" has tried to format the requested source page into HTML format using (guessed) plain text source code syntax highlighting (style: standard) with prefixed line numbers. Alternatively you can here view or download the uninterpreted source code file.

    1 Copyright (C) 2002-2009 Python Software Foundation; All Rights Reserved
    2 
    3 The Python Software Foundation (PSF) holds copyright on all material
    4 in this project.  You may use it under the terms of the PSF license;
    5 see LICENSE.txt.
    6 
    7 
    8 Assorted clues.
    9 
   10 
   11 What's Here?
   12 ============
   13 Lots of mondo cool partially documented code.  What else could there be <wink>?
   14 
   15 The focus of this project so far has not been to produce the fastest or
   16 smallest filters, but to set up a flexible pure-Python implementation
   17 for doing algorithm research.  Lots of people are making fast/small
   18 implementations, and it takes an entirely different kind of effort to
   19 make genuine algorithm improvements.  I think we've done quite well at
   20 that so far.  The focus of this codebase may change to small/fast
   21 later -- as is, the false positive rate has gotten too small to measure
   22 reliably across test sets with 4000 hams + 2750 spams, and the f-n rate
   23 has also gotten too small to measure reliably across that much training data.
   24 
   25 The code in this project requires Python 2.2 (or later).
   26 
   27 You should definitely check out the FAQ:
   28 http://spambayes.org/faq.html
   29 
   30 Getting Source Code
   31 ===================
   32 
   33 The SpamBayes project source code is hosted at SourceForge
   34 (http://spambayes.sourceforge.net/).  Access is via Subversion.
   35 
   36 Running Unit Tests
   37 ==================
   38 
   39 SpamBayes has a currently incomplete set of unit tests, not all of which
   40 pass, due, in part, to bit rot.  We are working on getting the unit tests to
   41 run using the `nose <http://somethingaboutorange.com/mrl/projects/nose/>`_
   42 package.  After downloading and installing nose, you can run the current
   43 unit tests on Unix-like systems like so from the SpamBayes top-level
   44 directory::
   45 
   46     TMPDIR=/tmp BAYESCUSTOMIZE= nosetests -v . 2>&1 \
   47     | sed -e "s:$(pwd)/::" \
   48           -e "s:$(python -c 'import sys ; print sys.exec_prefix')/::" \
   49     | tee failing-unit-tests.txt
   50 
   51 The file, failing-unit-tests.txt, is checked into the Subversion repository
   52 at the top level using Python from Subversion (currently 2.7a0).  You can
   53 look at it for any failing unit tests and work to get them passing, or write
   54 new tests.
   55 
   56 Primary Core Files
   57 ==================
   58 Options.py
   59     Uses ConfigParser to allow fiddling various aspects of the classifier,
   60     tokenizer, and test drivers.  Create a file named bayescustomize.ini to
   61     alter the defaults.  Modules wishing to control aspects of their
   62     operation merely do
   63 
   64         from Options import options
   65 
   66     near the start, and consult attributes of options.  To see what options
   67     are available, import Options.py and do
   68 
   69         print Options.options.display_full()
   70 
   71     This will print out a detailed description of each option, the allowed
   72     values, and so on.  (You can pass in a section or section and option
   73     name to display_full if you don't want the whole list).
   74 
   75     As an alternative to bayescustomize.ini, you can set the environment
   76     variable BAYESCUSTOMIZE to a list of one or more .ini files, these will
   77     be read in, in order, and applied to the options. This allows you to
   78     tweak individual runs by combining fragments of .ini files.  The
   79     character used to separate different .ini files is platform-dependent.
   80     On Unix, Linux and Mac OS X systems it is ':'.  On Windows it is ';'.
   81     On Mac OS 9 and earlier systems it is a NL character.
   82 
   83 classifier.py
   84     The classifier, which is the soul of the method.
   85 
   86 tokenizer.py
   87     An implementation of tokenize() that Tim can't seem to help but keep
   88     working on <wink>.  Generates a token stream from a message, which
   89     the classifier trains on or predicts against.
   90 
   91 chi2.py
   92     A collection of statistics functions.
   93 
   94 Apps
   95 ====
   96 sb_filter.py
   97     A simpler hammie front-end that doesn't print anything.  Useful for
   98     procmail filtering and scoring from your MUA.
   99 
  100 sb_mboxtrain.py
  101     Trainer for Maildir, MH, or mbox mailboxes.  Remembers which
  102     messages it saw the last time you ran it, and will only train on new
  103     messages or messages which should be retrained.  
  104 
  105     The idea is to run this automatically every night on your Inbox and
  106     Spam folders, and then sort misclassified messages by hand.  This
  107     will work with any IMAP4 mail client, or any client running on the
  108     server.
  109 
  110 sb_server.py
  111     A spam-classifying POP3 proxy.  It adds a spam-judgment header to
  112     each mail as it's retrieved, so you can use your email client's
  113     filters to deal with them without needing to fiddle with your email
  114     delivery system.
  115 
  116     Also acts as a web server providing a user interface that allows you
  117     to train the classifier, classify messages interactively, and query
  118     the token database.  This piece may at some point be split out into
  119     a separate module.
  120 
  121     If the appropriate options are set, also serves a message training
  122     SMTP proxy.  It sits between your email client and your SMTP server
  123     and intercepts mail to set ham and spam addresses.
  124     All other mail is simply passed through to the SMTP server.
  125 
  126 sb_mailsort.py
  127     A delivery agent that uses a CDB of word probabilities and delivers
  128     a message to one of two Maildir message folders, depending on the
  129     classifier score.  Note that both Maildirs must be on the same
  130     device.
  131 
  132 sb_xmlrpcserver.py
  133     A stab at making hammie into a client/server model, using XML-RPC.
  134 
  135 sb_client.py
  136     A client for sb_xmlrpcserver.py.
  137 
  138 sb_imapfilter.py
  139     A spam-classifying and training application for use with IMAP servers.
  140     You can specify folders that contain mail to train as ham/spam, and
  141     folders that contain mail to classify, and the filter will do so.
  142 
  143 
  144 Test Driver Core
  145 ================
  146 Tester.py
  147     A test-driver class that feeds streams of msgs to a classifier
  148     instance, and keeps track of right/wrong percentages and lists
  149     of false positives and false negatives.
  150 
  151 TestDriver.py
  152     A flexible higher layer of test helpers, building on Tester above.
  153     For example, it's usable for building simple test drivers, NxN test
  154     grids, and N-fold cross-validation drivers.  See also rates.py,
  155     cmp.py, and table.py below.
  156 
  157 msgs.py
  158     Some simple classes to wrap raw msgs, and to produce streams of
  159     msgs.  The test drivers use these.
  160 
  161 
  162 Concrete Test Drivers
  163 =====================
  164 mboxtest.py
  165     A concrete test driver like timtest.py, but working with a pair of
  166     mailbox files rather than the specialized timtest setup.
  167 
  168 timcv.py
  169     An N-fold cross-validating test driver.  Assumes "a standard" data
  170         directory setup (see below)) rather than the specialized mboxtest
  171         setup.
  172     N classifiers are built.
  173     1 run is done with each classifier.
  174     Each classifier is trained on N-1 sets, and predicts against the sole
  175         remaining set (the set not used to train the classifier).
  176     mboxtest does the same.
  177     This (or mboxtest) is the preferred way to test when possible:  it
  178         makes best use of limited data, and interpreting results is
  179         straightforward.
  180 
  181 timtest.py
  182     A concrete test driver like mboxtest.py, but working with "a standard"
  183         test data setup (see below).  This runs an NxN test grid, skipping
  184         the diagonal.
  185     N classifiers are built.
  186     N-1 runs are done with each classifier.
  187     Each classifier is trained on 1 set, and predicts against each of
  188         the N-1 remaining sets (those not used to train the classifier).
  189     This is a much harder test than timcv, because it trains on N-1 times
  190         less data, and makes each classifier predict against N-1 times
  191         more data than it's been taught about.
  192     It's harder to interpret the results of timtest (than timcv) correctly,
  193         because each msg is predicted against N-1 times overall.  So, e.g.,
  194         one terribly difficult spam or ham can count against you N-1 times.
  195 
  196 
  197 Test Utilities
  198 ==============
  199 rates.py
  200     Scans the output (so far) produced by TestDriver.Drive(), and captures
  201     summary statistics.
  202 
  203 cmp.py
  204     Given two summary files produced by rates.py, displays an account
  205     of all the f-p and f-n rates side-by-side, along with who won which
  206     (etc), the change in total # of unique false positives and negatives,
  207     and the change in average f-p and f-n rates.
  208 
  209 table.py
  210     Summarizes the high-order bits from any number of summary files,
  211     in a compact table.
  212 
  213 fpfn.py
  214     Given one or more TestDriver output files, prints list of false
  215     positive and false negative filenames, one per line.
  216 
  217 
  218 Test Data Utilities
  219 ===================
  220 cleanarch.py
  221     A script to repair mbox archives by finding "Unix From" lines that
  222     should have been escaped, and escaping them.
  223 
  224 unheader.py
  225     A script to remove unwanted headers from an mbox file.  This is mostly
  226     useful to delete headers which incorrectly might bias the results.
  227     In default mode, this is similar to 'spamassassin -d', but much, much
  228     faster.
  229 
  230 loosecksum.py
  231     A script to calculate a "loose" checksum for a message.  See the text of
  232     the script for an operational definition of "loose".
  233 
  234 rebal.py
  235     Evens out the number of messages in "standard" test data folders (see
  236     below).  Needs generalization (e.g., Ham and 4000 are hardcoded now).
  237 
  238 mboxcount.py
  239     Count the number of messages (both parseable and unparseable) in
  240     mbox archives.
  241 
  242 split.py
  243 splitn.py
  244     Split an mbox into random pieces in various ways.  Tim recommends
  245     using "the standard" test data set up instead (see below).
  246 
  247 splitndirs.py
  248     Like splitn.py (above), but splits an mbox into one message per file in
  249     "the standard" directory structure (see below).  This does an
  250     approximate split; rebal.py (above) can be used afterwards to even out
  251     the number of messages per folder.
  252 
  253 runtest.sh
  254     A Bourne shell script (for Unix) which will run some test or other.
  255     I (Neale) will try to keep this updated to test whatever Tim is
  256     currently asking for.  The idea is, if you have a standard directory
  257     structure (below), you can run this thing, go have some tea while it
  258     works, then paste the output to the SpamBayes list for good karma.
  259 
  260 
  261 Standard Test Data Setup
  262 ========================
  263 Barry gave Tim mboxes, but the spam corpus he got off the web had one spam
  264 per file, and it only took two days of extreme pain to realize that one msg
  265 per file is enormously easier to work with when testing:  you want to split
  266 these at random into random collections, you may need to replace some at
  267 random when testing reveals spam mistakenly called ham (and vice versa),
  268 etc -- even pasting examples into email is much easier when it's one msg
  269 per file (and the test drivers make it easy to print a msg's file path).
  270 
  271 The directory structure under my spambayes directory looks like so:
  272 
  273 Data/
  274     Spam/
  275         Set1/ (contains 1375 spam .txt files)
  276         Set2/            ""
  277         Set3/            ""
  278         Set4/            ""
  279         Set5/            ""
  280         Set6/            ""
  281         Set7/            ""
  282         Set9/            ""
  283         Set9/            ""
  284         Set10/           ""
  285     reservoir/ (contains "backup spam")
  286     Ham/
  287         Set1/ (contains 2000 ham .txt files)
  288         Set2/            ""
  289         Set3/            ""
  290         Set4/            ""
  291         Set5/            ""
  292         Set6/            ""
  293         Set7/            ""
  294         Set8/            ""
  295         Set9/            ""
  296         Set10/           ""
  297         reservoir/ (contains "backup ham")
  298 
  299 Every file at the deepest level is used (not just files with .txt
  300 extensions).  The files don't need to have a "Unix From"
  301 header before the RFC-822 message (i.e. a line of the form "From
  302 <address> <date>").
  303 
  304 If you use the same names and structure, huge mounds of the tedious testing
  305 code will work as-is.  The more Set directories the merrier, although you
  306 want at least a few hundred messages in each one.  The "reservoir"
  307 directories contain a few thousand other random hams and spams.  When a ham
  308 is found that's really spam, move it into a spam directory, then use the
  309 rebal.py utility to rebalance the Set directories moving random message(s)
  310 into and/or out of the reservoir directories.  The reverse works as well
  311 (finding ham in your spam directories).
  312 
  313 The hams are 20,000 msgs selected at random from a python-list archive.
  314 The spams are essentially all of Bruce Guenter's 2002 spam archive:
  315 
  316     <http://www.em.ca/~bruceg/spam/>
  317 
  318 The sets are grouped into pairs in the obvious way:  Spam/Set1 with
  319 Ham/Set1, and so on.  For each such pair, timtest trains a classifier on
  320 that pair, then runs predictions on each of the other pairs.  In effect,
  321 it's a NxN test grid, skipping the diagonal.  There's no particular reason
  322 to avoid predicting against the same set trained on, except that it
  323 takes more time and seems the least interesting thing to try.
  324 
  325 Later, support for N-fold cross validation testing was added, which allows
  326 more accurate measurement of error rates with smaller amounts of training
  327 data.  That's recommended now.  timcv.py is to cross-validation testing
  328 as the older timtest.py is to grid testing.  timcv.py has grown additional
  329 arguments to allow using only a random subset of messages in each Set.
  330 
  331 CAUTION:  The partitioning of your corpora across directories should
  332 be random.  If it isn't, bias creeps in to the test results.  This is
  333 usually screamingly obvious under the NxN grid method (rates vary by a
  334 factor of 10 or more across training sets, and even within runs against
  335 a single training set), but harder to spot using N-fold c-v.
  336 
  337 Testing a change and posting the results
  338 ========================================
  339 
  340 (Adapted from clues Tim posted on the spambayes and spambayes-dev lists)
  341 
  342 Firstly, setup your data as above; it's really not worth the hassle to
  343 come up with a different scheme.  If you use the Outlook plug-in, the
  344 export.py script in the Outlook2000 directory will export all the spam
  345 and ham in your 'training' folders for you into this format (or close
  346 enough).
  347 
  348 Basically the idea is that you should have 10 sets of data, each with
  349 200 to 500 messages in them.  Obviously if you're testing something to
  350 do with the size of a corpus, you'll want to change that.  You then want
  351 to run
  352     timcv.py -n 10 > std.txt
  353 (call std.txt whatever you like), and then
  354     rates.py std.txt
  355 You end up with two files, std.txt, which has the raw results, and stds.txt,
  356 which has more of a summary of the results.
  357 
  358 Now make the change to the code or options, and repeat the process,
  359 giving the files different names (note that rates.py will automatically
  360 choose the name for the output file, based on the input one).
  361 
  362 You've now got the data you need, but you have to interpret it.  The
  363 simplest way of all is just to post it to spambayes-dev@python.org and let
  364 someone else do it for you <wink>.  The data you should post is the output of
  365     cmp.py stds.txt alts.txt
  366 along with the output of
  367     table.py stds.txt alts.txt
  368 (note that these just print to stdout).
  369 
  370 Other information you can find in the 'raw' output (std.txt, above) are
  371 histograms of the ham/spam spread, and a copy of the options settings.
  372 
  373 Interpreting cmp.py output
  374 --------------------------
  375 
  376 (Using an example from Tim on spambayes-dev)
  377 
  378 > cv_octs.txt -> cv_oct_subjs.txt
  379 > -> <stat> tested 488 hams & 897 spams against 1824 hams & 3501 spams 
  380 > -> <stat> tested 462 hams & 863 spams against 1850 hams & 3535 spams 
  381 > -> <stat> tested 475 hams & 863 spams against 1837 hams & 3535 spams 
  382 > -> <stat> tested 430 hams & 887 spams against 1882 hams & 3511 spams 
  383 > -> <stat> tested 457 hams & 888 spams against 1855 hams & 3510 spams 
  384 > -> <stat> tested 488 hams & 897 spams against 1824 hams & 3501 spams 
  385 > -> <stat> tested 462 hams & 863 spams against 1850 hams & 3535 spams 
  386 > -> <stat> tested 475 hams & 863 spams against 1837 hams & 3535 spams 
  387 > -> <stat> tested 430 hams & 887 spams against 1882 hams & 3511 spams 
  388 > -> <stat> tested 457 hams & 888 spams against 1855 hams & 3510 spams
  389 >
  390 > false positive percentages
  391 >     0.000  0.000  tied
  392 >     0.000  0.000  tied
  393 >     0.000  0.000  tied
  394 >     0.000  0.000  tied
  395 >     0.219  0.219  tied
  396 >
  397 > won   0 times
  398 > tied  5 times
  399 > lost  0 times
  400 
  401 So all 5 runs tied on FP.  That tells us much more than that the *net*
  402 effect across 5 runs was nil on FP:  it tells us that there are no hidden
  403 glitches hiding behind that "net nothing" -- it was no change across the board.
  404 
  405 > total unique fp went from 1 to 1 tied
  406 > mean fp % went from 0.0437636761488 to 0.0437636761488 tied
  407 >
  408 > false negative percentages
  409 >     2.007  2.007  tied
  410 >     1.390  1.390  tied
  411 >     1.622  1.622  tied
  412 >     2.029  1.917  won     -5.52%
  413 >     2.703  2.477  won     -8.36%
  414 >
  415 > won   2 times
  416 > tied  3 times
  417 > lost  0 times
  418 
  419 When evaluating a small change, I'm heartened to see that in no run did it lose.
  420 At worst it tied, and twice it helped a little.  That's encouraging.
  421 
  422 What the histograms would tell us that we can't tell from this is whether you
  423 could have done just as well without the change by raising your ham cutoff a little.
  424 That would also tie on FP, and *may* also get rid of the same number (or even
  425 more) of FN.
  426 
  427 > total unique fn went from 86 to 83 won     -3.49%
  428 > mean fn % went from 1.95029003772 to 1.88269707836 won     -3.47%
  429 >
  430 > ham mean                     ham sdev
  431 >    0.57    0.58   +1.75%        4.63    4.77   +3.02%
  432 >    0.08    0.07  -12.50%        1.20    1.01  -15.83%
  433 >    0.36    0.29  -19.44%        3.61    3.23  -10.53%
  434 >    0.08    0.11  +37.50%        0.89    1.18  +32.58%
  435 >    0.72    0.76   +5.56%        6.80    7.06   +3.82%
  436 >
  437 > ham mean and sdev for all runs
  438 >    0.37    0.37   +0.00%        4.10    4.16   +1.46%
  439 
  440 That's a good example of grand averages hiding the truth:  the averaged change
  441 in the mean ham score was 0 across all 5 runs, but *within* the 5 runs it slobbered
  442 around wildly, from decreasing 20% to increasing 40%(!).
  443 
  444 > spam mean                    spam sdev
  445 >   96.43   96.44   +0.01%       15.89   15.89   +0.00%
  446 >   97.01   97.07   +0.06%       13.79   13.70   -0.65%
  447 >   97.14   97.16   +0.02%       14.05   14.02   -0.21%
  448 >   96.52   96.56   +0.04%       15.65   15.52   -0.83%
  449 >   95.53   95.63   +0.10%       17.47   17.31   -0.92%
  450 >
  451 > spam mean and sdev for all runs
  452 >   96.52   96.57   +0.05%       15.46   15.37   -0.58%
  453 
  454 That's good to see:  it's a consistent win for spam scores across runs,
  455 although an almost imperceptible one.  It's good when the mean spam score rises,
  456 and it's good when sdev (for ham or spam) decreases.
  457 
  458 > ham/spam mean difference: 96.15 96.20 +0.05
  459 
  460 This is a slight win for the chance, although seeing the details gives cause
  461 to worry some about the effect on ham:  the ham sdev increased overall, and
  462 the effects on ham mean and ham sdev varied wildly across runs.  OTOH, the
  463 "before" numbers for ham mean and ham sdev varied wildly across runs already.
  464 That gives cause to worry some about the data <wink>.
  465 
  466 
  467 Making a source release
  468 =======================
  469 
  470 Source releases are built with distutils.  Here's how I (Richie) have been
  471 building them.  I do this on a Windows box, partly so that the zip release
  472 can have Windows line endings without needing to run a conversion script.
  473 I don't think that's actually necessary, because everything would work on
  474 Windows even with Unix line endings, but you couldn't load the files into
  475 Notepad and sometimes it's convenient to do so.  End users might not even
  476 have any other text editor, so it make things like the README unREADable.
  477 8-)
  478 
  479 Anthony would rather eat live worms than trying to get a sane environment
  480 on Windows, so his approach to building the zip file is at the end.
  481 
  482  o If any new file types have been added since last time (eg. 1.0a5 went
  483    out without the Windows .rc and .h files) then add them to MANIFEST.in.
  484    If there are any new scripts or packages, add them to setup.py.  Test
  485    these changes (by building source packages according to the instructions
  486    below) then commit your edits.
  487  o Checkout the 'spambayes' module twice, once with Windows line endings
  488    and once with Unix line endings (I use WinCVS for this, using "Admin /
  489    Preferences / Globals / Checkout text files with the Unix LF".  If you
  490    use TortoiseCVS, like Tony, then the option is on the Options tab in
  491    the checkout dialog).
  492  o Change spambayes/__init__.py to contain the new version number but don't
  493    commit it yet, just in case something goes wrong.
  494  o Note that if you cheated above, and used an existing checkout, you need
  495    to ensure that you don't have extra files in there.  For example, if you
  496    have a few thousand email messages in testtools/Data, setup.py will take
  497    a *very* long time.
  498  o In the Windows checkout, run "python setup.py sdist --formats zip"
  499  o In the Unix checkout, run "python setup.py sdist --formats gztar"
  500  o Take the resulting spambayes-1.0a5.zip and spambayes-1.0a5.tar.gz, and
  501    test the former on Windows (ideally in a freshly-installed Python
  502    environment; I keep a VMWare snapshot of a clean Windows installation
  503    for this, but that's probably overkill 8-) and test the latter on Unix
  504    (a Debian VMWare box in my case).
  505  o If you can, rename these with "rc" at the end, and make them available
  506    to the spambayes-dev crowd as release candidates.  If all is OK, then
  507    fix the names (or redo this) and keep going.
  508  o Dance the SourceForge release dance:
  509    http://sourceforge.net/docman/display_doc.php?docid=6445&group_id=1#filereleasesteps
  510    When it comes to the "what's new" and the ChangeLog, I cut'n'paste the
  511    relevant pieces of WHAT_IS_NEW.txt and CHANGELOG.txt into the form, and
  512    check the "Keep my preformatted text" checkbox.
  513  o Now commit spambayes/__init__.py and tag the whole checkout - see the
  514    existing tag names for the tag name format.
  515  o In either checkout, run "python setup.py register" to register the new
  516    version with PyPI.
  517  o Update download.ht with checksums, links, and sizes for the files.
  518    From release 1.1 doing a "setup.py sdist" will generate checksums
  519    and sizes for you, and print out the results to stdout.
  520  o Create OpenPGP/PGP signatures for the files.  Using GnuPG:
  521       % gpg -sab spambayes-1.0.1.zip
  522       % gpg -sab spambayes-1.0.1.tar.gz
  523       % gpg -sab spambayes-1.0.1.exe
  524    Put the created *.asc files in the "sigs" directory of the website.
  525    (Note that when you update the website, you will need to manually ssh
  526    to shell1.sourceforge.net and chmod these files so that people can
  527    access them.)
  528  o If your public key isn't already linked to on the Download page, put
  529    it there.
  530  o Update the website News, Download and Windows sections.
  531  o Update reply.txt in the website repository as needed (it specifies the
  532    latest version).  Then let Tim, Barry, Tony, or Skip know that they need
  533    to update the autoresponder.
  534  o Run "make install version" in the website directory to push the new
  535    version file, so that "Check for new version" works.
  536  o Add '+' to the end of spambayes/__init__.py's __version__, to
  537    differentiate CVS users, and check this change in.  After a number of
  538    changes have been checked in, this can be incremented and have "a0"
  539    added to the end. For example, with a 1.1 release:
  540        [before the release process] '1.1rc1'
  541        [during the release process] '1.1'
  542        [after the release process]  '1.1+'
  543        [later]                      '1.2a0'
  544        
  545 Then announce the release on the mailing lists and watch the bug reports
  546 roll in.  8-)
  547 
  548 Anthony's Alternate Approach to Building the Zipfile
  549 
  550  o Unpack the tarball somewhere, making a spambayes-1.0a7 directory
  551    (version number will obviously change in future releases)
  552  o Run the following two commands:
  553 
  554      find spambayes-1.0a7 -type f -name '*.txt' | xargs zip -l sb107.zip 
  555      find spambayes-1.0a7 -type f \! -name '*.txt' | xargs zip sb107.zip 
  556 
  557  o This makes a tarball where the .txt files are mangled, but everything
  558    else is left alone.
  559 
  560 Making a binary release
  561 =======================
  562 
  563 The binary release includes both sb_server and the Outlook plug-in and
  564 is an installer for Windows (98 and above) systems.  In order to have
  565 COM typelibs that work with Outlook 2000, 2002 and 2003, you need to
  566 build the installer on a system that has Outlook 2000 (not a more recent
  567 version).  You also need to have InnoSetup, pywin32, resourcepackage and
  568 py2exe installed.
  569 
  570  o Get hold of a fresh copy of the source (Windows line endings,
  571    presumably).
  572  o Run the setup.py file in the spambayes/Outlook2000/docs directory
  573    to generate the dynamic documentation.
  574  o Run sb_server and open the web interface.  This gets resourcepackage
  575    to generate the needed files.
  576  o Replace the __init__.py file in spambayes/spambayes/resources with
  577    a blank file to disable resourcepackage.
  578  o Ensure that the version numbers in spambayes/spambayes/__init__.py
  579    and spambayes/spambayes/Version.py are up-to-date.
  580  o Ensure that you don't have any other copies of spambayes in your
  581    PYTHONPATH, or py2exe will pick these up!  If in doubt, run
  582    setup.py install.
  583  o Run the "setup_all.py" script in the spambayes/windows/py2exe/
  584    directory. This uses py2exe to create the files that Inno will install.
  585  o Open (in InnoSetup) the spambayes.iss file in the spambayes/windows/
  586    directory.  Change the version number in the AppVerName and
  587    OutputBaseFilename lines to the new number.
  588  o Compile the spambayes.iss script to get the executable.
  589  o You can now follow the steps in the source release description above,
  590    from the testing step.
  591 
  592 Making a translation
  593 ====================
  594 
  595 Note that it is, in general, best to translate against a stable version.
  596 This means you avoid having to repeatedly re-translate text as the
  597 code changes.  This means code that has been released via the sourceforge
  598 system, that does not have a letter code at the end of the version (e.g.
  599 1.0.1, 1.1.2, but not 1.0a1, 1.1b1, or 2.1rc2).  If you do want to
  600 translate a more recent version, be sure to discuss your plans first on
  601 spambayes-dev so that you can be warned about any planned changes.
  602 
  603 Translation is only feasible for 1.1 and above.  No translation effort
  604 is planned for the 1.0.x series of releases.
  605 
  606 To translate, you will need:
  607 
  608  o A suitable version of Python (2.2 or greater) installed.
  609    See http://python.org/download
  610 
  611  o A copy of the SpamBayes source that you wish to translate.
  612 
  613  o Resourcepackage installed.
  614    See http://resourcepackage.sourceforge.net
  615 
  616 Optional tools that may make translation easier include:
  617 
  618  o A copy of VC++, Visual Studio, or some other GUI tool that allows
  619    editing of VC++ dialog resource files.
  620 
  621  o A GUI HTML editor.
  622 
  623  o A GUI gettext editor, such as poEdit.
  624    http://poedit.sourceforge.net
  625 
  626 Setup
  627 -----
  628 
  629 You will need to create a directory structure as follows:
  630 
  631 spambayes/                                    # spambayes package directory
  632                                               # containing classifier.py, tokenizer.py, etc
  633           languages/                          # root languages directory,
  634                                               # possibly already containing
  635                                               # other translations
  636                     {lang_code}/              # directory for the specific
  637                                               # translation - {lang_code} is
  638                                               # described below
  639                                 DIALOGS/      # directory for Outlook plug-in
  640                                               # dialog resources, which should contain an
  641                                               # empty __init__.py file, so that py2exe can
  642                                               # include the directory
  643                                 LC_MESSAGES/  # directory for gettext managed
  644                                               # strings, which should also contain an
  645                                               # empty __init__.py file
  646                                 __init__.py   # Copy of spambayes/spambayes/resources/__init__.py
  647 
  648 
  649 Translation Tasks
  650 -----------------
  651 
  652 There are four translation tasks:
  653 
  654  o Documentation.  This is the least exciting, but the most important.
  655    If the documentation is appropriately translated, then even if elements
  656    of the interface are not translated, users should be able to manage.
  657 
  658    A method of managing translated documents has yet to be created.  If you
  659    are interested in translating documentation, please contact
  660    spambayes-dev@python.org.
  661 
  662  o Outlook dialogs.  The majority of the Outlook plug-in interface is
  663    handled by a VC++/Visual Studio dialog resource file pair (dialogs.h
  664    and dialogs.rc).  The plug-in code then manipulates this to create the
  665    actual dialog.
  666 
  667    The easiest method of translating these dialogs is to use a tool like
  668    VC++ or Visual Studio.  Simply open the
  669    'Outlook2000\dialogs\resources\dialogs.rc' file, translate the dialog,
  670    and save the file as
  671    'spambayes\languages\{lang_code}\DIALOGS\dialogs.rc', where {lang_code}
  672    is the appropriate language code for the language you have translated
  673    into (e.g. 'en_UK', 'es', 'de_DE').  If you do not have a GUI tool to
  674    edit the dialogs, simply open the dialogs.rc file in a text editor,
  675    manually change the appropriate strings, and save the file as above.
  676 
  677    Once the dialogs are translated, you need to use the rc2py.py utility
  678    to create the i18n_dialogs.py file.  For example, in the
  679    'Outlook2000\dialogs\resources' directory:
  680      > rc2py.py {base}\spambayes\languages\de_DE\DIALOGS\dialogs.rc
  681        {base}\spambayes\languages\de_DE\DIALOGS\i18n_dialogs.py 1
  682    Where {base} is the directory that contains the spambayes package directory.
  683    This should create a 'i18n_dialogs.py' in the same directory as your
  684    translated dialogs.rc file - this is the file the the Outlook plug-in
  685    uses.
  686 
  687  o Web interface template file.  The majority of the web interface is
  688    created by dynamic use of a HTML template file.
  689 
  690    The easiest method of translating this file is to use a GUI HTML editor.
  691    Simply open the 'spambayes/resources/ui.html' file, translate
  692    it as described within, and save the file as
  693    'spambayes/languages/{lang_code}/i18n.ui.html', where {lang_code} is
  694    the appropriate language code as described above.  If you do not have
  695    a GUI HTML editor, or are happy editing HTML by hand, simply use your
  696    favority HTML editor to do this task.
  697 
  698    Once the template file is created, resourcepackage will automatically
  699    create the required ui_html.py file when SpamBayes is run with that
  700    language selected.
  701 
  702  o Gettext managed strings.  The remainder of both the Outlook plug-in
  703    and the web interface are contained within the various Python files
  704    that make up SpamBayes.  The Python gettext module (very similar to
  705    the GNU gettext system) is used to manage translation of these strings.
  706 
  707    To translate these strings, use the translation template
  708    'spambayes/languages/messages.pot'.  You can regenerate that file, if
  709    necessary, by running this command in the spambayes package directory:
  710      > {python dir}\tools\i18n\pygettext.py -o languages\messages.pot
  711        ..\contrib\*.py ..\Outlook2000\*.py ..\scripts\*.py *.py
  712        ..\testtools\*.py ..\utilities\*.py ..\windows\*.py
  713 
  714    You may wish to use a GUI system to create the required messages.po file, 
  715    such as poEdit, but you can also do this manually with a text editor.
  716    If your utility does not do it for you, you will also need to
  717    compile the .po file to a .mo file.  The utility msgfmt.py will do
  718    this for you - it should be located '{python dir}\tools\i18n'.
  719 
  720 Testing the translation
  721 -----------------------
  722 
  723 There are two ways to set the language that SpamBayes will use:
  724 
  725  o If you are using Windows, change the preferred Windows language using
  726    the Control Panel.
  727 
  728  o Get the '[globals] language' SpamBayes option to a list of the
  729    preferred language(s).