"Fossies" - the Fresh Open Source Software Archive

Member "statist-1.4.2/doc/manual-en.tex" (15 Nov 2006, 29496 Bytes) of package /linux/privat/old/statist-1.4.2.tar.gz:


As a special service "Fossies" has tried to format the requested source page into HTML format using (guessed) TeX and LaTeX source code syntax highlighting (style: standard) with prefixed line numbers. Alternatively you can here view or download the uninterpreted source code file.

    1 \documentclass[12pt,english]{article}
    2 \usepackage[T1]{fontenc}
    3 \usepackage[latin1]{inputenc}
    4 \usepackage{times}
    5 \usepackage{a4wide}
    6 \usepackage{graphicx}
    7 \usepackage{babel}
    8 \usepackage[pdftex,bookmarks=false,linkbordercolor={0.5 1 0.5}]{hyperref}
    9 
   10 \newcommand{\st}{{\tt sta\-tist} }
   11 
   12 \begin{document}
   13 
   14 \title{STATIST 1.4.1\\User Manual}
   15 \author{Jakson Alves de Aquino\\
   16 {\small {\tt jalvesaq@gmail.com}}}
   17 \date{September 5, 2006}
   18 
   19 \maketitle
   20 
   21 \tableofcontents
   22  
   23 \section{Introduction}
   24 
   25 {\tt Statist} is an easy to use, light weight statistics
   26 program.  Everything is in an interactive menu: you have
   27 just to choose what you need. {\tt Statist} is Free Software
   28 under GNU GPL and comes with absolutely no guarantee. 
   29 
   30 This manual is an incomplete and non literal translation
   31 from the original text written by Dirk Melcher, but with the
   32 addition of new material. I'm grateful to Bernhard Reiter
   33 for his suggestions of improvements to this document.
   34 
   35 \section{Warnings for Windows users}
   36 
   37 Users on GNU/Linux are much more accustomed to use console
   38 applications.  One helpful feature is the command line
   39 completion, where a long file name will be completed after
   40 typing the first letters and then pressing tab.  The
   41 terminal emulators, where you type in commands, can save and
   42 scroll over many lines that have come by.  And, the most
   43 important, GNU/Linux is Free Software where anybody can
   44 inspect what the computer does and many people can fix bugs
   45 to make this more secure.  Please, as soon as you can, try
   46 \st on a Free Software operating system like GNU/Linux or
   47 FreeBSD.
   48 
   49 To create graphics with \st you will need a version of {\tt
   50 gnuplot} that comes with {\tt pgnuplot}. Under Windows, you
   51 can't send commands to {\tt gnuplot} through {\tt sta\-tist}, as it is
   52 possible under Linux, but you can type the commands in the
   53 {\tt gnuplot} window.
   54 
   55 Be careful: Don't close the {\tt gnuplot} window.  You can
   56 close only the graphic! If you close the {\tt gnuplot}
   57 window you will have to restart \st to be able to create
   58 graphics again.
   59 
   60 Some software used to manipulated data files aren't part of
   61 {\tt sta\-tist}, but they are available for Windows. Please, search the
   62 Internet, looking for the package gnucoreutils, which is one
   63 of the GnuWin32 packages. Note, however that their
   64 installation and use might not be trivial for a Windows
   65 user.  Like {\tt sta\-tist}, they are easier to use in a
   66 Linux terminal emulator than in a DOS window.
   67 
   68 The \st documentation can be found at {\tt
   69 C:$\backslash$Program Files$\backslash$statist}, where there
   70 is also a sample configuration file for {\tt sta\-tist}. You
   71 can rename it to {\tt statistrc.txt} and edit it according
   72 to your preferences.
   73 
   74 Unfortunately, \st can't produce colorized output under DOS.
   75 
   76 \section{Installation from source code}
   77 
   78 \begin{enumerate}
   79 
   80 \item Open a terminal.
   81 
   82 \item Unpack the source code, compile the program, and
   83 become root to install it. That is, type:
   84 
   85 \end{enumerate}
   86 
   87 \begin{verbatim}
   88     tar -xvzf statist-1.4.1.tar.gz
   89     cd statist-1.4.1
   90     make
   91     # optional, if you have "check" installed
   92     make check
   93 
   94     # install for all users as root
   95     su -
   96     cd path-to/statist-1.4.1
   97     make install
   98     exit
   99 
  100 \end{verbatim}
  101 
  102 This is the default installation that should work in most
  103 GNU/Linux distributions. If the above instructions are not
  104 enough for your case, please see the file README for details
  105 on how to install \st from source code.
  106 
  107 \section{Invocation}
  108 
  109 You can simply type:
  110 \begin{verbatim}
  111     statist data_file
  112 \end{verbatim}
  113 
  114 However there are also some options that you might find
  115 useful, and, then, the invocation will be:
  116 
  117 \begin{verbatim}
  118  statist [ options ] data_file [ options ] 
  119 \end{verbatim}
  120 
  121 The only option that you need to memorise is {\tt {-}-help},
  122 or simply {\tt -h}, which will output the list of options.
  123 
  124 You can also create and edit the file {\verb ~/.statistrc }
  125 and set some options there.  If you have root privileges,
  126 you can also create the file {\tt /etc/statistrc}.  Options
  127 passed by the command line override the ones read from the
  128 {\tt statistrc} file.  You can find a sample {\tt statistrc}
  129 in the documentation directory (usually {\tt
  130 /usr/share/doc/statist}). Finally, if you choose the menu
  131 item {\em Preferences}, you can modify some options during \st
  132 execution.
  133 
  134 \section{Menu}
  135 
  136 The program has a simple menu that makes it very easy to
  137 use. There is no need of remembering commands. Typing `0'
  138 you go to the next higher menu-level, or finishes the
  139 program if you already are in the {\em Main menu}. One tip
  140 is important: if you have chosen a menu entry by mistake,
  141 you can always cancel the process by pressing the <Return>
  142 key before entering any value or answering any question.
  143 Then, the last menu will be printed again. 
  144 
  145 If you choose a statistical procedure from the menu, you
  146 will be asked to choose the variables. Often, it's not
  147 necessary to type the entire name of a column when inputting
  148 variable names for analyzes. For example, if you have a
  149 column named 
  150 
  151 \begin{center}
  152 this\_really\_is\_a\_big\_name
  153 \end{center}
  154 
  155 \noindent and there is no other column starting with the
  156 letter `t', you can simply type `t'. Finally, if you want to
  157 select all columns, you might simply type ``all'' as the
  158 name of the first column.
  159 
  160 Actually, the whole process is self-explanatory, and you
  161 would be able to use the program even without reading this
  162 short explanation.
  163 %Click <a href="menulist.html">here</a> to see the complete
  164 %menu.
  165 
  166  
  167 \section{Statist and Gnuplot}
  168 
  169 Gnuplot is an interactive program that makes graphical
  170 presentations from data and functions, and \st creates {\tt
  171 gnuplot} graphics for some functions. Normally, you will not
  172 have to open {\tt gnuplot} manually. The prerequisite to use
  173 it is simply that the program is installed and in the PATH.
  174 
  175 If you know {\tt gnuplot} syntax, you can refine or
  176 personalize your graphics, inputting {\tt gnu\-plot}
  177 commands. To do that, choose the menu option {\em
  178 Miscellaneous} {\textbar} {\em Enter gnuplot commands}. You can change many
  179 things in the graphic, like line colors and types, axes
  180 labels etc... Even if you don't know {\tt gnuplot} syntax,
  181 you can at least change the graphics title and axes labels
  182 because a list of the last commands sent to {\tt gnuplot}
  183 will be printed in the screen.  The changes will be applied
  184 to the current graphic being displayed with the {\tt
  185 gnuplot} command ``replot''. 
  186 
  187 The {\tt gnuplot} graphics can be disabled invoking the
  188 program with the option {\verb --noplot }. This can be
  189 useful if you, for example, will work with batch processing
  190 or if your database is too big and, thus, {\tt gnuplot}
  191 graphics are being generated too slowly.
  192 
  193 \subsection{Box-plot}
  194 
  195 You probably will have no problem interpreting \st graphics.
  196 The only one that might need some explanation is the {\em
  197 Box-and-Whisker Plot}. The picture below shows the meaning
  198 of each piece of this graphic:
  199 
  200 \begin{center}
  201 \includegraphics{boxplot-en}
  202 \end{center}
  203 
  204 \subsection{UTF-8}
  205 
  206 You might experience some problems with \st graphics
  207 made through gnuplot if your locale environment is set to
  208 UTF-8 and your language has non-ascii characters. The
  209 problem is that gnuplot will normally interpret titles and
  210 labels as they were encoded in a single-byte character set,
  211 like ISO-8859-1 (Latin 1), even if the terminal emulator
  212 charmap is set to UTF-8. It's possible to mix letters of
  213 different character sets (Greek and Latin 1, for example) in
  214 a single graphic. Please, access the web page below to know
  215 the details:
  216 
  217 \href{http://statist.wald.intevation.org/utf8.html}
  218 {http://statist.wald.intevation.org/utf8.html}
  219 
  220 \section{Data}
  221 
  222 \subsection{The file format}
  223 
  224 {\tt Statist} reads data from simple ASCII files (text
  225 files).  If the program is not invoked with an ASCII file
  226 name, it will immediately asks for the name of a data-file.
  227 Without data-file, there is nothing to do, unless you
  228 declare the option \verb --nofile  while invoking the
  229 program in order to use the keyboard to input data manually
  230 (choose from the menu: {\em Data management} {\textbar} {\em Read column
  231 from terminal}). However, only rarely it is reasonable to do
  232 this. It would be more comfortable to use a text editor or a
  233 spreadsheet program like {\em OpenOffice Calc} and {\em
  234 Gnumeric}. In this case, save the file as .csv.
  235 
  236 But be careful, because \st always uses a dot as decimal
  237 delimiter while working with data files. If the decimal
  238 delimiter in your language is a comma, \st might fail to
  239 correctly read the file. Thus, before typing your
  240 data, you can try to open the spreadsheet program in a
  241 terminal with locale set to ``C'', as below:
  242 
  243 \begin{verbatim}
  244   export LC_ALL=C
  245   oocalc &
  246 \end{verbatim}
  247 
  248 If you really need to use a data file with commas as decimal
  249 delimiters, \st will convert each comma that is in a quoted
  250 number into dot. If the numbers using commas as decimal
  251 delimiter are not between double quotes, it will be
  252 necessary to manually set the decimal delimiter.  You might
  253 be asked to set the file format. If not, choose the menu
  254 item {\em Data management} {\textbar} {\em File format
  255 options}. Alternatively, you can run \st as in the example:
  256 
  257 \begin{verbatim}
  258   statist datafile.csv --dec ","
  259 \end{verbatim}
  260 
  261 A data-file for \st consists of one or several columns of
  262 data.  The columns of numbers must be separated from each
  263 other by double quotes, tab characters, empty spaces, commas
  264 or semi-colons. These characters are ignored and, thus, it's
  265 possible to have any amount of them between two fields. For
  266 example, \st will read the same data from the two files
  267 below:
  268 
  269 \begin{verbatim}
  270 #Example data-file for statist  #Example data-file for statist
  271   1  3  5  6                     1,3,"5",6
  272   7  8  9 10                     ,7 8 ;, 9 10
  273  11 12 13 14                     11;12;13;14;;
  274 \end{verbatim}
  275 
  276 As you can infer from the above examples, commentaries begin
  277 with the symbol `\#' and are ignored. Empty-lines are also
  278 ignored. 
  279 
  280 \subsection{Column names and variable labels}
  281 
  282 When \st reads the data file, to each column is assigned one
  283 name. The first column will be column `a', the second will
  284 be `b', etc. However, it will be easier to understand a data
  285 file with many variables if its columns have more meaningful
  286 names. The first non-commentary line of the data file might
  287 contain the column names. {\tt Statist} will try to detect
  288 the names using a very simple algorithm to check. {\tt
  289 Statist} checks whether all fields in the first
  290 non-commentary line begin with a letter of the English
  291 alphabet. If any of the fields begins with a character that
  292 isn't between `a' and `z' or `A' and `Z', it will consider
  293 that the data file doesn't have a header.  If \st fails in
  294 this task, you can set the correct file format choosing the
  295 menu item you can use the option {\em Data management}
  296 {\textbar} {\em File format options}.  Another solution to
  297 this problem is the use of the command line options {\tt
  298 {-}-header} or {\tt {-}-noheader}.
  299 
  300 Alternatively, you can explicitly put in the data file the
  301 information that the header is present, including the
  302 ``\#\%'' string in the beginning of the line. In this last
  303 alternative, like commentary lines, the line must begin with
  304 one `\#', but this symbol must be followed by one `\%'.
  305 With its default configuration, \st can read the two
  306 examples of data file below simply typing ``{\tt
  307 statist~file}'':
  308 
  309 \begin{verbatim}
  310 #%kow kaw ec50              kow kaw ec50
  311 0.34 4.56 0.23              0.34 4.56 0.23
  312 1.23 5.45 6.76              1.23 5.45 6.76
  313 6.78 1.34 9.60              6.78 1.34 9.60
  314 \end{verbatim}
  315 
  316 The number of variable names declared must be exactly the
  317 same as the number of columns. Only letters, digits, and
  318 `\_' are allowed to be used in names, and letters with
  319 accents may cause problems.  If you use the option 
  320 \verb --labels  {\tt labels\_file} \st will use the value
  321 labels and the column titles present in {\tt labels\_file}.
  322 When running some graphics and analyzes, \st will replace
  323 column names and variable values with their labels. A {\tt
  324 labels\_file} is a list of column names plus their labels
  325 followed by a list of values with their labels. Information
  326 for different columns are separated by a blank line, as in
  327 the example:
  328 
  329 \begin{verbatim}
  330 stat Do you like statistics?
  331 0 No
  332 1 Yes
  333 2 No answer
  334 
  335 color What's your favorite color?
  336 0 Red
  337 1 Green
  338 2 Blue
  339 3 Other
  340 \end{verbatim}
  341 
  342 In the above example, the datafile has a column named
  343 ``stat'' and other named ``color''. The values of the
  344 variable ``stat'' are always ``0'', ``1'', or ``2''. You can
  345 use the same file with labels for different data files. There
  346 is no problem if some columns remain without labels, or if
  347 some labels don't find their column in the database. Thus,
  348 if you have a database with hundreds of columns and want to
  349 work with various subsets that share some columns, you can
  350 write one single labels file. If you choose in the menu the
  351 option {\em Read another file}, the labels will be applied to
  352 the appended columns. Note: large value labels will need too
  353 much space and the table of {\em Compare means} can no longer
  354 fit in the screen; if you have large labels, you will be
  355 able do run {\em Compare means} with only very few columns at
  356 the same time.
  357 
  358 \subsection{Missing values}
  359 
  360 {\tt Statist} can deal with data files with missing values
  361 ({\em not available} values), and there are two ways of
  362 indicating that a value is missing. The first one is to use
  363 a specific string where the value is missing. By default,
  364 \st interprets the string ``M'' is indicator of missing
  365 value, but you can choose a different string in the {\tt
  366 statistrc} file, using the argument {\tt
  367 {-}-na-string~<string>} in the command line, or in the menu
  368 item {\em Data management} {\textbar} {\em File format
  369 options}.
  370 
  371 Because \st interprets any amount of ignore characters
  372 (``{\tt ~",;}$\backslash${\tt t}'') as one single field
  373 separator, two adjacent field separators will not be
  374 interpreted as missing value. On the contrary, \st will
  375 report that the line has fewer columns than it should to.
  376 This is the default behavior, but it can be changed either
  377 in the {\tt statistrc}, with the command line option {\tt
  378 {-}-sep~<char>}, or, again, in the menu item {\em Data
  379 management} {\textbar} {\em File format options}. With the
  380 option, only one specific character will be interpreted as
  381 field separator.  Thus, the following data files will be
  382 read as the same, but the second one needs the option
  383 \verb --sep \verb "," :\footnote{Even with the option {\tt {-}-sep},
  384 the default algorithm is used to parse the line
  385 with column names. Hence, it's not allowed to have missing
  386 column names.}
  387 
  388 \begin{verbatim}
  389   1  3  5  6                     1,3,5,6
  390   7  M  9 10                     7,,9,10
  391  11 12  M 14                     11,12,,14
  392 \end{verbatim}
  393 
  394 Each column of the database is saved as a temporary binary
  395 file, where all values are stored as double precision
  396 floating point numbers (real numbers). These files are
  397 erased when you quit {\tt sta\-tist}. The missing values are
  398 stored as the smallest possible number, that is: $-1.79769
  399 \times 10^{308}$.  You have to be sure that this number
  400 isn't in your data file as a valid number, because it would
  401 not be treated as a very small number; it would be
  402 interpreted as a missing value.
  403 
  404 Before each analysis, \st reads the selected columns from
  405 temporary files into ram, and, if necessary, either deletes
  406 the rows that have at least one missing value or simply
  407 deletes missing values. However, the deletions occur only in
  408 a copy of the temporary files that is created in the
  409 computer memory.  The temporary files remain intact until
  410 you quit the program. For example the menu option {\em
  411 Regressions and correlations} {\textbar} {\em Multiple linear correlation}
  412 will delete all rows that have missing values in any one of
  413 the chosen columns. You should do this analysis if each row
  414 in your database represents a single case, what is very
  415 common in social sciences. The menu option {\em Tests}
  416 {\textbar} {\em t-test for comparison of two means of two samples} will
  417 delete every missing value, but a missing value in a column
  418 will not cause the entire row to be deleted. You should use
  419 this analysis if, for example, the columns in your database
  420 represent different series of similar experiments, and you
  421 would like to compare the two sets of results.
  422 
  423 \subsection{Reading and saving files}
  424 
  425 If you want to work only with subsets of your database, you
  426 can write columns into a text file (ASCII file), choosing
  427 the menu option {\em Data Management} {\textbar} {\em Export columns as
  428 ASCII-data}. You can also read data from several files
  429 simultaneously ({\em Data Management} {\textbar} {\em Read another file}).
  430 When you {\em Read another file}, new columns are added to
  431 the database, and if a column name in the new file is
  432 already in use in the current database, the symbol ``\_''
  433 will be appended to it.
  434 
  435 Another possibility is to join columns ({\em Data
  436 manipulation} {\textbar} {\em Join columns}). In this case, the selected
  437 columns will be concatenated in a bigger one.
  438 
  439 \section{Manipulating databases}
  440 
  441 \subsection{Extracting columns from fixed width data files}
  442 
  443 To extract columns from a fixed width data file, and save
  444 them in a \st data file, type:
  445 
  446 \begin{verbatim}
  447     statist --xcols config_file original_datafile new_datafile
  448 \end{verbatim}
  449 
  450 The content of a {\tt config\_file} is simply a list of
  451 variable names and their position in the fixed width data
  452 file, as in the example below:
  453 
  454 \begin{verbatim}
  455 born 1-4
  456 sex 8
  457 income 11-15
  458 \end{verbatim}
  459 
  460 With the above config\_file, \st would read the following
  461 database:
  462 
  463 \begin{verbatim}
  464 1971 522   2365
  465 19609991  32658
  466 19455632       
  467 19674131  32684
  468 \end{verbatim}
  469 
  470 And output:
  471 
  472 \begin{verbatim}
  473 #%born sex     income
  474 1971    2       2365
  475 1960    1       32658
  476 1945    2       M
  477 1967    1       32684
  478 \end{verbatim}
  479 
  480 {\tt Statist} will  not add the ``\verb #% '' string to the
  481 first line if either it was called with the command line
  482 option {\tt {-}-header} or the {\tt statistrc} file has the
  483 option {\tt autodetect\_header = yes}. The string used to
  484 define missing values also can be defined in the {\em
  485 statistrc} and using the command line options. The columns
  486 are separated by a blank space, unless you have chosen
  487 something different with the command line option {\tt
  488 {-}-sep}. Non numeric values are extracted and put between
  489 double quotes in the {\tt new\_datafile}, although \st is
  490 unable to read them. You would need to replace them with
  491 numeric codes.
  492 
  493 \subsection{Extracting a sample from a database}
  494 
  495 If you will work with a very big database that you still
  496 don't know very well, you may find it useful to begin the
  497 exploration of the database using a sample of it, which
  498 would be faster than using the entire database. After
  499 discovering what analyzes are more relevant for your
  500 research, you could re-run these analyzes with the original
  501 database.
  502 
  503 To extract a percentage of the database rows, invoke \st in
  504 the following way:
  505 
  506 \begin{verbatim}
  507    statist --xsample percentage database dest_file
  508 \end{verbatim}
  509 
  510 \noindent where {\tt percentage} must be a integer number
  511 between 1 and 99. The new database,\linebreak {\tt dest\_file} will be
  512 created with {\em approximately} the requested percentage or
  513 rows extracted from {\tt data\_base}.
  514 
  515 \subsection{Recoding a data base}
  516 
  517 For some kinds of data manipulation we will need some
  518 programs that are not part of {\tt sta\-tist}, but are
  519 available in most GNU/\-Linux distributions (and are also
  520 installable under DOS/\-Win\-dows). For small data files, with
  521 few variables, you can use your preferred text editor or
  522 spreadsheet program. However, if your file is too big, or
  523 has too many variables, it might be more convenient to use
  524 the tools described here and in the following sections.
  525 
  526 Sometimes, we need to recode some values in a database.
  527 Suppose, for example, that in a given data file, the value
  528 ``999'' means missing value for the variable age, and in
  529 some analyzes we want ``age classes'' and not ``age''.  We
  530 still want to use the variable ``age'' in other analyzes,
  531 and, thus, we need to recode ``age'' into a different
  532 variable. To create the new data base with the recoded
  533 variables we could use {\tt awk}, an external program.
  534 Suppose that the column ``age'' was the second one:
  535 
  536 \begin{verbatim}
  537 awk '{if(/age/) {print $0 "\t" "AGE1"}
  538         else {
  539           if(NF == 0) {print $0}
  540             else {
  541               if ($2 <= 20){age1 = 1} else
  542                 if ($2 > 20 && $2 <= 50){age1 = 2} else
  543                 if ($2 > 51 && $2 < 999){age1 = 3} else
  544                 {age1 = "M"}
  545               {print $0 "\t" age1}
  546             }
  547        }
  548      }' datafile.csv > newfile.csv
  549 \end{verbatim}
  550 
  551 
  552 The expression inside the quotes are {\tt awk} commands.
  553 With this command, {\tt awk} would read the following data
  554 file:
  555 
  556 \begin{verbatim}
  557 sex age
  558 2     23
  559 1     88
  560 2     10
  561 2     36
  562 3     999
  563 1     55
  564 \end{verbatim}
  565 
  566 And output:
  567 
  568 \begin{verbatim}
  569 sex  age     AGE1
  570 0       23      2
  571 1       88      3
  572 0       10      1
  573 0       36      2
  574 M       999     M
  575 1       55      3
  576 \end{verbatim}
  577 
  578 At first, the {\tt awk} command might looks like complex,
  579 but let me explain it: 
  580 
  581 \begin{description}
  582 
  583 \item {\tt \$}: The symbol  `{\tt \$}' means ``field'', that
  584 is, a column of a \st data file.
  585 
  586 \item {\tt \$0}: has a special meaning: the {\em entire
  587 line}.
  588 
  589 \item {\tt if(/\#/) \{print \$0 ``$\backslash$t''
  590 ``AGE1''\}}: If the line has the symbol `\#', print the
  591 entire line plus a tab character plus the string ``AGE1''.
  592 This line contains our column names (unless you inserted
  593 commentaries in the data file).
  594 
  595 \item {\tt if(NF == 0) \{print \$0\}}: If the number of
  596 fields is zero, simply print the entire line.
  597 
  598 \item {\tt if (\$2 > 20 \&\& \$2 <= 50)\{age1 = 2\}}: If the
  599 second field has a value higher than 20 and lower or equal
  600 to 50, the value of the variable ``age1'' will be 2.
  601 
  602 \item {\tt print \$0 ``$\backslash$t'' age1}: Print the
  603 entire line plus a tab character plus the value of the
  604 variable {\tt age1}.
  605 
  606 \end{description}
  607 
  608 We also use {\tt awk} to select cases and compute new
  609 variables. So, please refer to its manual or info page for
  610 more details on its usage (in a terminal, type {\tt info
  611 awk}). Frequently, our {\tt awk} commands will begin testing
  612 whether the line contains the column names and whether it is
  613 a empty line.
  614 
  615 \subsection{Selecting cases and computing new variables}
  616 
  617 We can use {\tt awk} to accomplish two other tasks: (1)
  618 create a new data base by selecting only some cases from a
  619 existing data file, and (2) compute a new variable using the
  620 values of some existing variables. Here we show only two
  621 examples of {\tt awk} usage.
  622 
  623 Suppose that the second column of a data file has the
  624 variable ``sex'', coded `0' for males and `1' for females,
  625 and that we want to include only females in some analyzes.
  626 Typing the following command in a terminal would create the
  627 new data file we need:
  628 
  629 \begin{verbatim}
  630   awk '{if(/sex/ || /#/ || $2 > 0) {print $0}
  631   }' data_file.csv > new_data_file.csv
  632 \end{verbatim}
  633 
  634 We are telling {\tt awk} that if either it finds the string
  635 ``sex'' in a line (because it certainly contains our column
  636 names or a commentary), or the second field of a line has a
  637 number bigger than $0$ it have to output the entire line
  638 (``{\tt ||}'' means ``or''). Finally we are also telling to the
  639 shell program that we want the output redirected from the
  640 screen to the file new\_data\_file.csv.
  641 
  642 Now, suppose that you want to calculate an index using three
  643 variables from your data base, and that the index would be
  644 the sum of columns 1 and 2 divided by the value of the third
  645 column:
  646 
  647 \begin{verbatim}
  648   awk '{if(/#/ || /var1/) {print $0 "\tidx"} else
  649   {{idx = ($1 + $2) / $3} 
  650   {print $0 "\t" idx}}}' datafile.dat > newfile.dat
  651 \end{verbatim}
  652 
  653 Warning: \st always uses dot as decimal separator while
  654 working with data files. But if the decimal separator in
  655 your language is a comma, {\tt awk} will use it in the
  656 outputs. To avoid this, type the following command in the
  657 terminal before using {\tt awk}:
  658 
  659 \begin{verbatim}
  660   export LC_ALL=C
  661 \end{verbatim}
  662 
  663 With the above command, the language, numbers, etc will be
  664 set to English.  Note that programs started in this terminal
  665 will also run in English. To reset the terminal you have to
  666 ``export LC\_ALL=xx'' again, using your language code
  667 instead of ``xx'' (or close the terminal and open another).
  668 
  669 \subsection{Sorting the data base}
  670 
  671 We can use some other programs if we want to sort the rows
  672 of the entire database using one more columns as keys.
  673 Suppose, for example, that we want to sort our database
  674 using the 12th column as key.  The following commands would
  675 do the job:
  676 
  677 \begin{verbatim}  
  678   head -n 1 datafile.csv > columnnames
  679   sort -g -k 12,12 datafile.csv > sorted
  680   cat columnnames sorted > sorted_datafile.csv
  681 \end{verbatim}
  682 
  683 With the above commands we have sorted our file in three
  684 steps: (1) We created the file {\tt columnnames} containing
  685 the first line of {\tt datafile.csv}. (2) We created the file
  686 {\tt sorted}, a sorted version of our database. However, in
  687 this file the 12th column name was treated as number and its
  688 line sorted. It might no longer be the first line of the
  689 file.  In this case, to create a sorted database with the
  690 original names, we use the third command. (3) We
  691 concatenated the files {\tt columnnames} and {\tt sorted} to
  692 create {\tt sorted\_datafile.csv}. Please, see manual pages of
  693 {\tt head}, {\tt sort}, and {\tt cat} for details on how to
  694 use them.
  695 
  696 \subsection{Merging data files}
  697 
  698 To merge data files using a variable as key, we use another
  699 external program: {\tt join}. Suppose that you have a data
  700 file containing information about people, and that some
  701 people actually are married with each other. You want to
  702 know the mean age difference between husbands and wives. You
  703 can't run analyzes to compare people in deferment rows, only
  704 variables in different columns. However, your data base has
  705 a variable that might be used as key: {\em house}. People
  706 who has the same value for the variable ``house'' and that
  707 are married, actually are married with each other. You
  708 should follow some steps to achieve your goal: (1) Use {\tt
  709 awk} to create two different data files, one only with
  710 married men and other only with married woman. (2) Use {\tt
  711 join} to merge the two data files in a new one. If the house
  712 variable is the first column in both data files, you should
  713 simply type:
  714 
  715 \begin{verbatim}
  716   join -e "" women.csv men.csv > couples.csv
  717 \end{verbatim}
  718 
  719 The above command would get the two following files:
  720 
  721 \begin{verbatim}
  722 house income age                house income age
  723 123     4215   23               123     3256   27
  724 124     3251   35               125     4126   25
  725 126     0      20               126     4261   22
  726 127     1241   45               128     3426   60
  727 \end{verbatim}
  728 
  729 And would output:
  730 
  731 \begin{verbatim}
  732 house income age income age
  733 123 4215 23 3256 27
  734 126 0 20 4261 22
  735 \end{verbatim}
  736 
  737 There is no problem with the duplicate occurrence of
  738 ``income'' and ``work'', because \st will append `\_' to the
  739 second one. If you have to merge files using more than one
  740 column as key, you can use {\tt awk} to create a single key
  741 column that concatenates the characters of all keys. For
  742 example, if your key variables are the columns 2 and 3:
  743 
  744 \begin{verbatim}
  745 awk '{if(/income/) {print "key" "\t" $0} else {
  746        if(NF == 0) {print $0} else {
  747           {print $2$3 "\t" $0}
  748         }
  749        }
  750       }' people.csv > people_with_key.csv
  751 \end{verbatim}
  752 
  753 \section{Batch/script}
  754 
  755 If you have to repeat many times the same analysis, you
  756 would became bored of starting {\tt sta\-tist}, and, again and again,
  757 choosing the same options from the menu. If this is your
  758 case, you can use the batch mode. You have to invoke \st
  759 with the option {\verb --silent }, and give to it a file
  760 containing what you would have to type if \st was running in
  761 the normal mode. The only difference is that while in silent
  762 mode \st doesn't print the message "Please, continue with
  763 <RETURN>", and, thus, you don't have to include these
  764 <RETURN> keys. For example, if you want to run a correlation
  765 between variables ``a'' and ``b'' in a data file called
  766 {\tt day365.csv} you could create a file named, for example,
  767 {\tt cmds\_file} with the following content:
  768 
  769 \begin{verbatim}
  770 2
  771 1
  772 a
  773 b
  774 0
  775 0
  776 \end{verbatim}
  777 
  778 The next step would be to invoke \st with the following
  779 command:
  780 
  781 \begin{verbatim}
  782    statist --silent --noplot day365.csv < cmds_file
  783 \end{verbatim}
  784 
  785 The result will be printed in the screen. However, if you
  786 prefer the results saved in a file called, say, report365,
  787 type:
  788 
  789 \begin{verbatim}
  790    statist --silent --noplot day365.csv < cmds_file > report365
  791 \end{verbatim}
  792 
  793 \section{Useful tips}
  794 
  795 \begin{itemize}
  796 
  797 \item Please, report any problem that you find (program
  798   bugs, documentation faults, grammar mistakes, etc...) to:
  799   statist-list@itevation.de. If you prefer, you can write
  800   directly to me: jalvesaq@gmail.com. You are also
  801   invited to make suggestions and ask for new features.
  802 
  803 \item When you see a question like ``Do something? (y/N),''
  804   the upper case ``N'' means that if you type any letter
  805   other than ``y'', and even if you simply press <Enter>, it
  806   will be assumed that your answer is ``No''.
  807   
  808 \item You can get the last version of \st on its
  809 website:
  810 
  811 \end{itemize}
  812 
  813 \begin{center}
  814 \href{http://statist.wald.intevation.org/}
  815 {http://statist.wald.intevation.org/}
  816 \end{center}
  817 
  818 \end{document}
  819 
  820 % vim:tw=60