"Fossies" - the Fresh Open Source Software Archive 
Member "statist-1.4.2/doc/manual-en.tex" (15 Nov 2006, 29496 Bytes) of package /linux/privat/old/statist-1.4.2.tar.gz:
As a special service "Fossies" has tried to format the requested source page into HTML format using (guessed) TeX and LaTeX source code syntax highlighting (style:
standard) with prefixed line numbers.
Alternatively you can here
view or
download the uninterpreted source code file.
1 \documentclass[12pt,english]{article}
2 \usepackage[T1]{fontenc}
3 \usepackage[latin1]{inputenc}
4 \usepackage{times}
5 \usepackage{a4wide}
6 \usepackage{graphicx}
7 \usepackage{babel}
8 \usepackage[pdftex,bookmarks=false,linkbordercolor={0.5 1 0.5}]{hyperref}
9
10 \newcommand{\st}{{\tt sta\-tist} }
11
12 \begin{document}
13
14 \title{STATIST 1.4.1\\User Manual}
15 \author{Jakson Alves de Aquino\\
16 {\small {\tt jalvesaq@gmail.com}}}
17 \date{September 5, 2006}
18
19 \maketitle
20
21 \tableofcontents
22
23 \section{Introduction}
24
25 {\tt Statist} is an easy to use, light weight statistics
26 program. Everything is in an interactive menu: you have
27 just to choose what you need. {\tt Statist} is Free Software
28 under GNU GPL and comes with absolutely no guarantee.
29
30 This manual is an incomplete and non literal translation
31 from the original text written by Dirk Melcher, but with the
32 addition of new material. I'm grateful to Bernhard Reiter
33 for his suggestions of improvements to this document.
34
35 \section{Warnings for Windows users}
36
37 Users on GNU/Linux are much more accustomed to use console
38 applications. One helpful feature is the command line
39 completion, where a long file name will be completed after
40 typing the first letters and then pressing tab. The
41 terminal emulators, where you type in commands, can save and
42 scroll over many lines that have come by. And, the most
43 important, GNU/Linux is Free Software where anybody can
44 inspect what the computer does and many people can fix bugs
45 to make this more secure. Please, as soon as you can, try
46 \st on a Free Software operating system like GNU/Linux or
47 FreeBSD.
48
49 To create graphics with \st you will need a version of {\tt
50 gnuplot} that comes with {\tt pgnuplot}. Under Windows, you
51 can't send commands to {\tt gnuplot} through {\tt sta\-tist}, as it is
52 possible under Linux, but you can type the commands in the
53 {\tt gnuplot} window.
54
55 Be careful: Don't close the {\tt gnuplot} window. You can
56 close only the graphic! If you close the {\tt gnuplot}
57 window you will have to restart \st to be able to create
58 graphics again.
59
60 Some software used to manipulated data files aren't part of
61 {\tt sta\-tist}, but they are available for Windows. Please, search the
62 Internet, looking for the package gnucoreutils, which is one
63 of the GnuWin32 packages. Note, however that their
64 installation and use might not be trivial for a Windows
65 user. Like {\tt sta\-tist}, they are easier to use in a
66 Linux terminal emulator than in a DOS window.
67
68 The \st documentation can be found at {\tt
69 C:$\backslash$Program Files$\backslash$statist}, where there
70 is also a sample configuration file for {\tt sta\-tist}. You
71 can rename it to {\tt statistrc.txt} and edit it according
72 to your preferences.
73
74 Unfortunately, \st can't produce colorized output under DOS.
75
76 \section{Installation from source code}
77
78 \begin{enumerate}
79
80 \item Open a terminal.
81
82 \item Unpack the source code, compile the program, and
83 become root to install it. That is, type:
84
85 \end{enumerate}
86
87 \begin{verbatim}
88 tar -xvzf statist-1.4.1.tar.gz
89 cd statist-1.4.1
90 make
91 # optional, if you have "check" installed
92 make check
93
94 # install for all users as root
95 su -
96 cd path-to/statist-1.4.1
97 make install
98 exit
99
100 \end{verbatim}
101
102 This is the default installation that should work in most
103 GNU/Linux distributions. If the above instructions are not
104 enough for your case, please see the file README for details
105 on how to install \st from source code.
106
107 \section{Invocation}
108
109 You can simply type:
110 \begin{verbatim}
111 statist data_file
112 \end{verbatim}
113
114 However there are also some options that you might find
115 useful, and, then, the invocation will be:
116
117 \begin{verbatim}
118 statist [ options ] data_file [ options ]
119 \end{verbatim}
120
121 The only option that you need to memorise is {\tt {-}-help},
122 or simply {\tt -h}, which will output the list of options.
123
124 You can also create and edit the file {\verb ~/.statistrc }
125 and set some options there. If you have root privileges,
126 you can also create the file {\tt /etc/statistrc}. Options
127 passed by the command line override the ones read from the
128 {\tt statistrc} file. You can find a sample {\tt statistrc}
129 in the documentation directory (usually {\tt
130 /usr/share/doc/statist}). Finally, if you choose the menu
131 item {\em Preferences}, you can modify some options during \st
132 execution.
133
134 \section{Menu}
135
136 The program has a simple menu that makes it very easy to
137 use. There is no need of remembering commands. Typing `0'
138 you go to the next higher menu-level, or finishes the
139 program if you already are in the {\em Main menu}. One tip
140 is important: if you have chosen a menu entry by mistake,
141 you can always cancel the process by pressing the <Return>
142 key before entering any value or answering any question.
143 Then, the last menu will be printed again.
144
145 If you choose a statistical procedure from the menu, you
146 will be asked to choose the variables. Often, it's not
147 necessary to type the entire name of a column when inputting
148 variable names for analyzes. For example, if you have a
149 column named
150
151 \begin{center}
152 this\_really\_is\_a\_big\_name
153 \end{center}
154
155 \noindent and there is no other column starting with the
156 letter `t', you can simply type `t'. Finally, if you want to
157 select all columns, you might simply type ``all'' as the
158 name of the first column.
159
160 Actually, the whole process is self-explanatory, and you
161 would be able to use the program even without reading this
162 short explanation.
163 %Click <a href="menulist.html">here</a> to see the complete
164 %menu.
165
166
167 \section{Statist and Gnuplot}
168
169 Gnuplot is an interactive program that makes graphical
170 presentations from data and functions, and \st creates {\tt
171 gnuplot} graphics for some functions. Normally, you will not
172 have to open {\tt gnuplot} manually. The prerequisite to use
173 it is simply that the program is installed and in the PATH.
174
175 If you know {\tt gnuplot} syntax, you can refine or
176 personalize your graphics, inputting {\tt gnu\-plot}
177 commands. To do that, choose the menu option {\em
178 Miscellaneous} {\textbar} {\em Enter gnuplot commands}. You can change many
179 things in the graphic, like line colors and types, axes
180 labels etc... Even if you don't know {\tt gnuplot} syntax,
181 you can at least change the graphics title and axes labels
182 because a list of the last commands sent to {\tt gnuplot}
183 will be printed in the screen. The changes will be applied
184 to the current graphic being displayed with the {\tt
185 gnuplot} command ``replot''.
186
187 The {\tt gnuplot} graphics can be disabled invoking the
188 program with the option {\verb --noplot }. This can be
189 useful if you, for example, will work with batch processing
190 or if your database is too big and, thus, {\tt gnuplot}
191 graphics are being generated too slowly.
192
193 \subsection{Box-plot}
194
195 You probably will have no problem interpreting \st graphics.
196 The only one that might need some explanation is the {\em
197 Box-and-Whisker Plot}. The picture below shows the meaning
198 of each piece of this graphic:
199
200 \begin{center}
201 \includegraphics{boxplot-en}
202 \end{center}
203
204 \subsection{UTF-8}
205
206 You might experience some problems with \st graphics
207 made through gnuplot if your locale environment is set to
208 UTF-8 and your language has non-ascii characters. The
209 problem is that gnuplot will normally interpret titles and
210 labels as they were encoded in a single-byte character set,
211 like ISO-8859-1 (Latin 1), even if the terminal emulator
212 charmap is set to UTF-8. It's possible to mix letters of
213 different character sets (Greek and Latin 1, for example) in
214 a single graphic. Please, access the web page below to know
215 the details:
216
217 \href{http://statist.wald.intevation.org/utf8.html}
218 {http://statist.wald.intevation.org/utf8.html}
219
220 \section{Data}
221
222 \subsection{The file format}
223
224 {\tt Statist} reads data from simple ASCII files (text
225 files). If the program is not invoked with an ASCII file
226 name, it will immediately asks for the name of a data-file.
227 Without data-file, there is nothing to do, unless you
228 declare the option \verb --nofile while invoking the
229 program in order to use the keyboard to input data manually
230 (choose from the menu: {\em Data management} {\textbar} {\em Read column
231 from terminal}). However, only rarely it is reasonable to do
232 this. It would be more comfortable to use a text editor or a
233 spreadsheet program like {\em OpenOffice Calc} and {\em
234 Gnumeric}. In this case, save the file as .csv.
235
236 But be careful, because \st always uses a dot as decimal
237 delimiter while working with data files. If the decimal
238 delimiter in your language is a comma, \st might fail to
239 correctly read the file. Thus, before typing your
240 data, you can try to open the spreadsheet program in a
241 terminal with locale set to ``C'', as below:
242
243 \begin{verbatim}
244 export LC_ALL=C
245 oocalc &
246 \end{verbatim}
247
248 If you really need to use a data file with commas as decimal
249 delimiters, \st will convert each comma that is in a quoted
250 number into dot. If the numbers using commas as decimal
251 delimiter are not between double quotes, it will be
252 necessary to manually set the decimal delimiter. You might
253 be asked to set the file format. If not, choose the menu
254 item {\em Data management} {\textbar} {\em File format
255 options}. Alternatively, you can run \st as in the example:
256
257 \begin{verbatim}
258 statist datafile.csv --dec ","
259 \end{verbatim}
260
261 A data-file for \st consists of one or several columns of
262 data. The columns of numbers must be separated from each
263 other by double quotes, tab characters, empty spaces, commas
264 or semi-colons. These characters are ignored and, thus, it's
265 possible to have any amount of them between two fields. For
266 example, \st will read the same data from the two files
267 below:
268
269 \begin{verbatim}
270 #Example data-file for statist #Example data-file for statist
271 1 3 5 6 1,3,"5",6
272 7 8 9 10 ,7 8 ;, 9 10
273 11 12 13 14 11;12;13;14;;
274 \end{verbatim}
275
276 As you can infer from the above examples, commentaries begin
277 with the symbol `\#' and are ignored. Empty-lines are also
278 ignored.
279
280 \subsection{Column names and variable labels}
281
282 When \st reads the data file, to each column is assigned one
283 name. The first column will be column `a', the second will
284 be `b', etc. However, it will be easier to understand a data
285 file with many variables if its columns have more meaningful
286 names. The first non-commentary line of the data file might
287 contain the column names. {\tt Statist} will try to detect
288 the names using a very simple algorithm to check. {\tt
289 Statist} checks whether all fields in the first
290 non-commentary line begin with a letter of the English
291 alphabet. If any of the fields begins with a character that
292 isn't between `a' and `z' or `A' and `Z', it will consider
293 that the data file doesn't have a header. If \st fails in
294 this task, you can set the correct file format choosing the
295 menu item you can use the option {\em Data management}
296 {\textbar} {\em File format options}. Another solution to
297 this problem is the use of the command line options {\tt
298 {-}-header} or {\tt {-}-noheader}.
299
300 Alternatively, you can explicitly put in the data file the
301 information that the header is present, including the
302 ``\#\%'' string in the beginning of the line. In this last
303 alternative, like commentary lines, the line must begin with
304 one `\#', but this symbol must be followed by one `\%'.
305 With its default configuration, \st can read the two
306 examples of data file below simply typing ``{\tt
307 statist~file}'':
308
309 \begin{verbatim}
310 #%kow kaw ec50 kow kaw ec50
311 0.34 4.56 0.23 0.34 4.56 0.23
312 1.23 5.45 6.76 1.23 5.45 6.76
313 6.78 1.34 9.60 6.78 1.34 9.60
314 \end{verbatim}
315
316 The number of variable names declared must be exactly the
317 same as the number of columns. Only letters, digits, and
318 `\_' are allowed to be used in names, and letters with
319 accents may cause problems. If you use the option
320 \verb --labels {\tt labels\_file} \st will use the value
321 labels and the column titles present in {\tt labels\_file}.
322 When running some graphics and analyzes, \st will replace
323 column names and variable values with their labels. A {\tt
324 labels\_file} is a list of column names plus their labels
325 followed by a list of values with their labels. Information
326 for different columns are separated by a blank line, as in
327 the example:
328
329 \begin{verbatim}
330 stat Do you like statistics?
331 0 No
332 1 Yes
333 2 No answer
334
335 color What's your favorite color?
336 0 Red
337 1 Green
338 2 Blue
339 3 Other
340 \end{verbatim}
341
342 In the above example, the datafile has a column named
343 ``stat'' and other named ``color''. The values of the
344 variable ``stat'' are always ``0'', ``1'', or ``2''. You can
345 use the same file with labels for different data files. There
346 is no problem if some columns remain without labels, or if
347 some labels don't find their column in the database. Thus,
348 if you have a database with hundreds of columns and want to
349 work with various subsets that share some columns, you can
350 write one single labels file. If you choose in the menu the
351 option {\em Read another file}, the labels will be applied to
352 the appended columns. Note: large value labels will need too
353 much space and the table of {\em Compare means} can no longer
354 fit in the screen; if you have large labels, you will be
355 able do run {\em Compare means} with only very few columns at
356 the same time.
357
358 \subsection{Missing values}
359
360 {\tt Statist} can deal with data files with missing values
361 ({\em not available} values), and there are two ways of
362 indicating that a value is missing. The first one is to use
363 a specific string where the value is missing. By default,
364 \st interprets the string ``M'' is indicator of missing
365 value, but you can choose a different string in the {\tt
366 statistrc} file, using the argument {\tt
367 {-}-na-string~<string>} in the command line, or in the menu
368 item {\em Data management} {\textbar} {\em File format
369 options}.
370
371 Because \st interprets any amount of ignore characters
372 (``{\tt ~",;}$\backslash${\tt t}'') as one single field
373 separator, two adjacent field separators will not be
374 interpreted as missing value. On the contrary, \st will
375 report that the line has fewer columns than it should to.
376 This is the default behavior, but it can be changed either
377 in the {\tt statistrc}, with the command line option {\tt
378 {-}-sep~<char>}, or, again, in the menu item {\em Data
379 management} {\textbar} {\em File format options}. With the
380 option, only one specific character will be interpreted as
381 field separator. Thus, the following data files will be
382 read as the same, but the second one needs the option
383 \verb --sep \verb "," :\footnote{Even with the option {\tt {-}-sep},
384 the default algorithm is used to parse the line
385 with column names. Hence, it's not allowed to have missing
386 column names.}
387
388 \begin{verbatim}
389 1 3 5 6 1,3,5,6
390 7 M 9 10 7,,9,10
391 11 12 M 14 11,12,,14
392 \end{verbatim}
393
394 Each column of the database is saved as a temporary binary
395 file, where all values are stored as double precision
396 floating point numbers (real numbers). These files are
397 erased when you quit {\tt sta\-tist}. The missing values are
398 stored as the smallest possible number, that is: $-1.79769
399 \times 10^{308}$. You have to be sure that this number
400 isn't in your data file as a valid number, because it would
401 not be treated as a very small number; it would be
402 interpreted as a missing value.
403
404 Before each analysis, \st reads the selected columns from
405 temporary files into ram, and, if necessary, either deletes
406 the rows that have at least one missing value or simply
407 deletes missing values. However, the deletions occur only in
408 a copy of the temporary files that is created in the
409 computer memory. The temporary files remain intact until
410 you quit the program. For example the menu option {\em
411 Regressions and correlations} {\textbar} {\em Multiple linear correlation}
412 will delete all rows that have missing values in any one of
413 the chosen columns. You should do this analysis if each row
414 in your database represents a single case, what is very
415 common in social sciences. The menu option {\em Tests}
416 {\textbar} {\em t-test for comparison of two means of two samples} will
417 delete every missing value, but a missing value in a column
418 will not cause the entire row to be deleted. You should use
419 this analysis if, for example, the columns in your database
420 represent different series of similar experiments, and you
421 would like to compare the two sets of results.
422
423 \subsection{Reading and saving files}
424
425 If you want to work only with subsets of your database, you
426 can write columns into a text file (ASCII file), choosing
427 the menu option {\em Data Management} {\textbar} {\em Export columns as
428 ASCII-data}. You can also read data from several files
429 simultaneously ({\em Data Management} {\textbar} {\em Read another file}).
430 When you {\em Read another file}, new columns are added to
431 the database, and if a column name in the new file is
432 already in use in the current database, the symbol ``\_''
433 will be appended to it.
434
435 Another possibility is to join columns ({\em Data
436 manipulation} {\textbar} {\em Join columns}). In this case, the selected
437 columns will be concatenated in a bigger one.
438
439 \section{Manipulating databases}
440
441 \subsection{Extracting columns from fixed width data files}
442
443 To extract columns from a fixed width data file, and save
444 them in a \st data file, type:
445
446 \begin{verbatim}
447 statist --xcols config_file original_datafile new_datafile
448 \end{verbatim}
449
450 The content of a {\tt config\_file} is simply a list of
451 variable names and their position in the fixed width data
452 file, as in the example below:
453
454 \begin{verbatim}
455 born 1-4
456 sex 8
457 income 11-15
458 \end{verbatim}
459
460 With the above config\_file, \st would read the following
461 database:
462
463 \begin{verbatim}
464 1971 522 2365
465 19609991 32658
466 19455632
467 19674131 32684
468 \end{verbatim}
469
470 And output:
471
472 \begin{verbatim}
473 #%born sex income
474 1971 2 2365
475 1960 1 32658
476 1945 2 M
477 1967 1 32684
478 \end{verbatim}
479
480 {\tt Statist} will not add the ``\verb #% '' string to the
481 first line if either it was called with the command line
482 option {\tt {-}-header} or the {\tt statistrc} file has the
483 option {\tt autodetect\_header = yes}. The string used to
484 define missing values also can be defined in the {\em
485 statistrc} and using the command line options. The columns
486 are separated by a blank space, unless you have chosen
487 something different with the command line option {\tt
488 {-}-sep}. Non numeric values are extracted and put between
489 double quotes in the {\tt new\_datafile}, although \st is
490 unable to read them. You would need to replace them with
491 numeric codes.
492
493 \subsection{Extracting a sample from a database}
494
495 If you will work with a very big database that you still
496 don't know very well, you may find it useful to begin the
497 exploration of the database using a sample of it, which
498 would be faster than using the entire database. After
499 discovering what analyzes are more relevant for your
500 research, you could re-run these analyzes with the original
501 database.
502
503 To extract a percentage of the database rows, invoke \st in
504 the following way:
505
506 \begin{verbatim}
507 statist --xsample percentage database dest_file
508 \end{verbatim}
509
510 \noindent where {\tt percentage} must be a integer number
511 between 1 and 99. The new database,\linebreak {\tt dest\_file} will be
512 created with {\em approximately} the requested percentage or
513 rows extracted from {\tt data\_base}.
514
515 \subsection{Recoding a data base}
516
517 For some kinds of data manipulation we will need some
518 programs that are not part of {\tt sta\-tist}, but are
519 available in most GNU/\-Linux distributions (and are also
520 installable under DOS/\-Win\-dows). For small data files, with
521 few variables, you can use your preferred text editor or
522 spreadsheet program. However, if your file is too big, or
523 has too many variables, it might be more convenient to use
524 the tools described here and in the following sections.
525
526 Sometimes, we need to recode some values in a database.
527 Suppose, for example, that in a given data file, the value
528 ``999'' means missing value for the variable age, and in
529 some analyzes we want ``age classes'' and not ``age''. We
530 still want to use the variable ``age'' in other analyzes,
531 and, thus, we need to recode ``age'' into a different
532 variable. To create the new data base with the recoded
533 variables we could use {\tt awk}, an external program.
534 Suppose that the column ``age'' was the second one:
535
536 \begin{verbatim}
537 awk '{if(/age/) {print $0 "\t" "AGE1"}
538 else {
539 if(NF == 0) {print $0}
540 else {
541 if ($2 <= 20){age1 = 1} else
542 if ($2 > 20 && $2 <= 50){age1 = 2} else
543 if ($2 > 51 && $2 < 999){age1 = 3} else
544 {age1 = "M"}
545 {print $0 "\t" age1}
546 }
547 }
548 }' datafile.csv > newfile.csv
549 \end{verbatim}
550
551
552 The expression inside the quotes are {\tt awk} commands.
553 With this command, {\tt awk} would read the following data
554 file:
555
556 \begin{verbatim}
557 sex age
558 2 23
559 1 88
560 2 10
561 2 36
562 3 999
563 1 55
564 \end{verbatim}
565
566 And output:
567
568 \begin{verbatim}
569 sex age AGE1
570 0 23 2
571 1 88 3
572 0 10 1
573 0 36 2
574 M 999 M
575 1 55 3
576 \end{verbatim}
577
578 At first, the {\tt awk} command might looks like complex,
579 but let me explain it:
580
581 \begin{description}
582
583 \item {\tt \$}: The symbol `{\tt \$}' means ``field'', that
584 is, a column of a \st data file.
585
586 \item {\tt \$0}: has a special meaning: the {\em entire
587 line}.
588
589 \item {\tt if(/\#/) \{print \$0 ``$\backslash$t''
590 ``AGE1''\}}: If the line has the symbol `\#', print the
591 entire line plus a tab character plus the string ``AGE1''.
592 This line contains our column names (unless you inserted
593 commentaries in the data file).
594
595 \item {\tt if(NF == 0) \{print \$0\}}: If the number of
596 fields is zero, simply print the entire line.
597
598 \item {\tt if (\$2 > 20 \&\& \$2 <= 50)\{age1 = 2\}}: If the
599 second field has a value higher than 20 and lower or equal
600 to 50, the value of the variable ``age1'' will be 2.
601
602 \item {\tt print \$0 ``$\backslash$t'' age1}: Print the
603 entire line plus a tab character plus the value of the
604 variable {\tt age1}.
605
606 \end{description}
607
608 We also use {\tt awk} to select cases and compute new
609 variables. So, please refer to its manual or info page for
610 more details on its usage (in a terminal, type {\tt info
611 awk}). Frequently, our {\tt awk} commands will begin testing
612 whether the line contains the column names and whether it is
613 a empty line.
614
615 \subsection{Selecting cases and computing new variables}
616
617 We can use {\tt awk} to accomplish two other tasks: (1)
618 create a new data base by selecting only some cases from a
619 existing data file, and (2) compute a new variable using the
620 values of some existing variables. Here we show only two
621 examples of {\tt awk} usage.
622
623 Suppose that the second column of a data file has the
624 variable ``sex'', coded `0' for males and `1' for females,
625 and that we want to include only females in some analyzes.
626 Typing the following command in a terminal would create the
627 new data file we need:
628
629 \begin{verbatim}
630 awk '{if(/sex/ || /#/ || $2 > 0) {print $0}
631 }' data_file.csv > new_data_file.csv
632 \end{verbatim}
633
634 We are telling {\tt awk} that if either it finds the string
635 ``sex'' in a line (because it certainly contains our column
636 names or a commentary), or the second field of a line has a
637 number bigger than $0$ it have to output the entire line
638 (``{\tt ||}'' means ``or''). Finally we are also telling to the
639 shell program that we want the output redirected from the
640 screen to the file new\_data\_file.csv.
641
642 Now, suppose that you want to calculate an index using three
643 variables from your data base, and that the index would be
644 the sum of columns 1 and 2 divided by the value of the third
645 column:
646
647 \begin{verbatim}
648 awk '{if(/#/ || /var1/) {print $0 "\tidx"} else
649 {{idx = ($1 + $2) / $3}
650 {print $0 "\t" idx}}}' datafile.dat > newfile.dat
651 \end{verbatim}
652
653 Warning: \st always uses dot as decimal separator while
654 working with data files. But if the decimal separator in
655 your language is a comma, {\tt awk} will use it in the
656 outputs. To avoid this, type the following command in the
657 terminal before using {\tt awk}:
658
659 \begin{verbatim}
660 export LC_ALL=C
661 \end{verbatim}
662
663 With the above command, the language, numbers, etc will be
664 set to English. Note that programs started in this terminal
665 will also run in English. To reset the terminal you have to
666 ``export LC\_ALL=xx'' again, using your language code
667 instead of ``xx'' (or close the terminal and open another).
668
669 \subsection{Sorting the data base}
670
671 We can use some other programs if we want to sort the rows
672 of the entire database using one more columns as keys.
673 Suppose, for example, that we want to sort our database
674 using the 12th column as key. The following commands would
675 do the job:
676
677 \begin{verbatim}
678 head -n 1 datafile.csv > columnnames
679 sort -g -k 12,12 datafile.csv > sorted
680 cat columnnames sorted > sorted_datafile.csv
681 \end{verbatim}
682
683 With the above commands we have sorted our file in three
684 steps: (1) We created the file {\tt columnnames} containing
685 the first line of {\tt datafile.csv}. (2) We created the file
686 {\tt sorted}, a sorted version of our database. However, in
687 this file the 12th column name was treated as number and its
688 line sorted. It might no longer be the first line of the
689 file. In this case, to create a sorted database with the
690 original names, we use the third command. (3) We
691 concatenated the files {\tt columnnames} and {\tt sorted} to
692 create {\tt sorted\_datafile.csv}. Please, see manual pages of
693 {\tt head}, {\tt sort}, and {\tt cat} for details on how to
694 use them.
695
696 \subsection{Merging data files}
697
698 To merge data files using a variable as key, we use another
699 external program: {\tt join}. Suppose that you have a data
700 file containing information about people, and that some
701 people actually are married with each other. You want to
702 know the mean age difference between husbands and wives. You
703 can't run analyzes to compare people in deferment rows, only
704 variables in different columns. However, your data base has
705 a variable that might be used as key: {\em house}. People
706 who has the same value for the variable ``house'' and that
707 are married, actually are married with each other. You
708 should follow some steps to achieve your goal: (1) Use {\tt
709 awk} to create two different data files, one only with
710 married men and other only with married woman. (2) Use {\tt
711 join} to merge the two data files in a new one. If the house
712 variable is the first column in both data files, you should
713 simply type:
714
715 \begin{verbatim}
716 join -e "" women.csv men.csv > couples.csv
717 \end{verbatim}
718
719 The above command would get the two following files:
720
721 \begin{verbatim}
722 house income age house income age
723 123 4215 23 123 3256 27
724 124 3251 35 125 4126 25
725 126 0 20 126 4261 22
726 127 1241 45 128 3426 60
727 \end{verbatim}
728
729 And would output:
730
731 \begin{verbatim}
732 house income age income age
733 123 4215 23 3256 27
734 126 0 20 4261 22
735 \end{verbatim}
736
737 There is no problem with the duplicate occurrence of
738 ``income'' and ``work'', because \st will append `\_' to the
739 second one. If you have to merge files using more than one
740 column as key, you can use {\tt awk} to create a single key
741 column that concatenates the characters of all keys. For
742 example, if your key variables are the columns 2 and 3:
743
744 \begin{verbatim}
745 awk '{if(/income/) {print "key" "\t" $0} else {
746 if(NF == 0) {print $0} else {
747 {print $2$3 "\t" $0}
748 }
749 }
750 }' people.csv > people_with_key.csv
751 \end{verbatim}
752
753 \section{Batch/script}
754
755 If you have to repeat many times the same analysis, you
756 would became bored of starting {\tt sta\-tist}, and, again and again,
757 choosing the same options from the menu. If this is your
758 case, you can use the batch mode. You have to invoke \st
759 with the option {\verb --silent }, and give to it a file
760 containing what you would have to type if \st was running in
761 the normal mode. The only difference is that while in silent
762 mode \st doesn't print the message "Please, continue with
763 <RETURN>", and, thus, you don't have to include these
764 <RETURN> keys. For example, if you want to run a correlation
765 between variables ``a'' and ``b'' in a data file called
766 {\tt day365.csv} you could create a file named, for example,
767 {\tt cmds\_file} with the following content:
768
769 \begin{verbatim}
770 2
771 1
772 a
773 b
774 0
775 0
776 \end{verbatim}
777
778 The next step would be to invoke \st with the following
779 command:
780
781 \begin{verbatim}
782 statist --silent --noplot day365.csv < cmds_file
783 \end{verbatim}
784
785 The result will be printed in the screen. However, if you
786 prefer the results saved in a file called, say, report365,
787 type:
788
789 \begin{verbatim}
790 statist --silent --noplot day365.csv < cmds_file > report365
791 \end{verbatim}
792
793 \section{Useful tips}
794
795 \begin{itemize}
796
797 \item Please, report any problem that you find (program
798 bugs, documentation faults, grammar mistakes, etc...) to:
799 statist-list@itevation.de. If you prefer, you can write
800 directly to me: jalvesaq@gmail.com. You are also
801 invited to make suggestions and ask for new features.
802
803 \item When you see a question like ``Do something? (y/N),''
804 the upper case ``N'' means that if you type any letter
805 other than ``y'', and even if you simply press <Enter>, it
806 will be assumed that your answer is ``No''.
807
808 \item You can get the last version of \st on its
809 website:
810
811 \end{itemize}
812
813 \begin{center}
814 \href{http://statist.wald.intevation.org/}
815 {http://statist.wald.intevation.org/}
816 \end{center}
817
818 \end{document}
819
820 % vim:tw=60