## "Fossies" - the Fresh Open Source Software Archive

### Member "gretl-2020b/doc/tex/datafiles.tex" (1 Apr 2020, 44332 Bytes) of package /linux/misc/gretl-2020b.tar.xz:

As a special service "Fossies" has tried to format the requested source page into HTML format using (guessed) TeX and LaTeX source code syntax highlighting (style: standard) with prefixed line numbers. Alternatively you can here view or download the uninterpreted source code file.

    1 \chapter{Data files}
2 \label{chap:datafiles}
3
4 \section{Data file formats}
5 \label{sec:data-formats}
6
7 Gretl has its own native format for data files.  Most users will
8 probably not want to read or write such files outside of gretl itself,
9 but occasionally this may be useful and details on the file formats
10 are given in Appendix~\ref{app-datafile}. The program can also import
11 data from a variety of other formats. In the GUI program this can be
12 done via the File, Open Data, User file'' menu---note the drop-down
13 list of acceptable file types. In script mode, simply use the
14 \cmd{open} command. The supported import formats are as follows.
15
16 \begin{itemize}
17 \item Plain text files (comma-separated or CSV'' being the most
18   common type).  For details on what gretl expects of such files, see
19   Section~\ref{scratch}.
20 \item Spreadsheets: MS \app{Excel}, \app{Gnumeric} and Open Document
21   (ODS). The requirements for such files are given in
22   Section~\ref{scratch}.
23 \item \app{Stata} data files (\texttt{.dta}).
24 \item \app{SPSS} data files (\texttt{.sav}).
25 \item \app{SAS} xport'' files (\texttt{.xpt}).
26 \item \app{Eviews} workfiles (\texttt{.wf1}).\footnote{See
27     \url{http://ricardo.ecn.wfu.edu/~cottrell/eviews_format/}.}
28 \item \app{JMulTi} data files.
29 \end{itemize}
30
31 When you import data from a plain text format, gretl opens a
32 diagnostic'' window, reporting on its progress in reading the data.
33 If you encounter a problem with ill-formatted data, the messages in
34 this window should give you a handle on fixing the problem.
35
36 Note that gretl has a facility for writing out data in the
37 native formats of GNU \app{R}, \app{Octave}, \app{JMulTi} and
38 \app{PcGive} (see Appendix~\ref{app-advanced}).  In the GUI client
39 this option is found under the File, Export data'' menu; in the
40 command-line client use the \cmd{store} command with the appropriate
41 option flag.
42
43 \section{Databases}
44 \label{dbase}
45
46 For working with large amounts of data gretl is supplied with a
47 database-handling routine.  A \emph{database}, as opposed to a
48 \emph{data file}, is not read directly into the program's workspace.
49 A database can contain series of mixed frequencies and sample ranges.
50 You open the database and select series to import into the working
51 dataset.  You can then save those series in a native format data file
52 if you wish. Databases can be accessed via the menu item File,
53 Databases''.
54
55 For details on the format of gretl databases, see
56 Appendix~\ref{app-datafile}.
57
59 \label{online-data}
60
61 Several gretl databases are available from Wake Forest University.
62 Your computer must be connected to the internet for this option to
63 work.  Please see the description of the data'' command under
65
66 \tip{Visit the gretl
67   \href{http://gretl.sourceforge.net/gretl_data.html}{data page} for
68   details and updates on available data.}
69
70
71 \subsection{Foreign database formats}
72 \label{RATS}
73
74 Thanks to Thomas Doan of \emph{Estima}, who made available the
75 specification of the database format used by RATS 4 (Regression
76 Analysis of Time Series), gretl can handle such databases---or at
77 least, a subset of same, namely time-series databases containing
78 monthly and quarterly series.
79
80 Gretl can also import data from \app{PcGive} databases.  These
81 take the form of a pair of files, one containing the actual data (with
82 suffix \texttt{.bn7}) and one containing supplementary information
83 (\texttt{.in7}).
84
85 In addition, gretl offers ODBC connectivity. Be warned: this feature
86 is meant for somewhat advanced users; there is currently no graphical
88 \ref{chap:odbc}.
89
90 \section{Creating a dataset from scratch}
91 \label{scratch}
92
93 There are several ways of doing this:
94
95 \begin{enumerate}
96 \item Find, or create using a text editor, a plain text data file and
97   open it via Import''.
98 \item Use your favorite spreadsheet to establish the data file, save
99   it in comma-separated format if necessary (this may not be
100   necessary if the spreadsheet format is MS Excel, Gnumeric or Open
101   Document), then use one of the Import'' options.
102 \item Use gretl's built-in spreadsheet.
103 \item Select data series from a suitable database.
104 \item Use your favorite text editor or other software tools to a
105   create data file in gretl format independently.
106 \end{enumerate}
107
108 Here are a few comments and details on these methods.
109
110 \subsection{Common points on imported data}
111
112 Options (1) and (2) involve using gretl's import'' mechanism.
113 For the program to read such data successfully, certain general
114 conditions must be satisfied:
115
116 \begin{itemize}
117
118 \item The first row must contain valid variable names.  A valid
119   variable name is of 31 characters maximum; starts with a letter; and
120   contains nothing but letters, numbers and the underscore character,
121   \verb+_+.  (Longer variable names will be truncated to 31
122   characters.)  Qualifications to the above: First, in the case of an
123   plain text import, if the file contains no row with variable names
124   the program will automatically add names, \verb+v1+, \verb+v2+ and
125   so on.  Second, by the first row'' is meant the first
126   \emph{relevant} row.  In the case of plain text imports, blank
127   rows and rows beginning with a hash mark, \verb+#+, are ignored.  In
128   the case of Excel, Gnumeric and ODS imports, you are presented with a
129   dialog box where you can select an offset into the spreadsheet, so
130   that gretl will ignore a specified number of rows and/or
131   columns.
132
133 \item Data values: these should constitute a rectangular block, with
134   one variable per column (and one observation per row).  The number
135   of variables (data columns) must match the number of variable names
137   expected, but in the case of importing from plain text, the program
138   offers limited handling of character (string) data: if a given
139   column contains character data only, consecutive numeric codes are
140   substituted for the strings, and once the import is complete a table
141   is printed showing the correspondence between the strings and the
142   codes.
143
144 \item Dates (or observation labels): Optionally, the \emph{first}
145   column may contain strings such as dates, or labels for
146   cross-sectional observations.  Such strings have a maximum of 15
147   characters (as with variable names, longer strings will be
148   truncated).  A column of this sort should be headed with the string
149   \verb+obs+ or \verb+date+, or the first row entry may be left
150   blank.
151
152   For dates to be recognized as such, the date strings should adhere
153   to one or other of a set of specific formats, as follows.  For
154   \emph{annual} data: 4-digit years.  For \emph{quarterly} data: a
155   4-digit year, followed by a separator (either a period, a colon, or
156   the letter \verb+Q+), followed by a 1-digit quarter.  Examples:
157   \verb+1997.1+, \verb+2002:3+, \verb+1947Q1+.  For \emph{monthly}
158   data: a 4-digit year, followed by a period or a colon, followed by a
159   two-digit month.  Examples: \verb+1997.01+, \verb+2002:10+.
160
161 \end{itemize}
162
163 Plain text (CSV'') files can use comma, space, tab or semicolon as
164 the column separator.  When you open such a file via the GUI you are
165 given the option of specifying the separator, though in most cases it
166 should be detected automatically.
167
168 If you use a spreadsheet to prepare your data you are able to carry
169 out various transformations of the raw'' data with ease (adding
170 things up, taking percentages or whatever): note, however, that you
171 can also do this sort of thing easily---perhaps more easily---within
172 gretl, by using the tools under the Add'' menu.
173
174 \subsection{Appending imported data}
175
176 You may wish to establish a dataset piece by piece, by incremental
177 importation of data from other sources.  This is supported via the
178 File, Append data'' menu items: gretl will check the new data for
179 conformability with the existing dataset and, if everything seems OK,
180 will merge the data.  You can add new variables in this way, provided
181 the data frequency matches that of the existing dataset.  Or you can
182 append new observations for data series that are already present; in
183 this case the variable names must match up correctly.  Note that by
184 default (that is, if you choose Open data'' rather than Append
185 data''), opening a new data file closes the current one.
186
188
189 Under the File, New data set'' menu you can choose the sort of
190 dataset you want to establish (e.g.\ quarterly time series,
191 cross-sectional).  You will then be prompted for starting and ending
192 dates (or observation numbers) and the name of the first variable to
193 add to the dataset. After supplying this information you will be faced
194 with a simple spreadsheet into which you can type data values.  In the
195 spreadsheet window, clicking the right mouse button will invoke a
197 observation (append a row at the foot of the sheet), or to insert an
198 observation at the selected point (move the data down and insert a
199 blank row.)
200
201 Once you have entered data into the spreadsheet you import these into
202 gretl's workspace using the spreadsheet's Apply changes''
203 button.
204
205 Please note that gretl's spreadsheet is quite basic and has no
206 support for functions or formulas.  Data transformations are done via
207 the Add'' or Variable'' menus in the main window.
208
209 \subsection{Selecting from a database}
210
211 Another alternative is to establish your dataset by selecting
212 variables from a database.
213
214 Begin with the File, Databases'' menu item. This has four forks:
215 Gretl native'', RATS 4'', PcGive'' and On database server''.
216 You should be able to find the file \verb+fedstl.bin+ in the file
217 selector that opens if you choose the Gretl native'' option since
218 this file, which contains a large collection of US macroeconomic time
219 series, is supplied with the distribution.
220
221 You won't find anything under RATS 4'' unless you have purchased
222 RATS data.\footnote{See \href{http://www.estima.com/}{www.estima.com}}
223 If you do possess RATS data you should go into the Tools,
224 Preferences, General'' dialog, select the Databases tab, and fill in
225 the correct path to your RATS files.
226
227 If your computer is connected to the internet you should find several
228 databases (at Wake Forest University) under On database server''.
229 You can browse these remotely; you also have the option of installing
230 them onto your own computer.  The initial remote databases window has
231 an item showing, for each file, whether it is already installed
232 locally (and if so, if the local version is up to date with the
233 version at Wake Forest).
234
235 Assuming you have managed to open a database you can import selected
236 series into gretl's workspace by using the Series, Import''
237 menu item in the database window, or via the popup menu that appears
238 if you click the right mouse button, or by dragging the series into
239 the program's main window.
240
241 \subsection{Creating a gretl data file independently}
242
243 It is possible to create a data file in one or other of gretl's own
244 formats using a text editor or software tools such as \app{awk},
245 \app{sed} or \app{perl}.  This may be a good choice if you have large
246 amounts of data already in machine readable form. You will, of course,
247 need to study these data formats (XML-based or traditional'') as
248 described in Appendix~\ref{app-datafile}.
249
250 \section{Structuring a dataset}
251 \label{sec:data-structure}
252
253 Once your data are read by gretl, it may be necessary to supply
254 some information on the nature of the data. We distinguish between
255 three kinds of datasets:
256 \begin{enumerate}
257 \item Cross section
258 \item Time series
259 \item Panel data
260 \end{enumerate}
261
262 The primary tool for doing this is the Data, Dataset structure''
263 menu entry in the graphical interface, or the \texttt{setobs} command
264 for scripts and the command-line interface.
265
266 \subsection{Cross sectional data}
267 \label{sec:cross-section-data}
268
269 By a cross section we mean observations on a set of units'' (which
270 may be firms, countries, individuals, or whatever) at a common point
271 in time.  This is the default interpretation for a data file: if there
272 is insufficient information to interpret data as time-series or panel
273 data, they are automatically interpreted as a cross section.  In the
274 unlikely event that cross-sectional data are wrongly interpreted as
275 time series, you can correct this by selecting the Data, Dataset
276 structure'' menu item.  Click the cross-sectional'' radio button in
277 the dialog box that appears, then click Forward''.  Click OK'' to
279
280 \subsection{Time series data}
281 \label{sec:timeser-data}
282
283 When you import data from a spreadsheet or plain text file,
284 gretl will make fairly strenuous efforts to glean time-series
285 information from the first column of the data, if it looks at all
286 plausible that such information may be present.  If time-series
287 structure is present but not recognized, again you can use the Data,
288 Dataset structure'' menu item.  Select Time series'' and click
289 Forward''; select the appropriate data frequency and click
290 Forward'' again; then select or enter the starting observation and
291 click Forward'' once more.  Finally, click OK'' to confirm the
292 time-series interpretation if it is correct (or click Back'' to make
294
295 Besides the basic business of getting a data set interpreted as time
296 series, further issues may arise relating to the frequency of
297 time-series data.  In a gretl time-series data set, all the series
298 must have the same frequency.  Suppose you wish to make a combined
299 dataset using series that, in their original state, are not all of the
300 same frequency.  For example, some series are monthly and some are
301 quarterly.
302
303 Your first step is to formulate a strategy: Do you want to end up with
304 a quarterly or a monthly data set?  A basic point to note here is
305 that compacting'' data from a higher frequency (e.g.\ monthly) to
306 a lower frequency (e.g.\ quarterly) is usually unproblematic.  You
307 lose information in doing so, but in general it is perfectly
308 legitimate to take (say) the average of three monthly observations to
309 create a quarterly observation.  On the other hand, expanding'' data
310 from a lower to a higher frequency is not, in general, a valid
311 operation.
312
313 In most cases, then, the best strategy is to start by creating a data
314 set of the \textit{lower} frequency, and then to compact the higher
315 frequency data to match.  When you import higher-frequency data from a
316 database into the current data set, you are given a choice of
317 compaction method (average, sum, start of period, or end of period).
318 In most instances average'' is likely to be appropriate.
319
320 You \textit{can} also import lower-frequency data into a
321 high-frequency data set, but this is generally not recommended.  What
322 gretl does in this case is simply replicate the values of the
323 lower-frequency series as many times as required. For example, suppose
324 we have a quarterly series with the value 35.5 in 1990:1, the first
325 quarter of 1990.  On expansion to monthly, the value 35.5 will be
326 assigned to the observations for January, February and March of 1990.
327 The expanded variable is therefore useless for fine-grained
328 time-series analysis, outside of the special case where you know that
329 the variable in question does in fact remain constant over the
330 sub-periods.
331
332 When the current data frequency is appropriate, gretl offers
333 both Compact data'' and Expand data'' options under the Data''
334 menu.  These options operate on the whole data set, compacting or
335 exanding all series.  They should be considered expert'' options
336 and should be used with caution.
337
338
339 \subsection{Panel data}
340 \label{sec:panel-data}
341
342 Panel data are inherently three dimensional---the dimensions being
343 variable, cross-sectional unit, and time-period.  For example, a
344 particular number in a panel data set might be identified as the
345 observation on capital stock for General Motors in 1980.  (A note on
346 terminology: we use the terms cross-sectional unit'', unit'' and
347 group'' interchangeably below to refer to the entities that compose
348 the cross-sectional dimension of the panel.  These might, for
349 instance, be firms, countries or persons.)
350
351 For representation in a textual computer file (and also for gretl's
352 internal calculations) the three dimensions must somehow be flattened
353 into two.  This flattening'' involves taking layers of the data that
354 would naturally stack in a third dimension, and stacking them in the
355 vertical dimension.
356
357 gretl always expects data to be arranged by observation'',
358 that is, such that each row represents an observation (and each
359 variable occupies one and only one column).  In this context the
360 flattening of a panel data set can be done in either of two ways:
361
362 \begin{itemize}
363 \item Stacked time series: the successive vertical blocks each
364   comprise a time series for a given unit.
365 \item Stacked cross sections: the successive vertical blocks each
366   comprise a cross-section for a given period.
367 \end{itemize}
368
369 You may input data in whichever arrangement is more convenient.
370 Internally, however, gretl always stores panel data in
371 the form of stacked time series.
372
373 \section{Panel data specifics}
374 \label{sec:more-panel}
375
376 When you import panel data into gretl from a spreadsheet or
377 comma separated format, the panel nature of the data will not be
378 recognized automatically (most likely the data will be treated as
379 undated'').  A panel interpretation can be imposed on the data
380 using the graphical interface or via the \cmd{setobs} command.
381
382 In the graphical interface, use the menu item Data, Dataset
383 structure''.  In the first dialog box that appears, select Panel''.
384 In the next dialog you have a three-way choice.  The first two
385 options, Stacked time series'' and Stacked cross sections'' are
386 applicable if the data set is already organized in one of these two
387 ways.  If you select either of these options, the next step is to
388 specify the number of cross-sectional units in the data set.  The
389 third option, Use index variables'', is applicable if the data set
390 contains two variables that index the units and the time periods
391 respectively; the next step is then to select those variables.  For
392 example, a data file might contain a country code variable and a
393 variable representing the year of the observation.  In that case
394 gretl can reconstruct the panel structure of the data regardless
395 of how the observation rows are organized.
396
397 The \cmd{setobs} command has options that parallel those in the
398 graphical interface.  If suitable index variables are available
399 you can do, for example
400 %
401 \begin{code}
402 setobs unitvar timevar --panel-vars
403 \end{code}
404 %
405 where \texttt{unitvar} is a variable that indexes the units and
406 \texttt{timevar} is a variable indexing the periods.  Alternatively
407 you can use the form \verb+setobs+ \textsl{freq} \verb+1:1+
408 \textsl{structure}, where \textsl{freq} is replaced by the block
409 size'' of the data (that is, the number of periods in the case of
410 stacked time series, or the number of units in the case of stacked
411 cross-sections) and structure is either \option{stacked-time-series}
412 or \option{stacked-cross-section}.  Two examples are given below: the
413 first is suitable for a panel in the form of stacked time series with
414 observations from 20 periods; the second for stacked cross sections
415 with 5 units.
416 %
417 \begin{code}
418 setobs 20 1:1 --stacked-time-series
419 setobs 5 1:1 --stacked-cross-section
420 \end{code}
421
422 \subsection{Panel data arranged by variable}
423
424 Publicly available panel data sometimes come arranged by variable.''
425 Suppose we have data on two variables, \varname{x1} and \varname{x2},
426 for each of 50 states in each of 5 years (giving a total of 250
427 observations per variable).  One textual representation of such a data
428 set would start with a block for \varname{x1}, with 50 rows
429 corresponding to the states and 5 columns corresponding to the years.
430 This would be followed, vertically, by a block with the same structure
431 for variable \varname{x2}.  A fragment of such a data file is shown
432 below, with quinquennial observations 1965--1985.  Imagine the table
433 continued for 48 more states, followed by another 50 rows for variable
434 \varname{x2}.
435
436 \begin{center}
437   \begin{tabular}{rrrrrr}
438   \varname{x1} \\
439      & 1965 & 1970 & 1975 & 1980 & 1985 \\
440   AR & 100.0 & 110.5 & 118.7 & 131.2 & 160.4\\
441   AZ & 100.0 & 104.3 & 113.8 & 120.9 & 140.6\\
442   \end{tabular}
443 \end{center}
444
445 If a datafile with this sort of structure is read into
446 gretl,\footnote{Note that you will have to modify such a
447   datafile slightly before it can be read at all.  The line containing
448   the variable name (in this example \varname{x1}) will have to be
449   removed, and so will the initial row containing the years,
450   otherwise they will be taken as numerical data.}  the program
451 will interpret the columns as distinct variables, so the data will not
452 be usable as is.''  But there is a mechanism for correcting the
453 situation, namely the \cmd{stack} function.
454
455 Consider the first data column in the fragment above: the first 50 rows
456 of this column constitute a cross-section for the variable \varname{x1}
457 in the year 1965.  If we could create a new series by stacking the
458 first 50 entries in the second column underneath the first 50 entries
459 in the first, we would be on the way to making a data set by
460 observation'' (in the first of the two forms mentioned above, stacked
461 cross-sections).  That is, we'd have a column comprising a
462 cross-section for \varname{x1} in 1965, followed by a cross-section for
463 the same variable in 1970.
464
465 The following gretl script illustrates how we can accomplish the
466 stacking, for both \varname{x1} and \varname{x2}.  We assume
467 that the original data file is called \texttt{panel.txt}, and that in
468 this file the columns are headed with variable names'' \varname{v1},
469 \varname{v2}, \dots, \varname{v5}.  (The columns are not really
470 variables, but in the first instance we pretend'' that they are.)
471
472 \begin{code}
473 open panel.txt
474 series x1 = stack(v1..v5, 50)
475 series x2 = stack(v1..v5, 50, 50)
476 setobs 50 1:1 --stacked-cross-section
477 store panel.gdt x1 x2
478 \end{code}
479
480 The second and third lines illustrate the syntax of the \cmd{stack}
481 function, which takes up to three arguments.  The double dots in the
482 first argument indicate a range of variables to be stacked: here we
483 want to stack all 5 columns (for all 5 years). More generally, you can
484 define a named list of series and pass that as the first argument to
485 \texttt{stack} (see chapter~\ref{chap:lists-strings}). In this
486 example we're supposing that the full data set contains 100 rows, and
487 that in the stacking of variable \varname{x1} we wish to read only the
488 first 50 rows from each column: we achieve this by adding \texttt{50}
489 as a second (\texttt{length}) argument.
490
491 On line 3 we do the stacking for variable \varname{x2}.  Again we want
492 a \texttt{length} of 50 for the components of the stacked series, but
493 this time we want to start reading from the 50th row of the original
494 data, and so we add a third \texttt{offset} argument of 50.  The
495 signature of the stack function is shown below; the second and third
496 arguments are optional, defaulting to automatic'' and 0
497 respectively.
498 \begin{code}
499 series stack(list L, int length n, int offset k)
500 \end{code}
501 Line 4 then imposes a panel interpretation on the data. Finally, we
502 save the stacked data to file, with the panel interpretation.
503
504 The illustrative script above is appropriate when the number of
505 variables to be processed is small.  When then are many variables in
506 the dataset it's more convenient to use a loop to accomplish the
507 stacking, as shown in the following script.  The setup is presumed to
508 be the same as in the previous case (50 units, 5 periods), but with 20
509 variables rather than 2.
510
511 \begin{code}
512 open panel.txt
513 list L = v1..v5 # predefine a list of series
514 scalar length = 50
515 loop i=1..20 --quiet
516   scalar offset = (i - 1) * length
517   series x$i = stack(L, length, offset) 518 endloop 519 setobs 50 1.01 --stacked-cross-section 520 store panel.gdt x1..x20 521 \end{code} 522 523 \subsection{Side-by-side time series} 524 525 There's a second sort of data that you may wish to convert to gretl's 526 panel format, namely side-by-side time series for a number of 527 cross-sectional units. For example, a data file might contain separate 528 GDP series of common length$T$for each of$N$countries. To turn 529 these into a single stacked time series the \texttt{stack} function 530 can again be used. An example follows, where we suppose the original 531 data source is a comma-separated file named \texttt{GDP.csv}, 532 containing GDP data for countries from Austria (\texttt{GDP\_AT}) to 533 Zimbabwe (\texttt{GDP\_ZW}) in consecutive columns. 534 535 \begin{code} 536 open GDP.csv 537 scalar T =$nobs # the number of periods
538 list L = GDP_AT..GDP_ZW
539 series GDP = stack(L, T)
540 setobs T 1:01 --stacked-time-series
541 store panel.gdt GDP
542 \end{code}
543
544 The resulting data file, \texttt{panel.gdt}, will contain a single
545 series of length $NT$ where $N$ is the number of countries and
546 $T$ is the length of the original dataset. One could insert revised
547 variants of lines 3 and 4 of the script if the original file contained
548 additional side-by-side per-country series for investment, consumption
549 or whatever.
550
551 \subsection{Panel data marker strings}
552
553 It can be helpful with panel data to have the observations identified
554 by mnemonic markers.  A special function in the \texttt{genr} command
555 is available for this purpose.
556
557 In the example under the heading Panel data arranged by variable''
558 above, suppose all the states are identified by two-letter codes in
559 the left-most column of the original datafile.  When the
560 \texttt{stack} function is invoked as shown, these codes will be
561 stacked along with the data values.  If the first row is marked
562 \texttt{AR} for Arkansas, then the marker \texttt{AR} will end up
563 being shown on each row containing an observation for Arkansas.
564 That's all very well, but these markers don't tell us anything about
565 the date of the observation.  To rectify this we could do:
566
567 \begin{code}
568 genr time
569 series year = 1960 + (5 * time)
570 genr markers = "%s:%d", marker, year
571 \end{code}
572
573 The first line generates a 1-based index representing the period of
574 each observation, and the second line uses the \texttt{time} variable
575 to generate a variable representing the year of the observation.  The
576 third line contains this special feature: if (and only if) the name of
577 the new variable'' to generate is \texttt{markers}, the portion of
578 the command following the equals sign is taken as a C-style format
579 string (which must be wrapped in double quotes), followed by a
580 comma-separated list of arguments.  The arguments will be printed
581 according to the given format to create a new set of observation
582 markers.  Valid arguments are either the names of variables in the
583 dataset, or the string \texttt{marker} which denotes the pre-existing
584 observation marker.  The format specifiers which are likely to be
585 useful in this context are \texttt{\%s} for a string and \texttt{\%d}
586 for an integer.  Strings can be truncated: for example \texttt{\%.3s}
587 will use just the first three characters of the string.  To chop
588 initial characters off an existing observation marker when
589 constructing a new one, you can use the syntax \texttt{marker + n},
590 where \texttt{n} is a positive integer: in the case the first
591 \texttt{n} characters will be skipped.
592
593 After the commands above are processed, then, the observation markers
594 will look like, for example, \texttt{AR:1965}, where the two-letter
595 state code and the year of the observation are spliced together with a
596 colon.
597
598 \subsection{Panel dummy variables}
599 \label{panel-dummies}
600
601 In a panel study you may wish to construct dummy variables of one or
602 both of the following sorts: (a) dummies as unique identifiers for the
603 units or groups, and (b) dummies as unique identifiers for the time
604 periods.  The former may be used to allow the intercept of the
605 regression to differ across the units, the latter to allow the
606 intercept to differ across periods.
607
608 Two special functions are available to create such dummies.  These are
609 found under the Add'' menu in the GUI, or under the \cmd{genr}
610 command in script mode or \app{gretlcli}.
611
612 \begin{enumerate}
613 \item unit dummies'' (script command \cmd{genr unitdum}).  This
614   command creates a set of dummy variables identifying the
615   cross-sectional units.  The variable \verb+du_1+ will have value 1
616   in each row corresponding to a unit 1 observation, 0 otherwise;
617   \verb+du_2+ will have value 1 in each row corresponding to a unit 2
618   observation, 0 otherwise; and so on.
619 \item time dummies'' (script command \cmd{genr timedum}).  This
620   command creates a set of dummy variables identifying the periods.
621   The variable \verb+dt_1+ will have value 1 in each row
622   corresponding to a period 1 observation, 0 otherwise; \verb+dt_2+
623   will have value 1 in each row corresponding to a period 2
624   observation, 0 otherwise; and so on.
625 \end{enumerate}
626
627 If a panel data set has the \verb+YEAR+ of the observation entered as
628 one of the variables you can create a periodic dummy to pick out a
629 particular year, e.g.\ \cmd{genr dum = (YEAR==1960)}.  You can also
630 create periodic dummy variables using the modulus operator,
631 \verb+%+.  For instance, to create a dummy with
632 value 1 for the first observation and every thirtieth observation
633 thereafter, 0 otherwise, do
634 %
635 \begin{code}
636 genr index
637 series dum = ((index-1) % 30) == 0
638 \end{code}
639
640 \subsection{Lags, differences, trends}
641 \label{panel-lagged}
642
643 If the time periods are evenly spaced you may want to use lagged
645 chapter~\ref{chap:dpanel}); you may also wish to construct first
646 differences of variables of interest.
647
648 Once a dataset is identified as a panel, gretl will handle the
649 generation of such variables correctly.  For example the command
650 \verb+genr x1_1 = x1(-1)+ will create a variable that contains the
651 first lag of \verb+x1+ where available, and the missing value code
652 where the lag is not available (e.g.\ at the start of the time series
653 for each group).  When you run a regression using such variables, the
654 program will automatically skip the missing observations.
655
656 When a panel data set has a fairly substantial time dimension, you may
657 wish to include a trend in the analysis.  The command \cmd{genr time}
658 creates a variable named \varname{time} which runs from 1 to $T$ for
659 each unit, where $T$ is the length of the time-series dimension of the
660 panel.  If you want to create an index that runs consecutively from 1
661 to $m\times T$, where $m$ is the number of units in the panel, use
662 \cmd{genr index}.
663
664 \subsection{Basic statistics by unit}
665 \label{panel-stats}
666
667 gretl contains functions which can be used to generate basic
668 descriptive statistics for a given variable, on a per-unit basis;
669 these are \texttt{pnobs()} (number of valid cases), \texttt{pmin()}
670 and \texttt{pmax()} (minimum and maximum) and \texttt{pmean()} and
671 \texttt{psd()} (mean and standard deviation).
672
673 As a brief illustration, suppose we have a panel data set comprising 8
674 time-series observations on each of $N$ units or groups.  Then the
675 command
676 %
677 \begin{code}
678 series pmx = pmean(x)
679 \end{code}
680 %
681 creates a series of this form: the first 8 values (corresponding to
682 unit 1) contain the mean of \varname{x} for unit 1, the next 8 values
683 contain the mean for unit 2, and so on.  The \texttt{psd()} function
684 works in a similar manner.  The sample standard deviation for group
685 $i$ is computed as
686 $687 s_i = \sqrt{\frac{\sum(x-\bar{x}_i)^2}{T_i-1}} 688$
689 where $T_i$ denotes the number of valid observations on \varname{x}
690 for the given unit, $\bar{x}_i$ denotes the group mean, and the
691 summation is across valid observations for the group.  If $T_i < 2$,
692 however, the standard deviation is recorded as 0.
693
694 One particular use of \texttt{psd()} may be worth noting.  If you want
695 to form a sub-sample of a panel that contains only those units for
696 which the variable \varname{x} is time-varying, you can either use
697 %
698 \begin{code}
699 smpl pmin(x) < pmax(x) --restrict
700 \end{code}
701 or
702 %
703 \begin{code}
704 smpl psd(x) > 0 --restrict
705 \end{code}
706
707 \section{Missing data values}
708 \label{missing-data}
709
710 \subsection{Representation and handling}
711
712 Missing values are represented internally as \verb+NaN+ (not a
713 number''), as defined in the IEEE 754 floating-point standard. In a
714 native-format data file they should be represented as \verb+NA+. When
715 importing CSV data gretl accepts several common representations of
716 missing values including $-$999, the string \verb+NA+ (in upper or
717 lower case), a single dot, or simply a blank cell.  Blank cells
718 should, of course, be properly delimited, e.g.\ \verb+120.6,,5.38+, in
719 which the middle value is presumed missing.
720
721 As for handling of missing values in the course of statistical
722 analysis, gretl does the following:
723
724 \begin{itemize}
725 \item In calculating descriptive statistics (mean, standard deviation,
726   etc.) under the \cmd{summary} command, missing values are simply
727   skipped and the sample size adjusted appropriately.
728 \item In running regressions gretl first adjusts the beginning
729   and end of the sample range, truncating the sample if need be.
730   Missing values at the beginning of the sample are common in time
731   series work due to the inclusion of lags, first differences and so
732   on; missing values at the end of the range are not uncommon due to
733   differential updating of series and possibly the inclusion of leads.
734 \end{itemize}
735
736 If gretl detects any missing values inside'' the (possibly
737 truncated) sample range for a regression, the result depends on the
738 character of the dataset and the estimator chosen.  In many cases, the
739 program will automatically skip the missing observations when
740 calculating the regression results.  In this situation a message is
741 printed stating how many observations were dropped.  On the other
742 hand, the skipping of missing observations is not supported for all
743 procedures: exceptions include all autoregressive estimators, system
744 estimators such as SUR, and nonlinear least squares.  In the case of
745 panel data, the skipping of missing observations is supported only if
746 their omission leaves a balanced panel. If missing observations are
747 found in cases where they are not supported, gretl gives an
748 error message and refuses to produce estimates.
749
750 \subsection{Manipulating missing values}
751 \label{sec:genr-missing}
752
753 Some special functions are available for the handling of missing
754 values.  The Boolean function \verb+missing()+ takes the name of a
755 variable as its single argument; it returns a series with value 1 for
756 each observation at which the given variable has a missing value, and
757 value 0 otherwise (that is, if the given variable has a valid value at
758 that observation).  The function \verb+ok()+ is complementary to
759 \verb+missing+; it is just a shorthand for \verb+!missing+ (where
760 \verb+!+ is the Boolean NOT operator).  For example, one can count the
761 missing values for variable \verb+x+ using
762
763 \begin{code}
764 scalar nmiss_x = sum(missing(x))
765 \end{code}
766
767 The function \verb+zeromiss()+, which again takes a single series as
768 its argument, returns a series where all zero values are set to the
769 missing code.  This should be used with caution---one does not want to
770 confuse missing values and zeros---but it can be useful in some
771 contexts.  For example, one can determine the first valid observation
772 for a variable \verb+x+ using
773
774 \begin{code}
775 genr time
776 scalar x0 = min(zeromiss(time * ok(x)))
777 \end{code}
778
779 The function \verb+misszero()+ does the opposite of \verb+zeromiss+,
780 that is, it converts all missing values to zero.
781
782 If missing values get involved in calculations, they propagate
783 according to the IEEE rules: notably, if one of the operands to an
784 arithmetical operation is a \texttt{NaN}, the result will also be
785 \texttt{NaN}.
786
787 \section{Maximum size of data sets}
788 \label{data-limits}
789
790 Basically, the size of data sets (both the number of variables and the
791 number of observations per variable) is limited only by the
792 characteristics of your computer.  Gretl allocates memory
793 dynamically, and will ask the operating system for as much memory as
794 your data require.  Obviously, then, you are ultimately limited by the
795 size of RAM.
796
797 Aside from the multiple-precision OLS option, gretl uses
798 double-precision floating-point numbers throughout.  The size of such
799 numbers in bytes depends on the computer platform, but is typically
800 eight.  To give a rough notion of magnitudes, suppose we have a data
801 set with 10,000 observations on 500 variables.  That's 5 million
802 floating-point numbers or 40 million bytes.  If we define the megabyte
803 (MB) as $1024 \times 1024$ bytes, as is standard in talking about RAM,
804 it's slightly over 38 MB.  The program needs additional memory for
805 workspace, but even so, handling a data set of this size should be
806 quite feasible on a current PC, which at the time of writing is likely
807 to have at least 256 MB of RAM.
808
809 If RAM is not an issue, there is one further limitation on data size
810 (though it's very unlikely to be a binding constraint).  That is,
811 variables and observations are indexed by signed integers, and on a
812 typical PC these will be 32-bit values, capable of representing
813 a maximum positive value of $2^{31} - 1 = 2,147,483,647$.
814
815 The limits mentioned above apply to gretl's native''
816 functionality.  There are tighter limits with regard to two
817 third-party programs that are available as add-ons to gretl for
818 certain sorts of time-series analysis including seasonal adjustment,
819 namely \app{TRAMO/SEATS} and \app{X-12-ARIMA}.  These programs employ
820 a fixed-size memory allocation, and can't handle series of more than
821 600 observations.
822
823
824 \section{Data file collections}
825 \label{collections}
826
827 If you're using gretl in a teaching context you may be
828 interested in adding a collection of data files and/or scripts that
829 relate specifically to your course, in such a way that students can
830 browse and access them easily.
831
832 There are three ways to access such collections of files:
833
834 \begin{itemize}
835 \item For data files: select the menu item File, Open data, Sample
836   file'', or click on the folder icon on the gretl toolbar.
837 \item For script files: select the menu item File, Script
838   files, Practice file''.
839 \end{itemize}
840
841 When a user selects one of the items:
842
843 \begin{itemize}
844 \item The data or script files included in the gretl distribution are
845   automatically shown (this includes files relating to Ramanathan's
846   \emph{Introductory Econometrics} and Greene's \emph{Econometric
847     Analysis}).
848 \item The program looks for certain known collections of data files
849   available as optional extras, for instance the datafiles from
850   various econometrics textbooks (Davidson and MacKinnon, Gujarati,
851   Stock and Watson, Verbeek, Wooldridge) and the Penn World Table (PWT
852   5.6).  (See \href{http://gretl.sourceforge.net/gretl_data.html}{the
853     data page} at the gretl website for information on these
854   collections.)  If the additional files are found, they are added to
855   the selection windows.
856 \item The program then searches for valid file collections (not
857   necessarily known in advance) in these places: the system'' data
858   directory, the system script directory, the user directory, and all
859   first-level subdirectories of these.  For reference, typical values
860   for these directories are shown in Table~\ref{tab-colls}.  (Note that
861   \texttt{PERSONAL} is a placeholder that is expanded by Windows,
862   corresponding to My Documents'' on English-language systems.)
863 \end{itemize}
864
865 \begin{table}[htbp]
866   \begin{center}
867     \begin{tabular}{lll}
868       & \multicolumn{1}{c}{\textit{Linux}} &
869       \multicolumn{1}{c}{\textit{MS Windows}} \\
870         system data dir &
871         {\small \verb+/usr/share/gretl/data+} &
872         {\small \verb+c:\Program Files\gretl\data+} \\
873         system script dir &
874         {\small \verb+/usr/share/gretl/scripts+} &
875         {\small \verb+c:\Program Files\gretl\scripts+} \\
876         user dir &
877         {\small \verb+\$HOME/gretl+} &
878         {\small \verb+PERSONAL\gretl+}\\
879   \end{tabular}
880  \end{center}
881  \caption{Typical locations for file collections}
882  \label{tab-colls}
883 \end{table}
884
885 Any valid collections will be added to the selection windows. So what
886 constitutes a valid file collection?  This comprises either a set of
887 data files in gretl XML format (with the \verb+.gdt+ suffix) or
888 a set of script files containing gretl commands (with \verb+.inp+
889 suffix), in each case accompanied by a master file'' or catalog.
890 The gretl distribution contains several example catalog files,
891 for instance the file \verb+descriptions+ in the \verb+misc+
892 sub-directory of the gretl data directory and
893 \verb+ps_descriptions+ in the \verb+misc+ sub-directory of the scripts
894 directory.
895
896 If you are adding your own collection, data catalogs should be named
897 \verb+descriptions+ and script catalogs should be be named
898 \verb+ps_descriptions+.  In each case the catalog should be placed
899 (along with the associated data or script files) in its own specific
900 sub-directory (e.g.\ \url{/usr/share/gretl/data/mydata} or
901 \verb+c:\userdata\gretl\data\mydata+).
902
903 The catalog files are plain text; if they contain non-ASCII characters
904 they must be encoded as UTF-8. The syntax of such files is
905 straightforward.  Here, for example, are the first few lines of
906 gretl's misc'' data catalog:
907
908 \begin{code}
909 # Gretl: various illustrative datafiles
910 "arma","artificial data for ARMA script example"
911 "ects_nls","Nonlinear least squares example"
912 "hamilton","Prices and exchange rate, U.S. and Italy"
913 \end{code}
914
915 The first line, which must start with a hash mark, contains a short
916 name, here Gretl'', which will appear as the label for this
917 collection's tab in the data browser window, followed by a colon,
918 followed by an optional short description of the collection.
919
920 Subsequent lines contain two elements, separated by a comma and
921 wrapped in double quotation marks.  The first is a datafile name
922 (leave off the \verb+.gdt+ suffix here) and the second is a short
923 description of the content of that datafile.  There should be one such
924 line for each datafile in the collection.
925
926 A script catalog file looks very similar, except that there are three
927 fields in the file lines: a filename (without its \verb+.inp+ suffix),
928 a brief description of the econometric point illustrated in the
929 script, and a brief indication of the nature of the data used.  Again,
930 here are the first few lines of the supplied misc'' script catalog:
931
932 \begin{code}
933 # Gretl: various sample scripts
934 "arma","ARMA modeling","artificial data"
935 "ects_nls","Nonlinear least squares (Davidson)","artificial data"
936 "leverage","Influential observations","artificial data"
937 "longley","Multicollinearity","US employment"
938 \end{code}
939
940 If you want to make your own data collection available to users, these
941 are the steps:
942
943 \begin{enumerate}
944 \item Assemble the data, in whatever format is convenient.
945 \item Convert the data to gretl format and save as \verb+gdt+
946   files.  It is probably easiest to convert the data by importing them
947   into the program from plain text, CSV, or a spreadsheet format (MS
948   Excel or Gnumeric) then saving them. You may wish to add
949   descriptions of the individual variables (the Variable, Edit
950   attributes'' menu item), and add information on the source of the
951   data (the Data, Edit info'' menu item).
952 \item Write a descriptions file for the collection using a text
953   editor.
954 \item Put the datafiles plus the descriptions file in a subdirectory
955   of the gretl data directory (or user directory).
956 \item If the collection is to be distributed to other people, package
957   the data files and catalog in some suitable manner, e.g.\ as a
958   zipfile.
959 \end{enumerate}
960
961 If you assemble such a collection, and the data are not proprietary, we
962 would encourage you to submit the collection for packaging as a
963 gretl optional extra.
964
965 \section{Assembling data from multiple sources}
966
967 In many contexts researchers need to bring together data from multiple
968 source files, and in some cases these sources are not organized such
969 that the data can simply be stuck together'' by appending rows or
970 columns to a base dataset. In gretl, the \texttt{join} command
971 can be used for this purpose; this command is discussed in detail in
972 chapter~\ref{chap:join}.
973
974
975 %%% Local Variables:
976 %%% mode: latex
977 %%% TeX-master: "gretl-guide"
978 %%% End:
979