"Fossies" - the Fresh Open Source Software Archive

Member "gretl-2020b/doc/tex/datafiles.tex" (1 Apr 2020, 44332 Bytes) of package /linux/misc/gretl-2020b.tar.xz:

As a special service "Fossies" has tried to format the requested source page into HTML format using (guessed) TeX and LaTeX source code syntax highlighting (style: standard) with prefixed line numbers. Alternatively you can here view or download the uninterpreted source code file.

    1 \chapter{Data files}
    2 \label{chap:datafiles}
    4 \section{Data file formats}
    5 \label{sec:data-formats}
    7 Gretl has its own native format for data files.  Most users will
    8 probably not want to read or write such files outside of gretl itself,
    9 but occasionally this may be useful and details on the file formats
   10 are given in Appendix~\ref{app-datafile}. The program can also import
   11 data from a variety of other formats. In the GUI program this can be
   12 done via the ``File, Open Data, User file'' menu---note the drop-down
   13 list of acceptable file types. In script mode, simply use the
   14 \cmd{open} command. The supported import formats are as follows.
   16 \begin{itemize}
   17 \item Plain text files (comma-separated or ``CSV'' being the most
   18   common type).  For details on what gretl expects of such files, see
   19   Section~\ref{scratch}.
   20 \item Spreadsheets: MS \app{Excel}, \app{Gnumeric} and Open Document
   21   (ODS). The requirements for such files are given in
   22   Section~\ref{scratch}.
   23 \item \app{Stata} data files (\texttt{.dta}).
   24 \item \app{SPSS} data files (\texttt{.sav}).
   25 \item \app{SAS} ``xport'' files (\texttt{.xpt}).
   26 \item \app{Eviews} workfiles (\texttt{.wf1}).\footnote{See
   27     \url{http://ricardo.ecn.wfu.edu/~cottrell/eviews_format/}.}
   28 \item \app{JMulTi} data files.
   29 \end{itemize}
   31 When you import data from a plain text format, gretl opens a
   32 ``diagnostic'' window, reporting on its progress in reading the data.
   33 If you encounter a problem with ill-formatted data, the messages in
   34 this window should give you a handle on fixing the problem.
   36 Note that gretl has a facility for writing out data in the
   37 native formats of GNU \app{R}, \app{Octave}, \app{JMulTi} and
   38 \app{PcGive} (see Appendix~\ref{app-advanced}).  In the GUI client
   39 this option is found under the ``File, Export data'' menu; in the
   40 command-line client use the \cmd{store} command with the appropriate
   41 option flag.
   43 \section{Databases}
   44 \label{dbase}
   46 For working with large amounts of data gretl is supplied with a
   47 database-handling routine.  A \emph{database}, as opposed to a
   48 \emph{data file}, is not read directly into the program's workspace.
   49 A database can contain series of mixed frequencies and sample ranges.
   50 You open the database and select series to import into the working
   51 dataset.  You can then save those series in a native format data file
   52 if you wish. Databases can be accessed via the menu item ``File,
   53 Databases''.
   55 For details on the format of gretl databases, see
   56 Appendix~\ref{app-datafile}.
   58 \subsection{Online access to databases}
   59 \label{online-data}
   61 Several gretl databases are available from Wake Forest University.
   62 Your computer must be connected to the internet for this option to
   63 work.  Please see the description of the ``data'' command under
   64 the Help menu.
   66 \tip{Visit the gretl
   67   \href{http://gretl.sourceforge.net/gretl_data.html}{data page} for
   68   details and updates on available data.}
   71 \subsection{Foreign database formats}
   72 \label{RATS}
   74 Thanks to Thomas Doan of \emph{Estima}, who made available the
   75 specification of the database format used by RATS 4 (Regression
   76 Analysis of Time Series), gretl can handle such databases---or at
   77 least, a subset of same, namely time-series databases containing
   78 monthly and quarterly series.
   80 Gretl can also import data from \app{PcGive} databases.  These
   81 take the form of a pair of files, one containing the actual data (with
   82 suffix \texttt{.bn7}) and one containing supplementary information
   83 (\texttt{.in7}).  
   85 In addition, gretl offers ODBC connectivity. Be warned: this feature
   86 is meant for somewhat advanced users; there is currently no graphical
   87 interface.  Interested readers will find more info in appendix
   88 \ref{chap:odbc}.
   90 \section{Creating a dataset from scratch}
   91 \label{scratch}
   93 There are several ways of doing this:
   95 \begin{enumerate}
   96 \item Find, or create using a text editor, a plain text data file and
   97   open it via ``Import''.
   98 \item Use your favorite spreadsheet to establish the data file, save
   99   it in comma-separated format if necessary (this may not be
  100   necessary if the spreadsheet format is MS Excel, Gnumeric or Open
  101   Document), then use one of the ``Import'' options.
  102 \item Use gretl's built-in spreadsheet.
  103 \item Select data series from a suitable database.
  104 \item Use your favorite text editor or other software tools to a
  105   create data file in gretl format independently.
  106 \end{enumerate}
  108 Here are a few comments and details on these methods.
  110 \subsection{Common points on imported data}
  112 Options (1) and (2) involve using gretl's ``import'' mechanism.
  113 For the program to read such data successfully, certain general
  114 conditions must be satisfied:
  116 \begin{itemize}
  118 \item The first row must contain valid variable names.  A valid
  119   variable name is of 31 characters maximum; starts with a letter; and
  120   contains nothing but letters, numbers and the underscore character,
  121   \verb+_+.  (Longer variable names will be truncated to 31
  122   characters.)  Qualifications to the above: First, in the case of an
  123   plain text import, if the file contains no row with variable names
  124   the program will automatically add names, \verb+v1+, \verb+v2+ and
  125   so on.  Second, by ``the first row'' is meant the first
  126   \emph{relevant} row.  In the case of plain text imports, blank
  127   rows and rows beginning with a hash mark, \verb+#+, are ignored.  In
  128   the case of Excel, Gnumeric and ODS imports, you are presented with a
  129   dialog box where you can select an offset into the spreadsheet, so
  130   that gretl will ignore a specified number of rows and/or
  131   columns.
  133 \item Data values: these should constitute a rectangular block, with
  134   one variable per column (and one observation per row).  The number
  135   of variables (data columns) must match the number of variable names
  136   given. See also section~\ref{missing-data}.  Numeric data are
  137   expected, but in the case of importing from plain text, the program
  138   offers limited handling of character (string) data: if a given
  139   column contains character data only, consecutive numeric codes are
  140   substituted for the strings, and once the import is complete a table
  141   is printed showing the correspondence between the strings and the
  142   codes.
  144 \item Dates (or observation labels): Optionally, the \emph{first}
  145   column may contain strings such as dates, or labels for
  146   cross-sectional observations.  Such strings have a maximum of 15
  147   characters (as with variable names, longer strings will be
  148   truncated).  A column of this sort should be headed with the string
  149   \verb+obs+ or \verb+date+, or the first row entry may be left
  150   blank.
  152   For dates to be recognized as such, the date strings should adhere
  153   to one or other of a set of specific formats, as follows.  For
  154   \emph{annual} data: 4-digit years.  For \emph{quarterly} data: a
  155   4-digit year, followed by a separator (either a period, a colon, or
  156   the letter \verb+Q+), followed by a 1-digit quarter.  Examples:
  157   \verb+1997.1+, \verb+2002:3+, \verb+1947Q1+.  For \emph{monthly}
  158   data: a 4-digit year, followed by a period or a colon, followed by a
  159   two-digit month.  Examples: \verb+1997.01+, \verb+2002:10+.
  161 \end{itemize}
  163 Plain text (``CSV'') files can use comma, space, tab or semicolon as
  164 the column separator.  When you open such a file via the GUI you are
  165 given the option of specifying the separator, though in most cases it
  166 should be detected automatically.
  168 If you use a spreadsheet to prepare your data you are able to carry
  169 out various transformations of the ``raw'' data with ease (adding
  170 things up, taking percentages or whatever): note, however, that you
  171 can also do this sort of thing easily---perhaps more easily---within
  172 gretl, by using the tools under the ``Add'' menu.
  174 \subsection{Appending imported data}
  176 You may wish to establish a dataset piece by piece, by incremental
  177 importation of data from other sources.  This is supported via the
  178 ``File, Append data'' menu items: gretl will check the new data for
  179 conformability with the existing dataset and, if everything seems OK,
  180 will merge the data.  You can add new variables in this way, provided
  181 the data frequency matches that of the existing dataset.  Or you can
  182 append new observations for data series that are already present; in
  183 this case the variable names must match up correctly.  Note that by
  184 default (that is, if you choose ``Open data'' rather than ``Append
  185 data''), opening a new data file closes the current one.
  187 \subsection{Using the built-in spreadsheet}
  189 Under the ``File, New data set'' menu you can choose the sort of
  190 dataset you want to establish (e.g.\ quarterly time series,
  191 cross-sectional).  You will then be prompted for starting and ending
  192 dates (or observation numbers) and the name of the first variable to
  193 add to the dataset. After supplying this information you will be faced
  194 with a simple spreadsheet into which you can type data values.  In the
  195 spreadsheet window, clicking the right mouse button will invoke a
  196 popup menu which enables you to add a new variable (column), to add an
  197 observation (append a row at the foot of the sheet), or to insert an
  198 observation at the selected point (move the data down and insert a
  199 blank row.)
  201 Once you have entered data into the spreadsheet you import these into
  202 gretl's workspace using the spreadsheet's ``Apply changes''
  203 button.
  205 Please note that gretl's spreadsheet is quite basic and has no
  206 support for functions or formulas.  Data transformations are done via
  207 the ``Add'' or ``Variable'' menus in the main window.
  209 \subsection{Selecting from a database}
  211 Another alternative is to establish your dataset by selecting
  212 variables from a database.  
  214 Begin with the ``File, Databases'' menu item. This has four forks:
  215 ``Gretl native'', ``RATS 4'', ``PcGive'' and ``On database server''.
  216 You should be able to find the file \verb+fedstl.bin+ in the file
  217 selector that opens if you choose the ``Gretl native'' option since
  218 this file, which contains a large collection of US macroeconomic time
  219 series, is supplied with the distribution.
  221 You won't find anything under ``RATS 4'' unless you have purchased
  222 RATS data.\footnote{See \href{http://www.estima.com/}{www.estima.com}}
  223 If you do possess RATS data you should go into the ``Tools,
  224 Preferences, General'' dialog, select the Databases tab, and fill in
  225 the correct path to your RATS files.  
  227 If your computer is connected to the internet you should find several
  228 databases (at Wake Forest University) under ``On database server''.
  229 You can browse these remotely; you also have the option of installing
  230 them onto your own computer.  The initial remote databases window has
  231 an item showing, for each file, whether it is already installed
  232 locally (and if so, if the local version is up to date with the
  233 version at Wake Forest).
  235 Assuming you have managed to open a database you can import selected
  236 series into gretl's workspace by using the ``Series, Import''
  237 menu item in the database window, or via the popup menu that appears
  238 if you click the right mouse button, or by dragging the series into
  239 the program's main window.
  241 \subsection{Creating a gretl data file independently}
  243 It is possible to create a data file in one or other of gretl's own
  244 formats using a text editor or software tools such as \app{awk},
  245 \app{sed} or \app{perl}.  This may be a good choice if you have large
  246 amounts of data already in machine readable form. You will, of course,
  247 need to study these data formats (XML-based or ``traditional'') as
  248 described in Appendix~\ref{app-datafile}.
  250 \section{Structuring a dataset}
  251 \label{sec:data-structure}
  253 Once your data are read by gretl, it may be necessary to supply
  254 some information on the nature of the data. We distinguish between
  255 three kinds of datasets:
  256 \begin{enumerate}
  257 \item Cross section
  258 \item Time series
  259 \item Panel data
  260 \end{enumerate}
  262 The primary tool for doing this is the ``Data, Dataset structure''
  263 menu entry in the graphical interface, or the \texttt{setobs} command
  264 for scripts and the command-line interface.
  266 \subsection{Cross sectional data}
  267 \label{sec:cross-section-data}
  269 By a cross section we mean observations on a set of ``units'' (which
  270 may be firms, countries, individuals, or whatever) at a common point
  271 in time.  This is the default interpretation for a data file: if there
  272 is insufficient information to interpret data as time-series or panel
  273 data, they are automatically interpreted as a cross section.  In the
  274 unlikely event that cross-sectional data are wrongly interpreted as
  275 time series, you can correct this by selecting the ``Data, Dataset
  276 structure'' menu item.  Click the ``cross-sectional'' radio button in
  277 the dialog box that appears, then click ``Forward''.  Click ``OK'' to
  278 confirm your selection.
  280 \subsection{Time series data}
  281 \label{sec:timeser-data}
  283 When you import data from a spreadsheet or plain text file,
  284 gretl will make fairly strenuous efforts to glean time-series
  285 information from the first column of the data, if it looks at all
  286 plausible that such information may be present.  If time-series
  287 structure is present but not recognized, again you can use the ``Data,
  288 Dataset structure'' menu item.  Select ``Time series'' and click
  289 ``Forward''; select the appropriate data frequency and click
  290 ``Forward'' again; then select or enter the starting observation and
  291 click ``Forward'' once more.  Finally, click ``OK'' to confirm the
  292 time-series interpretation if it is correct (or click ``Back'' to make
  293 adjustments if need be).
  295 Besides the basic business of getting a data set interpreted as time
  296 series, further issues may arise relating to the frequency of
  297 time-series data.  In a gretl time-series data set, all the series
  298 must have the same frequency.  Suppose you wish to make a combined
  299 dataset using series that, in their original state, are not all of the
  300 same frequency.  For example, some series are monthly and some are
  301 quarterly.
  303 Your first step is to formulate a strategy: Do you want to end up with
  304 a quarterly or a monthly data set?  A basic point to note here is
  305 that ``compacting'' data from a higher frequency (e.g.\ monthly) to
  306 a lower frequency (e.g.\ quarterly) is usually unproblematic.  You
  307 lose information in doing so, but in general it is perfectly
  308 legitimate to take (say) the average of three monthly observations to
  309 create a quarterly observation.  On the other hand, ``expanding'' data
  310 from a lower to a higher frequency is not, in general, a valid
  311 operation.  
  313 In most cases, then, the best strategy is to start by creating a data
  314 set of the \textit{lower} frequency, and then to compact the higher
  315 frequency data to match.  When you import higher-frequency data from a
  316 database into the current data set, you are given a choice of
  317 compaction method (average, sum, start of period, or end of period).
  318 In most instances ``average'' is likely to be appropriate.  
  320 You \textit{can} also import lower-frequency data into a
  321 high-frequency data set, but this is generally not recommended.  What
  322 gretl does in this case is simply replicate the values of the
  323 lower-frequency series as many times as required. For example, suppose
  324 we have a quarterly series with the value 35.5 in 1990:1, the first
  325 quarter of 1990.  On expansion to monthly, the value 35.5 will be
  326 assigned to the observations for January, February and March of 1990.
  327 The expanded variable is therefore useless for fine-grained
  328 time-series analysis, outside of the special case where you know that
  329 the variable in question does in fact remain constant over the
  330 sub-periods.
  332 When the current data frequency is appropriate, gretl offers
  333 both ``Compact data'' and ``Expand data'' options under the ``Data''
  334 menu.  These options operate on the whole data set, compacting or
  335 exanding all series.  They should be considered ``expert'' options
  336 and should be used with caution. 
  339 \subsection{Panel data}
  340 \label{sec:panel-data}
  342 Panel data are inherently three dimensional---the dimensions being
  343 variable, cross-sectional unit, and time-period.  For example, a
  344 particular number in a panel data set might be identified as the
  345 observation on capital stock for General Motors in 1980.  (A note on
  346 terminology: we use the terms ``cross-sectional unit'', ``unit'' and
  347 ``group'' interchangeably below to refer to the entities that compose
  348 the cross-sectional dimension of the panel.  These might, for
  349 instance, be firms, countries or persons.)
  351 For representation in a textual computer file (and also for gretl's
  352 internal calculations) the three dimensions must somehow be flattened
  353 into two.  This ``flattening'' involves taking layers of the data that
  354 would naturally stack in a third dimension, and stacking them in the
  355 vertical dimension.
  357 gretl always expects data to be arranged ``by observation'',
  358 that is, such that each row represents an observation (and each
  359 variable occupies one and only one column).  In this context the
  360 flattening of a panel data set can be done in either of two ways:
  362 \begin{itemize}
  363 \item Stacked time series: the successive vertical blocks each
  364   comprise a time series for a given unit.
  365 \item Stacked cross sections: the successive vertical blocks each
  366   comprise a cross-section for a given period.
  367 \end{itemize}
  369 You may input data in whichever arrangement is more convenient.
  370 Internally, however, gretl always stores panel data in
  371 the form of stacked time series.
  373 \section{Panel data specifics}
  374 \label{sec:more-panel}
  376 When you import panel data into gretl from a spreadsheet or
  377 comma separated format, the panel nature of the data will not be
  378 recognized automatically (most likely the data will be treated as
  379 ``undated'').  A panel interpretation can be imposed on the data
  380 using the graphical interface or via the \cmd{setobs} command.
  382 In the graphical interface, use the menu item ``Data, Dataset
  383 structure''.  In the first dialog box that appears, select ``Panel''.
  384 In the next dialog you have a three-way choice.  The first two
  385 options, ``Stacked time series'' and ``Stacked cross sections'' are
  386 applicable if the data set is already organized in one of these two
  387 ways.  If you select either of these options, the next step is to
  388 specify the number of cross-sectional units in the data set.  The
  389 third option, ``Use index variables'', is applicable if the data set
  390 contains two variables that index the units and the time periods
  391 respectively; the next step is then to select those variables.  For
  392 example, a data file might contain a country code variable and a
  393 variable representing the year of the observation.  In that case
  394 gretl can reconstruct the panel structure of the data regardless
  395 of how the observation rows are organized.
  397 The \cmd{setobs} command has options that parallel those in the
  398 graphical interface.  If suitable index variables are available
  399 you can do, for example
  400 %
  401 \begin{code}
  402 setobs unitvar timevar --panel-vars
  403 \end{code}
  404 %
  405 where \texttt{unitvar} is a variable that indexes the units and
  406 \texttt{timevar} is a variable indexing the periods.  Alternatively
  407 you can use the form \verb+setobs+ \textsl{freq} \verb+1:1+
  408 \textsl{structure}, where \textsl{freq} is replaced by the ``block
  409 size'' of the data (that is, the number of periods in the case of
  410 stacked time series, or the number of units in the case of stacked
  411 cross-sections) and structure is either \option{stacked-time-series}
  412 or \option{stacked-cross-section}.  Two examples are given below: the
  413 first is suitable for a panel in the form of stacked time series with
  414 observations from 20 periods; the second for stacked cross sections
  415 with 5 units.
  416 %
  417 \begin{code}
  418 setobs 20 1:1 --stacked-time-series
  419 setobs 5 1:1 --stacked-cross-section
  420 \end{code}
  422 \subsection{Panel data arranged by variable}
  424 Publicly available panel data sometimes come arranged ``by variable.''
  425 Suppose we have data on two variables, \varname{x1} and \varname{x2},
  426 for each of 50 states in each of 5 years (giving a total of 250
  427 observations per variable).  One textual representation of such a data
  428 set would start with a block for \varname{x1}, with 50 rows
  429 corresponding to the states and 5 columns corresponding to the years.
  430 This would be followed, vertically, by a block with the same structure
  431 for variable \varname{x2}.  A fragment of such a data file is shown
  432 below, with quinquennial observations 1965--1985.  Imagine the table
  433 continued for 48 more states, followed by another 50 rows for variable
  434 \varname{x2}.
  436 \begin{center}
  437   \begin{tabular}{rrrrrr}
  438   \varname{x1} \\
  439      & 1965 & 1970 & 1975 & 1980 & 1985 \\
  440   AR & 100.0 & 110.5 & 118.7 & 131.2 & 160.4\\
  441   AZ & 100.0 & 104.3 & 113.8 & 120.9 & 140.6\\
  442   \end{tabular}
  443 \end{center}
  445 If a datafile with this sort of structure is read into
  446 gretl,\footnote{Note that you will have to modify such a
  447   datafile slightly before it can be read at all.  The line containing
  448   the variable name (in this example \varname{x1}) will have to be
  449   removed, and so will the initial row containing the years,
  450   otherwise they will be taken as numerical data.}  the program
  451 will interpret the columns as distinct variables, so the data will not
  452 be usable ``as is.''  But there is a mechanism for correcting the
  453 situation, namely the \cmd{stack} function.
  455 Consider the first data column in the fragment above: the first 50 rows
  456 of this column constitute a cross-section for the variable \varname{x1}
  457 in the year 1965.  If we could create a new series by stacking the
  458 first 50 entries in the second column underneath the first 50 entries
  459 in the first, we would be on the way to making a data set ``by
  460 observation'' (in the first of the two forms mentioned above, stacked
  461 cross-sections).  That is, we'd have a column comprising a
  462 cross-section for \varname{x1} in 1965, followed by a cross-section for
  463 the same variable in 1970.
  465 The following gretl script illustrates how we can accomplish the
  466 stacking, for both \varname{x1} and \varname{x2}.  We assume
  467 that the original data file is called \texttt{panel.txt}, and that in
  468 this file the columns are headed with ``variable names'' \varname{v1},
  469 \varname{v2}, \dots, \varname{v5}.  (The columns are not really
  470 variables, but in the first instance we ``pretend'' that they are.)
  472 \begin{code}
  473 open panel.txt
  474 series x1 = stack(v1..v5, 50)
  475 series x2 = stack(v1..v5, 50, 50)
  476 setobs 50 1:1 --stacked-cross-section
  477 store panel.gdt x1 x2
  478 \end{code}
  480 The second and third lines illustrate the syntax of the \cmd{stack}
  481 function, which takes up to three arguments.  The double dots in the
  482 first argument indicate a range of variables to be stacked: here we
  483 want to stack all 5 columns (for all 5 years). More generally, you can
  484 define a named list of series and pass that as the first argument to
  485 \texttt{stack} (see chapter~\ref{chap:lists-strings}). In this
  486 example we're supposing that the full data set contains 100 rows, and
  487 that in the stacking of variable \varname{x1} we wish to read only the
  488 first 50 rows from each column: we achieve this by adding \texttt{50}
  489 as a second (\texttt{length}) argument.
  491 On line 3 we do the stacking for variable \varname{x2}.  Again we want
  492 a \texttt{length} of 50 for the components of the stacked series, but
  493 this time we want to start reading from the 50th row of the original
  494 data, and so we add a third \texttt{offset} argument of 50.  The
  495 signature of the stack function is shown below; the second and third
  496 arguments are optional, defaulting to ``automatic'' and 0
  497 respectively.
  498 \begin{code}
  499 series stack(list L, int length n, int offset k)
  500 \end{code}
  501 Line 4 then imposes a panel interpretation on the data. Finally, we
  502 save the stacked data to file, with the panel interpretation.
  504 The illustrative script above is appropriate when the number of
  505 variables to be processed is small.  When then are many variables in
  506 the dataset it's more convenient to use a loop to accomplish the
  507 stacking, as shown in the following script.  The setup is presumed to
  508 be the same as in the previous case (50 units, 5 periods), but with 20
  509 variables rather than 2.
  511 \begin{code}
  512 open panel.txt
  513 list L = v1..v5 # predefine a list of series
  514 scalar length = 50
  515 loop i=1..20 --quiet
  516   scalar offset = (i - 1) * length
  517   series x$i = stack(L, length, offset)
  518 endloop
  519 setobs 50 1.01 --stacked-cross-section
  520 store panel.gdt x1..x20
  521 \end{code}
  523 \subsection{Side-by-side time series}
  525 There's a second sort of data that you may wish to convert to gretl's
  526 panel format, namely side-by-side time series for a number of
  527 cross-sectional units. For example, a data file might contain separate
  528 GDP series of common length $T$ for each of $N$ countries. To turn
  529 these into a single stacked time series the \texttt{stack} function
  530 can again be used. An example follows, where we suppose the original
  531 data source is a comma-separated file named \texttt{GDP.csv},
  532 containing GDP data for countries from Austria (\texttt{GDP\_AT}) to
  533 Zimbabwe (\texttt{GDP\_ZW}) in consecutive columns.
  535 \begin{code}
  536 open GDP.csv
  537 scalar T = $nobs # the number of periods
  538 list L = GDP_AT..GDP_ZW
  539 series GDP = stack(L, T)
  540 setobs T 1:01 --stacked-time-series
  541 store panel.gdt GDP
  542 \end{code}
  544 The resulting data file, \texttt{panel.gdt}, will contain a single
  545 series of length $NT$ where $N$ is the number of countries and
  546 $T$ is the length of the original dataset. One could insert revised
  547 variants of lines 3 and 4 of the script if the original file contained
  548 additional side-by-side per-country series for investment, consumption
  549 or whatever.
  551 \subsection{Panel data marker strings}
  553 It can be helpful with panel data to have the observations identified
  554 by mnemonic markers.  A special function in the \texttt{genr} command
  555 is available for this purpose.
  557 In the example under the heading ``Panel data arranged by variable''
  558 above, suppose all the states are identified by two-letter codes in
  559 the left-most column of the original datafile.  When the
  560 \texttt{stack} function is invoked as shown, these codes will be
  561 stacked along with the data values.  If the first row is marked
  562 \texttt{AR} for Arkansas, then the marker \texttt{AR} will end up
  563 being shown on each row containing an observation for Arkansas.
  564 That's all very well, but these markers don't tell us anything about
  565 the date of the observation.  To rectify this we could do:
  567 \begin{code}
  568 genr time
  569 series year = 1960 + (5 * time)
  570 genr markers = "%s:%d", marker, year
  571 \end{code}
  573 The first line generates a 1-based index representing the period of
  574 each observation, and the second line uses the \texttt{time} variable
  575 to generate a variable representing the year of the observation.  The
  576 third line contains this special feature: if (and only if) the name of
  577 the new ``variable'' to generate is \texttt{markers}, the portion of
  578 the command following the equals sign is taken as a C-style format
  579 string (which must be wrapped in double quotes), followed by a
  580 comma-separated list of arguments.  The arguments will be printed
  581 according to the given format to create a new set of observation
  582 markers.  Valid arguments are either the names of variables in the
  583 dataset, or the string \texttt{marker} which denotes the pre-existing
  584 observation marker.  The format specifiers which are likely to be
  585 useful in this context are \texttt{\%s} for a string and \texttt{\%d}
  586 for an integer.  Strings can be truncated: for example \texttt{\%.3s}
  587 will use just the first three characters of the string.  To chop
  588 initial characters off an existing observation marker when
  589 constructing a new one, you can use the syntax \texttt{marker + n},
  590 where \texttt{n} is a positive integer: in the case the first
  591 \texttt{n} characters will be skipped.
  593 After the commands above are processed, then, the observation markers
  594 will look like, for example, \texttt{AR:1965}, where the two-letter
  595 state code and the year of the observation are spliced together with a
  596 colon.
  598 \subsection{Panel dummy variables}
  599 \label{panel-dummies}
  601 In a panel study you may wish to construct dummy variables of one or
  602 both of the following sorts: (a) dummies as unique identifiers for the
  603 units or groups, and (b) dummies as unique identifiers for the time
  604 periods.  The former may be used to allow the intercept of the
  605 regression to differ across the units, the latter to allow the
  606 intercept to differ across periods.
  608 Two special functions are available to create such dummies.  These are
  609 found under the ``Add'' menu in the GUI, or under the \cmd{genr}
  610 command in script mode or \app{gretlcli}.
  612 \begin{enumerate}
  613 \item ``unit dummies'' (script command \cmd{genr unitdum}).  This
  614   command creates a set of dummy variables identifying the
  615   cross-sectional units.  The variable \verb+du_1+ will have value 1
  616   in each row corresponding to a unit 1 observation, 0 otherwise;
  617   \verb+du_2+ will have value 1 in each row corresponding to a unit 2
  618   observation, 0 otherwise; and so on.
  619 \item ``time dummies'' (script command \cmd{genr timedum}).  This
  620   command creates a set of dummy variables identifying the periods.
  621   The variable \verb+dt_1+ will have value 1 in each row
  622   corresponding to a period 1 observation, 0 otherwise; \verb+dt_2+
  623   will have value 1 in each row corresponding to a period 2
  624   observation, 0 otherwise; and so on.
  625 \end{enumerate}
  627 If a panel data set has the \verb+YEAR+ of the observation entered as
  628 one of the variables you can create a periodic dummy to pick out a
  629 particular year, e.g.\ \cmd{genr dum = (YEAR==1960)}.  You can also
  630 create periodic dummy variables using the modulus operator,
  631 \verb+%+.  For instance, to create a dummy with
  632 value 1 for the first observation and every thirtieth observation
  633 thereafter, 0 otherwise, do
  634 %
  635 \begin{code}
  636 genr index 
  637 series dum = ((index-1) % 30) == 0
  638 \end{code}
  640 \subsection{Lags, differences, trends}
  641 \label{panel-lagged}
  643 If the time periods are evenly spaced you may want to use lagged
  644 values of variables in a panel regression (but see also
  645 chapter~\ref{chap:dpanel}); you may also wish to construct first
  646 differences of variables of interest.
  648 Once a dataset is identified as a panel, gretl will handle the
  649 generation of such variables correctly.  For example the command
  650 \verb+genr x1_1 = x1(-1)+ will create a variable that contains the
  651 first lag of \verb+x1+ where available, and the missing value code
  652 where the lag is not available (e.g.\ at the start of the time series
  653 for each group).  When you run a regression using such variables, the
  654 program will automatically skip the missing observations.
  656 When a panel data set has a fairly substantial time dimension, you may
  657 wish to include a trend in the analysis.  The command \cmd{genr time} 
  658 creates a variable named \varname{time} which runs from 1 to $T$ for
  659 each unit, where $T$ is the length of the time-series dimension of the
  660 panel.  If you want to create an index that runs consecutively from 1
  661 to $m\times T$, where $m$ is the number of units in the panel, use
  662 \cmd{genr index}.
  664 \subsection{Basic statistics by unit}
  665 \label{panel-stats}
  667 gretl contains functions which can be used to generate basic
  668 descriptive statistics for a given variable, on a per-unit basis;
  669 these are \texttt{pnobs()} (number of valid cases), \texttt{pmin()}
  670 and \texttt{pmax()} (minimum and maximum) and \texttt{pmean()} and
  671 \texttt{psd()} (mean and standard deviation).
  673 As a brief illustration, suppose we have a panel data set comprising 8
  674 time-series observations on each of $N$ units or groups.  Then the
  675 command
  676 %
  677 \begin{code}
  678 series pmx = pmean(x)
  679 \end{code}
  680 %
  681 creates a series of this form: the first 8 values (corresponding to
  682 unit 1) contain the mean of \varname{x} for unit 1, the next 8 values
  683 contain the mean for unit 2, and so on.  The \texttt{psd()} function
  684 works in a similar manner.  The sample standard deviation for group
  685 $i$ is computed as
  686 \[
  687 s_i = \sqrt{\frac{\sum(x-\bar{x}_i)^2}{T_i-1}}
  688 \]
  689 where $T_i$ denotes the number of valid observations on \varname{x}
  690 for the given unit, $\bar{x}_i$ denotes the group mean, and the
  691 summation is across valid observations for the group.  If $T_i < 2$,
  692 however, the standard deviation is recorded as 0.
  694 One particular use of \texttt{psd()} may be worth noting.  If you want
  695 to form a sub-sample of a panel that contains only those units for
  696 which the variable \varname{x} is time-varying, you can either use 
  697 %
  698 \begin{code}
  699 smpl pmin(x) < pmax(x) --restrict
  700 \end{code}
  701 or
  702 %
  703 \begin{code}
  704 smpl psd(x) > 0 --restrict
  705 \end{code}
  707 \section{Missing data values}
  708 \label{missing-data}
  710 \subsection{Representation and handling}
  712 Missing values are represented internally as \verb+NaN+ (``not a
  713 number''), as defined in the IEEE 754 floating-point standard. In a
  714 native-format data file they should be represented as \verb+NA+. When
  715 importing CSV data gretl accepts several common representations of
  716 missing values including $-$999, the string \verb+NA+ (in upper or
  717 lower case), a single dot, or simply a blank cell.  Blank cells
  718 should, of course, be properly delimited, e.g.\ \verb+120.6,,5.38+, in
  719 which the middle value is presumed missing.
  721 As for handling of missing values in the course of statistical
  722 analysis, gretl does the following:
  724 \begin{itemize}
  725 \item In calculating descriptive statistics (mean, standard deviation,
  726   etc.) under the \cmd{summary} command, missing values are simply
  727   skipped and the sample size adjusted appropriately.
  728 \item In running regressions gretl first adjusts the beginning
  729   and end of the sample range, truncating the sample if need be.
  730   Missing values at the beginning of the sample are common in time
  731   series work due to the inclusion of lags, first differences and so
  732   on; missing values at the end of the range are not uncommon due to
  733   differential updating of series and possibly the inclusion of leads.
  734 \end{itemize}
  736 If gretl detects any missing values ``inside'' the (possibly
  737 truncated) sample range for a regression, the result depends on the
  738 character of the dataset and the estimator chosen.  In many cases, the
  739 program will automatically skip the missing observations when
  740 calculating the regression results.  In this situation a message is
  741 printed stating how many observations were dropped.  On the other
  742 hand, the skipping of missing observations is not supported for all
  743 procedures: exceptions include all autoregressive estimators, system
  744 estimators such as SUR, and nonlinear least squares.  In the case of
  745 panel data, the skipping of missing observations is supported only if
  746 their omission leaves a balanced panel. If missing observations are
  747 found in cases where they are not supported, gretl gives an
  748 error message and refuses to produce estimates.
  750 \subsection{Manipulating missing values}
  751 \label{sec:genr-missing}
  753 Some special functions are available for the handling of missing
  754 values.  The Boolean function \verb+missing()+ takes the name of a
  755 variable as its single argument; it returns a series with value 1 for
  756 each observation at which the given variable has a missing value, and
  757 value 0 otherwise (that is, if the given variable has a valid value at
  758 that observation).  The function \verb+ok()+ is complementary to
  759 \verb+missing+; it is just a shorthand for \verb+!missing+ (where
  760 \verb+!+ is the Boolean NOT operator).  For example, one can count the
  761 missing values for variable \verb+x+ using
  763 \begin{code}
  764 scalar nmiss_x = sum(missing(x))
  765 \end{code}
  767 The function \verb+zeromiss()+, which again takes a single series as
  768 its argument, returns a series where all zero values are set to the
  769 missing code.  This should be used with caution---one does not want to
  770 confuse missing values and zeros---but it can be useful in some
  771 contexts.  For example, one can determine the first valid observation
  772 for a variable \verb+x+ using
  774 \begin{code}
  775 genr time
  776 scalar x0 = min(zeromiss(time * ok(x)))
  777 \end{code}
  779 The function \verb+misszero()+ does the opposite of \verb+zeromiss+,
  780 that is, it converts all missing values to zero.
  782 If missing values get involved in calculations, they propagate
  783 according to the IEEE rules: notably, if one of the operands to an
  784 arithmetical operation is a \texttt{NaN}, the result will also be
  785 \texttt{NaN}.
  787 \section{Maximum size of data sets}
  788 \label{data-limits}
  790 Basically, the size of data sets (both the number of variables and the
  791 number of observations per variable) is limited only by the
  792 characteristics of your computer.  Gretl allocates memory
  793 dynamically, and will ask the operating system for as much memory as
  794 your data require.  Obviously, then, you are ultimately limited by the
  795 size of RAM.
  797 Aside from the multiple-precision OLS option, gretl uses
  798 double-precision floating-point numbers throughout.  The size of such
  799 numbers in bytes depends on the computer platform, but is typically
  800 eight.  To give a rough notion of magnitudes, suppose we have a data
  801 set with 10,000 observations on 500 variables.  That's 5 million
  802 floating-point numbers or 40 million bytes.  If we define the megabyte
  803 (MB) as $1024 \times 1024$ bytes, as is standard in talking about RAM,
  804 it's slightly over 38 MB.  The program needs additional memory for
  805 workspace, but even so, handling a data set of this size should be
  806 quite feasible on a current PC, which at the time of writing is likely
  807 to have at least 256 MB of RAM.  
  809 If RAM is not an issue, there is one further limitation on data size
  810 (though it's very unlikely to be a binding constraint).  That is,
  811 variables and observations are indexed by signed integers, and on a
  812 typical PC these will be 32-bit values, capable of representing
  813 a maximum positive value of $2^{31} - 1 = 2,147,483,647$.
  815 The limits mentioned above apply to gretl's ``native''
  816 functionality.  There are tighter limits with regard to two
  817 third-party programs that are available as add-ons to gretl for
  818 certain sorts of time-series analysis including seasonal adjustment,
  819 namely \app{TRAMO/SEATS} and \app{X-12-ARIMA}.  These programs employ
  820 a fixed-size memory allocation, and can't handle series of more than
  821 600 observations.
  824 \section{Data file collections}
  825 \label{collections}
  827 If you're using gretl in a teaching context you may be
  828 interested in adding a collection of data files and/or scripts that
  829 relate specifically to your course, in such a way that students can
  830 browse and access them easily.
  832 There are three ways to access such collections of files:
  834 \begin{itemize}
  835 \item For data files: select the menu item ``File, Open data, Sample
  836   file'', or click on the folder icon on the gretl toolbar.
  837 \item For script files: select the menu item ``File, Script
  838   files, Practice file''.
  839 \end{itemize}
  841 When a user selects one of the items:
  843 \begin{itemize}
  844 \item The data or script files included in the gretl distribution are
  845   automatically shown (this includes files relating to Ramanathan's
  846   \emph{Introductory Econometrics} and Greene's \emph{Econometric
  847     Analysis}).
  848 \item The program looks for certain known collections of data files
  849   available as optional extras, for instance the datafiles from
  850   various econometrics textbooks (Davidson and MacKinnon, Gujarati,
  851   Stock and Watson, Verbeek, Wooldridge) and the Penn World Table (PWT
  852   5.6).  (See \href{http://gretl.sourceforge.net/gretl_data.html}{the
  853     data page} at the gretl website for information on these
  854   collections.)  If the additional files are found, they are added to
  855   the selection windows.
  856 \item The program then searches for valid file collections (not
  857   necessarily known in advance) in these places: the ``system'' data
  858   directory, the system script directory, the user directory, and all
  859   first-level subdirectories of these.  For reference, typical values
  860   for these directories are shown in Table~\ref{tab-colls}.  (Note that
  861   \texttt{PERSONAL} is a placeholder that is expanded by Windows,
  862   corresponding to ``My Documents'' on English-language systems.)
  863 \end{itemize}
  865 \begin{table}[htbp]
  866   \begin{center}
  867     \begin{tabular}{lll}
  868       & \multicolumn{1}{c}{\textit{Linux}} & 
  869       \multicolumn{1}{c}{\textit{MS Windows}} \\
  870         system data dir & 
  871         {\small \verb+/usr/share/gretl/data+} &
  872         {\small \verb+c:\Program Files\gretl\data+} \\
  873         system script dir & 
  874         {\small \verb+/usr/share/gretl/scripts+} &
  875         {\small \verb+c:\Program Files\gretl\scripts+} \\
  876         user dir & 
  877         {\small \verb+$HOME/gretl+} &
  878         {\small \verb+PERSONAL\gretl+}\\
  879   \end{tabular}
  880  \end{center}
  881  \caption{Typical locations for file collections}
  882  \label{tab-colls}
  883 \end{table}
  885 Any valid collections will be added to the selection windows. So what
  886 constitutes a valid file collection?  This comprises either a set of
  887 data files in gretl XML format (with the \verb+.gdt+ suffix) or
  888 a set of script files containing gretl commands (with \verb+.inp+
  889 suffix), in each case accompanied by a ``master file'' or catalog.
  890 The gretl distribution contains several example catalog files,
  891 for instance the file \verb+descriptions+ in the \verb+misc+
  892 sub-directory of the gretl data directory and
  893 \verb+ps_descriptions+ in the \verb+misc+ sub-directory of the scripts
  894 directory.
  896 If you are adding your own collection, data catalogs should be named
  897 \verb+descriptions+ and script catalogs should be be named
  898 \verb+ps_descriptions+.  In each case the catalog should be placed
  899 (along with the associated data or script files) in its own specific
  900 sub-directory (e.g.\ \url{/usr/share/gretl/data/mydata} or
  901 \verb+c:\userdata\gretl\data\mydata+).
  903 The catalog files are plain text; if they contain non-ASCII characters
  904 they must be encoded as UTF-8. The syntax of such files is
  905 straightforward.  Here, for example, are the first few lines of
  906 gretl's ``misc'' data catalog:
  908 \begin{code}
  909 # Gretl: various illustrative datafiles
  910 "arma","artificial data for ARMA script example"
  911 "ects_nls","Nonlinear least squares example"
  912 "hamilton","Prices and exchange rate, U.S. and Italy"
  913 \end{code}
  915 The first line, which must start with a hash mark, contains a short
  916 name, here ``Gretl'', which will appear as the label for this
  917 collection's tab in the data browser window, followed by a colon,
  918 followed by an optional short description of the collection.
  920 Subsequent lines contain two elements, separated by a comma and
  921 wrapped in double quotation marks.  The first is a datafile name
  922 (leave off the \verb+.gdt+ suffix here) and the second is a short
  923 description of the content of that datafile.  There should be one such
  924 line for each datafile in the collection.
  926 A script catalog file looks very similar, except that there are three
  927 fields in the file lines: a filename (without its \verb+.inp+ suffix),
  928 a brief description of the econometric point illustrated in the
  929 script, and a brief indication of the nature of the data used.  Again,
  930 here are the first few lines of the supplied ``misc'' script catalog:
  932 \begin{code}
  933 # Gretl: various sample scripts
  934 "arma","ARMA modeling","artificial data"
  935 "ects_nls","Nonlinear least squares (Davidson)","artificial data"
  936 "leverage","Influential observations","artificial data"
  937 "longley","Multicollinearity","US employment"
  938 \end{code}
  940 If you want to make your own data collection available to users, these
  941 are the steps:
  943 \begin{enumerate}
  944 \item Assemble the data, in whatever format is convenient.
  945 \item Convert the data to gretl format and save as \verb+gdt+
  946   files.  It is probably easiest to convert the data by importing them
  947   into the program from plain text, CSV, or a spreadsheet format (MS
  948   Excel or Gnumeric) then saving them. You may wish to add
  949   descriptions of the individual variables (the ``Variable, Edit
  950   attributes'' menu item), and add information on the source of the
  951   data (the ``Data, Edit info'' menu item).
  952 \item Write a descriptions file for the collection using a text
  953   editor.
  954 \item Put the datafiles plus the descriptions file in a subdirectory
  955   of the gretl data directory (or user directory).
  956 \item If the collection is to be distributed to other people, package
  957   the data files and catalog in some suitable manner, e.g.\ as a
  958   zipfile.
  959 \end{enumerate}
  961 If you assemble such a collection, and the data are not proprietary, we
  962 would encourage you to submit the collection for packaging as a
  963 gretl optional extra.
  965 \section{Assembling data from multiple sources}
  967 In many contexts researchers need to bring together data from multiple
  968 source files, and in some cases these sources are not organized such
  969 that the data can simply be ``stuck together'' by appending rows or
  970 columns to a base dataset. In gretl, the \texttt{join} command
  971 can be used for this purpose; this command is discussed in detail in
  972 chapter~\ref{chap:join}.
  975 %%% Local Variables: 
  976 %%% mode: latex
  977 %%% TeX-master: "gretl-guide"
  978 %%% End: