"Fossies" - the Fresh Open Source Software Archive

Member "gretl-2020e/doc/tex/sampling.tex" (17 May 2020, 14081 Bytes) of package /linux/misc/gretl-2020e.tar.xz:


As a special service "Fossies" has tried to format the requested source page into HTML format using (guessed) TeX and LaTeX source code syntax highlighting (style: standard) with prefixed line numbers. Alternatively you can here view or download the uninterpreted source code file.

    1 \chapter{Sub-sampling a dataset}
    2 \label{chap:sampling}
    3 
    4 \section{Introduction}
    5 \label{sample-intro}
    6 
    7 Some subtle issues can arise here; this chapter attempts to explain
    8 the issues.
    9 
   10 A sub-sample may be defined in relation to a full dataset in two
   11 different ways: we will refer to these as ``setting'' the sample and
   12 ``restricting'' the sample; these methods are discussed in
   13 sections~\ref{sec:sample-set} and~\ref{sec:sample-restrict}
   14 respectively. In addition section~\ref{sec:smpl-panel} discusses some
   15 special issues relating to panel data, and
   16 section~\ref{sec:resampling} covers resampling with replacement,
   17 which is useful in the context of bootstrapping test statistics.
   18 
   19 The following discussion focuses on the command-line approach. But you
   20 can also invoke the methods outlined here via the items under the
   21 \textsf{Sample} menu in the GUI program.
   22 
   23 
   24 \section{Setting the sample}
   25 \label{sec:sample-set}
   26 
   27 By ``setting'' the sample we mean defining a sub-sample simply by
   28 means of adjusting the starting and/or ending point of the current
   29 sample range.  This is likely to be most relevant for time-series
   30 data.  For example, one has quarterly data from 1960:1 to 2003:4, and
   31 one wants to run a regression using only data from the 1970s.  A
   32 suitable command is then
   33 
   34 \begin{code}
   35 smpl 1970:1 1979:4
   36 \end{code}
   37 
   38 Or one wishes to set aside a block of observations at the end of the
   39 data period for out-of-sample forecasting.  In that case one might do
   40 
   41 \begin{code}
   42 smpl ; 2000:4
   43 \end{code}
   44 
   45 where the semicolon is shorthand for ``leave the starting observation
   46 unchanged''.  (The semicolon may also be used in place of the second
   47 parameter, to mean that the ending observation should be unchanged.)
   48 By ``unchanged'' here, we mean unchanged relative to the last
   49 \verb+smpl+ setting, or relative to the full dataset if no sub-sample
   50 has been defined up to this point. For example, after
   51 
   52 \begin{code}
   53 smpl 1970:1 2003:4
   54 smpl ; 2000:4
   55 \end{code}
   56 
   57 the sample range will be 1970:1 to 2000:4.  
   58 
   59 An incremental or relative form of setting the sample range is also
   60 supported.  In this case a relative offset should be given, in the
   61 form of a signed integer (or a semicolon to indicate no change), for
   62 both the starting and ending point. For example
   63 
   64 \begin{code}
   65 smpl +1 ;
   66 \end{code}
   67 
   68 will advance the starting observation by one while preserving the
   69 ending observation, and
   70 
   71 \begin{code}
   72 smpl +2 -1
   73 \end{code}
   74 
   75 will both advance the starting observation by two and retard the
   76 ending observation by one.
   77 
   78 An important feature of ``setting'' the sample as described above is
   79 that it necessarily results in the selection of a subset of
   80 observations that are contiguous in the full dataset. The structure of
   81 the dataset is therefore unaffected (for example, if it is a quarterly
   82 time series before setting the sample, it remains a quarterly time
   83 series afterwards).
   84 
   85 \section{Restricting the sample}
   86 \label{sec:sample-restrict}
   87 
   88 By ``restricting'' the sample we mean selecting observations on the
   89 basis of some Boolean (logical) criterion, or by means of a random
   90 number generator.  This is likely to be most relevant for
   91 cross-sectional or panel data.
   92 
   93 Suppose we have data on a cross-section of individuals, recording
   94 their gender, income and other characteristics.  We wish to select for
   95 analysis only the women.  If we have a \verb+male+ dummy variable
   96 with value 1 for men and 0 for women we could do
   97 %      
   98 \begin{code}
   99 smpl male==0 --restrict
  100 \end{code}
  101 %
  102 to this effect.  Or suppose we want to restrict the sample to
  103 respondents with incomes over \$50,000.  Then we could use
  104 %
  105 \begin{code}
  106 smpl income>50000 --restrict
  107 \end{code}
  108 
  109 A question arises: if we issue the two commands above in sequence,
  110 what do we end up with in our sub-sample: all cases with income over
  111 50000, or just women with income over 50000? By default, the answer is
  112 the latter: women with income over 50000.  The second restriction
  113 augments the first, or in other words the final restriction is the
  114 logical product of the new restriction and any restriction that is
  115 already in place.  If you want a new restriction to replace any
  116 existing restrictions you can first recreate the full dataset using
  117 %
  118 \begin{code}
  119 smpl --full
  120 \end{code}
  121 %
  122 Alternatively, you can add the \verb+replace+ option to the
  123 \verb+smpl+ command:
  124 %
  125 \begin{code}
  126 smpl income>50000 --restrict --replace
  127 \end{code}
  128 
  129 This option has the effect of automatically re-establishing the full
  130 dataset before applying the new restriction.
  131 
  132 Unlike a simple ``setting'' of the sample, ``restricting'' the sample
  133 may result in selection of non-contiguous observations from the full
  134 data set.  It may therefore change the structure of the data set.
  135 
  136 This can be seen in the case of panel data.  Say we have a panel of
  137 five firms (indexed by the variable \verb+firm+) observed in each of
  138 several years (identified by the variable \verb+year+).  Then the
  139 restriction
  140 %
  141 \begin{code}
  142 smpl year==1995 --restrict
  143 \end{code}
  144 %
  145 produces a dataset that is not a panel, but a cross-section for the
  146 year 1995.  Similarly
  147 %
  148 \begin{code}
  149 smpl firm==3 --restrict
  150 \end{code}
  151 %
  152 produces a time-series dataset for firm number 3.
  153 
  154 For these reasons (possible non-contiguity in the observations,
  155 possible change in the structure of the data), gretl acts differently
  156 when you ``restrict'' the sample as opposed to simply ``setting'' it.
  157 In the case of setting, the program merely records the starting and
  158 ending observations and uses these as parameters to the various
  159 commands calling for the estimation of models, the computation of
  160 statistics, and so on. In the case of restriction, the program makes a
  161 reduced copy of the dataset and by default treats this reduced copy as
  162 a simple, undated cross-section---but see the further discussion of
  163 panel data in section~\ref{sec:smpl-panel}.
  164 
  165 If you wish to re-impose a time-series interpretation of the reduced
  166 dataset you can do so using the \cmd{setobs} command, or the GUI menu
  167 item ``Data, Dataset structure''.
  168 
  169 The fact that ``restricting'' the sample results in the creation of a
  170 reduced copy of the original dataset may raise an issue when the
  171 dataset is very large.  With such a dataset in memory, the creation of
  172 a copy may lead to a situation where the computer runs low on memory
  173 for calculating regression results.  You can work around this as
  174 follows:
  175 
  176 \begin{enumerate}
  177 \item Open the full data set, and impose the sample restriction.
  178 \item Save a copy of the reduced data set to disk.
  179 \item Close the full dataset and open the reduced one.
  180 \item Proceed with your analysis.
  181 \end{enumerate}
  182 
  183 \subsection{Random sub-sampling}
  184 \label{sample-random}
  185 
  186 Besides restricting the sample on some deterministic criterion, it may
  187 sometimes be useful (when working with very large datasets, or perhaps
  188 to study the properties of an estimator) to draw a random sub-sample
  189 from the full dataset.  This can be done using, for example,
  190 %
  191 \begin{code}
  192 smpl 100 --random
  193 \end{code}
  194 %
  195 to select 100 cases.  If you want the sample to be reproducible, you
  196 should set the seed for the random number generator first, using the
  197 \cmd{set} command.  This sort of sampling falls under the
  198 ``restriction'' category: a reduced copy of the dataset is made.
  199 
  200 \section{Panel data}
  201 \label{sec:smpl-panel}
  202 
  203 Consider for concreteness the Arellano--Bond dataset supplied with
  204 gretl (\texttt{abdata.gdt}). This comprises data on 140 firms
  205 $(n=140$) observed over the years 1976--1984 $(T=9)$. The dataset is
  206 ``nominally balanced'' in the sense that that the time-series length
  207 is the same for all countries (this being a requirement for a dataset
  208 to count as a panel in gretl), but in fact there are many missing
  209 values (\texttt{NA}s).
  210 
  211 You may want to sub-sample such a dataset in either the
  212 cross-sectional dimension (limit the sample to a subset of firms) or
  213 the time dimension (e.g.\ use data from the 1980s only). One way to
  214 sub-sample on firms keys off the notation used by gretl for panel
  215 observations. The full data range is printed as \texttt{1:1} (firm 1,
  216 period 1) to \texttt{140:9} (firm 140, period 9). The effect of
  217 %
  218 \begin{code}
  219 smpl 1:1 80:9
  220 \end{code}
  221 %
  222 is to limit the sample to the first 80 firms. Note that if you instead
  223 tried \texttt{smpl 1:1 80:4} this would provoke an error: you cannot
  224 use this syntax to sub-sample in the time dimension of the
  225 panel. Alternatively, and perhaps more naturally, you can use the
  226 \option{unit} option with the \cmd{smpl} command to limit the sample
  227 in the cross-sectional dimension, as in
  228 %
  229 \begin{code}
  230 smpl 1 80 --unit
  231 \end{code}
  232 
  233 The firms in the Arellano--Bond dataset are anonymous, but suppose you
  234 had a panel with five named countries. With such a panel you can
  235 inform gretl of the names of the groups using the \cmd{setobs}
  236 command. For example, given
  237 %
  238 \begin{code}
  239 string cstr = "Portugal Italy Ireland Greece Spain"
  240 setobs country cstr --panel-groups 
  241 \end{code}
  242 %
  243 gretl creates a string-valued series named \texttt{country} with group
  244 names taken from the variable \texttt{cstr}. Then, to include only
  245 Italy and Spain you could do
  246 %
  247 \begin{code}
  248 smpl country=="Italy" || country=="Spain" --restrict
  249 \end{code}
  250 %
  251 or to exclude one country,
  252 %
  253 \begin{code}
  254 smpl country!="Ireland" --restrict
  255 \end{code}
  256 
  257 To sub-sample in the time dimension, use of \option{restrict} is
  258 required. For example, the Arellano--Bond dataset contains a variable
  259 named \texttt{YEAR} that records the year of the observations and if
  260 one wanted to omit the first two years of data one could do
  261 %
  262 \begin{code}
  263 smpl YEAR >= 1978 --restrict
  264 \end{code}
  265 %
  266 If a dataset does not already incude a suitable variable for this
  267 purpose one can use the command \texttt{genr time} to create a simple
  268 1-based time index.
  269 
  270 Note that if you apply a sample restriction that just selects certain
  271 units (firms, countries or whatever), or selects certain contiguous
  272 time-periods---such that $n>1$, $T>1$ and the time-series length is
  273 still the same across all included units---your sub-sample will still
  274 be interpreted by gretl as a panel.
  275 
  276 
  277 \subsection{Unbalancing restrictions}
  278 
  279 In some cases one wants to sub-sample according to a criterion that
  280 ``cuts across the grain'' of a panel dataset. For instance, suppose you
  281 have a micro dataset with thousands of individuals observed over
  282 several years and you want to restrict the sample to observations on
  283 employed women.  
  284 
  285 If we simply extracted from the total $nT$ rows of the dataset those
  286 that pertain to women who were employed at time $t$ $(t = 1,\dots,T)$
  287 we would likely end up with a dataset that doesn't count as a panel in
  288 gretl (because the specific time-series length, $T_i$, would differ
  289 across individuals). In some contexts it might be OK that gretl
  290 doesn't take your sub-sample to be a panel, but if you want to apply
  291 panel-specific methods this is a problem. You can solve it by giving
  292 the \option{balanced} option with \cmd{smpl}. For example, supposing
  293 your dataset contained dummy variables \texttt{gender} (with the value
  294 1 coding for women) and \texttt{employed}, you could do
  295 %
  296 \begin{code}
  297 smpl gender==1 && employed==1 --restrict --balanced
  298 \end{code}
  299 %
  300 What exactly does this do? Well, let's say the years of your data are
  301 2000, 2005 and 2010, and that some women were employed in all of those
  302 years, giving a maximum $T_i$ value of 3. But individual 526 is a
  303 women who was employed only in the year 2000 ($T_i = 1$). The effect
  304 of the \option{balanced} option is then to insert ``padding rows'' of
  305 \texttt{NA}s for the years 2005 and 2010 for individual 526, and
  306 similarly for all individuals with $0 < T_i < 3$. Your sub-sample
  307 then qualifies as a panel.
  308 
  309 
  310 \section{Resampling and bootstrapping}
  311 \label{sec:resampling}
  312 
  313 Given an original data series \varname{x}, the command
  314 %
  315 \begin{code}
  316 series xr = resample(x)
  317 \end{code}
  318 %
  319 creates a new series each of whose elements is drawn at random from
  320 the elements of \varname{x}.  If the original series has 100
  321 observations, each element of \varname{x} is selected with probability
  322 $1/100$ at each drawing.  Thus the effect is to ``shuffle'' the
  323 elements of \varname{x}, with the twist that each element of
  324 \varname{x} may appear more than once, or not at all, in \varname{xr}.
  325 
  326 The primary use of this function is in the construction of bootstrap
  327 confidence intervals or p-values.  Here is a simple example.  Suppose
  328 we estimate a simple regression of $y$ on $x$ via OLS and find that
  329 the slope coefficient has a reported $t$-ratio of 2.5 with 40 degrees
  330 of freedom.  The two-tailed p-value for the null hypothesis that the
  331 slope parameter equals zero is then 0.0166, using the $t(40)$
  332 distribution.  Depending on the context, however, we may doubt whether
  333 the ratio of coefficient to standard error truly follows the $t(40)$
  334 distribution.  In that case we could derive a bootstrap p-value as
  335 shown in Listing~\ref{resampling-loop}.  
  336 
  337 Under the null hypothesis that the slope with respect to $x$ is zero,
  338 $y$ is simply equal to its mean plus an error term.  We simulate $y$
  339 by resampling the residuals from the initial OLS and re-estimate the
  340 model.  We repeat this procedure a large number of times, and record
  341 the number of cases where the absolute value of the $t$-ratio is
  342 greater than 2.5: the proportion of such cases is our bootstrap
  343 p-value.  For a good discussion of simulation-based tests and
  344 bootstrapping, see Davidson and MacKinnon
  345 (\citeyear{davidson-mackinnon04}, chapter 4); Davidson and Flachaire
  346 (\citeyear{davidson-flachaire01}) is also instructive.
  347 
  348 \begin{script}[htbp]
  349   \caption{Calculation of bootstrap p-value}
  350   \label{resampling-loop}
  351 \begin{scode}
  352 ols y 0 x
  353 # save the residuals
  354 genr ui = $uhat
  355 scalar ybar = mean(y)
  356 # number of replications for bootstrap
  357 scalar replics = 10000
  358 scalar tcount = 0
  359 series ysim
  360 loop replics
  361   # generate simulated y by resampling
  362   ysim = ybar + resample(ui)
  363   ols ysim 0 x
  364   scalar tsim = abs($coeff(x) / $stderr(x))
  365   tcount += (tsim > 2.5)
  366 endloop      
  367 printf "proportion of cases with |t| > 2.5 = %g\n", tcount / replics
  368 \end{scode}
  369 %$
  370 \end{script}
  371 
  372     
  373 %%% Local Variables: 
  374 %%% mode: latex
  375 %%% TeX-master: "gretl-guide"
  376 %%% End: 
  377