As a special service "Fossies" has tried to format the requested source page into HTML format using (guessed) TeX and LaTeX source code syntax highlighting (style: standard) with prefixed line numbers.
Alternatively you can here view or download the uninterpreted source code file.

1 \chapter{Sub-sampling a dataset} 2 \label{chap:sampling} 3 4 \section{Introduction} 5 \label{sample-intro} 6 7 Some subtle issues can arise here; this chapter attempts to explain 8 the issues. 9 10 A sub-sample may be defined in relation to a full dataset in two 11 different ways: we will refer to these as ``setting'' the sample and 12 ``restricting'' the sample; these methods are discussed in 13 sections~\ref{sec:sample-set} and~\ref{sec:sample-restrict} 14 respectively. In addition section~\ref{sec:smpl-panel} discusses some 15 special issues relating to panel data, and 16 section~\ref{sec:resampling} covers resampling with replacement, 17 which is useful in the context of bootstrapping test statistics. 18 19 The following discussion focuses on the command-line approach. But you 20 can also invoke the methods outlined here via the items under the 21 \textsf{Sample} menu in the GUI program. 22 23 24 \section{Setting the sample} 25 \label{sec:sample-set} 26 27 By ``setting'' the sample we mean defining a sub-sample simply by 28 means of adjusting the starting and/or ending point of the current 29 sample range. This is likely to be most relevant for time-series 30 data. For example, one has quarterly data from 1960:1 to 2003:4, and 31 one wants to run a regression using only data from the 1970s. A 32 suitable command is then 33 34 \begin{code} 35 smpl 1970:1 1979:4 36 \end{code} 37 38 Or one wishes to set aside a block of observations at the end of the 39 data period for out-of-sample forecasting. In that case one might do 40 41 \begin{code} 42 smpl ; 2000:4 43 \end{code} 44 45 where the semicolon is shorthand for ``leave the starting observation 46 unchanged''. (The semicolon may also be used in place of the second 47 parameter, to mean that the ending observation should be unchanged.) 48 By ``unchanged'' here, we mean unchanged relative to the last 49 \verb+smpl+ setting, or relative to the full dataset if no sub-sample 50 has been defined up to this point. For example, after 51 52 \begin{code} 53 smpl 1970:1 2003:4 54 smpl ; 2000:4 55 \end{code} 56 57 the sample range will be 1970:1 to 2000:4. 58 59 An incremental or relative form of setting the sample range is also 60 supported. In this case a relative offset should be given, in the 61 form of a signed integer (or a semicolon to indicate no change), for 62 both the starting and ending point. For example 63 64 \begin{code} 65 smpl +1 ; 66 \end{code} 67 68 will advance the starting observation by one while preserving the 69 ending observation, and 70 71 \begin{code} 72 smpl +2 -1 73 \end{code} 74 75 will both advance the starting observation by two and retard the 76 ending observation by one. 77 78 An important feature of ``setting'' the sample as described above is 79 that it necessarily results in the selection of a subset of 80 observations that are contiguous in the full dataset. The structure of 81 the dataset is therefore unaffected (for example, if it is a quarterly 82 time series before setting the sample, it remains a quarterly time 83 series afterwards). 84 85 \section{Restricting the sample} 86 \label{sec:sample-restrict} 87 88 By ``restricting'' the sample we mean selecting observations on the 89 basis of some Boolean (logical) criterion, or by means of a random 90 number generator. This is likely to be most relevant for 91 cross-sectional or panel data. 92 93 Suppose we have data on a cross-section of individuals, recording 94 their gender, income and other characteristics. We wish to select for 95 analysis only the women. If we have a \verb+male+ dummy variable 96 with value 1 for men and 0 for women we could do 97 % 98 \begin{code} 99 smpl male==0 --restrict 100 \end{code} 101 % 102 to this effect. Or suppose we want to restrict the sample to 103 respondents with incomes over \$50,000. Then we could use 104 % 105 \begin{code} 106 smpl income>50000 --restrict 107 \end{code} 108 109 A question arises: if we issue the two commands above in sequence, 110 what do we end up with in our sub-sample: all cases with income over 111 50000, or just women with income over 50000? By default, the answer is 112 the latter: women with income over 50000. The second restriction 113 augments the first, or in other words the final restriction is the 114 logical product of the new restriction and any restriction that is 115 already in place. If you want a new restriction to replace any 116 existing restrictions you can first recreate the full dataset using 117 % 118 \begin{code} 119 smpl --full 120 \end{code} 121 % 122 Alternatively, you can add the \verb+replace+ option to the 123 \verb+smpl+ command: 124 % 125 \begin{code} 126 smpl income>50000 --restrict --replace 127 \end{code} 128 129 This option has the effect of automatically re-establishing the full 130 dataset before applying the new restriction. 131 132 Unlike a simple ``setting'' of the sample, ``restricting'' the sample 133 may result in selection of non-contiguous observations from the full 134 data set. It may therefore change the structure of the data set. 135 136 This can be seen in the case of panel data. Say we have a panel of 137 five firms (indexed by the variable \verb+firm+) observed in each of 138 several years (identified by the variable \verb+year+). Then the 139 restriction 140 % 141 \begin{code} 142 smpl year==1995 --restrict 143 \end{code} 144 % 145 produces a dataset that is not a panel, but a cross-section for the 146 year 1995. Similarly 147 % 148 \begin{code} 149 smpl firm==3 --restrict 150 \end{code} 151 % 152 produces a time-series dataset for firm number 3. 153 154 For these reasons (possible non-contiguity in the observations, 155 possible change in the structure of the data), gretl acts differently 156 when you ``restrict'' the sample as opposed to simply ``setting'' it. 157 In the case of setting, the program merely records the starting and 158 ending observations and uses these as parameters to the various 159 commands calling for the estimation of models, the computation of 160 statistics, and so on. In the case of restriction, the program makes a 161 reduced copy of the dataset and by default treats this reduced copy as 162 a simple, undated cross-section---but see the further discussion of 163 panel data in section~\ref{sec:smpl-panel}. 164 165 If you wish to re-impose a time-series interpretation of the reduced 166 dataset you can do so using the \cmd{setobs} command, or the GUI menu 167 item ``Data, Dataset structure''. 168 169 The fact that ``restricting'' the sample results in the creation of a 170 reduced copy of the original dataset may raise an issue when the 171 dataset is very large. With such a dataset in memory, the creation of 172 a copy may lead to a situation where the computer runs low on memory 173 for calculating regression results. You can work around this as 174 follows: 175 176 \begin{enumerate} 177 \item Open the full data set, and impose the sample restriction. 178 \item Save a copy of the reduced data set to disk. 179 \item Close the full dataset and open the reduced one. 180 \item Proceed with your analysis. 181 \end{enumerate} 182 183 \subsection{Random sub-sampling} 184 \label{sample-random} 185 186 Besides restricting the sample on some deterministic criterion, it may 187 sometimes be useful (when working with very large datasets, or perhaps 188 to study the properties of an estimator) to draw a random sub-sample 189 from the full dataset. This can be done using, for example, 190 % 191 \begin{code} 192 smpl 100 --random 193 \end{code} 194 % 195 to select 100 cases. If you want the sample to be reproducible, you 196 should set the seed for the random number generator first, using the 197 \cmd{set} command. This sort of sampling falls under the 198 ``restriction'' category: a reduced copy of the dataset is made. 199 200 \section{Panel data} 201 \label{sec:smpl-panel} 202 203 Consider for concreteness the Arellano--Bond dataset supplied with 204 gretl (\texttt{abdata.gdt}). This comprises data on 140 firms 205 $(n=140$) observed over the years 1976--1984 $(T=9)$. The dataset is 206 ``nominally balanced'' in the sense that that the time-series length 207 is the same for all countries (this being a requirement for a dataset 208 to count as a panel in gretl), but in fact there are many missing 209 values (\texttt{NA}s). 210 211 You may want to sub-sample such a dataset in either the 212 cross-sectional dimension (limit the sample to a subset of firms) or 213 the time dimension (e.g.\ use data from the 1980s only). One way to 214 sub-sample on firms keys off the notation used by gretl for panel 215 observations. The full data range is printed as \texttt{1:1} (firm 1, 216 period 1) to \texttt{140:9} (firm 140, period 9). The effect of 217 % 218 \begin{code} 219 smpl 1:1 80:9 220 \end{code} 221 % 222 is to limit the sample to the first 80 firms. Note that if you instead 223 tried \texttt{smpl 1:1 80:4} this would provoke an error: you cannot 224 use this syntax to sub-sample in the time dimension of the 225 panel. Alternatively, and perhaps more naturally, you can use the 226 \option{unit} option with the \cmd{smpl} command to limit the sample 227 in the cross-sectional dimension, as in 228 % 229 \begin{code} 230 smpl 1 80 --unit 231 \end{code} 232 233 The firms in the Arellano--Bond dataset are anonymous, but suppose you 234 had a panel with five named countries. With such a panel you can 235 inform gretl of the names of the groups using the \cmd{setobs} 236 command. For example, given 237 % 238 \begin{code} 239 string cstr = "Portugal Italy Ireland Greece Spain" 240 setobs country cstr --panel-groups 241 \end{code} 242 % 243 gretl creates a string-valued series named \texttt{country} with group 244 names taken from the variable \texttt{cstr}. Then, to include only 245 Italy and Spain you could do 246 % 247 \begin{code} 248 smpl country=="Italy" || country=="Spain" --restrict 249 \end{code} 250 % 251 or to exclude one country, 252 % 253 \begin{code} 254 smpl country!="Ireland" --restrict 255 \end{code} 256 257 To sub-sample in the time dimension, use of \option{restrict} is 258 required. For example, the Arellano--Bond dataset contains a variable 259 named \texttt{YEAR} that records the year of the observations and if 260 one wanted to omit the first two years of data one could do 261 % 262 \begin{code} 263 smpl YEAR >= 1978 --restrict 264 \end{code} 265 % 266 If a dataset does not already incude a suitable variable for this 267 purpose one can use the command \texttt{genr time} to create a simple 268 1-based time index. 269 270 Note that if you apply a sample restriction that just selects certain 271 units (firms, countries or whatever), or selects certain contiguous 272 time-periods---such that $n>1$, $T>1$ and the time-series length is 273 still the same across all included units---your sub-sample will still 274 be interpreted by gretl as a panel. 275 276 277 \subsection{Unbalancing restrictions} 278 279 In some cases one wants to sub-sample according to a criterion that 280 ``cuts across the grain'' of a panel dataset. For instance, suppose you 281 have a micro dataset with thousands of individuals observed over 282 several years and you want to restrict the sample to observations on 283 employed women. 284 285 If we simply extracted from the total $nT$ rows of the dataset those 286 that pertain to women who were employed at time $t$ $(t = 1,\dots,T)$ 287 we would likely end up with a dataset that doesn't count as a panel in 288 gretl (because the specific time-series length, $T_i$, would differ 289 across individuals). In some contexts it might be OK that gretl 290 doesn't take your sub-sample to be a panel, but if you want to apply 291 panel-specific methods this is a problem. You can solve it by giving 292 the \option{balanced} option with \cmd{smpl}. For example, supposing 293 your dataset contained dummy variables \texttt{gender} (with the value 294 1 coding for women) and \texttt{employed}, you could do 295 % 296 \begin{code} 297 smpl gender==1 && employed==1 --restrict --balanced 298 \end{code} 299 % 300 What exactly does this do? Well, let's say the years of your data are 301 2000, 2005 and 2010, and that some women were employed in all of those 302 years, giving a maximum $T_i$ value of 3. But individual 526 is a 303 women who was employed only in the year 2000 ($T_i = 1$). The effect 304 of the \option{balanced} option is then to insert ``padding rows'' of 305 \texttt{NA}s for the years 2005 and 2010 for individual 526, and 306 similarly for all individuals with $0 < T_i < 3$. Your sub-sample 307 then qualifies as a panel. 308 309 310 \section{Resampling and bootstrapping} 311 \label{sec:resampling} 312 313 Given an original data series \varname{x}, the command 314 % 315 \begin{code} 316 series xr = resample(x) 317 \end{code} 318 % 319 creates a new series each of whose elements is drawn at random from 320 the elements of \varname{x}. If the original series has 100 321 observations, each element of \varname{x} is selected with probability 322 $1/100$ at each drawing. Thus the effect is to ``shuffle'' the 323 elements of \varname{x}, with the twist that each element of 324 \varname{x} may appear more than once, or not at all, in \varname{xr}. 325 326 The primary use of this function is in the construction of bootstrap 327 confidence intervals or p-values. Here is a simple example. Suppose 328 we estimate a simple regression of $y$ on $x$ via OLS and find that 329 the slope coefficient has a reported $t$-ratio of 2.5 with 40 degrees 330 of freedom. The two-tailed p-value for the null hypothesis that the 331 slope parameter equals zero is then 0.0166, using the $t(40)$ 332 distribution. Depending on the context, however, we may doubt whether 333 the ratio of coefficient to standard error truly follows the $t(40)$ 334 distribution. In that case we could derive a bootstrap p-value as 335 shown in Listing~\ref{resampling-loop}. 336 337 Under the null hypothesis that the slope with respect to $x$ is zero, 338 $y$ is simply equal to its mean plus an error term. We simulate $y$ 339 by resampling the residuals from the initial OLS and re-estimate the 340 model. We repeat this procedure a large number of times, and record 341 the number of cases where the absolute value of the $t$-ratio is 342 greater than 2.5: the proportion of such cases is our bootstrap 343 p-value. For a good discussion of simulation-based tests and 344 bootstrapping, see Davidson and MacKinnon 345 (\citeyear{davidson-mackinnon04}, chapter 4); Davidson and Flachaire 346 (\citeyear{davidson-flachaire01}) is also instructive. 347 348 \begin{script}[htbp] 349 \caption{Calculation of bootstrap p-value} 350 \label{resampling-loop} 351 \begin{scode} 352 ols y 0 x 353 # save the residuals 354 genr ui = $uhat 355 scalar ybar = mean(y) 356 # number of replications for bootstrap 357 scalar replics = 10000 358 scalar tcount = 0 359 series ysim 360 loop replics 361 # generate simulated y by resampling 362 ysim = ybar + resample(ui) 363 ols ysim 0 x 364 scalar tsim = abs($coeff(x) / $stderr(x)) 365 tcount += (tsim > 2.5) 366 endloop 367 printf "proportion of cases with |t| > 2.5 = %g\n", tcount / replics 368 \end{scode} 369 %$ 370 \end{script} 371 372 373 %%% Local Variables: 374 %%% mode: latex 375 %%% TeX-master: "gretl-guide" 376 %%% End: 377