As a special service "Fossies" has tried to format the requested source page into HTML format using (guessed) TeX and LaTeX source code syntax highlighting (style: standard) with prefixed line numbers.
Alternatively you can here view or download the uninterpreted source code file.

1 \chapter{Maximum likelihood estimation} 2 \label{chap:mle} 3 4 \section{Generic ML estimation with gretl} 5 \label{sec:mle-intro} 6 7 8 Maximum likelihood estimation is a cornerstone of modern inferential 9 procedures. Gretl provides a way to implement this method for a wide 10 range of estimation problems, by use of the \texttt{mle} command. We 11 give here a few examples. 12 13 To give a foundation for the examples that follow, we start from a 14 brief reminder on the basics of ML estimation. Given a sample of size 15 $T$, it is possible to define the density function\footnote{We are 16 supposing here that our data are a realization of continuous random 17 variables. For discrete random variables, everything continues to 18 apply by referring to the probability function instead of the 19 density. In both cases, the distribution may be conditional on some 20 exogenous variables.} for the whole sample, namely the joint 21 distribution of all the observations $f(\mathbf{Y} ; \theta)$, where 22 $\mathbf{Y} = \left\{ y_1, \ldots, y_T \right\}$. Its shape is 23 determined by a $k$-vector of unknown parameters $\theta$, which we 24 assume is contained in a set $\Theta$, and which can be used to 25 evaluate the probability of observing a sample with any given 26 characteristics. 27 28 After observing the data, the values $\mathbf{Y}$ are given, and this 29 function can be evaluated for any legitimate value of $\theta$. In 30 this case, we prefer to call it the \emph{likelihood} function; the 31 need for another name stems from the fact that this function works as 32 a density when we use the $y_t$s as arguments and $\theta$ as 33 parameters, whereas in this context $\theta$ is taken as the 34 function's argument, and the data $\mathbf{Y}$ only have the role of 35 determining its shape. 36 37 In standard cases, this function has a unique maximum. The location 38 of the maximum is unaffected if we consider the logarithm of the 39 likelihood (or log-likelihood for short): this function will be 40 denoted as 41 \[ 42 \LogLik(\theta) = \log f(\mathbf{Y}; \theta) 43 \] 44 The log-likelihood functions that gretl can handle are those where 45 $\LogLik(\theta)$ can be written as 46 \[ 47 \LogLik(\theta) = \sum_{t=1}^T \ell_t(\theta) 48 \] 49 which is true in most cases of interest. The functions $\ell_t(\theta)$ 50 are called the log-likelihood contributions. 51 52 Moreover, the location of the maximum is obviously determined by the 53 data $\mathbf{Y}$. This means that the value 54 \begin{equation} 55 \label{eq:maxlik} 56 \hat{\theta}(\mathbf{Y}) = \argmax_{\theta \in \Theta} \LogLik(\theta) 57 \end{equation} 58 is some function of the observed data (a statistic), which has the 59 property, under mild conditions, of being a consistent, asymptotically 60 normal and asymptotically efficient estimator of $\theta$. 61 62 Sometimes it is possible to write down explicitly the function 63 $\hat{\theta}(\mathbf{Y})$; in general, it need not be so. In these 64 circumstances, the maximum can be found by means of numerical 65 techniques. These often rely on the fact that the log-likelihood is a 66 smooth function of $\theta$, and therefore on the maximum its partial 67 derivatives should all be 0. The \textsl{gradient vector}, or 68 \textsl{score vector}, is a function that enjoys many interesting 69 statistical properties in its own right; it will be denoted here as 70 $\mathbf{g}(\theta)$. It is a $k$-vector with typical element 71 \[ 72 g_i(\theta) = \frac{\partial\LogLik(\theta)}{\partial\theta_i} 73 = \sum_{t=1}^T \frac{\partial\ell_t(\theta)}{\partial\theta_i} 74 \] 75 76 Gradient-based methods can be briefly illustrated as follows: 77 78 \begin{enumerate} 79 \item pick a point $\theta_0 \in \Theta$; 80 \item evaluate $\mathbf{g}(\theta_0)$; 81 \item if $\mathbf{g}(\theta_0)$ is ``small'', stop. Otherwise, compute 82 a direction vector $d(\mathbf{g}(\theta_0))$; 83 \item evaluate $\theta_1 = \theta_0 + d(\mathbf{g}(\theta_0))$; 84 \item substitute $\theta_0$ with $\theta_1$; 85 \item restart from 2. 86 \end{enumerate} 87 88 Many algorithms of this kind exist; they basically differ from one 89 another in the way they compute the direction vector 90 $d(\mathbf{g}(\theta_0))$, to ensure that $\LogLik(\theta_1) > 91 \LogLik(\theta_0)$ (so that we eventually end up on the maximum). 92 93 The default method gretl uses to maximize the log-likelihood is a 94 gradient-based algorithm known as the \textbf{BFGS} (Broyden, 95 Fletcher, Goldfarb and Shanno) method. This technique is used in most 96 econometric and statistical packages, as it is well-established and 97 remarkably powerful. Clearly, in order to make this technique 98 operational, it must be possible to compute the vector 99 $\mathbf{g}(\theta)$ for any value of $\theta$. In some cases this 100 vector can be written explicitly as a function of $\mathbf{Y}$. If 101 this is not possible or too difficult the gradient may be evaluated 102 numerically. The alternative \textbf{Newton-Raphson} algorithm is also 103 available. This method is more effective under some circumstances but 104 is also more fragile; see section \ref{sec:mle-adv} and chapter 105 \ref{chap:numerical} for details.\footnote{Note that some of the 106 statements made below (for example, regarding estimation of the 107 covariance matrix) have to be modified when Newton's method is 108 used.} 109 110 The choice of the starting value, $\theta_0$, is crucial in some contexts 111 and inconsequential in others. In general, however, it is 112 advisable to start the algorithm from ``sensible'' values whenever 113 possible. If a consistent estimator is available, this is usually a 114 safe and efficient choice: this ensures that in large samples the 115 starting point will be likely close to $\hat{\theta}$ and convergence 116 can be achieved in few iterations. 117 118 The maximum number of iterations allowed for the BFGS procedure, and 119 the relative tolerance for assessing convergence, can be adjusted 120 using the \cmd{set} command: the relevant variables are 121 \verb+bfgs_maxiter+ (default value 500) and \verb+bfgs_toler+ (default 122 value, the machine precision to the power 3/4). 123 124 \subsection{Covariance matrix and standard errors} 125 126 By default the covariance matrix of the parameter estimates is 127 based on the Outer Product of the Gradient (OPG). That is, 128 \begin{equation} 129 \label{eq:OPGmat} 130 \widehat{\mbox{Var}}_{\mbox{\scriptsize OPG}}(\hat{\theta}) = 131 \left(G'(\hat{\theta}) G(\hat{\theta}) \right)^{-1} 132 \end{equation} 133 where $G(\hat{\theta})$ is the $T \times k$ matrix of contributions to 134 the gradient. Two other options are available. If the 135 \option{hessian} flag is given, the covariance matrix is computed from 136 a numerical approximation to the Hessian at convergence. If the 137 \option{robust} option is selected, the quasi-ML ``sandwich'' 138 estimator is used: 139 \[ 140 \widehat{\mbox{Var}}_{\mbox{\scriptsize QML}}(\hat{\theta}) = H(\hat{\theta})^{-1} 141 G'(\hat{\theta}) G(\hat{\theta}) H(\hat{\theta})^{-1} 142 \] 143 where $H$ denotes the numerical approximation to the Hessian. 144 145 Cluster-robust estimation is also available: in order to activate it, 146 use the \option{cluster=}\emph{\texttt{clustvar}}, where \cmd{clustvar} 147 should be a discrete series. See section \ref{sec:vcv-cluster} for 148 more details. 149 150 Note, however, that if the log-likelihood function supplied by the 151 user just returns a scalar value---as opposed to a series or vector 152 holding per-observation contributions---then the OPG method is not 153 applicable and so the covariance matrix must be estimated via a 154 numerical approximation to the Hessian. 155 156 \section{Gamma estimation} 157 \label{sec:ml-gamma} 158 159 Suppose we have a sample of $T$ independent and identically 160 distributed observations from a Gamma distribution. The density 161 function for each observation $x_t$ is 162 \begin{equation} 163 \label{eq:gammadens} 164 f(x_t) = \frac{\alpha^p}{\Gamma(p)} x_t^{p-1} \exp\left({-\alpha 165 x_t}\right) 166 \end{equation} 167 The log-likelihood for the entire sample can be written as the 168 logarithm of the joint density of all the observations. Since these 169 are independent and identical, the joint density is the product of the 170 individual densities, and hence its log is 171 \begin{equation} 172 \label{eq:gammaloglik} 173 \LogLik(\alpha, p) = \sum_{t=1}^T \log \left[ \frac{\alpha^p}{\Gamma(p)} x_t^{p-1} \exp\left({-\alpha 174 x_t}\right) \right] = 175 \sum_{t=1}^T \ell_t 176 \end{equation} 177 where 178 \[ 179 \ell_t = p \cdot \log (\alpha x_t) - \gamma(p) - \log x_t - \alpha x_t 180 \] 181 and $\gamma(\cdot)$ is the log of the gamma function. In order to 182 estimate the parameters $\alpha$ and $p$ via ML, we need to maximize 183 (\ref{eq:gammaloglik}) with respect to them. The corresponding 184 gretl code snippet is 185 186 \begin{code} 187 scalar alpha = 1 188 scalar p = 1 189 190 mle logl = p*ln(alpha * x) - lngamma(p) - ln(x) - alpha * x 191 params alpha p 192 end mle 193 \end{code} 194 195 The first two statements 196 197 \begin{code} 198 alpha = 1 199 p = 1 200 \end{code} 201 202 are necessary to ensure that the variables \texttt{alpha} and 203 \texttt{p} exist before the computation of \texttt{logl} is 204 attempted. Inside the \texttt{mle} block these variables (which could 205 be either scalars, vectors or a combination of the two --- see below 206 for an example) are identified as the parameters that should be 207 adjusted to maximize the likelihood via the \texttt{params} keyword. 208 Their values will be changed by the execution of the \texttt{mle} 209 command; upon successful completion, they will be replaced by the ML 210 estimates. The starting value is 1 for both; this is arbitrary and 211 does not matter much in this example (more on this later). 212 213 The above code can be made more readable, and marginally more 214 efficient, by defining a variable to hold $\alpha \cdot x_t$. This 215 command can be embedded in the \texttt{mle} block as follows: 216 \begin{code} 217 mle logl = p*ln(ax) - lngamma(p) - ln(x) - ax 218 series ax = alpha*x 219 params alpha p 220 end mle 221 \end{code} 222 The variable \texttt{ax} is not added to the \texttt{params} list, 223 of course, since it is just an auxiliary variable to facilitate 224 the calculations. You can insert as many such auxiliary lines 225 as you require before the \texttt{params} line, with the restriction 226 that they must contain either (a) commands to generate series, 227 scalars or matrices or (b) print commands (which may be used to 228 aid in debugging). 229 230 In a simple example like this, the choice of the starting values is 231 almost inconsequential; the algorithm is likely to converge no 232 matter what the starting values are. However, consistent 233 method-of-moments estimators of $p$ and $\alpha$ can be simply 234 recovered from the sample mean $m$ and variance $V$: since it can be 235 shown that 236 \[ 237 E(x_t) = p/\alpha \qquad V(x_t) = p/\alpha^2 238 \] 239 it follows that the following estimators 240 \begin{eqnarray*} 241 \bar{\alpha} & = & m/V \\ 242 \bar{p} & = & m \cdot \bar{\alpha} 243 \end{eqnarray*} 244 are consistent, and therefore suitable to be used as starting point 245 for the algorithm. The gretl script code then becomes 246 \begin{code} 247 scalar m = mean(x) 248 scalar alpha = m/var(x) 249 scalar p = m*alpha 250 251 mle logl = p*ln(ax) - lngamma(p) - ln(x) - ax 252 series ax = alpha*x 253 params alpha p 254 end mle 255 \end{code} 256 257 Another thing to note is that sometimes parameters are constrained 258 within certain boundaries: in this case, for example, both $\alpha$ 259 and $p$ must be positive numbers. Gretl does not check for this: 260 it is the user's responsibility to ensure that the function is 261 always evaluated at an admissible point in the parameter space during 262 the iterative search for the maximum. An effective technique is to 263 define a variable for checking that the parameters are admissible and 264 setting the log-likelihood as undefined if the check fails. An 265 example, which uses the conditional assignment operator, follows: 266 \begin{code} 267 scalar m = mean(x) 268 scalar alpha = m/var(x) 269 scalar p = m*alpha 270 271 mle logl = check ? p*ln(ax) - lngamma(p) - ln(x) - ax : NA 272 series ax = alpha*x 273 scalar check = (alpha>0) && (p>0) 274 params alpha p 275 end mle 276 \end{code} 277 278 \section{Stochastic frontier cost function} 279 \label{sec:frontier} 280 281 \emph{% 282 Note: this section has the only purpose of illustrating the \cmd{mle} 283 command. For the estimation of stochastic frontier cost or production 284 functions, you may want to use the \package{frontier} function package.} 285 286 When modeling a cost function, it is sometimes worthwhile to 287 incorporate explicitly into the statistical model the notion that 288 firms may be inefficient, so that the observed cost deviates from the 289 theoretical figure not only because of unobserved heterogeneity 290 between firms, but also because two firms could be operating at a 291 different efficiency level, despite being identical under all other 292 respects. In this case we may write 293 \[ 294 C_i = C^*_i + u_i + v_i 295 \] 296 where $C_i$ is some variable cost indicator, $C_i^*$ is its 297 ``theoretical'' value, $u_i$ is a zero-mean disturbance term and $v_i$ 298 is the inefficiency term, which is supposed to be nonnegative by its 299 very nature. A linear specification for $C_i^*$ is often chosen. For 300 example, the Cobb--Douglas cost function arises when $C_i^*$ is a 301 linear function of the logarithms of the input prices and the output 302 quantities. 303 304 The \emph{stochastic frontier} model is a linear model of the form 305 $y_i = x_i \beta + \varepsilon_i$ in which the error term 306 $\varepsilon_i$ is the sum of $u_i$ and $v_i$. 307 308 A common postulate is that $u_i \sim N(0,\sigma_u^2)$ and 309 $v_i \sim \left|N(0,\sigma_v^2)\right|$. If independence between $u_i$ 310 and $v_i$ is also assumed, then it is possible to show that the 311 density function of $\varepsilon_i$ has the form: 312 \begin{equation} 313 \label{eq:frontdens} 314 f(\varepsilon_i) = 315 \sqrt{\frac{2}{\pi}} 316 \Phi\left(\frac{\lambda \varepsilon_i}{\sigma}\right) 317 \frac{1}{\sigma} \phi\left(\frac{\varepsilon_i}{\sigma}\right) 318 \end{equation} 319 where $\Phi(\cdot)$ and $\phi(\cdot)$ are, respectively, the distribution and density 320 function of the standard normal, $\sigma = 321 \sqrt{\sigma^2_u + \sigma^2_v}$ and $\lambda = \frac{\sigma_u}{\sigma_v}$. 322 323 As a consequence, the log-likelihood for one observation takes the 324 form (apart form an irrelevant constant) 325 \[ 326 \ell_t = 327 \log\Phi\left(\frac{\lambda \varepsilon_i}{\sigma}\right) - 328 \left[ \log(\sigma) + \frac{\varepsilon_i^2}{2 \sigma^2} \right] 329 \] 330 Therefore, a Cobb--Douglas cost function with stochastic frontier is the 331 model described by the following equations: 332 \begin{eqnarray*} 333 \log C_i & = & \log C^*_i + \varepsilon_i \\ 334 \log C^*_i & = & c + \sum_{j=1}^m \beta_j \log y_{ij} + \sum_{j=1}^n \alpha_j \log p_{ij} \\ 335 \varepsilon_i & = & u_i + v_i \\ 336 u_i & \sim & N(0,\sigma_u^2) \\ 337 v_i & \sim & \left|N(0,\sigma_v^2)\right| 338 \end{eqnarray*} 339 340 In most cases, one wants to ensure that the homogeneity of the cost 341 function with respect to the prices holds by construction. Since this 342 requirement is equivalent to $\sum_{j=1}^n \alpha_j = 1$, the above 343 equation for $C^*_i$ can be rewritten as 344 345 \begin{equation} 346 \label{eq:CobbDouglasFrontier} 347 \log C_i - \log p_{in} = c + \sum_{j=1}^m \beta_j \log y_{ij} + 348 \sum_{j=2}^n \alpha_j (\log p_{ij} - \log p_{in}) + \varepsilon_i 349 \end{equation} 350 351 The above equation could be estimated by OLS, but it would suffer from 352 two drawbacks: first, the OLS estimator for the intercept $c$ is 353 inconsistent because the disturbance term has a non-zero expected 354 value; second, the OLS estimators for the other parameters are 355 consistent, but inefficient in view of the non-normality of 356 $\varepsilon_i$. Both issues can be addressed by estimating 357 (\ref{eq:CobbDouglasFrontier}) by maximum likelihood. Nevertheless, 358 OLS estimation is a quick and convenient way to provide starting 359 values for the MLE algorithm. 360 361 Listing \ref{cost-estimation} shows how to implement the model 362 described so far. The \texttt{banks91} file contains part of the data 363 used in \citet*{lucchetti01}. 364 365 \begin{script}[htbp] 366 \caption{Estimation of stochastic frontier cost function (with 367 scalar parameters)} 368 \label{cost-estimation} 369 \begin{scode} 370 open banks91.gdt 371 372 # transformations 373 series cost = ln(VC) 374 series q1 = ln(Q1) 375 series q2 = ln(Q2) 376 series p1 = ln(P1) 377 series p2 = ln(P2) 378 series p3 = ln(P3) 379 380 # Cobb-Douglas cost function with homogeneity restrictions 381 # (for initialization) 382 genr rcost = cost - p1 383 genr rp2 = p2 - p1 384 genr rp3 = p3 - p1 385 386 ols rcost const q1 q2 rp2 rp3 387 388 # Cobb-Douglas cost function with homogeneity restrictions 389 # and inefficiency 390 391 scalar b0 = $coeff(const) 392 scalar b1 = $coeff(q1) 393 scalar b2 = $coeff(q2) 394 scalar b3 = $coeff(rp2) 395 scalar b4 = $coeff(rp3) 396 397 scalar su = 0.1 398 scalar sv = 0.1 399 400 mle logl = ln(cnorm(e*lambda/ss)) - (ln(ss) + 0.5*(e/ss)^2) 401 scalar ss = sqrt(su^2 + sv^2) 402 scalar lambda = su/sv 403 series e = rcost - b0*const - b1*q1 - b2*q2 - b3*rp2 - b4*rp3 404 params b0 b1 b2 b3 b4 su sv 405 end mle 406 \end{scode} 407 \end{script} 408 409 The script in example \ref{cost-estimation} is relatively easy to 410 modify to show how one can use vectors (that is, 1-dimensional 411 matrices) for storing the parameters to optimize: example 412 \ref{cost-estimation-vec} holds essentially the same script in which 413 the parameters of the cost function are stored together in a 414 vector. Of course, this makes also possible to use variable lists and 415 other refinements which make the code more compact and readable. 416 417 \begin{script}[htbp] 418 \caption{Estimation of stochastic frontier cost function (with 419 matrix parameters)} 420 \label{cost-estimation-vec} 421 \begin{scode} 422 open banks91.gdt 423 424 # transformations 425 series cost = ln(VC) 426 series q1 = ln(Q1) 427 series q2 = ln(Q2) 428 series p1 = ln(P1) 429 series p2 = ln(P2) 430 series p3 = ln(P3) 431 432 # Cobb-Douglas cost function with homogeneity restrictions 433 # (for initialization) 434 genr rcost = cost - p1 435 genr rp2 = p2 - p1 436 genr rp3 = p3 - p1 437 list X = const q1 q2 rp2 rp3 438 439 ols rcost X 440 X = const q1 q2 rp2 rp3 441 # Cobb-Douglas cost function with homogeneity restrictions 442 # and inefficiency 443 444 matrix b = $coeff 445 scalar su = 0.1 446 scalar sv = 0.1 447 448 mle logl = ln(cnorm(e*lambda/ss)) - (ln(ss) + 0.5*(e/ss)^2) 449 scalar ss = sqrt(su^2 + sv^2) 450 scalar lambda = su/sv 451 series e = rcost - lincomb(X, b) 452 params b su sv 453 end mle 454 \end{scode} 455 \end{script} 456 %$ 457 \section{GARCH models} 458 \label{sec:garch} 459 460 GARCH models are handled by gretl via a native function. However, it is 461 instructive to see how they can be estimated through the \texttt{mle} 462 command.\footnote{The \package{gig} addon, which handles other variants 463 of conditionally heteroskedastic models, uses \texttt{mle} as its 464 internal engine.} 465 466 The following equations provide the simplest example of a GARCH(1,1) 467 model: 468 \begin{eqnarray*} 469 y_t & = & \mu + \varepsilon_t \\ 470 \varepsilon_t & = & u_t \cdot \sigma_t \\ 471 u_t & \sim & N(0,1) \\ 472 h_t & = & \omega + \alpha \varepsilon^2_{t-1} + \beta h_{t-1}. 473 \end{eqnarray*} 474 Since the variance of $y_t$ depends on past values, writing down the 475 log-likelihood function is not simply a matter of summing the log 476 densities for individual observations. As is common in time series 477 models, $y_t$ cannot be considered independent of the other 478 observations in our sample, and consequently the density function for 479 the whole sample (the joint density for all observations) is not just 480 the product of the marginal densities. 481 482 Maximum likelihood estimation, in these cases, is achieved by 483 considering \emph{conditional} densities, so what we maximize is a 484 conditional likelihood function. If we define the information set at 485 time $t$ as 486 \[ 487 F_t = \left\{ y_t, y_{t-1}, \ldots \right\} , 488 \] 489 then the density of $y_t$ conditional on $F_{t-1}$ is normal: 490 \[ 491 y_t | F_{t-1} \sim N\left[ \mu, h_{t} \right]. 492 \] 493 494 By means of the properties of conditional distributions, the joint 495 density can be factorized as follows 496 \[ 497 f(y_t, y_{t-1}, \ldots) = \left[ \prod_{t=1}^T f(y_t |F_{t-1}) 498 \right] \cdot f(y_0) 499 \] 500 If we treat $y_0$ as fixed, then the term $f(y_0)$ does not depend on 501 the unknown parameters, and therefore the conditional log-likelihood 502 can then be written as the sum of the individual contributions as 503 \begin{equation} 504 \label{eq:garchloglik} 505 \LogLik(\mu,\omega,\alpha,\beta) = \sum_{t=1}^T \ell_t 506 \end{equation} 507 where 508 \[ 509 \ell_t = \log \left[ \frac{1}{\sqrt{h_t}} \phi\left( \frac{y_t - \mu}{\sqrt{h_t}} 510 \right) \right] = 511 - \frac{1}{2} \left[ \log(h_t) + \frac{(y_t - \mu)^2}{h_t} \right] 512 \] 513 514 The following script shows a simple application of this technique, 515 which uses the data file \texttt{djclose}; it is one of the example 516 dataset supplied with gretl and contains daily data from the Dow Jones 517 stock index. 518 519 \begin{code} 520 open djclose 521 522 series y = 100*ldiff(djclose) 523 524 scalar mu = 0.0 525 scalar omega = 1 526 scalar alpha = 0.4 527 scalar beta = 0.0 528 529 mle ll = -0.5*(log(h) + (e^2)/h) 530 series e = y - mu 531 series h = var(y) 532 series h = omega + alpha*(e(-1))^2 + beta*h(-1) 533 params mu omega alpha beta 534 end mle 535 \end{code} 536 537 \section{Analytical derivatives} 538 \label{sec:anal-der} 539 540 Computation of the score vector is essential for the working of the 541 BFGS method. In all the previous examples, no explicit formula for the 542 computation of the score was given, so the algorithm was fed 543 numerically evaluated gradients. Numerical computation of the score for 544 the $i$-th parameter is performed via a finite approximation of the 545 derivative, namely 546 \[ 547 \pder{\LogLik(\theta_1, \ldots, \theta_n)}{\theta_i} \simeq 548 \frac{\LogLik(\theta_1, \ldots, \theta_i + h, \ldots, \theta_n) - 549 \LogLik(\theta_1, \ldots, \theta_i - h, \ldots, \theta_n)}{2h} 550 \] 551 where $h$ is a small number. 552 553 In many situations, this is rather efficient and accurate. A better 554 approximation to the true derivative may be obtained by forcing 555 \cmd{mle} to use a technique known as \emph{Richardson Extrapolation}, 556 which gives extremely precise results, but is considerably more 557 CPU-intensive. This feature may be turned on by using the \cmd{set} 558 command as in 559 \begin{code} 560 set bfgs_richardson on 561 \end{code} 562 563 However, one might want to avoid the approximation and specify an 564 exact function for the derivatives. As an example, consider the 565 following script: 566 % 567 \begin{code} 568 nulldata 1000 569 570 genr x1 = normal() 571 genr x2 = normal() 572 genr x3 = normal() 573 574 genr ystar = x1 + x2 + x3 + normal() 575 genr y = (ystar > 0) 576 577 scalar b0 = 0 578 scalar b1 = 0 579 scalar b2 = 0 580 scalar b3 = 0 581 582 mle logl = y*ln(P) + (1-y)*ln(1-P) 583 series ndx = b0 + b1*x1 + b2*x2 + b3*x3 584 series P = cnorm(ndx) 585 params b0 b1 b2 b3 586 end mle --verbose 587 \end{code} 588 589 Here, 1000 data points are artificially generated for an ordinary 590 probit model:\footnote{Again, gretl does provide a native 591 \texttt{probit} command (see section \ref{sec:logit-probit}), but a 592 probit model makes for a nice example here.} $y_t$ is a binary 593 variable, which takes the value 1 if $y_t^* = \beta_1 x_{1t} + \beta_2 594 x_{2t} + \beta_3 x_{3t} + \varepsilon_t > 0$ and 0 otherwise. 595 Therefore, $y_t = 1$ with probability $\Phi(\beta_1 x_{1t} + \beta_2 596 x_{2t} + \beta_3 x_{3t}) = \pi_t$. The probability function for one 597 observation can be written as 598 \[ 599 P(y_t) = \pi_t^{y_t} ( 1 -\pi_t )^{1-y_t} 600 \] 601 Since the observations are independent and identically distributed, 602 the log-likelihood is simply the sum of the individual 603 contributions. Hence 604 \[ 605 \LogLik = \sum_{t=1}^T y_t \log(\pi_t) + (1 - y_t) \log(1 - \pi_t) 606 \] 607 The \option{verbose} switch at the end of the \texttt{end mle} 608 statement produces a detailed account of the iterations done by the 609 BFGS algorithm. 610 611 In this case, numerical differentiation works rather well; 612 nevertheless, computation of the analytical score is straightforward, 613 since the derivative $\pder{\LogLik}{\beta_i}$ can be written as 614 \[ 615 \pder{\LogLik}{\beta_i} = \pder{\LogLik}{\pi_t} \cdot \pder{\pi_t}{\beta_i} 616 \] 617 via the chain rule, and it is easy to see that 618 \begin{eqnarray*} 619 \pder{\LogLik}{\pi_t} & = & \frac{y_t}{\pi_t} - \frac{1 - y_t}{1 - 620 \pi_t} \\ 621 \pder{\pi_t}{\beta_i} & = & \phi(\beta_1 x_{1t} + \beta_2 x_{2t} + 622 \beta_3 x_{3t}) \cdot x_{it} 623 \end{eqnarray*} 624 625 The \texttt{mle} block in the above script can therefore be modified 626 as follows: 627 % 628 \begin{code} 629 mle logl = y*ln(P) + (1-y)*ln(1-P) 630 series ndx = b0 + b1*x1 + b2*x2 + b3*x3 631 series P = cnorm(ndx) 632 series m = dnorm(ndx)*(y/P - (1-y)/(1-P)) 633 deriv b0 = m 634 deriv b1 = m*x1 635 deriv b2 = m*x2 636 deriv b3 = m*x3 637 end mle --verbose 638 \end{code} 639 640 Note that the \texttt{params} statement has been replaced by a series 641 of \texttt{deriv} statements; these have the double function of 642 identifying the parameters over which to optimize and providing an 643 analytical expression for their respective score elements. 644 645 \section{Debugging ML scripts} 646 \label{sec:mle-debug} 647 648 We have discussed above the main sorts of statements that are 649 permitted within an \texttt{mle} block, namely 650 % 651 \begin{itemize} 652 \item auxiliary commands to generate helper variables; 653 \item \texttt{deriv} statements to specify the gradient with respect 654 to each of the parameters; and 655 \item a \texttt{params} statement to identify the parameters in case 656 analytical derivatives are not given. 657 \end{itemize} 658 659 For the purpose of debugging ML estimators one additional sort of 660 statement is allowed: you can print the value of a relevant variable 661 at each step of the iteration. This facility is more restricted then 662 the regular \texttt{print} command. The command word \texttt{print} 663 should be followed by the name of just one variable (a scalar, series 664 or matrix). 665 666 In the last example above a key variable named \texttt{m} was 667 generated, forming the basis for the analytical derivatives. To track 668 the progress of this variable one could add a print statement within 669 the ML block, as in 670 % 671 \begin{code} 672 series m = dnorm(ndx)*(y/P - (1-y)/(1-P)) 673 print m 674 \end{code} 675 676 \section{Using functions} 677 \label{sec:mle-func} 678 679 The \texttt{mle} command allows you to estimate models that 680 gretl does not provide natively: in some cases, it may be a good 681 idea to wrap up the \texttt{mle} block in a user-defined function (see 682 Chapter \ref{chap:functions}), so as to extend gretl's 683 capabilities in a modular and flexible way. 684 685 As an example, we will take a simple case of a model that gretl 686 does not yet provide natively: the zero-inflated Poisson model, or ZIP 687 for short.\footnote{The actual ZIP model is in fact a bit more general 688 than the one presented here. The specialized version discussed in this 689 section was chosen for the sake of simplicity. For futher details, see 690 \cite{greene03}.} In this model, we assume that we observe a mixed 691 population: for some individuals, the variable $y_t$ is (conditionally 692 on a vector of exogenous covariates $x_t$) distributed as a Poisson 693 random variate; for some others, $y_t$ is identically 0. The trouble 694 is, we don't know which category a given individual belongs to. 695 696 For instance, suppose we have a sample of women, and the variable $y_t$ 697 represents the number of children that woman $t$ has. There may be a 698 certain proportion, $\alpha$, of women for whom $y_t = 0$ with certainty 699 (maybe out of a personal choice, or due to physical impossibility). 700 But there may be other women for whom $y_t = 0$ just as a matter of 701 chance --- they haven't happened to have any children at the time 702 of observation. 703 704 In formulae: 705 \begin{eqnarray*} 706 P(y_t = k | x_t) & = & \alpha d_t + (1 - \alpha) 707 \left[e^{-\mu_t} \frac{\mu_t^{y_t}}{y_t!}\right] \\ 708 \mu_t & = & \exp(x_t \beta) \\ 709 d_t & = & 710 \left\{ 711 \begin{array}{ll} 712 1 & \mathrm{for} \quad y_t = 0 \\ 713 0 & \mathrm{for} \quad y_t > 0 714 \end{array} 715 \right. 716 \end{eqnarray*} 717 718 Writing a \texttt{mle} block for this model is not difficult: 719 \begin{code} 720 mle ll = logprob 721 series xb = exp(b0 + b1 * x) 722 series d = (y=0) 723 series poiprob = exp(-xb) * xb^y / gamma(y+1) 724 series logprob = (alpha>0) && (alpha<1) ? \ 725 log(alpha*d + (1-alpha)*poiprob) : NA 726 params alpha b0 b1 727 end mle -v 728 \end{code} 729 730 However, the code above has to be modified each time we change our 731 specification by, say, adding an explanatory variable. Using 732 functions, we can simplify this task considerably and eventually be 733 able to write something easy like 734 \begin{code} 735 list X = const x 736 zip(y, X) 737 \end{code} 738 739 \begin{script}[htbp] 740 \caption{Zero-inflated Poisson Model --- user-level function} 741 \label{mle-zip-main} 742 \begin{scode} 743 /* 744 user-level function: estimate the model and print out 745 the results 746 */ 747 function void zip(series y, list X) 748 matrix coef_stde = zip_estimate(y, X) 749 printf "\nZero-inflated Poisson model:\n" 750 string parnames = "alpha," 751 string parnames += varname(X) 752 modprint coef_stde parnames 753 end function 754 \end{scode} 755 \end{script} 756 %$ 757 758 Let's see how this can be done. First we need to define a function 759 called \texttt{zip()} that will take two arguments: a dependent 760 variable \texttt{y} and a list of explanatory variables \texttt{X}. An 761 example of such function can be seen in script~\ref{mle-zip-main}. By 762 inspecting the function code, you can see that the actual estimation 763 does not happen here: rather, the \texttt{zip()} function merely 764 uses the built-in \cmd{modprint} command to print out the results 765 coming from another user-written function, namely 766 \texttt{zip\_estimate()}. 767 768 \begin{script}[htbp] 769 \caption{Zero-inflated Poisson Model --- internal functions} 770 \label{mle-zip-inc} 771 \begin{scode} 772 /* compute log probabilities for the plain Poisson model */ 773 function series ln_poi_prob(series y, list X, matrix beta) 774 series xb = lincomb(X, beta) 775 return -exp(xb) + y*xb - lngamma(y+1) 776 end function 777 778 /* compute log probabilities for the zero-inflated Poisson model */ 779 function series ln_zip_prob(series y, list X, matrix beta, scalar p0) 780 # check if the probability is in [0,1]; otherwise, return NA 781 if p0 > 1 || p0 < 0 782 series ret = NA 783 else 784 series ret = ln_poi_prob(y, X, beta) + ln(1-p0) 785 series ret = y==0 ? ln(p0 + exp(ret)) : ret 786 endif 787 return ret 788 end function 789 790 /* do the actual estimation (silently) */ 791 function matrix zip_estimate(series y, list X) 792 # initialize alpha to a "sensible" value: half the frequency 793 # of zeros in the sample 794 scalar alpha = mean(y==0)/2 795 # initialize the coeffs (we assume the first explanatory 796 # variable is the constant here) 797 matrix coef = zeros(nelem(X), 1) 798 coef[1] = mean(y) / (1-alpha) 799 # do the actual ML estimation 800 mle ll = ln_zip_prob(y, X, coef, alpha) 801 params alpha coef 802 end mle --hessian --quiet 803 return $coeff ~ $stderr 804 end function 805 \end{scode} 806 \end{script} 807 %$ 808 809 The function \texttt{zip\_estimate()} is not meant to be executed 810 directly; it just contains the number-crunching part of the job, whose 811 results are then picked up by the end function \texttt{zip()}. In 812 turn, \texttt{zip\_estimate()} calls other user-written functions to 813 perform other tasks. The whole set of ``internal'' functions is shown 814 in the panel \ref{mle-zip-inc}. 815 816 All the functions shown in \ref{mle-zip-main} and \ref{mle-zip-inc} can 817 be stored in a separate \texttt{inp} file and executed once, at the 818 beginning of our job, by means of the \texttt{include} 819 command. Assuming the name of this script file is 820 \texttt{zip\_est.inp}, the following is an example script which 821 (a) includes the script file, (b) generates a simulated dataset, 822 and (c) performs the estimation of a ZIP model on the artificial data. 823 824 \begin{code} 825 set verbose off 826 827 # include the user-written functions 828 include zip_est.inp 829 830 # generate the artificial data 831 nulldata 1000 832 set seed 732237 833 scalar truep = 0.2 834 scalar b0 = 0.2 835 scalar b1 = 0.5 836 series x = normal() 837 series y = (uniform()<truep) ? 0 : randgen(p, exp(b0 + b1*x)) 838 list X = const x 839 840 # estimate the zero-inflated Poisson model 841 zip(y, X) 842 \end{code} 843 844 The results are as follows: 845 846 \begin{code} 847 Zero-inflated Poisson model: 848 849 coefficient std. error z-stat p-value 850 ------------------------------------------------------- 851 alpha 0.209738 0.0261746 8.013 1.12e-15 *** 852 const 0.167847 0.0449693 3.732 0.0002 *** 853 x 0.452390 0.0340836 13.27 3.32e-40 *** 854 \end{code} 855 856 A further step may then be creating a function package for accessing 857 your new \texttt{zip()} function via gretl's graphical interface. For 858 details on how to do this, see section \ref{sec:func-packages}. 859 860 \section{Advanced use of \texttt{mle}: functions, analytical 861 derivatives, algorithm choice} 862 \label{sec:mle-adv} 863 864 All the techniques decribed in the previous sections may be combined, 865 and \cmd{mle} can be used for solving non-standard estimation problems 866 (provided, of course, that one chooses maximum likelihood as the 867 preferred inference method). 868 869 The strategy that, as of this writing, has proven most successful in 870 designing scripts for this purpose is: 871 \begin{itemize} 872 \item Modularize your code as much as possible. 873 \item Use analytical derivatives whenever possible. 874 \item Choose your optimization method wisely. 875 \end{itemize} 876 877 In the rest of this section, we will expand on the probit example of 878 section \ref{sec:anal-der} to give the reader an idea of what a 879 ``heavy-duty'' application of \cmd{mle} looks like. Most of the code 880 fragments come from \verb|mle-advanced.inp|, which is one of the 881 sample scripts supplied with the standard installation of gretl 882 (see under \emph{File $>$ Script files $>$ Practice File}). 883 884 \subsection{BFGS with and without analytical derivatives} 885 \label{sec:mle-adv-bfgs} 886 887 The example in section \ref{sec:anal-der} can be made more general by 888 using matrices and user-written functions. Consider the following code 889 fragment: 890 \begin{code} 891 list X = const x1 x2 x3 892 matrix b = zeros(nelem(X),1) 893 894 mle logl = y*ln(P) + (1-y)*ln(1-P) 895 series ndx = lincomb(X, b) 896 series P = cnorm(ndx) 897 params b 898 end mle 899 \end{code} 900 901 In this context, the fact that the model we are estimating has four 902 explanatory variables is totally incidental: the code is written in 903 such a way that we could change the content of the list \texttt{X} 904 without having to make any other modification. This was made possible 905 by: 906 \begin{enumerate} 907 \item gathering the parameters to estimate into a single vector $b$ 908 rather than using separate scalars; 909 \item using the \texttt{nelem()} function to initialize $b$, so that 910 its dimension is kept track of automatically; 911 \item using the \texttt{lincomb()} function to compute the index 912 function. 913 \end{enumerate} 914 915 A parallel enhancement could be achieved in the case of analytically 916 computed derivatives: since $b$ is now a vector, \cmd{mle} expects the 917 argument to the \cmd{deriv} keyword to be a matrix, in which each 918 column is the partial derivative to the corresponding element of 919 $b$. It is useful to re-write the score for the $i$-th observation as 920 \begin{equation} 921 \label{eq:mle-probscore} 922 \pder{\LogLik_i}{\beta} = m_i \mathbf{x}_i' 923 \end{equation} 924 where $m_i$ is the ``signed Mills' ratio'', that is 925 \[ 926 m_i = y_i \frac{\phi(\mathbf{x}_i'\beta)}{\Phi(\mathbf{x}_i'\beta)} - 927 (1-y_i) \frac{\phi(\mathbf{x}_i'\beta)}{1 - \Phi(\mathbf{x}_i'\beta)} , 928 \] 929 which was computed in section \ref{sec:anal-der} via 930 \begin{code} 931 series P = cnorm(ndx) 932 series m = dnorm(ndx)*(y/P - (1-y)/(1-P)) 933 \end{code} 934 Here, we will code it in a somewhat terser way as 935 \begin{code} 936 series m = y ? invmills(-ndx) : -invmills(ndx) 937 \end{code} 938 and make use of the conditional assignment operator and of the 939 specialized function \cmd{invmills()} for efficiency. Building the 940 score matrix is now easily achieved via 941 \begin{code} 942 mle logl = y*ln(P) + (1-y)*ln(1-P) 943 series ndx = lincomb(X, b) 944 series P = cnorm(ndx) 945 series m = y ? invmills(-ndx) : -invmills(ndx) 946 matrix mX = {X} 947 deriv b = mX .* {m} 948 end mle 949 \end{code} 950 in which the \verb|{}| operator was used to turn series and lists into 951 matrices (see chapter \ref{chap:matrices}). However, proceeding in 952 this way for more complex models than probit may imply inserting into 953 the \cmd{mle} block a long series of instructions; the example above 954 merely happens to be short because the score matrix for the probit 955 model is very easy to write in matrix form. 956 957 A better solution is writing a user-level function to compute the 958 score and using that inside the \cmd{mle} block, as in 959 \begin{code} 960 function matrix score(matrix b, series y, list X) 961 series ndx = lincomb(X, b) 962 series m = y ? invmills(-ndx) : -invmills(ndx) 963 return {m} .* {X} 964 end function 965 966 [...] 967 968 mle logl = y*ln(P) + (1-y)*ln(1-P) 969 series ndx = lincomb(X, b) 970 series P = cnorm(ndx) 971 deriv b = score(b, y, X) 972 end mle 973 \end{code} 974 In this way, no matter how complex the computation of the score is, 975 the \cmd{mle} block remains nicely compact. 976 977 \subsection{Newton's method and the analytical Hessian} 978 \label{sec:mle-adv-hessian} 979 980 As mentioned above, gretl offers the user the option of using 981 Newton's method for maximizing the log-likelihood. In terms of the 982 notation used in section \ref{sec:mle-intro}, the direction for 983 updating the inital parameter vector $\theta_0$ is given by 984 \begin{equation} 985 \label{eq:mle-newton} 986 d\left[\mathbf{g}(\theta_0)\right] = -\lambda 987 \mathbf{H}(\theta_0)^{-1}\mathbf{g}(\theta_0) , 988 \end{equation} 989 where $\mathbf{H}(\theta)$ is the Hessian of the total loglikelihood 990 computed at $\theta$ and $0 < \lambda < 1$ is a scalar called the 991 \emph{step length}. 992 993 The above expression makes a few points clear: 994 \begin{enumerate} 995 \item At each step, it must be possible to compute not only the score 996 $\mathbf{g}(\theta)$, but also its derivative $\mathbf{H}(\theta)$; 997 \item the matrix $\mathbf{H}(\theta)$ should be nonsingular; 998 \item it is assumed that for some positive value of $\lambda$, 999 $\LogLik(\theta_1) > \LogLik(\theta_0)$; in other words, that going 1000 in the direction $d\left[\mathbf{g}(\theta_0)\right]$ leads upwards 1001 for some step length. 1002 \end{enumerate} 1003 1004 The strength of Newton's method lies in the fact that, if the 1005 loglikelihood is globally concave, then \eqref{eq:mle-newton} enjoys 1006 certain optimality properties and the number of iterations required to 1007 reach the maximum is often much smaller than it would be with other 1008 methods, such as BFGS. However, it may have some disadvantages: for a 1009 start, the Hessian $\mathbf{H}(\theta)$ may be difficult or very 1010 expensive to compute; moreover, the loglikelihood may not be globally 1011 concave, so for some values of $\theta$, the matrix 1012 $\mathbf{H}(\theta)$ is not negative definite or perhaps even 1013 singular. Those cases are handled by gretl's implementation of 1014 Newton's algorithm by means of several heuristic 1015 techniques\footnote{The gist to it is that, if $\mathbf{H}$ is not 1016 negative definite, it is substituted by $k \cdot 1017 \mathrm{dg}(\mathbf{H}) + (1-k) \cdot \mathbf{H}$, where $k$ is a 1018 suitable scalar; however, if you're interested in the precise 1019 details, you'll be much better off looking at the source code: the 1020 file you'll want to look at is \texttt{lib/src/gretl\_bfgs.c}.}, but 1021 a number of adverse consequences may occur, which range from 1022 \emph{longer} computation time for optimization to non-convergence of 1023 the algorithm. 1024 1025 As a consequence, using Newton's method is advisable only when the 1026 computation of the Hessian is not too CPU-intensive and the nature of 1027 the estimator is such that it is known in advance that the 1028 loglikelihood is globally concave. The probit models satisfies both 1029 requisites, so we will expand the preceding example to illustrate how 1030 to use Newton's method in gretl. 1031 1032 A first example may be given simply by issuing the command 1033 \begin{code} 1034 set optimizer newton 1035 \end{code} 1036 before the \cmd{mle} block.\footnote{To go back to BFGS, you use 1037 \cmd{set optimizer bfgs}.} This will instruct gretl to use 1038 Newton's method instead of BFGS. If the \cmd{deriv} keyword is used, 1039 gretl will differentiate the score function numerically; 1040 otherwise, if the score has to be computed itself numerically, 1041 gretl will calculate $\mathbf{H}(\theta)$ by differentiating the 1042 loglikelihood numerically twice. The latter solution, though, is 1043 generally to be avoided, as may be extremely time-consuming and may 1044 yield imprecise results. 1045 1046 A much better option is to calculate the Hessian analytically and have 1047 gretl use its true value rather than a numerical approximation. In 1048 most cases, this is both much faster and numerically stable, but of 1049 course comes at the price of having to differentiate the loglikelihood 1050 twice to respect with the parameters and translate the resulting 1051 expressions into efficient \app{hansl} code. 1052 1053 Luckily, both tasks are relatively easy in the probit case: the matrix 1054 of second derivatives of $\LogLik_i$ may be written as 1055 \[ 1056 \pder{{}^2\LogLik_i}{\beta \partial \beta'} = 1057 - m_i \left( m_i + \mathbf{x}_i'\beta \right) \mathbf{x}_i \mathbf{x}_i' 1058 \] 1059 so the total Hessian is 1060 \begin{equation} 1061 \label{eq:mle-tothess} 1062 \sum_{i=1}^n \pder{{}^2\LogLik_i}{\beta \partial \beta'} = 1063 - X' \left[ 1064 \begin{array}{cccc} 1065 w_1 & & & \\ 1066 & w_2 & & \\ 1067 & & \ddots & \\ 1068 & & & w_n 1069 \end{array} 1070 \right] X 1071 \end{equation} 1072 where $w_i = m_i \left( m_i + \mathbf{x}_i'\beta \right)$. It can be 1073 shown that $w_i > 0$, so the Hessian is guaranteed to be negative 1074 definite in all sensible cases and the conditions are ideal for 1075 applying Newton's method. 1076 1077 A \app{hansl} translation of equation \eqref{eq:mle-tothess} may look 1078 like 1079 \begin{code} 1080 function void Hess(matrix *H, matrix b, series y, list X) 1081 /* computes the negative Hessian for a Probit model */ 1082 series ndx = lincomb(X, b) 1083 series m = y ? invmills(-ndx) : -invmills(ndx) 1084 series w = m*(m+ndx) 1085 matrix mX = {X} 1086 H = (mX .* {w})'mX 1087 end function 1088 \end{code} 1089 1090 There are two characteristics worth noting of the function above. For 1091 a start, it doesn't return anything: the result of the computation is 1092 simply stored in the matrix pointed at by the first argument of the 1093 function. Second, the result is not the Hessian proper, but rather its 1094 negative. This function becomes usable from within an \cmd{mle} block 1095 by the keyword \cmd{hessian}. The syntax is 1096 \begin{code} 1097 mle ... 1098 ... 1099 hessian funcname(&mat_addr, ...) 1100 end mle 1101 \end{code} 1102 In other words, the \cmd{hessian} keyword must be followed by the call 1103 to a function whose first argument is a matrix pointer which is 1104 supposed to be filled with the \emph{negative} of the Hessian at 1105 $\theta$. 1106 1107 We said above (section~\ref{sec:mle-intro}) that the covariance matrix 1108 of the parameter estimates is by default estimated using the Outer 1109 Product of the Gradient (so long as the log-likelihood function 1110 returns the per-observation contributions). However, if you supply a 1111 function that computes the Hessian then by default it is used in 1112 estimating the covariance matrix. If you wish to impose use of OPG 1113 instead, append the \option{opg} option to the end of the \cmd{mle} 1114 block. 1115 1116 Note that gretl does not perform any numerical check on whether a 1117 user-supplied function computes the Hessian correctly. On the one 1118 hand, this means that you can trick \cmd{mle} into using alternatives 1119 to the Hessian and thereby implement other optimization methods. For 1120 example, if you substitute in equation \ref{eq:mle-newton} the Hessian 1121 $\mathbf{H}$ with the negative of the OPG matrix 1122 $-\mathbf{G}'\mathbf{G}$, as defined in \eqref{eq:OPGmat}, you get the 1123 so-called BHHH optimization method (see \cite{bhhh74}). Again, the 1124 sample file \verb|mle-advanced.inp| provides an example. On the other 1125 hand, you may want to perform a check of your analytically-computed 1126 $\mathbf{H}$ matrix versus a numerical approximation. 1127 1128 If you have a function that computes the score, this is relatively 1129 simple to do by using the \cmd{fdjac} function, briefly described in 1130 section \ref{sec:fdjac}, which computes a numerical approximation to a 1131 derivative. In practice, you need a function computing 1132 $\mathbf{g}(\theta)$ as a row vector and then use \cmd{fdjac} to 1133 differentiate it numerically with respect to $\theta$. The result can 1134 then be compared to your analytically-computed Hessian. The code 1135 fragment below shows an example of how this can be done in the probit 1136 case: 1137 \begin{code} 1138 function matrix totalscore(matrix *b, series y, list X) 1139 /* computes the total score */ 1140 return sumc(score(b, y, X)) 1141 end function 1142 1143 function void check(matrix b, series y, list X) 1144 /* compares the analytical Hessian to its numerical 1145 approximation obtained via fdjac */ 1146 matrix aH 1147 Hess(&aH, b, y, X) # stores the analytical Hessian into aH 1148 1149 matrix nH = fdjac(b, "totalscore(&b, y, X)") 1150 nH = 0.5*(nH + nH') # force symmetry 1151 1152 printf "Numerical Hessian\n%16.6f\n", nH 1153 printf "Analytical Hessian (negative)\n%16.6f\n", aH 1154 printf "Check (should be zero)\n%16.6f\n", nH + aH 1155 end function 1156 \end{code} 1157 1158 \section{Handling non-convergence gracefully} 1159 \label{sec:mle-nonconv} 1160 1161 If the numerical aspects of the estimation procedure are complex, it 1162 is possible that \cmd{mle} fails to find the maximum within the number 1163 of iterations stipulated via the \verb|bfgs_maxiter| state variable 1164 (which defaults to 500). 1165 1166 In these cases, \cmd{mle} will exit with error and it's up to the user 1167 to handle the situation appropriately. For example, it is possible 1168 that \cmd{mle} is used inside a loop and you don't want the loop to 1169 stop in case convergence is not achieved. The \cmd{catch} command 1170 modifier (see also the \GCR) is an excellent tool for this purpose. 1171 1172 The example provided in listing \ref{ex:catch-mle} illustrates the 1173 usage of \cmd{catch} in an artificially simple context: we use the 1174 \cmd{mle} command for estimating mean and variance of a Gaussian rv 1175 (of course you don't need the \cmd{mle} apparatus for this, but it 1176 makes for a nice example). The gist of the example is using the 1177 \verb|set bfgs_maxiter| command to force \cmd{mle} to abort after a 1178 very small number of iterations, so that you can have an idea on how 1179 to use the \cmd{catch} modifier and the associated \dollar{error} 1180 accessor to handle the situation. 1181 1182 You may want to increase the maximum number if BFGS iterations in the 1183 example to check what happens if the algorithm is allowed to 1184 converge. Note that, upon successful completion of \cmd{mle}, a bundle 1185 named \dollar{model} is available, containing several quantities that 1186 may be of your interest, including the total number of function 1187 evaluations. 1188 1189 \begin{script}[htbp] 1190 \caption{Handling non-convergence via \cmd{catch}} 1191 \label{ex:catch-mle} 1192 \begin{scodebit} 1193 set verbose off 1194 nulldata 200 1195 set seed 8118 1196 1197 # generate simulated data from a N(3,4) variate 1198 series x = normal(3,2) 1199 1200 # set starting values 1201 scalar m = 0 1202 scalar s2 = 1 1203 1204 # set iteration limit to a ridiculously low value 1205 set bfgs_maxiter 10 1206 1207 # perform ML estimation; note the "catch" modifier 1208 catch mle loglik = -0.5* (log(2*$pi) + log(s2) + e2/s2) 1209 series e2 = (x - m)^2 1210 params m s2 1211 end mle --quiet 1212 1213 # grab the error and proceed as needed 1214 err = $error 1215 if err 1216 printf "Not converged! (m = %g, s2 = %g)\n", m, s2 1217 else 1218 printf "Converged after %d iterations\n", $model.grcount 1219 cs = $coeff ~ sqrt(diag($vcv)) 1220 pn = "m s2" 1221 modprint cs pn 1222 endif 1223 \end{scodebit} 1224 \end{script} 1225 1226 1227 1228 %%% Local Variables: 1229 %%% mode: latex 1230 %%% TeX-master: "gretl-guide" 1231 %%% End: