## "Fossies" - the Fresh Open Source Software Archive

### Member "gretl-2019c/doc/tex/mle.tex" (26 Feb 2019, 46599 Bytes) of package /linux/misc/gretl-2019c.tar.xz:

As a special service "Fossies" has tried to format the requested source page into HTML format using (guessed) TeX and LaTeX source code syntax highlighting (style: standard) with prefixed line numbers. Alternatively you can here view or download the uninterpreted source code file.

    1 \chapter{Maximum likelihood estimation}
2 \label{chap:mle}
3
4 \section{Generic ML estimation with gretl}
5 \label{sec:mle-intro}
6
7
8 Maximum likelihood estimation is a cornerstone of modern inferential
9 procedures. Gretl provides a way to implement this method for a wide
10 range of estimation problems, by use of the \texttt{mle} command. We
11 give here a few examples.
12
13 To give a foundation for the examples that follow, we start from a
14 brief reminder on the basics of ML estimation.  Given a sample of size
15 $T$, it is possible to define the density function\footnote{We are
16   supposing here that our data are a realization of continuous random
17   variables. For discrete random variables, everything continues to
18   apply by referring to the probability function instead of the
19   density. In both cases, the distribution may be conditional on some
20   exogenous variables.} for the whole sample, namely the joint
21 distribution of all the observations $f(\mathbf{Y} ; \theta)$, where
22 $\mathbf{Y} = \left\{ y_1, \ldots, y_T \right\}$.  Its shape is
23 determined by a $k$-vector of unknown parameters $\theta$, which we
24 assume is contained in a set $\Theta$, and which can be used to
25 evaluate the probability of observing a sample with any given
26 characteristics.
27
28 After observing the data, the values $\mathbf{Y}$ are given, and this
29 function can be evaluated for any legitimate value of $\theta$. In
30 this case, we prefer to call it the \emph{likelihood} function; the
31 need for another name stems from the fact that this function works as
32 a density when we use the $y_t$s as arguments and $\theta$ as
33 parameters, whereas in this context $\theta$ is taken as the
34 function's argument, and the data $\mathbf{Y}$ only have the role of
35 determining its shape.
36
37 In standard cases, this function has a unique maximum.  The location
38 of the maximum is unaffected if we consider the logarithm of the
39 likelihood (or log-likelihood for short): this function will be
40 denoted as
41 $42 \LogLik(\theta) = \log f(\mathbf{Y}; \theta) 43$
44 The log-likelihood functions that gretl can handle are those where
45 $\LogLik(\theta)$ can be written as
46 $47 \LogLik(\theta) = \sum_{t=1}^T \ell_t(\theta) 48$
49 which is true in most cases of interest. The functions $\ell_t(\theta)$
50 are called the log-likelihood contributions.
51
52 Moreover, the location of the maximum is obviously determined by the
53 data $\mathbf{Y}$. This means that the value
54 \begin{equation}
55   \label{eq:maxlik}
56   \hat{\theta}(\mathbf{Y}) = \argmax_{\theta \in \Theta} \LogLik(\theta)
57 \end{equation}
58 is some function of the observed data (a statistic), which has the
59 property, under mild conditions, of being a consistent, asymptotically
60 normal and asymptotically efficient estimator of $\theta$.
61
62 Sometimes it is possible to write down explicitly the function
63 $\hat{\theta}(\mathbf{Y})$; in general, it need not be so. In these
64 circumstances, the maximum can be found by means of numerical
65 techniques. These often rely on the fact that the log-likelihood is a
66 smooth function of $\theta$, and therefore on the maximum its partial
67 derivatives should all be 0.  The \textsl{gradient vector}, or
68 \textsl{score vector}, is a function that enjoys many interesting
69 statistical properties in its own right; it will be denoted here as
70 $\mathbf{g}(\theta)$.  It is a $k$-vector with typical element
71 $72 g_i(\theta) = \frac{\partial\LogLik(\theta)}{\partial\theta_i} 73 = \sum_{t=1}^T \frac{\partial\ell_t(\theta)}{\partial\theta_i} 74$
75
76 Gradient-based methods can be briefly illustrated as follows:
77
78 \begin{enumerate}
79 \item pick a point $\theta_0 \in \Theta$;
80 \item evaluate $\mathbf{g}(\theta_0)$;
81 \item if $\mathbf{g}(\theta_0)$ is small'', stop. Otherwise, compute
82   a direction vector $d(\mathbf{g}(\theta_0))$;
83 \item evaluate $\theta_1 = \theta_0 + d(\mathbf{g}(\theta_0))$;
84 \item substitute $\theta_0$ with $\theta_1$;
85 \item restart from 2.
86 \end{enumerate}
87
88 Many algorithms of this kind exist; they basically differ from one
89 another in the way they compute the direction vector
90 $d(\mathbf{g}(\theta_0))$, to ensure that $\LogLik(\theta_1) > 91 \LogLik(\theta_0)$ (so that we eventually end up on the maximum).
92
93 The default method gretl uses to maximize the log-likelihood is a
94 gradient-based algorithm known as the \textbf{BFGS} (Broyden,
95 Fletcher, Goldfarb and Shanno) method. This technique is used in most
96 econometric and statistical packages, as it is well-established and
97 remarkably powerful. Clearly, in order to make this technique
98 operational, it must be possible to compute the vector
99 $\mathbf{g}(\theta)$ for any value of $\theta$. In some cases this
100 vector can be written explicitly as a function of $\mathbf{Y}$. If
101 this is not possible or too difficult the gradient may be evaluated
102 numerically. The alternative \textbf{Newton-Raphson} algorithm is also
103 available. This method is more effective under some circumstances but
104 is also more fragile; see section \ref{sec:mle-adv} and chapter
105 \ref{chap:numerical} for details.\footnote{Note that some of the
106   statements made below (for example, regarding estimation of the
107   covariance matrix) have to be modified when Newton's method is
108   used.}
109
110 The choice of the starting value, $\theta_0$, is crucial in some contexts
111 and inconsequential in others. In general, however, it is
112 advisable to start the algorithm from sensible'' values whenever
113 possible. If a consistent estimator is available, this is usually a
114 safe and efficient choice: this ensures that in large samples the
115 starting point will be likely close to $\hat{\theta}$ and convergence
116 can be achieved in few iterations.
117
118 The maximum number of iterations allowed for the BFGS procedure, and
119 the relative tolerance for assessing convergence, can be adjusted
120 using the \cmd{set} command: the relevant variables are
121 \verb+bfgs_maxiter+ (default value 500) and \verb+bfgs_toler+ (default
122 value, the machine precision to the power 3/4).
123
124 \subsection{Covariance matrix and standard errors}
125
126 By default the covariance matrix of the parameter estimates is
127 based on the Outer Product of the Gradient (OPG).  That is,
128 \begin{equation}
129   \label{eq:OPGmat}
130   \widehat{\mbox{Var}}_{\mbox{\scriptsize OPG}}(\hat{\theta}) =
131   \left(G'(\hat{\theta}) G(\hat{\theta}) \right)^{-1}
132 \end{equation}
133 where $G(\hat{\theta})$ is the $T \times k$ matrix of contributions to
134 the gradient.  Two other options are available.  If the
135 \option{hessian} flag is given, the covariance matrix is computed from
136 a numerical approximation to the Hessian at convergence.  If the
137 \option{robust} option is selected, the quasi-ML sandwich''
138 estimator is used:
139 $140 \widehat{\mbox{Var}}_{\mbox{\scriptsize QML}}(\hat{\theta}) = H(\hat{\theta})^{-1} 141 G'(\hat{\theta}) G(\hat{\theta}) H(\hat{\theta})^{-1} 142$
143 where $H$ denotes the numerical approximation to the Hessian.
144
145 Cluster-robust estimation is also available: in order to activate it,
146 use the \option{cluster=}\emph{\texttt{clustvar}}, where \cmd{clustvar}
147 should be a discrete series. See section \ref{sec:vcv-cluster} for
148 more details.
149
150 Note, however, that if the log-likelihood function supplied by the
151 user just returns a scalar value---as opposed to a series or vector
152 holding per-observation contributions---then the OPG method is not
153 applicable and so the covariance matrix must be estimated via a
154 numerical approximation to the Hessian.
155
156 \section{Gamma estimation}
157 \label{sec:ml-gamma}
158
159 Suppose we have a sample of $T$ independent and identically
160 distributed observations from a Gamma distribution. The density
161 function for each observation $x_t$ is
162 \begin{equation}
164   f(x_t) = \frac{\alpha^p}{\Gamma(p)} x_t^{p-1} \exp\left({-\alpha
165       x_t}\right)
166 \end{equation}
167 The log-likelihood for the entire sample can be written as the
168 logarithm of the joint density of all the observations. Since these
169 are independent and identical, the joint density is the product of the
170 individual densities, and hence its log is
171 \begin{equation}
172   \label{eq:gammaloglik}
173   \LogLik(\alpha, p) = \sum_{t=1}^T \log \left[ \frac{\alpha^p}{\Gamma(p)} x_t^{p-1} \exp\left({-\alpha
174       x_t}\right) \right] =
175       \sum_{t=1}^T \ell_t
176 \end{equation}
177 where
178 $179 \ell_t = p \cdot \log (\alpha x_t) - \gamma(p) - \log x_t - \alpha x_t 180$
181 and $\gamma(\cdot)$ is the log of the gamma function.  In order to
182 estimate the parameters $\alpha$ and $p$ via ML, we need to maximize
183 (\ref{eq:gammaloglik}) with respect to them. The corresponding
184 gretl code snippet is
185
186 \begin{code}
187 scalar alpha = 1
188 scalar p = 1
189
190 mle logl =  p*ln(alpha * x) - lngamma(p) - ln(x) - alpha * x
191   params alpha p
192 end mle
193 \end{code}
194
195 The first two statements
196
197 \begin{code}
198 alpha = 1
199 p = 1
200 \end{code}
201
202 are necessary to ensure that the variables \texttt{alpha} and
203 \texttt{p} exist before the computation of \texttt{logl} is
204 attempted. Inside the \texttt{mle} block these variables (which could
205 be either scalars, vectors or a combination of the two --- see below
206 for an example) are identified as the parameters that should be
207 adjusted to maximize the likelihood via the \texttt{params} keyword.
208 Their values will be changed by the execution of the \texttt{mle}
209 command; upon successful completion, they will be replaced by the ML
210 estimates. The starting value is 1 for both; this is arbitrary and
211 does not matter much in this example (more on this later).
212
213 The above code can be made more readable, and marginally more
214 efficient, by defining a variable to hold $\alpha \cdot x_t$. This
215 command can be embedded in the \texttt{mle} block as follows:
216 \begin{code}
217 mle logl =  p*ln(ax) - lngamma(p) - ln(x) - ax
218   series ax = alpha*x
219   params alpha p
220 end mle
221 \end{code}
222 The variable \texttt{ax} is not added to the \texttt{params} list,
223 of course, since it is just an auxiliary variable to facilitate
224 the calculations.  You can insert as many such auxiliary lines
225 as you require before the \texttt{params} line, with the restriction
226 that they must contain either (a) commands to generate series,
227 scalars or matrices or (b) print commands (which may be used to
228 aid in debugging).
229
230 In a simple example like this, the choice of the starting values is
231 almost inconsequential; the algorithm is likely to converge no
232 matter what the starting values are. However, consistent
233 method-of-moments estimators of $p$ and $\alpha$ can be simply
234 recovered from the sample mean $m$ and variance $V$: since it can be
235 shown that
236 $237 E(x_t) = p/\alpha \qquad V(x_t) = p/\alpha^2 238$
239 it follows that the following estimators
240 \begin{eqnarray*}
241   \bar{\alpha} & = &  m/V \\
242   \bar{p} & = & m \cdot \bar{\alpha}
243 \end{eqnarray*}
244 are consistent, and therefore suitable to be used as starting point
245 for the algorithm.  The gretl script code then becomes
246 \begin{code}
247 scalar m = mean(x)
248 scalar alpha = m/var(x)
249 scalar p = m*alpha
250
251 mle logl =  p*ln(ax) - lngamma(p) - ln(x) - ax
252   series ax = alpha*x
253   params alpha p
254 end mle
255 \end{code}
256
257 Another thing to note is that sometimes parameters are constrained
258 within certain boundaries: in this case, for example, both $\alpha$
259 and $p$ must be positive numbers. Gretl does not check for this:
260 it is the user's responsibility to ensure that the function is
261 always evaluated at an admissible point in the parameter space during
262 the iterative search for the maximum. An effective technique is to
263 define a variable for checking that the parameters are admissible and
264 setting the log-likelihood as undefined if the check fails. An
265 example, which uses the conditional assignment operator, follows:
266 \begin{code}
267 scalar m = mean(x)
268 scalar alpha = m/var(x)
269 scalar p = m*alpha
270
271 mle logl = check ? p*ln(ax) - lngamma(p) - ln(x) - ax : NA
272   series ax = alpha*x
273   scalar check = (alpha>0) && (p>0)
274   params alpha p
275 end mle
276 \end{code}
277
278 \section{Stochastic frontier cost function}
279 \label{sec:frontier}
280
281 \emph{%
282 Note: this section has the only purpose of illustrating the \cmd{mle}
283 command. For the estimation of stochastic frontier cost or production
284 functions, you may want to use the \package{frontier} function package.}
285
286 When modeling a cost function, it is sometimes worthwhile to
287 incorporate explicitly into the statistical model the notion that
288 firms may be inefficient, so that the observed cost deviates from the
289 theoretical figure not only because of unobserved heterogeneity
290 between firms, but also because two firms could be operating at a
291 different efficiency level, despite being identical under all other
292 respects. In this case we may write
293 $294 C_i = C^*_i + u_i + v_i 295$
296 where $C_i$ is some variable cost indicator, $C_i^*$ is its
297 theoretical'' value, $u_i$ is a zero-mean disturbance term and $v_i$
298 is the inefficiency term, which is supposed to be nonnegative by its
299 very nature. A linear specification for $C_i^*$ is often chosen. For
300 example, the Cobb--Douglas cost function arises when $C_i^*$ is a
301 linear function of the logarithms of the input prices and the output
302 quantities.
303
304 The \emph{stochastic frontier} model is a linear model of the form
305 $y_i = x_i \beta + \varepsilon_i$ in which the error term
306 $\varepsilon_i$ is the sum of $u_i$ and $v_i$.
307
308 A common postulate is that $u_i \sim N(0,\sigma_u^2)$ and
309 $v_i \sim \left|N(0,\sigma_v^2)\right|$. If independence between $u_i$
310 and $v_i$ is also assumed, then it is possible to show that the
311 density function of $\varepsilon_i$ has the form:
312 \begin{equation}
313   \label{eq:frontdens}
314   f(\varepsilon_i) =
315    \sqrt{\frac{2}{\pi}}
316    \Phi\left(\frac{\lambda \varepsilon_i}{\sigma}\right)
317    \frac{1}{\sigma} \phi\left(\frac{\varepsilon_i}{\sigma}\right)
318 \end{equation}
319 where $\Phi(\cdot)$ and $\phi(\cdot)$ are, respectively, the distribution and density
320 function of the standard normal, $\sigma = 321 \sqrt{\sigma^2_u + \sigma^2_v}$ and $\lambda = \frac{\sigma_u}{\sigma_v}$.
322
323 As a consequence, the log-likelihood for one observation takes the
324 form (apart form an irrelevant constant)
325 $326 \ell_t = 327 \log\Phi\left(\frac{\lambda \varepsilon_i}{\sigma}\right) - 328 \left[ \log(\sigma) + \frac{\varepsilon_i^2}{2 \sigma^2} \right] 329$
330 Therefore, a Cobb--Douglas cost function with stochastic frontier is the
331 model described by the following equations:
332 \begin{eqnarray*}
333   \log C_i & = & \log C^*_i + \varepsilon_i \\
334   \log C^*_i & = & c + \sum_{j=1}^m \beta_j \log y_{ij} + \sum_{j=1}^n \alpha_j \log p_{ij} \\
335   \varepsilon_i & = & u_i + v_i \\
336   u_i & \sim & N(0,\sigma_u^2) \\
337   v_i & \sim & \left|N(0,\sigma_v^2)\right|
338 \end{eqnarray*}
339
340 In most cases, one wants to ensure that the homogeneity of the cost
341 function with respect to the prices holds by construction. Since this
342 requirement is equivalent to $\sum_{j=1}^n \alpha_j = 1$, the above
343 equation for $C^*_i$ can be rewritten as
344
345 \begin{equation}
346   \label{eq:CobbDouglasFrontier}
347   \log C_i - \log p_{in}  = c + \sum_{j=1}^m \beta_j \log y_{ij} +
348   \sum_{j=2}^n \alpha_j (\log p_{ij} - \log p_{in})  + \varepsilon_i
349 \end{equation}
350
351 The above equation could be estimated by OLS, but it would suffer from
352 two drawbacks: first, the OLS estimator for the intercept $c$ is
353 inconsistent because the disturbance term has a non-zero expected
354 value; second, the OLS estimators for the other parameters are
355 consistent, but inefficient in view of the non-normality of
356 $\varepsilon_i$. Both issues can be addressed by estimating
357 (\ref{eq:CobbDouglasFrontier}) by maximum likelihood. Nevertheless,
358 OLS estimation is a quick and convenient way to provide starting
359 values for the MLE algorithm.
360
361 Listing \ref{cost-estimation} shows how to implement the model
362 described so far. The \texttt{banks91} file contains part of the data
363 used in \citet*{lucchetti01}.
364
365 \begin{script}[htbp]
366   \caption{Estimation of stochastic frontier cost function (with
367     scalar parameters)}
368   \label{cost-estimation}
369 \begin{scode}
370 open banks91.gdt
371
372 # transformations
373 series cost = ln(VC)
374 series q1 = ln(Q1)
375 series q2 = ln(Q2)
376 series p1 = ln(P1)
377 series p2 = ln(P2)
378 series p3 = ln(P3)
379
380 # Cobb-Douglas cost function with homogeneity restrictions
381 # (for initialization)
382 genr rcost = cost - p1
383 genr rp2 = p2 - p1
384 genr rp3 = p3 - p1
385
386 ols rcost const q1 q2 rp2 rp3
387
388 # Cobb-Douglas cost function with homogeneity restrictions
389 # and inefficiency
390
391 scalar b0 = $coeff(const) 392 scalar b1 =$coeff(q1)
393 scalar b2 = $coeff(q2) 394 scalar b3 =$coeff(rp2)
395 scalar b4 = $coeff(rp3) 396 397 scalar su = 0.1 398 scalar sv = 0.1 399 400 mle logl = ln(cnorm(e*lambda/ss)) - (ln(ss) + 0.5*(e/ss)^2) 401 scalar ss = sqrt(su^2 + sv^2) 402 scalar lambda = su/sv 403 series e = rcost - b0*const - b1*q1 - b2*q2 - b3*rp2 - b4*rp3 404 params b0 b1 b2 b3 b4 su sv 405 end mle 406 \end{scode} 407 \end{script} 408 409 The script in example \ref{cost-estimation} is relatively easy to 410 modify to show how one can use vectors (that is, 1-dimensional 411 matrices) for storing the parameters to optimize: example 412 \ref{cost-estimation-vec} holds essentially the same script in which 413 the parameters of the cost function are stored together in a 414 vector. Of course, this makes also possible to use variable lists and 415 other refinements which make the code more compact and readable. 416 417 \begin{script}[htbp] 418 \caption{Estimation of stochastic frontier cost function (with 419 matrix parameters)} 420 \label{cost-estimation-vec} 421 \begin{scode} 422 open banks91.gdt 423 424 # transformations 425 series cost = ln(VC) 426 series q1 = ln(Q1) 427 series q2 = ln(Q2) 428 series p1 = ln(P1) 429 series p2 = ln(P2) 430 series p3 = ln(P3) 431 432 # Cobb-Douglas cost function with homogeneity restrictions 433 # (for initialization) 434 genr rcost = cost - p1 435 genr rp2 = p2 - p1 436 genr rp3 = p3 - p1 437 list X = const q1 q2 rp2 rp3 438 439 ols rcost X 440 X = const q1 q2 rp2 rp3 441 # Cobb-Douglas cost function with homogeneity restrictions 442 # and inefficiency 443 444 matrix b =$coeff
445 scalar su = 0.1
446 scalar sv = 0.1
447
448 mle logl = ln(cnorm(e*lambda/ss)) - (ln(ss) + 0.5*(e/ss)^2)
449   scalar ss = sqrt(su^2 + sv^2)
450   scalar lambda = su/sv
451   series e = rcost - lincomb(X, b)
452   params b su sv
453 end mle
454 \end{scode}
455 \end{script}
456 %$457 \section{GARCH models} 458 \label{sec:garch} 459 460 GARCH models are handled by gretl via a native function. However, it is 461 instructive to see how they can be estimated through the \texttt{mle} 462 command.\footnote{The \package{gig} addon, which handles other variants 463 of conditionally heteroskedastic models, uses \texttt{mle} as its 464 internal engine.} 465 466 The following equations provide the simplest example of a GARCH(1,1) 467 model: 468 \begin{eqnarray*} 469 y_t & = & \mu + \varepsilon_t \\ 470 \varepsilon_t & = & u_t \cdot \sigma_t \\ 471 u_t & \sim & N(0,1) \\ 472 h_t & = & \omega + \alpha \varepsilon^2_{t-1} + \beta h_{t-1}. 473 \end{eqnarray*} 474 Since the variance of$y_t$depends on past values, writing down the 475 log-likelihood function is not simply a matter of summing the log 476 densities for individual observations. As is common in time series 477 models,$y_t$cannot be considered independent of the other 478 observations in our sample, and consequently the density function for 479 the whole sample (the joint density for all observations) is not just 480 the product of the marginal densities. 481 482 Maximum likelihood estimation, in these cases, is achieved by 483 considering \emph{conditional} densities, so what we maximize is a 484 conditional likelihood function. If we define the information set at 485 time$t$as 486 $487 F_t = \left\{ y_t, y_{t-1}, \ldots \right\} , 488$ 489 then the density of$y_t$conditional on$F_{t-1}$is normal: 490 $491 y_t | F_{t-1} \sim N\left[ \mu, h_{t} \right]. 492$ 493 494 By means of the properties of conditional distributions, the joint 495 density can be factorized as follows 496 $497 f(y_t, y_{t-1}, \ldots) = \left[ \prod_{t=1}^T f(y_t |F_{t-1}) 498 \right] \cdot f(y_0) 499$ 500 If we treat$y_0$as fixed, then the term$f(y_0)$does not depend on 501 the unknown parameters, and therefore the conditional log-likelihood 502 can then be written as the sum of the individual contributions as 503 \begin{equation} 504 \label{eq:garchloglik} 505 \LogLik(\mu,\omega,\alpha,\beta) = \sum_{t=1}^T \ell_t 506 \end{equation} 507 where 508 $509 \ell_t = \log \left[ \frac{1}{\sqrt{h_t}} \phi\left( \frac{y_t - \mu}{\sqrt{h_t}} 510 \right) \right] = 511 - \frac{1}{2} \left[ \log(h_t) + \frac{(y_t - \mu)^2}{h_t} \right] 512$ 513 514 The following script shows a simple application of this technique, 515 which uses the data file \texttt{djclose}; it is one of the example 516 dataset supplied with gretl and contains daily data from the Dow Jones 517 stock index. 518 519 \begin{code} 520 open djclose 521 522 series y = 100*ldiff(djclose) 523 524 scalar mu = 0.0 525 scalar omega = 1 526 scalar alpha = 0.4 527 scalar beta = 0.0 528 529 mle ll = -0.5*(log(h) + (e^2)/h) 530 series e = y - mu 531 series h = var(y) 532 series h = omega + alpha*(e(-1))^2 + beta*h(-1) 533 params mu omega alpha beta 534 end mle 535 \end{code} 536 537 \section{Analytical derivatives} 538 \label{sec:anal-der} 539 540 Computation of the score vector is essential for the working of the 541 BFGS method. In all the previous examples, no explicit formula for the 542 computation of the score was given, so the algorithm was fed 543 numerically evaluated gradients. Numerical computation of the score for 544 the$i$-th parameter is performed via a finite approximation of the 545 derivative, namely 546 $547 \pder{\LogLik(\theta_1, \ldots, \theta_n)}{\theta_i} \simeq 548 \frac{\LogLik(\theta_1, \ldots, \theta_i + h, \ldots, \theta_n) - 549 \LogLik(\theta_1, \ldots, \theta_i - h, \ldots, \theta_n)}{2h} 550$ 551 where$h$is a small number. 552 553 In many situations, this is rather efficient and accurate. A better 554 approximation to the true derivative may be obtained by forcing 555 \cmd{mle} to use a technique known as \emph{Richardson Extrapolation}, 556 which gives extremely precise results, but is considerably more 557 CPU-intensive. This feature may be turned on by using the \cmd{set} 558 command as in 559 \begin{code} 560 set bfgs_richardson on 561 \end{code} 562 563 However, one might want to avoid the approximation and specify an 564 exact function for the derivatives. As an example, consider the 565 following script: 566 % 567 \begin{code} 568 nulldata 1000 569 570 genr x1 = normal() 571 genr x2 = normal() 572 genr x3 = normal() 573 574 genr ystar = x1 + x2 + x3 + normal() 575 genr y = (ystar > 0) 576 577 scalar b0 = 0 578 scalar b1 = 0 579 scalar b2 = 0 580 scalar b3 = 0 581 582 mle logl = y*ln(P) + (1-y)*ln(1-P) 583 series ndx = b0 + b1*x1 + b2*x2 + b3*x3 584 series P = cnorm(ndx) 585 params b0 b1 b2 b3 586 end mle --verbose 587 \end{code} 588 589 Here, 1000 data points are artificially generated for an ordinary 590 probit model:\footnote{Again, gretl does provide a native 591 \texttt{probit} command (see section \ref{sec:logit-probit}), but a 592 probit model makes for a nice example here.}$y_t$is a binary 593 variable, which takes the value 1 if$y_t^* = \beta_1 x_{1t} + \beta_2
594 x_{2t} + \beta_3 x_{3t} + \varepsilon_t > 0$and 0 otherwise. 595 Therefore,$y_t = 1$with probability$\Phi(\beta_1 x_{1t} + \beta_2
596 x_{2t} + \beta_3 x_{3t}) = \pi_t$. The probability function for one 597 observation can be written as 598 $599 P(y_t) = \pi_t^{y_t} ( 1 -\pi_t )^{1-y_t} 600$ 601 Since the observations are independent and identically distributed, 602 the log-likelihood is simply the sum of the individual 603 contributions. Hence 604 $605 \LogLik = \sum_{t=1}^T y_t \log(\pi_t) + (1 - y_t) \log(1 - \pi_t) 606$ 607 The \option{verbose} switch at the end of the \texttt{end mle} 608 statement produces a detailed account of the iterations done by the 609 BFGS algorithm. 610 611 In this case, numerical differentiation works rather well; 612 nevertheless, computation of the analytical score is straightforward, 613 since the derivative$\pder{\LogLik}{\beta_i}$can be written as 614 $615 \pder{\LogLik}{\beta_i} = \pder{\LogLik}{\pi_t} \cdot \pder{\pi_t}{\beta_i} 616$ 617 via the chain rule, and it is easy to see that 618 \begin{eqnarray*} 619 \pder{\LogLik}{\pi_t} & = & \frac{y_t}{\pi_t} - \frac{1 - y_t}{1 - 620 \pi_t} \\ 621 \pder{\pi_t}{\beta_i} & = & \phi(\beta_1 x_{1t} + \beta_2 x_{2t} + 622 \beta_3 x_{3t}) \cdot x_{it} 623 \end{eqnarray*} 624 625 The \texttt{mle} block in the above script can therefore be modified 626 as follows: 627 % 628 \begin{code} 629 mle logl = y*ln(P) + (1-y)*ln(1-P) 630 series ndx = b0 + b1*x1 + b2*x2 + b3*x3 631 series P = cnorm(ndx) 632 series m = dnorm(ndx)*(y/P - (1-y)/(1-P)) 633 deriv b0 = m 634 deriv b1 = m*x1 635 deriv b2 = m*x2 636 deriv b3 = m*x3 637 end mle --verbose 638 \end{code} 639 640 Note that the \texttt{params} statement has been replaced by a series 641 of \texttt{deriv} statements; these have the double function of 642 identifying the parameters over which to optimize and providing an 643 analytical expression for their respective score elements. 644 645 \section{Debugging ML scripts} 646 \label{sec:mle-debug} 647 648 We have discussed above the main sorts of statements that are 649 permitted within an \texttt{mle} block, namely 650 % 651 \begin{itemize} 652 \item auxiliary commands to generate helper variables; 653 \item \texttt{deriv} statements to specify the gradient with respect 654 to each of the parameters; and 655 \item a \texttt{params} statement to identify the parameters in case 656 analytical derivatives are not given. 657 \end{itemize} 658 659 For the purpose of debugging ML estimators one additional sort of 660 statement is allowed: you can print the value of a relevant variable 661 at each step of the iteration. This facility is more restricted then 662 the regular \texttt{print} command. The command word \texttt{print} 663 should be followed by the name of just one variable (a scalar, series 664 or matrix). 665 666 In the last example above a key variable named \texttt{m} was 667 generated, forming the basis for the analytical derivatives. To track 668 the progress of this variable one could add a print statement within 669 the ML block, as in 670 % 671 \begin{code} 672 series m = dnorm(ndx)*(y/P - (1-y)/(1-P)) 673 print m 674 \end{code} 675 676 \section{Using functions} 677 \label{sec:mle-func} 678 679 The \texttt{mle} command allows you to estimate models that 680 gretl does not provide natively: in some cases, it may be a good 681 idea to wrap up the \texttt{mle} block in a user-defined function (see 682 Chapter \ref{chap:functions}), so as to extend gretl's 683 capabilities in a modular and flexible way. 684 685 As an example, we will take a simple case of a model that gretl 686 does not yet provide natively: the zero-inflated Poisson model, or ZIP 687 for short.\footnote{The actual ZIP model is in fact a bit more general 688 than the one presented here. The specialized version discussed in this 689 section was chosen for the sake of simplicity. For futher details, see 690 \cite{greene03}.} In this model, we assume that we observe a mixed 691 population: for some individuals, the variable$y_t$is (conditionally 692 on a vector of exogenous covariates$x_t$) distributed as a Poisson 693 random variate; for some others,$y_t$is identically 0. The trouble 694 is, we don't know which category a given individual belongs to. 695 696 For instance, suppose we have a sample of women, and the variable$y_t$697 represents the number of children that woman$t$has. There may be a 698 certain proportion,$\alpha$, of women for whom$y_t = 0$with certainty 699 (maybe out of a personal choice, or due to physical impossibility). 700 But there may be other women for whom$y_t = 0$just as a matter of 701 chance --- they haven't happened to have any children at the time 702 of observation. 703 704 In formulae: 705 \begin{eqnarray*} 706 P(y_t = k | x_t) & = & \alpha d_t + (1 - \alpha) 707 \left[e^{-\mu_t} \frac{\mu_t^{y_t}}{y_t!}\right] \\ 708 \mu_t & = & \exp(x_t \beta) \\ 709 d_t & = & 710 \left\{ 711 \begin{array}{ll} 712 1 & \mathrm{for} \quad y_t = 0 \\ 713 0 & \mathrm{for} \quad y_t > 0 714 \end{array} 715 \right. 716 \end{eqnarray*} 717 718 Writing a \texttt{mle} block for this model is not difficult: 719 \begin{code} 720 mle ll = logprob 721 series xb = exp(b0 + b1 * x) 722 series d = (y=0) 723 series poiprob = exp(-xb) * xb^y / gamma(y+1) 724 series logprob = (alpha>0) && (alpha<1) ? \ 725 log(alpha*d + (1-alpha)*poiprob) : NA 726 params alpha b0 b1 727 end mle -v 728 \end{code} 729 730 However, the code above has to be modified each time we change our 731 specification by, say, adding an explanatory variable. Using 732 functions, we can simplify this task considerably and eventually be 733 able to write something easy like 734 \begin{code} 735 list X = const x 736 zip(y, X) 737 \end{code} 738 739 \begin{script}[htbp] 740 \caption{Zero-inflated Poisson Model --- user-level function} 741 \label{mle-zip-main} 742 \begin{scode} 743 /* 744 user-level function: estimate the model and print out 745 the results 746 */ 747 function void zip(series y, list X) 748 matrix coef_stde = zip_estimate(y, X) 749 printf "\nZero-inflated Poisson model:\n" 750 string parnames = "alpha," 751 string parnames += varname(X) 752 modprint coef_stde parnames 753 end function 754 \end{scode} 755 \end{script} 756 %$
757
758 Let's see how this can be done.  First we need to define a function
759 called \texttt{zip()} that will take two arguments: a dependent
760 variable \texttt{y} and a list of explanatory variables \texttt{X}. An
761 example of such function can be seen in script~\ref{mle-zip-main}. By
762 inspecting the function code, you can see that the actual estimation
763 does not happen here: rather, the \texttt{zip()} function merely
764 uses the built-in \cmd{modprint} command to print out the results
765 coming from another user-written function, namely
766 \texttt{zip\_estimate()}.
767
768 \begin{script}[htbp]
769   \caption{Zero-inflated Poisson Model --- internal functions}
770   \label{mle-zip-inc}
771 \begin{scode}
772 /* compute log probabilities for the plain Poisson model */
773 function series ln_poi_prob(series y, list X, matrix beta)
774     series xb = lincomb(X, beta)
775     return -exp(xb) + y*xb - lngamma(y+1)
776 end function
777
778 /* compute log probabilities for the zero-inflated Poisson model */
779 function series ln_zip_prob(series y, list X, matrix beta, scalar p0)
780     # check if the probability is in [0,1]; otherwise, return NA
781     if p0 > 1 || p0 < 0
782         series ret = NA
783     else
784         series ret = ln_poi_prob(y, X, beta) + ln(1-p0)
785         series ret = y==0 ? ln(p0 + exp(ret)) : ret
786     endif
787     return ret
788 end function
789
790 /* do the actual estimation (silently) */
791 function matrix zip_estimate(series y, list X)
792     # initialize alpha to a "sensible" value: half the frequency
793     # of zeros in the sample
794     scalar alpha = mean(y==0)/2
795     # initialize the coeffs (we assume the first explanatory
796     # variable is the constant here)
797     matrix coef = zeros(nelem(X), 1)
798     coef[1] = mean(y) / (1-alpha)
799     # do the actual ML estimation
800     mle ll = ln_zip_prob(y, X, coef, alpha)
801         params alpha coef
802     end mle --hessian --quiet
803     return $coeff ~$stderr
804 end function
805 \end{scode}
806 \end{script}
807 %$808 809 The function \texttt{zip\_estimate()} is not meant to be executed 810 directly; it just contains the number-crunching part of the job, whose 811 results are then picked up by the end function \texttt{zip()}. In 812 turn, \texttt{zip\_estimate()} calls other user-written functions to 813 perform other tasks. The whole set of internal'' functions is shown 814 in the panel \ref{mle-zip-inc}. 815 816 All the functions shown in \ref{mle-zip-main} and \ref{mle-zip-inc} can 817 be stored in a separate \texttt{inp} file and executed once, at the 818 beginning of our job, by means of the \texttt{include} 819 command. Assuming the name of this script file is 820 \texttt{zip\_est.inp}, the following is an example script which 821 (a) includes the script file, (b) generates a simulated dataset, 822 and (c) performs the estimation of a ZIP model on the artificial data. 823 824 \begin{code} 825 set verbose off 826 827 # include the user-written functions 828 include zip_est.inp 829 830 # generate the artificial data 831 nulldata 1000 832 set seed 732237 833 scalar truep = 0.2 834 scalar b0 = 0.2 835 scalar b1 = 0.5 836 series x = normal() 837 series y = (uniform()<truep) ? 0 : randgen(p, exp(b0 + b1*x)) 838 list X = const x 839 840 # estimate the zero-inflated Poisson model 841 zip(y, X) 842 \end{code} 843 844 The results are as follows: 845 846 \begin{code} 847 Zero-inflated Poisson model: 848 849 coefficient std. error z-stat p-value 850 ------------------------------------------------------- 851 alpha 0.209738 0.0261746 8.013 1.12e-15 *** 852 const 0.167847 0.0449693 3.732 0.0002 *** 853 x 0.452390 0.0340836 13.27 3.32e-40 *** 854 \end{code} 855 856 A further step may then be creating a function package for accessing 857 your new \texttt{zip()} function via gretl's graphical interface. For 858 details on how to do this, see section \ref{sec:func-packages}. 859 860 \section{Advanced use of \texttt{mle}: functions, analytical 861 derivatives, algorithm choice} 862 \label{sec:mle-adv} 863 864 All the techniques decribed in the previous sections may be combined, 865 and \cmd{mle} can be used for solving non-standard estimation problems 866 (provided, of course, that one chooses maximum likelihood as the 867 preferred inference method). 868 869 The strategy that, as of this writing, has proven most successful in 870 designing scripts for this purpose is: 871 \begin{itemize} 872 \item Modularize your code as much as possible. 873 \item Use analytical derivatives whenever possible. 874 \item Choose your optimization method wisely. 875 \end{itemize} 876 877 In the rest of this section, we will expand on the probit example of 878 section \ref{sec:anal-der} to give the reader an idea of what a 879 heavy-duty'' application of \cmd{mle} looks like. Most of the code 880 fragments come from \verb|mle-advanced.inp|, which is one of the 881 sample scripts supplied with the standard installation of gretl 882 (see under \emph{File$>$Script files$>$Practice File}). 883 884 \subsection{BFGS with and without analytical derivatives} 885 \label{sec:mle-adv-bfgs} 886 887 The example in section \ref{sec:anal-der} can be made more general by 888 using matrices and user-written functions. Consider the following code 889 fragment: 890 \begin{code} 891 list X = const x1 x2 x3 892 matrix b = zeros(nelem(X),1) 893 894 mle logl = y*ln(P) + (1-y)*ln(1-P) 895 series ndx = lincomb(X, b) 896 series P = cnorm(ndx) 897 params b 898 end mle 899 \end{code} 900 901 In this context, the fact that the model we are estimating has four 902 explanatory variables is totally incidental: the code is written in 903 such a way that we could change the content of the list \texttt{X} 904 without having to make any other modification. This was made possible 905 by: 906 \begin{enumerate} 907 \item gathering the parameters to estimate into a single vector$b$908 rather than using separate scalars; 909 \item using the \texttt{nelem()} function to initialize$b$, so that 910 its dimension is kept track of automatically; 911 \item using the \texttt{lincomb()} function to compute the index 912 function. 913 \end{enumerate} 914 915 A parallel enhancement could be achieved in the case of analytically 916 computed derivatives: since$b$is now a vector, \cmd{mle} expects the 917 argument to the \cmd{deriv} keyword to be a matrix, in which each 918 column is the partial derivative to the corresponding element of 919$b$. It is useful to re-write the score for the$i$-th observation as 920 \begin{equation} 921 \label{eq:mle-probscore} 922 \pder{\LogLik_i}{\beta} = m_i \mathbf{x}_i' 923 \end{equation} 924 where$m_i$is the signed Mills' ratio'', that is 925 $926 m_i = y_i \frac{\phi(\mathbf{x}_i'\beta)}{\Phi(\mathbf{x}_i'\beta)} - 927 (1-y_i) \frac{\phi(\mathbf{x}_i'\beta)}{1 - \Phi(\mathbf{x}_i'\beta)} , 928$ 929 which was computed in section \ref{sec:anal-der} via 930 \begin{code} 931 series P = cnorm(ndx) 932 series m = dnorm(ndx)*(y/P - (1-y)/(1-P)) 933 \end{code} 934 Here, we will code it in a somewhat terser way as 935 \begin{code} 936 series m = y ? invmills(-ndx) : -invmills(ndx) 937 \end{code} 938 and make use of the conditional assignment operator and of the 939 specialized function \cmd{invmills()} for efficiency. Building the 940 score matrix is now easily achieved via 941 \begin{code} 942 mle logl = y*ln(P) + (1-y)*ln(1-P) 943 series ndx = lincomb(X, b) 944 series P = cnorm(ndx) 945 series m = y ? invmills(-ndx) : -invmills(ndx) 946 matrix mX = {X} 947 deriv b = mX .* {m} 948 end mle 949 \end{code} 950 in which the \verb|{}| operator was used to turn series and lists into 951 matrices (see chapter \ref{chap:matrices}). However, proceeding in 952 this way for more complex models than probit may imply inserting into 953 the \cmd{mle} block a long series of instructions; the example above 954 merely happens to be short because the score matrix for the probit 955 model is very easy to write in matrix form. 956 957 A better solution is writing a user-level function to compute the 958 score and using that inside the \cmd{mle} block, as in 959 \begin{code} 960 function matrix score(matrix b, series y, list X) 961 series ndx = lincomb(X, b) 962 series m = y ? invmills(-ndx) : -invmills(ndx) 963 return {m} .* {X} 964 end function 965 966 [...] 967 968 mle logl = y*ln(P) + (1-y)*ln(1-P) 969 series ndx = lincomb(X, b) 970 series P = cnorm(ndx) 971 deriv b = score(b, y, X) 972 end mle 973 \end{code} 974 In this way, no matter how complex the computation of the score is, 975 the \cmd{mle} block remains nicely compact. 976 977 \subsection{Newton's method and the analytical Hessian} 978 \label{sec:mle-adv-hessian} 979 980 As mentioned above, gretl offers the user the option of using 981 Newton's method for maximizing the log-likelihood. In terms of the 982 notation used in section \ref{sec:mle-intro}, the direction for 983 updating the inital parameter vector$\theta_0$is given by 984 \begin{equation} 985 \label{eq:mle-newton} 986 d\left[\mathbf{g}(\theta_0)\right] = -\lambda 987 \mathbf{H}(\theta_0)^{-1}\mathbf{g}(\theta_0) , 988 \end{equation} 989 where$\mathbf{H}(\theta)$is the Hessian of the total loglikelihood 990 computed at$\theta$and$0 < \lambda < 1$is a scalar called the 991 \emph{step length}. 992 993 The above expression makes a few points clear: 994 \begin{enumerate} 995 \item At each step, it must be possible to compute not only the score 996$\mathbf{g}(\theta)$, but also its derivative$\mathbf{H}(\theta)$; 997 \item the matrix$\mathbf{H}(\theta)$should be nonsingular; 998 \item it is assumed that for some positive value of$\lambda$, 999$\LogLik(\theta_1) > \LogLik(\theta_0)$; in other words, that going 1000 in the direction$d\left[\mathbf{g}(\theta_0)\right]$leads upwards 1001 for some step length. 1002 \end{enumerate} 1003 1004 The strength of Newton's method lies in the fact that, if the 1005 loglikelihood is globally concave, then \eqref{eq:mle-newton} enjoys 1006 certain optimality properties and the number of iterations required to 1007 reach the maximum is often much smaller than it would be with other 1008 methods, such as BFGS. However, it may have some disadvantages: for a 1009 start, the Hessian$\mathbf{H}(\theta)$may be difficult or very 1010 expensive to compute; moreover, the loglikelihood may not be globally 1011 concave, so for some values of$\theta$, the matrix 1012$\mathbf{H}(\theta)$is not negative definite or perhaps even 1013 singular. Those cases are handled by gretl's implementation of 1014 Newton's algorithm by means of several heuristic 1015 techniques\footnote{The gist to it is that, if$\mathbf{H}$is not 1016 negative definite, it is substituted by$k \cdot
1017   \mathrm{dg}(\mathbf{H}) + (1-k) \cdot \mathbf{H}$, where$k$is a 1018 suitable scalar; however, if you're interested in the precise 1019 details, you'll be much better off looking at the source code: the 1020 file you'll want to look at is \texttt{lib/src/gretl\_bfgs.c}.}, but 1021 a number of adverse consequences may occur, which range from 1022 \emph{longer} computation time for optimization to non-convergence of 1023 the algorithm. 1024 1025 As a consequence, using Newton's method is advisable only when the 1026 computation of the Hessian is not too CPU-intensive and the nature of 1027 the estimator is such that it is known in advance that the 1028 loglikelihood is globally concave. The probit models satisfies both 1029 requisites, so we will expand the preceding example to illustrate how 1030 to use Newton's method in gretl. 1031 1032 A first example may be given simply by issuing the command 1033 \begin{code} 1034 set optimizer newton 1035 \end{code} 1036 before the \cmd{mle} block.\footnote{To go back to BFGS, you use 1037 \cmd{set optimizer bfgs}.} This will instruct gretl to use 1038 Newton's method instead of BFGS. If the \cmd{deriv} keyword is used, 1039 gretl will differentiate the score function numerically; 1040 otherwise, if the score has to be computed itself numerically, 1041 gretl will calculate$\mathbf{H}(\theta)$by differentiating the 1042 loglikelihood numerically twice. The latter solution, though, is 1043 generally to be avoided, as may be extremely time-consuming and may 1044 yield imprecise results. 1045 1046 A much better option is to calculate the Hessian analytically and have 1047 gretl use its true value rather than a numerical approximation. In 1048 most cases, this is both much faster and numerically stable, but of 1049 course comes at the price of having to differentiate the loglikelihood 1050 twice to respect with the parameters and translate the resulting 1051 expressions into efficient \app{hansl} code. 1052 1053 Luckily, both tasks are relatively easy in the probit case: the matrix 1054 of second derivatives of$\LogLik_i$may be written as 1055 $1056 \pder{{}^2\LogLik_i}{\beta \partial \beta'} = 1057 - m_i \left( m_i + \mathbf{x}_i'\beta \right) \mathbf{x}_i \mathbf{x}_i' 1058$ 1059 so the total Hessian is 1060 \begin{equation} 1061 \label{eq:mle-tothess} 1062 \sum_{i=1}^n \pder{{}^2\LogLik_i}{\beta \partial \beta'} = 1063 - X' \left[ 1064 \begin{array}{cccc} 1065 w_1 & & & \\ 1066 & w_2 & & \\ 1067 & & \ddots & \\ 1068 & & & w_n 1069 \end{array} 1070 \right] X 1071 \end{equation} 1072 where$w_i = m_i \left( m_i + \mathbf{x}_i'\beta \right)$. It can be 1073 shown that$w_i > 0$, so the Hessian is guaranteed to be negative 1074 definite in all sensible cases and the conditions are ideal for 1075 applying Newton's method. 1076 1077 A \app{hansl} translation of equation \eqref{eq:mle-tothess} may look 1078 like 1079 \begin{code} 1080 function void Hess(matrix *H, matrix b, series y, list X) 1081 /* computes the negative Hessian for a Probit model */ 1082 series ndx = lincomb(X, b) 1083 series m = y ? invmills(-ndx) : -invmills(ndx) 1084 series w = m*(m+ndx) 1085 matrix mX = {X} 1086 H = (mX .* {w})'mX 1087 end function 1088 \end{code} 1089 1090 There are two characteristics worth noting of the function above. For 1091 a start, it doesn't return anything: the result of the computation is 1092 simply stored in the matrix pointed at by the first argument of the 1093 function. Second, the result is not the Hessian proper, but rather its 1094 negative. This function becomes usable from within an \cmd{mle} block 1095 by the keyword \cmd{hessian}. The syntax is 1096 \begin{code} 1097 mle ... 1098 ... 1099 hessian funcname(&mat_addr, ...) 1100 end mle 1101 \end{code} 1102 In other words, the \cmd{hessian} keyword must be followed by the call 1103 to a function whose first argument is a matrix pointer which is 1104 supposed to be filled with the \emph{negative} of the Hessian at 1105$\theta$. 1106 1107 We said above (section~\ref{sec:mle-intro}) that the covariance matrix 1108 of the parameter estimates is by default estimated using the Outer 1109 Product of the Gradient (so long as the log-likelihood function 1110 returns the per-observation contributions). However, if you supply a 1111 function that computes the Hessian then by default it is used in 1112 estimating the covariance matrix. If you wish to impose use of OPG 1113 instead, append the \option{opg} option to the end of the \cmd{mle} 1114 block. 1115 1116 Note that gretl does not perform any numerical check on whether a 1117 user-supplied function computes the Hessian correctly. On the one 1118 hand, this means that you can trick \cmd{mle} into using alternatives 1119 to the Hessian and thereby implement other optimization methods. For 1120 example, if you substitute in equation \ref{eq:mle-newton} the Hessian 1121$\mathbf{H}$with the negative of the OPG matrix 1122$-\mathbf{G}'\mathbf{G}$, as defined in \eqref{eq:OPGmat}, you get the 1123 so-called BHHH optimization method (see \cite{bhhh74}). Again, the 1124 sample file \verb|mle-advanced.inp| provides an example. On the other 1125 hand, you may want to perform a check of your analytically-computed 1126$\mathbf{H}$matrix versus a numerical approximation. 1127 1128 If you have a function that computes the score, this is relatively 1129 simple to do by using the \cmd{fdjac} function, briefly described in 1130 section \ref{sec:fdjac}, which computes a numerical approximation to a 1131 derivative. In practice, you need a function computing 1132$\mathbf{g}(\theta)$as a row vector and then use \cmd{fdjac} to 1133 differentiate it numerically with respect to$\theta$. The result can 1134 then be compared to your analytically-computed Hessian. The code 1135 fragment below shows an example of how this can be done in the probit 1136 case: 1137 \begin{code} 1138 function matrix totalscore(matrix *b, series y, list X) 1139 /* computes the total score */ 1140 return sumc(score(b, y, X)) 1141 end function 1142 1143 function void check(matrix b, series y, list X) 1144 /* compares the analytical Hessian to its numerical 1145 approximation obtained via fdjac */ 1146 matrix aH 1147 Hess(&aH, b, y, X) # stores the analytical Hessian into aH 1148 1149 matrix nH = fdjac(b, "totalscore(&b, y, X)") 1150 nH = 0.5*(nH + nH') # force symmetry 1151 1152 printf "Numerical Hessian\n%16.6f\n", nH 1153 printf "Analytical Hessian (negative)\n%16.6f\n", aH 1154 printf "Check (should be zero)\n%16.6f\n", nH + aH 1155 end function 1156 \end{code} 1157 1158 \section{Handling non-convergence gracefully} 1159 \label{sec:mle-nonconv} 1160 1161 If the numerical aspects of the estimation procedure are complex, it 1162 is possible that \cmd{mle} fails to find the maximum within the number 1163 of iterations stipulated via the \verb|bfgs_maxiter| state variable 1164 (which defaults to 500). 1165 1166 In these cases, \cmd{mle} will exit with error and it's up to the user 1167 to handle the situation appropriately. For example, it is possible 1168 that \cmd{mle} is used inside a loop and you don't want the loop to 1169 stop in case convergence is not achieved. The \cmd{catch} command 1170 modifier (see also the \GCR) is an excellent tool for this purpose. 1171 1172 The example provided in listing \ref{ex:catch-mle} illustrates the 1173 usage of \cmd{catch} in an artificially simple context: we use the 1174 \cmd{mle} command for estimating mean and variance of a Gaussian rv 1175 (of course you don't need the \cmd{mle} apparatus for this, but it 1176 makes for a nice example). The gist of the example is using the 1177 \verb|set bfgs_maxiter| command to force \cmd{mle} to abort after a 1178 very small number of iterations, so that you can have an idea on how 1179 to use the \cmd{catch} modifier and the associated \dollar{error} 1180 accessor to handle the situation. 1181 1182 You may want to increase the maximum number if BFGS iterations in the 1183 example to check what happens if the algorithm is allowed to 1184 converge. Note that, upon successful completion of \cmd{mle}, a bundle 1185 named \dollar{model} is available, containing several quantities that 1186 may be of your interest, including the total number of function 1187 evaluations. 1188 1189 \begin{script}[htbp] 1190 \caption{Handling non-convergence via \cmd{catch}} 1191 \label{ex:catch-mle} 1192 \begin{scodebit} 1193 set verbose off 1194 nulldata 200 1195 set seed 8118 1196 1197 # generate simulated data from a N(3,4) variate 1198 series x = normal(3,2) 1199 1200 # set starting values 1201 scalar m = 0 1202 scalar s2 = 1 1203 1204 # set iteration limit to a ridiculously low value 1205 set bfgs_maxiter 10 1206 1207 # perform ML estimation; note the "catch" modifier 1208 catch mle loglik = -0.5* (log(2*$pi) + log(s2) + e2/s2)
1209     series e2 = (x - m)^2
1210     params m s2
1211 end mle --quiet
1212
1213 # grab the error and proceed as needed
1214 err = $error 1215 if err 1216 printf "Not converged! (m = %g, s2 = %g)\n", m, s2 1217 else 1218 printf "Converged after %d iterations\n",$model.grcount
1219     cs = $coeff ~ sqrt(diag($vcv))
1220     pn = "m s2"
1221     modprint cs pn
1222 endif
1223 \end{scodebit}
1224 \end{script}
1225
1226
1227
1228 %%% Local Variables:
1229 %%% mode: latex
1230 %%% TeX-master: "gretl-guide"
1231 %%% End: