"Fossies" - the Fresh Open Source Software Archive

Member "gretl-2019c/doc/tex/mle.tex" (26 Feb 2019, 46599 Bytes) of package /linux/misc/gretl-2019c.tar.xz:


As a special service "Fossies" has tried to format the requested source page into HTML format using (guessed) TeX and LaTeX source code syntax highlighting (style: standard) with prefixed line numbers. Alternatively you can here view or download the uninterpreted source code file.

    1 \chapter{Maximum likelihood estimation}
    2 \label{chap:mle}
    3 
    4 \section{Generic ML estimation with gretl}
    5 \label{sec:mle-intro}
    6 
    7 
    8 Maximum likelihood estimation is a cornerstone of modern inferential
    9 procedures. Gretl provides a way to implement this method for a wide
   10 range of estimation problems, by use of the \texttt{mle} command. We
   11 give here a few examples.
   12 
   13 To give a foundation for the examples that follow, we start from a
   14 brief reminder on the basics of ML estimation.  Given a sample of size
   15 $T$, it is possible to define the density function\footnote{We are
   16   supposing here that our data are a realization of continuous random
   17   variables. For discrete random variables, everything continues to
   18   apply by referring to the probability function instead of the
   19   density. In both cases, the distribution may be conditional on some
   20   exogenous variables.} for the whole sample, namely the joint
   21 distribution of all the observations $f(\mathbf{Y} ; \theta)$, where
   22 $\mathbf{Y} = \left\{ y_1, \ldots, y_T \right\}$.  Its shape is
   23 determined by a $k$-vector of unknown parameters $\theta$, which we
   24 assume is contained in a set $\Theta$, and which can be used to
   25 evaluate the probability of observing a sample with any given
   26 characteristics.
   27 
   28 After observing the data, the values $\mathbf{Y}$ are given, and this
   29 function can be evaluated for any legitimate value of $\theta$. In
   30 this case, we prefer to call it the \emph{likelihood} function; the
   31 need for another name stems from the fact that this function works as
   32 a density when we use the $y_t$s as arguments and $\theta$ as
   33 parameters, whereas in this context $\theta$ is taken as the
   34 function's argument, and the data $\mathbf{Y}$ only have the role of
   35 determining its shape.
   36 
   37 In standard cases, this function has a unique maximum.  The location
   38 of the maximum is unaffected if we consider the logarithm of the
   39 likelihood (or log-likelihood for short): this function will be
   40 denoted as
   41 \[
   42   \LogLik(\theta) = \log  f(\mathbf{Y}; \theta)
   43 \] 
   44 The log-likelihood functions that gretl can handle are those where
   45 $\LogLik(\theta)$ can be written as
   46 \[
   47   \LogLik(\theta) = \sum_{t=1}^T \ell_t(\theta)
   48 \] 
   49 which is true in most cases of interest. The functions $\ell_t(\theta)$
   50 are called the log-likelihood contributions.
   51 
   52 Moreover, the location of the maximum is obviously determined by the
   53 data $\mathbf{Y}$. This means that the value
   54 \begin{equation}
   55   \label{eq:maxlik}
   56   \hat{\theta}(\mathbf{Y}) = \argmax_{\theta \in \Theta} \LogLik(\theta)
   57 \end{equation}
   58 is some function of the observed data (a statistic), which has the
   59 property, under mild conditions, of being a consistent, asymptotically
   60 normal and asymptotically efficient estimator of $\theta$.
   61 
   62 Sometimes it is possible to write down explicitly the function
   63 $\hat{\theta}(\mathbf{Y})$; in general, it need not be so. In these
   64 circumstances, the maximum can be found by means of numerical
   65 techniques. These often rely on the fact that the log-likelihood is a
   66 smooth function of $\theta$, and therefore on the maximum its partial
   67 derivatives should all be 0.  The \textsl{gradient vector}, or
   68 \textsl{score vector}, is a function that enjoys many interesting
   69 statistical properties in its own right; it will be denoted here as
   70 $\mathbf{g}(\theta)$.  It is a $k$-vector with typical element
   71 \[
   72 g_i(\theta) = \frac{\partial\LogLik(\theta)}{\partial\theta_i} 
   73   = \sum_{t=1}^T \frac{\partial\ell_t(\theta)}{\partial\theta_i}
   74 \]
   75 
   76 Gradient-based methods can be briefly illustrated as follows:
   77 
   78 \begin{enumerate}
   79 \item pick a point $\theta_0 \in \Theta$;
   80 \item evaluate $\mathbf{g}(\theta_0)$;
   81 \item if $\mathbf{g}(\theta_0)$ is ``small'', stop. Otherwise, compute
   82   a direction vector $d(\mathbf{g}(\theta_0))$;
   83 \item evaluate $\theta_1 = \theta_0 + d(\mathbf{g}(\theta_0))$;
   84 \item substitute $\theta_0$ with $\theta_1$;
   85 \item restart from 2.
   86 \end{enumerate}
   87 
   88 Many algorithms of this kind exist; they basically differ from one
   89 another in the way they compute the direction vector
   90 $d(\mathbf{g}(\theta_0))$, to ensure that $\LogLik(\theta_1) >
   91 \LogLik(\theta_0)$ (so that we eventually end up on the maximum).
   92 
   93 The default method gretl uses to maximize the log-likelihood is a
   94 gradient-based algorithm known as the \textbf{BFGS} (Broyden,
   95 Fletcher, Goldfarb and Shanno) method. This technique is used in most
   96 econometric and statistical packages, as it is well-established and
   97 remarkably powerful. Clearly, in order to make this technique
   98 operational, it must be possible to compute the vector
   99 $\mathbf{g}(\theta)$ for any value of $\theta$. In some cases this
  100 vector can be written explicitly as a function of $\mathbf{Y}$. If
  101 this is not possible or too difficult the gradient may be evaluated
  102 numerically. The alternative \textbf{Newton-Raphson} algorithm is also
  103 available. This method is more effective under some circumstances but
  104 is also more fragile; see section \ref{sec:mle-adv} and chapter
  105 \ref{chap:numerical} for details.\footnote{Note that some of the
  106   statements made below (for example, regarding estimation of the
  107   covariance matrix) have to be modified when Newton's method is
  108   used.}
  109 
  110 The choice of the starting value, $\theta_0$, is crucial in some contexts
  111 and inconsequential in others. In general, however, it is
  112 advisable to start the algorithm from ``sensible'' values whenever
  113 possible. If a consistent estimator is available, this is usually a
  114 safe and efficient choice: this ensures that in large samples the
  115 starting point will be likely close to $\hat{\theta}$ and convergence
  116 can be achieved in few iterations. 
  117 
  118 The maximum number of iterations allowed for the BFGS procedure, and
  119 the relative tolerance for assessing convergence, can be adjusted
  120 using the \cmd{set} command: the relevant variables are
  121 \verb+bfgs_maxiter+ (default value 500) and \verb+bfgs_toler+ (default
  122 value, the machine precision to the power 3/4).
  123 
  124 \subsection{Covariance matrix and standard errors}
  125 
  126 By default the covariance matrix of the parameter estimates is
  127 based on the Outer Product of the Gradient (OPG).  That is,
  128 \begin{equation}
  129   \label{eq:OPGmat}
  130   \widehat{\mbox{Var}}_{\mbox{\scriptsize OPG}}(\hat{\theta}) =
  131   \left(G'(\hat{\theta}) G(\hat{\theta}) \right)^{-1}
  132 \end{equation}
  133 where $G(\hat{\theta})$ is the $T \times k$ matrix of contributions to
  134 the gradient.  Two other options are available.  If the
  135 \option{hessian} flag is given, the covariance matrix is computed from
  136 a numerical approximation to the Hessian at convergence.  If the
  137 \option{robust} option is selected, the quasi-ML ``sandwich''
  138 estimator is used:
  139 \[
  140 \widehat{\mbox{Var}}_{\mbox{\scriptsize QML}}(\hat{\theta}) = H(\hat{\theta})^{-1}
  141   G'(\hat{\theta}) G(\hat{\theta}) H(\hat{\theta})^{-1}
  142 \]
  143 where $H$ denotes the numerical approximation to the Hessian.
  144 
  145 Cluster-robust estimation is also available: in order to activate it,
  146 use the \option{cluster=}\emph{\texttt{clustvar}}, where \cmd{clustvar}
  147 should be a discrete series. See section \ref{sec:vcv-cluster} for
  148 more details.
  149 
  150 Note, however, that if the log-likelihood function supplied by the
  151 user just returns a scalar value---as opposed to a series or vector
  152 holding per-observation contributions---then the OPG method is not
  153 applicable and so the covariance matrix must be estimated via a
  154 numerical approximation to the Hessian.
  155 
  156 \section{Gamma estimation}
  157 \label{sec:ml-gamma}
  158 
  159 Suppose we have a sample of $T$ independent and identically
  160 distributed observations from a Gamma distribution. The density
  161 function for each observation $x_t$ is
  162 \begin{equation}
  163   \label{eq:gammadens}
  164   f(x_t) = \frac{\alpha^p}{\Gamma(p)} x_t^{p-1} \exp\left({-\alpha
  165       x_t}\right)
  166 \end{equation}
  167 The log-likelihood for the entire sample can be written as the
  168 logarithm of the joint density of all the observations. Since these
  169 are independent and identical, the joint density is the product of the
  170 individual densities, and hence its log is
  171 \begin{equation}
  172   \label{eq:gammaloglik}
  173   \LogLik(\alpha, p) = \sum_{t=1}^T \log \left[ \frac{\alpha^p}{\Gamma(p)} x_t^{p-1} \exp\left({-\alpha
  174       x_t}\right) \right] = 
  175       \sum_{t=1}^T \ell_t
  176 \end{equation}
  177 where 
  178 \[
  179   \ell_t = p \cdot \log (\alpha x_t) - \gamma(p) - \log x_t - \alpha x_t
  180 \]
  181 and $\gamma(\cdot)$ is the log of the gamma function.  In order to
  182 estimate the parameters $\alpha$ and $p$ via ML, we need to maximize
  183 (\ref{eq:gammaloglik}) with respect to them. The corresponding
  184 gretl code snippet is
  185 
  186 \begin{code}
  187 scalar alpha = 1
  188 scalar p = 1
  189 
  190 mle logl =  p*ln(alpha * x) - lngamma(p) - ln(x) - alpha * x 
  191   params alpha p
  192 end mle 
  193 \end{code}
  194 
  195 The first two statements
  196 
  197 \begin{code}
  198 alpha = 1
  199 p = 1
  200 \end{code}
  201 
  202 are necessary to ensure that the variables \texttt{alpha} and
  203 \texttt{p} exist before the computation of \texttt{logl} is
  204 attempted. Inside the \texttt{mle} block these variables (which could
  205 be either scalars, vectors or a combination of the two --- see below
  206 for an example) are identified as the parameters that should be
  207 adjusted to maximize the likelihood via the \texttt{params} keyword.
  208 Their values will be changed by the execution of the \texttt{mle}
  209 command; upon successful completion, they will be replaced by the ML
  210 estimates. The starting value is 1 for both; this is arbitrary and
  211 does not matter much in this example (more on this later).
  212 
  213 The above code can be made more readable, and marginally more
  214 efficient, by defining a variable to hold $\alpha \cdot x_t$. This
  215 command can be embedded in the \texttt{mle} block as follows:
  216 \begin{code}
  217 mle logl =  p*ln(ax) - lngamma(p) - ln(x) - ax 
  218   series ax = alpha*x
  219   params alpha p
  220 end mle 
  221 \end{code}
  222 The variable \texttt{ax} is not added to the \texttt{params} list,
  223 of course, since it is just an auxiliary variable to facilitate
  224 the calculations.  You can insert as many such auxiliary lines
  225 as you require before the \texttt{params} line, with the restriction
  226 that they must contain either (a) commands to generate series,
  227 scalars or matrices or (b) print commands (which may be used to
  228 aid in debugging).
  229 
  230 In a simple example like this, the choice of the starting values is
  231 almost inconsequential; the algorithm is likely to converge no
  232 matter what the starting values are. However, consistent
  233 method-of-moments estimators of $p$ and $\alpha$ can be simply
  234 recovered from the sample mean $m$ and variance $V$: since it can be
  235 shown that
  236 \[
  237   E(x_t) = p/\alpha \qquad  V(x_t) = p/\alpha^2
  238 \]
  239 it follows that the following estimators 
  240 \begin{eqnarray*}
  241   \bar{\alpha} & = &  m/V \\
  242   \bar{p} & = & m \cdot \bar{\alpha} 
  243 \end{eqnarray*}
  244 are consistent, and therefore suitable to be used as starting point
  245 for the algorithm.  The gretl script code then becomes
  246 \begin{code}
  247 scalar m = mean(x)
  248 scalar alpha = m/var(x)
  249 scalar p = m*alpha
  250 
  251 mle logl =  p*ln(ax) - lngamma(p) - ln(x) - ax 
  252   series ax = alpha*x
  253   params alpha p
  254 end mle 
  255 \end{code}
  256 
  257 Another thing to note is that sometimes parameters are constrained
  258 within certain boundaries: in this case, for example, both $\alpha$
  259 and $p$ must be positive numbers. Gretl does not check for this:
  260 it is the user's responsibility to ensure that the function is
  261 always evaluated at an admissible point in the parameter space during
  262 the iterative search for the maximum. An effective technique is to
  263 define a variable for checking that the parameters are admissible and
  264 setting the log-likelihood as undefined if the check fails. An
  265 example, which uses the conditional assignment operator, follows:
  266 \begin{code}
  267 scalar m = mean(x)
  268 scalar alpha = m/var(x)
  269 scalar p = m*alpha
  270 
  271 mle logl = check ? p*ln(ax) - lngamma(p) - ln(x) - ax : NA
  272   series ax = alpha*x
  273   scalar check = (alpha>0) && (p>0)
  274   params alpha p
  275 end mle 
  276 \end{code}
  277 
  278 \section{Stochastic frontier cost function}
  279 \label{sec:frontier}
  280 
  281 \emph{%
  282 Note: this section has the only purpose of illustrating the \cmd{mle}
  283 command. For the estimation of stochastic frontier cost or production
  284 functions, you may want to use the \package{frontier} function package.}
  285 
  286 When modeling a cost function, it is sometimes worthwhile to
  287 incorporate explicitly into the statistical model the notion that
  288 firms may be inefficient, so that the observed cost deviates from the
  289 theoretical figure not only because of unobserved heterogeneity
  290 between firms, but also because two firms could be operating at a
  291 different efficiency level, despite being identical under all other
  292 respects. In this case we may write
  293 \[
  294   C_i = C^*_i + u_i + v_i
  295 \]
  296 where $C_i$ is some variable cost indicator, $C_i^*$ is its
  297 ``theoretical'' value, $u_i$ is a zero-mean disturbance term and $v_i$
  298 is the inefficiency term, which is supposed to be nonnegative by its
  299 very nature. A linear specification for $C_i^*$ is often chosen. For
  300 example, the Cobb--Douglas cost function arises when $C_i^*$ is a
  301 linear function of the logarithms of the input prices and the output
  302 quantities.
  303 
  304 The \emph{stochastic frontier} model is a linear model of the form
  305 $y_i = x_i \beta + \varepsilon_i$ in which the error term
  306 $\varepsilon_i$ is the sum of $u_i$ and $v_i$.  
  307 
  308 A common postulate is that $u_i \sim N(0,\sigma_u^2)$ and
  309 $v_i \sim \left|N(0,\sigma_v^2)\right|$. If independence between $u_i$
  310 and $v_i$ is also assumed, then it is possible to show that the
  311 density function of $\varepsilon_i$ has the form:
  312 \begin{equation}
  313   \label{eq:frontdens}
  314   f(\varepsilon_i) = 
  315    \sqrt{\frac{2}{\pi}} 
  316    \Phi\left(\frac{\lambda \varepsilon_i}{\sigma}\right)
  317    \frac{1}{\sigma} \phi\left(\frac{\varepsilon_i}{\sigma}\right)
  318 \end{equation}
  319 where $\Phi(\cdot)$ and $\phi(\cdot)$ are, respectively, the distribution and density
  320 function of the standard normal, $\sigma =
  321 \sqrt{\sigma^2_u + \sigma^2_v}$ and $\lambda = \frac{\sigma_u}{\sigma_v}$.
  322 
  323 As a consequence, the log-likelihood for one observation takes the
  324 form (apart form an irrelevant constant)
  325 \[
  326   \ell_t = 
  327   \log\Phi\left(\frac{\lambda \varepsilon_i}{\sigma}\right) -
  328   \left[ \log(\sigma) + \frac{\varepsilon_i^2}{2 \sigma^2} \right]
  329 \]
  330 Therefore, a Cobb--Douglas cost function with stochastic frontier is the
  331 model described by the following equations: 
  332 \begin{eqnarray*}
  333   \log C_i & = & \log C^*_i + \varepsilon_i \\
  334   \log C^*_i & = & c + \sum_{j=1}^m \beta_j \log y_{ij} + \sum_{j=1}^n \alpha_j \log p_{ij} \\
  335   \varepsilon_i & = & u_i + v_i \\
  336   u_i & \sim & N(0,\sigma_u^2) \\
  337   v_i & \sim & \left|N(0,\sigma_v^2)\right| 
  338 \end{eqnarray*}
  339 
  340 In most cases, one wants to ensure that the homogeneity of the cost
  341 function with respect to the prices holds by construction. Since this
  342 requirement is equivalent to $\sum_{j=1}^n \alpha_j = 1$, the above
  343 equation for $C^*_i$ can be rewritten as
  344 
  345 \begin{equation}
  346   \label{eq:CobbDouglasFrontier}
  347   \log C_i - \log p_{in}  = c + \sum_{j=1}^m \beta_j \log y_{ij} +
  348   \sum_{j=2}^n \alpha_j (\log p_{ij} - \log p_{in})  + \varepsilon_i
  349 \end{equation}
  350 
  351 The above equation could be estimated by OLS, but it would suffer from
  352 two drawbacks: first, the OLS estimator for the intercept $c$ is
  353 inconsistent because the disturbance term has a non-zero expected
  354 value; second, the OLS estimators for the other parameters are
  355 consistent, but inefficient in view of the non-normality of
  356 $\varepsilon_i$. Both issues can be addressed by estimating
  357 (\ref{eq:CobbDouglasFrontier}) by maximum likelihood. Nevertheless,
  358 OLS estimation is a quick and convenient way to provide starting
  359 values for the MLE algorithm.
  360 
  361 Listing \ref{cost-estimation} shows how to implement the model
  362 described so far. The \texttt{banks91} file contains part of the data
  363 used in \citet*{lucchetti01}.
  364 
  365 \begin{script}[htbp]
  366   \caption{Estimation of stochastic frontier cost function (with
  367     scalar parameters)}
  368   \label{cost-estimation}
  369 \begin{scode}
  370 open banks91.gdt
  371 
  372 # transformations
  373 series cost = ln(VC)
  374 series q1 = ln(Q1)
  375 series q2 = ln(Q2)
  376 series p1 = ln(P1)
  377 series p2 = ln(P2)
  378 series p3 = ln(P3)
  379 
  380 # Cobb-Douglas cost function with homogeneity restrictions
  381 # (for initialization)
  382 genr rcost = cost - p1
  383 genr rp2 = p2 - p1
  384 genr rp3 = p3 - p1
  385 
  386 ols rcost const q1 q2 rp2 rp3
  387 
  388 # Cobb-Douglas cost function with homogeneity restrictions 
  389 # and inefficiency 
  390 
  391 scalar b0 = $coeff(const)
  392 scalar b1 = $coeff(q1)
  393 scalar b2 = $coeff(q2)
  394 scalar b3 = $coeff(rp2)
  395 scalar b4 = $coeff(rp3)
  396 
  397 scalar su = 0.1
  398 scalar sv = 0.1
  399 
  400 mle logl = ln(cnorm(e*lambda/ss)) - (ln(ss) + 0.5*(e/ss)^2)
  401   scalar ss = sqrt(su^2 + sv^2)
  402   scalar lambda = su/sv
  403   series e = rcost - b0*const - b1*q1 - b2*q2 - b3*rp2 - b4*rp3
  404   params b0 b1 b2 b3 b4 su sv
  405 end mle
  406 \end{scode}
  407 \end{script}
  408 
  409 The script in example \ref{cost-estimation} is relatively easy to
  410 modify to show how one can use vectors (that is, 1-dimensional
  411 matrices) for storing the parameters to optimize: example
  412 \ref{cost-estimation-vec} holds essentially the same script in which
  413 the parameters of the cost function are stored together in a
  414 vector. Of course, this makes also possible to use variable lists and
  415 other refinements which make the code more compact and readable.
  416 
  417 \begin{script}[htbp]
  418   \caption{Estimation of stochastic frontier cost function (with
  419     matrix parameters)}
  420   \label{cost-estimation-vec}
  421 \begin{scode}
  422 open banks91.gdt
  423 
  424 # transformations
  425 series cost = ln(VC)
  426 series q1 = ln(Q1)
  427 series q2 = ln(Q2)
  428 series p1 = ln(P1)
  429 series p2 = ln(P2)
  430 series p3 = ln(P3)
  431 
  432 # Cobb-Douglas cost function with homogeneity restrictions
  433 # (for initialization)
  434 genr rcost = cost - p1
  435 genr rp2 = p2 - p1
  436 genr rp3 = p3 - p1
  437 list X = const q1 q2 rp2 rp3
  438 
  439 ols rcost X
  440 X = const q1 q2 rp2 rp3
  441 # Cobb-Douglas cost function with homogeneity restrictions 
  442 # and inefficiency 
  443 
  444 matrix b = $coeff
  445 scalar su = 0.1
  446 scalar sv = 0.1
  447 
  448 mle logl = ln(cnorm(e*lambda/ss)) - (ln(ss) + 0.5*(e/ss)^2)
  449   scalar ss = sqrt(su^2 + sv^2)
  450   scalar lambda = su/sv
  451   series e = rcost - lincomb(X, b)
  452   params b su sv
  453 end mle
  454 \end{scode}
  455 \end{script}
  456 %$
  457 \section{GARCH models}
  458 \label{sec:garch}
  459 
  460 GARCH models are handled by gretl via a native function. However, it is
  461 instructive to see how they can be estimated through the \texttt{mle}
  462 command.\footnote{The \package{gig} addon, which handles other variants
  463 of conditionally heteroskedastic models, uses \texttt{mle} as its
  464 internal engine.}
  465 
  466 The following equations provide the simplest example of a GARCH(1,1)
  467 model:
  468 \begin{eqnarray*}
  469   y_t & = & \mu + \varepsilon_t \\
  470   \varepsilon_t & = & u_t \cdot \sigma_t \\
  471   u_t & \sim & N(0,1) \\
  472   h_t & = & \omega + \alpha \varepsilon^2_{t-1} + \beta h_{t-1}.
  473 \end{eqnarray*}
  474 Since the variance of $y_t$ depends on past values, writing down the
  475 log-likelihood function is not simply a matter of summing the log
  476 densities for individual observations. As is common in time series
  477 models, $y_t$ cannot be considered independent of the other
  478 observations in our sample, and consequently the density function for
  479 the whole sample (the joint density for all observations) is not just
  480 the product of the marginal densities.
  481 
  482 Maximum likelihood estimation, in these cases, is achieved by
  483 considering \emph{conditional} densities, so what we maximize is a
  484 conditional likelihood function. If we define the information set at
  485 time $t$ as
  486 \[
  487   F_t = \left\{ y_t, y_{t-1}, \ldots \right\} ,
  488 \]
  489 then the density of $y_t$ conditional on $F_{t-1}$ is normal:
  490 \[
  491   y_t | F_{t-1} \sim N\left[ \mu, h_{t} \right].
  492 \]
  493 
  494 By means of the properties of conditional distributions, the joint
  495 density can be factorized as follows
  496 \[
  497   f(y_t, y_{t-1}, \ldots) = \left[ \prod_{t=1}^T f(y_t |F_{t-1})
  498   \right] \cdot f(y_0)
  499 \]
  500 If we treat $y_0$ as fixed, then the term $f(y_0)$ does not depend on
  501 the unknown parameters, and therefore the conditional log-likelihood
  502 can then be written as the sum of the individual contributions as
  503 \begin{equation}
  504   \label{eq:garchloglik}
  505   \LogLik(\mu,\omega,\alpha,\beta) = \sum_{t=1}^T \ell_t
  506 \end{equation}
  507 where 
  508 \[
  509   \ell_t = \log \left[ \frac{1}{\sqrt{h_t}} \phi\left( \frac{y_t - \mu}{\sqrt{h_t}}
  510     \right) \right] = 
  511     - \frac{1}{2} \left[ \log(h_t) + \frac{(y_t - \mu)^2}{h_t} \right]
  512 \]
  513 
  514 The following script shows a simple application of this technique,
  515 which uses the data file \texttt{djclose}; it is one of the example
  516 dataset supplied with gretl and contains daily data from the Dow Jones
  517 stock index.
  518 
  519 \begin{code}
  520 open djclose
  521 
  522 series y = 100*ldiff(djclose)
  523 
  524 scalar mu = 0.0
  525 scalar omega = 1
  526 scalar alpha = 0.4
  527 scalar beta = 0.0
  528 
  529 mle ll = -0.5*(log(h) + (e^2)/h)
  530   series e = y - mu
  531   series h = var(y)
  532   series h = omega + alpha*(e(-1))^2 + beta*h(-1)
  533   params mu omega alpha beta
  534 end mle
  535 \end{code}
  536 
  537 \section{Analytical derivatives}
  538 \label{sec:anal-der}
  539 
  540 Computation of the score vector is essential for the working of the
  541 BFGS method. In all the previous examples, no explicit formula for the
  542 computation of the score was given, so the algorithm was fed
  543 numerically evaluated gradients. Numerical computation of the score for
  544 the $i$-th parameter is performed via a finite approximation of the
  545 derivative, namely
  546 \[
  547   \pder{\LogLik(\theta_1, \ldots, \theta_n)}{\theta_i} \simeq 
  548   \frac{\LogLik(\theta_1, \ldots, \theta_i + h, \ldots, \theta_n) -
  549     \LogLik(\theta_1, \ldots, \theta_i - h, \ldots, \theta_n)}{2h}
  550 \]
  551 where $h$ is a small number. 
  552 
  553 In many situations, this is rather efficient and accurate. A better
  554 approximation to the true derivative may be obtained by forcing
  555 \cmd{mle} to use a technique known as \emph{Richardson Extrapolation},
  556 which gives extremely precise results, but is considerably more
  557 CPU-intensive. This feature may be turned on by using the \cmd{set}
  558 command as in
  559 \begin{code}
  560   set bfgs_richardson on
  561 \end{code}
  562 
  563 However, one might want to avoid the approximation and specify an
  564 exact function for the derivatives. As an example, consider the
  565 following script:
  566 %
  567 \begin{code}
  568 nulldata 1000
  569 
  570 genr x1 = normal()
  571 genr x2 = normal()
  572 genr x3 = normal()
  573 
  574 genr ystar = x1 + x2 + x3 + normal()
  575 genr y = (ystar > 0)
  576 
  577 scalar b0 = 0
  578 scalar b1 = 0
  579 scalar b2 = 0
  580 scalar b3 = 0
  581 
  582 mle logl = y*ln(P) + (1-y)*ln(1-P)
  583   series ndx = b0 + b1*x1 + b2*x2 + b3*x3
  584   series P = cnorm(ndx)
  585   params b0 b1 b2 b3
  586 end mle --verbose
  587 \end{code}
  588 
  589 Here, 1000 data points are artificially generated for an ordinary
  590 probit model:\footnote{Again, gretl does provide a native
  591   \texttt{probit} command (see section \ref{sec:logit-probit}), but a
  592   probit model makes for a nice example here.} $y_t$ is a binary
  593 variable, which takes the value 1 if $y_t^* = \beta_1 x_{1t} + \beta_2
  594 x_{2t} + \beta_3 x_{3t} + \varepsilon_t > 0$ and 0 otherwise.
  595 Therefore, $y_t = 1$ with probability $\Phi(\beta_1 x_{1t} + \beta_2
  596 x_{2t} + \beta_3 x_{3t}) = \pi_t$.  The probability function for one
  597 observation can be written as
  598 \[
  599   P(y_t) = \pi_t^{y_t} ( 1 -\pi_t )^{1-y_t}
  600 \]
  601 Since the observations are independent and identically distributed,
  602 the log-likelihood is simply the sum of the individual
  603 contributions. Hence
  604 \[
  605   \LogLik = \sum_{t=1}^T y_t \log(\pi_t) + (1 - y_t) \log(1 - \pi_t)
  606 \]
  607 The \option{verbose} switch at the end of the \texttt{end mle}
  608 statement produces a detailed account of the iterations done by the
  609 BFGS algorithm.
  610 
  611 In this case, numerical differentiation works rather well;
  612 nevertheless, computation of the analytical score is straightforward,
  613 since the derivative $\pder{\LogLik}{\beta_i}$ can be written as
  614 \[
  615   \pder{\LogLik}{\beta_i} = \pder{\LogLik}{\pi_t} \cdot \pder{\pi_t}{\beta_i}
  616 \]
  617 via the chain rule, and it is easy to see that
  618 \begin{eqnarray*}
  619   \pder{\LogLik}{\pi_t} & = & \frac{y_t}{\pi_t} - \frac{1 - y_t}{1 -
  620     \pi_t} \\
  621   \pder{\pi_t}{\beta_i} & = & \phi(\beta_1 x_{1t} + \beta_2 x_{2t} +
  622   \beta_3 x_{3t}) \cdot x_{it}
  623 \end{eqnarray*}
  624 
  625 The \texttt{mle} block in the above script can therefore be modified
  626 as follows:
  627 %
  628 \begin{code}
  629 mle logl = y*ln(P) + (1-y)*ln(1-P)
  630   series ndx = b0 + b1*x1 + b2*x2 + b3*x3
  631   series P = cnorm(ndx)
  632   series m = dnorm(ndx)*(y/P - (1-y)/(1-P))
  633   deriv b0 = m
  634   deriv b1 = m*x1
  635   deriv b2 = m*x2
  636   deriv b3 = m*x3
  637 end mle --verbose
  638 \end{code}
  639 
  640 Note that the \texttt{params} statement has been replaced by a series
  641 of \texttt{deriv} statements; these have the double function of
  642 identifying the parameters over which to optimize and providing an
  643 analytical expression for their respective score elements.
  644 
  645 \section{Debugging ML scripts}
  646 \label{sec:mle-debug}
  647 
  648 We have discussed above the main sorts of statements that are
  649 permitted within an \texttt{mle} block, namely 
  650 %
  651 \begin{itemize}
  652 \item auxiliary commands to generate helper variables;
  653 \item \texttt{deriv} statements to specify the gradient with respect
  654   to each of the parameters; and
  655 \item a \texttt{params} statement to identify the parameters in case
  656   analytical derivatives are not given.
  657 \end{itemize}
  658 
  659 For the purpose of debugging ML estimators one additional sort of
  660 statement is allowed: you can print the value of a relevant variable
  661 at each step of the iteration.  This facility is more restricted then
  662 the regular \texttt{print} command.  The command word \texttt{print}
  663 should be followed by the name of just one variable (a scalar, series
  664 or matrix).
  665 
  666 In the last example above a key variable named \texttt{m} was
  667 generated, forming the basis for the analytical derivatives.  To track
  668 the progress of this variable one could add a print statement within
  669 the ML block, as in
  670 %
  671 \begin{code}
  672 series m = dnorm(ndx)*(y/P - (1-y)/(1-P))
  673 print m
  674 \end{code}
  675 
  676 \section{Using functions}
  677 \label{sec:mle-func}
  678 
  679 The \texttt{mle} command allows you to estimate models that
  680 gretl does not provide natively: in some cases, it may be a good
  681 idea to wrap up the \texttt{mle} block in a user-defined function (see
  682 Chapter \ref{chap:functions}), so as to extend gretl's
  683 capabilities in a modular and flexible way.
  684 
  685 As an example, we will take a simple case of a model that gretl
  686 does not yet provide natively: the zero-inflated Poisson model, or ZIP
  687 for short.\footnote{The actual ZIP model is in fact a bit more general
  688 than the one presented here. The specialized version discussed in this
  689 section was chosen for the sake of simplicity. For futher details, see
  690 \cite{greene03}.} In this model, we assume that we observe a mixed
  691 population: for some individuals, the variable $y_t$ is (conditionally
  692 on a vector of exogenous covariates $x_t$) distributed as a Poisson
  693 random variate; for some others, $y_t$ is identically 0. The trouble
  694 is, we don't know which category a given individual belongs to.  
  695 
  696 For instance, suppose we have a sample of women, and the variable $y_t$
  697 represents the number of children that woman $t$ has. There may be a
  698 certain proportion, $\alpha$, of women for whom $y_t = 0$ with certainty
  699 (maybe out of a personal choice, or due to physical impossibility). 
  700 But there may be other women for whom $y_t = 0$ just as a matter of
  701 chance --- they haven't happened to have any children at the time
  702 of observation.
  703 
  704 In formulae:
  705 \begin{eqnarray*}
  706   P(y_t = k | x_t) & = & \alpha d_t + (1 - \alpha) 
  707   \left[e^{-\mu_t} \frac{\mu_t^{y_t}}{y_t!}\right] \\
  708     \mu_t & = & \exp(x_t \beta) \\
  709     d_t & = & 
  710     \left\{ 
  711       \begin{array}{ll} 
  712         1 & \mathrm{for} \quad y_t = 0 \\ 
  713         0 & \mathrm{for} \quad y_t > 0 
  714       \end{array}
  715     \right. 
  716 \end{eqnarray*}
  717 
  718 Writing a \texttt{mle} block for this model is not difficult:
  719 \begin{code}
  720 mle ll = logprob
  721   series xb = exp(b0 + b1 * x)
  722   series d = (y=0)
  723   series poiprob = exp(-xb) * xb^y / gamma(y+1)
  724   series logprob = (alpha>0) && (alpha<1) ? \
  725     log(alpha*d + (1-alpha)*poiprob) : NA
  726   params alpha b0 b1
  727 end mle -v
  728 \end{code}
  729 
  730 However, the code above has to be modified each time we change our
  731 specification by, say, adding an explanatory variable.  Using
  732 functions, we can simplify this task considerably and eventually be
  733 able to write something easy like
  734 \begin{code}
  735 list X = const x
  736 zip(y, X)
  737 \end{code}
  738 
  739 \begin{script}[htbp]
  740   \caption{Zero-inflated Poisson Model --- user-level function}
  741   \label{mle-zip-main}
  742 \begin{scode}
  743 /*
  744   user-level function: estimate the model and print out
  745   the results
  746 */
  747 function void zip(series y, list X)
  748     matrix coef_stde = zip_estimate(y, X)
  749     printf "\nZero-inflated Poisson model:\n"
  750     string parnames = "alpha,"
  751     string parnames += varname(X)
  752     modprint coef_stde parnames
  753 end function
  754 \end{scode}
  755 \end{script}
  756 %$
  757 
  758 Let's see how this can be done.  First we need to define a function
  759 called \texttt{zip()} that will take two arguments: a dependent
  760 variable \texttt{y} and a list of explanatory variables \texttt{X}. An
  761 example of such function can be seen in script~\ref{mle-zip-main}. By
  762 inspecting the function code, you can see that the actual estimation
  763 does not happen here: rather, the \texttt{zip()} function merely
  764 uses the built-in \cmd{modprint} command to print out the results
  765 coming from another user-written function, namely
  766 \texttt{zip\_estimate()}.
  767 
  768 \begin{script}[htbp]
  769   \caption{Zero-inflated Poisson Model --- internal functions}
  770   \label{mle-zip-inc}
  771 \begin{scode}
  772 /* compute log probabilities for the plain Poisson model */
  773 function series ln_poi_prob(series y, list X, matrix beta)
  774     series xb = lincomb(X, beta)
  775     return -exp(xb) + y*xb - lngamma(y+1)
  776 end function  
  777 
  778 /* compute log probabilities for the zero-inflated Poisson model */
  779 function series ln_zip_prob(series y, list X, matrix beta, scalar p0)
  780     # check if the probability is in [0,1]; otherwise, return NA
  781     if p0 > 1 || p0 < 0
  782         series ret = NA
  783     else
  784         series ret = ln_poi_prob(y, X, beta) + ln(1-p0)
  785         series ret = y==0 ? ln(p0 + exp(ret)) : ret
  786     endif
  787     return ret
  788 end function  
  789 
  790 /* do the actual estimation (silently) */
  791 function matrix zip_estimate(series y, list X)
  792     # initialize alpha to a "sensible" value: half the frequency
  793     # of zeros in the sample
  794     scalar alpha = mean(y==0)/2
  795     # initialize the coeffs (we assume the first explanatory 
  796     # variable is the constant here)
  797     matrix coef = zeros(nelem(X), 1)
  798     coef[1] = mean(y) / (1-alpha)
  799     # do the actual ML estimation
  800     mle ll = ln_zip_prob(y, X, coef, alpha)
  801         params alpha coef
  802     end mle --hessian --quiet
  803     return $coeff ~ $stderr
  804 end function
  805 \end{scode}
  806 \end{script}
  807 %$
  808 
  809 The function \texttt{zip\_estimate()} is not meant to be executed
  810 directly; it just contains the number-crunching part of the job, whose
  811 results are then picked up by the end function \texttt{zip()}. In
  812 turn, \texttt{zip\_estimate()} calls other user-written functions to
  813 perform other tasks. The whole set of ``internal'' functions is shown
  814 in the panel \ref{mle-zip-inc}.
  815 
  816 All the functions shown in \ref{mle-zip-main} and \ref{mle-zip-inc} can
  817 be stored in a separate \texttt{inp} file and executed once, at the
  818 beginning of our job, by means of the \texttt{include}
  819 command.  Assuming the name of this script file is
  820 \texttt{zip\_est.inp}, the following is an example script which
  821 (a) includes the script file, (b) generates a simulated dataset,
  822 and (c) performs the estimation of a ZIP model on the artificial data.
  823 
  824 \begin{code}
  825 set verbose off
  826 
  827 # include the user-written functions
  828 include zip_est.inp
  829 
  830 # generate the artificial data
  831 nulldata 1000
  832 set seed 732237
  833 scalar truep = 0.2
  834 scalar b0 = 0.2
  835 scalar b1 = 0.5
  836 series x = normal()
  837 series y = (uniform()<truep) ? 0 : randgen(p, exp(b0 + b1*x))
  838 list X = const x
  839 
  840 # estimate the zero-inflated Poisson model
  841 zip(y, X)
  842 \end{code}
  843 
  844 The results are as follows:
  845 
  846 \begin{code}
  847 Zero-inflated Poisson model:
  848 
  849              coefficient   std. error   z-stat   p-value 
  850   -------------------------------------------------------
  851   alpha       0.209738     0.0261746     8.013   1.12e-15 ***
  852   const       0.167847     0.0449693     3.732   0.0002   ***
  853   x           0.452390     0.0340836    13.27    3.32e-40 ***
  854 \end{code}
  855 
  856 A further step may then be creating a function package for accessing
  857 your new \texttt{zip()} function via gretl's graphical interface. For
  858 details on how to do this, see section \ref{sec:func-packages}.
  859 
  860 \section{Advanced use of \texttt{mle}: functions, analytical
  861   derivatives, algorithm choice}
  862 \label{sec:mle-adv}
  863 
  864 All the techniques decribed in the previous sections may be combined,
  865 and \cmd{mle} can be used for solving non-standard estimation problems
  866 (provided, of course, that one chooses maximum likelihood as the
  867 preferred inference method).
  868 
  869 The strategy that, as of this writing, has proven most successful in
  870 designing scripts for this purpose is:
  871 \begin{itemize}
  872 \item Modularize your code as much as possible.
  873 \item Use analytical derivatives whenever possible.
  874 \item Choose your optimization method wisely.
  875 \end{itemize}
  876 
  877 In the rest of this section, we will expand on the probit example of
  878 section \ref{sec:anal-der} to give the reader an idea of what a
  879 ``heavy-duty'' application of \cmd{mle} looks like. Most of the code
  880 fragments come from \verb|mle-advanced.inp|, which is one of the
  881 sample scripts supplied with the standard installation of gretl
  882 (see under \emph{File $>$ Script files $>$ Practice File}).
  883 
  884 \subsection{BFGS with and without analytical derivatives}
  885 \label{sec:mle-adv-bfgs}
  886 
  887 The example in section \ref{sec:anal-der} can be made more general by
  888 using matrices and user-written functions. Consider the following code
  889 fragment:
  890 \begin{code}
  891 list X = const x1 x2 x3
  892 matrix b = zeros(nelem(X),1)
  893 
  894 mle logl = y*ln(P) + (1-y)*ln(1-P)
  895     series ndx = lincomb(X, b)
  896     series P = cnorm(ndx)
  897     params b
  898 end mle
  899 \end{code}
  900 
  901 In this context, the fact that the model we are estimating has four
  902 explanatory variables is totally incidental: the code is written in
  903 such a way that we could change the content of the list \texttt{X}
  904 without having to make any other modification. This was made possible
  905 by:
  906 \begin{enumerate}
  907 \item gathering the parameters to estimate into a single vector $b$
  908   rather than using separate scalars;
  909 \item using the \texttt{nelem()} function to initialize $b$, so that
  910   its dimension is kept track of automatically;
  911 \item using the \texttt{lincomb()} function to compute the index
  912   function.
  913 \end{enumerate}
  914 
  915 A parallel enhancement could be achieved in the case of analytically
  916 computed derivatives: since $b$ is now a vector, \cmd{mle} expects the
  917 argument to the \cmd{deriv} keyword to be a matrix, in which each
  918 column is the partial derivative to the corresponding element of
  919 $b$. It is useful to re-write the score for the $i$-th observation as
  920 \begin{equation}
  921   \label{eq:mle-probscore}
  922   \pder{\LogLik_i}{\beta} = m_i \mathbf{x}_i'
  923 \end{equation}
  924 where $m_i$ is the ``signed Mills' ratio'', that is 
  925 \[
  926 m_i = y_i \frac{\phi(\mathbf{x}_i'\beta)}{\Phi(\mathbf{x}_i'\beta)} - 
  927 (1-y_i) \frac{\phi(\mathbf{x}_i'\beta)}{1 - \Phi(\mathbf{x}_i'\beta)} ,
  928 \]
  929 which was computed in section \ref{sec:anal-der} via
  930 \begin{code}
  931   series P = cnorm(ndx)
  932   series m = dnorm(ndx)*(y/P - (1-y)/(1-P))
  933 \end{code}
  934 Here, we will code it in a somewhat terser way as
  935 \begin{code}
  936   series m = y ? invmills(-ndx) : -invmills(ndx)
  937 \end{code}
  938 and make use of the conditional assignment operator and of the
  939 specialized function \cmd{invmills()} for efficiency.  Building the
  940 score matrix is now easily achieved via
  941 \begin{code}
  942 mle logl = y*ln(P) + (1-y)*ln(1-P)
  943     series ndx = lincomb(X, b)
  944     series P = cnorm(ndx)
  945     series m = y ? invmills(-ndx) : -invmills(ndx)
  946     matrix mX = {X}
  947     deriv b = mX .* {m}
  948 end mle
  949 \end{code}
  950 in which the \verb|{}| operator was used to turn series and lists into
  951 matrices (see chapter \ref{chap:matrices}). However, proceeding in
  952 this way for more complex models than probit may imply inserting into
  953 the \cmd{mle} block a long series of instructions; the example above
  954 merely happens to be short because the score matrix for the probit
  955 model is very easy to write in matrix form.
  956 
  957 A better solution is writing a user-level function to compute the
  958 score and using that inside the \cmd{mle} block, as in
  959 \begin{code}
  960 function matrix score(matrix b, series y, list X)
  961     series ndx = lincomb(X, b)
  962     series m = y ? invmills(-ndx) : -invmills(ndx)
  963     return {m} .* {X}
  964 end function
  965     
  966 [...]
  967 
  968 mle logl = y*ln(P) + (1-y)*ln(1-P)
  969     series ndx = lincomb(X, b)
  970     series P = cnorm(ndx)
  971     deriv b = score(b, y, X)
  972 end mle
  973 \end{code}
  974 In this way, no matter how complex the computation of the score is,
  975 the \cmd{mle} block remains nicely compact.
  976 
  977 \subsection{Newton's method and the analytical Hessian}
  978 \label{sec:mle-adv-hessian}
  979 
  980 As mentioned above, gretl offers the user the option of using
  981 Newton's method for maximizing the log-likelihood. In terms of the
  982 notation used in section \ref{sec:mle-intro}, the direction for
  983 updating the inital parameter vector $\theta_0$ is given by
  984 \begin{equation}
  985   \label{eq:mle-newton}
  986   d\left[\mathbf{g}(\theta_0)\right] = -\lambda
  987   \mathbf{H}(\theta_0)^{-1}\mathbf{g}(\theta_0) ,
  988 \end{equation}
  989 where $\mathbf{H}(\theta)$ is the Hessian of the total loglikelihood
  990 computed at $\theta$ and $0 < \lambda < 1$ is a scalar called the
  991 \emph{step length}.
  992 
  993 The above expression makes a few points clear:
  994 \begin{enumerate}
  995 \item At each step, it must be possible to compute not only the score
  996   $\mathbf{g}(\theta)$, but also its derivative $\mathbf{H}(\theta)$;
  997 \item the matrix $\mathbf{H}(\theta)$ should be nonsingular;
  998 \item it is assumed that for some positive value of $\lambda$,
  999   $\LogLik(\theta_1) > \LogLik(\theta_0)$; in other words, that going
 1000   in the direction $d\left[\mathbf{g}(\theta_0)\right]$ leads upwards
 1001   for some step length.
 1002 \end{enumerate}
 1003 
 1004 The strength of Newton's method lies in the fact that, if the
 1005 loglikelihood is globally concave, then \eqref{eq:mle-newton} enjoys
 1006 certain optimality properties and the number of iterations required to
 1007 reach the maximum is often much smaller than it would be with other
 1008 methods, such as BFGS. However, it may have some disadvantages: for a
 1009 start, the Hessian $\mathbf{H}(\theta)$ may be difficult or very
 1010 expensive to compute; moreover, the loglikelihood may not be globally
 1011 concave, so for some values of $\theta$, the matrix
 1012 $\mathbf{H}(\theta)$ is not negative definite or perhaps even
 1013 singular.  Those cases are handled by gretl's implementation of
 1014 Newton's algorithm by means of several heuristic
 1015 techniques\footnote{The gist to it is that, if $\mathbf{H}$ is not
 1016   negative definite, it is substituted by $k \cdot
 1017   \mathrm{dg}(\mathbf{H}) + (1-k) \cdot \mathbf{H}$, where $k$ is a
 1018   suitable scalar; however, if you're interested in the precise
 1019   details, you'll be much better off looking at the source code: the
 1020   file you'll want to look at is \texttt{lib/src/gretl\_bfgs.c}.}, but
 1021 a number of adverse consequences may occur, which range from
 1022 \emph{longer} computation time for optimization to non-convergence of
 1023 the algorithm.
 1024 
 1025 As a consequence, using Newton's method is advisable only when the
 1026 computation of the Hessian is not too CPU-intensive and the nature of
 1027 the estimator is such that it is known in advance that the
 1028 loglikelihood is globally concave. The probit models satisfies both
 1029 requisites, so we will expand the preceding example to illustrate how
 1030 to use Newton's method in gretl.
 1031 
 1032 A first example may be given simply by issuing the command
 1033 \begin{code}
 1034   set optimizer newton
 1035 \end{code}
 1036 before the \cmd{mle} block.\footnote{To go back to BFGS, you use
 1037   \cmd{set optimizer bfgs}.} This will instruct gretl to use
 1038 Newton's method instead of BFGS. If the \cmd{deriv} keyword is used,
 1039 gretl will differentiate the score function numerically;
 1040 otherwise, if the score has to be computed itself numerically,
 1041 gretl will calculate $\mathbf{H}(\theta)$ by differentiating the
 1042 loglikelihood numerically twice. The latter solution, though, is
 1043 generally to be avoided, as may be extremely time-consuming and may
 1044 yield imprecise results.
 1045 
 1046 A much better option is to calculate the Hessian analytically and have
 1047 gretl use its true value rather than a numerical approximation. In
 1048 most cases, this is both much faster and numerically stable, but of
 1049 course comes at the price of having to differentiate the loglikelihood
 1050 twice to respect with the parameters and translate the resulting
 1051 expressions into efficient \app{hansl} code.
 1052 
 1053 Luckily, both tasks are relatively easy in the probit case: the matrix
 1054 of second derivatives of $\LogLik_i$ may be written as
 1055 \[
 1056   \pder{{}^2\LogLik_i}{\beta \partial \beta'} = 
 1057   - m_i \left( m_i + \mathbf{x}_i'\beta \right) \mathbf{x}_i \mathbf{x}_i'
 1058 \]
 1059 so the total Hessian is 
 1060 \begin{equation}
 1061   \label{eq:mle-tothess}
 1062   \sum_{i=1}^n \pder{{}^2\LogLik_i}{\beta \partial \beta'} = 
 1063   - X' \left[
 1064     \begin{array}{cccc}
 1065       w_1 & & & \\
 1066       & w_2 & & \\
 1067       & & \ddots & \\
 1068       & & & w_n
 1069     \end{array}
 1070   \right] X 
 1071 \end{equation}
 1072 where $w_i = m_i \left( m_i + \mathbf{x}_i'\beta \right)$. It can be
 1073 shown that $w_i > 0$, so the Hessian is guaranteed to be negative
 1074 definite in all sensible cases and the conditions are ideal for
 1075 applying Newton's method.
 1076 
 1077 A \app{hansl} translation of equation \eqref{eq:mle-tothess} may look
 1078 like
 1079 \begin{code}
 1080 function void Hess(matrix *H, matrix b, series y, list X) 
 1081     /* computes the negative Hessian for a Probit model */
 1082     series ndx = lincomb(X, b)
 1083     series m = y ? invmills(-ndx) : -invmills(ndx)
 1084     series w = m*(m+ndx)
 1085     matrix mX = {X}    
 1086     H = (mX .* {w})'mX
 1087 end function
 1088 \end{code}
 1089 
 1090 There are two characteristics worth noting of the function above. For
 1091 a start, it doesn't return anything: the result of the computation is
 1092 simply stored in the matrix pointed at by the first argument of the
 1093 function. Second, the result is not the Hessian proper, but rather its
 1094 negative. This function becomes usable from within an \cmd{mle} block
 1095 by the keyword \cmd{hessian}. The syntax is
 1096 \begin{code}
 1097 mle ...
 1098     ...
 1099     hessian funcname(&mat_addr, ...)
 1100 end mle
 1101 \end{code}
 1102 In other words, the \cmd{hessian} keyword must be followed by the call
 1103 to a function whose first argument is a matrix pointer which is
 1104 supposed to be filled with the \emph{negative} of the Hessian at
 1105 $\theta$.
 1106 
 1107 We said above (section~\ref{sec:mle-intro}) that the covariance matrix
 1108 of the parameter estimates is by default estimated using the Outer
 1109 Product of the Gradient (so long as the log-likelihood function
 1110 returns the per-observation contributions). However, if you supply a
 1111 function that computes the Hessian then by default it is used in
 1112 estimating the covariance matrix. If you wish to impose use of OPG
 1113 instead, append the \option{opg} option to the end of the \cmd{mle}
 1114 block.
 1115 
 1116 Note that gretl does not perform any numerical check on whether a
 1117 user-supplied function computes the Hessian correctly. On the one
 1118 hand, this means that you can trick \cmd{mle} into using alternatives
 1119 to the Hessian and thereby implement other optimization methods. For
 1120 example, if you substitute in equation \ref{eq:mle-newton} the Hessian
 1121 $\mathbf{H}$ with the negative of the OPG matrix
 1122 $-\mathbf{G}'\mathbf{G}$, as defined in \eqref{eq:OPGmat}, you get the
 1123 so-called BHHH optimization method (see \cite{bhhh74}). Again, the
 1124 sample file \verb|mle-advanced.inp| provides an example. On the other
 1125 hand, you may want to perform a check of your analytically-computed
 1126 $\mathbf{H}$ matrix versus a numerical approximation.
 1127 
 1128 If you have a function that computes the score, this is relatively
 1129 simple to do by using the \cmd{fdjac} function, briefly described in
 1130 section \ref{sec:fdjac}, which computes a numerical approximation to a
 1131 derivative. In practice, you need a function computing
 1132 $\mathbf{g}(\theta)$ as a row vector and then use \cmd{fdjac} to
 1133 differentiate it numerically with respect to $\theta$. The result can
 1134 then be compared to your analytically-computed Hessian. The code
 1135 fragment below shows an example of how this can be done in the probit
 1136 case:
 1137 \begin{code}
 1138 function matrix totalscore(matrix *b, series y, list X) 
 1139     /* computes the total score */
 1140     return sumc(score(b, y, X))
 1141 end function
 1142 
 1143 function void check(matrix b, series y, list X)
 1144     /* compares the analytical Hessian to its numerical
 1145     approximation obtained via fdjac */
 1146     matrix aH
 1147     Hess(&aH, b, y, X) # stores the analytical Hessian into aH
 1148     
 1149     matrix nH = fdjac(b, "totalscore(&b, y, X)")
 1150     nH = 0.5*(nH + nH') # force symmetry
 1151     
 1152     printf "Numerical Hessian\n%16.6f\n", nH 
 1153     printf "Analytical Hessian (negative)\n%16.6f\n", aH 
 1154     printf "Check (should be zero)\n%16.6f\n", nH + aH
 1155 end function
 1156 \end{code}
 1157 
 1158 \section{Handling non-convergence gracefully}
 1159 \label{sec:mle-nonconv}
 1160 
 1161 If the numerical aspects of the estimation procedure are complex, it
 1162 is possible that \cmd{mle} fails to find the maximum within the number
 1163 of iterations stipulated via the \verb|bfgs_maxiter| state variable
 1164 (which defaults to 500).
 1165 
 1166 In these cases, \cmd{mle} will exit with error and it's up to the user
 1167 to handle the situation appropriately. For example, it is possible
 1168 that \cmd{mle} is used inside a loop and you don't want the loop to
 1169 stop in case convergence is not achieved. The \cmd{catch} command
 1170 modifier (see also the \GCR) is an excellent tool for this purpose.
 1171 
 1172 The example provided in listing \ref{ex:catch-mle} illustrates the
 1173 usage of \cmd{catch} in an artificially simple context: we use the
 1174 \cmd{mle} command for estimating mean and variance of a Gaussian rv
 1175 (of course you don't need the \cmd{mle} apparatus for this, but it
 1176 makes for a nice example). The gist of the example is using the
 1177 \verb|set bfgs_maxiter| command to force \cmd{mle} to abort after a
 1178 very small number of iterations, so that you can have an idea on how
 1179 to use the \cmd{catch} modifier and the associated \dollar{error}
 1180 accessor to handle the situation.
 1181 
 1182 You may want to increase the maximum number if BFGS iterations in the
 1183 example to check what happens if the algorithm is allowed to
 1184 converge. Note that, upon successful completion of \cmd{mle}, a bundle
 1185 named \dollar{model} is available, containing several quantities that
 1186 may be of your interest, including the total number of function
 1187 evaluations.
 1188 
 1189 \begin{script}[htbp]
 1190   \caption{Handling non-convergence via \cmd{catch}}
 1191   \label{ex:catch-mle}
 1192 \begin{scodebit}
 1193 set verbose off
 1194 nulldata 200
 1195 set seed 8118
 1196 
 1197 # generate simulated data from a N(3,4) variate
 1198 series x = normal(3,2)
 1199 
 1200 # set starting values
 1201 scalar m = 0
 1202 scalar s2 = 1
 1203 
 1204 # set iteration limit to a ridiculously low value
 1205 set bfgs_maxiter 10 
 1206 
 1207 # perform ML estimation; note the "catch" modifier
 1208 catch mle loglik = -0.5* (log(2*$pi) + log(s2) + e2/s2)
 1209     series e2 = (x - m)^2
 1210     params m s2
 1211 end mle --quiet
 1212 
 1213 # grab the error and proceed as needed
 1214 err = $error
 1215 if err
 1216     printf "Not converged! (m = %g, s2 = %g)\n", m, s2
 1217 else
 1218     printf "Converged after %d iterations\n", $model.grcount
 1219     cs = $coeff ~ sqrt(diag($vcv))
 1220     pn = "m s2"
 1221     modprint cs pn
 1222 endif
 1223 \end{scodebit}
 1224 \end{script}
 1225 
 1226 
 1227 
 1228 %%% Local Variables: 
 1229 %%% mode: latex
 1230 %%% TeX-master: "gretl-guide"
 1231 %%% End: