33
Macroeconomia 1
FECAP
5
Macroeconomia 1
FECAP
27
Macroeconomia 1
FECAP
59
Macroeconomia 1
FECAP
18
Macroeconomia 1
UFRN
1
Macroeconomia 1
USP
Texto de pré-visualização
Gretl Users Guide Gnu Regression Econometrics and Timeseries Library Allin Cottrell Department of Economics Wake Forest University Riccardo Jack Lucchetti Dipartimento di Economia Università Politecnica delle Marche September 2023 Permission is granted to copy distribute andor modify this document under the terms of the GNU Free Documentation License Version 11 or any later version published by the Free Software Foundation see httpwwwgnuorglicensesfdlhtml Contents 1 Introduction 1 11 Features at a glance 1 12 Acknowledgements 1 13 Installing the programs 2 I Running the program 3 2 Getting started 4 21 Lets run a regression 4 22 Estimation output 6 23 The main window menus 7 24 Keyboard shortcuts 10 25 The gretl toolbar 10 3 Modes of working 12 31 Command scripts 12 32 Saving script objects 14 33 The gretl console 14 34 The Session concept 15 4 Data files 18 41 Data file formats 18 42 Databases 18 43 Creating a dataset from scratch 19 44 Structuring a dataset 21 45 Panel data specifics 23 46 Missing data values 27 47 Maximum size of data sets 28 48 Data file collections 28 49 Assembling data from multiple sources 30 5 Subsampling a dataset 31 51 Introduction 31 52 Setting the sample 31 53 Restricting the sample 32 i Contents ii 54 Panel data 33 55 Resampling and bootstrapping 35 6 Graphs and plots 36 61 Gnuplot graphs 36 62 Plotting graphs from scripts 40 63 Boxplots 46 7 Joining data sources 48 71 Introduction 48 72 Basic syntax 48 73 Filtering 49 74 Matching with keys 50 75 Aggregation 53 76 Stringvalued key variables 53 77 Importing multiple series 54 78 A realworld case 55 79 The representation of dates 57 710 Timeseries data 58 711 Special handling of time columns 61 712 Panel data 61 713 Memo join options 63 8 Realtime data 66 81 Introduction 66 82 Atomic format for realtime data 66 83 More on timerelated options 68 84 Getting a certain data vintage 68 85 Getting the nth release for each observation period 69 86 Getting the values at a fixed lag after the observation period 70 87 Getting the revision history for an observation 71 9 Temporal disaggregation 74 91 Introduction 74 92 Notation and design 75 93 Overview of data handling 76 94 Extrapolation 76 95 Function signature 77 96 Handling of deterministic terms 78 97 Some technical details 78 98 The plot option 80 Contents iii 99 Multiple lowfrequency series 80 910 Examples 81 10 Special functions in genr 82 101 Introduction 82 102 Cumulative densities and pvalues 83 103 Retrieving internal variables dollar accessors 84 11 Gretl data types 85 111 Introduction 85 112 Series 85 113 Scalars 86 114 Matrices 86 115 Lists 86 116 Strings 86 117 Bundles 87 118 Arrays 92 119 The life cycle of gretl objects 95 12 Discrete variables 98 121 Declaring variables as discrete 98 122 Commands for discrete variables 99 13 Loop constructs 103 131 Introduction 103 132 Loop control variants 103 133 Special controls 106 134 Progressive mode 106 135 Loop examples 107 14 Userdefined functions 110 141 Defining a function 110 142 Calling a function 113 143 Deleting a function 113 144 Function programming details 114 145 Function packages 121 15 Named lists and strings 122 151 Named lists 122 152 Named strings 127 16 Stringvalued series 132 Contents iv 161 Introduction 132 162 Creating a stringvalued series 132 163 Permitted operations 135 164 Stringvalued series and functions 137 165 Other import formats 139 17 Matrix manipulation 140 171 Creating matrices 140 172 Empty matrices 141 173 Selecting submatrices 142 174 Deleting rows or columns 143 175 Matrix operators 144 176 Matrixscalar operators 146 177 Matrix functions 146 178 Matrix accessors 151 179 Namespace issues 153 1710Creating a data series from a matrix 153 1711Matrices and lists 154 1712Deleting a matrix 154 1713Printing a matrix 154 1714Example OLS using matrices 156 18 Complex matrices 157 181 Introduction 157 182 Creating a complex matrix 157 183 Indexation 158 184 Operators 159 185 Functions 159 186 File inputoutput 160 187 Backward incompatibility 160 19 Calendar dates 164 191 Introduction 164 192 Date and time representations 164 193 Converting between representations 166 194 Epoch day arithmetic 170 195 Other accessors and functions 171 196 Working with preGregorian dates 172 20 Handling mixedfrequency data 177 201 Basics 177 Contents v 202 The notion of a MIDAS list 179 203 Highfrequency lag lists 180 204 Highfrequency first differences 182 205 MIDASrelated plots 182 206 Alternative MIDAS data methods 182 21 Cheat sheet 188 211 Dataset handling 188 212 Creatingmodifying variables 192 213 Neat tricks 199 II Econometric methods 204 22 Robust covariance matrix estimation 205 221 Introduction 205 222 Crosssectional data and the HCCME 206 223 Time series data and HAC covariance matrices 207 224 Special issues with panel data 212 225 The clusterrobust estimator 213 23 Panel data 215 231 Estimation of panel models 215 232 Autoregressive panel models 223 24 Dynamic panel models 225 241 Introduction 225 242 Usage 228 243 Replication of DPD results 231 244 Crosscountry growth example 234 245 Auxiliary test statistics 236 246 Postestimation available statistics 237 247 Memo dpanel options 239 25 Nonlinear least squares 240 251 Introduction and examples 240 252 Initializing the parameters 240 253 NLS dialog window 241 254 Analytical and numerical derivatives 241 255 Advanced use 242 256 Controlling termination 243 257 Details on the code 243 Contents vi 258 Numerical accuracy 243 26 Maximum likelihood estimation 246 261 Generic ML estimation with gretl 246 262 Syntax 247 263 Covariance matrix and standard errors 248 264 Gamma estimation 249 265 Stochastic frontier cost function 251 266 GARCH models 252 267 Analytical derivatives 255 268 Debugging ML scripts 256 269 Using functions 256 2610Advanced use of mle functions analytical derivatives algorithm choice 259 2611Estimating constrained models 263 2612Handling nonconvergence gracefully 264 27 GMM estimation 267 271 Introduction and terminology 267 272 GMM as Method of Moments 268 273 OLS as GMM 271 274 TSLS as GMM 272 275 Covariance matrix options 272 276 A real example the Consumption Based Asset Pricing Model 274 277 Caveats 278 28 Model selection criteria 279 281 Introduction 279 282 Information criteria 279 29 Degrees of freedom correction 281 291 Introduction 281 292 Back to basics 281 293 Application to OLS regression 282 294 Beyond OLS 282 295 Consistency and awkward cases 283 296 What gretl does 284 30 Time series filters 287 301 Fractional differencing 287 302 The HodrickPrescott filter 287 303 The Baxter and King filter 288 Contents vii 304 The Butterworth filter 289 305 The discrete Fourier transform 290 31 Univariate time series models 293 311 Introduction 293 312 ARIMA models 293 313 Unit root tests 300 314 Cointegration test 304 315 ARCH and GARCH 305 32 Vector Autoregressions 308 321 Notation 308 322 Estimation 309 323 Structural VARs 311 324 Residualbased diagnostic tests 315 33 Cointegration and Vector Error Correction Models 317 331 Introduction 317 332 Vector Error Correction Models as representation of a cointegrated system 318 333 Interpretation of the deterministic components 319 334 The Johansen cointegration tests 321 335 Identification of the cointegration vectors 322 336 Overidentifying restrictions 324 337 Numerical solution methods 330 34 Multivariate models 334 341 The system command 334 342 Equation systems within functions 336 343 Restriction and estimation 337 344 System accessors 338 35 Forecasting 341 351 Introduction 341 352 Saving and inspecting fitted values 341 353 The fcast command 341 354 Univariate forecast evaluation statistics 344 355 Forecasts based on VAR models 345 356 Forecasting from simultaneous systems 347 36 State Space Modeling 348 361 Introduction 348 Contents viii 362 Notation 348 363 Defining the model as a bundle 348 364 Special features of statespace bundles 350 365 The kfilter function 350 366 The ksmooth function 351 367 The kdsmooth function 352 368 Diffuse initialization of the state vector 352 369 Extensions and refinements 354 3610The ksimul function 356 3611Numerical optimization 358 3612Example scripts 358 3613Graphical interface 365 37 Numerical methods 372 371 Derivativebased optimization methods 372 372 Derivativefree optimization methods 375 373 Numerical differentiation 379 374 Numerical integration 383 38 Discrete and censored dependent variables 385 381 Logit and probit models 385 382 Ordered response models 389 383 Multinomial logit 391 384 Bivariate probit 392 385 Panel estimators 392 386 The Tobit model 394 387 Interval regression 394 388 Sample selection model 395 389 Count data 397 3810Duration models 399 39 Quantile regression 407 391 Introduction 407 392 Basic syntax 407 393 Confidence intervals 408 394 Multiple quantiles 408 395 Large datasets 409 40 Nonparametric methods 412 401 Locally weighted regression loess 412 402 The NadarayaWatson estimator 414 Contents ix 41 MIDAS models 417 411 Parsimonious parameterizations 417 412 Estimating MIDAS models 418 413 Parameterization functions 424 III Technical details 428 42 Gretl and ODBC 429 421 ODBC support 429 422 ODBC base concepts 429 423 Syntax 430 424 Examples 432 425 Connectivity details 433 43 Gretl and TEX 435 431 Introduction 435 432 TEXrelated menu items 435 433 Finetuning typeset output 437 434 Installing and learning TEX 440 44 Gretl and R 441 441 Introduction 441 442 Starting an interactive R session 441 443 Running an R script 444 444 Sending data back and forth 445 445 Interacting with R from the command line 448 446 Performance issues with R 450 447 Further use of the R library 450 45 Gretl and Ox 452 451 Introduction 452 452 Ox support in gretl 452 453 Illustration replication of DPD model 454 46 Gretl and Octave 456 461 Introduction 456 462 Octave support in gretl 456 463 Illustration spectral methods 458 47 Gretl and Stata 460 48 Gretl and Python 462 Contents x 481 Introduction 462 482 Python support in gretl 462 483 Illustration linear regression with multicollinearity 462 49 Gretl and Julia 464 491 Introduction 464 492 Julia support in gretl 464 493 Illustration 464 50 Troubleshooting gretl 466 501 Bug reports 466 502 Auxiliary programs 467 51 The command line interface 468 IV Appendices 469 A Data file details 470 A1 Basic native format 470 A2 Binary data file format 470 A3 Native database format 470 B Building gretl 472 B1 Installing the prerequisites 472 B2 Getting the source release or git 473 B3 Configure the source 474 B4 Build and install 475 C Numerical accuracy 477 D Related free software 478 E Listing of URLs 479 Bibliography 480 Chapter 1 Introduction 11 Features at a glance Gretl is an econometrics package including a shared library a commandline client program and a graphical user interface Userfriendly Gretl offers an intuitive user interface it is very easy to get up and running with econometric analysis Thanks to its association with the econometrics textbooks by Ramu Ramanathan Jeffrey Wooldridge and James Stock and Mark Watson the package offers many practice data files and command scripts These are well annotated and accessible Two other useful resources for gretl users are the available documentation and the gretlusers mailing list Flexible You can choose your preferred point on the spectrum from interactive pointandclick to complex scripting and can easily combine these approaches Crossplatform Gretls home platform is Linux but it is also available for MS Windows and Mac OS X and should work on any unixlike system that has the appropriate basic libraries see Appendix B Open source The full source code for gretl is available to anyone who wants to critique it patch it or extend it See Appendix B Sophisticated Gretl offers a full range of leastsquares based estimators either for single equations and for systems including vector autoregressions and vector error correction models Sev eral specific maximum likelihood estimators eg probit ARIMA GARCH are also provided natively more advanced estimation methods can be implemented by the user via generic maximum likelihood or nonlinear GMM Extensible Users can enhance gretl by writing their own functions and procedures in gretls script ing language which includes a wide range of matrix functions Accurate Gretl has been thoroughly tested on several benchmarks among which the NIST refer ence datasets See Appendix C Internet ready Gretl can fetch materials such databases collections of textbook datafiles and add on packages over the internet International Gretl will produce its output in English French Italian Spanish Polish Portuguese German Basque Turkish Russian Albanian or Greek depending on your computers native language setting 12 Acknowledgements The gretl code base originally derived from the program ESL Econometrics Software Library written by Professor Ramu Ramanathan of the University of California San Diego We are much in debt to Professor Ramanathan for making this code available under the GNU General Public Licence and for helping to steer gretls early development 1 Chapter 1 Introduction 2 We are also grateful to the authors of several econometrics textbooks for permission to package for gretl various datasets associated with their texts This list currently includes William Greene au thor of Econometric Analysis Jeffrey Wooldridge Introductory Econometrics A Modern Approach James Stock and Mark Watson Introduction to Econometrics Damodar Gujarati Basic Economet rics Russell Davidson and James MacKinnon Econometric Theory and Methods and Marno Ver beek A Guide to Modern Econometrics GARCH estimation in gretl is based on code deposited in the archive of the Journal of Applied Econometrics by Professors Fiorentini Calzolari and Panattoni and the code to generate pvalues for DickeyFuller tests is due to James MacKinnon In each case we are grateful to the authors for permission to use their work With regard to the internationalization of gretl thanks go to Ignacio DíazEmparanza Spanish Michel Robitaille and Florent Bresson French Cristian Rigamonti Italian Tadeusz Kufel and Pawel Kufel Polish Markus Hahn and Sven Schreiber German Hélio Guilherme and Henrique Andrade Portuguese Susan Orbe Basque Talha Yalta Turkish and Alexander Gedranovich Russian Gretl has benefitted greatly from the work of numerous developers of free opensource software for specifics please see Appendix B Our thanks are due to Richard Stallman of the Free Software Foundation for his support of free software in general and for agreeing to adopt gretl as a GNU program in particular Many users of gretl have submitted useful suggestions and bug reports In this connection par ticular thanks are due to Ignacio DíazEmparanza Tadeusz Kufel Pawel Kufel Alan Isaac Cri Rigamonti Sven Schreiber Talha Yalta Andreas Rosenblad and Dirk Eddelbuettel who maintains the gretl package for Debian GNULinux 13 Installing the programs Linux On the Linux1 platform you have the choice of compiling the gretl code yourself or making use of a prebuilt package Building gretl from the source is necessary if you want to access the development version or customize gretl to your needs but this takes quite a few skills most users will want to go for a prebuilt package Some Linux distributions feature gretl as part of their standard offering Debian Ubuntu and Fe dora for example If this is the case all you need to do is install gretl through your package manager of choice In addition the gretl webpage at httpgretlsourceforgenet offers a generic package in rpm format for modern Linux systems If you prefer to compile your own or are using a unix system for which prebuilt packages are not available instructions on building gretl can be found in Appendix B MS Windows The MS Windows version comes as a selfextracting executable Installation is just a matter of downloading gretlinstallexe and running this program You will be prompted for a location to install the package Mac OS X The Mac version comes as a gzipped disk image Installation is a matter of downloading the image file opening it in the Finder and dragging Gretlapp to the Applications folder However when installing for the first time two prerequisite packages must be put in place first details are given on the gretl website 1In this manual we use Linux as shorthand to refer to the GNULinux operating system What is said herein about Linux mostly applies to other unixtype systems too though some local modifications may be needed Part I Running the program Chapter 2 Getting started 21 Lets run a regression This introduction is mostly angled towards the graphical client program please see Chapter 51 below and the Gretl Command Reference for details on the commandline program gretlcli You can supply the name of a data file to open as an argument to gretl but for the moment lets not do that just fire up the program1 You should see a main window which will hold information on the data set but which is at first blank and various menus some of them disabled at first What can you do at this point You can browse the supplied data files or databases open a data file create a new data file read the help items or open a command script For now lets browse the supplied data files Under the File menu choose Open data Sample file A second notebooktype window will open presenting the sets of data files supplied with the package see Figure 21 Select the first tab Ramanathan The numbering of the files in this section corresponds to the chapter organization of Ramanathan 2002 which contains discussion of the analysis of these data The data will be useful for practice purposes even without the text Figure 21 Practice data files window If you select a row in this window and click on Info this opens a window showing information on the data set in question for example on the sources and definitions of the variables If you find a file that is of interest you may open it by clicking on Open or just doubleclicking on the file name For the moment lets open data36 In gretl windows containing lists doubleclicking on a line launches a default action for the associated list entry eg displaying the values of a data series opening a file 1For convenience we refer to the graphical client program simply as gretl in this manual Note however that the specific name of the program differs according to the computer platform On Linux it is called gretlx11 while on MS Windows it is gretlexe On Linux systems a wrapper script named gretl is also installed see also the Gretl Command Reference 4 Chapter 2 Getting started 5 This file contains data pertaining to a classic econometric chestnut the consumption function The data window should now display the name of the current data file the overall data range and sample range and the names of the variables along with brief descriptive tagssee Figure 22 Figure 22 Main window with a practice data file open OK what can we do now Hopefully the various menu options should be fairly self explanatory For now well dip into the Model menu a brief tour of all the main window menus is given in Section 23 below Gretls Model menu offers numerous various econometric estimation routines The simplest and most standard is Ordinary Least Squares OLS Selecting OLS pops up a dialog box calling for a model specificationsee Figure 23 Figure 23 Model specification dialog To select the dependent variable highlight the variable you want in the list on the left and click the arrow that points to the Dependent variable slot If you check the Set as default box this variable will be preselected as dependent when you next open the model dialog box Shortcut doubleclicking on a variable on the left selects it as dependent and also sets it as the default To select independent variables highlight them on the left and click the green arrow or rightclick the Chapter 2 Getting started 6 highlighted variable to remove variables from the selected list use the rad arrow To select several variable in the list box drag the mouse over them to select several noncontiguous variables hold down the Ctrl key and click on the variables you want To run a regression with consumption as the dependent variable and income as independent click Ct into the Dependent slot and add Yt to the Independent variables list 22 Estimation output Once youve specified a model a window displaying the regression output will appear The output is reasonably comprehensive and in a standard format Figure 24 Figure 24 Model output window The output window contains menus that allow you to inspect or graph the residuals and fitted values and to run various diagnostic tests on the model For most models there is also an option to print the regression output in LATEX format See Chap ter 43 for details To import gretl output into a word processor you may copy and paste from an output window using its Edit menu or Copy button in some contexts to the target program Many not all gretl windows offer the option of copying in RTF Microsofts Rich Text Format or as LATEX If you are pasting into a word processor RTF may be a good option because the tabular formatting of the output is preserved2 Alternatively you can save the output to a plain text file then import the file into the target program When you finish a gretl session you are given the option of saving all the output from the session to a single file Note that on the gnome desktop and under MS Windows the File menu includes a command to send the output directly to a printer When pasting or importing plain text gretl output into a word processor select a monospaced or typewriter style font eg Courier to preserve the outputs tabular formatting Select a small font 10point Courier should do to prevent the output lines from being broken in the wrong place 2Note that when you copy as RTF under MS Windows Windows will only allow you to paste the material into ap plications that understand RTF Thus you will be able to paste into MS Word but not into notepad Note also that there appears to be a bug in some versions of Windows whereby the paste will not work properly unless the target application eg MS Word is already running prior to copying the material in question Chapter 2 Getting started 7 23 The main window menus Reading left to right along the main windows menu bar we find the File Tools Data View Add Sample Variable Model and Help menus File menu Open data Open a native gretl data file or import from other formats See Chapter 4 Append data Add data to the current working data set from a gretl data file a comma separated values file or a spreadsheet file Save data Save the currently open native gretl data file Save data as Write out the current data set in native format with the option of using gzip data compression See Chapter 4 Export data Write out the current data set in Comma Separated Values CSV format or the formats of GNU R or GNU Octave See Chapter 4 and also Appendix D Send to Send the current data set as an email attachment New data set Allows you to create a blank data set ready for typing in values or for importing series from a database See below for more on databases Clear data set Clear the current data set out of memory Generally you dont have to do this since opening a new data file automatically clears the old one but sometimes its useful Working directory Change the current working directory or workdir and specify re lated options For an explanation of the role of the workdir click the Help button in the dialog window which is presented or refer to the documentation of the set command with the workdir option in the command reference Script files A script is a file containing a sequence of gretl commands This item contains entries that let you open a script you have created previously User file open a sample script or open an editor window in which you can create a new script Session files A session file contains a snapshot of a previous gretl session including the data set used and any models or graphs that you saved Under this item you can open a saved session or save the current session Databases Allows you to browse various large databases either on your own computer or if you are connected to the internet on the gretl database server See Section 42 for details Function packages Manage usercontributed function packages that extend gretls capa bilities To learn more about such packages written in gretls builtin matrix and scripting language hansl please refer to the Packages entry in Help menu Resource from addon Access example scripts and datafiles that are shipped as part of gretls official addons Addons are function packages that are more tightly integrated with the gretl program than standard usercontributed packages Exit Quit the program Youll be prompted to save any unsaved work Tools menu Statistical tables Look up critical values for commonly used distributions normal or Gaussian t chisquare F and DurbinWatson Pvalue finder Look up pvalues from the Gaussian t chisquare F gamma binomial or Poisson distributions See also the pvalue command in the Gretl Command Reference Chapter 2 Getting started 8 Distribution graphs Produce graphs of various probability distributions In the resulting graph window the popup menu includes an item Add another curve which enables you to superimpose a further plot for example you can draw the t distribution with various different degrees of freedom Test statistic calculator Calculate test statistics and pvalues for a range of common hy pothesis tests population mean variance and proportion difference of means variances and proportions Nonparametric tests Calculate test statistics for various nonparametric tests Sign test Wilcoxon rank sum test Wilcoxon signed rank test Runs test Seed for random numbers Set the seed for the random number generator by default this is set based on the system time when the program is started Command log Open a window containing a record of the commands executed so far Gretl console Open a console window into which you can type commands as you would using the commandline program gretlcli as opposed to using pointandclick Start Gnu R Start R if it is installed on your system and load a copy of the data set currently open in gretl See Appendix D Sort variables Rearrange the listing of variables in the main window either by ID number or alphabetically by name Function packages Handles function packages see Section 145 which allow you to access functions written by other users and share the ones written by you NIST test suite Check the numerical accuracy of gretl against the reference results for linear regression made available by the US National Institute of Standards and Technol ogy Preferences Set the paths to various files gretl needs to access Choose the font in which gretl displays text output Activate or suppress gretls messaging about the availability of program updates and so on See the Gretl Command Reference for further details Data menu Select all Several menu items act upon those variables that are currently selected in the main window This item lets you select all the variables Display values Pops up a window with a simple not editable printout of the values of the selected variable or variables Edit values Opens a spreadsheet window where you can edit the values of the selected variables Add observations Gives a dialog box in which you can choose a number of observations to add at the end of the current dataset for use with forecasting Remove extra observations Active only if extra observations have been added automati cally in the process of forecasting deletes these extra observations Read info Edit info Read info just displays the summary information for the current data file Edit info allows you to make changes to it if you have permission to do so Print description Opens a window containing a full account of the current dataset in cluding the summary information and any specific information on each of the variables Add case markers Prompts for the name of a text file containing case markers short strings identifying the individual observations and adds this information to the data set See Chapter 4 Remove case markers Active only if the dataset has case markers identifying the obser vations removes these case markers Chapter 2 Getting started 9 Dataset structure invokes a series of dialog boxes which allow you to change the struc tural interpretation of the current dataset For example if data were read in as a cross section you can get the program to interpret them as time series or as a panel See also section 44 Compact data For timeseries data of higher than annual frequency gives you the option of compacting the data to a lower frequency using one of four compaction methods average sum start of period or end of period Expand data For timeseries data gives you the option of expanding the data to a higher frequency Transpose data Turn each observation into a variable and vice versa or in other words each row of the data matrix becomes a column in the modified data matrix can be useful with imported data that have been read in sideways View menu Icon view Opens a window showing the content of the current session as a set of icons see section 34 Graph specified vars Gives a choice between a time series plot a regular XY scatter plot an XY plot using impulses vertical bars an XY plot with factor separation ie with the points colored differently depending to the value of a given dummy variable boxplots and a 3D graph Serves up a dialog box where you specify the variables to graph See Chapter 6 for details Multiple graphs Allows you to compose a set of up to six small graphs either pairwise scatterplots or timeseries graphs These are displayed together in a single window Summary statistics Shows a full set of descriptive statistics for the variables selected in the main window Correlation matrix Shows the pairwise correlation coefficients for the selected variables Cross Tabulation Shows a crosstabulation of the selected variables This works only if at least two variables in the data set have been marked as discrete see Chapter 12 Principal components Produces a Principal Components Analysis for the selected vari ables Mahalanobis distances Computes the Mahalanobis distance of each observation from the centroid of the selected set of variables Crosscorrelogram Computes and graphs the crosscorrelogram for two selected vari ables Add menu Offers various standard transformations of variables logs lags squares etc that you may wish to add to the data set Also gives the option of adding random variables and for timeseries data adding seasonal dummy variables eg quarterly dummy variables for quarterly data Sample menu Set range Select a different starting andor ending point for the current sample within the range of data available Restore full range selfexplanatory Define based on dummy Given a dummy indicator variable with values 0 or 1 this drops from the current sample all observations for which the dummy variable has value 0 Restrict based on criterion Similar to the item above except that you dont need a pre defined variable you supply a Boolean expression eg sqft 1400 and the sample is restricted to observations satisfying that condition See the entry for genr in the Gretl Command Reference for details on the Boolean operators that can be used Chapter 2 Getting started 10 Random subsample Draw a random sample from the full dataset Drop all obs with missing values Drop from the current sample all observations for which at least one variable has a missing value see Section 46 Count missing values Give a report on observations where data values are missing May be useful in examining a panel data set where its quite common to encounter missing values Set missing value code Set a numerical value that will be interpreted as missing or not available This is intended for use with imported data when gretl has not recognized the missingvalue code used Variable menu Most items under here operate on a single variable at a time The active variable is set by highlighting it clicking on its row in the main data window Most options will be selfexplanatory Note that you can rename a variable and can edit its descriptive label under Edit attributes You can also Define a new variable via a formula eg involving some function of one or more existing variables For the syntax of such formulae look at the online help for Generate variable syntax or see the genr command in the Gretl Command Reference One simple example foo x1 x2 will create a new variable foo as the product of the existing variables x1 and x2 In these formulae variables must be referenced by name not number Model menu For details on the various estimators offered under this menu please consult the Gretl Command Reference Also see Chapter 25 regarding the estimation of nonlinear models Help menu Please use this as needed It gives details on the syntax required in various dialog entries 24 Keyboard shortcuts When working in the main gretl window some common operations may be performed using the keyboard as shown in the table below Return Opens a window displaying the values of the currently selected variables it is the same as selecting Data Display Values Delete Pressing this key has the effect of deleting the selected variables A confirma tion is required to prevent accidental deletions e Has the same effect as selecting Edit attributes from the Variable menu F2 Same as e Included for compatibility with other programs g Has the same effect as selecting Define new variable from the Variable menu which maps onto the genr command h Opens a help window for gretl commands F1 Same as h Included for compatibility with other programs r Refreshes the variable list in the main window t Graphs the selected variable a line graph is used for timeseries datasets whereas a distribution plot is used for crosssectional data 25 The gretl toolbar At the bottom left of the main window sits the toolbar Chapter 2 Getting started 11 The icons have the following functions reading from left to right 1 Launch a calculator program A convenience function in case you want quick access to a calculator when youre working in gretl The default program is calcexe under MS Win dows or xcalc under the X window system You can change the program under the Tools Preferences General menu Programs tab 2 Start a new script Opens an editor window in which you can type a series of commands to be sent to the program as a batch 3 Open the gretl console A shortcut to the Gretl console menu item Section 23 above 4 Open the session icon window 5 Open a window displaying available gretl function packages 6 Open this manual in PDF format 7 Open the help item for script commands syntax ie a listing with details of all available commands 8 Open the dialog box for defining a graph 9 Open the dialog box for estimating a model using ordinary least squares 10 Open a window listing the sample datasets supplied with gretl and any other data file collec tions that have been installed Chapter 3 Modes of working 31 Command scripts As you execute commands in gretl using the GUI and filling in dialog entries those commands are recorded in the form of a script or batch file Such scripts can be edited and rerun using either gretl or the commandline client gretlcli To view the current state of the script at any point in a gretl session choose Command log under the Tools menu This log file is called sessioninp and it is overwritten whenever you start a new session To preserve it save the script under a different name Script files will be found most easily using the GUI file selector if you name them with the extension inp To open a script you have written independently use the File Script files menu item to create a script from scratch use the File Script files New script item or the new script toolbar button In either case a script window will open see Figure 31 Figure 31 Script window editing a command file The toolbar at the top of the script window offers the following functions left to right 1 Save the file 2 Save the file under a specified name 3 Print the file this option is not available on all platforms 4 Execute the commands in the file 5 Copy selected text 6 Paste the selected text 7 Find and replace text 8 Undo the last Paste or Replace action 9 Help if you place the cursor in a command word and press the question mark you will get help on that command 10 Close the window When you execute the script by clicking on the Execute icon or by pressing Ctrlr all output is directed to a single window where it can be edited saved or copied to the clipboard To learn more about the possibilities of scripting take a look at the gretl Help item Command reference 12 Chapter 3 Modes of working 13 or start up the commandline program gretlcli and consult its help or consult the Gretl Command Reference If you run the script when part of it is highlighted gretl will only run that portion Moreover if you want to run just the current line you can do so by pressing CtrlEnter1 Clicking the right mouse button in the script editor window produces a popup menu This gives you the option of executing either the line on which the cursor is located or the selected region of the script if theres a selection in place If the script is editable this menu also gives the option of adding or removing comment markers from the start of the line or lines The gretl package includes over 70 example scripts Many of these relate to Ramanathan 2002 but they may also be used as a freestanding introduction to scripting in gretl and to various points of econometric theory You can explore the example files under File Script files Example scripts There you will find a listing of the files along with a brief description of the points they illustrate and the data they employ Open any file and run it to see the output Note that long commands in a script can be broken over two or more lines using backslash as a continuation character You can if you wish use the GUI controls and the scripting approach in tandem exploiting each method where it offers greater convenience Here are two suggestions Open a data file in the GUI Explore the datagenerate graphs run regressions perform tests Then open the Command log edit out any redundant commands and save it under a specific name Run the script to generate a single file containing a concise record of your work Start by establishing a new script file Type in any commands that may be required to set up transformations of the data see the genr command in the Gretl Command Reference Typically this sort of thing can be accomplished more efficiently via commands assembled with forethought rather than pointandclick Then save and run the script the GUI data window will be updated accordingly Now you can carry out further exploration of the data via the GUI To revisit the data at a later point open and rerun the preparatory script first Scripts and data files One common way of doing econometric research with gretl is as follows compose a script execute the script inspect the output modify the script run it againwith the last three steps repeated as many times as necessary In this context note that when you open a data file this clears out most of gretls internal state Its therefore probably a good idea to have your script start with an open command the data file will be reopened each time and you can be confident youre getting fresh results One further point should be noted When you go to open a new data file via the graphical interface you are always prompted opening a new data file will lose any unsaved work do you really want to do this When you execute a script that opens a data file however you are not prompted The assumption is that in this case youre not going to lose any work because the work is embodied in the script itself and it would be annoying to be prompted at each iteration of the work cycle described above This means you should be careful if youve done work using the graphical interface and then decide to run a script the current data file will be replaced without any questions asked and its your responsibility to save any changes to your data first 1This feature is not unique to gretl other econometric packages offer the same facility However experience shows that while this can be remarkably useful it can also lead to writing dinosaur scripts that are never meant to be executed all at once but rather used as a chaotic repository to cherrypick snippets from Since gretl allows you to have several script windows open at the same time you may want to keep your scripts tidy and reasonably small Chapter 3 Modes of working 14 32 Saving script objects When you estimate a model using pointandclick the model results are displayed in a separate window offering menus which let you perform tests draw graphs save data from the model and so on Ordinarily when you estimate a model using a script you just get a noninteractive printout of the results You can however arrange for models estimated in a script to be captured so that you can examine them interactively when the script is finished Here is an example of the syntax for achieving this effect Model1 ols Ct 0 Yt That is you type a name for the model to be saved under then a backpointing assignment arrow then the model command The assignment arrow is composed of the lessthan sign followed by a dash it must be separated by spaces from both the preceding name and the following command The name for a saved object may include spaces but in that case it must be wrapped in double quotes Model 1 ols Ct 0 Yt Models saved in this way will appear as icons in the gretl icon view window see Section 34 after the script is executed In addition you can arrange to have a named model displayed in its own window automatically as follows Model1show Again if the name contains spaces it must be quoted Model 1show The same facility can be used for graphs For example the following will create a plot of Ct against Yt save it under the name CrossPlot it will appear under this name in the icon view window and have it displayed CrossPlot gnuplot Ct Yt CrossPlotshow You can also save the output from selected commands as named pieces of text again these will appear in the session icon window from where you can open them later For example this com mand sends the output from an augmented DickeyFuller test to a text object named ADF1 and displays it in a window ADF1 adf 2 x1 ADF1show Objects saved in this way whether models graphs or pieces of text output can be destroyed using the command free appended to the name of the object as in ADF1free 33 The gretl console A further option is available for your computing convenience Under gretls Tools menu you will find the item Gretl console there is also an open gretl console button on the toolbar in the main window This opens up a window in which you can type commands and execute them one by one by pressing the Enter key interactively This is essentially the same as gretlclis mode of operation except that the GUI is updated based on commands executed from the console enabling you to work back and forth as you wish Chapter 3 Modes of working 15 In the console you have command history that is you can use the up and down arrow keys to navigate the list of command you have entered to date You can retrieve edit and then reenter a previous command In console mode you can create display and free objects models graphs or text aa described above for script mode 34 The Session concept Gretl offers the idea of a session as a way of keeping track of your work and revisiting it later The basic idea is to provide an iconic space containing various objects pertaining to your current working session see Figure 32 You can add objects represented by icons to this space as you go along If you save the session these added objects should be available again if you reopen the session later Figure 32 Icon view one model and one graph have been added to the default icons If you start gretl and open a data set then select Icon view from the View menu you should see the basic default set of icons these give you quick access to information on the data set if any correlation matrix Correlations and descriptive summary statistics Summary All of these are activated by doubleclicking the relevant icon The Data set icon is a little more complex doubleclicking opens up the data in the builtin spreadsheet but you can also rightclick on the icon for a menu of other actions To add a model to the Icon view first estimate it using the Model menu Then pull down the File menu in the model window and select Save to session as icon or Save as icon and close Simply hitting the S key over the model window is a shortcut to the latter action To add a graph first create it under the View menu Graph specified vars or via one of gretls other graphgenerating commands Click on the graph window to bring up the graph menu and select Save to session as icon Once a model or graph is added its icon will appear in the Icon view window Doubleclicking on the icon redisplays the object while rightclicking brings up a menu which lets you display or delete the object This popup menu also gives you the option of editing graphs The model table In econometric research it is common to estimate several models with a common dependent variablethe models differing in respect of which independent variables are included or per haps in respect of the estimator used In this situation it is convenient to present the regression results in the form of a table where each column contains the results coefficient estimates and standard errors for a given model and each row contains the estimates for a given variable across the models Note that some estimation methods are not compatible with the straightforward model Chapter 3 Modes of working 16 table format therefore gretl will not let those models be added to the model table These meth ods include nonlinear least squares nls generic maximumlikelihood estimators mle generic GMM gmm dynamic panel models dpanel interval regressions intreg bivariate probit models biprobit ARIMA models arima or arma and GARCH models garch and arch In the Icon view window gretl provides a means of constructing such a table and copying it in plain text LATEX or Rich Text Format The procedure is outlined below The model table can also be built noninteractively in script modesee the entry for modeltab in the Gretl Command Reference 1 Estimate a model which you wish to include in the table and in the model display window under the File menu select Save to session as icon or Save as icon and close 2 Repeat step 1 for the other models to be included in the table up to a total of six models 3 When you are done estimating the models open the icon view of your gretl session by se lecting Icon view under the View menu in the main gretl window or by clicking the session icon view icon on the gretl toolbar 4 In the Icon view there is an icon labeled Model table Decide which model you wish to appear in the leftmost column of the model table and add it to the table either by dragging its icon onto the Model table icon or by rightclicking on the model icon and selecting Add to model table from the popup menu 5 Repeat step 4 for the other models you wish to include in the table The second model selected will appear in the second column from the left and so on 6 When you are finished composing the model table display it by doubleclicking on its icon Under the Edit menu in the window which appears you have the option of copying the table to the clipboard in various formats 7 If the ordering of the models in the table is not what you wanted rightclick on the model table icon and select Clear table Then go back to step 4 above and try again A simple instance of gretls model table is shown in Figure 33 Figure 33 Example of model table Chapter 3 Modes of working 17 The graph page The graph page icon in the session window offers a means of putting together several graphs for printing on a single page This facility will work only if you have the LATEX typesetting system installed and are able to generate and view either PDF or PostScript output The output format is controlled by your choice of program for compiling TEX files which can be found under the Programs tab in the Preferences dialog box under the Tools menu in the main window Usually this should be pdflatex for PDF output or latex for PostScript In the latter case you must have a working setup for handling PostScript which will usually include dvips ghostscript and a viewer such as gv ggv or kghostview In the Icon view window you can drag up to eight graphs onto the graph page icon When you doubleclick on the icon or rightclick and select Display a page containing the selected graphs in PDF or EPS format will be composed and opened in your viewer From there you should be able to print the page To clear the graph page rightclick on its icon and select Clear As with the model table it is also possible to manipulate the graph page via commands in script or console modesee the entry for the graphpg command in the Gretl Command Reference Saving and reopening sessions If you create models or graphs that you think you may wish to reexamine later then before quitting gretl select Session files Save session from the File menu and give a name under which to save the session To reopen the session later either Start gretl then reopen the session file by going to the File Session files Open session or From the command line type gretl r sessionfile where sessionfile is the name under which the session was saved or Drag the icon representing a session file onto gretl Chapter 4 Data files 41 Data file formats Gretl has its own native format for data files Most users will probably not want to read or write such files outside of gretl itself but occasionally this may be useful and details on the file formats are given in Appendix A The program can also import data from a variety of other formats In the GUI program this can be done via the File Open Data User file menunote the dropdown list of acceptable file types In script mode simply use the open command The supported import formats are as follows Plain text files commaseparated or CSV being the most common type For details on what gretl expects of such files see Section 43 Spreadsheets MS Excel Gnumeric and Open Document ODS The requirements for such files are given in Section 43 Stata data files dta SPSS data files sav SAS xport files xpt Eviews workfiles wf11 JMulTi data files When you import data from a plain text format gretl opens a diagnostic window reporting on its progress in reading the data If you encounter a problem with illformatted data the messages in this window should give you a handle on fixing the problem Note that gretl has a facility for writing out data in the native formats of GNU R Octave JMulTi and PcGive see Appendix D In the GUI client this option is found under the File Export data menu in the commandline client use the store command with the appropriate option flag 42 Databases For working with large amounts of data gretl is supplied with a databasehandling routine A database as opposed to a data file is not read directly into the programs workspace A database can contain series of mixed frequencies and sample ranges You open the database and select series to import into the working dataset You can then save those series in a native format data file if you wish Databases can be accessed via the menu item File Databases For details on the format of gretl databases see Appendix A 1See httpuserswfueducottrelleviewsformat 18 Chapter 4 Data files 19 Online access to databases Several gretl databases are available from Wake Forest University Your computer must be con nected to the internet for this option to work Please see the description of the data command under the Help menu Visit the gretl data page for details and updates on available data Foreign database formats Thanks to Thomas Doan of Estima who made available the specification of the database format used by RATS 4 Regression Analysis of Time Series gretl can handle such databasesor at least a subset of same namely timeseries databases containing monthly and quarterly series Gretl can also import data from PcGive databases These take the form of a pair of files one containing the actual data with suffix bn7 and one containing supplementary information in7 In addition gretl offers ODBC connectivity Be warned this feature is meant for somewhat ad vanced users there is currently no graphical interface Interested readers will find more info in appendix 42 43 Creating a dataset from scratch There are several ways of doing this 1 Find or create using a text editor a plain text data file and open it via Import 2 Use your favorite spreadsheet to establish the data file save it in commaseparated format if necessary this may not be necessary if the spreadsheet format is MS Excel Gnumeric or Open Document then use one of the Import options 3 Use gretls builtin spreadsheet 4 Select data series from a suitable database 5 Use your favorite text editor or other software tools to a create data file in gretl format inde pendently Here are a few comments and details on these methods Common points on imported data Options 1 and 2 involve using gretls import mechanism For the program to read such data successfully certain general conditions must be satisfied The first row must contain valid variable names A valid variable name is of 31 characters maximum starts with a letter and contains nothing but letters numbers and the underscore character Longer variable names will be truncated to 31 characters Qualifications to the above First in the case of an plain text import if the file contains no row with variable names the program will automatically add names v1 v2 and so on Second by the first row is meant the first relevant row In the case of plain text imports blank rows and rows beginning with a hash mark are ignored In the case of Excel Gnumeric and ODS imports you are presented with a dialog box where you can select an offset into the spreadsheet so that gretl will ignore a specified number of rows andor columns Data values these should constitute a rectangular block with one variable per column and one observation per row The number of variables data columns must match the number of variable names given See also section 46 Numeric data are expected but in the case of Chapter 4 Data files 20 importing from plain text the program offers limited handling of character string data if a given column contains character data only consecutive numeric codes are substituted for the strings and once the import is complete a table is printed showing the correspondence between the strings and the codes Dates or observation labels Optionally the first column may contain strings such as dates or labels for crosssectional observations Such strings have a maximum of 15 characters as with variable names longer strings will be truncated A column of this sort should be headed with the string obs or date or the first row entry may be left blank For dates to be recognized as such the date strings should adhere to one or other of a set of specific formats as follows For annual data 4digit years For quarterly data a 4digit year followed by a separator either a period a colon or the letter Q followed by a 1digit quarter Examples 19971 20023 1947Q1 For monthly data a 4digit year followed by a period or a colon followed by a twodigit month Examples 199701 200210 Plain text CSV files can use comma space tab or semicolon as the column separator When you open such a file via the GUI you are given the option of specifying the separator though in most cases it should be detected automatically If you use a spreadsheet to prepare your data you are able to carry out various transformations of the raw data with ease adding things up taking percentages or whatever note however that you can also do this sort of thing easilyperhaps more easilywithin gretl by using the tools under the Add menu Appending imported data You may wish to establish a dataset piece by piece by incremental importation of data from other sources This is supported via the File Append data menu items gretl will check the new data for conformability with the existing dataset and if everything seems OK will merge the data You can add new variables in this way provided the data frequency matches that of the existing dataset Or you can append new observations for data series that are already present in this case the variable names must match up correctly Note that by default that is if you choose Open data rather than Append data opening a new data file closes the current one Using the builtin spreadsheet Under the File New data set menu you can choose the sort of dataset you want to establish eg quarterly time series crosssectional You will then be prompted for starting and ending dates or observation numbers and the name of the first variable to add to the dataset After supplying this information you will be faced with a simple spreadsheet into which you can type data values In the spreadsheet window clicking the right mouse button will invoke a popup menu which enables you to add a new variable column to add an observation append a row at the foot of the sheet or to insert an observation at the selected point move the data down and insert a blank row Once you have entered data into the spreadsheet you import these into gretls workspace using the spreadsheets Apply changes button Please note that gretls spreadsheet is quite basic and has no support for functions or formulas Data transformations are done via the Add or Variable menus in the main window Selecting from a database Another alternative is to establish your dataset by selecting variables from a database Begin with the File Databases menu item This has four forks Gretl native RATS 4 PcGive and On database server You should be able to find the file fedstlbin in the file selector that Chapter 4 Data files 21 opens if you choose the Gretl native option since this file which contains a large collection of US macroeconomic time series is supplied with the distribution You wont find anything under RATS 4 unless you have purchased RATS data2 If you do possess RATS data you should go into the Tools Preferences General dialog select the Databases tab and fill in the correct path to your RATS files If your computer is connected to the internet you should find several databases at Wake Forest University under On database server You can browse these remotely you also have the option of installing them onto your own computer The initial remote databases window has an item showing for each file whether it is already installed locally and if so if the local version is up to date with the version at Wake Forest Assuming you have managed to open a database you can import selected series into gretls workspace by using the Series Import menu item in the database window or via the popup menu that ap pears if you click the right mouse button or by dragging the series into the programs main window Creating a gretl data file independently It is possible to create a data file in one or other of gretls own formats using a text editor or software tools such as awk sed or perl This may be a good choice if you have large amounts of data already in machine readable form You will of course need to study these data formats XMLbased or traditional as described in Appendix A 44 Structuring a dataset Once your data are read by gretl it may be necessary to supply some information on the nature of the data We distinguish between three kinds of datasets 1 Cross section 2 Time series 3 Panel data The primary tool for doing this is the Data Dataset structure menu entry in the graphical inter face or the setobs command for scripts and the commandline interface Cross sectional data By a cross section we mean observations on a set of units which may be firms countries indi viduals or whatever at a common point in time This is the default interpretation for a data file if there is insufficient information to interpret data as timeseries or panel data they are automat ically interpreted as a cross section In the unlikely event that crosssectional data are wrongly interpreted as time series you can correct this by selecting the Data Dataset structure menu item Click the crosssectional radio button in the dialog box that appears then click Forward Click OK to confirm your selection Time series data When you import data from a spreadsheet or plain text file gretl will make fairly strenuous efforts to glean timeseries information from the first column of the data if it looks at all plausible that such information may be present If timeseries structure is present but not recognized again you can use the Data Dataset structure menu item Select Time series and click Forward select the appropriate data frequency and click Forward again then select or enter the starting observation 2See wwwestimacom Chapter 4 Data files 22 and click Forward once more Finally click OK to confirm the timeseries interpretation if it is correct or click Back to make adjustments if need be Besides the basic business of getting a data set interpreted as time series further issues may arise relating to the frequency of timeseries data In a gretl timeseries data set all the series must have the same frequency Suppose you wish to make a combined dataset using series that in their original state are not all of the same frequency For example some series are monthly and some are quarterly Your first step is to formulate a strategy Do you want to end up with a quarterly or a monthly data set A basic point to note here is that compacting data from a higher frequency eg monthly to a lower frequency eg quarterly is usually unproblematic You lose information in doing so but in general it is perfectly legitimate to take say the average of three monthly observations to create a quarterly observation On the other hand expanding data from a lower to a higher frequency is not in general a valid operation In most cases then the best strategy is to start by creating a data set of the lower frequency and then to compact the higher frequency data to match When you import higherfrequency data from a database into the current data set you are given a choice of compaction method average sum start of period or end of period In most instances average is likely to be appropriate You can also import lowerfrequency data into a highfrequency data set but this is generally not recommended What gretl does in this case is simply replicate the values of the lowerfrequency series as many times as required For example suppose we have a quarterly series with the value 355 in 19901 the first quarter of 1990 On expansion to monthly the value 355 will be assigned to the observations for January February and March of 1990 The expanded variable is therefore useless for finegrained timeseries analysis outside of the special case where you know that the variable in question does in fact remain constant over the subperiods When the current data frequency is appropriate gretl offers both Compact data and Expand data options under the Data menu These options operate on the whole data set compacting or exanding all series They should be considered expert options and should be used with caution Panel data Panel data are inherently three dimensionalthe dimensions being variable crosssectional unit and timeperiod For example a particular number in a panel data set might be identified as the observation on capital stock for General Motors in 1980 A note on terminology we use the terms crosssectional unit unit and group interchangeably below to refer to the entities that compose the crosssectional dimension of the panel These might for instance be firms countries or persons For representation in a textual computer file and also for gretls internal calculations the three dimensions must somehow be flattened into two This flattening involves taking layers of the data that would naturally stack in a third dimension and stacking them in the vertical dimension gretl always expects data to be arranged by observation that is such that each row represents an observation and each variable occupies one and only one column In this context the flattening of a panel data set can be done in either of two ways Stacked time series the successive vertical blocks each comprise a time series for a given unit Stacked cross sections the successive vertical blocks each comprise a crosssection for a given period You may input data in whichever arrangement is more convenient Internally however gretl always stores panel data in the form of stacked time series Chapter 4 Data files 23 45 Panel data specifics When you import panel data into gretl from a spreadsheet or comma separated format the panel nature of the data will not be recognized automatically most likely the data will be treated as undated A panel interpretation can be imposed on the data using the graphical interface or via the setobs command In the graphical interface use the menu item Data Dataset structure In the first dialog box that appears select Panel In the next dialog you have a threeway choice The first two options Stacked time series and Stacked cross sections are applicable if the data set is already organized in one of these two ways If you select either of these options the next step is to specify the number of crosssectional units in the data set The third option Use index variables is applicable if the data set contains two variables that index the units and the time periods respectively the next step is then to select those variables For example a data file might contain a country code variable and a variable representing the year of the observation In that case gretl can reconstruct the panel structure of the data regardless of how the observation rows are organized The setobs command has options that parallel those in the graphical interface If suitable index variables are available you can do for example setobs unitvar timevar panelvars where unitvar is a variable that indexes the units and timevar is a variable indexing the periods Alternatively you can use the form setobs freq 11 structure where freq is replaced by the block size of the data that is the number of periods in the case of stacked time series or the number of units in the case of stacked crosssections and structure is either stackedtimeseries or stackedcrosssection Two examples are given below the first is suitable for a panel in the form of stacked time series with observations from 20 periods the second for stacked cross sections with 5 units setobs 20 11 stackedtimeseries setobs 5 11 stackedcrosssection Panel data arranged by variable Publicly available panel data sometimes come arranged by variable Suppose we have data on two variables x1 and x2 for each of 50 states in each of 5 years giving a total of 250 observations per variable One textual representation of such a data set would start with a block for x1 with 50 rows corresponding to the states and 5 columns corresponding to the years This would be followed vertically by a block with the same structure for variable x2 A fragment of such a data file is shown below with quinquennial observations 19651985 Imagine the table continued for 48 more states followed by another 50 rows for variable x2 x1 1965 1970 1975 1980 1985 AR 1000 1105 1187 1312 1604 AZ 1000 1043 1138 1209 1406 If a datafile with this sort of structure is read into gretl3 the program will interpret the columns as distinct variables so the data will not be usable as is But there is a mechanism for correcting the situation namely the stack function Consider the first data column in the fragment above the first 50 rows of this column constitute a crosssection for the variable x1 in the year 1965 If we could create a new series by stacking the 3Note that you will have to modify such a datafile slightly before it can be read at all The line containing the variable name in this example x1 will have to be removed and so will the initial row containing the years otherwise they will be taken as numerical data Chapter 4 Data files 24 first 50 entries in the second column underneath the first 50 entries in the first we would be on the way to making a data set by observation in the first of the two forms mentioned above stacked crosssections That is wed have a column comprising a crosssection for x1 in 1965 followed by a crosssection for the same variable in 1970 The following gretl script illustrates how we can accomplish the stacking for both x1 and x2 We assume that the original data file is called paneltxt and that in this file the columns are headed with variable names v1 v2 v5 The columns are not really variables but in the first instance we pretend that they are open paneltxt series x1 stackv1v5 50 series x2 stackv1v5 50 50 setobs 50 11 stackedcrosssection store panelgdt x1 x2 The second and third lines illustrate the syntax of the stack function which has this signature series stacklist L scalar length scalar offset L a list of series on which to operate length an integer giving the number of observations to take from each series offset an integer giving the offset from the top of the dataset at which to start taking values optional defaults to 0 The syntax in the example above constructs a list of the 5 contiguous series to be stacked More generally you can define a named list of series and pass that as the first argument to stack see chapter 15 In this example were supposing that the full data set contains 100 rows and that in the stacking of variable x1 we wish to read only the first 50 rows from each column so we give 50 as the second argument On line 3 we do the stacking for variable x2 Again we want a length of 50 for the components of the stacked series but this time we want to start reading from the 50th row of the original data and so we add a third offset argument of 50 Line 4 then imposes a panel interpretation on the data Finally we save the stacked data to file with the panel interpretation The illustrative script above is appropriate when the number of variables to be processed is small When then are many variables in the dataset it will be more convenient to use a loop to accomplish the stacking as shown in the following script The setup is presumed to be the same as in the previous case 50 units 5 periods but with 20 variables rather than 2 open paneltxt list L v1v5 predefine a list of series scalar length 50 loop i120 scalar offset i 1 length series xi stackL length offset endloop setobs 50 101 stackedcrosssection store panelgdt x1x20 Sidebyside time series Theres a second sort of data that you may wish to convert to gretls panel format namely side byside time series for a number of crosssectional units For example a data file might contain separate GDP series of common length T for each of N countries To turn these into a single stacked Chapter 4 Data files 25 time series the stack function can again be used An example follows where we suppose the original data source is a commaseparated file named GDPcsv containing GDP data for countries from Austria GDPAT to Zimbabwe GDPZW in consecutive columns open GDPcsv scalar T nobs the number of periods list L GDPATGDPZW series GDP stackL T setobs T 101 stackedtimeseries store panelgdt GDP The resulting data file panelgdt will contain a single series of length NT where N is the number of countries and T is the length of the original dataset One could insert revised variants of lines 3 and 4 of the script if the original file contained additional sidebyside percountry series for investment consumption or whatever Panel data marker strings It can be helpful with panel data to have the observations identified by mnemonic markers A special function in the genr command is available for this purpose In the example under the heading Panel data arranged by variable above suppose all the states are identified by twoletter codes in the leftmost column of the original datafile When the stack function is invoked as shown these codes will be stacked along with the data values If the first row is marked AR for Arkansas then the marker AR will end up being shown on each row containing an observation for Arkansas Thats all very well but these markers dont tell us anything about the date of the observation To rectify this we could do genr time series year 1960 5 time genr markers sd marker year The first line generates a 1based index representing the period of each observation and the second line uses the time variable to generate a variable representing the year of the observation The third line contains this special feature if and only if the name of the new variable to generate is markers the portion of the command following the equals sign is taken as a Cstyle format string which must be wrapped in double quotes followed by a commaseparated list of arguments The arguments will be printed according to the given format to create a new set of observation markers Valid arguments are either the names of variables in the dataset or the string marker which denotes the preexisting observation marker The format specifiers which are likely to be useful in this context are s for a string and d for an integer Strings can be truncated for example 3s will use just the first three characters of the string To chop initial characters off an existing observation marker when constructing a new one you can use the syntax marker n where n is a positive integer in the case the first n characters will be skipped After the commands above are processed then the observation markers will look like for example AR1965 where the twoletter state code and the year of the observation are spliced together with a colon Panel dummy variables In a panel study you may wish to construct dummy variables of one or both of the following sorts a dummies as unique identifiers for the units or groups and b dummies as unique identifiers for the time periods The former may be used to allow the intercept of the regression to differ across the units the latter to allow the intercept to differ across periods Two special functions are available to create such dummies These are found under the Add menu in the GUI or under the genr command in script mode or gretlcli 1 unit dummies script command genr unitdum This command creates a set of dummy variables identifying the crosssectional units The variable du1 will have value 1 in each row corresponding to a unit 1 observation 0 otherwise du2 will have value 1 in each row corresponding to a unit 2 observation 0 otherwise and so on 2 time dummies script command genr timedum This command creates a set of dummy variables identifying the periods The variable dt1 will have value 1 in each row corresponding to a period 1 observation 0 otherwise dt2 will have value 1 in each row corresponding to a period 2 observation 0 otherwise and so on If a panel data set has the YEAR of the observation entered as one of the variables you can create a periodic dummy to pick out a particular year eg genr dum YEAR1960 You can also create periodic dummy variables using the modulus operator For instance to create a dummy with value 1 for the first observation and every thirtieth observation thereafter 0 otherwise do genr index series dum index1 30 0 Lags differences trends If the time periods are evenly spaced you may want to use lagged values of variables in a panel regression but see also chapter 24 you may also wish to construct first differences of variables of interest Once a dataset is identified as a panel gretl will handle the generation of such variables correctly For example the command genr x11 x11 will create a variable that contains the first lag of x1 where available and the missing value code where the lag is not available eg at the start of the time series for each group When you run a regression using such variables the program will automatically skip the missing observations When a panel data set has a fairly substantial time dimension you may wish to include a trend in the analysis The command genr time creates a variable named time which runs from 1 to T for each unit where T is the length of the timeseries dimension of the panel If you want to create an index that runs consecutively from 1 to m T where m is the number of units in the panel use genr index Basic statistics by unit gretl contains functions which can be used to generate basic descriptive statistics for a given variable on a perunit basis these are pnobs number of valid cases pmin and pmax minimum and maximum and pmean and psd mean and standard deviation As a brief illustration suppose we have a panel data set comprising 8 timeseries observations on each of N units or groups Then the command series pmx pmeanx creates a series of this form the first 8 values corresponding to unit 1 contain the mean of x for unit 1 the next 8 values contain the mean for unit 2 and so on The psd function works in a similar manner The sample standard deviation for group i is computed as si sqrtsumx xi2 Ti 1 where Ti denotes the number of valid observations on x for the given unit xi denotes the group mean and the summation is across valid observations for the group If Ti 2 however the standard deviation is recorded as 0 Chapter 4 Data files 27 One particular use of psd may be worth noting If you want to form a subsample of a panel that contains only those units for which the variable x is timevarying you can either use smpl pminx pmaxx restrict or smpl psdx 0 restrict 46 Missing data values Representation and handling Missing values are represented internally as NaN not a number as defined in the IEEE 754 floatingpoint standard In a nativeformat data file they should be represented as NA When im porting CSV data gretl accepts several common representations of missing values including 999 the string NA in upper or lower case a single dot or simply a blank cell Blank cells should of course be properly delimited eg 1206538 in which the middle value is presumed missing As for handling of missing values in the course of statistical analysis gretl does the following In calculating descriptive statistics mean standard deviation etc under the summary com mand missing values are simply skipped and the sample size adjusted appropriately In running regressions gretl first adjusts the beginning and end of the sample range trun cating the sample if need be Missing values at the beginning of the sample are common in time series work due to the inclusion of lags first differences and so on missing values at the end of the range are not uncommon due to differential updating of series and possibly the inclusion of leads If gretl detects any missing values inside the possibly truncated sample range for a regression the result depends on the character of the dataset and the estimator chosen In many cases the program will automatically skip the missing observations when calculating the regression results In this situation a message is printed stating how many observations were dropped On the other hand the skipping of missing observations is not supported for all procedures exceptions include all autoregressive estimators system estimators such as SUR and nonlinear least squares In the case of panel data the skipping of missing observations is supported only if their omission leaves a balanced panel If missing observations are found in cases where they are not supported gretl gives an error message and refuses to produce estimates Manipulating missing values Some special functions are available for the handling of missing values The Boolean function missing takes the name of a variable as its single argument it returns a series with value 1 for each observation at which the given variable has a missing value and value 0 otherwise that is if the given variable has a valid value at that observation The function ok is complementary to missing it is just a shorthand for missing where is the Boolean NOT operator For example one can count the missing values for variable x using scalar nmissx summissingx The function zeromiss which again takes a single series as its argument returns a series where all zero values are set to the missing code This should be used with cautionone does not want to confuse missing values and zerosbut it can be useful in some contexts For example one can determine the first valid observation for a variable x using Chapter 4 Data files 28 genr time scalar x0 minzeromisstime okx The function misszero does the opposite of zeromiss that is it converts all missing values to zero If missing values get involved in calculations they propagate according to the IEEE rules notably if one of the operands to an arithmetical operation is a NaN the result will also be NaN 47 Maximum size of data sets Basically the size of data sets both the number of variables and the number of observations per variable is limited only by the characteristics of your computer Gretl allocates memory dynami cally and will ask the operating system for as much memory as your data require Obviously then you are ultimately limited by the size of RAM Aside from the multipleprecision OLS option gretl uses doubleprecision floatingpoint numbers throughout The size of such numbers in bytes depends on the computer platform but is typically eight To give a rough notion of magnitudes suppose we have a data set with 10000 observations on 500 variables Thats 5 million floatingpoint numbers or 40 million bytes If we define the megabyte MB as 1024 1024 bytes as is standard in talking about RAM its slightly over 38 MB The program needs additional memory for workspace but even so handling a data set of this size should be quite feasible on a current PC which at the time of writing is likely to have at least 256 MB of RAM If RAM is not an issue there is one further limitation on data size though its very unlikely to be a binding constraint That is variables and observations are indexed by signed integers and on a typical PC these will be 32bit values capable of representing a maximum positive value of 231 1 2 147 483 647 The limits mentioned above apply to gretls native functionality There are tighter limits with regard to two thirdparty programs that are available as addons to gretl for certain sorts of time series analysis including seasonal adjustment namely TRAMOSEATS and X12ARIMA These pro grams employ a fixedsize memory allocation and cant handle series of more than 600 observa tions 48 Data file collections If youre using gretl in a teaching context you may be interested in adding a collection of data files andor scripts that relate specifically to your course in such a way that students can browse and access them easily There are three ways to access such collections of files For data files select the menu item File Open data Sample file or click on the folder icon on the gretl toolbar For script files select the menu item File Script files Example scripts When a user selects one of the items The data or script files included in the gretl distribution are automatically shown this includes files relating to Ramanathans Introductory Econometrics and Greenes Econometric Analysis The program looks for certain known collections of data files available as optional extras for instance the datafiles from various econometrics textbooks Davidson and MacKinnon Gujarati Stock and Watson Verbeek Wooldridge and the Penn World Table PWT 56 See Chapter 4 Data files 29 the data page at the gretl website for information on these collections If the additional files are found they are added to the selection windows The program then searches for valid file collections not necessarily known in advance in these places the system data directory the system script directory the user directory and all firstlevel subdirectories of these For reference typical values for these directories are shown in Table 41 Note that PERSONAL is a placeholder that is expanded by Windows corresponding to My Documents on Englishlanguage systems Linux MS Windows system data dir usrsharegretldata cProgram Filesgretldata system script dir usrsharegretlscripts cProgram Filesgretlscripts user dir HOMEgretl PERSONALgretl Table 41 Typical locations for file collections Any valid collections will be added to the selection windows So what constitutes a valid file collec tion This comprises either a set of data files in gretl XML format with the gdt suffix or a set of script files containing gretl commands with inp suffix in each case accompanied by a master file or catalog The gretl distribution contains several example catalog files for instance the file descriptions in the misc subdirectory of the gretl data directory and psdescriptions in the misc subdirectory of the scripts directory If you are adding your own collection data catalogs should be named descriptions and script catalogs should be be named psdescriptions In each case the catalog should be placed along with the associated data or script files in its own specific subdirectory eg usrsharegretl datamydata or cuserdatagretldatamydata The catalog files are plain text if they contain nonASCII characters they must be encoded as UTF 8 The syntax of such files is straightforward Here for example are the first few lines of gretls misc data catalog Gretl various illustrative datafiles armaartificial data for ARMA script example ectsnlsNonlinear least squares example hamiltonPrices and exchange rate US and Italy The first line which must start with a hash mark contains a short name here Gretl which will appear as the label for this collections tab in the data browser window followed by a colon followed by an optional short description of the collection Subsequent lines contain two elements separated by a comma and wrapped in double quotation marks The first is a datafile name leave off the gdt suffix here and the second is a short de scription of the content of that datafile There should be one such line for each datafile in the collection A script catalog file looks very similar except that there are three fields in the file lines a filename without its inp suffix a brief description of the econometric point illustrated in the script and a brief indication of the nature of the data used Again here are the first few lines of the supplied misc script catalog Gretl various sample scripts armaARMA modelingartificial data ectsnlsNonlinear least squares Davidsonartificial data leverageInfluential observationsartificial data longleyMulticollinearityUS employment Chapter 4 Data files 30 If you want to make your own data collection available to users these are the steps 1 Assemble the data in whatever format is convenient 2 Convert the data to gretl format and save as gdt files It is probably easiest to convert the data by importing them into the program from plain text CSV or a spreadsheet format MS Excel or Gnumeric then saving them You may wish to add descriptions of the individual variables the Variable Edit attributes menu item and add information on the source of the data the Data Edit info menu item 3 Write a descriptions file for the collection using a text editor 4 Put the datafiles plus the descriptions file in a subdirectory of the gretl data directory or user directory 5 If the collection is to be distributed to other people package the data files and catalog in some suitable manner eg as a zipfile If you assemble such a collection and the data are not proprietary we would encourage you to submit the collection for packaging as a gretl optional extra 49 Assembling data from multiple sources In many contexts researchers need to bring together data from multiple source files and in some cases these sources are not organized such that the data can simply be stuck together by append ing rows or columns to a base dataset In gretl the join command can be used for this purpose this command is discussed in detail in chapter 7 Chapter 5 Subsampling a dataset 51 Introduction Some subtle issues can arise here this chapter attempts to explain the issues A subsample may be defined in relation to a full dataset in two different ways we will refer to these as setting the sample and restricting the sample these methods are discussed in sections 52 and 53 respectively In addition section 54 discusses some special issues relating to panel data and section 55 covers resampling with replacement which is useful in the context of bootstrapping test statistics The following discussion focuses on the commandline approach But you can also invoke the methods outlined here via the items under the Sample menu in the GUI program 52 Setting the sample By setting the sample we mean defining a subsample simply by means of adjusting the starting andor ending point of the current sample range This is likely to be most relevant for timeseries data For example one has quarterly data from 19601 to 20034 and one wants to run a regression using only data from the 1970s A suitable command is then smpl 19701 19794 Or one wishes to set aside a block of observations at the end of the data period for outofsample forecasting In that case one might do smpl 20004 where the semicolon is shorthand for leave the starting observation unchanged The semicolon may also be used in place of the second parameter to mean that the ending observation should be unchanged By unchanged here we mean unchanged relative to the last smpl setting or relative to the full dataset if no subsample has been defined up to this point For example after smpl 19701 20034 smpl 20004 the sample range will be 19701 to 20004 An incremental or relative form of setting the sample range is also supported In this case a relative offset should be given in the form of a signed integer or a semicolon to indicate no change for both the starting and ending point For example smpl 1 will advance the starting observation by one while preserving the ending observation and smpl 2 1 31 Chapter 5 Subsampling a dataset 32 will both advance the starting observation by two and retard the ending observation by one An important feature of setting the sample as described above is that it necessarily results in the selection of a subset of observations that are contiguous in the full dataset The structure of the dataset is therefore unaffected for example if it is a quarterly time series before setting the sample it remains a quarterly time series afterwards 53 Restricting the sample By restricting the sample we mean selecting observations on the basis of some Boolean logical criterion or by means of a random number generator This is likely to be most relevant for cross sectional or panel data Suppose we have data on a crosssection of individuals recording their gender income and other characteristics We wish to select for analysis only the women If we have a male dummy variable with value 1 for men and 0 for women we could do smpl male0 restrict to this effect Or suppose we want to restrict the sample to respondents with incomes over 50000 Then we could use smpl income50000 restrict A question arises if we issue the two commands above in sequence what do we end up with in our subsample all cases with income over 50000 or just women with income over 50000 By default the answer is the latter women with income over 50000 The second restriction augments the first or in other words the final restriction is the logical product of the new restriction and any restriction that is already in place If you want a new restriction to replace any existing restrictions you can first recreate the full dataset using smpl full Alternatively you can add the replace option to the smpl command smpl income50000 restrict replace This option has the effect of automatically reestablishing the full dataset before applying the new restriction Unlike a simple setting of the sample restricting the sample may result in selection of non contiguous observations from the full data set It may therefore change the structure of the data set This can be seen in the case of panel data Say we have a panel of five firms indexed by the variable firm observed in each of several years identified by the variable year Then the restriction smpl year1995 restrict produces a dataset that is not a panel but a crosssection for the year 1995 Similarly smpl firm3 restrict produces a timeseries dataset for firm number 3 For these reasons possible noncontiguity in the observations possible change in the structure of the data gretl acts differently when you restrict the sample as opposed to simply setting it In Chapter 5 Subsampling a dataset 33 the case of setting the program merely records the starting and ending observations and uses these as parameters to the various commands calling for the estimation of models the computation of statistics and so on In the case of restriction the program makes a reduced copy of the dataset and by default treats this reduced copy as a simple undated crosssectionbut see the further discussion of panel data in section 54 If you wish to reimpose a timeseries interpretation of the reduced dataset you can do so using the setobs command or the GUI menu item Data Dataset structure The fact that restricting the sample results in the creation of a reduced copy of the original dataset may raise an issue when the dataset is very large With such a dataset in memory the creation of a copy may lead to a situation where the computer runs low on memory for calculating regression results You can work around this as follows 1 Open the full data set and impose the sample restriction 2 Save a copy of the reduced data set to disk 3 Close the full dataset and open the reduced one 4 Proceed with your analysis Random subsampling Besides restricting the sample on some deterministic criterion it may sometimes be useful when working with very large datasets or perhaps to study the properties of an estimator to draw a random subsample from the full dataset This can be done using for example smpl 100 random to select 100 cases If you want the sample to be reproducible you should set the seed for the random number generator first using the set command This sort of sampling falls under the restriction category a reduced copy of the dataset is made 54 Panel data Consider for concreteness the ArellanoBond dataset supplied with gretl abdatagdt This com prises data on 140 firms n 140 observed over the years 19761984 T 9 The dataset is nominally balanced in the sense that that the timeseries length is the same for all countries this being a requirement for a dataset to count as a panel in gretl but in fact there are many missing values NAs You may want to subsample such a dataset in either the crosssectional dimension limit the sam ple to a subset of firms or the time dimension eg use data from the 1980s only One way to subsample on firms keys off the notation used by gretl for panel observations The full data range is printed as 11 firm 1 period 1 to 1409 firm 140 period 9 The effect of smpl 11 809 is to limit the sample to the first 80 firms Note that if you instead tried smpl 11 804 this would provoke an error you cannot use this syntax to subsample in the time dimension of the panel Alternatively and perhaps more naturally you can use the unit option with the smpl command to limit the sample in the crosssectional dimension as in smpl 1 80 unit The firms in the ArellanoBond dataset are anonymous but suppose you had a panel with five named countries With such a panel you can inform gretl of the names of the groups using the setobs command For example given Chapter 5 Subsampling a dataset 34 string cstr Portugal Italy Ireland Greece Spain setobs country cstr panelgroups gretl creates a stringvalued series named country with group names taken from the variable cstr Then to include only Italy and Spain you could do smpl countryItaly countrySpain restrict or to exclude one country smpl countryIreland restrict Subsampling a panel in the time dimension can be done via restrict For example the ArellanoBond dataset contains a variable named YEAR that records the year of the observations and if one wanted to omit the first two years of data one could do smpl YEAR 1978 restrict If a dataset does not already incude a suitable variable for this purpose one can use the command genr time to create a simple 1based time index Another way to subsample in the time dimension of a panel starts with a specification of time via the setobs command as in setobs 1 1976 paneltime This tells gretl that paneltime is annual frequency 1 starting in 1976 In fact this is already done for abdatagdt Then to restrict the sample range to 19791982 you can do smpl 1979 1982 time Note that if you apply a sample restriction that just selects certain units firms countries or what ever or selects certain contiguous timeperiodssuch that n 1 T 1 and the timeseries length is still the same across all included unitsyour subsample will still be interpreted by gretl as a panel Unbalancing restrictions In some cases one wants to subsample according to a criterion that cuts across the grain of a panel dataset For instance suppose you have a micro dataset with thousands of individuals observed over several years and you want to restrict the sample to observations on employed women If we simply extracted from the total nT rows of the dataset those that pertain to women who were employed at time t t 1 T we would likely end up with a dataset that doesnt count as a panel in gretl because the specific timeseries length Ti would differ across individuals In some contexts it might be OK that gretl doesnt take your subsample to be a panel but if you want to apply panelspecific methods this is a problem You can solve it by giving the preservepanel option with smpl For example supposing your dataset contained dummy variables gender with the value 1 coding for women and employed you could do smpl gender1 employed1 restrict preservepanel What exactly does this do Well lets say the years of your data are 2000 2005 and 2010 and that some women were employed in all of those years giving a maximum Ti value of 3 But in dividual 526 is a woman who was employed only in the year 2000 Ti 1 The effect of the preservepanel option is then to insert padding rows of NAs for the years 2005 and 2010 for individual 526 and similarly for all individuals with 0 Ti 3 Your subsample then qualifies as a panel Chapter 5 Subsampling a dataset 35 55 Resampling and bootstrapping Given an original data series x the command series xr resamplex creates a new series each of whose elements is drawn at random from the elements of x If the original series has 100 observations each element of x is selected with probability 1100 at each drawing Thus the effect is to shuffle the elements of x with the twist that each element of x may appear more than once or not at all in xr The primary use of this function is in the construction of bootstrap confidence intervals or pvalues Here is a simple example Suppose we estimate a simple regression of y on x via OLS and find that the slope coefficient has a reported tratio of t0 with ν degrees of freedom A twotailed pvalue for the null hypothesis that the slope parameter equals zero can then be found using the tν distribution Depending on the context however we may doubt whether the ratio of coefficient to standard error truly follows the tν distribution In that case we could derive a bootstrap pvalue as shown in Listing 51 Under the null hypothesis that the slope with respect to x is zero y is simply equal to its mean plus an error term We simulate y by resampling the residuals from the initial OLS and reestimate the model We repeat this procedure a large number of times and record the number of cases where the absolute value of the tratio is greater than t0 the proportion of such cases is our bootstrap pvalue For a good discussion of simulationbased tests and bootstrapping see Davidson and MacKinnon 2004 chapter 4 Davidson and Flachaire 2001 is also instructive Listing 51 Calculation of bootstrap pvalue Download nulldata 50 set seed 54321 series x normal series y 10 x 2normal ols y 0 x the reported tstat t0 abscoeff2 stderr2 save the residuals series u uhat scalar ybar meany number of replications for bootstrap scalar B 1000 scalar tcount 0 series ysim loop B generate simulated y by resampling ysim ybar resampleu ols ysim 0 x quiet scalar tsim abscoeff2 stderr2 tcount tsim t0 endloop printf proportion of cases with t 3f g t0 tcount B Chapter 6 Graphs and plots 61 Gnuplot graphs A separate program gnuplot is called to generate graphs Gnuplot is a very fullfeatured graphing program with myriad options It is available from wwwgnuplotinfo but note that a suitable copy of gnuplot is bundled with the packaged versions of gretl for MS Windows and Mac OS X Gretl gives you direct access via a graphical interface to a subset of gnuplots options and it tries to choose sensible values for you it also allows you to take complete control over graph details if you wish With a graph displayed you can click on the graph window for a popup menu with the following options Save as PNG Save the graph in Portable Network Graphics format the same format that you see on screen Save as postscript Save in encapsulated postscript EPS format Save as Windows metafile Save in Enhanced Metafile EMF format Save to session as icon The graph will appear in iconic form when you select Icon view from the View menu Zoom Lets you select an area within the graph for closer inspection not available for all graphs Print Current GTK or MS Windows only lets you print the graph directly Copy to clipboard MS Windows only lets you paste the graph into Windows applications such as MS Word Edit Opens a controller for the plot which lets you adjust many aspects of its appearance Close Closes the graph window Displaying data labels For simple XY scatter plots some further options are available if the dataset includes case mark ers that is labels identifying each observation1 With a scatter plot displayed when you move the mouse pointer over a data point its label is shown on the graph By default these labels are transient they do not appear in the printed or copied version of the graph They can be removed by selecting Clear data labels from the graph popup menu If you want the labels to be affixed per manently so they will show up when the graph is printed or copied select the option Freeze data labels from the popup menu Clear data labels cancels this operation The other labelrelated option All data labels requests that case markers be shown for all observations At present the display of case markers is disabled for graphs containing more than 250 data points 1For an example of such a dataset see the Ramanathan file data410 this contains data on private school enrollment for the 50 states of the USA plus Washington DC the case markers are the twoletter codes for the states 36 Chapter 6 Graphs and plots 37 GUI plot editor Selecting the Edit option in the graph popup menu opens an editing dialog box shown in Figure 61 Notice that there are several tabs allowing you to adjust many aspects of a graphs appearance font title axis scaling line colors and types and so on You can also add lines or descriptive labels to a graph under the Lines and Labels tabs The Apply button applies your changes without closing the editor OK applies the changes and closes the dialog Figure 61 gretls gnuplot controller Publicationquality graphics advanced options The GUI plot editor has two limitations First it cannot represent all the myriad options that gnuplot offers Users who are sufficiently familiar with gnuplot to know what theyre missing in the plot editor presumably dont need much help from gretl so long as they can get hold of the gnuplot command file that gretl has put together Second even if the plot editor meets your needs in terms of finetuning the graph you see on screen a few details may need further work in order to get optimal results for publication Either way the first step in advanced tweaking of a graph is to get access to the graph command file In the graph display window rightclick and choose Save to session as icon If its not already open open the icon view windoweither via the menu item ViewIcon view or by clicking the session icon view button on the mainwindow toolbar Rightclick on the icon representing the newly added graph and select Edit plot commands from the popup menu You get a window displaying the plot file Figure 62 Here are the basic things you can do in this window Obviously you can edit the file you just opened You can also send it for processing by gnuplot by clicking the Execute cogwheel icon in the toolbar Or you can use the Save as button to save a copy for editing and processing as you wish Chapter 6 Graphs and plots 38 Figure 62 Plot commands editor Unless youre a gnuplot expert most likely youll only need to edit a couple of lines at the top of the file specifying a driver plus options and an output file We offer here a brief summary of some points that may be useful First gnuplots output mode is set via the command set term followed by the name of a supported driver terminal in gnuplot parlance plus various possible options The top line in the plot commands window shows the set term line that gretl used to make a PNG file commented out The graphic formats that are most suitable for publication are PDF and EPS These are supported by the gnuplot term types pdf pdfcairo and postscript with the eps option The pdfcairo driver has the virtue that is behaves in a very similar manner to the PNG one the output of which you see on screen This is provided by the version of gnuplot that is included in the gretl packages for MS Windows and Mac OS X if youre on Linux it may or may be supported If pdfcairo is not available the pdf terminal may be available the postscript terminal is almost certainly available Besides selecting a term type if you want to get gnuplot to write the actual output file you need to append a set output line giving a filename Here are a few examples of the first two lines you might type in the window editing your plot commands Well make these more realistic shortly set term pdfcairo set output mygraphpdf set term pdf set output mygraphpdf set term postscript eps set output mygrapheps There are a couple of things worth remarking here First you may want to adjust the size of the graph and second you may want to change the font The default sizes produced by the above drivers are 5 inches by 3 inches for pdfcairo and pdf and 5 inches by 35 inches for postscript eps In each case you can change this by giving a size specification which takes the form XXYY examples below Chapter 6 Graphs and plots 39 You may ask why bother changing the size in the gnuplot command file After all PDF and EPS are both vector formats so the graphs can be scaled at will True but a uniform scaling will also affect the font size which may end looking wrong You can get optimal results by experimenting with the font and size options to gnuplots set term command Here are some examples comments follow below pdfcairo regular size slightly amended set term pdfcairo font Sans6 size 5in35in or small size set term pdfcairo font Sans5 size 3in2in pdf regular size slightly amended set term pdf font Helvetica8 size 5in35in or small set term pdf font Helvetica6 size 3in2in postscript regular set term post eps solid font Helvetica16 or small set term post eps solid font Helvetica12 size 3in2in On the first line we set a sans serif font for pdfcairo at a suitable size for a 5 35 inch plot which you may find looks better than the rather letterboxy default of 5 3 And on the second we illustrate what you might do to get a smaller 3 2 inch plot You can specify the plot size in centimeters if you prefer as in set term pdfcairo font Sans6 size 6cm4cm We then repeat the exercise for the pdf terminal Notice that here were specifying one of the 35 standard PostScript fonts namely Helvetica Unlike pdfcairo the plain pdf driver is unlikely to be able to find fonts other than these In the third pair of lines we illustrate options for the postscript driver which as you see can be abbreviated as post Note that here we have added the option solid Unlike most other drivers this one uses dashed lines unless you specify the solid option Also note that weve apparently specified a much larger font in this case Thats because the eps option in effect tells the postscript driver to work at halfsize among other things so we need to double the font size Table 61 summarizes the basics for the three drivers we have mentioned Terminal default size inches suggested font pdfcairo 5 3 Sans6 pdf 5 3 Helvetica8 post eps 5 35 Helvetica16 Table 61 Drivers for publicationquality graphics To find out more about gnuplot visit wwwgnuplotinfo This site has documentation for the current version of the program in various formats Additional tips To be written Line widths enhanced text Show a before and after example Chapter 6 Graphs and plots 40 62 Plotting graphs from scripts When working with scripts you may want to have a graph shown onto your display or saved into a file In fact if in your usual workflow you find yourself creating similar graphs over and over again you might want to consider the option of writing a script which automates this process for you gretl gives you two main tools for doing this one is a command called gnuplot whose main use is to create standard plot quickly The other one is the plot command block which has a more elaborate syntax but offers you more control on output The gnuplot command The gnuplot command is described at length in the Gretl Command Reference and the online help system Here we just summarize its main features basically it consists of the gnuplot keyword followed by a list of items telling the command what you want plotted and a list of options telling it how you want it plotted For example the line gnuplot y1 y2 x will give you a basic XY plot of the two series y1 and y2 on the vertical axis versus the series x on the horizontal axis In general the arguments to the gnuplot command is a list of series the last of which goes on the xaxis while all the other ones go onto the yaxis By default the gnuplot command gives you a scatterplot If you just have one variable on the yaxis then gretl will also draw a the OLS interpolation if the fit is good enough2 Several aspects of the behavior described above can be modified You do this by appending options to the command Most options can be broadly grouped in three categories 1 Plot styles we support points the default choice lines lines and points together and im pulses vertical lines 2 Algorithm for the fitted line here you can choose between linear quadratic and cubic inter polation but also more exotic choices such as semilog inverse or loess nonparametric Of course you can also turn this feature off 3 Input and output you can choose whether you want your graph on your computer screen and possibly use the inbuilt graphical widget to further customize it see above page 37 or rather save it to a file We support several graphical formats among which PNG and PDF to make it easy to incorporate your plots into text documents The following script uses the AWM dataset to exemplify some traditional plots in macroeconomics open AWMgdt quiet consumption and income different styles gnuplot PCR YER gnuplot PCR YER outputdisplay gnuplot PCR YER outputdisplay timeseries gnuplot PCR YER outputdisplay timeseries withlines Phillips curve different fitted lines gnuplot INFQ URX outputdisplay 2The technical condition for this is that the twotailed pvalue for the slope coefficient should be under 10 Chapter 6 Graphs and plots 41 gnuplot INFQ URX fitnone outputdisplay gnuplot INFQ URX fitinverse outputdisplay gnuplot INFQ URX fitloess outputdisplay These examples use variables from the areawide model dataset by the European Central Bank ECB which is shipped with gretl in the AWMgdt file PCR is aggregate private real consumption and YER is real GDP The first command line above thus plots consumption against income as a kind of Keynesian consumption function More precisely it produces a simple scatter plot with an automatically linear fitted line If this is executed in the gretl console the plot will be directly shown in a new window but if this line is contained in a script then instead a file with the plot commands will be saved for later execution The second example line changes this behavior for a script command and forces the plot to be shown directly The third line instead asks for a plot of the two variables as two separate curves against time on the xaxis Each observation point is drawn separately with a certain symbol determined by gnuplot defaults If you add the option withlines the points will be connected with a continuous line and the symbols omitted The second set of example lines above demonstrate how the fitted line in the scatter plot can be controlled from gretls side The option fitnone overrides gnuplots default to draw a line if it deems the fit to be good enough The effect of fitinverse is to consider the variable on the yaxis as a function of 1X instead of X and draw the corresponding hyperbolic branch For the workings of a Loess fit locallyweighted polynomial regression please refer to the documentation of the loess function For more detail consult the Gretl Command Reference The plot command block The plot environment is a way to pass information to Gnuplot in a more structured way so that customization of basic plots becomes easier It has the following characteristics The block starts with the plot keyword followed by a required parameter the name of a list a single series or a matrix This parameter specifies the data to be plotted The starting line may be prefixed with the savename apparatus to save a plot as an icon in the GUI program The block ends with end plot Inside the block you have zero or more lines of these types identified by an initial keyword option specify a single option details below options specify multiple options on a single line if more than one option is given on a line the options should be separated by spaces literal a command to be passed to gnuplot literally printf a printf statement whose result will be passed to gnuplot literally this allows the use of string variables without having to resort to style string substitution The options available are basically those of the current gnuplot command but with a few dif ferences For one thing you dont need the leading doubledash in an option or options line Besides that You cant use the option matrixwhatever with plot that possibility is handled by pro viding the name of a matrix on the initial plot line The inputfilename option is not supported use gnuplot for the case where youre supplying the entire plot specification yourself Chapter 6 Graphs and plots 42 The several options pertaining to the presence and type of a fitted line are replaced in plot by a single option fit which requires a parameter Supported values for the parameter are none linear quadratic cubic inverse semilog and loess Example option fitquadratic As with gnuplot the default is to show a linear fit in an XY scatter if its significant at the 10 percent level Heres a simple example the plot specification from the bandplot package which shows how to achieve the same result via the gnuplot command and a plot block respectivelythe latter occupies a few more lines but is clearer gnuplot 1 2 3 4 withlines matrixplotmat fitnone outputdisplay set linetype 3 lc rgb 0000ff set title title set nokey set xlabel xname plot plotmat options withlines fitnone literal set linetype 3 lc rgb 0000ff literal set nokey printf set title s title printf set xlabel s xname end plot outputdisplay Note that outputdisplay is appended to end plot also note that if you give a matrix to plot its assumed you want to plot all the columns In addition if you give a single series and the dataset is time series its assumed you want a timeseries plot Example Plotting an histogram together with a density Listing 61 contains a slightly more elaborate example here we load the Mroz example dataset and calculate the log of the individuals wage Then we match the histogram of a discretized version of the same variable obtained via the aggregate function versus the theoretical density if data were Gaussian There are a few points to note The data for the plot are passed through a matrix in which we set column names via the cnameset function those names are then automatically used by the plot environment In this example we make extensive use of the set literal construct for refining the plot by passing instruction to gnuplot the power of gnuplot is impossible to overstate We encourage you to visit the demos version of gnuplots website httpgnuplotsourceforgenet and revel in amazement In the plot environment you can use all the quantities you have in your script This is the way we calibrate the histogram width try setting the scalar k in the script to different values Note that the printf command has a special meaning inside a plot environment The script displays the plot on your screen If you want to save it to a file instead replace outputdisplay at the end with outputfilename Its OK to insert comments in the plot environment actually its a rather good idea to com ment as much as possible as always The output from the script is shown in Figure 63 Chapter 6 Graphs and plots 43 Listing 61 Plotting the log wage from the Mroz example dataset Download set verbose off open mroz87gdt quiet series lWW logWW scalar m meanlWW scalar s sdlWW prepare matrix with data for plot number of valid observations scalar n nobslWW discretize log wage scalar k 4 series disclWW roundlWWkk get frequencies matrix f aggregatenull disclWW add density phi dnormf1 ms sk put columns together and add labels plotmat f2n phi f1 strings cnames defarrayfrequency density log wage cnamesetplotmat cnames create plot plot plotmat move legend literal set key outside rmargin set line style literal set linetype 2 dashtype 2 linewidth 2 set histogram color literal set linetype 1 lc rgb 777777 set histogram style literal set style fill solid 025 border set histogram width printf set boxwidth 42f 05k options withlines2 withboxes1 end plot outputdisplay Chapter 6 Graphs and plots 44 0 002 004 006 008 01 012 014 016 018 2 1 0 1 2 3 log wage frequency density Figure 63 Output from listing 61 Listing 62 Plotting t densities for varying degrees of freedom Download set verbose off function string tplotscalar m return sprintfstudxd title td m m end function matrix dfs 2 4 16 plot literal set xrange 4545 literal set yrange 0045 literal Binvpq explgammapqlgammaplgammaq literal studxm Binv05m05sqrtm10xxm05m10 printf plot s s s tplotdfs1 tplotdfs2 tplotdfs3 end plot outputdisplay Chapter 6 Graphs and plots 45 Example Plotting Students t densities The power of the printf statement in a plot block becomes apparent when used jointly with userdefined functions as exemplified in Listing 62 in which we create a plot showing the den sity functions of Students t distribution for three different settings of the degrees of freedom parameter note that plotting a t density is very easy to do from the GUI just go to the Tools Distribution graphs menu First we define a user function called tplot which returns a string with the ingredients to pass to the gnuplot plot statement as a function of a scalar parameter the degrees of freedom in our case Next this function is used within the plot block to plot the appropriate density Note that most of the statements to mathematically define the function to plot are outsourced to gnuplot via the literal command The output from the script is shown in Figure 64 0 005 01 015 02 025 03 035 04 045 4 3 2 1 0 1 2 3 4 t2 t4 t16 Figure 64 Output from listing 62 Chapter 6 Graphs and plots 46 63 Boxplots These plots after Tukey and Chambers display the distribution of a variable Its shape depends on a few quantities defined as follows xmin sample minimum Q1 first quartile m median x mean Q3 third quartile xmax sample maximum R Q3 Q1 interquartile range The central box encloses the middle 50 percent of the data ie goes from Q1 to Q3 therefore its height equals R A line is drawn across the box at the median m and a sign identifies the mean x The length of the whiskers depends on the presence of outliers The top whisker extends from the top of the box up to a maximum of 15 times the interquartile range but can be shorter if the sample maximum is lower than that value that is it reaches minxmax Q3 15R Observations larger than Q3 15R if any are considered outliers and represented individually via dots3 The bottom whisker obeys the same logic with obvious adjustments Figure 65 provides an example of all this by using the variable FAMINC from the sample dataset mroz87 0 20000 40000 60000 80000 FAMINC x Q3 Q1 m xmin xmax outliers Figure 65 Sample boxplot In the case of boxplots with confidence intervals dotted lines show the limits of an approximate 90 3To give you an intuitive idea if a variable is normally distributed the chances of picking an outlier by this definition are slightly below 07 Chapter 6 Graphs and plots 47 percent confidence interval for the median This is obtained by the bootstrap method which can take a while if the data series is very long For details on constructing boxplots see the entry for boxplot in the Gretl Command Reference or use the Help button that appears when you select one of the boxplot items under the menu item View Graph specified vars in the main gretl window Factorized boxplots A nice feature which is quite useful for data visualization is the conditional or factorized boxplot This type of plot allows you to examine the distribution of a variable conditional on the value of some discrete factor As an example well use one of the datasets supplied with gretl that is rac3d which contains an example taken from Cameron and Trivedi 2013 on the health conditions of 5190 people The script below compares the unconditional marginal distribution of the number of illnesses in the past 2 weeks with the distribution of the same variable conditional on age classes open rac3dgdt unconditional boxplot boxplot ILLNESS outputdisplay create a discrete variable for age class 0 below 20 1 between 20 and 39 etc series ageclass floorAGE02 conditional boxplot boxplot ILLNESS ageclass factorized outputdisplay After running the code above you should see two graphs similar to Figure 66 By comparing the marginal plot to the factorized one the effect of age on the mean number of illnesses is quite evident by joining the green crosses you get what is technically known as the conditional mean function or regression function if you prefer 0 1 2 3 4 5 ILLNESS 0 1 2 3 4 5 0 1 2 3 ILLNESS ageclass Distribution of ILLNESS by ageclass Figure 66 Conditional and unconditional distribution of illnesses Chapter 7 Joining data sources 71 Introduction Gretl provides two commands for adding data from file to an existing dataset in the programs workspace namely append and join The append command which has been available for a long time is relatively simple and is described in the Gretl Command Reference Here we focus on the join command which is much more flexible and sophisticated This chapter gives an overview of the functionality of join along with a detailed account of its syntax and options We provide several toy examples and discuss one realworld case at length First a note on terminology in the following we use the terms lefthand and inner to refer to the dataset that is already in memory and the terms righthand and outer to refer to the dataset in the file from which additional data are to be drawn Two main features of join are worth emphasizing at the outset Key variables can be used to match specific observations rows in the inner and outer datasets and this match need not be 1 to 1 A row filter may be applied to screen out unwanted observations in the outer dataset As will be explained below these features support rather complex concatenation and manipulation of data from different sources A further aspect of join should be notedone that makes this command particularly useful when dealing with very large data files That is when gretl executes a join operation it does not in gen eral read into memory the entire content of the righthand side dataset Only those columns that are actually needed for the operation are read in full This makes join faster and less demanding of computer memory than the methods available in most other software On the other hand gretls asymmetrical treatment of the inner and outer datasets in join may require some getting used to for users of other packages 72 Basic syntax The minimal invocation of join is join filename varname where filename is the name of a data file and varname is the name of a series to be imported Only two sorts of data file are supported at present delimited text files where the delimiter may be comma space tab or semicolon and native gretl data files gdt or gdtb A series named varname may already be present in the lefthand dataset but that is not required The series to be imported may be numerical or stringvalued For most of the discussion below we assume that just a single series is imported by each join command but see section 77 for an account of multiple imports The effect of the minimal version of join is this gretl looks for a data column labeled varname in the specified file if such a column is found and the number of observations on the right matches the number of observations in the current sample range on the left then the values from the right are copied into the relevant range of observations on the left If varname does not already exist 48 Chapter 7 Joining data sources 49 on the left any observations outside of the current sample are set to NA if it exists already then observations outside of the current sample are left unchanged The case where you want to rename a series on import is handled by the data option This option has one required argument the name by which the series is known on the right At this point we need to explain something about righthand variable names column headings Righthand names We accept on input arbitrary column heading strings but if these strings do not qualify as valid gretl identifiers they are automatically converted and in the context of join you must use the converted names A gretl identifier must start with a letter contain nothing but ASCII letters digits and the underscore character and must not exceed 31 characters The rules used in name conversion are 1 Skip any leading nonletters 2 Until the 31character is reached or the input is exhausted transcribe legal characters skip illegal characters apart from spaces and replace one or more consecutive spaces with an underscore unless the last character transcribed is an underscore in which case space is skipped In the unlikely event that this policy yields an empty string we replace the original with coln where n is replaced by the 1based index of the column in question among those used in the join operation If you are in doubt regarding the converted name of a given column the function fixname can be used as a check it takes the original string as an argument and returns the converted name Examples eval fixnamevalididentifier valididentifier eval fixname12 Some name Somename Returning to the use of the data option suppose we have a column headed 12 Some name on the right and wish to import it as x After figuring how the righthand name converts we can do join foocsv x dataSomename No righthand names Some data files have no column headings they jump straight into the data and you need to deter mine from accompanying documentation what the columns represent Since gretl expects column headings you have to take steps to get the importation right It is generally a good idea to insert a suitable header row into the data file However if for some reason thats not practical you should give the noheader option in which case gretl will name the columns on the right as col1 col2 and so on If you do not do either of these things you will likely lose the first row of data since gretl will attempt to make variable names out of it as described above 73 Filtering Rows from the outer dataset can be filtered using the filter option The required parameter for this option is a Boolean condition that is an expression which evaluates to nonzero true include the row or zero false skip the row for each of the outer rows The filter expression may include any of the following terms up to three righthand series under their converted names as Chapter 7 Joining data sources 50 explained above scalar or string variables defined on the left any of the operators and functions available in gretl including userdefined functions and numeric or string constants Here are a few simple examples of potentially valid filter options assuming that the specified right hand side columns are found 1 relationship between two righthand variables filterx15x17 2 comparison of righthand variable with constant filternkids2 3 comparison of stringvalued righthand variable with string constant filterSEXF 4 filter on valid values of a righthand variable filtermissingincome 5 compound condition filterx 100 x 0 y 0 Note that if you are comparing against a string constant as in example 3 above it is necessary to put the string in escaped doublequotes each doublequote preceded by a backslash so the interpreter knows that F is not supposed to be the name of a variable It is safest to enclose the whole filter expression in double quotes however this is not strictly required unless the expression contains spaces or the equals sign In general an error is flagged if a missing value is encountered in a series referenced in a filter expression This is because the condition then becomes indeterminate taking example 2 above if the nkids value is NA on any given row we are not in a position to evaluate the condition nkids2 However you can use the missing functionor ok which is a shorthand for missingif you need a filter that keys off the missing or nonmissing status of a variable 74 Matching with keys Things get interesting when we come to keymatching The purpose of this facility is perhaps best introduced by example Suppose that as with many survey and censusbased datasets we have a dataset that is composed of two or more related files each having a different unit of observation for example we have a persons data file and a households data file Table 71 shows a simple artificial case The file peoplecsv contains a unique identifier for the individuals pid The households file hholdscsv contains the unique household identifier hid which is also present in the persons file As a first example of join with keys lets add the householdlevel variable xh to the persons dataset open peoplecsv quiet join hholdscsv xh ikeyhid print byobs The basic key option is named ikey this indicates inner key that is the key variable found in the lefthand or inner dataset By default it is assumed that the righthand dataset contains a column of the same name though as well see below that assumption can be overridden The join command above says find a series named xh in the righthand dataset and add it to the lefthand one using the values of hid to match rows Looking at the data in Table 71 we can see how this should work Persons 1 and 2 are both members of household 1 so they should both get values of 1 for xh persons 3 and 4 are members of household 2 so that xh 4 and so on Note that the order Chapter 7 Joining data sources 51 in which the key values occur on the righthand side does not matter The gretl output from the print command is shown in the lower panel of Table 71 peoplecsv hholdscsv pidhidgenderagexp hidcountryxh 11M501 1US1 21F402 6IT12 32M303 3UK6 42F252 4IT8 53M403 2US4 64F354 5IT10 74M703 84F603 95F204 106M404 pid hid xh 1 1 1 2 1 1 3 2 4 4 2 4 5 3 6 6 4 8 7 4 8 8 4 8 9 5 10 10 6 12 Table 71 Two linked CSV data files and the effect of a join Note that key variables are treated conceptually as integers If a specified key contains fractional values these are truncated Two extensions of the basic key mechanism are available If the outer dataset contains a relevant key variable but it goes under a different name from the inner key you can use the okey option to specify the outer key As with other right hand names this does not have to be a valid gretl identifier So for example if hholdscsv contained the hid information but under the name HHOLD the join command above could be modified as join hholdscsv xh ikeyhid okeyHHOLD If a single key is not sufficient to generate the matches you want you can specify a double key in the form of two series names separated by a comma in this case the importation of data is restricted to those rows on which both keys match The syntax here is for example join foocsv x ikeykey1key2 Again the okey option may be used if the corresponding righthand columns are named differently The same number of keys must be given on the left and the right but when a Chapter 7 Joining data sources 52 double key is used and only one of the key names differs on the right the name that is in common may be omitted although the comma separator must be retained For example the second of the following lines is acceptable shorthand for the first join foocsv x ikeykey1Lkey2 okeykey1Rkey2 join foocsv x ikeykey1Lkey2 okeyRkey2 The number of keymatches The example shown in Table 71 is an instance of a 1 to 1 match applying the matching criterion produces exactly one value of the variable xh corresponding to each row of the inner dataset Three other possibilities arise Some rows on the left have multiple matches on the right 1 to n matching Some rows on the right have multiple matches on the left n to 1 matching Some rows in the inner dataset have no match on the right The first case is addressed in detail in the next section here we discuss the others The n to 1 case is straightforward If a particular key value or combination of key values occurs at each of n 1 observations on the left but at a single observation on the right then the righthand value is entered at each of the matching slots on the left The handling of the case where theres no match on the right depends on whether the join operation is adding a new series to the inner dataset or modifying an existing one If its a new series then unmatched rows automatically get NA for the imported data However if join is pulling in values for a series already present on the left only matched rows will be updated In other words we do not overwite an existing value on the left with NA when theres no match on the right These defaults may not produce the desired results in every case but gretl provides the means to modify the effect if need be We will illustrate with two scenarios First consider adding a new series recording number of hours worked when the inner dataset contains individuals and the outer file contains data on jobs If an individual does not appear in the jobs file we may want to take her hours worked as implicitly zero rather than NA In this case gretls misszero function can be used to turn NA into 0 in the imported series Second consider updating a series via join when the outer file is presumed to contain all available updated values such that no match should be taken as an implicit NA In that case we want the presumably outofdate values on any unmatched rows to be overwritten with NA Let the series in question be called x both on the left and the right and let the common key be called pid The solution is then join updatecsv tmpvar datax ikeypid x tmpvar As a new variable tmpvar will get NA for all unmatched rows we then transcribe its values into x In a more complicated case one might use the smpl command to limit the sample range before assigning tmpvar to x or use the conditional assignment operator One further point given some missing values in an imported series you may want to know whether a the NAs were explicitly represented in the outer data file or b they arose due to no match You can find this out by using a method described in the following section namely the count variant of the aggregation option this will give you a series with 0 values for all and only unmatched rows Chapter 7 Joining data sources 53 75 Aggregation In the case of 1 to n matching of rows n 1 the user must specify an aggregation method that is a method for mapping from n rows down to one This is handled by the aggr option which requires a single argument from the following list Code Value returned count count of matches avg mean of matching values sum sum of matching values min minimum of matching values max maximum of matching values seqi the ith matching value eg seq2 minaux minimum of matching values of auxiliary variable maxaux maximum of matching values of auxiliary variable Note that the count aggregation method is special in that there is no need for a data series on the right the imported series is simply a function of the specified keys All the other methods require that actual data are found on the right Also note that when count is used the value returned when no match is found is as one might expect zero rather than NA The basic use of the seq method is shown above following the colon you give a positive integer rep resenting the 1based position of the observation in the sequence of matched rows Alternatively a negative integer can be used to count down from the last match seq1 selects the last match seq2 the secondlast match and so on If the specified sequence number is out of bounds for a given observation this method returns NA Referring again to the data in Table 71 suppose we want to import data from the persons file into a dataset established at household level Heres an example where we use the individual age data from peoplecsv to add the average and minimum age of household members open hholdscsv quiet join peoplecsv avgage ikeyhid dataage aggravg join peoplecsv minage ikeyhid dataage aggrmin Heres a further example where we add to the household data the sum of the personal data xp with the twist that we apply filters to get the sum specifically for household members under the age of 40 and for women open hholdscsv quiet join peoplecsv youngxp ikeyhid filterage40 dataxp aggrsum join peoplecsv femalexp ikeyhid filtergenderF dataxp aggrsum The possibility of using an auxiliary variable with the min and max modes of aggregation gives extra flexibility For example suppose we want for each household the income of its oldest member open hholdscsv quiet join peoplecsv oldestxp ikeyhid dataxp aggrmaxage 76 Stringvalued key variables The examples above use numerical variables household and individual ID numbers in the match ing process It is also possible to use stringvalued variables in which case a match means that the string values of the key variables compare equal with case sensitivity When using double keys Chapter 7 Joining data sources 54 you can mix numerical and string keys but naturally you cannot mix a string variable on the left via ikey with a numerical one on the right via okey or vice versa Heres a simple example Suppose that alongside hholdscsv we have a file countriescsv with the following content countryGDP UK100 US500 IT150 FR180 The variable country which is also found in hholdscsv is stringvalued We can pull the GDP of the country in which the household resides into our households dataset with open hholdscsv q join countriescsv GDP ikeycountry which gives hid country GDP 1 1 1 500 2 6 2 150 3 3 3 100 4 4 2 150 5 2 1 500 6 5 2 150 77 Importing multiple series The examples given so far have been limited in one respect While several columns in the outer data file may be referenced as keys or in filtering or aggregation only one column has actually provided dataand correspondingly only one series in the inner dataset has been created or modifiedper invocation of join However join can handle the importation of several series at once This section gives an account of the required syntax along with certain restrictions that apply to the multipleimport case There are two ways to specify more than one series for importation 1 The varname field in the command can take the form of a spaceseparated list of names rather than a single name 2 Alternatively you can give the name of an array of strings in place of varname the elements of this array should be the names of the series to import Here are the limitations 1 The data option which permits the renaming of a series on import is not available When importing multiple series you are obliged to accept their outer names fixed up as described in section 72 2 While the other join options are available they necessarily apply uniformly to all the series imported via a given command This means that if you want to import several series but using different keys filters or aggregation methods you must use a sequence of commands Here are a couple of examples of multiple imports Chapter 7 Joining data sources 55 open base datafile containing keys open PUMSdatagdt join using a list of import names join ss13pnccsv SCHL WAGP WKHP ikeySERIALNOSPORDER using a strings array may be worthwhile if the array will be used for more than one purpose strings S defarraySCHL WAGP WKHP join ss13pnccsv S ikeySERIALNOSPORDER 78 A realworld case For a real usecase for join with crosssectional data we turn to the Bank of Italys Survey on House hold Income and Wealth SHIW1 In ASCII form the 2010 survey results comprise 47 MB of data in 29 files In this exercise we will draw on five of the SHIW files to construct a replica of the dataset used in Thomas Mrozs famous paper Mroz 1987 on womens labor force participation which contains data on married women between the age of 30 and 60 along with certain characteristics of their households and husbands Our general strategy is as follows we create a core dataset by opening the file carcom10csv which contains basic data on the individuals After dropping unwanted individuals all but married women we use the resulting dataset as a base for pulling in further data via the join command The complete script to do the job is given in the Appendix to this chapter here we walk through the script with comments interspersed We assume that all the relevant files from the Bank of Italy survey are contained in a subdirectory called SHIW Starting with carcom10csv we use the cols option to the open command to import specific series namely NQUEST household ID number NORD sequence number for individuals within each household SEX male 1 female 2 PARENT status in household 1 head of household 2 spouse of head etc STACIV marital status married 1 STUDIO educational level coded from 1 to 8 ETA age in years and ACOM4C size of town open SHIWcarcom10csv cols12349102941 We then restrict the sample to married women from 30 to 60 years of age and additionally restrict the sample of women to those who are either heads of households or spouses of the head smpl SEX2 ETA30 ETA60 STACIV1 restrict smpl PARENT3 restrict For compatibility with the Mroz dataset as presented in the gretl data file mroz87gdt we rename the age and education variables as WA and WE respectively we compute the CIT dummy and finally we store the reduced base dataset in gretl format rename ETA WA rename STUDIO WE series CIT ACOM4C 2 store mrozrepgdt The next step will be to get data on working hours from the jobs file allb1csv Theres a com plication here We need the total hours worked over the course of the year for both the women 1Details of the survey can be found at httpwwwbancaditaliaitstatisticheindcampbilfaitdismicro The ASCII CSV data files for the 2010 survey are available at httpwwwbancaditaliaitstatisticheindcamp bilfaitdismicroannualeasciiind10asciizip Chapter 7 Joining data sources 56 and their husbands This is not available as such but the variables ORETOT and MESILAV give respectively average hours worked per week and the number of months worked in 2010 each on a perjob basis If each person held at most one job over the year we could compute his or her annual hours as HRS ORETOT 52 MESILAV12 However some people had more than one job and in this case what we want is the sum of annual hours across their jobs We could use join with the seq aggregation method to construct this sum but it is probably more straightforward to read the allb1 data compute the HRS values per job as shown above and save the results to a temporary CSV file open SHIWallb1csv cols12811 quiet series HRS misszeroORETOT 52 misszeroMESILAV12 store HRScsv NQUEST NORD HRS Now we can reopen the base dataset and join the hours variable from HRScsv Note that we need a double key here the women are uniquely identified by the combination of NQUEST and NORD We dont need an okey specification since these keys go under the same names in the righthand file We define labor force participation LFP based on hours open mrozrepgdt join HRScsv WHRS ikeyNQUESTNORD dataHRS aggrsum WHRS misszeroWHRS LFP WHRS 0 For reference heres how we could have used seq to avoid writing a temporary file join SHIWallb1csv njobs ikeyNQUESTNORD dataORETOT aggrcount series WHRS 0 loop i1maxnjobs join SHIWallb1csv htmp ikeyNQUESTNORD dataORETOT aggrseqi join SHIWallb1csv mtmp ikeyNQUESTNORD dataMESILAV aggrseqi WHRS misszerohtmp 52 misszeromtmp12 endloop To generate the work experience variable AX we use the file lavorocsv this contains a variable named ETALAV which records the age at which the person first started work join SHIWlavorocsv ETALAV ikeyNQUESTNORD series AX misszeroWA ETALAV We compute the womans hourly wage WW as the ratio of total employment income to annual working hours This requires drawing the series YL payroll income and YM net selfemployment income from the persons file rper10csv join SHIWrper10csv YL YM ikeyNQUESTNORD aggrsum series WW LFP YL YMWHRS 0 The familys net disposable income is available as Y in the file rfam10csv we import this as FAMINC join SHIWrfam10csv FAMINC ikeyNQUEST dataY Data on number of children are now obtained by applying the count method For the Mroz repli cation we want the number of children under the age of 6 and also the number aged 6 to 18 Chapter 7 Joining data sources 57 join SHIWcarcom10csv KIDS ikeyNQUEST aggrcount filterETA18 join SHIWcarcom10csv KL6 ikeyNQUEST aggrcount filterETA6 series K618 KIDS KL6 We want to add data on the womens husbands but how do we find them To do this we create an additional inner key which well call HID husband ID by subsampling in turn on the observations falling into each of two classes a those where the woman is recorded as head of household and b those where the husband has that status In each case we want the individual ID NORD of the household member whose status is complementary to that of the woman in question So for case a we subsample using PARENT1 head of household and filter the join using PARENT2 spouse of head in case b we do the converse We thus construct HID piecewise for women who are household heads smpl PARENT1 restrict replace join SHIWcarcom10csv HID ikeyNQUEST dataNORD filterPARENT2 for women who are not household heads smpl PARENT2 restrict replace join SHIWcarcom10csv HID ikeyNQUEST dataNORD filterPARENT1 smpl full Now we can use our new inner key to retrieve the husbands data matching HID on the left with NORD on the right within each household join SHIWcarcom10csv HA ikeyNQUESTHID okeyNQUESTNORD dataETA join SHIWcarcom10csv HE ikeyNQUESTHID okeyNQUESTNORD dataSTUDIO join HRScsv HHRS ikeyNQUESTHID okeyNQUESTNORD dataHRS aggrsum HHRS misszeroHHRS The remainder of the script is straightforward and does not require discussion here we recode the education variables for compatibility delete some intermediate series that are not needed any more add informative labels and save the final product See the Appendix for details To compare the results from this dataset with those from the earlier US data used by Mroz one can copy the input file heckitinp supplied with the gretl package and substitute mrozrepgdt for mroz87gdt It turns out that the results are qualitatively very similar 79 The representation of dates Up to this point all the data we have considered have been crosssectional In the following sections we discuss data that have a time dimension and before proceeding it may be useful to say some thing about the representation of dates Gretl takes the ISO 8601 standard as its reference point but provides mean of converting dates provided in other formats it also offers a set of calendrical functions for manipulating dates isodate isoconv epochday and others ISO 8601 recognizes two formats for daily dates extended and basic In both formats dates are given as 4digit year 2digit month and 2digit day in that order In extended format a dash is inserted between the fieldsas in 20131021 or more generally YYYYMMDDwhile in basic format the fields are run together YYYYMMDD Extended format is more easily parsed by human readers while basic format is more suitable for computer processing since one can apply ordinary arithmetic to compare dates as equal earlier or later The standard also recognizes YYYYMM as representing year and month eg 201011 for November 20102 as well as a plain fourdigit number for year alone One problem for economists is that the quarter is not a period covered by ISO 8601 This could be presented by YYYYQ with only one digit following the dash but in gretl output we in fact use a colon as in 20132 for the second quarter of 2013 For printed output of months gretl also uses 2The form YYYYMM is not recognized for year and month Chapter 7 Joining data sources 58 a colon as in 201306 A difficulty with following ISO here is that in a statistical context a string such as 198010 may look more like a subtraction than a date Anyway at present we are more interested in the parsing of dates on input rather than in what gretl prints And in that context note that excess precision is acceptable a month may be represented by its first day eg 20050501 for May 2005 and a quarter may be represented by its first month and day 20050701 for the third quarter of 2005 Some additional points regarding dates will be taken up as they become relevant in practical cases of joining data 710 Timeseries data Suppose our lefthand dataset is recognized by gretl as time series with a supported frequency annual quarterly monthly weekly daily or hourly This will be the case if the original data were read from a file that contained suitable time or date information or if a timeseries interpretation has been imposed using either the setobs command or its GUI equivalent Thenapart perhaps from some very special casesjoining additional data is bound to involve matching observations by timeperiod In this case contrary to the crosssectional case the inner dataset has a natural ordering of which gretl is aware hence no inner key is required If in addition the file from data which are to be joined is in native gretl format and contains time series information keys are not needed at all Three cases can arise the frequency of the outer dataset may be the same lower or higher than that of the inner dataset In the first two cases join should work without any special apparatus lowerfrequency values will be repeated for each highfrequency period In the third case however an aggregation method must be specified gretl needs to know how to map higherfrequency data into the existing dataset by averaging summing or whatever If the outer data file is not in native gretl format we need a means of identifying the period of each observation on the right an outer key which well call a time key The join command provides a simple but limited default for extracting period information from the outer data file plus an option that can be used if the default is not applicable as follows The default assumptions are 1 the time key appears in the first column 2 the heading of this column is either left blank or is one of obs date year period observation or observationdate on a caseinsensitive comparison and 3 the time format conforms to ISO 8601 where applicable extended daily date format YYYYMMDD monthly format YYYYMM or annual format YYYY If dates do not appear in the first column of the outer file or if the column heading or format is not as just described the tkey option can be used to indicate which column should be used andor what format should be assumed Setting the timekey column andor format The tkey option requires a parameter holding the name of the column in which the time key is located andor a string specifying the format in which datestimes are written in the timekey column This parameter should be enclosed in doublequotes If both elements are present they should be separated by a comma if only a format is given it should be preceded by a comma Some examples tkeyPeriodmdY tkeyPeriod tkeyobsperiod tkeyYmm Chapter 7 Joining data sources 59 The first of these applies if Period is not the first column on the right and dates are given in the US format of month day year separated by slashes The second implies that although Period is not the first column the date format is ISO 8601 The third again implies that the date format is OK here the name is required even if obsperiod is the first column since this heading is not one recognized by gretls heuristic The last example implies that dates are in the first column with one of the recognized headings but are given in the nonstandard format year m month The date format string should be composed using the codes employed by the POSIX function strptime Table 72 contains a list of the most relevant codes3 Code Meaning The character b The month name according to the current locale either abbreviated or in full C The century number 099 d The day of month 131 D Equivalent to mdy This is the American style date very con fusing to nonAmericans especially since dmy is widely used in Europe The ISO 8601 standard format is Ymd H The hour 023 j The day number in the year 1366 m The month number 112 n Arbitrary whitespace q The quarter 14 w The weekday number 06 with Sunday 0 y The year within century 099 When a century is not otherwise spec ified values in the range 6999 refer to years in the twentieth century 19691999 values in the range 0068 refer to years in the twenty first century 20002068 Y The year including century for example 1991 Table 72 Date format codes Example daily stock prices We show below the first few lines of a file named IBMcsv containing stockprice data for IBM corporation DateOpenHighLowCloseVolumeAdj Close 2013080219550195501932219516386100019516 2013080119665197171954119581285690019581 2013073119449196911944919504381000019504 Note that the data are in reverse timeseries orderthat wont matter to join the data can appear in any order Also note that the first column is headed Date and holds daily dates as ISO 8601 extended That means we can pull the data into gretl very easily In the following fragment we create a suitably dimensioned empty daily dataset then rely on the default behavior of join with timeseries data to import the closing stock price nulldata 500 setobs 5 20120101 join IBMcsv Close 3The q code for quarter is not present in strptime it is added for use with join since quarterly data are common in macroeconomics Chapter 7 Joining data sources 60 To make explicit what were doing we could accomplish exactly the same using the tkey option join IBMcsv Close tkeyDateYmd Example OECD quarterly data Table 73 shows an excerpt from a CSV file provided by the OECD statistical site statoecdorg in response to a request for GDP at constant prices for several countries4 FrequencyPeriodCountryValueFlags QuarterlyQ11960France463876148126845E QuarterlyQ11960Germany768802119278467E QuarterlyQ11960Italy414629791450547E QuarterlyQ11960United Kingdom578437090291889E QuarterlyQ21960France465618977328614E QuarterlyQ21960Germany782484138122549E QuarterlyQ21960Italy420714910290157E QuarterlyQ21960United Kingdom572853474696578E QuarterlyQ31960France46910441925852E QuarterlyQ31960Germany809532161494483E QuarterlyQ31960Italy426893675840156E QuarterlyQ31960United Kingdom581252066618986E QuarterlyQ41960France474664327992619E QuarterlyQ41960Germany817806132384948E QuarterlyQ41960Italy427221338414114E Table 73 Example of CSV file as provided by the OECD statistical website This is an instance of data in what we call atomic format that is a format in which each line of the outer file contains a single datapoint and extracting data mainly requires filtering the appropriate lines The outer time key is under the Period heading and has the format Qquarteryear Assuming that the file in Table 73 has the name oecdcsv the following script reconstructs the time series of Gross Domestic Product for several countries nulldata 220 setobs 4 19601 join oecdcsv FRA tkeyPeriodQqY dataValue filterCountryFrance join oecdcsv GER tkeyPeriodQqY dataValue filterCountryGermany join oecdcsv ITA tkeyPeriodQqY dataValue filterCountryItaly join oecdcsv UK tkeyPeriodQqY dataValue filterCountryUnited Kingdom Note the use of the format codes q for the quarter and Y for the 4digit year A touch of elegance could have been added by storing the invariant options to join using the setopt command as in setopt join persist tkeyPeriodQqY dataValue join oecdcsv FRA filterCountryFrance join oecdcsv GER filterCountryGermany join oecdcsv ITA filterCountryItaly join oecdcsv UK filterCountryUnited Kingdom setopt join clear If one were importing a large number of such series it might be worth rewriting the sequence of joins as a loop as in 4Retrieved 20130805 The OECD files in fact contain two leading columns with very long labels these are irrelevant to the present example and can be omitted without altering the sample script Chapter 7 Joining data sources 61 strings countries defarrayFrance Germany Italy United Kingdom strings vnames defarrayFRA GER ITA UK setopt join persist tkeyPeriodQqY dataValue loop foreach i countries vname vnamesi join oecdcsv vname filterCountryi endloop setopt join clear 711 Special handling of time columns When dealing with straight time series data the tkey mechanism described above should suffice in almost all cases In some contexts however time enters the picture in a more complex way examples include panel data see section 712 and socalled realtime data see chapter 8 To handle such cases join provides the tconvert option This can be used to select certain columns in the righthand data file for special treatment strings representing dates in these columns will be converted to numerical values 8digit numbers on the pattern YYYYMMDD ISO basic daily format Once dates are in this form it is easy to use them in keymatching or filtering By default it is assumed that the strings in the selected columns are in ISO extended format YYYYMMDD If that is not the case you can supply a timeformat string using the tconvfmt option The format string should be written using the codes shown in Table 72 Here are some examples select one column for treatment tconvertstartdate select two columns for treatment tconvertstartdateenddate specify USstyle daily date format tconvfmtmdY specify quarterly datestrings as in 2004q1 tconvfmtYqq Some points to note If a specified column is not selected for a substantive role in the join operation as data to be imported as a key or as an auxiliary variable for use in aggregation the column in question is not read and so no conversion is carried out If a specified column contains numerical rather than string values no conversion is carried out If a string value in a selected column fails parsing using the relevant time format user specified or default the converted value is NA On successful conversion the output is always in dailydate form as stated above If you specify a monthly or quarterly time format the converted date is the first day of the month or quarter 712 Panel data In section 710 we gave an example of reading quarterly GDP data for several countries from an OECD file In that context we imported each countrys data as a distinct timeseries variable Now Chapter 7 Joining data sources 62 suppose we want the GDP data in panel format instead stacked time series How can we do this with join As a reminder heres what the OECD data look like FrequencyPeriodCountryValueFlags QuarterlyQ11960France463876148126845E QuarterlyQ11960Germany768802119278467E QuarterlyQ11960Italy414629791450547E QuarterlyQ11960United Kingdom578437090291889E QuarterlyQ21960France465618977328614E and so on If we have four countries and quarterly observations running from 19601 to 20132 T 214 quarters we might set up our panel workspace like this scalar N 4 scalar T 214 scalar NT NT nulldata NT preserve setobs T 11 stackedtimeseries The relevant outer keys are obvious Country for the country and Period for the time period Our task is now to construct matching keys in the inner dataset This can be done via two panelspecific options to the setobs command Lets work on the time dimension first setobs 4 19601 paneltime series quarter obsdate This variant of setobs allows us to tell gretl that time in our panel is quarterly starting in the first quarter of 1960 Having set that the accessor obsdate will give us a series of 8digit dates representing the first day of each quarter19600101 19600401 19600701 and so on repeating for each country As we explained in section 711 we can use the tconvert option on the outer series Period to get exactly matching values in this case using a format of QqY for parsing the Period values Now for the country names string cstrs sprintfFrance Germany Italy United Kingdom setobs country cstrs panelgroups Here we write into the string cstrs the names of the countries using escaped doublequotes to handle the space in United Kingdom then pass this string to setobs with the panelgroups option preceded by the identifier country This asks gretl to construct a stringvalued series named country in which each name will repeat T times Were now ready to join Assuming the OECD file is named oecdcsv we do join oecdcsv GDP dataValue ikeycountryquarter okeyCountryPeriod tconvertPeriod tconvfmtQqY Other input formats The OECD file discussed above is in the most convenient format for join with one datapoint per line But sometimes we may want to make a panel from a data file structured like this Real GDP PeriodFranceGermanyItalyUnited Kingdom Chapter 7 Joining data sources 63 Q11960463863768757414630578437 Q21960465605782438420715572853 Q31960469091809484426894581252 Q41960474651817758427221584779 Q11961482285826031442528594684 Call this file sidebysidecsv Assuming the same initial setup as above we can panelize the data by setting the sample to each countrys time series in turn and importing the relevant column The only point to watch here is that the string United Kingdom being a column heading will become UnitedKingdom on importing see section 72 so well need a slightly different set of country strings strings cstrs defarrayFrance Germany Italy UnitedKingdom setobs country cstrs panelgroups loop foreach i cstrs smpl countryi restrict replace join sidebysidecsv GDP datai ikeyquarter okeyPeriod tconvertPeriod tconvfmtQqY endloop smpl full If our working dataset and the outer data file are dimensioned such that there are just as many timeseries observations on the right as there are time slots on the leftand the observations on the right are contiguous in chronological order and start on the same date as the working datasetwe could dispense with the key apparatus and just use the first line of the join command shown above However in general it is safer to use keys to ensure that the data end up in correct registration 713 Memo join options Basic syntax join filename varnames options flag effect data Give the name of the data column on the right in case it differs from varname 72 single import only filter Specify a condition for filtering data rows 73 ikey Specify up to two keys for matching data rows 74 okey Specify outer key names in case they differ the inner ones 74 aggr Select an aggregation method for 1 to n joins 75 tkey Specify righthand time key 710 tconvert Select outer date columns for conversion to numeric form 711 tconvfmt Specify a format for use with tconvert 711 noheader Treat the first row on the right as data 72 verbose Report on progress in reading the outer data Chapter 7 Joining data sources 64 Appendix the full Mroz data script start with everybody get gender age and a few other variables directly while were at it open SHIWcarcom10csv cols12349102941 subsample on married women between the ages of 30 and 60 smpl SEX2 ETA30 ETA60 STACIV1 restrict for simplicity restrict to heads of households and their spouses smpl PARENT3 restrict rename the age and education variables for compatibility compute the city dummy and finally save the reduced base dataset rename ETA WA rename STUDIO WE series CIT ACOM4C2 store mrozrepgdt make a temp file holding annual hours worked per job open SHIWallb1csv cols12811 quiet series HRS misszeroORETOT 52 misszeroMESILAV12 store HRScsv NQUEST NORD HRS reopen the base dataset and begin drawing assorted data in open mrozrepgdt womens annual hours summed across jobs join HRScsv WHRS ikeyNQUESTNORD dataHRS aggrsum WHRS misszeroWHRS labor force participation LFP WHRS 0 work experience ETALAV age when started first job join SHIWlavorocsv ETALAV ikeyNQUESTNORD series AX misszeroWA ETALAV womens hourly wages join SHIWrper10csv YL YM ikeyNQUESTNORD aggrsum series WW LFP YL YMWHRS 0 family income Y net disposable income join SHIWrfam10csv FAMINC ikeyNQUEST dataY get data on children using the count method join SHIWcarcom10csv KIDS ikeyNQUEST aggrcount filterETA18 join SHIWcarcom10csv KL6 ikeyNQUEST aggrcount filterETA6 series K618 KIDS KL6 data on husbands we first construct an auxiliary inner key for husbands using the little trick of subsampling the inner dataset for women who are household heads smpl PARENT1 restrict replace join SHIWcarcom10csv HID ikeyNQUEST dataNORD filterPARENT2 for women who are not household heads smpl PARENT2 restrict replace join SHIWcarcom10csv HID ikeyNQUEST dataNORD filterPARENT1 smpl full Chapter 7 Joining data sources 65 add husbands data via the newlyadded secondary inner key join SHIWcarcom10csv HA ikeyNQUESTHID okeyNQUESTNORD dataETA join SHIWcarcom10csv HE ikeyNQUESTHID okeyNQUESTNORD dataSTUDIO join HRScsv HHRS ikeyNQUESTHID okeyNQUESTNORD dataHRS aggrsum HHRS misszeroHHRS final cleanup begins recode educational attainment as years of education matrix eduyrs 0 5 8 11 13 16 18 21 series WE replaceWE seq18 eduyrs series HE replaceHE seq18 eduyrs cut some cruft delete SEX STACIV KIDS YL YM PARENT HID ETALAV add some labels for the series setinfo LFP d 1 if woman worked in 2010 setinfo WHRS d Wifes hours of work in 2010 setinfo KL6 d Number of children less than 6 years old in household setinfo K618 d Number of children between ages 6 and 18 in household setinfo WA d Wifes age setinfo WE d Wifes educational attainment in years setinfo WW d Wifes average hourly earnings in 2010 euros setinfo HHRS d Husbands hours worked in 2010 setinfo HA d Husbands age setinfo HE d Husbands educational attainment in years setinfo FAMINC d Family income in 2010 euros setinfo AX d Actual years of wifes previous labor market experience setinfo CIT d 1 if live in large city save the final product store mrozrepgdt Chapter 8 Realtime data 81 Introduction As of gretl version 1913 the join command see chapter 7 has been enhanced to deal with so called realtime datasets in a straightforward manner Such datasets contain information on when the observations in a time series were actually published by the relevant statistical agency and how they have been revised over time Probably the most popular sources of such data are the Alfred online database at the St Louis Fed httpalfredstlouisfedorg and the OECDs StatEx tracts site httpstatsoecdorg The examples in this chapter deal with files downloaded from these sources but should be easy to adapt to files with a slightly different format As already stated join requires a columnoriented plain text file where the columns may be sepa rated by commas tabs spaces or semicolons Alfred and the OECD provide the option to download realtime data in this format tabdelimited files from Alfred commadelimited from the OECD If you have a realtime dataset in a spreadsheet file you must export it to a delimited text file before using it with join Representing revision histories is more complex than just storing a standard time series because for each observation period you have in general more than one published value over time along with the information on when each of these values were valid or current Sometimes this is repre sented in spreadsheets with two time axes one for the observation period and another one for the publication date or vintage The filled cells then form an upper triangle or a guillotine blade shape if the publication dates do not reach back far enough to complete the triangle This format can be useful for giving a human reader an overview of realtime data but it is not optimal for automatic processing for that purpose atomic format is best 82 Atomic format for realtime data What we are calling atomic format is exactly the format used by Alfred if you choose the option Observations by RealTime Period and by the OECD if you select all editions of a series for download as plain text CSV1 A file in this format contains one actual datapoint per line together with associated metadata This is illustrated in Table 81 where we show the first three lines from an Alfred file and an OECD file slightly modified2 Consider the first data line in the Alfred file in the observationdate column we find 19600101 indicating that the datapoint on this line namely 1120 is an observation or measurement in this case of the US index of industrial production that refers to the period starting on January 1st 1960 The realtimestartdate value of 19600216 tells us that this value was published on February 16th 1960 and the realtimeenddate value says that this vintage remained current through March 15th 1960 On the next day as we can see from the following line this datapoint was revised slightly downward to 1110 Daily dates in Alfred files are given in ISO extended format YYYYMMDD but below we describe how to deal with differently formatted dates Note that daily dates are appropriate for the last 1If you choose to download in Excel format from OECD you get a file in the triangular or guillotine format mentioned above 2In the Alfred file we have used commas rather than tabs as the column delimiter in the OECD example we have shortened the name in the Variable column 66 Chapter 8 Realtime data 67 Alfred monthly US industrial production observationdateINDPROrealtimestartdaterealtimeenddate 1960010111200001960021619600315 1960010111100001960031619611015 OECD monthly UK industrial production CountryVariableFrequencyTimeEditionValueFlags United KingdomINDPROMonthlyJan1990February 1999100 United KingdomINDPROMonthlyFeb1990February 1999993 Table 81 Variant atomic formats for realtime data two columns which jointly record the interval over which a given data vintage was current Daily dates might however be considered overly precise for the first column since the data period may well be the year quarter or month as it is in fact here However following Alfreds practice it is acceptable to specify a daily date indicating the first day of the period even for nondaily data3 Compare the first data line of the OECD example Theres a greater amount of leading metadata which is left implicit in the Alfred file Here Time is the equivalent of Alfreds observationdate and Edition the equivalent of Alfreds realtimestartdate So we read that in February 1999 a value of 100 was current for the UK index of industrial production for January 1990 and from the next line we see that in the same vintage month a value of 993 was current for industrial production in February 1990 Besides the different names and ordering of the columns there are a few more substantive differ ences between Alfred and OECD files most of which are irrelevant for join but some of which are possibly relevant The first irrelevant difference is the ordering of the lines It appears though were not sure how consistent this is that in Alfred files the lines are sorted by observation date first and then by publication dateso that all revisions of a given observation are grouped togetherwhile OECD files are sorted first by revision date Edition and then by observation date Time If we want the next revision of UK industrial production for January 1990 in the OECD file we have to scan down several lines until we find United KingdomINDPROMonthlyJan1990March 1999100 This difference is basically irrelevant because join can handle the case where the lines appear in random order although some operations can be coded more conveniently if were able to assume chronological ordering either on the Alfred or the OECD pattern it doesnt matter The second also irrelevant difference is that the OECD seems to include periodic Edition lines even when there is no change from the previous value as illustrated above where the UK industrial production index for January 1990 is reported as 100 as of March 1999 the same value that we saw to be current in February 1999 while Alfred reports a new value only when it differs from what was previously current A third difference lies in the dating of the revisions or editions As we have seen Alfred gives a specific daily date while in the UK industrial production file at any rate the OECD just dates each edition to a month This is not necessarily relevant for join but it does raise the question of whether the OECD might date revisions to a finer granularity in some of their files in which case one would have to be on the lookout for a different date format The final difference is that Alfred supplies an end date for each data vintage while the OECD 3Notice that this implies that in the Alfred example it is not clear without further information whether the observation period is the first quarter of 1960 the month January 1960 or the day January 1st 1960 However we assume that this information is always available in context Chapter 8 Realtime data 68 supplies only a starting date But there is less to this difference than meets the eye according to the Alfred webmaster by design a new vintage must start immediately following the day after the lapse of the old vintageso the end date conveys no independent information4 83 More on timerelated options Before we get properly started it is worth saying a little more about the tkey and tconvert options to join first introduced in section 711 as they apply in the case of realtime data When youre working with regular time series data tkey is likely to be useful while tconvert is unlikely to be applicable see section 710 On the other hand when youre working with panel data tkey is definitely not applicable but tconvert may well be helpful section 712 When working with realtime data however depending on the task in hand both options may be useful You will likely need tkey you may well wish to select at least one column for tconvert treatment and in fact you may want to name a given column in both contextsthat is include the tkey variable among the tconvert columns Why might this make sense Well think of the tconvert option as a preprocessing directive it asks gretl to convert date strings to numerical values 8digit ISO basic dates at source as they are read from the outer datafile The tkey option on the other hand singles out a column as the one to use for matching rows with the inner dataset So you would want to name a column in both roles if a it should be used for matching periods and also b it is desirable to have the values from this column in numerical form most likely for use in filtering As we have seen you can supply specific formats in connection with both tkey and tconvert in the latter case via the companion option tconvfmt to handle the case where the date strings on the right are not ISOfriendly at source This raises the question of how the format specifications work if a given column is named under both options Here are the rules that gretl applies 1 If a format is given with the tkey option it always applies to the tkey column alone and for that column it overrides any format given via the tconvfmt option 2 If a format is given via tconvfmt it is assumed to apply to all the tconvert columns unless this assumption is overriden by rule 1 84 Getting a certain data vintage The most common application of realtime data is to travel back in time and retrieve the data that were current as of a certain date in the past This would enable you to replicate a forecast or other statistical result that could have been produced at that date For example suppose we are interested in a variable of monthly frequency named INDPRO realtime data on which is stored in an Alfred file named INDPROtxt and we want to check the status quo as of June 15th 2011 If we dont already have a suitable dataset into which to import the INDPRO data our first steps will be to create an appropriately dimensioned empty dataset using the nulldata command and then specify its timeseries character via setobs as in nulldata 132 setobs 12 200401 For convenience we can put the name of our realtime file into a string variable On Windows this might look like 4Email received from Travis May of the Federal Reserve Bank of St Louis 20131017 This closes off the possibility that a given vintage could lapse or expire some time before the next vintage becomes available hence giving rise to a hole in an Alfred realtime file Chapter 8 Realtime data 69 string fname CUsersyournameDownloadsINDPROtxt We can then import the data vintage 20110615 using join arbitrarily choosing the selfexplanatory identifier ipasof20110615 join fname ipasof20110615 tkeyobservationdate dataINDPRO tconvertrealtimestartdate filterrealtimestartdate20110615 aggrmaxrealtimestartdate Here some detailed explanations of the various options are warranted The tkey option specifies the column which should be treated as holding the observation period identifiers to be matched against the periods in the current gretl dataset5 The more general form of this option is tkeycolnameformat note the double quotes here so if the dates do not come in standard format we can tell gretl how to parse them by using the appropriate conversion specifiers as shown in Table 72 For example here we could have written tkeyobservationdateYmd Next dataINDPRO tells gretl that we want to retrieve the entries stored in the column named INDPRO As explained in section 711 the tconvert option selects certain columns in the righthand data file for conversion from date strings to 8digit numbers on the pattern YYYYMMDD Well need this for the next step filtering since the transformation to numerical values makes it possible to perform basic arithmetic on dates Note that since date strings in Alfred files conform to gretls default assumption it is not necessary to use the tconvfmt option here The filter option specification in combination with the subsequent aggr aggregation treatment is the central piece of our data retrieval notice how we use the date constant 20110615 in ISO basic form to do numerical comparisons and how we perform the numerical max operation on the converted column realtimestartdate It would also have been possible to predefine a scalar variable as in vintage 20110615 and then use vintage in the join command instead Here we tell join that we only want to extract those publications that 1 already appeared before and including June 15th 2011 and 2 were not yet obsoleted by a newer release6 As a result your dataset will now contain a time series named ipasof20110615 with the values that a researcher would have had available on June 15th 2011 Of course all values for the observa tions after June 2011 will be missing and probably a few before that too because they only have become available later on 85 Getting the nth release for each observation period For some purposes it may be useful to retrieve the nth published value of each observation where n is a fixed positive integer irrespective of when each of these nth releases was published Sup pose we are interested in the third release then the relevant join command becomes join fname ip3rdpub tkeyobservationdate dataINDPRO aggrseq3 5Strictly speaking using tkey is unnecessary in this example because we could just have relied on the default which is to use the first column in the source file for the periods However being explicit is often a good idea 6By implementing the second condition through the max aggregation on the realtimestartdate column alone without using the realtimeenddate column we make use of the fact that Alfred files cannot have holes as explained before Chapter 8 Realtime data 70 Since we do not need the realtimestartdate information for this retrieval we have dropped the tconvert option here Note that this formulation assumes that the source file is ordered chronologically otherwise using the option aggrseq3 which retrieves the third value from each sequence of matches could have yielded a result different from the one intended However this assumption holds for Alfred files and is probably rather safe in general The values of the variable imported as ip3rdpub in this way were published at different dates so the variable is effectively a mix of different vintages Depending on the type of variable this may also imply drastic jumps in the values for example index numbers are regularly rebased to different base periods This problem also carries over to inflationadjusted economic variables where the base period of the price index changes over time Mixing vintages in general also means mixing different scales in the output with which you would have to deal appropriately7 86 Getting the values at a fixed lag after the observation period New data releases may take place on any day of the month and as we have seen the specific day of each release is recorded in realtime files from Alfred However if you are working with say monthly or quarterly data you may sometimes want to adjust the granularity of your realtime axis to a monthly or quarterly frequency For example in order to analyse the data revision process for monthly industrial production you might be interested in the extent of revisions between the data available two and three months after each observation period This is a relatively complicated task and there is more than one way of accomplishing it Either you have to make several passes through the outer dataset or you need a sophisticated filter written as a hansl function Either way you will want to make use of some of gretls builtin calendrical functions Well assume that a suitably dimensioned workspace has been set up as described above Given that the key ingredients of the join are a filtering function which well call relok for release is OK and the join command which calls it Heres the function function series relok series obsdate series reldate int p series yobs mobs yrel mrel get year and month from observation date isoconvobsdate yobs mobs get year and month from release date isoconvreldate yrel mrel find the delta in months series dm 12yrel mrel 12yobs mobs and implement the filter return dm p end function And heres the command scalar lag 3 choose your fixed lag here join fname ipplus3 dataINDPRO tkeyobservationdate tconvertobservationdaterealtimestartdate filterrelokobservationdate realtimestartdate lag aggrmaxrealtimestartdate Note that we use tconvert to convert both the observation date and the realtime start date or release date to 8digit numerical values Both of these series are passed to the filter which uses the 7Some usercontributed functions may be available that address this issue but it is beyond our scope here Another even more complicated issue in the realtime context is that of benchmark revisions applied by statistical agencies where the underlying definition or composition of a variable changes on some date which goes beyond a mere rescaling However this type of structural change is not in principle a feature of realtime data alone but applies to any timeseries data Chapter 8 Realtime data 71 builtin function isoconv to extract year and month We can then calculate dm the delta months since the observation date for each release The filter condition is that this delta should be no greater than the specified lag p8 This filter condition may be satisfied by more than one release but only the latest of those will actually be the vintage that was current at the end of the nth month after the observation period so we add the option aggrmaxrealtimestartdate If instead you want to target the release at the beginning of the nth month you would have to use a slightly more complicated filter function An illustration Figure 81 shows four time series for the monthly index of US industrial production from October 2005 to June 2009 the value as of first publication plus the values current 3 6 and 12 months out from the observation date9 From visual inspection it would seem that over much of this period the Federal reserve was fairly consistently overestimating industrial production at first release and shortly thereafter relative to the figure they arrived at with a lag of a year The script that produced this Figure is shown in full in Listing 81 Note that in this script we are using a somewhat more efficient version of the relok function shown above where we pass the required series arguments in pointer form to avoid having to copy them see chapter 14 94 96 98 100 102 104 106 108 110 112 114 116 2006 2007 2008 2009 First publication Plus 3 months Plus 6 months Plus 12 months Figure 81 Successive revisions to US industrial production 87 Getting the revision history for an observation For our final example we show how to retrieve the revision history for a given observation again using Alfred data on US industrial production In this exercise we are switching the time axis the observation period is a fixed point and time is vintage time A suitable script is shown in Listing 82 We first select an observation to track January 1970 We start the clock in the following month when a datapoint for this period was first published and let 8The filter is written on the assumption that the lag is expressed in months on that understanding it could be used with annual or quarterly data as well as monthly The idea could be generalized to cover weekly or daily data without much difficulty 9Why not a longer series Because if we try to extend it in either direction we immediately run into the index rebasing problem mentioned in section 85 with big staggered leaps downward in all the series Chapter 8 Realtime data 72 Listing 81 Retrieving successive realtime lags of US industrial production Download function series relok series obsdate series reldate int p series yobs mobs dobs yrel mrel drel isoconvobsdate yobs mobs dobs isoconvreldate yrel mrel drel series dm 12yrel mrel 12yobs mobs return dm p dm p drel dobs end function nulldata 45 setobs 12 200510 string fname INDPROtxt initial published values join fname firstpub dataINDPRO tkeyobservationdate tconvertrealtimestartdate aggrminrealtimestartdate plus 3 months join fname plus3 dataINDPRO tkeyobservationdate tconvertobservationdaterealtimestartdate filterrelokobservationdate realtimestartdate 3 aggrmaxrealtimestartdate plus 6 months join fname plus6 dataINDPRO tkeyobservationdate tconvertobservationdaterealtimestartdate filterrelokobservationdate realtimestartdate 6 aggrmaxrealtimestartdate plus 12 months join fname plus12 dataINDPRO tkeyobservationdate tconvertobservationdaterealtimestartdate filterrelokobservationdate realtimestartdate 12 aggrmaxrealtimestartdate setinfo firstpub graphnameFirst publication setinfo plus3 graphnamePlus 3 months setinfo plus6 graphnamePlus 6 months setinfo plus12 graphnamePlus 12 months set outputrealtimepdf for PDF gnuplot firstpub plus3 plus6 plus12 time withlines outputdisplay set key left bottom Chapter 8 Realtime data 73 it run to the end of the vintage history in this file March 2013 Our outer time key is the realtime start date and we filter on observation date we name the imported INDPRO values as ipjan70 Since it sometimes happens that more than one revision occurs in a given month we need to select an aggregation method here we choose to take the last revision in the month Recall from section 82 that Alfred records a new revision only when the datapoint in question actually changes This means that our imported series will contain missing values for all months when no real revision took place However we can apply a simple autoregressive rule to fill in the blanks each missing value equals the prior nonmissing value Figure 82 displays the revision history Over this sample period the periodic rebasing of the index overshadows amendments due to accrual of new information Listing 82 Retrieving a revision history Download choose the observation to track here YYYYMMDD scalar target 19700101 nulldata 518 preserve setobs 12 197002 join INDPROtxt ipjan70 dataINDPRO tkeyrealtimestartdate tconvertobservationdate filterobservationdatetarget aggrseq1 ipjan70 okipjan70 ipjan70 ipjan701 gnuplot ipjan70 time withlines outputdisplay 20 40 60 80 100 120 140 160 180 1970 1975 1980 1985 1990 1995 2000 2005 2010 ipjan70 Figure 82 Vintages of the index of US industrial production for January 1970 Chapter 9 Temporal disaggregation 91 Introduction This chapter describes and explains the facility for temporal disaggregation in gretl1 This is im plemented by the tdisagg function which supports three variants of the method of Chow and Lin 1971 the method of Fernández 1981 and two variants of the method of Denton 1971 as modi fied by Cholette 1984 Given the analytical similarities between them the three ChowLin variants and the Fernández method will be grouped in the discussion below as ChowLin methods The balance of this section provides a gentle introduction to the idea of temporal disaggregation experts may wish to skip to the next section Basically temporal disaggregation is the business of taking timeseries data observed at some given frequency say annually and producing a counterpart series at a higher frequency say quarterly The term disaggregation indicates the inverse operation of aggregation and to understand tem poral disaggregation its helpful first to understand temporal aggregation In aggregating a high frequency series to a lower frequency there are three basic methods the appropriate method de pending on the nature of the data Here are some illustrative examples GDP say we have quarterly GDP data and wish to produce an annual series This is a flow variable and the annual flow will be the sum of the quarterly values unless the quarterly values are annualized in which case we would aggregate by taking their mean Industrial Production this takes the form of an index reporting the level of production over some period relative to that in a base period in which the index is by construction 100 To aggregate from for example monthly to quarterly we should take the average of the monthly values The sum would give a nonsense result The same goes for price indices and also for ratios of stocks to flows or vice versa inventory to sales debt to GDP capacity utilization Money stock this is typically reported as an endofperiod value so in aggregating from monthly to quarterly wed take the value from the final month of each quarter In case a stock variable is reported as a startofperiod value the aggregated version would be that of the first month of the quarter A central idea in temporal disaggregation is that the high frequency series must respect both the given low frequency data and the aggregation method So for example whatever numbers we come up with for quarterly GDP given an annual series as starting point our numbers must sum to the annual total If money stock is measured at the end of the period then whatever numbers we come up with for monthly money stock given quarterly data the figure for the last month of the quarter must match that for the quarter as a whole This is why temporal disaggregation is sometimes called benchmarking the given low frequency data constitute a benchmark which the constructed high frequency data must match in a well defined sense that depends on the nature of the data Colloquially we might describe temporal disaggregation as interpolation but strictly speaking interpolation applies only to stock variables We have a known endofquarter value say which is also the value at the end of the last month of the quarter and were trying to figure out what the 1We are grateful to Tommaso Di Fonzo Professor of Statistical Science at the University of Padua for detailed and precise comments on earlier drafts Any remaining errors are of course our responsibility 74 Chapter 9 Temporal disaggregation 75 value might have been at the end of months 1 and 2 Were filling in the blanks or interpolating In the GDP case however the procedure is distribution rather than interpolation We have a given annual total and were trying to figure out how it should be distributed over the quarters Were also doing distribution for variables taking the form of indices or ratios except in this case were seeking plausible values whose mean equals the given lowfrequency value While matching the low frequency benchmark is an important constraint it obviously does not tie down the high frequency values That is a job for either regressionbased methods such as ChowLin or nonregression methods such as Denton Details are provided in section 97 92 Notation and design Some notation first the two main ingredients in temporal disaggregation are a T g matrix Y holding the series to be disaggregated and a matrix X with k columns and s T m rows to aid in the disaggregation The idea is that Y contains time series data sampled at some frequency f while each column of X contains time series data at a higher frequency sf So for each observation Yt we have s corresponding rows in X The object is to produce a transformation of Y to frequency sf with the help of X whose columns are typically called related series or indicators in the temporal disaggregation literature via either distribution or interpolation depending on the nature of the data For most of this document we will assume that g 1 or in other words we are performing temporal disaggregation on a single lowfrequency series but tdisagg supports batch processing of several series and we return to this point in section 99 If the m in s T m is greater than zero that implies that there are some extra highfrequency observations available for extrapolationsee section 94 for details We need to say something more about what goes into X Under the Denton methods this must be a single series generally known as the preliminary series2 For the ChowLin methods X can hold a combination of deterministic terms eg constant trend and stochastic series Naturally suitable candidates for the role of preliminary series or indicator will be variables that are correlated with Y and in particular might be expected to share shortrun dynamics with Y However it is possible to carry out disaggregation using deterministic terms onlyin the simplest case with X containing nothing but a constant Experts in the field tend to frown on this with reason in the absence of any genuine highfrequency information disaggregation just amounts to a mechanical smoothing But some people may have a use for such smoothing and its permitted by tdisagg We should draw attention to a design decision in tdisagg we have separated the specification of indicators in X from certain standard deterministic terms that might be wanted namely a constant linear trend or quadratic trend If you want a disaggregation without stochastic indicators you can omit or set to null the argument corresponding to X In that case a constant only will be employed automatically but for the ChowLin methods one can adjust the deterministic terms used via an option named det described below In other words the content of X becomes implicit See section 96 for more detail Heres an important point to note when X is given explicitly although this matrix may contain extra observations at the end we assume that Y and X are correctly aligned at the start Take for example the annual to quarterly case if the first observation in annual Y is for 1980 then the first observation in quarterly X must be for the first quarter of 1980 Ensuring this is the users responsibility We will have some more to say about this in the following section 2Theres nothing to stop a user from constructing such a series using several primary series as inputby taking the first principal component or some other meansbut that possibility is beyond our scope here Chapter 9 Temporal disaggregation 76 93 Overview of data handling The tdisagg function has three basic arguments representing Y X and s respectively plus several options see below The first two arguments can be given either in matrix form as such or as dataset objectsthat is a series for Y and a series or list of series for X Or as mentioned above X can be omitted left implicit This gives rise to five cases which is most convenient will depend on the users workflow 1 Both Y and X are matrices In this case the size and periodicity of the currently open dataset if any are irrelevant If Y has T rows X must of course have at least s T rows if that condition is not satisfied an Invalid argument error will be flagged 2 Y is a series or list and X a matrix In this case we assume that the periodicity of the currently open dataset is the lower one and T will be taken as equal to nobs the number of observations in the current sample range Again X must have at least s T rows 3 Y is a matrix and X a series or list We then assume that the periodicity of the currently open dataset is the higher one so that nobs defines s T m And Y is supposed to be at the lower frequency so its number of rows gives T We should then be able to find m as nobs minus s T if m 0 an error is flagged 4 Both Y and X are dataset objects We have two subcases here a If X is a series or an ordinary list of series the periodicity of the currently open dataset is taken to be the higher one The series or list containing Y should hold the appropriate entries every s elements For example if s 4 Y1 will be taken from the first observation Y2 from the fifth Y3 from the ninth and so on In practical terms series of this sort are likely to be composed by repeating each element of a lowfrequency variable s times b Alternatively X could be a MIDAS list The concept of a MIDAS list is fully explained in chapter 20 but for example in a quarterly dataset a MIDAS list would be a list of three series for the third second and first month note the ordering In this case the current periodicity is taken to be the lower one and X will contain one column corresponding to the highfrequency representation of the MIDAS list 5 X is omitted If Y is given as a matrix it is taken to have T rows Otherwise the interpretation is determined heuristically if the Y series is recognized by gretl as composed of repeated lowfrequency observations or if a series result is requested it is taken as having length sT otherwise its length is taken to be T In the previous section we flagged the importance of correct alignment of X and Y at the start of the data were now in a position to say a little more about this If either X or Y are given in matrix form alignment is truly the users responsibility But if they are dataset objects gretl can be more helpful We automatically advance the start of the sample range to exclude any leading missing values and retard the end of the sample ranges for X and Y to exclude trailing missing values allowing for the possibility that X may extend beyond Y In addition we further advance the sample start if this is required to ensure that the X data begin in the first highfrequency subperiod eg the first quarter of a year or the first month of a quarter But please note when gretl automatically excludes leading or trailing missing values intrasample missing values will still provoke an error 94 Extrapolation As mentioned above if X holds covariate data which extend beyond the range of the original series to be disaggregated then extrapolation is supported But this is inherently risky and becomes riskier the longer the horizon over which it is attempted In tdisagg extrapolation is by default limited to one lowfrequency period s highfrequency periods beyond the end of the original data The user can adjust this behavior via the extmax member of the opts bundle passed to tdisagg described in the next section Chapter 9 Temporal disaggregation 77 95 Function signature The signature of tdisagg is matrix tdisaggY0 X int s bundle opts bundle results where square brackets indicate optional arguments Note that while the return value is a matrix if Y0 contains a single column or series it can be assigned to a series as in series ys tdisaggY0 provided its of the right length to match the current dataset or the current sample range Details on the arguments follow Y0 Y as a matrix series or list X optional X as a matrix series or list This should not contain standard deterministic terms since they are handled separately see det under opts below If this matrix is omitted then disaggregation will be performed using deterministic terms only s int The temporal expansion factor for example 3 for quarterly to monthly 4 for annual to quarterly or 12 for annual to monthly We do not support cases such as monthly to weekly or monthly to daily where s is not a fixed integer value common to all observations otherwise anything goes opts bundle optional a bundle holding additional options The recognized keys are in alpha betical order aggtype string Specifies the type of temporal aggregation appropriate to the series in ques tion The value must be one of sum each lowfrequency value is a sum of s highfrequency values the default avg each lowfrequency value is the average of s high frequency val ues or last or first indicating respectively that each lowfrequency value is the last or first of s high frequency values det int Relevant only when one of the ChowLin methods is selected This is a numeric code for the deterministic terms to be included in the regressions 0 means none 1 constant only 2 constant and linear trend 3 constant and quadratic trend The default is 1 extmax int the maximum number of highfrequency periods over which extrapolation should be carried out conditional on the availability of covariate data A zero value means no extrapolation a value of 1 means as many periods as possible and a positive value limits extrapolation to the specified number of periods See section 94 for a statement of the default value method string Selects the method of disaggregation see the listing below Note that the ChowLin methods employ an autoregression coefficient ρ which captures the persis tence of the target series at the higher frequency and is used in GLS estimation of the parameters linking X to Y chowlin the default is modeled on the original method proposed by Chow and Lin It uses a value of ρ computed as the transformation of a maximumlikelihood estimate of the lowfrequency autocorrelation coefficient chowlinmle is equivalent to the method called chowlinmaxlog in the tempdis agg package for R ρ is estimated by iterated GLS using the loglikelihood as criterion as recommended by Bournay and Laroque 1979 The BFGS algorithm is used inter nally chowlinssr is equivalent to the method called chowlinminrssquilis in tem pdisagg ρ is estimated by iterated GLS using the sum of squared GLS residuals as criterion LBFGS is used internally Chapter 9 Temporal disaggregation 78 fernandez is basically ChowLin with ρ 1 It is suitable if the target series has a unit root and is not cointegrated with the indicator series dentonpfd is the proportional first differences variant of Denton as modified by Cholette See Di Fonzo and Marini 2012 for details dentonafd is the additive first differences variant of Denton again as modified by Cholette In contrast to the ChowLin methods neither Denton procedure involves regression plot int If a nonzero value is given a simple plot is displayed by way of a sanity check on the final series See section 98 for details rho scalar Relevant only when one of the ChowLin methods is selected If the method is chowlin then rho is treated as a fixed value for ρ thus enabling the user to by pass the default estimation procedure altogether If the method is chowlinmle or chowlinssr on the other hand the supplied ρ value is used to initialize the numeri cal optimization algorithm verbose int Controls the verbosity of ChowLin or Fernández output If 0 the default nothing is printed unless an error occurs if 1 summary output from the relevant regres sion is shown if 2 in addition output from the optimizer for the iterated GLS procedure is shown if applicable results bundle optional If present this argument must be a previously defined bundle Upon successful completion of any of the methods other than denton it contains details of the disaggregation under the following keys method the method employed rho the value of ρ used lnl loglikelihood maximized by the chowlinmle method SSR sum of squared residuals minimized by the chowlinssr method coeff the GLS or OLS coefficients stderr standard errors for the coefficients If ρ is set to zeroeither by specification of the user or because the estimate ˆρ turned out to be nonpositivethen estimation of the coefficients is via OLS In that case the lnl and SSR values are calculated using the OLS residuals which will be on a different scale from the weighted residuals in GLS 96 Handling of deterministic terms It may be helpful to set out clearly in one place how deterministic terms are handled by tdisagg If X is given explicitly No deterministic term is added when the Denton method is used since a single preliminary series is wanted but a constant is added when one of the ChowLin methods is selected The latter default can be overridden ie the constant removed or a trend added by means of the det entry in the options bundle If X is omitted By default a constant is used for all methods Again for ChowLin this can be overridden by specifying a det value If for some reason you wanted Denton with just a trend you would have to supply X containing a trend 97 Some technical details In this section we provide some technical details on the methods used by tdisagg We will refer to the version of Y converted to the high frequency sf as the final series Chapter 9 Temporal disaggregation 79 As regards the Cholettemodified Denton methods for the proportional first difference variant we calculate the final series using the solution described by Di Fonzo and Marini 2012 specifically equation 4 on page 5 and for the additive variant we draw on Di Fonzo 2003 pages 3 and 5 in particular Note that these procedures require the construction and inversion of a matrix of order s 1T If both s and T are large it can therefore take some time and be quite demanding of RAM As regards ChowLin let ρ0 indicate the rho value passed via the options bundle if applicable We then take these steps 1 If ρ0 0 set ρ ρ0 and go to step 6 if the method is chowlin or step 7 otherwise But if ρ0 0 set ρ0 0 2 Estimate via OLS a regression of Y on CX3 where C is the appropriate aggregation matrix Let ˆβOLS equal the coefficients from this regression If ρ0 0 and the method is chowlin go to step 8 3 Calculate the low frequency first order autocorrelation of the OLS residuals ˆρL If ˆρL 106 go to step 4 Otherwise if the method is chowlin set ρ 0 and go to step 8 else set ρ 05 and go to step 7 4 Refine the positive estimate of ˆρL via Maximum Likelihood estimation of the AR1 specifica tion as described in Davidson and MacKinnon 2004 5 If ˆρL 0999 set ρ to the highfrequency counterpart of ˆρL using the approach given in Chow and Lin 1971 Otherwise set ρ 0999 If the method is chowlin go to step 6 otherwise go to step 7 6 Perform GLS with the given value of ρ store the coefficients as ˆβGLS and go to step 9 7 Perform iterated GLS starting from the prior value of ρ adjusting ρ with the goal of either maximizing the loglikelihood method chowlinmle or minimizing the sum of squared GLS residuals chowlinssr set ˆβGLS to the final coefficient estimates and go to step 9 8 Calculate the final series as XˆβOLS CCC1 ˆuOLS where ˆuOLS indicates the OLS residuals and stop 9 Calculate the final series as XˆβGLS VCCVC1 ˆuGLS where ˆuGLS indicates the GLS residuals and V is the estimated highfrequency covariance matrix A few notes on our ChowLin algorithm follow One might question the value of performing steps 2 to 5 when the method is one that calls for GLS iteration chowlinmle or chowlinssr but our testing indicates that it can be helpful to have a reasonably good estimate of ρ in hand before embarking on these iterations Conversely one might wonder why we bother with GLS iterations if we find ˆρL 106 But this allows for the possibility most likely associated with small sample size that iteration will lead to ρ 0 even when the estimate based on the intial OLS residuals is zero or negative Note that in all cases we are discarding an estimate of ρ 0 truncating to 0 which we take to be standard in this field In our iterated GLS we achieve this by having the optimizer pick values x in which are translated to 0 1 via the logistic CDF ρ 11 expx To be precise thats the case with chowlinmle But we find that the chowlinssr method is liable to overestimate ρ and proceed to values arbitrarily close to 1 resulting in numerical problems We therefore bound this method to x in 20 69 corresponding to ρ values between nearzero and approximately 09994 3Strictly speaking CX uses only the first sT rows of X if m 0 4It may be worth noting that the tempdisagg package for R limits both methods to a maximum ρ of 0999 We find however that the ML method can look after itself and does not require the fixed upper bound short of 10 Chapter 9 Temporal disaggregation 80 1600 1700 1800 1900 2000 2100 2200 2300 2400 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 Temporal disaggregation chowlin original data final series 4 Figure 91 Example output from plot option showing annual GNP red and quarterly final series blue using quarterly industrial production as indicator As for the Fernández method this is quite straightforward The place of the highfrequency co variance matrix V in ChowLin is taken by DD1 where D is the approximate firstdifferencing matrix with 1 on the diagonal and 1 on the first subdiagonal For efficient computation however we store neither D nor DD as such and do not perform any explicit inversion The special struc ture of DD1 makes it possible to produce the effect of premultiplication by this matrix with OT 2 floatingpoint operations Estimation of ρ is not an issue since it equals 1 by assumption 98 The plot option The semantics of this option may be enriched in future but for now its a simple boolean switch The effect is to produce a time series plot of the final series along with the original lowfrequency series shown in step form If aggregation is by sum the final series is multiplied by s for comparability with the original If the disaggregation has been successful these two series should track closely together with the final series showing plausible shortrun dynamics An example is shown in Figure 91 If there are many observations the two lines may appear virtually coincident In that case one can see whats going on in more detail by exploiting the Zoom functionality of the plot which is accessed via the rightclick menu in the plot window 99 Multiple lowfrequency series We now return to a point mentioned in section 92 namely that Y may be given as a T g matrix with g 1 or a list of g series This means that a single call to tdisagg can be used to process several input series batch processing in which case the return value is a matrix with s T m rows and g columns There are some restrictions First and most obviously a single call to tdisagg implies a single selection of indicators or related series X and a single selection of options aggregation type of the data deterministic terms disaggregation method and so on So this possibility will be relevant only if you have several series that want the same treatment In addition if g 1 the plot and verbose options are ignored and the results bundle is not filled if you need those features you should supply a single series or vector in Y Chapter 9 Temporal disaggregation 81 The advantage of batch processing lies in the spreading of fixed computational cost leading to shorter execution time However the relative importance of the fixed cost differs substantially according to the disaggregation method For the ChowLin methods the fixed cost is relatively small and so little speedup can be expected but for the Denton methods it dominates and in our testing you can process g 1 series in little more time than it takes to process a single series As they say Your mileage may vary but if you have a large number of series to be disaggregated via one of the Denton methods you may well find it much faster to use the batch facility of tdisagg 910 Examples Listing 91 shows an example of usage and its output The data are drawn from the St Louis Fed we disaggregate quarterly GDP to monthly with the help of industrial production and payroll employment using the default ChowLin method Several other example scripts are available from httpgretlsourceforgenettdisagg Listing 91 Example of tdisagg usage Download Input Traditional ChowLin y is a series with repetition and X is a list of series This corresponds to case 4a as described in section 93 of the documentation above ensure that no data are in place clear open gretls St Louis Fed database open fedstlbin import two monthly series data indpro payems import quarterly GDP values are repeated data gdpc1 restrict sample to complete data smpl nomissing disaggregate GDP from quarterly to monthly using industrial production and payroll employment as indicators scalar s 3 list X indpro payems series gdpm tdisagggdpc1 X s verbose1 aggtypesum Output Aggregation type sum GLS estimates chowlin T 294 Dependent variable gdpc1 coefficient std error tratio pvalue const 312394 263372 1186 02365 indpro 109158 175785 6210 183e09 payems 00242860 000171935 1413 739e35 rho 0999 SSR 515439 lnl 160498 Generated series gdpm ID 4 Chapter 10 Special functions in genr 101 Introduction The genr command provides a flexible means of defining new variables At the same time the somewhat paradoxical situation is that the genr keyword is almost never visible in gretl scripts For example it is not really recommended to write a line such as genr b 25 because there are the following alternatives scalar b 25 which also invokes the genr apparatus in gretl but provides explicit type information about the variable b which is usually preferable gretls language hansl is stati cally typed so b cannot switch from scalar to string or matrix for example b 25 leaving it to gretl to infer the admissible or most natural type for the new object which would again be a scalar in this case matrix b 25 This formulation is required if b is going to be expanded with additional rows or columns later on Otherwise gretls static typing would not allow b to be promoted from scalar to matrix so it must be a matrix right from the start even if it is of dimension 11 initially This definition could also be written as matrix b 25 but the more explicit form is recommended In addition to scalar or matrix other type keywords that can be used to substitute the generic genr term are those enumerated in the following chapter 11 In the case of an array the concrete specification should be used so one of matrices strings lists bundles1 Therefore theres only a handful of special cases where it is really necessary to use the genr keyword genr time Creates a time trend variable 123 under the name time Note that within an appropriately defined panel dataset this variable honors the panel structure and is a true time index In a crosssectional dataset the command will still work and produces the same result as genr index below but of course no temporal meaning exists genr index Creates an observation variable named index running from 1 to the sample size genr unitdum In the context of panel data creates a set of dummies for the panel groups or units These are named du1 du2 and so forth Actually this particular genr usage is not strictly necessary because a list of group dummies can also be obtained as series gr unit list groupdums dummifygr NA The NA argument to the dummify function has the effect of not skipping any unit as the reference group thus producing the full set of dummies 1A recently added advanced datatype is an array of arrays with the associated type specifier arrays 82 Chapter 10 Special functions in genr 83 genr timedum Again for panel data creates a set of dummies for the time periods named dt1 dt2 And again a listproducing variant without genr exists using the special accessor obsminor which indexes time in the panel context and can be used as a substitute for time from above series tindex obsminor list timedums dummifytindex NA genr markers See section 45 for an explanation and example of this panelrelated feature Finally there also exists genr dummy which produces a set of seasonal dummies However it is recommended to use the seasonals function instead which can also return centered dummies The rest of this chapter discusses other special function aspects 102 Cumulative densities and pvalues The two functions cdf and pvalue provide complementary means of examining values from 17 probability distributions as of July 2021 among which the most important ones standard normal Students t χ2 F gamma and binomial The syntax of these functions is set out in the Gretl Command Reference here we expand on some subtleties The cumulative density function or CDF for a random variable is the integral of the variables density from its lower limit typically either or 0 to any specified value x The pvalue at least the onetailed righthand pvalue as returned by the pvalue function is the complementary probability the integral from x to the upper limit of the distribution typically In principle therefore there is no need for two distinct functions given a CDF value p0 you could easily find the corresponding pvalue as 1p0 or vice versa In practice with finiteprecision com puter arithmetic the two functions are not redundant This requires a little explanation In gretl as in most statistical programs floating point numbers are represented as doubles double precision values that typically have a storage size of eight bytes or 64 bits Since there are only so many bits available only so many floatingpoint numbers can be represented doubles do not model the real line Typically doubles can represent numbers over the range roughly 17977 10308 but only to about 15 digits of precision Suppose youre interested in the left tail of the χ2 distribution with 50 degrees of freedom youd like to know the CDF value for x 09 Take a look at the following interactive session scalar p1 cdfX 50 09 Generated scalar p1 894977e35 scalar p2 pvalueX 50 09 Generated scalar p2 1 scalar test 1 p2 Generated scalar test 0 The cdf function has produced an accurate value but the pvalue function gives an answer of 1 from which it is not possible to retrieve the answer to the CDF question This may seem surprising at first but consider if the value of p1 above is correct then the correct value for p2 is 1894977 1035 But theres no way that value can be represented as a double that would require over 30 digits of precision Of course this is an extreme example If the x in question is not too far off into one or other tail of the distribution the cdf and pvalue functions will in fact produce complementary answers as shown below scalar p1 cdfX 50 30 Generated scalar p1 00111648 Chapter 10 Special functions in genr 84 scalar p2 pvalueX 50 30 Generated scalar p2 0988835 scalar test 1 p2 Generated scalar test 00111648 But the moral is that if you want to examine extreme values you should be careful in selecting the function you need in the knowledge that values very close to zero can be represented as doubles while values very close to 1 cannot 103 Retrieving internal variables dollar accessors A very useful feature is to retrieve in a script various values calculated by gretl in the course of estimating models or testing hypotheses Since they all start with a literal character they are called dollar accessors The variables that can be retrieved in this way are listed in the Gretl Command Referenceor in the builtin function help under the Help menu The dollar accessors can be used like other gretl objects in script assignments or statements Some of those accessors are actually independent of any estimation or test and describe for example the context of the running gretl program But here we say a bit more about the special variables test and pvalue These variables hold respectively the value of the last test statistic calculated using an explicit testing command and the pvalue for that test statistic If no such test has been performed at the time when these variables are referenced they will produce the missing value code Some explicit testing commands that work in this way are as follows among others add joint test for the sig nificance of variables added to a model adf Augmented DickeyFuller test see below arch test for ARCH chow Chow test for a structural break coeffsum test for the sum of specified coef ficients coint EngleGranger cointegration test cusum the HarveyCollier tstatistic difftest test for a difference of two groups kpss KPSS stationarity test no pvalue available modtest see below meantest test for difference of means omit joint test for the significance of vari ables omitted from a model reset Ramseys RESET restrict general linear restriction runs runs test for randomness and vartest test for difference of variances In most cases both a test and a pvalue are stored the exception is the KPSS test for which a pvalue is not currently available The modtest command which must follow an estimation command offers several diagnostic tests the particular test performed depends on the option flag provided Please see the Gretl Command Reference and for example chapters 32 and 31 of this Guide for details An important point to notice about this mechanism is that the internal variables test and pvalue are overwritten each time one of the tests listed above is performed If you want to reference these values you must do so at the correct point in the sequence of gretl commands Chapter 11 Gretl data types 111 Introduction Gretl offers the following data types scalar holds a single numerical value series holds n numerical values where n is the number of observations in the current dataset matrix holds a rectangular array of numerical values of any two dimensions list holds the ID numbers of a set of series string holds an array of characters bundle holds zero or more objects of various types array holds zero or more objects of a given type The numerical values mentioned above are all doubleprecision floating point numbers In this chapter we give a rundown of the basic characteristics of each of these types and also explain their life cycle creation modification and destruction The list and matrix types whose uses are relatively complex are discussed at greater length in chapters 15 and 17 respectively 112 Series We begin with the series type which is the oldest and in a sense the most basic type in gretl When you open a data file in the gretl GUI what you see in the main window are the ID numbers names and descriptions if available of the series read from the file All the series existing at any point in a gretl session are of the same length although some may have missing values The variables that can be added via the items under the Add menu in the main window logs squares and so on are also series For a gretl session to contain any series a common series length must be established This is usually achieved by opening a data file or importing a series from a database in which case the length is set by the first import But one can also use the nulldata command which takes as it single argument the desired length a positive integer Each series has these basic attributes an ID number a name and of course n numerical values A series may also have a description which is shown in the main window and is also accessible via the labels command a display name for use in graphs a record of the compaction method used in reducing the variables frequency for timeseries data only and flags marking the variable as discrete andor as a numeric encoding of a qualitative characteristic These attributes can be edited in the GUI by choosing Edit Attributes either under the Variable menu or via rightclick or by means of the setinfo command In the context of most commands you are able to reference series by name or by ID number as you wish The main exception is the definition or modification of variables via a formula here you must use names since ID numbers would get confused with numerical constants Note that series ID numbers are always consecutive and the ID number for a given series will change if you delete a lowernumbered series In some contexts where gretl is liable to get confused by 85 Chapter 11 Gretl data types 86 such changes deletion of lownumbered series is disallowed Discrete series It is possible to mark variables of the series type as discrete The meaning and uses of this facility are explained in chapter 12 Stringvalued series It is generally expected that series in gretl will be properly numeric on a ratio or at least an ordinal scale or the sort of numerical indicator variables 01 dummies that are traditional in econometrics However stringvalued series are also supportedsee chapter 16 for details 113 Scalars The scalar type is relatively simple just a convenient named holder for a single numerical value Scalars have none of the additional attributes pertaining to series do not have ID numbers and must be referenced by name A common use of scalar variables is to record information made available by gretl commands for further processing as in scalar s2 sigmaˆ2 to record the square of the standard error of the regression following an estimation command such as ols You can define and work with scalars in gretl without having any dataset in place In the gretl GUI scalar variables can be inspected and their values edited via the Icon view see the View menu in the main window 114 Matrices Matrices in gretl work much as in other mathematical software eg MATLAB Octave Like scalars they have no ID numbers and must be referenced by name and they can be used without any dataset in place Matrix indexing is 1based the topleft element of matrix A is A11 Matrices are discussed at length in chapter 17 advanced users of gretl will want to study this chapter in detail Matrices have two optional attribute beyond their numerical content they may have column andor row names attached these are displayed when the matrix is printed See the cnameset and rnameset functions for details In the gretl GUI matrices can be inspected analysed and edited via the Icon view item under the View menu in the main window each currently defined matrix is represented by an icon 115 Lists As with matrices lists merit an explication of their own see chapter 15 Briefly named lists can and should be used to make command scripts less verbose and repetitious and more easily modifiable Since lists are in fact lists of series ID numbers they can be used only when a dataset is in place In the gretl GUI named lists can be inspected and edited under the Data menu in the main window via the item Define or edit list 116 Strings String variables may be used for labeling or for constructing commands They are discussed in chapter 15 They must be referenced by name they can be defined in the absence of a dataset Chapter 11 Gretl data types 87 Such variables can be created and modified via the commandline in the gretl console or via script there is no means of editing them via the gretl GUI 117 Bundles A bundle is a container or wrapper for various sorts of objectsprimarily scalars matrices strings arrays and bundles Yes a bundle can contain other bundles Secondarily series and lists can be placed in bundles but this is subject to important qualifications noted below A bundle takes the form of a hash table or associative array each item placed in the bundle is associated with a key which can used to retrieve it subsequently We begin by explaining the mechanics of bundles then offer some thoughts on what they are good for There are several ways of creating a bundle Here are the first two Just declare it as in bundle foo or define an empty bundle using the defbundle function without any arguments bundle foo defbundle These formulations are basically equivalent in that they both create an empty bundle The differ ence is that the second variant may be reusedif a bundle named foo already exists the effect is to empty itwhile the first may only be used once in a given gretl session it is an error to attempt to declare a variable that already exists To create a bundle and populate it with some members in one go you can use the defbundle function with some arguments For example bundle foo defbundlex 13 mat I3 str some string Here the arguments come in pairs key followed by the object to be associated with the key with all terms commaseparated However you may prefer to use one or other of the alternative idioms introduced in gretl 2021a The first of these looks like this bundle foo x 13 mat I3 str some string Its more streamlined than defbundle but not quite so flexible You dont have to quote the keys but that also means that you cant give the name of a key as a string variable its always taken as a string literal Yet more streamlined but also less flexible is this variant bundle foo x mat str which works if and only if there are existing objects x mat and str in scope and you want to add them to the bundle under keys equal to their own names For more on the defbundle function see the Gretl Command Reference or the Function Reference under Help in the GUI program To add an object to a bundle you assign to a compound lefthand value the name of the bundle followed by the key Two forms of syntax are acceptable in this context The recommended syntax for most uses is bundlenamekey that is the name of the bundle followed by a dot then the key Both the bundle name and the key must be valid gretl identifiers1 For example the statement foomatrix1 m 1As a reminder 31 characters maximum starting with a letter and composed of just letters numbers or underscore Chapter 11 Gretl data types 88 adds an object called m presumably a matrix to bundle foo under the key matrix1 If you wish to make it explicit that m is supposed to be a matrix you can use the form matrix foomatrix1 m Alternatively a bundle key may be given as a string enclosed in square brackets as in foomatrix1 m This syntax offers greater flexibility in that the key string does not have to be a valid identifier for example it can include spaces In addition when using the square bracket syntax it is possible to use a string variable to define or access the key in question For example string s matrix 1 foos m matrix is added under key matrix 1 To get an item out of a bundle again use the name of the bundle followed by the key as in matrix bm foomatrix1 or using the alternative notation matrix bm foomatrix1 or using a string variable matrix bm foos Note that the key identifying an object within a given bundle is necessarily unique If you reuse an existing key in a new assignment the effect is to replace the object which was previously stored under the given key It is not required that the type of the replacement object is the same as that of the original Also note that when you add an object to a bundle what in fact happens is that the bundle acquires a copy of the object The external object retains its own identity and is unaffected if the bundled object is replaced by another Consider the following script fragment bundle foo matrix m I3 foomykey m scalar x 20 foomykey x After the above commands are completed bundle foo does not contain a matrix under mykey but the original matrix m is still in good health To delete an object from a bundle use the delete command with the bundlekey combination as in delete foomykey This destroys the object associated with mykey and removes the key from the hash table To determine whether a bundle contains an object associated with a given key use the inbundle function This takes two arguments the name of the bundle and the key string The value returned by this function is an integer which codes for the type of the object 0 for no match 1 for scalar 2 for series 3 for matrix 4 for string 5 for bundle and 6 for array The function typestr may be used to get the string corresponding to this code For example scalar type inbundlefoo x if type 0 Chapter 11 Gretl data types 89 print x no such object else printf x is of type s typestrtype endif Besides adding accessing replacing and deleting individual items the other operations that are supported for bundles are union printing and deletion As regards union if bundles b1 and b2 are defined you can say bundle b3 b1 b2 to create a new bundle that is the union of the two others The algorithm is create a new bundle that is a copy of b1 then add any items from b2 whose keys are not already present in the new bundle This means that bundle union is not commutative if the bundles have one or more key strings in common If b is a bundle and you say print b you get a listing of the bundles keys along with the types of the corresponding objects as in print b bundle b x scalar mat matrix inside bundle Note that in the example above the bundle b nests a bundle named inside If you want to see whats inside nested bundles with a single command you can append the tree option to the print command Series and lists as bundle members It is possible to add both series and lists to a bundle as in open data410 list X const CATHOL INCOME bundle b by ENROLL bX X eval by eval bX However it is important to bear in mind the following limitations A series as such is inherently a member of a dataset and a bundle can survive the replace ment or destruction of the dataset from which a series was added It may then be impossible or even if possible meaningless to extract a bundled series as a series However its always possible to retrieve the values of the series in the form of a matrix column vector In gretl commands that call for series arguments you cannot give a bundled series without first extracting it In the little example above the series ENROLL was added to bundle b under the key y but by is not itself a series member of a dataset its just an anonymous array of values It therefore cannot be given as say the dependent variable in a call to gretls ols command A gretl list is just an array of ID numbers of series in a given dataset a macro if you like So as with series theres no guarantee that a bundled list can be extracted as a list though it can always be extracted as a row vector Chapter 11 Gretl data types 90 The points made above are illustrated in Listing 111 In Case 1 we open a little dataset with just 14 crosssectional observations and put a series into a bundle We then open a timeseries dataset with 64 observations while preserving the bundle and extract the bundled series This instance is legal since the stored series does not overflow the length of the new dataset it gets written into the first 14 observations but its probably not meaningful Its up to the user to decide if such operations make sense In Case 2 a similar sequence of statements leads to an error trapped by catch because this time the stored series will not fit into the new dataset We can nonetheless grab the data as a vector In Case 3 we put a list of three series into a bundle This does not put any actual data values into the bundle just the ID numbers of the specified series which happen to be 4 5 and 6 We then switch to a dataset that contains just 4 series so the list cannot be extracted as such IDs 5 and 6 are out of bounds Once again however we can retrieve the ID numbers in matrix form if we want In some cases putting a gretl list as such into a bundle may be appropriate but in others you are better off adding the content of the list in matrix form as in open data410 list X const CATHOL INCOME bundle b matrix bX X In this case were adding a matrix with three columns and as many rows as there are in the dataset we have the actual data not just a reference to the data that might go bad See chapter 17 for more on this What are bundles good for Bundles are unlikely to be of interest in the context of standalone gretl scripts but they can be very useful in the context of complex function packages where a good deal of information has to be passed around between the component functions see Cottrell and Lucchetti 2016 Instead of using a lengthy list of individual arguments function A can bundle up the required data and pass it to functions B and C where relevant information can be extracted via a mnemonic key In this context bundles should be passed in pointer form see chapter 14 as illustrated in the fol lowing trivial example where a bundle is created at one level then filled out by a separate function modification of bundle pointer by user function function void filloutbundle bundle b bmat I3 bstr foo bx 32 end function bundle mybundle filloutbundlemybundle The bundle type can also be used to advantage as the return value from a packaged function in cases where a package writer wants to give the user the option of accessing various results In the gretl GUI function packages that return a bundle are treated specially the output window that displays the printed results acquires a menu showing the bundled items their names and types from which the user can save items of interest For example a function package that estimates a model might return a bundle containing a vector of parameter estimates a residual series and a covariance matrix for the parameter estimates among other possibilities As a refinement to support the use of bundles as a function return type the setnote function can be used to add a brief explanatory note to a bundled itemsuch notes will then be shown in the Chapter 11 Gretl data types 91 Listing 111 Series and lists in bundles Download Case 1 store and retrieve series OK open data41 bundle b series bx sqft open data97 preserve series x bx print x byobs Case 2 store and retrieve series gives an error but the data can be retrieved in matrix form open data97 bundle b series bx QNC open data41 preserve catch series x bx wrong wont fit if error matrix mx bx print mx else print x endif Case 3 store and retrieve list gives an error but the ID numbers in the list can be retrieved as a row vector open data97 list L PRIME UNEMP STOCK bundle b list bL L open data41 preserve catch list L bL if error matrix mL bL print mL prints 4 5 6 endif Chapter 11 Gretl data types 92 GUI menu This function takes three arguments the name of a bundle a key string and the note For example setnoteb vcv covariance matrix After this the object under the key vcv in bundle b will be shown as covariance matrix in a GUI menu 118 Arrays The gretl array type is intended for scripting use Arrays have no GUI representation and theyre unlikely to acquire one2 A gretl array is as you might expect a container which can hold zero or more objects of a certain type indexed by consecutive integers starting at 1 It is onedimensional This type is implemented by a quite generic backend The types of object that can be put into arrays are strings matrices lists bundles and arrays3 Of gretls primary types then neither scalars nor series are supported by the array mechanism There would be little point in supporting arrays of scalars as such since the matrix type already plays that role and more flexibly As for series they have a special status as elements of a dataset which is in a sense an array of series already and in addition we have the list type which already functions as a sort of array for subsets of the series in a dataset Creating an array An array can be brought into existence in any of three ways bare declaration or using one of the functions array or defarray In each case one of the specific typewords strings matrices lists bundles or arrays must be used Here are some examples declare an empty array of strings strings S make an empty array of matrices matrices M array0 make an array with space for four bundles bundles B array4 make an array with three specified strings strings P defarrayfoo bar baz The bare declaration form and the function form with array0 have the same effect of creating an empty array but the second can be used in contexts where bare declaration is not allowed and it can also be used to destroy the content of an existing array and reduce it to size zero The array function expects a nonnegative integer argument and can be used to create an array of pregiven size in this case the elements are initialized appropriately as empty strings empty matrices empty lists empty bundles or empty arrays The defarray function takes a variable number of arguments one or more each of which may be the name of a variable of the appropriate type or an expression which evaluates to an object of the appropriate type Setting and getting elements There are two ways to set the value of an array element you can set a particular element using the array index or you can append an element using the operator 2However its possible to save arrays invisibly in the context of a GUI session by virtue of the fact that they can be packed into bundles see below and bundles can be saved as part of a session 3It was not possible to nest arrays prior to version 2019d of gretl Chapter 11 Gretl data types 93 first case strings S array3 S2 string the second alternative matrices M array0 M mnormalTk In the first method the index must of course be within bounds that is greater than zero and no greater than the current length of the array When the second method is used it automatically extends the length of the array by 1 To get hold of an element the array index must be used for S an array of strings string s S5 for M an array of matrices printf 125g M1 Removing elements Theres a counterpart to the operator mentioned above can be used to remove one or more elements specified by content from an array of strings Note that works on all matching ele ments so after the following statements strings S defarraya a b a S a S becomes a oneelement array holding only the original third element More generally a negative index can be used to remove a specified element from an array of any type as in strings S defarraya a b a S S1 where only the first element is removed See chapter 17 for more on the semantics of negative indices Operations on whole arrays Three operators are applicable to whole arrays but only one to arrays of arbitrary type the other two being restricted to arrays of strings The generally available operation is appending You can do for example for M1 and M2 both arrays of matrices matrices BigM M1 M2 or if you wish to augment M1 M1 M2 In each case the result is an array of matrices whose length is the sum of the lengths of M1 and M2and similarly for the other supported types The operators specific to strings are union via and intersection via Given the following code for S1 and S2 both arrays of strings strings Su S1 S2 strings Si S1 S2 the array Su will contain all the strings in S1 plus any in S2 that are not in S1 while Si will contain all and only the strings that appear in both S1 and S2 Chapter 11 Gretl data types 94 Arrays as function arguments One can write hansl functions that take as arguments any of the array types and it is possible to pass arrays as function arguments in pointerized form In addition hansl functions may return any of the array types Here is a trivial example for strings function void printstrings strings S loop i1nelemS printf element d s i Si endloop end function function strings mkstrs int n strings S arrayn loop i1n Si sprintfmember d i endloop return S end function strings Foo mkstrs5 print Foo printstringsFoo A couple of points are worth noting here First the nelem function works to give the number of elements in any of the container types lists arrays bundles matrices Second if you do print Foo for Foo an array youll see something like print Foo Array of strings length 5 Nesting arrays While gretls array structure is in itself onedimensional you can add extra dimensions by nesting For example the code below creates an array holding n arrays of m bundles arrays BB arrayn loop i1n bundles BBi arraym endloop The syntax for setting or accessing any of the n m bundles or their members is then on the following pattern BBijm I3 eval BBij eval BBijm or eval BBijm where the respective array subscripts are each put into square brackets The elements of an array of arrays must obviously all be arrays but its not required that they have a common contenttype For example the following code creates an array holding an array of matrices plus an array of strings arrays AA array2 matrices AA1 array3 strings AA2 array3 Chapter 11 Gretl data types 95 Arrays and bundles As mentioned the bundle type is supported by the array mechanism In addition arrays of what ever type can be put into bundles matrices M array8 set values of Mi here bundle b bM M The mutual packability of bundles and arrays means that its possible to go quite far down the rabbithole users are advised not to get carried away 119 The life cycle of gretl objects Creation The most basic way to create a new variable of any type is by declaration where one states the type followed by the name of the variable to create as in scalar x series y matrix A and so forth In that case the object in question is given a default initialization as follows a new scalar has value NA missing a new series is filled with NAs a new matrix is empty zero rows and columns a new string is empty a new list has no members new bundles and new arrays are empty Declaration can be supplemented by a definite initialization as in scalar x pi series y logx matrix A zeros104 The type of a new variable can be left implicit as in x y100 z 35 Here the type of x will be determined automatically depending on the context If y is a scalar a series or a matrix x will inherit ys type otherwise an error will be generated since division is applicable to these types only The new variable z will naturally be of scalar type In general however we recommend that you state the type of a new variable explicitly This makes the intent clearer to a reader of the script and also guards against errors that might otherwise be difficult to understand ie a certain variable turns out to be of the wrong type for some subsequent calculation but you dont notice at first because you didnt say what type you wanted Exceptions to this rule might reasonably be granted for clear and simple cases where theres little possibility of confusion Modification Typically the values of variables of all types are modified by assignment using the operator with the name of the variable on the left and a suitable value or formula on the right Chapter 11 Gretl data types 96 z normal x 100 logy logy1 M qforma X By a suitable value we mean one that is conformable for the type in question A gretl variable acquires its type when it is first created and this cannot be changed via assignment for example if you have a matrix A and later want a string A you will have to delete the matrix first One point to watch out for in gretl scripting is type conflicts having to do with the names of series brought in from a data file For example in setting up a command loop see chapter 13 it is very common to call the loop index i Now a loop index is a scalar typically incremented each time round the loop If you open a data file that happens to contain a series named i you will get a type error Types not conformable for operation when you try to use i as a loop index Although the type of an existing variable cannot be changed on the fly gretl nonetheless tries to be as understanding as possible For example if x is an existing series and you say x 100 gretl will give the series a constant value of 100 rather than complaining that you are trying to assign a scalar to a series This issue is particularly relevant for the matrix typesee chapter 17 for details Besides using the regular assignment operator you also have the option of using an inflected equals sign as in the C programming language This is shorthand for the case where the new value of the variable is a function of the old value For example x 100 in longhand x x 100 x 100 in longhand x x 100 For scalar variables you can use a more condensed shorthand for simple increment or decrement by 1 namely trailing or respectively x 100 x x now equals 99 x x now equals 100 In the case of objects holding more than one valueseries matrices and bundlesyou can mod ify particular values within the object using an expression within square brackets to identify the elements to access We have discussed this above for the bundle type and chapter 17 goes into details for matrices As for series there are two ways to specify particular values for modification you can use a simple 1based index or if the dataset is a time series or panel or if it has marker strings that identify the observations you can use an appropriate observation string Such strings are displayed by gretl when you print data with the byobs flag Examples x13 100 simple index the 13th observation x19954 100 date quarterly time series x200308 100 date monthly time series xAZ 100 the observation with marker string AZ x315 100 panel the 15th observation for the 3rd unit Note that with quarterly or monthly time series there is no ambiguity between a simple index number and a date since dates always contain a colon With annual timeseries data however such ambiguity exists and it is resolved by the rule that a number in brackets is always read as a simple index x1905 means the nineteenhundred and fifth observation not the observation for the year 1905 You can specify a year by quotation as in x1905 Chapter 11 Gretl data types 97 Destruction Objects of the types discussed above with the important exception of named lists are all destroyed using the delete command delete objectname Lists are an exception for this reason in the context of gretl commands a named list expands to the ID numbers of the member series so if you say delete L for L a list the effect is to delete all the series in L the list itself is not destroyed but ends up empty To delete the list itself without deleting the member series you must invert the command and use the list keyword list L delete Note that the delete command cannot be used within a loop construct see chapter 13 Chapter 12 Discrete variables When a variable can take only a finite typically small number of values then it is said to be discrete In gretl variables of the series type only can be marked as discrete When we speak of variables below this should be understood as referring to series Some gretl commands act in a slightly different way when applied to discrete variables moreover gretl provides a few commands that only apply to discrete variables Specifically the dummify and xtab commands see below are available only for discrete variables while the freq frequency distribution command produces different output for discrete variables 121 Declaring variables as discrete Gretl uses a simple heuristic to judge whether a given variable should be treated as discrete but you also have the option of explicitly marking a variable as discrete in which case the heuristic check is bypassed The heuristic is as follows First are all the values of the variable reasonably round where this is taken to mean that they are all integer multiples of 025 If this criterion is met we then ask whether the variable takes on a fairly small set of distinct values where fairly small is defined as less than or equal to 8 If both conditions are satisfied the variable is automatically considered discrete To mark a variable as discrete you have two options 1 From the graphical interface select Variable Edit Attributes from the menu A dialog box will appear and if the variable seems suitable you will see a tick box labeled Treat this variable as discrete This dialog box can also be invoked via the context menu rightclick on a variable or by pressing the F2 key 2 From the commandline interface via the discrete command The command takes one or more arguments which can be either variables or list of variables For example list xlist x1 x2 x3 discrete z1 xlist z2 This syntax makes it possible to declare as discrete many variables at once which cannot presently be done via the graphical interface The switch reverse reverses the declaration of a variable as discrete or in other words marks it as continuous For example discrete foo now foo is discrete discrete foo reverse now foo is continuous The commandline variant is more powerful in that you can mark a variable as discrete even if it does not seem to be suitable for this treatment Note that marking a variable as discrete does not affect its content It is the users responsibility to make sure that marking a variable as discrete is a sensible thing to do Note that if you want to recode a continuous variable into classes you can use gretls arithmetical functionality as in the following example 98 Chapter 12 Discrete variables 99 nulldata 100 generate a series with mean 2 and variance 1 series x normal 2 split into 4 classes series z x0 x2 x4 now declare z as discrete discrete z Once a variable is marked as discrete this setting is remembered when you save the data file 122 Commands for discrete variables The dummify command The dummify command takes as argument a series x and creates dummy variables for each distinct value present in x which must have already been declared as discrete Example open greene222 discrete Z5 mark Z5 as discrete dummify Z5 The effect of the above command is to generate 5 new dummy variables labeled DZ51 through DZ55 which correspond to the different values in Z5 Hence the variable DZ54 is 1 if Z5 equals 4 and 0 otherwise This functionality is also available through the graphical interface by selecting the menu item Add Dummies for selected discrete variables The dummify command can also be used with the following syntax list dlist dummifyx This not only creates the dummy variables but also a named list see section 151 that can be used afterwards The following example computes summary statistics for the variable Y for each value of Z5 open greene222 discrete Z5 mark Z5 as discrete list foo dummifyZ5 loop foreach i foo smpl i restrict replace summary Y endloop smpl full Since dummify generates a list it can be used directly in commands that call for a list as input such as ols For example open greene222 discrete Z5 mark Z5 as discrete ols Y 0 dummifyZ5 The freq command The freq command displays absolute and relative frequencies for a given variable The way fre quencies are counted depends on whether the variable is continuous or discrete This command is also available via the graphical interface by selecting the Variable Frequency distribution menu entry Chapter 12 Discrete variables 100 For discrete variables frequencies are counted for each distinct value that the variable takes For continuous variables values are grouped into bins and then the frequencies are counted for each bin The number of bins by default is computed as a function of the number of valid observations in the currently selected sample via the rule shown in Table 121 However when the command is invoked through the menu item Variable Frequency Plot this default can be overridden by the user Observations Bins 8 n 16 5 16 n 50 7 50 n 850 n n 850 29 Table 121 Number of bins for various sample sizes For example the following code open greene191 freq TUCE discrete TUCE mark TUCE as discrete freq TUCE yields Read datafile usrlocalsharegretldatagreenegreene191gdt periodicity 1 maxobs 32 observations range 132 Listing 5 variables 0 const 1 GPA 2 TUCE 3 PSI 4 GRADE freq TUCE Frequency distribution for TUCE obs 132 number of bins 7 mean 219375 sd 390151 interval midpt frequency rel cum 13417 12000 1 312 312 13417 16250 14833 1 312 625 16250 19083 17667 6 1875 2500 19083 21917 20500 6 1875 4375 21917 24750 23333 9 2812 7188 24750 27583 26167 7 2188 9375 27583 29000 2 625 10000 Test for null hypothesis of normal distribution Chisquare2 1872 with pvalue 039211 discrete TUCE mark TUCE as discrete freq TUCE Frequency distribution for TUCE obs 132 frequency rel cum 12 1 312 312 14 1 312 625 17 3 938 1562 Chapter 12 Discrete variables 101 19 3 938 2500 20 2 625 3125 21 4 1250 4375 22 2 625 5000 23 4 1250 6250 24 3 938 7188 25 4 1250 8438 26 2 625 9062 27 1 312 9375 28 1 312 9688 29 1 312 10000 Test for null hypothesis of normal distribution Chisquare2 1872 with pvalue 039211 As can be seen from the sample output a DoornikHansen test for normality is computed auto matically This test is suppressed for discrete variables where the number of distinct values is less than 10 This command accepts two options quiet to avoid generation of the histogram when invoked from the command line and gamma for replacing the normality test with Lockes nonparametric test whose null hypothesis is that the data follow a Gamma distribution If the distinct values of a discrete variable need to be saved the values matrix construct can be used see chapter 17 The xtab command The xtab command cab be invoked in either of the following ways First xtab ylist xlist where ylist and xlist are lists of discrete variables This produces crosstabulations twoway frequencies of each of the variables in ylist by row against each of the variables in xlist by column Or second xtab xlist In the second case a full set of crosstabulations is generated that is each variable in xlist is tabu lated against each other variable in the list In the graphical interface this command is represented by the Cross Tabulation item under the View menu which is active if at least two variables are selected Here is an example of use open greene222 discrete Z mark Z1Z8 as discrete xtab Z1 Z4 Z5 Z6 which produces Crosstabulation of Z1 rows against Z5 columns 1 2 3 4 5 TOT 0 20 91 75 93 36 315 1 28 73 54 97 34 286 TOTAL 48 164 129 190 70 601 Chapter 12 Discrete variables 102 Pearson chisquare test 548233 4 df pvalue 0241287 Crosstabulation of Z1 rows against Z6 columns 9 12 14 16 17 18 20 TOT 0 4 36 106 70 52 45 2 315 1 3 8 48 45 37 67 78 286 TOTAL 7 44 154 115 89 112 80 601 Pearson chisquare test 123177 6 df pvalue 350375e24 Crosstabulation of Z4 rows against Z5 columns 1 2 3 4 5 TOT 0 17 60 35 45 14 171 1 31 104 94 145 56 430 TOTAL 48 164 129 190 70 601 Pearson chisquare test 111615 4 df pvalue 00248074 Crosstabulation of Z4 rows against Z6 columns 9 12 14 16 17 18 20 TOT 0 1 8 39 47 30 32 14 171 1 6 36 115 68 59 80 66 430 TOTAL 7 44 154 115 89 112 80 601 Pearson chisquare test 183426 6 df pvalue 00054306 Pearsons χ2 test for independence is automatically displayed provided that all cells have expected frequencies under independence greater than 107 However a common rule of thumb states that this statistic is valid only if the expected frequency is 5 or greater for at least 80 percent of the cells If this condition is not met a warning is printed Additionally the row or column options can be given in this case the output displays row or column percentages respectively If you want to cut and paste the output of xtab to some other program eg a spreadsheet you may want to use the zeros option this option causes cells with zero frequency to display the number 0 instead of being empty Chapter 13 Loop constructs 131 Introduction The command loop opens a special mode in which gretl accepts a block of commands to be re peated zero or more times This feature may be useful for among other things Monte Carlo simulations bootstrapping of test statistics and iterative estimation procedures The general form of a loop is loop controlexpression progressive verbose loop body endloop Five forms of controlexpression are available as explained in section 132 Not all gretl commands are available within loops the commands that are not presently accepted in this context are shown in Table 131 Table 131 Commands not usable in loops function include nulldata quit run setmiss By default the genr command operates quietly in the context of a loop without printing informa tion on the variable generated To force the printing of feedback you may specify the verbose option to loop The progressive option to loop modifies the behavior of the commands print and store and certain estimation commands in a manner that may be useful with Monte Carlo analyses see Section 134 The following sections explain the various forms of the loop control expression and provide some examples of use of loops If you are carrying out a substantial Monte Carlo analysis with many thousands of repetitions memory capacity and processing time may be an issue To minimize the use of computer resources run your script using the commandline program gretlcli with output redirected to a file 132 Loop control variants Count loop The simplest form of loop control is a direct specification of the number of times the loop should be repeated We refer to this as a count loop The number of repetitions may be a numerical constant as in loop 1000 or may be read from a scalar variable as in loop replics In the case where the loop count is given by a variable say replics in concept replics is an integer if the value is not integral it is converted to an integer by truncation Note that replics is evaluated only once when the loop is initially compiled 103 Chapter 13 Loop constructs 104 While loop A second sort of control expression takes the form of the keyword while followed by a Boolean expression For example loop while essdiff 00001 Execution of the commands within the loop will continue so long as a the specified condition evaluates as true and b the number of iterations does not exceed the value of the internal variable loopmaxiter By default this equals 100000 but you can specify a different value or remove the limit via the set command see the Gretl Command Reference Index loop A third form of loop control uses an index variable for example i1 In this case you specify starting and ending values for the index as in loop i120 The index variable may be a preexisting scalar if this is not the case the variable is created automatically and is destroyed on exit from the loop The index may be used within the loop body in either of two ways you can access the integer value of i or you can use its string representation i The starting and ending values for the index can be given in numerical form by reference to pre defined scalar variables or as expressions that evaluate to scalars In the latter two cases the variables are evaluated once at the start of the loop In addition with time series data you can give the starting and ending values in the form of dates as in loop i1950119994 for quarterly data This form of loop control is intended to be quick and easy and as such it is subject to certain limitations In particular standard behavior is to increment the index variable by one at each iteration So for example if you have loop imn where m and n are scalar variables with values m n at the time of execution the index will not be decremented rather the loop will simply be bypassed One modification of this behavior is supported via the option flag decr or d for short This causes the index to be decremented by one at each iteration For example loop imn decr In this case the loop will be bypassed if m n If you need more flexible control see the for form below The index loop is particularly useful in conjunction with the values matrix function when some operation must be carried out for each value of some discrete variable see chapter 12 Consider the following example open greene222 discrete Z8 v8 valuesZ8 loop i1rowsv8 scalar xi v8i smpl Z8xi restrict replace printf meanY Z8 g 85f sdY Z8 g g xi meanY xi sdY endloop 1It is common programming practice to use simple onecharacter names for such variables Chapter 13 Loop constructs 105 In this case we evaluate the conditional mean and standard deviation of the variable Y for each value of Z8 Foreach loop The fourth form of loop control also uses an index variable in this case to index a specified set of strings The loop is executed once for each string in the list This can be useful for performing repetitive operations on a list of variables Here is an example of the syntax loop foreach i peach pear plum print i endloop This loop will execute three times printing out peach pear and plum on the respective itera tions The numerical value of the index starts at 1 and is incremented by 1 at each iteration If you wish to loop across a list of variables that are contiguous in the dataset you can give the names of the first and last variables in the list separated by rather than having to type all the names For example say we have 50 variables AK AL WY containing income levels for the states of the US To run a regression of income on time for each of the states we could do genr time loop foreach i ALWY ols i const time endloop This loop variant can also be used for looping across the elements in a named list see chapter 15 For example list ylist y1 y2 y3 loop foreach i ylist ols i const x1 x2 endloop Note that if you use this idiom inside a function see chapter 14 looping across a list that has been supplied to the function as an argument it is necessary to use the syntax listnamei to reference the listmember variables In the context of the example above this would mean replacing the third line with ols ylisti const x1 x2 Two other cases are supported the target of foreach can be a named array of strings or a bundle see chapter 11 In the array case i gets naturally the string at position i in the array from 1 to the number of elements in the bundle case it gets the keystrings of all bundle members in no particular order For a bundle b the command print b gives a fairly terse account of the bundles membership for a full account you can do loop foreach i b print i eval bi endloop For loop The final form of loop control emulates the for statement in the C programming language The syntax is loop for followed by three component expressions separated by semicolons and sur rounded by parentheses The three components are as follows Chapter 13 Loop constructs 106 1 Initialization This is evaluated only once at the start of the loop Common example setting a scalar control variable to some starting value 2 Continuation condition this is evaluated at the top of each iteration including the first If the expression evaluates as true nonzero iteration continues otherwise it stops Common example an inequality expressing a bound on a control variable 3 Modifier an expression which modifies the value of some variable This is evaluated prior to checking the continuation condition on each iteration after the first Common example a control variable is incremented or decremented Heres a simple example loop for r001 r991 r01 In this example the variable r will take on the values 001 002 099 across the 99 iterations Note that due to the finite precision of floating point arithmetic on computers it may be necessary to use a continuation condition such as the above r991 rather than the more natural r99 Using doubleprecision numbers on an x86 processor at the point where you would expect r to equal 099 it may in fact have value 0990000000000001 Any or all of the three expressions governing a for loop may be omittedthe minimal form is If the continuation test is omitted it is implicitly true so you have an infinite loop unless you arrange for some other way out such as a break statement see section 133 below If the initialization expression in a for loop takes the common form of setting a scalar variable to a given value the string representation of that scalars value is made available within the loop via the accessor varname 133 Special controls Besides the control afforded by the governing expression at the top of a loop the flow of execution can be adjusted via the keywords break and continue The break keyword terminates execution of the current loop immediately while continue has the effect of skipping any subsequent statements within the loop on the current iteration execution will proceed to the next iteration if the condition for continuation is still satisfied 134 Progressive mode If the progressive option is given for a command loop special behavior is invoked for certain commands namely print store and simple estimation commands By simple here we mean commands which a estimate a single equation as opposed to a system of equations and b do so by means of a single command statement as opposed to a block of statements as with nls and mle The paradigm is ols other possibilities include tsls wls logit and so on The special behavior is as follows Estimators The results from each individual iteration of the estimator are not printed Instead after the loop is completed you get a printout of a the mean value of each estimated coefficient across all the repetitions b the standard deviation of those coefficient estimates c the mean value of the estimated standard error for each coefficient and d the standard deviation of the estimated standard errors Note that this is useful only if there is some random input at each step print When this command is used to print the value of a variable its value is not printed each time round the loop Rather when the loop is terminated you get a printout of the mean and standard deviation of the variable across the repetitions of the loop This mode is intended for use Chapter 13 Loop constructs 107 with variables that have a scalar value at each iteration for example the sum of squared residuals from a regression Series cannot be printed in this way and neither can matrices store This command writes out the values of the specified scalars from each time round the loop to a specified file Thus it keeps a complete record of their values across the iterations For example coefficient estimates could be saved in this way so as to permit subsequent examination of their frequency distribution Only one such store can be used in a given loop 135 Loop examples Monte Carlo example A simple example of a Monte Carlo loop in progressive mode is shown in Listing 131 Listing 131 Simple Monte Carlo loop Download nulldata 50 set seed 547 series x 100 uniform open a progressive loop to be repeated 100 times loop 100 progressive series u 10 normal construct the dependent variable series y 10x u run OLS regression ols y const x grab the coefficient estimates and Rsquared scalar a coeffconst scalar b coeffx scalar r2 rsq arrange for printing of stats on these print a b r2 and save the coefficients to file store coeffsgdt a b endloop This loop will print out summary statistics for the a and b estimates and R2 across the 100 rep etitions After running the loop coeffsgdt which contains the individual coefficient estimates from all the runs can be opened in gretl to examine the frequency distribution of the estimates in detail The nulldata command is useful for Monte Carlo work Instead of opening a real data set nulldata 50 for instance creates an artificial dataset containing just a constant and an index variable with 50 observations Constructed variables can then be added See the set command for information on generating repeatable pseudorandom series Iterated least squares Listing 132 uses a while loop to replicate the estimation of a nonlinear consumption function of the form C α βY γ ϵ as presented in Greene 2000 Example 113 This script is included in the gretl distribution under the name greene113inp you can find it in gretl under the menu item File Script files Example scripts Greene Chapter 13 Loop constructs 108 The option printfinal for the ols command arranges matters so that the regression results will not be printed each time round the loop but the results from the regression on the last iteration will be printed when the loop terminates Listing 132 Nonlinear consumption function Download open greene113gdt run initial OLS ols C 0 Y scalar essbak ess scalar essdiff 1 scalar beta coeffY scalar gamma 1 iterate OLS till the error sum of squares converges loop while essdiff 00001 form the linearized variables series C0 C gamma beta Ygamma logY series x1 Ygamma series x2 beta Ygamma logY run OLS ols C0 0 x1 x2 printfinal nodfcorr vcv beta coeff2 gamma coeff3 ess ess essdiff absess essbakessbak essbak ess endloop print parameter estimates using their proper names printf alpha g coeff1 printf beta g beta printf gamma g gamma Listing 133 shows how a loop can be used to estimate an ARMA model exploiting the outer product of the gradient OPG regression discussed by Davidson and MacKinnon 1993 Further examples of loop usage that may be of interest can be found in chapter 21 Chapter 13 Loop constructs 109 Listing 133 ARMA 1 1 Download Estimation of an ARMA11 model manually using a loop open armagdt scalar c 0 scalar a 01 scalar m 01 series e 00 series dec e series dea e series dem e scalar crit 1 loop while crit 10e9 onestep forecast errors e y c ay1 me1 loglikelihood scalar loglik 05 sume2 print loglik partials of e with respect to c a and m dec 1 m dec1 dea y1 m dea1 dem e1 m dem1 partials of l with respect to c a and m series scc dec e series sca dea e series scm dem e OPG regression ols const scc sca scm printfinal nodfcorr vcv Update the parameters c coeff1 a coeff2 m coeff3 show progress printf constant 8g gradient 6g c coeff1 printf ar1 coefficient 8g gradient 6g a coeff2 printf ma1 coefficient 8g gradient 6g m coeff3 crit T ess print crit endloop scalar sec stderr1 scalar sea stderr2 scalar sem stderr3 printf printf constant 8g se 6g t 4f c sec csec printf ar1 coefficient 8g se 6g t 4f a sea asea printf ma1 coefficient 8g se 6g t 4f m sem msem Chapter 14 Userdefined functions 141 Defining a function Gretl offers a mechanism for defining functions which may be called via the command line in the context of a script or if packaged appropriately see section 145 via the programs graphical interface The syntax for defining a function looks like this function type funcname parameters function body end function The opening line of a function definition contains these elements in strict order 1 The keyword function 2 type which states the type of value returned by the function if any This must be one of void if the function does not return anything scalar series matrix list string bundle or one of gretls array types matrices bundles strings see section 118 3 funcname the unique identifier for the function Function names have a maximum length of 31 characters they must start with a letter and can contain only letters numerals and the underscore character You will get an error if you try to define a function having the same name as an existing gretl command 4 The functions parameters in the form of a commaseparated list enclosed in parentheses This may be run into the function name or separated by white space as shown In case the function takes no arguments unusual but acceptable this should be indicated by placing the keyword void between the parameterlist parentheses Function parameters can be of any of the types shown below1 Type Description bool scalar variable acting as a Boolean switch int scalar variable acting as an integer scalar scalar variable series data series list named list of series matrix matrix or vector string string variable or string literal bundle allpurpose container see section 117 matrices array of matrices see section 118 bundles array of bundles strings array of strings 1An additional parameter type is available for GUI use namely obs this is equivalent to int except for the way it is represented in the graphical interface for calling a function 110 Chapter 14 Userdefined functions 111 Each element in the listing of parameters must include two terms a type specifier and the name by which the parameter shall be known within the function An example follows function scalar myfunc series y list xvars bool verbose Each of the typespecifiers with the exception of list and string may be modified by prepending an asterisk to the associated parameter name as in function scalar myfunc series y scalar b The meaning of this modification is explained below see section 144 it is related to the use of pointer arguments in the C programming language Function parameters optional refinements Besides the required elements mentioned above the specification of a function parameter may include some additional fields as follows The const modifier For scalar or int parameters minimum maximum andor default values or for bool pa rameters just a default value For optional arguments other than scalar int and bool the special default value null For all parameters a descriptive string For int parameters with minimum and maximum values specified a set of strings to associate with the allowed numerical values value labels The first three of these options may be useful in many contexts the last two may be helpful if a function is to be packaged for use in the gretl GUI but probably not otherwise We now expand on each of the options The const modifier must be given as a prefix to the basic parameter specification as in const matrix M This constitutes a promise that the corresponding argument will not be modified within the function gretl will flag an error if the function attempts to modify the argument Minimum maximum and default values for scalar or int types These values should di rectly follow the name of the parameter enclosed in square brackets and with the individual elements separated by colons For example suppose we have an integer parameter order for which we wish to specify a minimum of 1 a maximum of 12 and a default of 4 We can write int order1124 If you wish to omit any of the three specifiers leave the corresponding field empty For example 14 would specify a minimum of 1 and a default of 4 while leaving the maximum unlimited However as a special case it is acceptable to give just one value with no colons in which case the value is interpreted as a default So for example int k0 designates a default value of 0 for the parameter k with no minimum or maximum specified If you wished to specify a minimum of zero with no maximum or default you would have to write Chapter 14 Userdefined functions 112 int k0 For a parameter of type bool whose values are just zero or nonzero you can specify a default of 1 true or 0 false as in bool verbose0 Descriptive string This will show up as an aid to the user if the function is packaged see section 145 below and called via gretls graphical interface The string should be enclosed in double quotes and separated from the preceding elements of the parameter specification with a space as in series y dependent variable Value labels These may be used only with int parameters for which minimum and maximum values have been specified so that there is a fixed number of admissible values and the number of labels must match the number of values They will show up in the graphical interface in the form of a dropdown list making the function writers intent clearer when an integer argument represents a categorical selection A set of value labels must be enclosed in braces and the individual labels must be enclosed in double quotes and separated by commas or spaces For example int case131 Fixed effects Between model Random effects If two or more of the trailing optional fields are given in a parameter specification they must be given in the order shown above minmaxdefault description value labels Note that there is no facility for escaping characters within descriptive strings or value labels these may contain spaces but they cannot contain the doublequote character Here is an example of a wellformed function specification using all the elements mentioned above function matrix myfunc series y dependent variable list X regressors int p01 lag order int c121 criterion AIC BIC bool quiet0 One advantage of specifying default values for parameters where applicable is that in script or commandline mode users may omit trailing arguments that have defaults For example myfunc above could be invoked with just two arguments corresponding to y and X implicitly p 1 c 1 and quiet is false Functions taking no parameters You may define a function that has no parameters these are called routines in some programming languages In this case use the keyword void in place of the listing of parameters function matrix myfunc2 void The function body The function body is composed of gretl commands or calls to userdefined functions that is function calls may be nested A function may call itself that is functions may be recursive While the function body may contain function calls it may not contain function definitions That is you cannot define a function inside another function For further details see section 144 Chapter 14 Userdefined functions 113 142 Calling a function A user function is called by typing its name followed by zero or more arguments enclosed in parentheses If there are two or more arguments they must be separated by commas There are automatic checks in place to ensure that the number of arguments given in a function call matches the number of parameters and that the types of the given arguments match the types specified in the definition of the function An error is flagged if either of these conditions is violated One qualification allowance is made for omitting arguments at the end of the list provided that default values are specified in the function definition To be precise the check is that the number of arguments is at least equal to the number of required parameters and is no greater than the total number of parameters In general an argument to a function may be given either as the name of a preexisting variable or as an expression which evaluates to a variable of the appropriate type The following trivial example illustrates a function call that correctly matches the corresponding function definition function definition function scalar olsess series y list xvars ols y 0 xvars quiet printf ESS g ess return ess end function main script open data41 list xlist 2 3 4 function call the return value is ignored here olsessprice xlist The function call gives two arguments the first is a data series specified by name and the second is a named list of regressors Note that while the function offers the Error Sum of Squares as a return value it is ignored by the caller in this instance As a side note here if you want a function to calculate some value having to do with a regression but are not interested in the full results of the regression you may wish to use the quiet flag with the estimation command as shown above A second example shows how to write a function call that assigns a return value to a variable in the caller function definition function series getuhat series y list xvars ols y 0 xvars quiet return uhat end function main script open data41 list xlist 2 3 4 function call series resid getuhatprice xlist 143 Deleting a function If you have defined a function and subsequently wish to clear it out of memory you can do so using the keywords delete or clear as in function myfunc delete function getuhat clear Chapter 14 Userdefined functions 114 Note however that if myfunc is already a defined function providing a new definition automatically overwrites the previous one so it should rarely be necessary to delete functions explicitly 144 Function programming details Variables versus pointers Most arguments to functions can be passed in two ways as they are or via pointers the exception is the list type which cannot be passed as a pointer First consider the following rather artificial example function series triple1 series x return 3x end function function void triple2 series x x 3 end function nulldata 10 series y normal series y3 triple1y print y3 triple2y print y These two functions produce essentially the same resultthe two print statements in the caller will show the same valuesbut in quite different ways The first explicitly returns a modified version of its input which must be a plain series after the call to triple1 y is unaltered it would have been altered only if the return value were assigned back to y rather than y3 The second function modifies its input given as a pointer to a series in place without actually returning anything Its worth noting that triple2 as it stands would not be considered idiomatic as a gretl function although its formally OK The point here is just to illustrate the distinction between passing an argument in the default way and in pointer form Why make this distinction There are two main reasons for doing so modularity and performance By modularity we mean the insulation of a function from the rest of the script which calls it One of the many benefits of this approach is that your functions are easily reusable in other contexts To achieve modularity variables created within a function are local to that function and are destroyed when the function exits unless they are made available as return values and these values are picked up or assigned by the caller In addition functions do not have access to variables in outer scope that is variables that exist in the script from which the function is called except insofar as these are explicitly passed to the function as arguments By default when a variable is passed to a function as an argument what the function actually gets is a copy of the outer variable which means that the value of the outer variable is not modified by anything that goes on inside the function This means that you can pass arguments to a function without worrying about possible side effects at the same time the function writer can use argument variables as workspace without fear of disruptive effects at the level of the caller The use of pointers however allows a function and its caller to cooperate such that an outer variable can be modified by the function In effect this allows a function to return more than one value although only one variable can be returned directlysee below To indicate that a particular object is to be passed as a pointer the parameter in question is marked with a prefix of in the function definition and the corresponding argument is marked with the complementary prefix in the caller For example Chapter 14 Userdefined functions 115 function series getuhatandessseries y list xvars scalar ess ols y 0 xvars quiet ess ess series uh uhat return uh end function open data41 list xlist 2 3 4 scalar SSR series resid getuhatandessprice xlist SSR In the above we may say that the function is given the address of the scalar variable SSR and it assigns a value to that variable under the local name ess For anyone used to programming in C note that it is not necessary or even possible to dereference the variable in question within the function using the operator Unadorned use of the name of the variable is sufficient to access the variable in outer scope An address parameter of this sort can be used as a means of offering optional information to the caller That is the corresponding argument is not strictly needed but will be used if present In that case the parameter should be given a default value of null and the the function should test to see if the caller supplied a corresponding argument or not using the builtin function exists For example here is the simple function shown above modified to make the filling out of the ess value optional function series getuhatandessseries y list xvars scalar essnull ols y 0 xvars quiet if existsess ess ess endif return uhat end function If the caller does not care to get the ess value it can use null in place of a real argument series resid getuhatandessprice xlist null Alternatively trailing function arguments that have default values may be omitted so the following would also be a valid call series resid getuhatandessprice xlist One limitation on the use of pointertype arguments should be noted you cannot supply a given variable as a pointer argument more than once in any given function call For example suppose we have a function that takes two matrixpointer arguments function scalar pointfunc matrix a matrix b And suppose we have two matrices x and y at the caller level The call pointfuncx y is OK but the call pointfuncx x will not work will generate an error Thats because the situation inside the function would become too confusing with what is really the same object existing under two names Chapter 14 Userdefined functions 116 Const parameters Pointertype arguments may also be useful for optimizing performance Even if a variable is not modified inside the function it may be a good idea to pass it as a pointer if it occupies a lot of memory Otherwise the time gretl spends transcribing the value of the variable to the local copy may be nonnegligible compared to the time the function spends doing the job it was written for Listing 141 takes this to the extreme We define two functions which return the number of rows of a matrix a pretty fast operation The first gets a matrix as argument while the second gets a pointer to a matrix The functions are evaluated 500 times on a matrix with 2000 rows and 2000 columns on a typical system floatingpoint numbers take 8 bytes of memory so the total size of the matrix is roughly 32 megabytes Running the code in example 141 will produce output similar to the following the actual numbers of course depend on the machine youre using Elapsed time without pointers copy 247197 seconds with pointers no copy 000378627 seconds Listing 141 Performance comparison values versus pointer Download function scalar rowcount1 matrix X return rowsX end function function scalar rowcount2 const matrix X return rowsX end function set verbose off X zeros20002000 scalar r set stopwatch loop 500 r rowcount1X endloop e1 stopwatch set stopwatch loop 500 r rowcount2X endloop e2 stopwatch printf Elapsed time without pointers copy g seconds with pointers no copy g seconds e1 e2 If a pointer argument is used for this sort of purposeand the object to which the pointer points is not modified is treated as readonly by the functionone can signal this to the user by adding the const qualifier as shown for function rowcount2 in Listing 141 When a pointer argument is qualified in this way any attempt to modify the object within the function will generate an error However combining the const flag with the pointer mechanism is technically redundant for the Chapter 14 Userdefined functions 117 following reason if you mark a matrix argument as const then gretl will in fact pass it in pointer mode internally since it cant be modified within the function theres no downside to simply mak ing it available to the function rather than copying it So in the example above we could revise the signature of the second function as function scalar rowcount2a const matrix X and call it with r rowcount2aX for the same speedup relative to rowcount1 From the callers point of view the second optionusing the const modifier without pointer notationis preferable as it allows the caller to pass an object created on the fly Suppose the caller has two matrices A and B in scope and wishes to pass their vertical concatenation as an argument The following call would work fine r rowcount2aAB To use rowcount2 on the other hand the caller would have to create a named variable first since you cannot give the address of a anonymous object such as AB matrix AB AB r rowcount2AB This requires an extra line of code and leaves AB occupying memory after the call We have illustrated using a matrix parameter but the const modifier may be used with the same effectnamely the argument is passed directly without being copied but is protected against modification within the functionfor all the types that support the pointer apparatus List arguments The use of a named list as an argument to a function gives a means of supplying a function with a set of variables whose number is unknown when the function is writtenfor example sets of regressors or instruments Within the function the list can be passed on to commands such as ols A list argument can also be unpacked using a foreach loop construct but this requires some care For example suppose you have a list X and want to calculate the standard deviation of each variable in the list You can do loop foreach i X scalar sdi sdXi endloop Please note a special piece of syntax is needed in this context If we wanted to perform the above task on a list in a regular script not inside a function we could do loop foreach i X scalar sdi sdi endloop where i gets the name of the variable at position i in the list and sdi gets its standard deviation But inside a function working on a list supplied as an argument if we want to reference an individual variable in the list we must use the syntax listnamevarname Hence in the example above we write sdXi This is necessary to avoid possible collisions between the namespace of the function and the name space of the caller script For example suppose we have a function that takes a list argument and that defines a local variable called y Now suppose that this function is passed a list containing Chapter 14 Userdefined functions 118 a variable named y If the two namespaces were not separated either wed get an error or the external variable y would be silently overwritten by the local one It is important therefore that listargument variables should not be visible by name within functions To get hold of such variables you need to use the form of identification just mentioned the name of the list followed by a dot followed by the name of the variable Constancy of list arguments When a named list of variables is passed to a function the function is actually provided with a copy of the list The function may modify this copy for instance adding or removing members but the original list at the level of the caller is not modified Optional list arguments If a list argument to a function is optional this should be indicated by appending a default value of null as in function scalar myfunc scalar y list Xnull In that case if the caller gives null as the list argument or simply omits the last argument the named list X inside the function will be empty This possibility can be detected using the nelem function which returns 0 for an empty list String arguments String arguments can be used for example to provide flexibility in the naming of variables created within a function In the following example the function mavg returns a list containing two moving averages constructed from an input series with the names of the newly created variables governed by the string argument function list mavg series y string vname list retlist deflist string newname sprintfs2 vname retlist genseriesnewname yy1 2 newname sprintfs4 vname retlist genseriesnewname yy1y2y3 4 return retlist end function open data99 list malist mavgnocars nocars print malist byobs The last line of the script will print two variables named nocars2 and nocars4 For details on the handling of named strings see chapter 15 If a string argument is considered optional it may be given a null default value as in function scalar foo series y string vnamenull Retrieving the names of arguments The variables given as arguments to a function are known inside the function by the names of the corresponding parameters For example within the function whose signature is function void somefun series y we have the series known as y It may be useful however to be able to determine the names of the variables provided as arguments This can be done using the function argname which takes the name of a function parameter as its single argument and returns a string Here is a simple illustration Chapter 14 Userdefined functions 119 function void namefun series y printf the series given as y was named s argnamey end function open data97 namefunQNC This produces the output the series given as y was named QNC Please note that this will not always work the arguments given to functions may be anonymous variables created on the fly as in somefunlogQNC or somefunCPI100 In that case the argname function returns an empty string Function writers who wish to make use of this facility should check the return from argname using the strlen function if this returns 0 no name was found Return values Functions can return nothing just printing a result perhaps or they can return a single variable The return value if any is specified via a statement within the function body beginning with the keyword return followed by either the name of a variable which must be of the type announced on the first line of the function definition or an expression which produces a value of the correct type Having a function return a list or bundle is a way of permitting the return of more than one variable For example you can define several series inside a function and package them as a list in this case they are not destroyed when the function exits Here is a simple example which also illustrates the possibility of setting the descriptive labels for variables generated in a function function list makecubes list xlist list cubes deflist loop foreach i xlist series i3 xlisti3 setinfo i3 d cube of i list cubes i3 endloop return cubes end function open data41 list xlist price sqft list cubelist makecubesxlist print xlist cubelist byobs labels A return statement causes the function to return exit at the point where it appears within the body of the function A function may also exit when a the end of the function code is reached in the case of a function with no return value b a gretl error occurs or c a funcerr statement is reached The funcerr keywordwhich may be followed by a string enclosed in double quotes or the name of a string variable or nothingcauses a function to exit with an error flagged If a string is provided either literally or via a variable this is printed on exit otherwise a generic error message is printed This mechanism enables the author of a function to preempt an ordinary execution error andor offer a more specific and helpful error message For example if nelemxlist 0 funcerr xlist must not be empty endif Chapter 14 Userdefined functions 120 A function may contain more than one return statement as in function scalar multi bool s if s return 1000 else return 10 endif end function However it is recommended programming practice to have a single return point from a function unless this is very inconvenient The simple example above would be better written as function scalar multi bool s return s 1000 10 end function Overloading You may have noticed that several builtin functions in gretl are overloadedthat is a given argument slot may accept more than one type of argument and the return value may depend on the type of the argument in question For instance the argument x for the pdf function may be a scalar series or matrix and the return type will match that choice on the callers part Since gretl2021b this possibility also exists for userdefined functions The metatype numeric can be used in place of a specific type to accept a scalar series or matrix argument and similarly the returntype of a function can be marked as numeric As a function writer you can choose to be more restrictive than the default which allows scalar series or matrix for any numeric argument For instance if you write a function in which two arguments x and y are specified as numeric you might decide to disallow the case where x is a matrix and y a series or vice versa as too complicated You can use the typeof function to determine what types of arguments were supplied and the funcerr command or errorif function to reject an unsupported combination If your function is going to return a certain specific type say matrix regardless of the type of the input then the return value should be labeled accordingly use numeric for the return only if its truly unknown in advance Listing 142 offers an admittedly artificial example its numeric inputs can be scalars series or column vectors but they must be of a single type Naturally if your overloaded function is intended for public use you should state clearly in its documentation what is supported and what is not Error checking When gretl first reads and compiles a function definition there is minimal errorchecking the only checks are that the function name is acceptable and so far as the body is concerned that you are not trying to define a function inside a function see Section 141 Otherwise if the function body contains invalid commands this will become apparent only when the function is called and its commands are executed Debugging The usual mechanism whereby gretl echoes commands and reports on the creation of new variables is by default suppressed when a function is being executed If you want more verbose output from a particular function you can use either or both of the following commands within the function Chapter 14 Userdefined functions 121 Listing 142 Example of overloaded function Download function numeric xplusby numeric x scalar b numeric y erroriftypeofx typeofy x and y must be of the same type if typeofx 2 scalar or series return x by elif rowsx rowsy colsx 1 colsy 1 return x by else funcerr x and y should be column vectors endif end function call 1 x and y are scalars eval xplusby10 3 2 call 2 x and y are vectors matrix x mnormal10 1 matrix y mnormal10 1 eval xplusbyx 2 y open data41 call 3 x and y are series series bb xplusbybedrms 05 baths print bb byobs set echo on set messages on Alternatively you can achieve this effect for all functions via the command set debug 1 Usually when you set the value of a state variable using the set command the effect applies only to the current level of function execution For instance if you do set messages on within function f1 which in turn calls function f2 then messages will be printed for f1 but not f2 The debug variable however acts globally all functions become verbose regardless of their level Further you can do set debug 2 in addition to command echo and the printing of messages this is equivalent to setting maxverbose which produces verbose output from the BFGS maximizer at all levels of function execution 145 Function packages At various points above we have alluded to function packages and the use of these via the gretl GUI This topic is covered in depth by the Gretl Function Package Guide If youre running gretl you can find this under the Help menu Alternatively you may download it from httpssourceforgenetprojectsgretlfilesmanual Chapter 15 Named lists and strings 151 Named lists Many gretl commands take one or more lists of series as arguments To make this easier to handle in the context of command scripts and in particular within userdefined functions gretl offers the possibility of named lists Creating and modifying named lists A named list is created using the keyword list followed by the name of the list an equals sign and an expression that forms a list The most basic sort of expression that works in this context is a spaceseparated list of variables given either by name or by ID number For example list xlist 1 2 3 4 list reglist income price Note that the variables in question must be of the series type Two abbreviations are available in defining lists You can use the wildcard character to create a list of variables by name For example dum can be used to indicate all variables whose names begin with dum You can use two dots to indicate a range of variables For example incomeprice indicates the set of variables whose ID numbers are greater than or equal to that of income and less than or equal to that of price In addition there are two special forms If you use the keyword null on the righthand side you get an empty list If you use the keyword dataset on the right you get a list containing all the series in the current dataset except the predefined const The name of the list must start with a letter and must be composed entirely of letters numbers or the underscore character The maximum length of the name is 31 characters list names cannot contain spaces Once a named list has been created it will be remembered for the duration of the gretl session unless you delete it and can be used in the context of any gretl command where a list of variables is expected One simple example is the specification of a list of regressors list xlist x1 x2 x3 x4 ols y 0 xlist To get rid of a list you use the following syntax list xlist delete 122 Chapter 15 Named lists and strings 123 Be careful delete xlist will delete the series contained in the list so it implies data loss which may not be what you want On the other hand list xlist delete will simply undefine the xlist identifier the series themselves will not be affected Similarly to print the names of the members of a list you have to invert the usual print command as in list xlist print If you just say print xlist the list will be expanded and the values of all the member series will be printed Lists can be modified in various ways To redefine an existing list altogether use the same syntax as for creating a list For example list xlist 1 2 3 xlist 4 5 6 After the second assignment xlist contains just variables 4 5 and 6 To append or prepend variables to an existing list we can make use of the fact that a named list stands in for a longhand list For example we can do list xlist xlist 5 6 7 xlist 9 10 xlist 11 12 Another option for appending a term or a list to an existing list is to use as in xlist cpi To drop a variable from a list use xlist cpi In most contexts where lists are used in gretl it is expected that they do not contain any duplicated elements If you form a new list by simple concatenation as in list L3 L1 L2 where L1 and L2 are existing lists its possible that the result may contain duplicates To guard against this you can form a new list as the union of two existing ones list L3 L1 L2 The result is a list that contains all the members of L1 plus any members of L2 that are not already in L1 In the same vein you can construct a new list as the intersection of two existing ones list L3 L1 L2 Here L3 contains all the elements that are present in both L1 and L2 You can also subtract one list from another list L3 L1 L2 The result contains all the elements of L1 that are not present in L2 Indexing into a defined list is also possible as if it were a vector list L2 L114 This leaves L2 with the first four members of L1 Notice that the ordering of list members is pathdependent Chapter 15 Named lists and strings 124 Lists and matrices There are two ways one can think of lists and matrices being interchangeable either you think of a list as a collection of references to series or you may consider the rectangle of data given by the series that the list contains In the former case a list may be translated into or created from a onedimensional matrix that is a vector Therefore the matrix in question must be interpretable as a vector containing ID numbers of data series It may be either a row or a column vector and each of its elements must have an integer part that is no greater than the number of variables in the data set For example matrix m 1234 list L m The above is OK provided the data set contains at least 4 variables Conversely the command matrix m L will create a row vector with the ID numbers of the series referenced by L The latter case occurs when the matrix is assumed to contain valid data To create a matrix from the list simply assing to a matrix the list name surrounded by curly brackets as in matrix m L Note the difference with the above without the curly brackets matrix m would have been just a vector Also note that any row corresponding to one or more missing entries will be dropped unless the skipmissing set variable is set to on For the reverse operation gretl provides the mat2list function which takes a matrix say X as argument and creates new series as well as a list containing them The row dimension of X must equal either the length of the current dataset or the number of observations in the current sample range The naming of the series in the returned list proceeds as follows First if the optional prefix argument is supplied the series created from column i of X is named by appending i to the given string Otherwise if X has column names set these names are used Finally if neither of the above conditions is satisfied the names are column1 column2 and so on For example matrix X mnormalnobs 8 list L mat2listX xnorm will add to the dataset eight fulllength series named xnorm1 xnorm2 and so on Querying a list You can determine the number of variables or elements in a list using the function nelem list xlist 1 2 3 nl nelemxlist The scalar variable nl will be assigned a value of 3 since xlist contains 3 members You can determine whether a given series is a member of a specified list using the function inlist as in scalar k inlistL y where L is a list and y a series The series may be specified by name or ID number The return value is the 1based position of the series in the list or zero if the series is not present in the list Generating lists of transformed variables Given a named list of series you are able to generate lists of transformations of these series using the functions log lags diff ldiff sdiff or dummyify For example list xlist x1 x2 x3 list lxlist logxlist list difflist diffxlist When generating a list of lags in this way you specify the maximum lag order inside the parentheses before the list name and separated by a comma For example list xlist x1 x2 x3 list laglist lags2 xlist or scalar order 4 list laglist lagsorder xlist These commands will populate laglist with the specified number of lags of the variables in xlist You can give the name of a single series in place of a list as the second argument to lags this is equivalent to giving a list with just one member The dummyify function creates a set of dummy variables coding for all but one of the distinct values taken on by the original variable which should be discrete The smallest value is taken as the omitted catgory Like lags this function returns a list even if the input is a single series Another useful operation you can perform with lists is creating interaction variables Suppose you have a discrete variable xi taking values from 1 to n and a variable zi which could be continuous or discrete In many cases you want to split zi into a set of n variables via the rule zij zi when xi j 0 otherwise in practice you create dummies for the xi variable first and then you multiply them all by zi these are commonly called the interactions between xi and zi In gretl you can do list H D Z where D is a list of discrete series or a single discrete series Z is a list or a single series¹ all the interactions will be created and listed together under the name H An example is provided in script 151 Generating series from lists There are various ways of retrieving or generating individual series from a named list The most basic method is indexing into the list For example series x3 Xlist3 will retrieve the third element of the list Xlist under the name x3 or will generate an error if Xlist has less then three members In addition gretl offers several functions that apply to a list and return a series In most cases these functions also apply to single series and behave as natural extensions when applied to lists but this is not always the case 1Warning this construct does not work if neither D nor Z are of the the list type Chapter 15 Named lists and strings 126 Listing 151 Usage of interaction lists Download Input open mroz87gdt quiet the coding below makes it so that KIDS 0 no kids KIDS 1 young kids only KIDS 2 young or older kids series KIDS KL6 0 KL6 0 K618 0 list D CIT KIDS interaction discrete variables list X WE WA variables to split list INTER D X smpl 1 6 print D X INTER o Output selected portions CIT KIDS WE WA WECIT0 1 0 2 12 32 12 2 1 1 12 30 0 3 0 2 12 35 12 4 0 1 12 34 12 5 1 2 14 31 0 6 1 0 12 54 0 WECIT1 WACIT0 WACIT1 WEKIDS0 WEKIDS1 1 0 32 0 0 0 2 12 0 30 0 12 3 0 35 0 0 0 4 0 34 0 0 12 5 14 0 31 0 0 6 12 0 54 12 0 WEKIDS2 WAKIDS0 WAKIDS1 WAKIDS2 1 12 0 0 32 2 0 0 30 0 3 12 0 0 35 4 0 0 34 0 5 14 0 0 31 6 0 54 0 0 For recognizing and handling missing values gretl offers several functions see the Gretl Command Reference for details In this context it is worth remarking that the ok function can be used with a list argument For example list xlist x1 x2 x3 series xok okxlist After these commands the series xok will have value 1 for observations where none of x1 x2 or Chapter 15 Named lists and strings 127 YpcFR YpcGE YpcIT NFR NGE NIT 1997 1149 1246 1193 59830635 82034771 56890372 1998 1153 1227 1200 60046709 82047195 56906744 1999 1150 1224 1178 60348255 82100243 56916317 2000 1156 1188 1172 60750876 82211508 56942108 2001 1160 1169 1181 61181560 82349925 56977217 2002 1163 1155 1122 61615562 82488495 57157406 2003 1121 1169 1110 62041798 82534176 57604658 2004 1103 1166 1069 62444707 82516260 58175310 2005 1124 1151 1051 62818185 82469422 58607043 2006 1119 1142 1033 63195457 82376451 58941499 Table 151 GDP per capita and population in 3 European countries Source Eurostat x3 has a missing value and value 0 for any observations where this condition is not met The functions max min mean sd sum and var behave horizontally rather than vertically when their argument is a list For instance the following commands list Xlist x1 x2 x3 series m meanXlist produce a series m whose ith element is the average of x1i x2i and x3i missing values if any are implicitly discarded In addition gretl provides three functions for weighted operations wmean wsd and wvar Consider as an illustration Table 151 the first three columns are GDP per capita for France Germany and Italy columns 4 to 6 contain the population for each country If we want to compute an aggregate indicator of per capita GDP all we have to do is list Ypc YpcFR YpcGE YpcIT list N NFR NGE NIT y wmeanYpc N so for example y1996 1149 59830635 1246 82034771 1193 56890372 59830635 82034771 56890372 120163 See the Gretl Command Reference for more details 152 Named strings For some purposes it may be useful to save a string that is a sequence of characters as a named variable that can be reused Some examples of the definition of a string variable are shown below string s1 some stuff I want to save string s2 getenvHOME string s3 s1 11 The first field after the typename string is the name under which the string should be saved then comes an equals sign then comes a specification of the string to be saved This may take any of the following forms Chapter 15 Named lists and strings 128 a string literal enclosed in double quotes or the name of an existing string variable or a function that returns a string see below or any of the above followed by and an integer offset The role of the integer offset is to use a substring of the preceding element starting at the given character offset An empty string is returned if the offset is greater than the length of the string in question To add to the end of an existing string you can use the operator as in string s1 some stuff I want to string s1 save or you can use the operator to join two or more strings as in string s1 sweet string s2 Home s1 home Note that when you define a string variable using a string literal no characters are treated as special other than the double quotes that delimit the string Specifically the backslash is not used as an escape character So for example string s is a valid assignment producing a string that contains a single backslash character If you wish to use backslashescapes to denote newlines tabs embedded doublequotes and so on use the sprintf function instead see the printf command for an account of the escape characters This function can also be used to produce a string variable whose definition involves the values of other variables as in scalar x 8 foo sprintfvard x produces var8 String variables and string substitution String variables can be used in two ways in scripting the name of the variable can be typed as is or it may be preceded by the at sign In the first variant the named string is treated as a variable in its own right while the second calls for string substitution The context determines which of these variants is appropriate In the following contexts the names of string variables should be given in plain form without the at sign When such a variable appears among the arguments to the printf command or sprintf function When such a variable is given as the argument to a function On the righthand side of a string assignment Here is an illustration of the use of a named string argument with printf Chapter 15 Named lists and strings 129 string vstr variance Generated string vstr printf vstr 12s vstr vstr variance String substitution can be used in contexts where a string variable is not acceptable as such If gretl encounters the symbol followed directly by the name of a string variable this notation is treated as a macro the value of the variable is sustituted literally into the command line before the regular parsing of the command is carried out One common use of string substitution is when you want to construct and use the name of a series programatically For example suppose you want to create 10 random normal series named norm1 to norm10 This can be accomplished as follows string sname loop i110 sname sprintfnormd i series sname normal endloop Note that plain sname could not be used in the second line within the loop the effect would be to attempt to overwrite the string variable named sname with a series of the same name What we want is for the current value of sname to be dumped directly into the command that defines a series and the notation achieves that Another typical use of string substitution is when you want the options used with a particular command to vary depending on some condition For example function void useoptstr series y list xlist int verbose string optstr verbose simpleprint ols y xlist optstr end function open data41 list X const sqft useoptstrprice X 1 useoptstrprice X 0 When printing the value of a string variable using the print command the plain variable name should generally be used as in string s Just testing print s The following variant is equivalent though clumsy string s Just testing print s But note that this next variant does something quite different string s Just testing print s After string substitution the print command reads print Just testing which attempts to print the values of two variables Just and testing Chapter 15 Named lists and strings 130 Builtin strings Apart from any strings that the user may define some string variables are defined by gretl itself These may be useful for people writing functions that include shell commands The builtin strings are as shown in Table 152 gretldir the gretl installation directory workdir users current gretl working directory dotdir the directory gretl uses for temporary files gnuplot path to or name of the gnuplot executable tramo path to or name of the tramo executable x12a path to or name of the x12arima executable tramodir tramo data directory x12adir x12arima data directory Table 152 Builtin string variables To access these as ordinary string variables prepend a dollar sign as in dotdir to use them in stringsubstitution mode prepend the atsign dotdir Reading strings from the environment It is possible to read into gretls named strings values that are defined in the external environment To do this you use the function getenv which takes the name of an environment variable as its argument For example string user getenvUSER Generated string user string home getenvHOME Generated string home printf ss home directory is s user home cottrells home directory is homecottrell To check whether you got a nonempty value from a given call to getenv you can use the function strlen which retrieves the length of the string as in string temp getenvTEMP Generated string temp scalar x strlentemp Generated scalar x 0 Capturing strings via the shell If shell commands are enabled in gretl you can capture the output from such commands using the syntax string stringname shellcommand That is you enclose a shell command in parentheses preceded by a dollar sign Reading from a file into a string You can read the content of a file into a string variable using the syntax string stringname readfilefilename The filename field may be given as a string variable For example Chapter 15 Named lists and strings 131 fname sprintfsQNCrts x12adir Generated string fname string foo readfilefname Generated string foo More string functions Gretl offers several functions for creating or manipulating strings You can find these listed and explained in the Function Reference under the category Strings Chapter 16 Stringvalued series 161 Introduction By a stringvalued series we mean a series whose primary values are strings though internally such series comprise an integer coding plus a dictionary mapping from the integer values to strings This chapter explains how to create such series and describes the operations that are supported for them 162 Creating a stringvalued series This can be done in three ways by reading such a series from a suitable source file by taking a suitable numerical series within gretl and adding string values using the stringify function and by direct assignment to a series from an array of strings In each case string values will be preserved when such a series is saved in a gretlnative data file Reading stringvalued series The primary suitable source for stringvalued series is a delimited text data file but see sec tion 165 below Heres a little example The following is the content of a file named gccsv cityyear Bilbao2009 Torun2011 Oklahoma City2013 Berlin2015 Athens2017 Naples2019 A script to read this file and its output are shown in Listing 161 from which we can see a few things By default the print command shows us the string values of the series city and it han dles nonASCII characters provided theyre in UTF8 but it doesnt handle longer strings very elegantly The numeric option to print exposes the integer codes for a stringvalued series The syntax seriesnameobs yields a string when a series is stringvalued If you want to access the numeric code for a particular stringvalued observation you can get it by casting the series in question to a vector by wrapping the identifier in curly brackets So for example 132 Chapter 16 Stringvalued series 133 Listing 161 Working with a stringvalued series Input open gccsv quiet print byobs print city byobs numeric printf The third gretl conference took place in s city3 Output print byobs city year 1 Bilbao 2009 2 Torun 2011 3 Oklahoma C 2013 4 Berlin 2015 5 Athens 2017 6 Naples 2019 print city byobs numeric city 1 1 2 2 3 3 4 4 5 5 6 6 The third gretl conference took place in Oklahoma City Chapter 16 Stringvalued series 134 printf The code for s is d city3 city3 gives The code for Oklahoma City is 3 The numeric codes for stringvalued series are always assigned thus reading the data file row by row the first string value is assigned 1 the next distinct string value is assigned 2 and so on Assigning string values to a numeric series This is done via the stringify function which takes two arguments the name of a series and an array of strings For this to work two conditions must be met 1 The series must have only integer values and the smallest value must be 1 or greater 2 The array of strings must have at least n distinct members where n is the largest value found in the series The logic of these conditions is that were looking to create a mapping as described above from a 1based sequence of integers to a set of strings However were allowing for the possibility that the series in question is an incomplete sample from an associated population Suppose we have a series that goes 2 3 5 9 10 This is taken to be a sample from a population that has at least 10 discrete values 1 2 10 and so requires at least 10 valuestrings Heres a simplified version of an example that one of the authors has had cause to use deriving USstyle letter grades from a series containing percentage scores for students Call the percentage series x and say we want to create a series with values A for x 90 B for 80 x 90 and so on down to F for x 60 Then we can do series grade 1 F the least value grade x 60 D grade x 70 C grade x 80 B grade x 90 A stringifygrade strsplitF D C B A The way the grade series is constructed is not the most compact but its nice and explicit and easy to amend if one wants to adjust the threshold values Note the use of strsplit to create an onthefly array of strings from a string literal this is convenient when the array contains a moderate number of elements with no embedded spaces An alternative way to get the same result is to define the array of strings via the defarray function as in stringifygrade defarrayFDCBA The inverse operation of stringify is performed by the strvals function this retrieves the array of distinct string values from a series or returns an empty array if the series is not string valued Assigning from an array of strings Given an array of strings whose length matches the full length of the current dataset you can assign directly to a series result provided these conditions are satisfied the dataset is not subsampled and if the assignment is to a preexisting series it is not already stringvalued Heres a trivial example Chapter 16 Stringvalued series 135 nulldata 6 strings S defarraya b c b a d series sx S print sx byobs Heres a second example where we create a stringvalued series using the observation markers from the current dataset after grabbing them as an array via the markers command open data410 markers toarrayS series state S print state byobs And heres a third example where we construct the array of strings by reading from a text file nulldata 8 series sv strsplitreadfileABCDtxt print sv byobs This will work fine if the content of ABCDtxt is something like A B C D D C B A containing 8 spaceseparated values with or without line breaks If the strings in question contain embedded spaces you would have to make use of the optional second argument to strsplit 163 Permitted operations One question that arises with stringvalued series is what exactly are you allowed to do with them The optimal policy may be debatable but here we set out the current state of things Setting values per observation You can set particular values in a stringvalued series either by string or numeric code For example suppose in relation to the example in section 162 that for some reason student number 31 with a percentage score of 88 nonetheless merits an A grade We could do grade31 A or if were confident about the mapping grade31 5 Or to raise the students grade by one letter grade31 1 What youre not allowed to do here is make a numerical adjustment that would put the value out of bounds in relation to the set of string values For example if we tried grade31 6 wed get an error On the other hand you can implicitly extend the set of string values This wouldnt make sense for the letter grades example but it might for say city names Returning to the example in section 162 suppose we try Chapter 16 Stringvalued series 136 dataset addobs 1 year7 2023 city7 Gdansk This will work were implicitly adding another member to the string table for city the associated numeric code will be the next available integer1 Logical product of two stringvalued series The operator can be used to produce what we might call the logical product of two stringvalued series as in series sv3 sv1 sv2 The result is another stringvalued series with value sisj at observations where sv1 has value si and sv2 has value sj For example if at a given observation sv1 has value A and sv2 has value X then sv3 will have value AX The set of strings attached to the resulting series will include all such string combinations even if they are not all represented in the given sample Assignment to a stringvalued series In an assignment statement where the lefthand side LHS term is an existing stringvalued series two general conditions must be met First the righthand side RHS term must be a series ei ther numeric or stringvalued and second the assignment operator must be plain inflected operators such as and are not supported When the RHS series is numeric all its values must be either integers between 1 and the number of strings attached to the LHS series or NA This is required to preserve the integrity of the LHS When the RHS series is itself stringvalued there are two cases to consider theres no sample restriction in place or there is such a restriction In the unrestricted case the LHS series is in effect destroyed and replaced by a clone of the RHS Otherwise string values on the RHS are written into the LHS only within the current sample range If an RHS string is already present on the left its numerical code is adjusted if necessary to match the LHS string table if it is not present on the left it is appended to the LHS string table Missing values We support one exception to the general rule never break the mapping between strings and nu meric codes for stringvalued series you can mark particular observations as missing This is done in the usual way eg grade31 NA Note however that on importing a string series from a delimited text file any nonblank strings in cluding NA will be interpreted as valid values any missing values in such a file should therefore be represented by blank cells Copying a stringvalued series If you make a copy of a stringvalued series as in series foo city the string values are not copied over you get a purely numerical series holding the codes of the original series But if you want a full copy with the string values that can easily be arranged 1So please be careful one may inadvertently add a new string value by mistyping a string thats already present Chapter 16 Stringvalued series 137 series citycopy city stringifycitycopy strvalscity Stringvalued series in other contexts Stringvalued series can be used on the righthand side of assignment statements at will and in that context their numerical values are taken For example series y sqrtcity will elicit no complaint and generate a numerical series 1 141421 Its up to the user to judge whether this sort of thing makes any sense Similarly its up to the user to decide if it makes sense to use a stringvalued series as is in a regression model whether as regressand or regressoragain the numerical values of the series are taken Often this will not make sense but sometimes it may the numerical values may by design form an ordinal or even a cardinal scale as in the grade example in section 162 More likely one would want to use dummify on a stringvalued series before using it in statistical modeling In that context gretls series labels are suitably informative For example suppose we have a series race with numerical values 1 2 and 3 and associated strings White Black and Other Then the hansl code list D dummifyrace labels will show these labels Drace2 dummy for race Black Drace3 dummy for race Other Given a series such as race you can use its string values in a sample restriction as in smpl race Black restrict although race 2 would also be acceptable Accessing string values We have mentioned above two ways of accessing string values from a given series via the syntax seriesnameobs to obtain a single such value and via the strvals function to obtain an array holding all its distinct values Here we note a third option direct assignment from a stringvalued series to an array of strings as in strings S sv where sv is a suitable series In this case you get an array holding all the sv strings for observations in the current sample range not just the distinct values as with strvals 164 Stringvalued series and functions We first offer a few words on builtin functions that can be applied to stringvalued series The five functions substr strsub regsub tolower and toupper all perform transformations on Chapter 16 Stringvalued series 138 stringsrespectively extraction of a substring replacement of a substring replacement via regular expression conversion to all lowercase and to all uppercase see the Gretl Command Reference for details These functions work on single strings arrays of strings and also stringvalued series Note that when applied to a stringvalued series these functions may reduce the number of distinct strings attached to the series For example some string values that are originally distinct may collapse into identity when converted to all lowercase This possibility is handled by adjustment of the integer codes as needed A special case is presented by the builtin strvsort function this does not return a modified stringvalued series but rather modifies such a series in place It puts the string values into alpha betical order and recalculates the integer codes so as to preserve the original association between observation number and string If for example the first observation had a string value of X coded as 1 it will still have value X but its code will reflect the position of X in the alphabet ized ordering This can be particularly useful if a dataset comprises several series having the same string values but occurring in various orders The effect of running strvsort on such series will be to impose a common numerical encoding Userdefined hansl functions can also deal with stringvalued series If you supply such a series as an argument to a hansl function its string values will be accessible within the function One can test whether a given series arg is stringvalued as follows if nelemstrvalsarg 0 yes else no endif Its also possible since gretl version 2023c to put something like the code that generated the grade series in section 162 into a function and return the stringified series as in the following where we assume that x contains percentage scores function series lettergrade series x series grade define grade based on x and stringify it as shown above return grade end function An alternative means of achieving the same effectand the only means available prior to gretl 2023cis to to define grade as a series at the level of the caller and pass it in pointer form to lettergrade as in function void lettergrade series x series grade define grade based on x and stringify it end function caller series grade lettergradex grade As youll see from the account above we dont offer any very fancy facilities for stringvalued series Well read them from suitable sources and well create them natively via stringifyand well try to ensure that they retain their integritybut we dont for example take the specification of a stringvalued series as a regressor as an implicit request to include the dummification of its distinct values Chapter 16 Stringvalued series 139 165 Other import formats In section 162 we illustrated the reading of stringvalued series with reference to a delimited text data file Gretl can also handle several other sources of stringvalued data including the spread sheet formats xls xlsx gnumeric and ods and to a degree the formats of Stata SAS and SPSS Stata files Stata supports two relevant sorts of variables 1 those that are of string type and 2 variables of one or other numeric type that have value labels defined Neither of these is exactly equivalent to what we call a stringvalued series in gretl Stata variables of string type have no numeric representation their values are literally strings and thats all Statas numeric variables with value labels do not have to be integervalued and their least value does not have to be 1 however you cant define a label for a value that is not an integer Thus in Stata you can have a series that comprises both integer and noninteger values but only the integer values can be labeled2 This means that on import to gretl we can readily handle variables of string type from Statas dta files We give them a 1based numeric encoding this is arbitrary but does not conflict with any information in the dta file On the other hand in general were not able to handle Statas numeric variables with value labels currently we report the value labels to the user but do not attempt to store them in the gretl dataset We could check such variables and import them as stringvalued series if they satisfy the criteria stated in section 162 but we dont at present SAS and SPSS files Gretl is able to read and preserve string values associated with variables from SAS export xpt files and also from SPSS sav files Such variables seem to be on the same pattern as Stata variables of string type 2Verified in Stata 12 Chapter 17 Matrix manipulation Together with the other two basic types of data series and scalars gretl offers a quite compre hensive array of matrix methods This chapter illustrates the peculiarities of matrix syntax and discusses briefly some of the more advanced matrix functions For a full listing of matrix functions and a comprehensive account of their syntax please refer to the Gretl Command Reference In this chapter were concerned with real matrices most of the points made here also apply to complex matrices but see the following chapter for additional specifics on the complex case 171 Creating matrices Matrices can be created using any of these methods 1 By direct specification of the scalar values that compose the matrixeither in numerical form or by reference to preexisting scalar variables or using computed values 2 By providing a list of data series 3 By providing a named list of series 4 Via a suitable expression that references existing matrices andor scalars or via some special functions To specify a matrix directly in terms of scalars the syntax is for example matrix A 1 2 3 4 5 6 The matrix is defined by rows the elements on each row are separated by commas and the rows are separated by semicolons The whole expression must be wrapped in braces Spaces within the braces are not significant The above expression defines a 2 3 matrix Each element should be a numerical value the name of a scalar variable or an expression that evaluates to a scalar Directly after the closing brace you can append a single quote to obtain the transpose To specify a matrix in terms of data series the syntax is for example matrix A x1 x2 x3 where the names of the variables are separated by commas Besides names of existing variables you can use expressions that evaluate to a series For example given a series x you could do matrix A x x2 Each variable occupies a column and there can only be one variable per column You cannot use the semicolon as a row separator in this case if you want the series arranged in rows append the transpose symbol The range of data values included in the matrix depends on the current setting of the sample range Instead of giving an explicit list of variables you may instead provide the name of a saved list see Chapter 15 as in 140 Chapter 17 Matrix manipulation 141 list xlist x1 x2 x3 matrix A xlist When you provide a named list the data series are by default placed in columns as is natural in an econometric context if you want them in rows append the transpose symbol As a special case of constructing a matrix from a list of variables you can say matrix A dataset This builds a matrix using all the series in the current dataset apart from the constant variable 0 When this dummy list is used it must be the sole element in the matrix definition You can however create a matrix that includes the constant along with all other variables using horizontal concatenation see below as in matrix A constdataset By default when you build a matrix from series that include missing values the data rows that contain NAs are skipped But you can modify this behavior via the command set skipmissing off In that case NAs are converted to NaN Not a Number In the IEEE floatingpoint stan dard arithmetic operations involving NaN always produce NaN Alternatively you can take greater control over the observations data rows that are included in the matrix using the set variable matrixmask as in set matrixmask msk where msk is the name of a series Subsequent commands that form matrices from series or lists will include only observations for which msk has nonzero and nonmissing values You can remove this mask via the command set matrixmask null Names of matrices must satisfy the same requirements as names of gretl variables in general the name can be no longer than 31 characters must start with a letter and must be composed of nothing but letters numbers and the underscore character 172 Empty matrices The syntax matrix A creates an empty matrixa matrix with zero rows and zero columns The main purpose of the concept of an empty matrix is to enable the user to define a starting point for subsequent concatenation operations For instance if X is an already defined matrix of any size the commands matrix A matrix B A X result in a matrix B identical to X From an algebraic point of view one can make sense of the idea of an empty matrix in terms of vector spaces if a matrix is an ordered set of vectors then A is the empty set As a consequence operations involving addition and multiplications dont have any clear meaning arguably they have none at all but operations involving the cardinality of this set that is the dimension of the space spanned by A are meaningful Chapter 17 Matrix manipulation 142 Function Return value Function Return value A transpA A rowsA 0 colsA 0 rankA 0 detA NA ldetA NA trA NA onenormA NA infnormA NA rcondA NA Table 171 Valid functions on an empty matrix A Legal operations on empty matrices are listed in Table 171 All other matrix operations gener ate an error when an empty matrix is given as an argument In line with the above interpreta tion some matrix functions return an empty matrix under certain conditions the functions diag vec vech unvech when the arguments is an empty matrix the functions I ones zeros mnormal muniform when one or more of the arguments is 0 and the function nullspace when its argument has full column rank 173 Selecting submatrices You can select submatrices of a given matrix using the syntax Arowscols where rows can take any of these forms 1 empty selects all rows 2 a single integer selects the single specified row 3 two integers separated by a colon selects a range of rows 4 the name of a matrix selects the specified rows With regard to option 2 the integer value can be given numerically as the name of an existing scalar variable or as an expression that evaluates to a scalar With option 4 the index matrix given in the rows field must be either p 1 or 1 p and should contain integer values in the range 1 to n where n is the number of rows in the matrix from which the selection is to be made The cols specification works in the same way mutatis mutandis Here are some examples matrix B A1 matrix B A2335 matrix B A22 matrix idx 1 2 6 matrix B Aidx The first example selects row 1 from matrix A the second selects a 23 submatrix the third selects a scalar and the fourth selects rows 1 2 and 6 from matrix A If the matrix in question is n 1 or 1 m it is OK to give just one index specifier and omit the comma For example A2 selects the second element of A if A is a vector Otherwise the comma is mandatory In addition there are some predefined index specifications represented by the keywords diag lower upper real imag and end With the exception of end these keywords imply specific row and column selections and therefore cannot be combined with any additional commaseparated term The diag specification selects the principal diagonal of a matrix Chapter 17 Matrix manipulation 143 lower and upper select respectively the elements of a matrix below and those above the principal diagonal real and imag are specific to complex matrices and are described in chapter 18 end selects the last element in a given row or column It can be employed in arithmetical expressions so for example end1 accesses the secondlast element in a row or column You can use submatrix selections on either the righthand side of a matrixgenerating formula or the left Here is an example of use of a selection on the right to extract a 2 2 submatrix B from a 3 3 matrix A then the lower triangle of A matrix A 1 2 3 4 5 6 7 8 9 matrix B A1223 matrix C Alower And here are examples of selection on the left The second line below writes a 2 2 identity matrix into the bottom right corner of the 3 3 matrix A The fourth line replaces the diagonal of A with 1s matrix A 1 2 3 4 5 6 7 8 9 matrix A2323 I2 matrix d 1 1 1 matrix Adiag d When the lower and upper selections are used on the right they yield a vector holding the elements in their scope The ordering of the elements is columnmajor in both cases as illustrated below for the 4 4 case d 1 2 4 1 d 3 5 2 4 d 6 3 5 6 d This means that lower and upper do not produce the same result for symmetric matrices bigger than 33 which may seem unfortunate but it gives the user a degree of flexibility in respect of the ordering of the elements Suppose you have a nonsymmetric matrix M and youd like to extract the infradiagonal elements in rowmajor order Mupper will do the job When lower and upper are used on the left the replacement must be either a a vector of length equal to the number of elements in the selection or b a scalar value In case a the elements of the target matrix are filled in columnmajor order in case b they are all set using the scalar One possible use of these tools is taking say a lower triangular matrix and rendering it symmetric by copying the elements from beneath the diagonal to above The way to get this right assuming you have a lower triangular matrix L is Lupper Lupper note not Lupper Llower 174 Deleting rows or columns A variant of submatrix notation is available for convenience in dropping specified rows andor columns from a matrix namely giving negative values for the indices Here is a simple example matrix A 1 2 3 4 5 6 7 8 9 matrix B A23 Chapter 17 Matrix manipulation 144 which creates B as a 22 matrix which drops row 2 and column 3 from A Negative indices can also be given in the form of an index vector matrix rdrop 135 matrix B Ardrop In this case B is formed by dropping rows 1 3 and 5 from A which must have at least 5 rows but retaining the column dimension of A There are two limitations on the use of negative indices First the fromto range syntax described in the previous section is not available but you can use the seq function to achieve an equivalent effect as in matrix A muniform1 10 matrix B Aseq37 where B drops columns 3 to 7 from A Second use of negative indices is valid only on the righthand side of a matrix calculation there is no negative index equivalent of assignment to a submatrix as in A13 ones3 colsA 175 Matrix operators The following binary operators are available for matrices addition subtraction ordinary matrix multiplication premultiplication by transpose matrix left division see below matrix right division see below columnwise concatenation rowwise concatenation Kronecker product test for equality test for inequality In addition the following operators dot operators apply on an elementbyelement basis Here are explanations of the less obvious cases For matrix addition and subtraction in general the two matrices have to be of the same dimensions but an exception to this rule is granted if one of the operands is a 1 1 matrix or scalar The scalar is implicitly promoted to the status of a matrix of the correct dimensions all of whose elements are equal to the given scalar value For example if A is an m n matrix and k a scalar then the commands matrix C A k matrix D A k Chapter 17 Matrix manipulation 145 both produce m n matrices with elements cij aij k and dij aij k respectively By premultiplication by transpose we mean for example that matrix C XY produces the product of Xtranspose and Y In effect the expression XY is shorthand for XY which is also valid syntax In the special case X Y however the two are not exactly equivalent The former expression uses a specialized algorithm with two advantages it is more efficient com putationally and ensures that the result is free of machine precision artifacts that may render it numerically nonsymmetric This however is unlikely to be an issue unless your X matrix is rather large at least several hundreds rowscolumns In matrix left division the statement matrix X A B is interpreted as a request to find the matrix X that solves AX B so A and B must have the same number of rows If A is a square matrix this is in principle equivalent to A1B which fails if A is singular the numerical method employed here is the LU decomposition If A is a T k matrix with T k then X is the leastsquares solution X AA1AB which fails if AA is singular the numerical method is the QR decomposition Otherwise the operation fails For matrix right division as in X A B X is the matrix that solves XB A so A and B must have the same number of columns If B is nonsingular this is in principle equivalent to AB1 otherwise X is the leastsquares solution In dot operations a binary operation is applied element by element the result of this operation is obvious if the matrices are of the same size However there are several other cases where such operators may be applied For example if we write matrix C A B then the result C depends on the dimensions of A and B Let A be an m n matrix and let B be p q the result is as follows Case Result Dimensions match m p and n q cij aij bij A is a column vector rows match m p n 1 cij ai bij B is a column vector rows match m p q 1 cij aij bi A is a row vector columns match m 1 n q cij aj bij B is a row vector columns match m p q 1 cij aij bj A is a column vector B is a row vector n 1 p 1 cij ai bj A is a row vector B is a column vector m 1 q 1 cij aj bi A is a scalar m 1 and n 1 cij a bij B is a scalar p 1 and q 1 cij aij b If none of the above conditions are satisfied the result is undefined and an error is flagged Note that this convention makes it unnecessary in most cases to use diagonal matrices to perform transformations by means of ordinary matrix multiplication if Y XV where V is diagonal it is computationally much more convenient to obtain Y via the instruction matrix Y X v where v is a row vector containing the diagonal of V In columnwise concatenation of an mn matrix A and an mp matrix B the result is an mnp matrix That is matrix C A B produces C A B Rowwise concatenation of an m n matrix A and an p n matrix B produces an m p n matrix That is matrix C A B produces C A B 176 Matrixscalar operators For matrix A and scalar k the operators shown in Table 172 are available Addition and subtraction were discussed in section 175 but we include them in the table for completeness In addition for square A and scalar x B Ax produces a matrix B which is A raised to the power x but only if either of two conditions are satisfied First if x is a nonnegative integer then Golub and Van Loans Binary Powering Algorithm 1122 is usedsee Golub and Van Loan 1996and A can then be a generic square matrix Second if A is positive semidefinite the power is computed via its eigendecomposition and x can be a real number subject to the constraint that x can be negative only if A is invertible Expression Effect matrix B A k bij kai j matrix B A k bij ai j k matrix B k A bij kai j matrix B A k bij ai j k matrix B A k bij ai j k matrix B k A bij k ai j matrix B A k bij ai j modulo k Table 172 Matrixscalar operators 177 Matrix functions Most of the functions available for scalars and series also apply to matrices on an elementbyelement basis This is the case for log exp sqrt sin and many others For example if a matrix A is already defined then matrix B sqrtA generates a matrix such that bij aij All such functions require a single matrix as argument or an expression which evaluates to a single matrix1 In this section we review some aspects of functions that apply specifically to matrices A full account of each function is available in the Gretl Command Reference 1Note that to find the matrix square root you need the cholesky function see below And since the exp function computes the exponential element by element it does not return the matrix exponential unless the matrix is diagonal To get the matrix exponential use mexp Chapter 17 Matrix manipulation 147 Matrix manipulation bin2dec cnameset cols dec2bin diag diagcat halton I lower mlag mnormal mrandgen mreverse mshape msortby msplitby muniform ones rnameset rows selifc selifr seq trimr unvech upper vec vech zeros Matrix algebra cholesky cnumber commute conv2d det eigen eigengen eigensym eigsolve fft ffti ginv hdprod infnorm inv invpd ldet Lsolve mexp mlog nullspace onenorm psdroot qform qrdecomp rank rcond svd toepsolv tr transp Statisticstransformations aggregate bkw corr cov ecdf fcstats ghk gini imaxc imaxr iminc iminr kpsscrit maxc maxr mcorr mcov mcovg meanc meanr minc minr mols mpols mrls mxtab normtest npcorr princomp prodc prodr quadtable quantile ranking resample sdc sphericorrsst sumc sumr uniq values Numerical methods BFGSmax BFGScmax fdjac fzero GSSmax NMmax NRmax numhess simann Table 173 Matrix functions by category Chapter 17 Matrix manipulation 148 Matrix reshaping In addition to the methods discussed in sections 171 and 173 a matrix can also be created by rearranging the elements of a preexisting matrix This is accomplished via the mshape function It takes three arguments the input matrix A and the rows and columns of the target matrix r and c respectively Elements are read from A and written to the target in columnmajor order If A contains fewer elements than n r c they are repeated cyclically if A has more elements only the first n are used For example matrix a mnormal23 a matrix b mshapea31 b matrix b mshapea52 b produces a a 12323 099714 039078 054363 043928 048467 matrix b mshapea31 Generated matrix b b b 12323 054363 099714 matrix b mshapea52 Replaced matrix b b b 12323 048467 054363 12323 099714 054363 043928 099714 039078 043928 Multiple returns and the null keyword Some functions take one or more matrices as arguments and compute one or more matrices these are eigensym Eigenanalysis of symmetric matrix eigen Eigenanalysis of general matrix mols Matrix OLS qrdecomp QR decomposition svd Singular value decomposition SVD The general rule is the main result of the function is always returned as the result proper Auxiliary returns if needed are retrieved using preexisting matrices which are passed to the Chapter 17 Matrix manipulation 149 function as pointers see 144 If such values are not needed the pointer may be substituted with the keyword null The syntax for qrdecomp and eigensym is of the form matrix B funcA C The first argument A represents the input data that is the matrix whose decomposition or analysis is required The second argument must be either the name of an existing matrix preceded by to indicate the address of the matrix in question in which case an auxiliary result is written to that matrix or the keyword null in which case the auxiliary result is not produced In case a nonnull second argument is given the specified matrix will be overwritten with the auxiliary result It is not required that the existing matrix be of the right dimensions to receive the result The function eigensym computes the eigenvalues and optionally the right eigenvectors of a sym metric n n matrix The eigenvalues are returned directly in a column vector of length n if the eigenvectors are required they are returned in an n n matrix For example matrix V matrix E eigensymM V matrix E eigensymM null In the first case E holds the eigenvalues of M and V holds the eigenvectors In the second E holds the eigenvalues but the eigenvectors are not computed The function eigen computes the eigenvalues and optionally the right andor left eigenvectors of a general n n matrix2 Following the input matrix argument there are two slots for matrix addresses the first to retrieve the right eigenvectors and the second for the left Calls to this function should therefore conform to one of the following patterns get the eigenvalues only matrix E eigenM get the right eigenvectors as well matrix V matrix E eigenM V get both sets of eigenvectors matrix V matrix W matrix E eigenM V W get the left eigenvectors but not the right matrix W matrix E eigenM null W The eigenvalues are returned directly in a complex nvector If the eigenvectors are wanted they are returned in a n n complex matrix The qrdecomp function computes the QR decomposition of an m n matrix A A QR where Q is an m n orthogonal matrix and R is an n n upper triangular matrix The matrix Q is returned directly while R can be retrieved via the second argument Here are two examples matrix R matrix Q qrdecompM R matrix Q qrdecompM null 2The legacy function eigengen used to be the way to do this prior to gretl 2019d Chapter 17 Matrix manipulation 150 In the first example the triangular R is saved as R in the second R is discarded The first line above shows an example of a simple declaration of a matrix R is declared to be a matrix variable but is not given any explicit value In this case the variable is initialized as a 1 1 matrix whose single element equals zero The syntax for svd is matrix B funcA C D The function svd computes all or part of the singular value decomposition of the real mn matrix A Let k minm n The decomposition is A UΣV where U is an m k orthogonal matrix Σ is an k k diagonal matrix and V is an k n orthogonal matrix3 The diagonal elements of Σ are the singular values of A they are real and nonnegative and are returned in descending order The first k columns of U and V are the left and right singular vectors of A The svd function returns the singular values in a vector of length k The left andor right singu lar vectors may be obtained by supplying nonnull values for the second andor third arguments respectively For example matrix s svdA U V matrix s svdA null null matrix s svdA null V In the first case both sets of singular vectors are obtained in the second case only the singular values are obtained and in the third the right singular vectors are obtained but U is not computed Please note when the third argument is nonnull it is actually V that is provided To reconstitute the original matrix from its SVD one can do matrix s svdA U V matrix B UsV Finally the syntax for mols is matrix B molsY X U This function returns the OLS estimates obtained by regressing the T n matrix Y on the T k matrix X that is a k n matrix holding XX1XY The Cholesky decomposition is used The matrix U if not null is used to store the residuals Reading and writing matrices fromto text files The two functions mread and mwrite can be used for basic matrix inputoutput This can be useful to enable gretl to exchange data with other programs The mread function accepts one string parameter the name of the plain text file from which the matrix is to be read The file in question may start with any number of comment lines defined as lines that start with the hash mark such lines are ignored Beyond that the content must conform to the following rules 1 The first noncomment line must contain two integers separated by a space or a tab indicat ing the number of rows and columns respectively 3This is not the only definition of the SVD some writers define U as m m Σ as m n with k nonzero diagonal elements and V as n n Chapter 17 Matrix manipulation 151 2 The columns must be separated by spaces or tab characters 3 The decimal separator must be the dot character Should an error occur such as the file being badly formatted or inaccessible an empty matrix see section 172 is returned The complementary function mwrite produces text files formatted as described above The column separator is the tab character so import into spreadsheets should be straightforward Usage is illustrated in example 171 Matrices stored via the mwrite command can be easily read by other programs the following table summarizes the appropriate commands for reading a matrix A from a file called amat in some widelyused programs4 Note that the Python example requires that the numpy module is loaded Program Sample code GAUSS tmp load amat A reshapetmp3rowstmptmp1tmp2 Octave fd fopenamat rc fscanffd d d C A reshapefscanffd g rccr fclosefd Ox decl A loadmatamat R A asmatrixreadtableamat skip1 Python A numpyloadtxtamat skiprows1 Julia A readdlmamat skipstart1 Optionally the mwrite and mread functions can use gzip compression this is invoked if the name of the matrix file has the suffix gz In this case the elements of the matrix are written in a single column Note however that compression should not be applied when writing matrices for reading by thirdparty software unless you are sure that the software can handle compressed data 178 Matrix accessors In addition to the matrix functions discussed above various accessor strings allow you to create copies of internal matrices associated with models previously estimated These are set out in Table 174 Many of the accessors in Table 174 behave somewhat differently depending on the sort of model that is referenced as follows Singleequation models sigma gets a scalar the standard error of the regression coeff and stderr get column vectors uhat and yhat get series System estimators sigma gets the crossequation residual covariance matrix uhat and yhat get matrices with one column per equation The format of coeff and stderr de pends on the nature of the system for VARs and VECMs where the matrix of regressors is the same for all equations these return matrices with one column per equation but for other system estimators they return a big column vector VARs and VECMs vcv is not available but XX1 where X is the common matrix of regres sors is available as xtxinv such that for VARs and VECMs without restrictions on α a vcv equivalent can be easily and efficiently constructed as sigma xtxinv 4Matlab users may find the Octave example helpful since the two programs are mostly compatible with one another Chapter 17 Matrix manipulation 152 Listing 171 Matrix inputoutput via text files Download nulldata 64 scalar n 3 string f1 acsv string f2 bcsv matrix a mnormalnn matrix b inva err mwritea f1 if err 0 fprintf Failed to write s f1 else err mwriteb f2 endif if err 0 fprintf Failed to write s f2 else c mreadf1 d mreadf2 a cd printf The following matrix should be an identity matrix print a endif coeff matrix of estimated coefficients compan companion matrix after VAR or VECM estimation jalpha matrix α loadings from Johansens procedure jbeta matrix β cointegration vectors from Johansens procedure jvbeta covariance matrix for the unrestricted elements of β from Johansens procedure rho autoregressive coefficients for error process sigma residual covariance matrix stderr matrix of estimated standard errors uhat matrix of residuals vcv covariance matrix of parameter estimates vma VMA matrices in stacked form see section 322 yhat matrix of fitted values Table 174 Matrix accessors for model data Chapter 17 Matrix manipulation 153 If the accessors are given without any prefix they retrieve results from the last model estimated if any Alternatively they may be prefixed with the name of a saved model plus a period in which case they retrieve results from the specified model Here are some examples matrix u uhat matrix b m1coeff matrix v2 m1vcv1212 The first command grabs the residuals from the last model the second grabs the coefficient vector from model m1 and the third which uses the mechanism of submatrix selection described above grabs a portion of the covariance matrix from model m1 If the model in question a VAR or VECM only compan and vma return the companion matrix and the VMA matrices in stacked form respectively see section 322 for details After a vector error correction model is estimated via Johansens procedure the matrices jalpha and jbeta are also available These have a number of columns equal to the chosen cointegration rank therefore the product matrix Pi jalpha jbeta returns the reducedrank estimate of A1 Since β is automatically identified via the Phillips nor malization see section 335 its unrestricted elements do have a proper covariance matrix which can be retrieved through the jvbeta accessor 179 Namespace issues Matrices share a common namespace with data series and scalar variables In other words no two objects of any of these types can have the same name It is an error to attempt to change the type of an existing variable for example scalar x 3 matrix x ones22 wrong It is possible however to delete or rename an existing variable then reuse the name for a variable of a different type scalar x 3 delete x matrix x ones22 OK 1710 Creating a data series from a matrix Section 171 above describes how to create a matrix from a data series or set of series You may sometimes wish to go in the opposite direction that is to copy values from a matrix into a regular data series The syntax for this operation is series sname mspec where sname is the name of the series to create and mspec is the name of the matrix to copy from possibly followed by a matrix selection expression Here are two examples series s x series u1 U1 It is assumed that x and U are preexisting matrices In the second example the series u1 is formed from the first column of the matrix U Chapter 17 Matrix manipulation 154 For this operation to work the matrix or matrix selection must be a vector with length equal to either the full length of the current dataset n or the length of the current sample range n If n n then only n elements are drawn from the matrix if the matrix or selection comprises n elements the n values starting at element t1 are used where t1 represents the starting observation of the sample range Any values in the series that are not assigned from the matrix are set to the missing code 1711 Matrices and lists To facilitate the manipulation of named lists of variables see Chapter 15 it is possible to convert between matrices and lists In section 171 above we mentioned the facility for creating a matrix from a list of variables as in matrix M listname That formulation with the name of the list enclosed in braces builds a matrix whose columns hold the variables referenced in the list What we are now describing is a different matter if we say matrix M listname without the braces we get a row vector whose elements are the ID numbers of the variables in the list This special case of matrix generation cannot be embedded in a compound expression The syntax must be as shown above namely simple assignment of a list to a matrix To go in the other direction you can include a matrix on the righthand side of an expression that defines a list as in list Xl M where M is a matrix The matrix must be suitable for conversion that is it must be a row or column vector containing nonnegative integer values none of which exceeds the highest ID number of a series in the current dataset Listing 172 illustrates the use of this sort of conversion to normalize a list moving the constant variable 0 to first position 1712 Deleting a matrix To delete a matrix just write delete M where M is the name of the matrix to be deleted 1713 Printing a matrix To print a matrix the easiest way is to give the name of the matrix in question on a line by itself which is equivalent to using the print command matrix M mnormal1002 M print M You can get finer control on the formatting of output by using the printf command as illustrated in the interactive session below Chapter 17 Matrix manipulation 155 Listing 172 Manipulating a list Download function void normalizelist matrix x If the matrix representing a list contains var 0 but not in first position move it to first position if x1 0 scalar k colsx loop for i2 ik i if xi 0 xi x1 x1 0 break endif endloop endif end function open data97 list Xl 2 3 0 4 matrix x Xl normalizelistx list Xl x list Xl print matrix Id I2 matrix Id I2 Generated matrix Id print Id print Id Id 2 x 2 1 0 0 1 printf 103f Id 1000 0000 0000 1000 For presentation purposes you may wish to give titles to the columns of a matrix For this you can use the cnameset function the first argument is a matrix and the second is either a named list of variables whose names will be used as headings or a string that contains as many spaceseparated substrings as the matrix has columns For example matrix M mnormal33 cnamesetM foo bar baz print M M 3 x 3 foo bar baz 17102 076072 0089406 099780 19003 025123 091762 039237 16114 Chapter 17 Matrix manipulation 156 1714 Example OLS using matrices Listing 173 shows how matrix methods can be used to replicate gretls builtin OLS functionality Listing 173 OLS via matrix methods Download open data41 matrix X const sqft matrix y price matrix b invpdXX Xy print estimated coefficient vector b matrix u y Xb scalar SSR uu scalar s2 SSR rowsX rowsb matrix V s2 invXX V matrix se sqrtdiagV print estimated standard errors se compare with builtin function ols price const sqft vcv Chapter 18 Complex matrices 181 Introduction Native support for complex matrices was added to gretl in version 2019d1 Not all of hansls matrix functions accept complex input but we have enabled a sizable subset of these functions which should suffice for most econometric purposes Complex numbers are not used in most areas of econometrics but there are a few notable ex ceptions among these complex numbers allow for an elegant treatment of univariate spectral analysis of time series and become indispensable if you consider multivariate spectral analysis see for example Shumway and Stoffer 2017 A more recent example is the numerical solution of linear models with rational expectations which are widely used in modern macroeconomics for which the complex Schur factorization has become the tool of choice Klein 2000 A first point to note is that complex values are treated as a special case of the hansl matrix type theres no complex type as such Complex scalars fall under the matrix type as 1 1 matrices the hansl scalar type is only for real values as is the series type A 1 1 complex matrix should do any work you might require of a complex scalar Before we proceed to the details of complex matrices in gretl heres a brief reminder of the relevant concepts and notation Complex numbers are pairs of the form a b i where a and b are real numbers and i is defined as the square root of 1 a is the real part and b the imaginary part One can specify a complex number either via a and b or in polar form The latter pertains to the complex plane which has the real component on the horizontal axis and the imaginary component on the vertical The polar representation of a complex number is composed of the length r of the ray from the origin to the point in question and the angle θ subtended between the positive real axis and this ray measured counterclockwise in radians In polar form the complex number z a b i can be written as z z cos θ i sin θ z eiθ where z r a2 b2 and θ tan1ba The quantity z is known as the modulus of z and θ as its complex argument or sometimes phase The notation z is used for the complex conjugate of z if z a b i then z a b i 182 Creating a complex matrix The standard constructor for complex matrices is the complex function This takes two argu ments giving the real and imaginary parts respectively and sticks them together as in C complexA B Four cases are supported as follows A and B are both m n real matrices Then C is an m n complex matrix such that ckj akj bkj i A and B are both scalars C is a 1 1 complex matrix such that c a b i 1Prior to that release gretl offered improvised support for some complex functionality see section 187 for details 157 Chapter 18 Complex matrices 158 A is an m n real matrix and B is a scalar C is an m n matrix such that ckj akj b i A is a scalar and B is an m n real matrix C is an m n matrix such that ckj a bkj i In addition complex matrices may naturally arise as the result of certain computations With both real and complex matrices in circulation one may wish to determine whether a particular matrix is complex The function iscomplex can tell you Passed an identifier it returns nonzero if it names a complex matrix 0 if it names a real matrix or NA otherwise The nonzero return value is either 1 or 2 with the following interpretation 1 indicates that the matrix is nominally complex each element is represented as having a real part and an imaginary part but all imaginary parts are zero 2 indicates that at least one element has a nonzero imaginary part The following code snippet illustrates the point matrix z1 complex10 scalar a iscomplexz1 matrix z2 complex11 scalar b iscomplexz2 printf a d b d a b The code above gives a 1 b 2 183 Indexation Indexation of complex matrices works as with real matrices on the understanding that each ele ment of a complex matrix is a complex pair So for example Cij gets you the complex pair at row i column j of C in the form of a 1 1 complex matrix If you wish to access just the real or imaginary part of a given element or range of elements you can use the functions Re or Im as in scalar rij ReCij which gets you the real part of cij In addition the dummy selectors real and imag can be used to assign to just the real or imaginary component of a complex matrix Here are two examples replace the real part of C with random normals Creal mnormalrowsC colsC set the imaginary part of C to all zeros Cimag 0 The replacement must be either a real matrix of the same dimensions as the target or a scalar Further the real and imag selectors may be combined with regular selectors to access specific portions of a complex matrix for either reading or writing Examples retrieve the real part of a submatrix of C matrix R C1212real set the imaginary part of C33 to y C33imag y Chapter 18 Complex matrices 159 184 Operators Most of the operators available for working with real matrices are also available for complex ones this includes the dotoperators which work elementwise or by broadcasting vectors Moreover mixed operands are accepted as in D C A where C is complex and A real the result D will be complex In such cases the real operand is treated as a complex matrix with an allzero imaginary part The operators not defined for complex values are Those that include the inequality tests or since complex values as such cannot be compared as greater or lesser though they can be compared as equal or not equal The real modulus operator percent sign as in x y which gives the remainder on division of x by y As for real matrices the transposition operator is available in both unary form as in B A and binary form as in C AB transposemultiply But note that for complex A this means the conjugate transpose AH If you need the nonconjugated transpose you can use transp You may wish to note although none of gretls explicit regression functions or commands accept complex input you can calculate parameter estimates for a leastsquares regression of complex Y T 1 on complex X T k via B X Y 185 Functions To give an idea of what works and what doesnt for complex matrices well walk through the hansl functionspace using the categories employed in gretls online Function reference under the Help menu in the GUI program Linear algebra The functions that accept complex arguments are cholesky det ldet eigen eigensym for Hermitian matrices fft ffti inv ginv hdprod mexp mlog qrdecomp rank svd tr and transp Note however that mexp and mlog require that the input matrix be diagonalizable and cholesky requires a positive definite Hermitian matrix In addition there are the complexonly functions ctrans which gives the conjugate transpose2 and schur for the Schur factorization Matrix building Given what was said in section 182 above several of the functions in this category should be thought of as applying to the real or imaginary part of a complex matrix for example ones and mnormal and are of course usable in that way However some of these functions can be applied to complex matrices as such namely diag diagcat lower upper vec vech and unvech Please note when unvech is applied to a suitable real vector it produces a symmetric matrix but when applied to a complex vector it produces a Hermitian matrix The only functions not available for complex matrices are cnameset and rnameset That is you cannot name the columns or rows of such matrices although this restriction could probably be lifted without great difficulty 2The transp function gives the straight nonconjugated transpose of a complex matrix Chapter 18 Complex matrices 160 Matrix shaping The functions that accept complex input are cols rows mreverse mshape selifc selifr and trimr The functions msortby sort and dsort are excluded for the reason mentioned in section 184 Statistical Supported for complex input meanc meanr sumc sumr prodc and prodr And thats all Mathematical In the matrix context these are functions that are applied element by element For complex input the following are supported log exp and sqrt plus all of the trigonometric functions with the exception of atan2 In addition there are the complexonly functions cmod complex modulus also accessible via abs carg complex argument conj complex conjugate Re real part and Im imaginary part Note that cargz atan2y x for z x y i Listing 181 illustrates usage of cmod and carg Transformations In this category only two functions can be applied to complex matrices namely cum and diff 186 File inputoutput Complex matrices are stored and retrieved correctly in the XML serialization used for gretl session files gretl The functions mwrite and mread work in two modes binary mode if the filename ends with bin and text mode otherwise Both modes handle complex matrices correctly if both the writing and the reading are to be done by gretl but for exchange of data with foreign programs text mode will not work for complex matrices as a whole The options are In text mode use mwrite and mread on the two parts of a complex matrix separately and reassemble the matrix in the target program Use binary mode on the whole matrix if this is supported for the given foreign program At present binary mode transfer of complex matrices is supported for octave python and julia Listing 182 shows some examples we export a complex matrix to each of these programs in turn calculate its inverse in the foreign program then verify that the result as imported back into gretl is the same as that calculated in gretl 187 Backward incompatibility Prior to version 2019d gretl did not provide native support for complex matrices It did however offer an improvised representation of such matrices for certain restricted purposes taking the form of an expanded regular gretl matrix with real values and imaginary parts in odd and even numbered columns respectively The functions fft eigengen and polroots returned matrices in this special form and the functions cmult and cdiv operated on such matrices As of version 2022b fft and polroots have been redefined to work with proper complex ma trices as described above The other affected functions are deprecated and will be removed or redefined in a subsequent release If you have any hansl code using the legacy representation the following brief porting guide may be helpful Chapter 18 Complex matrices 161 Listing 181 Variant representations of complex numbers We picked 8 points on the unit circle in the complex plane so their modulus is constant and equal to 1 The Polar matrix below shows that the complex argument is expressed in radians multiplying by 180π gives degrees The chk matrix verifies that we can retrieve the orginal representation of the complex values from the polar form in either of the two ways mentioned at the start of the chapter z z cos θ i sin θ or z z eiθ Download complex values in a bi form scalar rp5 sqrt05 matrix A 1 rp5 0 rp5 1 rp5 0 rp5 matrix B 0 rp5 1 rp5 0 rp5 1 rp5 matrix Z complexA B calculate modulus and argument matrix zmod cmodZ matrix theta cargZ matrix Polar zmod theta theta 180pi cnamesetPolar modulus radians degrees printf 124f Polar reconstitute the original Z matrix in two ways matrix Z1 zmod complexcostheta sintheta matrix Z2 zmod expcomplex0 theta matrix chk Z Z1 Z2 print chk Printing of Polar and chk modulus radians degrees 10000 00000 00000 10000 07854 450000 10000 15708 900000 10000 23562 1350000 10000 31416 1800000 10000 23562 1350000 10000 15708 900000 10000 07854 450000 100000 000000i 100000 000000i 100000 000000i 070711 070711i 070711 070711i 070711 070711i 000000 100000i 000000 100000i 000000 100000i 070711 070711i 070711 070711i 070711 070711i 100000 000000i 100000 000000i 100000 000000i 070711 070711i 070711 070711i 070711 070711i 000000 100000i 000000 100000i 000000 100000i 070711 070711i 070711 070711i 070711 070711i Chapter 18 Complex matrices 162 Listing 182 Exporting and importing complex matrices Download set seed 34756 matrix C complexmnormal33 mnormal33 D invC mwriteC Cbin 1 foreign languageoctave C gretlloadmatCbin gretlexportinvC octDbin end foreign octD mreadoctDbin 1 eval D octD foreign languagepython import numpy as np C gretlloadmatCbin gretlexportnplinalginvC pyDbin end foreign pyD mreadpyDbin 1 eval D pyD foreign languagejulia C gretlloadmatCbin gretlexportinvC jlDbin end foreign jlD mreadjlDbin 1 eval D jlD Chapter 18 Complex matrices 163 Porting old complex code cmult and cdiv These functions performed elementwise multiplication and division of complex column vectors in the old twocolumn form The statements old elementwise operations c1 cmulta1 b1 d1 cdiva1 b1 can be updated as new elementwise operations c2 a2 b2 d2 a2 b2 where a2 and b2 are newstyle complex vectors or matrices The following statements c3 a2 b2 d3 a2 b2 are also valid but have different effects the first performing standard rather than elementwise multiplication of matrices complex or real and the second performing right division equivalent to a2 invb2 Note that while the return value from cmult and cdiv could be either a real vector or a two column complex vector the newstyle operations yield a nominally complex result if at least one of the operands is complex even if the result has an allzero imaginary part A piece of code that appears in some contexts such as calculation of a periodogram is as follows given a complex vector v compute a vector w holding the squared moduli of the elements of v The oldstyle code to accomplish this was legacy v has two columns w sumrv2 and the new replacement is current v has a single complex column w absv2 where abs gives the complex modulus eigengen Most uses of this legacy function simply retrieve the eigenvalues of a general that is not symmetric matrix and do not exploit the option of retrieving eigenvectors In that context it is straightforward to substitute a call to the new function eigen The only point to note is that eigen returns a newstyle complex vector if you have need to convert this to the legacy representation you can use the cswitch function which is documented in the Gretl Command Reference In brief the following code gives you the legacy equivalent of a newstyle complex vector v newvec if vimag 0 oldv vreal else oldv cswitchv 2 endif polroots This function now returns a newstyle complex vector As with eigengen you can use cswitch to convert the vector if necessary Chapter 19 Calendar dates 191 Introduction Any software that aims to handle dates and times must have a good builtin calendar Gretl of fers several functions to handle date and time information which are documented in the Gretl Command Reference To facilitate their effective use this chapter lists the various possibilities for storing dates and times and discusses ways of converting between variant representations Our main focus in this chapter is dates as such year month and day but we add some discussion of timeofday where relevant A final section addresses the somewhat arcane issue of handling historical dates on the Julian calendar First of all it may be useful to distinguish two contexts You have a timeseries dataset in place or a panel dataset with a welldefined time dimension You have no such dataset in place or perhaps no dataset at all While you can work with dates in the second case in the first case you have extra resources You probably know that if you open a dataset that is in fact time series but gretl has not immedi ately recognized that fact you can rectify matters by use of the setobs command or via the menu item DataDataset structure in the gretl GUI You may also know that with a panel dataset you can impose a definite dating and frequency in its time dimension if appropriateagain via the setobs command but with the paneltime option In what follows we state if a relevant function or accessor requires a timeseries dataset or well defined paneldata time otherwise you can assume it does not carry such a requirement 192 Date and time representations In gretl there is more than one way to encode a date such as May 26th 1993 Some are more intuitive some less obvious from a human viewpoint but easier to handle for an algorithm The basic representations we discuss here are 1 the threenumbers approach 2 date as string 3 the ISO 8601 standard 4 the epoch day 5 Unix time seconds We first explain what these representations are then explain how to convert between them The threenumbers approach Since a date without regard to intraday detail basically consists of three numbers it can obviously be encoded in precisely that way For example the date May 26th 1993 can be stored as 164 Chapter 19 Calendar dates 165 scalar y 1993 scalar m 5 scalar d 26 Gretls multipleelement objects can be used to extend this approach for example by using a 3 element vector for year month and day or a 3column matrix for storing as many dates as desired If you wish to store dates as series in your dataset this approach would lead you to use three series possibly grouping them into a list as in nulldata 60 setobs 7 20200101 series y obsmajor series m obsminor series d obsmicro list DATE y m d This example above will generate daily dates for January and February 2020 Note that use of the obsm accessors requires a timeseries dataset and obsmicro in particular requires daily data See Section 195 for details Some CSV files represent dates in this sort of brokendown format with various conventions on the ordering of the three components Date as string To a human being this may seem the most natural choice The string 2661953 is pretty much unambiguous But using such a format for machine processing can be problematic due to differing conventions regarding the separators between day month and year as well as the order in which the three pieces of information are arranged For example 261953 is not unambiguous it will naturally be read differently by Europeans and Americans This can be a problem with CSV files found in the wild containing arbitrarily formatted dates Therefore gretl provides fairly comprehensive functionality for converting dates of this sort into more manageable formats The ISO 8601 standard Among other things the ISO 8601 standard provides two representations for a daily date the basic representation which uses an 8digit integer and the extended representation which uses a 10character string In the basic version the first four digits represent the year the middle two the month and the rightmost two the day so that for example 20170219 indicates February 19th 2017 The extended representation is similar except that the date is a string in which the items are separated by hy phens so the same date would be represented as 20170219 In several contexts ISO 8601 dates are privileged by gretl the ISO format as taken as the default and you need to supply an additional function argument or take extra steps if the representation is nonstandard Using series andor matrices to store ISO 8601 basic dates is perfectly straightforward Epoch days In gretl an epoch day is an unsigned 32bit integer which represents a date as the number of days since January 1 1 AD that is the first day of the Common Era on the proleptic Gregorian calendar1 For example 19930526 corresponds to 7277092 1The term proleptic as applied to a calendar indicates that it is extrapolated backwards or forwards relative to its period of actual historical use 2This representation derives from the astronomers Julian day which is also a count of days since a benchmark namely January 1 4713 BC at which time certain astronomical cycles were aligned Chapter 19 Calendar dates 166 This is the convention used by the GLib library on which gretl depends for much of its calendrical calculation Since an epoch day is an unsigned integer neither GLib nor gretl supports dates BC or prior to the Common Era This representation has several advantages Like ISO 8601 basic it lends itself naturally to storing dates as series Compared to ISO 8601 it has the disadvantage of not being readily understandable by humans but to compensate for that it makes it very easy to determine the length of a range of dates ISO basic dates can be used for comparison which of two dates on a given calendar refers to a later day but with epoch days one can carry out fullyfledged dates arithmetic Epoch days are always consecutive by construction but 8digit basic dates are consecutive only within a given month3 For more on arithmetic with epoch days see Section 194 Unix seconds In this representationthe cornerstone of date and time handling on Unixlike systemstime is the number of seconds since midnight at the start of 1970 according to Coordinated Universal Time UTC4 This format is therefore ideal for storing finegrained information including time of day as well as date This representation is not transparent to humans for example the number 123456789 corre sponds to the start of Thursday 29 Nov 1973 but again it lends itself naturally to calculation Since Unix seconds are hardwired to UTC a given value will correspond to different times and possibly different dates if evaluated in different time zones we expand on this point below 193 Converting between representations To support conversion between different representations gretl provides several dedicated func tions although in some cases conversion can be carried out by using generalpurpose functions Figure 191 displays a summary solid lines represent dedicated functions while dashed lines in dicate that no special function is needed Numerical formats are depicted as boxes and string formats as ovals For a full description of the functions referenced in the figure see the Gretl Com mand Reference In the rest of this section we discuss several cases of conversion with the help of examples Strings and threenumber dates As indicated in Figure 191 converting between date strings and the threenumber representation does not require datespecific functions The two generic functions that can be used for this purpose are printf and sscanf Heres how suppose you encode a date via the three scalars d30 m10 and y1983 You can use printf to turn it into a date string rather easily as in eus printfdmy d m y uss printfmdy m d y where the two strings follow the European and US conventions respectively The reverse operation using the sscanf function is a little trickier see the Gretl Command Ref erence for a full illustration The string s19831030 can be broken down into three scalars as scalar d m y n sscanfs ddd y m d 3In fact they advance by 101 minus the number of days in the previous month at the start of each month other than January and by 8870 at the start of each year 4UTC is to a first approximation the time such that the Sun is at its highest point at noon over the prime meridian the line of 0 longitude which as a matter of historical contingency runs through Greenwich England Chapter 19 Calendar dates 167 epochday ISO integer isodate ISO extended isodate list dmy epochday genr Generic string printf printf epochday isoconv Unix seconds strpdaystrfday sscanf strptimestrftime strpdaystrfday sscanf substr atof strptimestrftime Figure 191 Conversions between different date formats Note that in this case d in the format specification does not mean day but rather decimal integer which is why there are three instances Alternatively one could have used a 3element vector as in matrix date zeros13 n sscanfs ddd date1 date2 date3 Decomposing a series of basic dates To generate from a series of dates in ISO 8601 basic format distinct series holding year month and day the function isoconv can be used This function should be passed the original series followed by pointers to the series to be filled out For example if we have a series named dates in the prescribed format we might do series y m d isoconvdates y m d This is mostly just a convenience function provided the dates input is valid on the possibly proleptic Gregorian calendar it is equivalent to series y floordates10000 series m floordates10000y100 series d dates 10000y 100m However there is some value added isoconv checks the validity of the dates input If the implied year month and day for any dates observation do not correspond to a valid date then all the derived series will have value NA at that observation The inverse operation is trivial Chapter 19 Calendar dates 168 series dates 10000 y 100 m d The use of series here means that such operations require that a dataset is in place but although they would most naturally occur in the context of a timeseries dataset they could equally well occur with an undated dataset since an ISO 8601 basic value is just a numeric value with some restrictions and such values do not have to appear in chronological order Stringnumeric conversions dedicated functions The primary means of converting between string and scalar numeric representations of dates and times is provided by two pairs of functions strptimestrftime and strpdaystrfday The first of each pair takes string input and outputs a numeric value and the second performs the inverse operation as shown in Table 191 With the first pair the numeric value is Unix seconds with the second its an epoch day Numeric values are always relative to UTC and string values are by default at least always relative to local time function input output strptime datetime string format Unix seconds strftime Unix seconds format datetime string strpday date string format epoch day strfday epoch day format date string Table 191 String numeric datetime conversions Before moving on lets be clear on what we mean by local time Generically this is time according to the local time zone with or without a Daylight saving or Summer adjustment depending on the time of year In a computing context we have to be more specific the local time zone is whatever is set as such via the operating system and possibly adjusted via an environment variable on the host computer It will usually be the same as the geographically local zone but theres nothing to stop a user making a different setting Dates as stringvalued series It often happens that CSV files contain date information stored as strings Take for example a file containing earthquake data like the following5 Date Time Latitude Longitude Magnitude 01021965 134418 19246 145616 60 01041965 112949 1863 127352 58 01051965 180558 20579 173972 62 01081965 184943 59076 23557 58 01091965 133250 11938 126427 58 01101965 133632 13405 166629 67 01121965 133225 27357 87867 59 01151965 231742 13309 166212 60 Suppose we want to convert the Date column to epoch days Note that the date format follows the American convention monthdayyear The simplest way to accomplish the task is shown in Listing 191 where we assume that the data file is named earthquakescsv Note that the allcols option is wanted here so that gretl treats Dates as a stringvalued series rather than just a source of timeseries information For good measure we show how to add an ISO 8601 date series 5See httpswwwkagglecomdatasetsusgsearthquakedatabase for the dataset of which this is an extract Chapter 19 Calendar dates 169 Listing 191 Converting a stringvalued date series to epoch day open earthquakescsv allcols series eday strpdayDate mdY series isodates strfdayeday Ymd print Date eday isodates o Output Date eday isodates 1 01021965 717338 19650102 2 01041965 717340 19650104 3 01051965 717341 19650105 4 01081965 717344 19650108 5 01091965 717345 19650109 6 01101965 717346 19650110 7 01121965 717348 19650112 8 01151965 717351 19650115 Alternatively one might like to convert the Date and Time columns jointly to Unix seconds This can be done by sticking the two strings together at each observation and calling strptime with a suitable format as follows series usecs Unix seconds loop i1nobs usecsi strptimeDatei Timei mdYHMS endloop Unix seconds and time zones At 846 in the morning of September 11 2001 an airliner crashed into the North Tower of the World Trade Center in New York Relative to what time zone is that statement correct Eastern Daylight Time EDT of course Unless we have special reason to do otherwise we report the time of an event relative to the time zone in which it occurred and if we do otherwise we need to state the metric were using for example one might say that this event occurred at 20010911 1246 UTC Now consider the following script date 20010911 0846 format Ymd HM usecs strptimedate format check strftimeusecs format printf Unix time d usecs printf original s date printf recovered s check Run this script in any time zone you like and the last line of output will read recovered 20010911 0846 The usecs value will differ by time zonefor example itll be 1000212360 under Eastern Daylight Time but 1000194360 under Central European Timebut this difference cancels out in recover ing the original time via strftime So far so good But suppose I write a script in which I store the date as Unix seconds with my laptops clock set to EDT Chapter 19 Calendar dates 170 usecs 1000212360 date strftimeusecs Ymd HM print date Running this script under EDT will again print out 20010911 0846 but if I take my laptop to Italy in June set its clock to the local time and rerun the script Ill get 20010911 1446 Is that a problem Well 1446 is indeed the time in Italy when its 0846 in New York with both zones in their Summer variants its a problem only if you want to preserve the locality of the original time To do that you need to give timezone information to both strptime and strftime This is illustrated in Listing 192 Listing 192 Datetime invariance with respect to current time zone string date 20010911 0846 0400 string format Ymd HM z usecs strptimedate format printf Unix time d usecs In the code above we specify the time zone in date using 0400 meaning 4 hours behind UTC which is correct when Daylight Saving time is in force in the Eastern US And we match this with the z specifier in format As a result regardless of the time zone in which the code is run the Unix time value will be 1000212360 Then we come to unpacking that value date strftime1000212360 Ymd HM z 43600 print date Here we use the third optional argument to strftime to supply the offset in seconds of EDT relative to UTC Having told strptime the time zone why do we need this Well remember that Unix time is just a scalar value always relative to UTC it cannot store timezone information Anyway the result is that this code will print 20010911 0846 0400 regardless of where and when it is executed Some additional comments are in order First spaces matter in parsing the strptime arguments they must match between the date and format strings In the example above we inserted spaces before 0400 and z We could have omitted both spaces but not just one of them Second the C standard does not require that strptime and strftime know anything about time zones the extensions used in this example are supported by GLib functionality 194 Epoch day arithmetic Give the way epoch days are defined they provide a useful tool for checking whether daily data are complete Suppose we have what purport to be 7day daily data with a starting date of 20150101 and an ending date of 20161231 How many observations should there be ed1 epochday201511 ed2 epochday20161231 n ed2 ed1 1 We find that there should be n 731 observations if there are fewer theres something missing If the data are supposed to be on a 5day week skipping Saturday and Sunday or 6day week skipping Sunday alone the calculation is more complicated in this case we can use the dayspan function providing as arguments the epochday values for the first and last dates and the number of days per week Chapter 19 Calendar dates 171 ed1 epochday201511 ed2 epochday20161230 n dayspaned1 ed2 5 We discover that there were n 522 weekdays in this period The dayspan function can also be helpful if you wish to construct a suitably sized empty daily dataset prior to importing data from a thirdparty database for example stock prices from Yahoo Say the data to be imported are on a 5day week and you want the range to be from 20000103 the first weekday in 2000 to 20201230 a Wednesday Heres how one could initialize a suitable host dataset ed1 epochday200013 ed2 epochday20201230 n dayspaned1 ed2 5 nulldata n setobs 5 20000103 Another use of arithmetic using epoch days is constructing a sequence of dates of nonstandard frequency Suppose you want a biweekly series including alternate Saturdays in 2023 Heres a solution nulldata 26 setobs 1 1 specialtimeseries series eday eday1 epochday20230107 the first Saturday loop i2nobs edayi edayi1 14 endloop series dates strfdayeday Ymd 195 Other accessors and functions Accessors Gretl offers various accessors for generating dates One is now which returns the current datetime as a 2element vector The first element is Unix seconds and the second an epoch day see Section 192 This is always available regardless of the presence or absence of a dataset When a timeseries dataset is open up to four accessors are available to retrieve observation dates as numeric series First there is obsdate which returns ISO 8601 basic dates If the frequency is annual quarterly or monthly these dates represent the first day of the period in question if the frequency is hourly this accessor is not available Then theres a set of up to three accessors obsmajor obsminor and obsmicro The availability and interpretation of these values depends on the character of the dataset as shown in Table 192 For reference the constructor column shows the argument that should be supplied to the setobs command to impose each frequency on a dataset assuming it starts on January 1 1990 The hourly frequency is not fully supported by gretls calendrical apparatus But an epoch day value can be used to set the starting day for an hourly time series as exemplified in Table 192 726468 for 19900101 One could then construct a stringvalued hourly datetime series in this way series day strptimeisodateobsmajor series usecs day 3600 obsminor 1 Unix seconds series tstrs strftimeusecs Ymd HM When a panel dataset is open and its time dimension is specified see Section 191 and the docu mentation for the setobs command obsdate works as described for timeseries datasets But Chapter 19 Calendar dates 172 frequency description constructor obsmajor obsminor obsmicro 1 annual 1 1990 year 4 quarterly 4 19901 year quarter 12 monthly 12 199001 year month 5 6 7 daily n 19900101 year month day 52 weekly 52 19900101 year month day 24 hourly 24 72646801 day hour Table 192 Calendrical frequencies and accessors obsmajor and obsminor do not refer to the time dimension rather they give the 1based indices of the individuals and time periods respectively And obsmicro is not available Miscellaneous functions Besides conversion several other calendrical functions are available monthlen given month and year returns the length of the month in days optionally ignoring weekends weekday given a date as year month and day or ISO 8601 basic returns a number from 0 Sun day to 6 Saturday corresponding to day of the week juldate given an epoch day returns the corresponding date on the Julian calendar see Section 196 below dayspan given two epoch days calculates their distance optionally taking weekends into account easterday given the year returns the date of Easter on the Gregorian calendar isoweek given a date as year month and day returns the progressive number of the week within that year as per the ISO 8601 specification 196 Working with preGregorian dates Working with dates is fairly straightforward in the current era with the Gregorian calendar used universally for the dating of socioeconomic observations It is not so straightforward however when dealing with historical data recorded prior to the adoption of the Gregorian calendar in place of the Julian an event which first occurred in the principal Catholic countries in 1582 but which took place at different dates in different countries over a span of several centuries Gretl like most dataoriented software uses the Gregorian calendar by default for all dates thereby ensuring that dates are all consecutive the latter being a requirement of the ISO 8601 standard for dates and times As readers probably know the Julian calendar adds a leap day February 29 on each year that is divisible by 4 with no remainder But this overcompensates for the fact that a 365day year is too short to keep the calendar synchronized with the seasons The Gregorian calendar introduced a more complex rule which maintains better synchronization namely each year divisible by 4 with no remainder is a leap year unless its a centurial year eg 1900 in which case its a leap year only if it is divisible by 400 with no remainder So the years 1600 and 2000 were leap years on both calendars but 1700 1800 and 1900 were leap years only on the Julian calendar While the average length of a Julian year is 36525 days the Gregorian average is 3652425 days The fact that the Julian calendar inserts leap days more frequently means that the Julian date progressively although very slowly falls behind the Gregorian date For example February 18 Chapter 19 Calendar dates 173 2017 Gregorian is February 5 2017 on the Julian calendar On adoption of the Gregorian calendar it was therefore necessary to skip several days In England where the transition occurred in 1752 Wednesday September 2 was directly followed by Thursday September 14 In comparing calendars one wants to refer to a given day in terms that are not specific to either calendarbut how to define a given day This is accomplished by a count of days following some definite temporal benchmark As described in Section 192 gretl uses days since the start of 1 AD which we call epoch days In this section we address the problem of constructing within gretl a calendar which agrees with the actual historical calendar prior to the switch to Gregorian dating Most people will have no use for this but researchers working with archival data may find it helpful it would be tricky and errorprone to enter on the Gregorian calendar data whose dates are given on the Julian at source In order to represent Julian dates Gretl uses two basic tools one is the juldate function which converts a Gregorian epoch day into an ISO8601like integer and the convention that for some functions a negative value where a year is expected acts as a Julian calendar flag So for example the following code fragment edg epochday170011 edj epochday170011 produces edg 620548 and edj 620558 indicating that the two calendars differed by 10 days at the point in time known as January 1 1700 on the proleptic Gregorian calendar Taken together with the isodate and juldate functions which each take an epoch day argument and return an ISO 8601 basic date on respectively the Gregorian and Julian calendars epochday can be used to convert between the two calendars For example what was the date in England still on the Julian calendar on the day known to Italians as June 26 1740 Italy having been on the Gregorian calendar since October 1582 ed epochday1740626 englishdate juldateed printf d englishdate We find that the English date was 17400615 the 15th of June Working in the other direction what Italian date corresponded to the 5th of November 1740 in England ed epochday1740115 italiandate isodateed printf d italiandate Answer 17401116 Guy Fawkes night in 1740 occurred on November 16 from the Italian point of view Well now consider the trickiest case namely a calendar which includes the day on which the Julian to Gregorian switch occurred If we can handle this it should be relatively simple to handle a purely Julian calendar Our illustration will be England in 1752 a similar analysis could be done for Spain in 1582 or Greece in 1923 A solution is presented in Listing 193 The first step is to find the epoch day corresponding to the Julian date 17520101 which turns out to be 639551 Then we can create a series of epoch days from which we get both Julian and Gregorian dates for 355 days starting on epoch day 639551 Note 355 days because this was a short year it was a leap year but 11 days were skipped in September in making the transition to the Gregorian calendar We can then construct a series hcal which switches calendar at the right historical point Notice that although the series hcal contains the correct historical calendar in basic form the observation labels in the first column of the output are still just index numbers It may be prefer able to have historical dates in that role To achieve this we can decompose the hcal series into Chapter 19 Calendar dates 174 Listing 193 Historical calendar for Britain in 1752 Download 1752 was a short year on the British calendar nulldata 355 give a negative year to indicate Julian date ed0 epochday175211 consistent series of epoch day values series ed ed0 index 1 Julian dates as YYYYMMDD series jdate juldateed Gregorian dates as YYYYMMDD series gdate isodateed Historical cutover in September series hcal ed epochday175292 gdate jdate And lets take a look print ed jdate gdate hcal o Partial output ed jdate gdate hcal 1 639551 17520101 17520112 17520101 2 639552 17520102 17520113 17520102 245 639795 17520901 17520912 17520901 246 639796 17520902 17520913 17520902 247 639797 17520903 17520914 17520914 248 639798 17520904 17520915 17520915 355 639905 17521220 17521231 17521231 Chapter 19 Calendar dates 175 year month and day then use the special genr markers apparatus see chapter 4 Suitable code along with partial output is shown in Listing 194 Listing 194 Continuation of Britain 1752 example Download Additional input series y m d isoconvhcal y m d genr markers 04d02d02d y m d print ed jdate gdate hcal o Partial output ed jdate gdate hcal 17520101 639551 17520101 17520112 17520101 17520102 639552 17520102 17520113 17520102 17520901 639795 17520901 17520912 17520901 17520902 639796 17520902 17520913 17520902 17520914 639797 17520903 17520914 17520914 17520915 639798 17520904 17520915 17520915 17521231 639905 17521220 17521231 17521231 Year numbering A further complication in dealing with archival data is that the year number has not always been advanced on January 1 for example in Britain prior to 1752 March 25 was taken as the start of the new year On gretls calendar whether Julian or Gregorian the year number always advances on January 1 but its possible to construct observation markers following the old scheme This is illustrated for the year 1751 as we would now call it in Listing 195 Day of week and length of month Two of the functions described in Section 195 that by default operate on the Gregorian calendar can be induced to work on the Julian by the trick mentioned above namely giving the negative of the year These are weekday which takes arguments year month and day and monthlen which takes arguments month year and days per week Thus for example eval weekday1700229 gives 4 indicating that Julian February 29 1700 was a Thursday And eval monthlen219005 gives 21 indicating that there were 21 weekdays in Julian February 1900 Chapter 19 Calendar dates 176 Listing 195 Historical calendar for England in 1751 Download Input nulldata 365 a common year ed0 epochday175111 ed1 epochday1751325 series ed ed0 index 1 series jdate juldateed series y m d isoconvjdate y m d y ed ed1 y1 y genr markers 04d02d02d y m d print index o Partial output 17500101 1 17500102 2 17500103 3 17500323 82 17500324 83 17510325 84 17510326 85 17511231 365 Chapter 20 Handling mixedfrequency data 201 Basics In some cases one may want to handle data that are observed at different frequencies a facility known as MIDAS Mixed Data Sampling A common pairing includes GDP usually available quar terly and industrial production often available monthly The most common context when this feature is required is specification and estimation of MIDAS models see Chapter 41 but other cases are possible A gretl dataset formally handles only a single data frequency but we have adopted a straightfor ward means of representing nested frequencies a higher frequency series xH is represented by a set of m series each holding the value of xH in a subperiod of the base lowerfrequency period where m is the ratio of the higher frequency to the lower This is most easily understood by means of an example Suppose our base frequency is quarterly and we wish to include a monthly series in the analysis Then a relevant fragment of the gretl dataset might look as shown in Table 201 Here gdpc96 is a quarterly series while indpro is monthly so m 124 3 and the permonth values of indpro are identified by the suffix mn n 3 2 1 gdpc96 indprom3 indprom2 indprom1 19471 193447 143650 142811 141973 19472 193228 143091 143091 142532 19473 193031 144209 143091 142253 19474 196070 148121 147562 145606 19481 198954 147563 149240 148960 19482 202185 152313 150357 147842 Table 201 A slice of MIDAS data To recover the actual monthly time series for indpro one must read the three relevant series right toleft by row At first glance this may seem perverse but in fact it is the most convenient setup for MIDAS analysis In such models the highfrequency variables are represented by lists of lags and of course in econometrics it is standard to give the most recent lag first xt1 xt2 One can construct such a dataset manually from raw sources using hansls matrixhandling meth ods or the join command see Section 206 for illustrations but we have added native support for the common cases shown below base frequency higher frequency annual quarterly or monthly quarterly monthly or daily monthly daily The examples below mostly pertain to the case of quarterly plus monthly data Section 206 has details on handling of daily data 177 Chapter 20 Handling mixedfrequency data 178 A mixedfrequency dataset can be created in either of two ways by selective importation of series from a database or by creating two datasets of different frequencies then merging them Importation from a database Heres a simple example in which we draw from the fedstl St Louis Fed database which is supplied in the gretl distribution clear open fedstlbin data gdpc96 data indpro compactspread store gdpindprogdt Since gdpc96 is a quarterly series its importation via the data command establishes a quarterly dataset Then the MIDAS work is done by the option compactspread for the second invocation of data This spreads the series indprowhich is monthly at sourceinto three quarterly series exactly as shown in Table 201 Merging two datasets In this case we consider an Excel file provided by Eric Ghysels in his MIDAS Matlab Toolbox1 namely mydataxlsx This contains quarterly real GDP in Sheet1 and monthly nonfarm payroll employment in Sheet2 A hansl script to build a MIDASstyle file named gdppayrollmidasgdt is shown in Listing 201 Listing 201 Building a gretl MIDAS dataset via merger sheet 2 contains monthly employment data open MIDASv22mydataxlsx sheet2 rename VALUE payems dataset compact 4 spread limit to the sample range of the GDP data smpl 19471 20112 setinfo payemsm3 descriptionNonfarm payroll employment month 3 of quarter setinfo payemsm2 descriptionNonfarm payroll employment month 2 of quarter setinfo payemsm1 descriptionNonfarm payroll employment month 1 of quarter store payrollmidasgdt sheet 1 contains quarterly GDP data open MIDASv22mydataxlsx sheet1 rename VALUE qgdp setinfo qgdp descriptionReal quarterly US GDP append payrollmidasgdt store gdppayrollmidasgdt Note that both series are simply named VALUE in the source file so we use gretls rename command to set distinct and meaningful names The heavy lifting is done here by the line dataset compact 4 spread 1See httpeghyselswebuncedu for links Chapter 20 Handling mixedfrequency data 179 which tells gretl to compact an entire dataset in this case as it happens just containing one series to quarterly frequency using the spread method Once this is done it is straightforward to append the compacted data to the quarterly GDP dataset We will put an extended version of this dataset supplied with gretl and named gdpmidasgdt to use in subsequent sections 202 The notion of a MIDAS list In the following two sections well describe functions that rather easily do the right thing if you wish to create lists of lags or first differences of highfrequency series However we should first be clear about the correct domain for such functions since they could produce the most diabolical mashup of your data if applied to the wrong sort of list argumentfor instance a regular list containing distinct series all observed at the base frequency of the dataset So let us define a MIDAS list this is a list of m series holding perperiod values of a single high frequency series arranged in the order of most recent first as illustrated above Given the dataset shown in Table 201 an example of a correctly formulated MIDAS list would be list INDPRO indprom3 indprom2 indprom1 Or since the monthly observations are already in the required order we could define the list by means of a wildcard list INDPRO indprom Having created such a list one can use the setinfo command to tell gretl that its a bona fide MIDAS list setinfo INDPRO midas This will spare you some warnings that gretl would otherwise emit when you call some of the functions described below This step should not be necessary however if the series in question are the product of a compact operation with the spread parameter Inspecting highfrequency data The layout of highfrequency data shown in Table 201 is convenient for running regressions but not very convenient for inspecting and checking such data We therefore provide some methods for displaying MIDAS data at their natural frequency Figure 201 shows the gretl main window with the gdpmidas dataset loaded along with the menu that pops up if you rightclick with the payems series highlighted The items Display values and Time series plot show the data on their original monthly calendar while the Display components item shows the three component series on a quarterly calendar as in Table 201 These methods are also available via the command line For example the commands list PAYEMS payems print PAYEMS byobs midas hfplot PAYEMS withlines outputdisplay produce a monthly printout of the payroll employment data followed by a monthly timeseries plot See section 205 for more on hfplot Chapter 20 Handling mixedfrequency data 180 Figure 201 MIDAS data menu 203 Highfrequency lag lists A basic requirement of MIDAS is the creation of lists of highfrequency lags for use on the right hand side of a regression specification This is possible but not very convenient using the gretls lags function it is made easier by a dedicated variant of that function described below For illustration well consider an example presented in Ghysels Matlab implementation of MIDAS this uses 9 monthly lags of payroll employment starting at lag 3 in a model for quarterly GDP The estimation period for this model starts in 1985Q1 At this observation the stipulation that we start at lag 3 means that the first most recent lag is employment for October 19842 and the 9lag window means that we need to include monthly lags back to February 1984 Let the permonth employment series be called xm3 xm2 and xm1 and let quarterly lags be represented by 1 2 and so on Then the terms we want are reading lefttoright by row xm11 xm32 xm22 xm12 xm33 xm23 xm13 xm34 xm24 We could construct such a list in gretl using the following standard syntax Note that the third argument of 1 to lags below tells gretl that we want the terms ordered by lag rather than by variable this is required to respect the order of the terms shown above list X xm create lags for 4 quarters by lag list XL lags4X1 convert the list to a matrix matrix tmp XL trim off the first two elements and the last tmp tmp311 2That is what Ghysels means but see the subsection on Leads and nowcasting below for a possible ambiguity in this regard Chapter 20 Handling mixedfrequency data 181 and convert back to a list XL tmp However the following specialized syntax is more convenient list X xm setinfo X midas create highfrequency lags 3 to 11 list XL hflags3 11 X In the case of hflags the length of the list given as the third argument defines the compaction ratio m 3 in this example we can in fact must specify the lags we want in highfrequency terms and ordering of the generated series by lag is automatic Word to the wise do not use hflags on anything other than a MIDAS list as defined in section 202 unless perhaps you have some special project in mind and really know what you are doing Leads and nowcasting Before leaving the topic of lags it is worth commenting on the question of leads and socalled nowcastingthat is prediction of the current value of a lowerfrequency variable before its mea surement becomes available In a regular dataset where all series are of the same frequency lag 1 means the observation from the previous period lag 0 is equivalent to the current observation and lag 1 or lead 1 is the observation for the next period into the relative future When considering highfrequency lags in the MIDAS context however there is no uniquely deter mined highfrequency subperiod which is temporally coincident with a given lowfrequency period The placement of highfrequency lag 0 therefore has to be a matter of convention Unfortunately there are two incompatible conventions in currently available MIDAS software as follows Highfrequency lag 0 corresponds to the first subperiod within the current lowfrequency period This is what we find in Eric Ghysels MIDAS Matlab Toolbox its also clearly stated and explained in Armesto et al 2010 Highfrequency lag 0 corresponds to the last subperiod in the current lowfrequency period This convention is employed in the midasr package for R3 Consider for example the quarterlymonthly case In Matlab highfrequency HF lag 0 is the first month of the current quarter HF lag 1 is the last month of the prior quarter and so on In midasr however HF lag 0 is the last month of the current quarter HF lag 1 the middle month of the quarter and HF lag 3 is the first one to take you back in time relative to the start of the current quarter namely to the last month of the prior quarter In gretl we have chosen to employ the first of these conventions So lag 1 points to the most recent subperiod in the previous basefrequency period lag 0 points to the first subperiod in the current period and lag 1 to the second subperiod within the current period Continuing with the quarterlymonthly case monthly observations for lags 0 and 1 are likely to become available before a measurement for the quarterly variable is published possibly also a monthly value for lag 2 The first truly future lead does not occur until lag 3 The hflags function supports negative lags Suppose one wanted to use 9 lags of a highfrequency variable 1 0 1 7 for nowcasting Given a suitable MIDAS list X the following would do the job list XLnow hflags1 7 X 3See httpcranrprojectorgwebpackagesmidasr and for documentation httpsgithubcom mpiktasmidasruserguiderawmastermidasruserguidepdf Chapter 20 Handling mixedfrequency data 182 This means that one could generate a forecast for the current lowfrequency period which is not yet completed and for which no observation is available using data from two subperiods into the lowfrequency period eg the first two months of the quarter 204 Highfrequency first differences When working with nonstationary data one may wish to take first differences and in the MIDAS context that probably means highfrequency differences of the highfrequency data Note that the ordinary gretl functions diff and ldiff will not do what is wanted for series such as indpro as shown in Table 201 these functions will give permonth quarterly differences of the data month 3 of the current quarter minus month 3 of the previous quarter and so on To get the desired result one could create the differences before compacting the highfrequency data but this may not be convenient and its not compatible with the method of constructing a MIDAS dataset shown in section 201 The alternative is to employ the specialized differencing function hfdiff This takes one required argument a MIDAS list as defined in section 202 A second optional argument is a scalar multiplier with default value 10 this permits scaling the output series by a constant Theres also an hfldiff function for creating highfrequency log differences this has the same syntax as hfdiff So for example the following creates a list of highfrequency percentage changes 100 times log difference then a list of highfrequency lags of the changes list X indpro setinfo X midas list dX hfldiffX 100 list dXL hflags3 11 dX If you only need the series in the list dXL however you can nest these two function calls list dXL hflags3 11 hfldiffX 100 205 MIDASrelated plots In the context of MIDAS analysis one may wish to produce timeseries plots which show high and lowfrequency data in correct registration as in Figures 1 and 2 in Armesto et al 2010 This can be done using the hfplot command which has the following syntax hfplot midaslist lflist options The required argument is a MIDAS list as defined above Optionally one or more lowerfrequency series lflist can be added to the plot following a semicolon Supported options are withlines timeseries and output These have the same effects as with the gretls gnuplot command An example based on Figure 1 in Armesto et al 2010 is shown in Listing 202 and Figure 202 206 Alternative MIDAS data methods Importation via a column vector Listing 203 illustrates how one can construct via hansl a MIDAS list from a matrix column vector holding data of a higher frequency than the given dataset In practice one would probably read high frequency data from file using the mread function but here we just construct an artificial sequential vector Note the check in the highfreqlist function we determine the current sample size T and insist that the input matrix is suitably dimensioned with a single column of length equal to T times the compaction factor here 3 for monthly to quarterly Chapter 20 Handling mixedfrequency data 183 Listing 202 Replication of a plot from Armesto et al Download open gdpmidasgdt form and label the dependent variable series dy logqgdpqgdp1400 setinfo dy graphnameGDP form list of annualized HF differences list X payems list dX hfldiffX 1200 setinfo dX graphnamePayroll Employment smpl 19801 20091 hfplot dX dy withlines timeseries outputdisplay 10 5 0 5 10 15 20 1980 1985 1990 1995 2000 2005 Payroll Employment GDP Figure 202 Quarterly GDP and monthly Payroll Employment annualized percentage changes Chapter 20 Handling mixedfrequency data 184 Listing 203 Create a midas list from a matrix Download function list highfreqlist const matrix x int compfac string vname list ret deflist scalar T nobs if rowsx compfacT colsx 1 funcerr Invalid x matrix endif matrix m mreversemshapex compfac T loop i1compfac scalar k compfac 1 i ret genseriessprintfsd vname k mi endloop setinfo ret midas return ret end function construct a little quarterly dataset nulldata 12 setobs 4 19801 generate monthly data 1236 matrix x seq13nobs print x turn into midas list list H highfreqlistx 3 testm print H byobs Chapter 20 Handling mixedfrequency data 185 The final command in the script should produce testm3 testm2 testm1 19801 3 2 1 19802 6 5 4 19803 9 8 7 This functionality is available in the builtin function hflist which has the same signature as the hansl prototype above Importation via join The join command provides a general and flexible framework for importing data from external files see chapter 7 In order to handle multiplefrequency data it supports the spreading of highfrequency series to a MIDAS list in a single operation This requires use of the aggr option with parameter spread There are two acceptable forms of usage illustrated below Note that AWM is a quarterly dataset while hamilton is monthly First case open AWMgdt join hamiltongdt PC6IT aggrspread and second case open AWMgdt join hamiltongdt PCI dataPC6IT aggrspread In the first case MIDAS series PC6ITm3 PC6ITm2 and PC6ITm1 are added to the working dataset In the second case PCI is used as the base name for the imports giving PCIm3 PCIm2 and PCIm1 as the names of the permonth series Note that only one highfrequency series can be imported in a given join invocation with the option aggrspread which already implies the writing of multiple series in the lower frequency dataset An important point to note is that the aggrspread mechanism where we map from one higher frequency series to a set of lowerfrequency ones relies on finding a known reliable timeseries structure in the outer data file Native gretl timeseries data files will have such a structure and also wellformed gretlfriendly CSV files but not arbitrary commaseparated files So if you have difficulty importing data MIDASstyle from a given CSV file using aggrspread you might want to drop back to a more agnostic piecewise approach agnostic in the sense of assuming less about gretls ability to detect any timeseries structure that might be present Heres an example open hamiltongdt create monthofquarter series for filtering series mofq obsminor 1 3 1 write example CSV file the first column holds eg 1973M01 store testcsv PC6IT mofq open AWMgdt q import monthly components one at a time using a filter join testcsv PCIm3 dataPC6IT tkeyYMm filtermofq3 join testcsv PCIm2 dataPC6IT tkeyYMm filtermofq2 join testcsv PCIm1 dataPC6IT tkeyYMm filtermofq1 list PCI PCI setinfo PCI midas print PCIm byobs Chapter 20 Handling mixedfrequency data 186 The example is artificial in that a timeseries CSV file of suitable frequency written by gretl itself should work without special treatment But you may have to add helper columns such as the mofq series above to a thirdparty CSV file to enable a piecewise MIDAS join via filtering Daily data Daily data commonly financialmarket data are often used in practical applications of the MIDAS methodology Its therefore important that gretl support use of such data but there are special issues arising from the fact that the number of days in a month quarter or year is not a constant It seems to us that its necessary to stipulate a fixed conventional number of days per lower frequency period that is in practice per month or quarter since for the moment were ignoring the week as a basic temporal unit and were not yet attempting to support the combination of annual and daily data But matters are further complicated by the fact that daily data come in at least three sorts 5 days per week as in financialmarket data 6day some commercial data which skip Sunday and 7day That said we currently supportvia compactspread as described in section 201the following conversions Daily to monthly If the daily data are 5days per week we impose 22 days per month This is the median and also the mode of weekdays per month although some months have as few as 20 weekdays and some have 23 If the daily data are 6day we impose 26 days per month and in the 7day case 30 days per month Daily to quarterly In this case the stipulated days per quarter are simply 3 times the days permonth values specified above So given a daily dataset you can say dataset compact 12 spread to convert MIDASwise to monthly or substitute 4 for 12 for a quarterly target And this is sup posed to work whether the number of days per week is 5 6 or 7 That leaves the question of how we handle cases where the actual number of days in the calendar month or quarter falls short of or exceeds the stipulated number Well talk this through with reference to the conversion of 5day daily data to monthly all other cases are essentially the same mutatis mutandis4 We start at day 1 namely the first relevant daily date within the calendar period so the first weekday with 5day data From that point on we fill up to 22 slots with relevant daily observations including not skipping NAs due to holidays or whatever If at the end we have daily observations left over we ignore them If were short we fill the empty slots with the arithmetic mean of the valid used observations5 and we fill in any missing values in the same way This means that lags 1 to 22 of 5day daily data in a monthly dataset are always observations from days within the prior month or in some cases padding that substitutes for such observations lag 23 takes you back to the most recent day in the month before that Clearly we could get a good deal fancier in our handling of daily data for example letting the user determine the number of days per month or quarter andor offering more elaborate means of filling in missing and nonexistent daily values Its not clear that this would be worthwhile but its open to discussion A little dailytomonthly example is shown in Listing 204 and Figure 203 The example exercises the hfplot command see section 205 4Or should be Were not ready to guarantee that just yet 5This is the procedure followed in some example programs in the MIDAS Matlab Toolbox Chapter 20 Handling mixedfrequency data 187 Listing 204 Monthly plus daily data Download open a daily dataset open djclosegdt spread the data to monthly dataset compact 12 spread list DJ djc import an actual monthly series open fedstlbin data indpro highfrequency plot set outputdailypdf for PDF hfplot DJ indpro withlines outputdisplay set key top left 500 1000 1500 2000 2500 3000 1980 1982 1984 1986 1988 1990 48 50 52 54 56 58 60 62 64 66 djclose left indpro right Figure 203 Monthly industrial production and daily Dow Jones close Chapter 21 Cheat sheet This chapter explains how to perform some commonand some not so commontasks in gretls scripting language hansl Some but not all of the techniques listed here are also available through the graphical interface Although the graphical interface may be more intuitive and less intimidat ing at first we encourage users to take advantage of the power of gretls scripting language as soon as they feel comfortable with the program 211 Dataset handling Weird periodicities Problem You have data sampled each 3 minutes from 9am onwards youll probably want to specify the hour as 20 periods Solution setobs 20 91 special Comment Now functions like sdiff seasonal difference or estimation methods like seasonal ARIMA will work as expected Generating a panel dataset of given dimensions Problem You want to generate via nulldata a panel dataset and specify in advance the number of units and the time length of your series via two scalar variables Solution scalar nunits 100 scalar T 12 scalar NT T nunits nulldata NT preserve setobs T 11 stackedtimeseries Comment The essential ingredient that we use here is the preserve option it protects existing scalars and matrices for that matter from being trashed by nulldata thus making it possible to use the scalar T in the setobs command Help my data are backwards Problem Gretl expects time series data to be in chronological order most recent observation last but you have imported thirdparty data that are in reverse order most recent first Solution setobs 1 1 crosssection series sortkey obs dataset sortby sortkey setobs 1 1950 timeseries 188 Chapter 21 Cheat sheet 189 Comment The first line is required only if the data currently have a time series interpretation it removes that interpretation because for fairly obvious reasons the dataset sortby operation is not allowed for time series data The following two lines reverse the data using the negative of the builtin index variable obs The last line is just illustrative it establishes the data as annual time series starting in 1950 If you have a dataset that is mostly the right way round but a particular variable is wrong you can reverse that variable as follows x sortbyobs x Dropping missing observations selectively Problem You have a dataset with many variables and want to restrict the sample to those observa tions for which there are no missing observations for the variables x1 x2 and x3 Solution list X x1 x2 x3 smpl nomissing X Comment You can save the file via a store command to preserve a subsampled version of the dataset Alternative solutions based on the ok function such as list X x1 x2 x3 series sel okX smpl sel restrict are perhaps less obvious but more flexible Pick your poison By operations Problem You have a discrete variable d and you want to run some commands for example estimate a model by splitting the sample according to the values of d Solution matrix vd valuesd m rowsvd loop i1m scalar sel vdi smpl dsel restrict replace ols y const x endloop smpl full Comment The main ingredient here is a loop You can have gretl perform as many instructions as you want for each value of d as long as they are allowed inside a loop Note however that if all you want is descriptive statistics the summary command does have a by option Adding a time series to a panel Problem You have a panel dataset comprising observations of n indidivuals in each of T periods and you want to add a variable which is available in straight timeseries form For example you want to add annual CPI data to a panel in order to deflate nominal income figures In gretl a panel is represented in stacked timeseries format so in effect the task is to create a new variable which holds n stacked copies of the original time series Lets say the panel comprises 500 individuals observed in the years 1990 1995 and 2000 n 500 T 3 and we have these CPI data in the ASCII file cpitxt Chapter 21 Cheat sheet 190 date cpi 1990 130658 1995 152383 2000 172192 What we need is for the CPI variable in the panel to repeat these three values 500 times Solution Simple With the panel dataset open in gretl append cpitxt Comment If the length of the time series is the same as the length of the time dimension in the panel 3 in this example gretl will perform the stacking automatically Rather than using the append command you could use the Append data item under the File menu in the GUI program If the length of your time series does not exactly match the T dimension of your panel dataset append will not work but you can use the join command which is able to pick just the observations with matching time periods On selecting Append data in the GUI you are given a choice between plain append and join modes and if you choose the latter you get a dialog window allowing you to specify the keys for the join operation For native gretl data files you can use builtin series that identify the time periods such as obsmajor for your outer key to match the dates In the example above if the CPI data were in gretl format obsmajor would give you the year of the observations Time averaging of panel datasets Problem You have a panel dataset comprising observations of n indidivuals in each of T periods and you want to lower the time frequency by averaging This is commonly done in empirical growth economics where annual data are turned into 3 or 4 or 5year averages see for example Islam 1995 Solution In a panel dataset gretl functions that deal with time are aware of the panel structure so they will automatically do the right thing Therefore all you have to do is use the movavg function for computing moving averages and then just drop the years you dont need An example with artificial data follows nulldata 36 set seed 61218 setobs 12 11 stackedtimeseries generate simulated yearly data series year 2000 time series y roundnormal series x round3uniform list X y x print year X o now recast as 4year averages a dummy for endpoints series endpoint year 4 0 id variable series id unit compute averages loop foreach i X series i movavgi 4 Chapter 21 Cheat sheet 191 endloop drop extra observations smpl endpoint dummy permanent restore panel structure setobs id year panelvars print id year X o Running the above script produces among other output print year X o year y x 101 2001 1 1 102 2002 1 1 103 2003 1 0 104 2004 0 1 105 2005 1 2 106 2006 1 2 107 2007 1 0 108 2008 1 1 109 2009 0 3 110 2010 1 1 111 2011 1 1 112 2012 0 1 309 2009 0 1 310 2010 1 1 311 2011 0 2 312 2012 1 2 print id year X o id year y x 11 1 2004 025 075 12 1 2008 050 125 13 1 2012 000 150 33 3 2012 050 150 Turning observationmarker strings into a series Problem Heres one that might turn up in the context of the join command see chapter 7 The current dataset contains a stringvalued series that youd like to use as a key for matching observations perhaps the twoletter codes for the names of US states The file from which you wish to add data contains that same information but not in the form of a stringvalued series rather it exists in the form of observation markers Such markers cannot be used as a key directly but is there a way to parlay them into a stringvalued series Why of course there is Solution Well illustrate with the Ramanathan data file data410gdt which contains private school enrollment data and covariates for the 50 US states plus Washington DC n 51 open data410gdt markers toarraystatecodes genr index stringifyindex statecodes store joindatagdt Chapter 21 Cheat sheet 192 Comment The markers command saves the observation markers to an array of strings The com mand genr index creates a series that goes 1 2 3 and we attach the state codes to this series via stringify After saving the result we have a datafile that contains a series index that can be matched with whatever series holds the state code strings in the target dataset Suppose the relevant stringvalued key series in the target dataset is called state We might prefer to avoid the need to specify a distinct outer key again see chapter 7 In that case in place of genr index stringifyindex statecodes we could do genr index series state index stringifystate statecodes and the two datafiles will contain a comparable stringvalued state series 212 Creatingmodifying variables Generating a dummy variable for a specific observation Problem Generate dt 0 for all observation but one for which dt 1 Solution series d t19842 Comment The internal variable t is used to refer to observations in string form so if you have a crosssection sample you may just use d t123 If the dataset has observation labels you can use the corresponding label For example if you open the dataset mrwgdt supplied with gretl among the examples a dummy variable for Italy could be generated via series DIta tItaly Note that this method does not require scripting at all In fact you might as well use the GUI Menu AddDefine new variable for the same purpose with the same syntax Generating a discrete variable out of a set of dummies Problem The dummify function also available as a command generates a set of mutually exclusive dummies from a discrete variable The reverse functionality however seems to be absent Solution series x lincombD seq1 nelemD Comment Suppose you have a list D of mutually exclusive dummies that is a full set of 01 vari ables coding for the value of some characteristic such that the sum of the values of the elements of D is 1 at each observation This is by the way exactly what the dummify command produces The reverse job of dummify can be performed neatly by using the lincomb function The code above multiplies the first dummy variable in the list D by 1 the second one by 2 and so on Hence the return value is a series whose value is i if and only if the ith member of D has value 1 If you want your coding to start from 0 instead of 1 youll have to modify the code snippet above into series x lincombD seq0 nelemD1 Chapter 21 Cheat sheet 193 Easter Problem I have a 7day daily dataset How do I create an Easter dummy Solution We have the easterday function which returns month and day of Easter given the year The following is an example script which uses this function and a few string magic tricks series Easter 0 loop y20112016 a easterdayy m floora d round100am edstr sprintf04d02d02d y m d Easteredstr 1 endloop Comment The round function is necessary for the day component because otherwise floating point problems may ensue Try the year 2015 for example Recoding a variable Problem You want to perform a 1to1 recode on a variable For example consider tennis points you may have a variable x holding values 1 to 3 and you want to recode it to 15 30 40 Solution 1 series x replacex 1 15 series x replacex 2 30 series x replacex 3 40 Solution 2 matrix tennis 15 30 40 series x replacex seq13 tennis Comment There are many equivalent ways to achieve the same effect but for simple cases such as this the replace function is simple and transparent If you dont mind using matrices scripts using replace can also be remarkably compact Note that replace also performs nto1 surjec tive replacements such as series x replacez 2 3 5 11 22 33 1 which would turn all entries equal to 2 3 5 11 22 or 33 to 1 and leave the other ones unchanged Generating a subset of values dummy Problem You have a dataset which contains a finegrained coding for some qualitative variable and you want to collapse this to a relatively small set of dummy variables Examples you have place of work by US state and you want a small set of regional dummies or you have detailed occupational codes from a census dataset and you want a manageable number of occupational category dummies Lets call the source series src and one of the target dummies D1 And lets say that the values of src to be grouped under D1 are 2 13 14 and 25 Well consider three possible solutions Longhand Clever and Proper Longhand solution series D1 src2 src13 src14 src25 Chapter 21 Cheat sheet 194 Comment The above works fine if the number of distinct values in the source to be condensed into each dummy variable is fairly small but it becomes cumbersome if a single dummy must comprise dozens of source values Clever solution matrix sel 2131425 series D1 maxrsrc vecsel 0 Comment The subset of values to be grouped together can be written out as a matrix relatively compactly first line The magic that turns this into the desired series second line relies on the versatility of the dot elementwise matrix operators The expression src gets a column vector version of the input seriescall this xand vecsel gets the input matrix as a row vector in case its a column vector or a matrix with both dimensions greater than 1call this s If x is n 1 and s is 1 m the operator produces an n m result each element i j of which equals 1 if xi sj otherwise 0 The maxr function along with the operator see chapter 17 for both then produces the result we want Of course whichever procedure you use you have to repeat for each of the dummy series you want to create but keep readingthe proper solution is probably what you want if you plan to create several dummies Further comment Note that the clever solution depends on converting what is naturally a vector result into a series This will fail if there are missing values in src since by default missing values will be skipped when converting src to x and so the number of rows in the result will fall short of the number of observations in the dataset One fix is then to subsample the dataset to exclude missing values before employing this method another is to adjust the skipmissing setting via the set command see the Gretl Command Reference Proper solution The best solution in terms of both computational efficiency and code clarity would be using a conversion table and the replace function to produce a series on which the dummify command can be used For example suppose we want to convert from a series called fips holding FIPS codes1 for the 50 US states plus the District of Columbia to a series holding codes for the four standard US regions We could create a 2 51 matrixcall it srmapwith the 51 FIPS codes on the first row and the corresponding region codes on the second and then do series region replacefips srmap1 srmap2 Generating an ARMA11 Problem Generate yt 09yt1 εt 05εt1 with εt NIID0 1 Recommended solution alpha 09 theta 05 series y filternormal 1 theta alpha Bread and butter solution alpha 09 theta 05 series e normal series y 0 series y alpha y1 e theta e1 1FIPS is the Federal Information Processing Standard it assigns numeric codes from 1 to 56 to the US states and outlying areas Comment The filter function is specifically designed for this purpose so in most cases youll want to take advantage of its speed and flexibility That said in some cases you may want to generate the series in a manner which is more transparent maybe for teaching purposes In the second solution the statement series y 0 is necessary because the next statement evaluates y recursively so y1 must be set Note that you must use the keyword series here instead of writing genr y 0 or simply y 0 to ensure that y is a series and not a scalar Recoding a variable by classes Problem You want to recode a variable by classes For example you have the age of a sample of individuals xi and you need to compute age classes yi as yi 1 for xi 18 yi 2 for 18 xi 65 yi 3 for xi 65 Solution series y 1 x 18 x 65 Comment True and false expressions are evaluated as 1 and 0 respectively so they can be manipulated algebraically as any other number The same result could also be achieved by using the conditional assignment operator see below but in most cases it would probably lead to more convoluted constructs Conditional assignment Problem Generate yt via the following rule yt xt for dt a zt for dt a Solution series y d a x z Comment There are several alternatives to the one presented above One is a brute force solution using loops Another one more efficient but still suboptimal would be series y dax daz However the ternary conditional assignment operator is not only the most efficient way to accomplish what we want it is also remarkably transparent to read when one gets used to it Some readers may find it helpful to note that the conditional assignment operator works exactly the same way as the IF function in spreadsheets Generating a time index for panel datasets Problem gretl has a unit accessor but not the equivalent for time What should I use Solution series x time Comment The special construct genr time and its variants are aware of whether a dataset is a panel Chapter 21 Cheat sheet 196 Sanitizing a list of regressors Problem I noticed that builtin commands like ols automatically drop collinear variables and put the constant first How can I achieve the same result for an estimator Im writing Solution No worry The function below does just that function list sanitizelist X list R X const if nelemR nelemX R const R endif return dropcollR end function so for example the code below nulldata 20 x normal y normal z x y collinear list A x y const z list B sanitizeA list print A list print B returns list print A x y const z list print B const x y Besides it has been brought to our attention that some mischievous programs out there put the constant last instead of first like God intended We are not amused by this utter disrespect of econometric tradition but if you want to pursue the way of evil it is rather simple to adapt the script above to that effect Generating the hat values after an OLS regression Problem Ive just run an OLS regression and now I need the socalled the leverage values also known as the hat values I know you can access residuals and fitted values through dollar accessors but nothing like that seems to be available for hat values Solution Hat values are can be thought of as the diagonal of the projection matrix PX or more explicitly as hi x iXX1xi where X is the matrix of regressors and x i is its ith row The reader is invited to study the code below which offers four different solutions to the problem open data41gdt quiet list X const sqft bedrms baths ols price X method 1 leverage save quiet series h1 lever these are necessary for what comes next matrix mX X matrix iXX invpdmXmX method 2 series h2 diagqformmX iXX method 3 series h3 sumrmX mXiXX method 4 series h4 NA loop i1nobs matrix x mXi h4i xiXXx endloop verify print h1 h2 h3 h4 byobs Comment Solution 1 is the preferable one it relies on the builtin leverage command which computes the requested series quite efficiently taking care of missing values possible restrictions to the sample etc However three more are shown for didactical purposes mainly to show the user how to manipulate matrices Solution 2 first constructs the PX matrix explicitly via the qform function and then takes its diagonal this is definitely not recommended despite its compactness since you generate a much bigger matrix than you actually need and waste a lot of memory and CPU cycles in the process It doesnt matter very much in the present case since the sample size is very small but with a big dataset this could be a very bad idea Solution 3 is more clever and relies on the fact that if you define Z X XX1 then hi could also be written as hi xi zi Σk i1 xik zik which is in turn equivalent to the sum of the elements of the ith row of X Z where is the elementbyelement product In this case your clever usage of matrix algebra would produce a solution computationally much superior to solution 2 Solution 4 is the most oldfashioned one and employs an indexed loop While this wastes practically no memory and employs no more CPU cycles in algebraic operations than strictly necessary it imposes a much greater burden on the hansl interpreter since handling a loop is conceptually more complex than a single operation In practice youll find that for any realisticallysized problem solution 4 is much slower that solution 3 Moving functions for time series Problem gretl provides native functions for moving averages but I need a to compute a different statistic on a sliding data window Is there a way to do this without using loops Solution One of the nice things of the list data type is that if you define a list then several functions that would normally apply vertically to elements of a series apply horizontally across the list So for example the following piece of code open bjggdt order 12 list L lg lagsorder1 lg smpl order Chapter 21 Cheat sheet 198 series movmin minL series movmax maxL series movmed medianL smpl full computes the moving minimum maximum and median of the lg series Plotting the four series would produce something similar to figure 211 46 48 5 52 54 56 58 6 62 64 66 1950 1952 1954 1956 1958 1960 lg movmin movmed movmax Figure 211 Moving functions Generating data with a prescribed correlation structure Problem Id like to generate a bunch of normal random variates whose covariance matrix is exactly equal to a given matrix Σ How can I do this in gretl Solution The Cholesky decomposition is your friend If you want to generate data with a given population covariance matrix then all you have to do is postmultiply your pseudorandom data by the Cholesky factor transposed of the matrix you want For example set seed 123 S 2111 T 1000 X mnormalT rowsS X X choleskyS eval mcovX should give you eval mcovX 20016 10157 10157 10306 If instead you want your simulated data to have a given sample covariance matrix you have to apply the same technique twice one for standardizing the data another one for giving it the covariance structure you want Example S 2111 T 1000 Chapter 21 Cheat sheet 199 X mnormalT rowsS X X choleskyScholeskymcovX eval mcovX gives you eval mcovX 2 1 1 1 as required 213 Neat tricks Interaction dummies Problem You want to estimate the model yi xiβ1 ziβ2 diβ3 di ziβ4 εt where di is a dummy variable while xi and zi are vectors of explanatory variables Solution As of version 1912 gretl provides the operator to make this operation easy See section 151 for details especially example script 151 But back in my day we used loops to do that Heres how list X x1 x2 x3 list Z z1 z2 list dZ deflist loop foreach i Z series di d i list dZ dZ di endloop ols y X Z d dZ Comment Its amazing what string substitution can do for you isnt it Realized volatility Problem Given data by the minute you want to compute the realized volatility for the hour as RVt 160 60 τ1 y 2 tτ Imagine your sample starts at time 11 Solution smpl full genr time series minute inttime60 1 series second time 60 setobs minute second panel series rv psdy2 setobs 1 1 smpl second1 restrict store foo rv Comment Here we trick gretl into thinking that our dataset is a panel dataset where the minutes are the units and the seconds are the time this way we can take advantage of the special function psd panel standard deviation Then we simply drop all observations but one per minute and save the resulting data store foo rv translates as store in the gretl datafile foogdt the series rv Chapter 21 Cheat sheet 200 Looping over two paired lists Problem Suppose you have two lists with the same number of elements and you want to apply some command to corresponding elements over a loop Solution list L1 a b c list L2 x y z k1 1 loop foreach i L1 k2 1 loop foreach j L2 if k1 k2 ols i 0 j endif k2 endloop k1 endloop Comment The simplest way to achieve the result is to loop over all possible combinations and filter out the unneeded ones via an if condition as above That said in some cases variable names can help For example if list Lx x1 x2 x3 list Ly y1 y2 y3 then we could just loop over the integersquite intuitive and certainly more elegant loop i13 ols yi const xi endloop Convolution polynomial multiplication Problem How do I multiply polynomials Theres no dedicated function to do that and yet its a fairly basic mathematical task Solution Never fear We have the conv2d function which is a tool for a more general problem but includes polynomial multiplication as a special case Suppose you want to multiply two finiteorder polynomials Px mi0 pi xi and Qx ni0 qi xi What you want is the sequence of coefficients of the polynomial Rx Px Qx mn k0 rkk where rk ki0 pi qki is the convolution of the pi and qi coefficients The same operation can be performed via the FFT but in most cases using conv2d is quicker and more natural As an example well use the same one we used in Section 305 consider the multiplication of two polynomials Px 1 05x Qx 1 03x 08x2 Rx Px Qx 1 08x 065x2 04x3 Chapter 21 Cheat sheet 201 The following code snippet performs all the necessary calculations p 1 05 q 1 03 08 r conv2dp q print r Runnning the above produces r 4 x 1 1 08 065 04 which is indeed the desired result Note that the same computation could also be performed via the filter function at the price of slightly more elaborate syntax Comparing two lists Problem How can I tell if two lists contain the same variables not necessarily in the same order Solution In many respects lists are like sets so it makes sense to use the socalled symmetric difference operator which is defined as A B A B B A where in this context backslash represents the relative complement operator such that A B x A x B In practice we first check if there are any series in A but not in B then we perform the reverse check If the union of the two results is an empty set then the lists must contain the same variables The hansl syntax for this would be something like scalar NotTheSame nelemAB BA 0 Reordering list elements Problem Is there a way to reorder list elements Solution You can use the fact that a list can be cast into a vector of integers and then manipulated via ordinary matrix syntax So for example if you wanted to flip a list you may just use the mreverse function For example open AWMgdt quiet list X 3 6 9 12 matrix tmp X list revX mreversetmp list X print list revX print will produce list X print D1 D872 EENDIS GCD list revX print GCD EENDIS D872 D1 Chapter 21 Cheat sheet 202 Plotting an asymmetric confidence interval Problem I like the look of the band option to the gnuplot and plot commands but its set up for plotting a symmetric interval and I want to show an asymmetric one Solution Any interval is by construction symmetrical about its mean at each observation So you just need to perform a little tweak Say you want to plot a series x along with a band defined by the two series top and bot Here we go create series for midpoint and deviation series mid top bot2 series dev top mid gnuplot x bandmiddev timeseries withlines outputdisplay Crossvalidation Problem Id like to compute the socalled leaveoneout crossvalidation criterion for my regression Is there a command in gretl If you have a sample with n observations the leaveoneout crossvalidation criterion can be mechanically computed by running n regressions in which one observation at a time is omitted and all the other ones are used to forecast its value The sum of the n squared forecast errors is the statistic we want Fortunately there is no need to do so It is possible to prove that the same statistic can be computed as CV ni1ui1 hi2 where hi is the ith element of the hat matrix see section 212 from a regression on the whole sample This method is natively provided by gretl as a side benefit to the leverage command that stores the CV criterion into the test accessor The following script shows the equivalence of the two approaches set verbose off open data41gdt list X const sqft bedrms baths compute the CV criterion the silly way scalar CV 0 matrix mX X loop i 1 nobs xi mXi yi pricei smpl obs i restrict ols price X quiet smpl full scalar fe yi xi coeff CV fe2 endloop printf CV g CV the smart way ols price X quiet leverage quiet printf CV g test Chapter 21 Cheat sheet 203 Is my matrix result broken Problem Most of the matrix manipulation functions available in gretl flag an error if something goes wrong but theres no guarantee that every matrix computation will return an entirely finite matrix containing no infinities or NaNs So how do I tell if Ive got a fully valid matrix Solution Given a matrix m the call okm returns a matrix with the same dimensions as m with elements 1 for finite values and 0 for infinities or NaNs A matrix as a whole is OK if it has no elements which fail this test so heres a suitable check for a broken matrix using the logical NOT operator sumcsumrokm 0 If this gives a nonzero return value you know that m contains at least one nonfinite element Part II Econometric methods 204 Chapter 22 Robust covariance matrix estimation 221 Introduction Consider once again the linear regression model y X beta u where y and u are Tvectors X is a T x k matrix of regressors and beta is a kvector of parameters As is well known the estimator of beta given by Ordinary Least Squares OLS is betahat X X 1 X y If the condition EuX 0 is satisfied this is an unbiased estimator under somewhat weaker conditions the estimator is biased but consistent It is straightforward to show that when the OLS estimator is unbiased that is when Ebetahat beta 0 its variance is Varbetahat Ebetahat betabetahat beta X X 1 X Omega X X X 1 where Omega Euu is the covariance matrix of the error terms Under the assumption that the error terms are independently and identically distributed iid we can write Omega sigma squared I where sigma squared is the common variance of the errors and the covariances are zero In that case 223 simplifies to the classical formula Varbetahat sigma squared X X 1 If the iid assumption is not satisfied two things follow First it is possible in principle to construct a more efficient estimator than OLS for instance some sort of Feasible Generalized Least Squares FGLS Second the simple classical formula for the variance of the least squares estimator is no longer correct and hence the conventional OLS standard errors which are just the square roots of the diagonal elements of the matrix defined by 224 do not provide valid means of statistical inference In the recent history of econometrics there are broadly two approaches to the problem of noniid errors The traditional approach is to use an FGLS estimator For example if the departure from the iid condition takes the form of timeseries dependence and if one believes that this could be modeled as a case of firstorder autocorrelation one might employ an AR1 estimation method such as CochraneOrcutt HildrethLu or PraisWinsten If the problem is that the error variance is nonconstant across observations one might estimate the variance as a function of the independent variables and then perform weighted least squares using as weights the reciprocals of the estimated variances While these methods are still in use an alternative approach has found increasing favor that is use OLS but compute standard errors or more generally covariance matrices that are robust with respect to deviations from the iid assumption This is typically combined with an emphasis on using large datasets large enough that the researcher can place some reliance on the asymptotic consistency property of OLS This approach has been enabled by the availability of cheap computing power The computation of robust standard errors and the handling of very large datasets were daunting tasks at one time but now they are unproblematic The other point favoring the newer methodology is that while FGLS offers an efficiency advantage in principle it often involves making 205 Chapter 22 Robust covariance matrix estimation 206 additional statistical assumptions which may or may not be justified which may not be easy to test rigorously and which may threaten the consistency of the estimator for example the common factor restriction that is implied by traditional FGLS corrections for autocorrelated errors James Stock and Mark Watsons Introduction to Econometrics illustrates this approach at the level of undergraduate instruction many of the datasets they use comprise thousands or tens of thousands of observations FGLS is downplayed and robust standard errors are reported as a matter of course In fact the discussion of the classical standard errors labeled homoskedasticityonly is confined to an Appendix Against this background it may be useful to set out and discuss all the various options offered by gretl in respect of robust covariance matrix estimation The first point to notice is that gretl produces classical standard errors by default in all cases apart from GMM estimation In script mode you can get robust standard errors by appending the robust flag to estimation commands In the GUI program the model specification dialog usually contains a Robust standard errors check box along with a configure button that is activated when the box is checked The configure button takes you to a configuration dialog which can also be reached from the main menu bar Tools Preferences General HCCME There you can select from a set of possible robust estimation variants and can also choose to make robust estimation the default The specifics of the available options depend on the nature of the data under consideration crosssectional time series or panel and also to some extent the choice of estimator Although we introduced robust standard errors in the context of OLS above they may be used in conjunction with other estimators too The following three sections of this chapter deal with matters that are specific to the three sorts of data just mentioned Note that additional details regarding covariance matrix estimation in the context of GMM are given in chapter 27 We close this introduction with a brief statement of what robust standard errors can and cannot achieve They can provide for asymptotically valid statistical inference in models that are basically correctly specified but in which the errors are not iid The asymptotic part means that they may be of little use in small samples The correct specification part means that they are not a magic bullet if the error term is correlated with the regressors so that the parameter estimates themselves are biased and inconsistent robust standard errors will not save the day 222 Crosssectional data and the HCCME With crosssectional data the most likely departure from iid errors is heteroskedasticity nonconstant variance 1 In some cases one may be able to arrive at a judgment regarding the likely form of the heteroskedasticity and hence to apply a specific correction The more common case however is where the heteroskedasticity is of unknown form We seek an estimator of the covariance matrix of the parameter estimates that retains its validity at least asymptotically in face of unspecified heteroskedasticity It is not obvious a priori that this should be possible but White 1980 showed that varexphbeta X X 1 X Omega X X X 1 does the trick As usual in statistics we need to say under certain conditions but the conditions are not very restrictive Omega is in this context a diagonal matrix whose nonzero elements may be estimated using squared OLS residuals White referred to 225 as a heteroskedasticityconsistent covariance matrix estimator HCCME Davidson and MacKinnon 2004 chapter 5 offer a useful discussion of several variants on Whites HCCME theme They refer to the original variant of 225 in which the diagonal elements of Omega are estimated directly by the squared OLS residuals u squared as HC0 The associated standard errors are often called Whites standard errors The various refinements of Whites proposal share a common point of departure namely the idea that the squared OLS residuals are likely to be too 1In some specialized contexts spatial autocorrelation may be an issue Gretl does not have any builtin methods to handle this and we will not discuss it here small on average This point is quite intuitive The OLS parameter estimates hatbeta satisfy by design the criterion that the sum of squared residuals Sigma hatut2 Sigma leftyt Xt hatbetaright2 is minimized for given X and y Suppose that hatbeta eq beta This is almost certain to be the case even if OLS is not biased it would be a miracle if the hatbeta calculated from any finite sample were exactly equal to beta But in that case the sum of squares of the true unobserved errors Sigma ut2 Sigma leftyt Xt betaright2 is bound to be greater than Sigma hatut2 The elaborated variants on HC0 take this point on board as follows HC1 Applies a degreesoffreedom correction multiplying the HC0 matrix by TleftTkright HC2 Instead of using hatut2 for the diagonal elements of hatOmega uses hatut2left1htright where ht XtleftXXright1Xt the tth diagonal element of the projection matrix PX which has the property that PX cdot y haty The relevance of ht is that if the variance of all the ut is sigma2 the expectation of hatut2 is sigma2 left1htright or in other words the ratio hatut2left1htright has expectation sigma2 As Davidson and MacKinnon show 0 leq ht 1 for all t so this adjustment cannot reduce the the diagonal elements of hatOmega and in general revises them upward HC3 Uses hatut2left1htright2 The additional factor of left1htright in the denominator relative to HC2 may be justified on the grounds that observations with large variances tend to exert a lot of influence on the OLS estimates so that the corresponding residuals tend to be underestimated See Davidson and MacKinnon for a fuller explanation HC3a Implements the jackknife approach from MacKinnon and White 1985 HC3 is a close approximation of this The relative merits of these variants have been explored by means of both simulations and theoretical analysis Unfortunately there is not a clear consensus on which is best Davidson and MacKinnon argue that the original HC0 is likely to perform worse than the others nonetheless Whites standard errors are reported more often than the more sophisticated variants and therefore for reasons of comparability HC0 is the default HCCME in gretl If you wish to use HC1 HC2 HC3 or HC3a you can arrange for this in either of two ways In script mode you can do for example set hcversion 2 In the GUI program you can go to the HCCME configuration dialog as noted above and choose any of these variants to be the default 223 Time series data and HAC covariance matrices Heteroskedasticity may be an issue with time series data too but it is unlikely to be the only or even the primary concern One form of heteroskedasticity is common in macroeconomic time series but is fairly easily dealt with That is in the case of strongly trending series such as Gross Domestic Product aggregate consumption aggregate investment and so on higher levels of the variable in question are likely to be associated with higher variability in absolute terms The obvious fix employed in many macroeconometric studies is to use the logs of such series rather than the raw levels Provided the proportional variability of such series remains roughly constant over time the log transformation is effective Other forms of heteroskedasticity may resist the log transformation but may demand a special treatment distinct from the calculation of robust standard errors We have in mind here autoregressive conditional heteroskedasticity for example in the behavior of asset prices where large disturbances to the market may usher in periods of increased volatility Such phenomena call for specific estimation strategies such as GARCH see chapter 31 Despite the points made above some residual degree of heteroskedasticity may be present in time series data the key point is that in most cases it is likely to be combined with serial correlation autocorrelation hence demanding a special treatment In Whites approach hatOmega the estimated covariance matrix of the ut remains conveniently diagonal the variances Eleftut2right may differ by t but the covariances Eleftut usright for s eq t are all zero Autocorrelation in time series data means that at least some of the the offdiagonal elements of hatOmega should be nonzero This introduces a substantial complication and requires another piece of terminology estimates of the covariance matrix that are asymptotically valid in face of both heteroskedasticity and autocorrelation of the error process are termed HAC heteroskedasticity and autocorrelationconsistent The issue of HAC estimation is treated in more technical terms in chapter 27 Here we try to convey some of the intuition at a more basic level We begin with a general comment residual autocorrelation is not so much a property of the data as a symptom of an inadequate model Data may be persistent though time and if we fit a model that does not take this aspect into account properly we end up with a model with autocorrelated disturbances Conversely it is often possible to mitigate or even eliminate the problem of autocorrelation by including relevant lagged variables in a time series model or in other words by specifying the dynamics of the model more fully HAC estimation should not be seen as the first resort in dealing with an autocorrelated error process That said the obvious extension of Whites HCCME to the case of autocorrelated errors would seem to be this estimate the offdiagonal elements of hatOmega that is the autocovariances Eleftut usright using once again the appropriate OLS residuals hatomegats hatut hatus This is basically right but demands an important amendment We seek a consistent estimator one that converges towards the true Omega as the sample size tends towards infinity This cant work if we allow unbounded serial dependence A larger sample will enable us to estimate more of the true omegats elements that is for t and s more widely separated in time but will not contribute everincreasing information regarding the maximally separated omegats pairs since the maximal separation itself grows with the sample size To ensure consistency we have to confine our attention to processes exhibiting temporally limited dependence In other words we cut off the computation of the hatomegats values at some maximum value of p ts where p is treated as an increasing function of the sample size T although it cannot increase in proportion to T The simplest variant of this idea is to truncate the computation at some finite lag order p where p grows as say T14 The trouble with this is that the resulting hatOmega may not be a positive definite matrix In practical terms we may end up with negative estimated variances One solution to this problem is offered by The NeweyWest estimator Newey and West 1987 which assigns declining weights to the sample autocovariances as the temporal separation increases To understand this point it is helpful to look more closely at the covariance matrix given in 225 namely leftXX right1 leftX hatOmega X right leftXX right1 This is known as a sandwich estimator The bread which appears on both sides is leftXX right1 This k imes k matrix is also the key ingredient in the computation of the classical covariance matrix The filling in the sandwich is hatSigma where wj is the weight given to lag j 0 and the k imes k matrix hatGammaleftjright for j geq 0 is given by hatGammaleftjright sumtj1T hatut hatutj Xt Xtj that is the sample autocovariance matrix of xt cdot ut at lag j apart from a scaling factor T This leaves two questions How exactly do we determine the maximum lag length or bandwidth p of the HAC estimator And how exactly are the weights wj to be determined We will return to the difficult question of the bandwidth shortly As regards the weights gretl offers three variants The default is the Bartlett kernel as used by Newey and West This sets wj begincases 1 fracjp1 j le p 0 jp endcases so the weights decline linearly as j increases The other two options are the Parzen kernel and the Quadratic Spectral QS kernel For the Parzen kernel wj begincases 1 6aj2 6aj3 0 le aj le 05 2left1ajright3 05 aj le 1 0 aj1 endcases where aj jleftp1right and for the QS kernel wj frac2512pi2 dj2 leftfracsin mjmj cos mj right where dj jp and mj 6 pi dj 5 Figure 221 shows the weights generated by these kernels for p 4 and j 1 to 9 Chapter 22 Robust covariance matrix estimation 210 set haclag nw2 As shown in Table 221 the choice between nw1 and nw2 does not make a great deal of difference T p nw1 p nw2 50 2 3 100 3 4 150 3 4 200 4 4 300 5 5 400 5 5 Table 221 HAC bandwidth two rules of thumb You also have the option of specifying a fixed numerical value for p as in set haclag 6 In addition you can set a distinct bandwidth for use with the Quadratic Spectral kernel since this need not be an integer For example set qsbandwidth 35 Prewhitening and databased bandwidth selection An alternative approach is to deal with residual autocorrelation by attacking the problem from two sides The intuition behind the technique known as VAR prewhitening Andrews and Monahan 1992 can be illustrated by a simple example Let xt be a sequence of firstorder autocorrelated random variables xt ρxt1 ut The longrun variance of xt can be shown to be VLRxt VLRut 1 ρ2 In most cases ut is likely to be less autocorrelated than xt so a smaller bandwidth should suffice Estimation of VLRxt can therefore proceed in three steps 1 estimate ρ 2 obtain a HAC estimate of ˆut xt ˆρxt1 and 3 divide the result by 1 ρ2 The application of the above concept to our problem implies estimating a finiteorder Vector Au toregression VAR on the vector variables ξt Xt ˆut In general the VAR can be of any order but in most cases 1 is sufficient the aim is not to build a watertight model for ξt but just to mop up a substantial part of the autocorrelation Hence the following VAR is estimated ξt Aξt1 εt Then an estimate of the matrix XΩX can be recovered via I ˆ A1ˆΣεI ˆ A1 where ˆΣε is any HAC estimator applied to the VAR residuals You can ask for prewhitening in gretl using set hacprewhiten on Chapter 22 Robust covariance matrix estimation 211 There is at present no mechanism for specifying an order other than 1 for the initial VAR A further refinement is available in this context namely databased bandwidth selection It makes intuitive sense that the HAC bandwidth should not simply be based on the size of the sample but should somehow take into account the timeseries properties of the data and also the kernel chosen A nonparametric method for doing this was proposed by Newey and West 1994 a good concise account of the method is given in Hall 2005 This option can be invoked in gretl via set haclag nw3 This option is the default when prewhitening is selected but you can override it by giving a specific numerical value for haclag Even the NeweyWest databased method does not fully pin down the bandwidth for any particular sample The first step involves calculating a series of residual covariances The length of this series is given as a function of the sample size but only up to a scalar multiplefor example it is given as OT 29 for the Bartlett kernel Gretl uses an implied multiple of 1 NeweyWest with missing values If the estimation sample for a timeseries model includes incomplete observationswhere the value of the dependent variable or one more regressors is missingthe NeweyWest procedure must be either modified or abandoned since some ingredients of the ˆΣ matrix defined above will be absent Two modified methods have been discussed in the literature Parzen 1963 proposed what he called Amplitude Modulation AM which involves setting the values of the residual and each of the regressors to zero for the incomplete observations and then proceeding as usual Datta and Du 2012 propose the socalled Equal Spacing ES method calculate as if the incomplete observations did not exist and the complete observations therefore form an equallyspaced series Somewhat suprisingly it can be shown that both of these methods have appropriate asymptotic properties see Rho and Vogelsang 2018 for further elaboration In gretl you can select a preferred method via one or other of these commands set hacmissvals es ES Datta and Du set hacmissvals am AM Parzen set hacmissvals off The ES method is the default The off option means that gretl will refuse to produce HAC standard errors when the sample includes incomplete observations use this if you have qualms about the modified methods VARs a special case A wellspecified vector autoregression VAR will generally include enough lags of the dependent variables to obviate the problem of residual autocorrelation in which case HAC estimation is redundantalthough there may still be a need to correct for heteroskedasticity For that rea son plain HCCME and not HAC is the default when the robust flag is given in the context of the var command However if for some reason you need HAC you can force the issue by giving the option robusthac Longrun variance Let us expand a little on the subject of the longrun variance that was mentioned above and the associated tools offered by gretl for scripting You may also want to check out the reference for the lrcovar function for the multivariate case As is well known the variance of the average of T random variables x1 x2 xT with equal variance σ 2 equals σ 2T if the data are uncorrelated In this case the sample variance of xt over the sample size provides a consistent estimator If however there is serial correlation among the xᵗs the variance of X T¹ t1ᵀ xₜ must be estimated differently One of the most widely used statistics for this purpose is a nonparametric kernel estimator with the Bartlett kernel defined as ω²k T¹ tkᵀₖ ikᵏ wᵢxₜ Xxₜᵢ X where the integer k is known as the window size and the wᵢ terms are the socalled Bartlett weights defined as wᵢ 1ik1 It can be shown that for k large enough ω²kT yields a consistent estimator of the variance of X gretl implements this estimator by means of the function lrvar This function takes one required argument namely the series whose longrun variance is to be estimated followed by two optional arguments The first of these can be used to supply a value for k if it is omitted or negative the popular choice T¹³ is used The second allows specification of an assumed value for the population mean of X which then replaces X in the variance calculation Usage is illustrated below automatic window size use xbar for mean lrs2 lrvarx set a window size of 12 lrs2 lrvarx 12 set window size and impose assumed mean of zero lrs2 lrvarx 12 0 impose mean zero automatic window size lrs2 lrvarx 1 0 224 Special issues with panel data Since panel data have both a timeseries and a crosssectional dimension one might expect that in general robust estimation of the covariance matrix would require handling both heteroskedasticity and autocorrelation the HAC approach In addition some special features of panel data require attention The variance of the error term may differ across the crosssectional units The covariance of the errors across the units may be nonzero in each time period If the between variation is not swept out the errors may exhibit autocorrelation not in the usual timeseries sense but in the sense that the mean value of the error term may differ across units This is relevant when estimation is by pooled OLS Gretl currently offers two robust covariance matrix estimators specifically for panel data These are available for models estimated via fixed effects random effects pooled OLS and pooled twostage least squares The default robust estimator is that suggested by Arellano 2003 which is HAC provided the panel is of the large n small T variety that is many units are observed in relatively few periods The Arellano estimator is Σᴬ XX¹i1ⁿ Xᵢ ûᵢ ûᵢ XᵢXX¹ where X is the matrix of regressors with the group means subtracted in the case of fixed effects or quasidemeaned in the case of random effects ûᵢ denotes the vector of residuals for unit i and n is the number of crosssectional units² Cameron and Trivedi 2005 make a strong case for using this estimator they note that the ordinary White HCCME can produce misleadingly small standard ²This variance estimator is also known as the clustered over entities estimator errors in the panel context because it fails to take autocorrelation into account³ In addition Stock and Watson 2008 show that the White HCCME is inconsistent in the fixedeffects panel context for fixed T 2 In cases where autocorrelation is not an issue the estimator proposed by Beck and Katz 1995 and discussed by Greene 2003 chapter 13 may be appropriate This estimator which takes into account contemporaneous correlation across the units and heteroskedasticity by unit is ΣᴮᴷXX¹i1ⁿ j1ⁿ σᵢⱼ Xᵢ XⱼXX¹ The covariances σᵢⱼ are estimated via σᵢⱼ ûᵢ ûⱼ T where T is the length of the time series for each unit Beck and Katz call the associated standard errors PanelCorrected Standard Errors PCSE This estimator can be invoked in gretl via the command set pcse on The Arellano default can be reestablished via set pcse off Note that regardless of the pcse setting the robust estimator is not used unless the robust flag is given or the Robust box is checked in the GUI program 225 The clusterrobust estimator One further variance estimator is available in gretl namely the clusterrobust estimator This may be appropriate for crosssectional data mostly when the observations naturally fall into groups or clusters and one suspects that the error term may exhibit dependency within the clusters andor have a variance that differs across clusters Such clusters may be binary eg employed versus unemployed workers categorical with several values eg products grouped by manufacturer or ordinal eg individuals with low middle or high education levels For linear regression models estimated via least squares the cluster estimator is defined as ΣᴄXX¹j1ᵐ Xⱼ ûⱼ ûⱼ XⱼXX¹ where m denotes the number of clusters and Xⱼ and ûⱼ denote respectively the matrix of regressors and the vector of residuals that fall within cluster j As noted above the Arellano variance estimator for panel data models is a special case of this where the clustering is by panel unit For models estimated by the method of Maximum Likelihood in which case the standard variance estimator is the inverse of the negative Hessian H the cluster estimator is ΣᴄH¹j1ᵐ Gⱼ Gⱼ H¹ where Gⱼ is the sum of the score that is the derivative of the loglikelihood with respect to the parameter estimates across the observations falling within cluster j ³See also Cameron and Miller 2015 for a discussion of the Arellanotype estimator in the context of the random effects model Chapter 22 Robust covariance matrix estimation 214 It is common to apply a degrees of freedom adjustment to these estimators otherwise the variance may appear misleadingly small in comparison with other estimators if the number of clusters is small In the least squares case the factor is mm 1 n 1n k where n is the total number of observations and k is the number of parameters estimated in the case of ML estimation the factor is just mm 1 Availability and syntax The clusterrobust estimator is currently available for models estimated via OLS and TSLS and also for most ML estimators other than those specialized for timeseries data binary logit and pro bit ordered logit and probit multinomial logit Tobit interval regression biprobit count models and duration models Additionally the same option is available for generic maximum likelihood estimation as provided by the mle command see chapter 26 for extra details In all cases the syntax is the same you give the option flag cluster followed by the name of the series to be used to define the clusters as in ols y 0 x1 x2 clustercvar The specified clustering variable must a be defined not missing at all observations used in esti mating the model and b take on at least two distinct values over the estimation range The clusters are defined as sets of observations having a common value for the clustering variable It is generally expected that the number of clusters is substantially less than the total number of observations Chapter 23 Panel data A panel dataset is one in which each of N 1 units sometimes called individuals or groups is observed over time In a balanced panel there are T 1 observations on each unit more generally the number of observations may differ by unit In the following we index units by i and time by t To allow for imbalance in a panel we use the notation Ti to refer to the number of observations for unit or individual i 231 Estimation of panel models Pooled Ordinary Least Squares The simplest estimator for panel data is pooled OLS In most cases this is unlikely to be adequate but it provides a baseline for comparison with more complex estimators If you estimate a model on panel data using OLS an additional test item becomes available In the GUI model window this is the item panel specification under the Tests menu the script counterpart is the panspec command To take advantage of this test you should specify a model without any dummy variables represent ing crosssectional units The test compares pooled OLS against the principal alternatives the fixed effects and random effects models These alternatives are explained in the following section The fixed and random effects models In the graphical interface these options are found under the menu item ModelPanelFixed and random effects In the commandline interface one uses the panel command with or without the randomeffects option The fixedeffects option is also allowed but not strictly necessary being the default This section explains the nature of these models and comments on their estimation via gretl The pooled OLS specification may be written as yit Xitβ uit 231 where yit is the observation on the dependent variable for crosssectional unit i in period t Xit is a 1 k vector of independent variables observed for unit i in period t β is a k 1 vector of parameters and uit is an error or disturbance term specific to unit i in period t The fixed and random effects models have in common that they decompose the unitary pooled error term uit For the fixed effects model we write uit αi εit yielding yit Xitβ αi εit 232 That is we decompose uit into a unitspecific and timeinvariant component αi and an observation specific error εit1 The αis are then treated as fixed parameters in effect unitspecific yintercepts which are to be estimated This can be done by including a dummy variable for each crosssectional 1It is possible to break a third component out of uit namely wt a shock that is timespecific but common to all the units in a given period In the interest of simplicity we do not pursue that option here 215 unit and suppressing the global constant This is sometimes called the Least Squares Dummy Variables LSDV method Alternatively one can subtract the group mean from each of variables and estimate a model without a constant In the latter case the dependent variable may be written as ỹᵢₜ yᵢₜ ȳᵢ The group mean ȳᵢ is defined as ȳᵢ 1Tᵢ t1ᵀᵢ yᵢₜ where Tᵢ is the number of observations for unit i An exactly analogous formulation applies to the independent variables Given parameter estimates β obtained using such demeaned data we can recover estimates of the αᵢs using αᵢ 1Tᵢ t1ᵀᵢ yᵢₜ Xᵢₜ β These two methods LSDV and using demeaned data are numerically equivalent gretl takes the approach of demeaning the data If you have a small number of crosssectional units a large number of timeseries observations per unit and a large number of regressors it is more economical in terms of computer memory to use LSDV If need be you can easily implement this manually For example genr unitdum ols y x du See Chapter 10 for details on unitdum The αᵢ estimates are not printed as part of the standard model output in gretl there may be a large number of these and typically they are not of much inherent interest However you can retrieve them after estimation of the fixed effects model if you wish In the graphical interface go to the Save menu in the model window and select perunit constants In commandline mode you can do series newname ahat where newname is the name you want to give the series For the random effects model we write uᵢₜ vᵢ εᵢₜ so the model becomes yᵢₜ Xᵢₜ β vᵢ εᵢₜ 233 In contrast to the fixed effects model the vᵢs are not treated as fixed parameters but as random drawings from a given probability distribution The celebrated GaussMarkov theorem according to which OLS is the best linear unbiased estimator BLUE depends on the assumption that the error term is independently and identically distributed IID In the panel context the IID assumption means that Eu²ᵢₜ in relation to equation 231 equals a constant σ²ᵤ for all i and t while the covariance Euᵢₛ uᵢₜ equals zero for all s t and the covariance Euⱼₜ uᵢₜ equals zero for all j i If these assumptions are not metand they are unlikely to be met in the context of panel dataOLS is not the most efficient estimator Greater efficiency may be gained using generalized least squares GLS taking into account the covariance structure of the error term Consider observations on a given unit i at two different times s and t From the hypotheses above it can be worked out that Varuᵢₛ Varuᵢₜ σ²ᵥ σ²ε while the covariance between uᵢₛ and uᵢₜ is given by Euᵢₛ uᵢₜ σ²ᵥ In matrix notation we may group all the Tᵢ observations for unit i into the vector yᵢ and write it as yᵢ Xᵢ β uᵢ 234 The vector ui which includes all the disturbances for individual i has a variancecovariance matrix given by Varui Σi σε2 I σν2 J where J is a square matrix with all elements equal to 1 It can be shown that the matrix Ki I θi Ti J where θi 1 sqrtσε2 σε2 Ti σν2 has the property Ki Σ Ki σε2 I It follows that the transformed system Ki yi Ki Xi β Ki ui satisfies the GaussMarkov conditions and OLS estimation of 236 provides efficient inference But since Ki yi yi θi yi GLS estimation is equivalent to OLS using quasidemeaned variables that is variables from which we subtract a fraction θ of their average Notice that for σε2 0 θ 1 while for σν2 0 θ 0 This means that if all the variance is attributable to the individual effects then the fixed effects estimator is optimal if on the other hand individual effects are negligible then pooled OLS turns out unsurprisingly to be the optimal estimator To implement the GLS approach we need to calculate θ which in turn requires estimates of the two variances σε2 and σν2 These are often referred to as the within and between variances respectively since the former refers to variation within each crosssectional unit and the latter to variation between the units Several means of estimating these magnitudes have been suggested in the literature see Baltagi 1995 by default gretl uses the method of Swamy and Arora 1972 σε2 is estimated by the residual variance from the fixed effects model and σν2 is estimated indirectly with the help of the between regression which uses the group means of all the relevant variables is yi Xi β ei The residual variance from this regression se2 can be shown to estimate the sum σν2 σε2 T An estimate of σν2 can therefore be obtained by subtracting 1T times the estimate of σε2 from se2 σv2 se2 σε2 T Alternatively if the nerlove option is given gretl uses the method suggested by Nerlove 1971 In this case σν2 is estimated as the sample variance of the fixed effects αi σν2 1N 1 Σi1n αi ā2 where N is the number of individuals and ā is the mean of the estimated fixed effects Swamy and Aroras equation 237 involves T hence assuming a balanced panel When the number of time series observations Ti differs across individuals some sort of adjustment is needed By default gretl follows Stata by using the harmonic mean of the Ti s in place of T It may be argued however that a more substantial adjustment is called for in the unbalanced case Baltagi and Chang 1994 recommend a variant of SwamyArora which involves Tiweighted estimation of the between regression on the basis that units with more observations will be more informative about the variance of interest In gretl one can switch to the BaltagiChang variant by giving the unbalanced Chapter 23 Panel data 218 option with the panel command But the gain in efficiency from doing so may well be slim for a discussion of this point and related matters see Cottrell 2017 Unbalancedness also affects the Nerlove 1971 estimator but the econometric literature offers no guidance on the details Gretl uses the weighted average of the fixed effects as a natural extension of the original method Again see Cottrell 2017 for further details Choice of estimator Which panel method should one use fixed effects or random effects One way of answering this question is in relation to the nature of the data set If the panel comprises observations on a fixed and relatively small set of units of interest say the member states of the European Union there is a presumption in favor of fixed effects If it comprises observations on a large number of randomly selected individuals as in many epidemiological and other longitudinal studies there is a presumption in favor of random effects Besides this general heuristic however various statistical issues must be taken into account 1 Some panel data sets contain variables whose values are specific to the crosssectional unit but which do not vary over time If you want to include such variables in the model the fixed effects option is simply not available When the fixed effects approach is implemented using dummy variables the problem is that the timeinvariant variables are perfectly collinear with the perunit dummies When using the approach of subtracting the group means the issue is that after demeaning these variables are nothing but zeros 2 A somewhat analogous issue arises with the random effects estimator As mentioned above the default SwamyArora method relies on the groupmeans regression to obtain a measure of the between variance Suppose we have observations on n units or individuals and there are k independent variables of interest If k n this regression cannot be runsince we have only n effective observationsand hence SwamyArora estimates cannot be obtained In this case however it is possible to use Nerloves method instead If both fixed effects and random effects are feasible for a given specification and dataset the choice between these estimators may be expressed in terms of the two econometric desiderata efficiency and consistency From a purely statistical viewpoint we could say that there is a tradeoff between robustness and efficiency In the fixed effects approach we do not make any hypotheses on the group effects that is the timeinvariant differences in mean between the groups beyond the fact that they exist and that can be tested see below As a consequence once these effects are swept out by taking deviations from the group means the remaining parameters can be estimated On the other hand the random effects approach attempts to model the group effects as drawings from a probability distribution instead of removing them This requires that individual effects are representable as a legitimate part of the disturbance term that is zeromean random variables uncorrelated with the regressors As a consequence the fixedeffects estimator always works but at the cost of not being able to estimate the effect of timeinvariant regressors The richer hypothesis set of the randomeffects estimator ensures that parameters for timeinvariant regressors can be estimated and that esti mation of the parameters for timevarying regressors is carried out more efficiently These advan tages though are tied to the validity of the additional hypotheses If for example there is reason to think that individual effects may be correlated with some of the explanatory variables then the randomeffects estimator would be inconsistent while fixedeffects estimates would still be valid The Hausman test is built on this principle see below if the fixed and randomeffects estimates agree to within the usual statistical margin of error there is no reason to think the additional hypotheses invalid and as a consequence no reason not to use the more efficient RE estimator Testing panel models If you estimate a fixed effects or random effects model in the graphical interface you may notice that the number of items available under the Tests menu in the model window is relatively limited Panel models carry certain complications that make it difficult to implement all of the tests one expects to see for models estimated on straight timeseries or crosssectional data Nonetheless various panelspecific tests are printed along with the parameter estimates as a matter of course as follows When you estimate a model using fixed effects you automatically get an Ftest for the null hypothesis that the crosssectional units all have a common intercept That is to say that all the αi s are equal in which case the pooled model 231 with a column of 1s included in the X matrix is adequate When you estimate using random effects RE the BreuschPagan and Hausman tests are presented automatically To save their results in the context of a script one would copy the modelbptest or modelhausmantest bundles which are nested inside the model bundle Both of these inner bundles contain the elements test dfn degrees of freedom and pvalue The BreuschPagan test is the counterpart to the Ftest mentioned above The null hypothesis is that the variance of vi in equation 233 equals zero if this hypothesis is not rejected then again we conclude that the simple pooled model is adequate If the panel is unbalanced the method from Baltagi and Li 1990 is used to perform the BreuschPagan test for individual effects The Hausman test probes the consistency of the GLS estimates The null hypothesis is that these estimates are consistent that is that the requirement of orthogonality of the vi and the Xi is satisfied The test is based on a measure H of the distance between the fixedeffects and randomeffects estimates constructed such that under the null it follows the χ2 distribution with degrees of freedom equal to the number of timevarying regressors in the matrix X If the value of H is large this suggests that the random effects estimator is not consistent and the fixedeffects model is preferable There are two ways of calculating H the matrixdifference method and the regression method The procedure for the matrixdifference method is this Collect the fixedeffects estimates in a vector 𝛃 and the corresponding randomeffects estimates in β then form the difference vector 𝛃 β Form the covariance matrix of the difference vector as Var𝛃 β Var𝛃 Varβ Ψ where Varβ and Var𝛃 are estimated by the sample variance matrices of the fixed and randomeffects models respectively Compute H 𝛃 β Ψ1 𝛃 β Given the relative efficiencies of 𝛃 and β the matrix Ψ should be positive definite in which case H is positive but in finite samples this is not guaranteed and of course a negative χ2 value is not admissible The regression method avoids this potential problem The procedure is to estimate via OLS an augmented regression in which the dependent variable is quasidemeaned y and the regressors include both quasidemeaned X as in the RE specification and the demeaned variants of all the timevarying variables ie the fixedeffects regressors The Hausman null then implies that the coefficients on the latter subset of regressors should be statistically indistinguishable from zero If the RE specification employs the default covariancematrix estimator assuming IID errors H can be obtained as follows Chapter 23 Panel data 220 Treat the randomeffects model as the restricted model and record its sum of squared resid uals as SSRr Estimate the augmented unrestricted regression and record its sum of squared residuals as SSRu Compute H n SSRr SSRu SSRu where n is the total number of observations used Alternatively if the robust option is selected for RE estimation H is calculated as a Wald test based on a robust estimate of the covariance matrix of the augmented regression Either way H cannot be negative By default gretl computes the Hausman test via the regression method but it uses the matrix difference method if you pass the option matrixdiff to the panel command Serial correlation A simple test for firstorder autocorrelation of the error term namely the DurbinWatson DW statistic is printed as part of the output for pooled OLS as well as fixedeffects and randomeffects estimation Let us define serial correlation proper as serial correlation strictly in the time di mension of a panel dataset When based on the residuals from fixedeffects estimation the DW statistic is a test for serial correlation proper4 The DW value shown in the case of randomeffects estimation is based on the fixedeffects residuals When DW is based on pooled OLS residuals it tests for serial correlation proper only on the assumption of a common intercept Put differently in this case it tests a joint null hypothesis absence of fixed effects plus absence of first order serial correlation proper In the presence of missing observations the DW statistic is calculated as described in Baltagi and Wu 1999 their expression for d1 under equation 16 on page 819 When it is computed the DW statistic can be retrieved via the accessor dw after estimation In addition an approximate Pvalue for the null of no serial correlation ρ 0 against the alternative of ρ 0 may be available via the accessor dwpval This is based on the analysis in Bhargava et al 1982 strictly speaking it is the marginal significance level of DW considered as a dL value the value below which the test rejects as opposed to dU the value above which the test fails to reject In the panel case however dL and dU are quite close particularly when N the number of individual units is large At present gretl does not attempt to compute such Pvalues when the number of observations differs across individuals Robust standard errors For most estimators gretl offers the option of computing an estimate of the covariance matrix that is robust with respect to heteroskedasticity andor autocorrelation and hence also robust standard errors In the case of panel data robust covariance matrix estimators are available for the pooled fixed effects and random effects models See section 224 for details The constant in the fixed effects model Users are sometimes puzzled by the constant or intercept reported by gretl on estimation of the fixed effects model how can a constant remain when the group means have been subtracted from the data The method of calculation of this term is a matter of convention but the gretl authors decided to follow the convention employed by Stata this involves adding the global mean back into the variables from which the group means have been removed5 If you prefer to interpret the fixed effects model as OLS plus unit dummies throughout it can be proven the this approach is equivalent to using centered unit dummies instead of plain 01 dummies 4The generalization of the DurbinWatson statistic from the straight timeseries context to panel data is due to Bhargava et al 1982 5See Gould 2013 for an extended explanation Chapter 23 Panel data 221 The method that gretl uses internally is exemplified in Listing 231 The coefficients in the second OLS estimation including the intercept agree with those in the initial fixed effects model though the standard errors differ due to a degrees of freedom correction in the fixedeffects covariance matrix Note that the pmean function returns the group mean of a series The third estimator which produces quite a lot of outputinstead uses the stdize function to create the centered dummies It thereby shows the equivalence of the internallyused method to OLS plus centered dummies Note that in this case the standard errors agree with the initial estimates Listing 231 Calculating the intercept in the fixed effects model Download open abdatagdt list X w k ys list of explanatory variables builtin method panel n const X fixedeffects recentering by hand depvar n pmeann meann redefine the dependent variable list indepvars const loop foreach i X redefine the explanatory variables xi i pmeani meani indepvars xi endloop ols depvar indepvars perform estimation using centered dummies list C dummifyunit create the unit dummies smpl n X nomissing adjust to perform centering correctly list D stdizeC 1 center the unit dummies ols n const X D perform estimation Rsquared in the fixed effects model There is no uniquely correct way of calculating R2 in the context of the fixedeffects model It may be argued that a measure of the squared correlation between the dependent variable and the prediction yielded by the model is a desirable descriptive statistic to have but which model and which variant of the dependent variable are we talking about Fixedeffects models can be thought of in two equally defensible ways From one perspective they provide a nice clean way of sweeping out individual effects by using the fact that in the linear model a sufficient statistic is easy to compute Alternatively they provide a clever way to estimate the important parameters of a model in which you want to include for whatever reason a full set of individual dummies If you take the second of these perspectives your dependent variable is unmodified y and your model includes the unit dummies the appropriate R2 measure is then the squared correlation between y and the y computed using both the measured individual effects and the effects of the explicitly named regressors This is reported by gretl as the LSDV Rsquared If you take the first point of view on the other hand your dependent variable is really yit yi and your model just includes the β terms the coefficients of deviations of the x variables from their perunit means In this case the relevant measure of R2 is the socalled within R2 this variant is printed by gretl for fixedeffects model in place of the adjusted R2 it being unclear in this case what exactly the adjustment should amount to anyway Residuals in the fixed and random effect models After estimation of most kinds of models in gretl you can retrieve a series containing the residuals using the uhat accessor This is true of the fixed and random effects models but the exact meaning of gretls uhat in these cases requires a little explanation Consider first the fixed effects model yit Xit β αi εit In this model gretl takes the fitted value yhat to be αi Xit β and the residual uhat to be yit minus this fitted value This makes sense because the fixed effects the αi terms are taken as parameters to be estimated However it can be argued that the fixed effects are not really explanatory and if one defines the residual as the observed yit value minus its explained component one might prefer to see just yit Xit β You can get this after fixedeffects estimation as follows series uefe uhat ahat coeff1 where ahat gives the unitspecific intercept as it would be calculated if one included all N unit dummies and omitted a common yintercept and coeff1 gives the global yintercept Now consider the randomeffects model yit Xit β vi εit In this case gretl considers the error term to be vi εit since vi is conceived as a random drawing and the uhat series is an estimate of this namely yit Xit β What if you want an estimate of just vi or just εit in this case This poses a signalextraction problem given the composite residual how to recover an estimate of its components The solution is to ascribe to the individual effect vi a suitable fraction of the mean residual per individual ui Σt1Ti ûit The suitable fraction is the proportion of the variance of the variance of ui that is due to vi namely σv2 σv2 σε2 Ti 1 1 θi2 After random effects estimation in gretl you can access a series containing the vi s under the name ahat This series can be calculated by hand as follows case 1 balanced panel scalar theta theta series vhat 1 1 theta2 pmeanuhat Chapter 23 Panel data 223 case 2 unbalanced Ti varies by individual scalar s2v s2v scalar s2e s2e series frac s2v s2v s2epnobsuhat series ahat frac pmeanuhat If an estimate of εit is wanted it can then be obtained by subtraction from uhat 232 Autoregressive panel models Special problems arise when a lag of the dependent variable is included among the regressors in a panel model Consider a dynamic variant of the pooled model eq 231 yit Xitβ ρyit1 uit 239 First if the error uit includes a group effect vi then yit1 is bound to be correlated with the error since the value of vi affects yi at all t That means that OLS applied to 239 will be inconsistent as well as inefficient The fixedeffects model sweeps out the group effects and so overcomes this particular problem but a subtler issue remains which applies to both fixed and random effects estimation Consider the demeaned representation of fixed effects as applied to the dynamic model yit Xitβ ρ yit1 εit where yit yit yi and εit uit ui or uit αi using the notation of equation 232 The trouble is that yit1 will be correlated with εit via the group mean yi The disturbance εit influences yit directly which influences yi which by construction affects the value of yit for all t The same issue arises in relation to the quasidemeaning used for random effects Estimators which ignore this correlation will be consistent only as T in which case the marginal effect of εit on the group mean of y tends to vanish One strategy for handling this problem and producing consistent estimates of β and ρ was pro posed by Anderson and Hsiao 1981 Instead of demeaning the data they suggest taking the first difference of 239 an alternative tactic for sweeping out the group effects yit Xitβ ρyit1 ηit 2310 where ηit uit vi εit εit εit1 Were not in the clear yet given the structure of the error ηit the disturbance εit1 is an influence on both ηit and yit1 yit yit1 The next step is then to find an instrument for the contaminated yit1 Anderson and Hsiao suggest using either yit2 or yit2 both of which will be uncorrelated with ηit provided that the underlying errors εit are not themselves serially correlated The AndersonHsiao estimator is not provided as a builtin function in gretl since gretls sensible handling of lags and differences for panel data makes it a simple application of regression with instrumental variablessee Listing 232 which is based on a study of country growth rates by Nerlove 19997 Although the AndersonHsiao estimator is consistent it is not most efficient it does not make the fullest use of the available instruments for yit1 nor does it take into account the differenced structure of the error ηit It is improved upon by the methods of Arellano and Bond 1991 and Blundell and Bond 1998 These methods are taken up in the next chapter 7Also see Clint Cummins benchmarks page httpwwwstanfordeduclintbench Chapter 23 Panel data 224 Listing 232 The AndersonHsiao estimator for a dynamic panel model Download Penn World Table data as used by Nerlove open penngrowgdt Fixed effects for comparison panel Y 0 Y1 X Random effects for comparison panel Y 0 Y1 X randomeffects take differences of all variables diff Y X AndersonHsiao using Y2 as instrument tsls dY dY1 dX 0 dX Y2 AndersonHsiao using dY2 as instrument tsls dY dY1 dX 0 dX dY2 Chapter 24 Dynamic panel models The command for estimating dynamic panel models in gretl is dpanel This command supports both the difference estimator Arellano and Bond 1991 and the system estimator Blundell and Bond 1998 which has become the method of choice in the applied literature 241 Introduction Notation A dynamic linear panel data model can be represented as follows in notation based on Arellano 2003 yit αyit1 βxit ηi υit 241 where i 1 2 N indexes the crosssection units and t indexes time The main idea behind the difference estimator is to sweep out the individual effect via differencing Firstdifferencing eq 241 yields Δyit αΔyit1 βΔxit Δυit y Wit Δυit 242 in obvious notation The error term of 242 is by construction autocorrelated and also correlated with the lagged dependent variable so an estimator that takes both issues into account is needed The endogeneity issue is solved by noting that all values of yitk with k 1 can be used as instruments for Δyit1 unobserved values of yitk whether missing or presample can safely be substituted with 0 In the language of GMM this amounts to using the relation EΔυit yitk 0 k 1 243 as an orthogonality condition Autocorrelation is dealt with by noting that if υit is white noise the covariance matrix of the vector whose typical element is Δυit is proportional to a matrix H that has 2 on the main diagonal 1 on the first subdiagonals and 0 elsewhere Onestep GMM estimation of equation 242 amounts to computing yhat Σi Wi Zi AN Σi Zi Wi1 Σi Wi Zi AN Σi Zi Δyi 244 where Δyi Δyi3 ΔyiT Wi Δyi2 ΔyiT1 Δxi3 ΔxiT Zi yi1 0 0 0 Δxi3 0 yi1 yi2 0 Δxi4 0 0 0 yiT2 ΔxiT 225 Chapter 24 Dynamic panel models 226 and AN Σi Zi H Zi1 Once the 1step estimator is computed the sample covariance matrix of the estimated residuals can be used instead of H to obtain 2step estimates which are not only consistent but asymptotically efficient In principle the process may be iterated but nobody seems to be interested Standard GMM theory applies except for one point Windmeijer 2005 has computed finitesample corrections to the asymptotic covariance matrix of the parameters which are nowadays almost universally used The difference estimator is consistent but has been shown to have poor properties in finite samples when α is near one People these days prefer the socalled system estimator which complements the differenced data with lagged levels used as instruments with data in levels using lagged differences as instruments The system estimator relies on an extra orthogonality condition which has to do with the earliest value of the dependent variable yi1 The interested reader is referred to Blundell and Bond 1998 pp 124125 for details but here it suffices to say that this condition is satisfied in meanstationary models and brings an improvement in efficiency that may be substantial in many cases The set of orthogonality conditions exploited in the system approach is not very much larger than with the difference estimator since most of the possible orthogonality conditions associated with the equations in levels are redundant given those already used for the equations in differences The key equations of the system estimator can be written as ytilde Σi Wi Zi AN Σi Zi Wi1 Σi Wi Zi AN Σi Zi Δyi 245 where Δyi Δyi3 ΔyiT yi3 yiT Wi Δyi2 ΔyiT1 yi2 yiT1 Δxi3 ΔxiT xi3 xiT Zi yi1 0 0 0 0 0 Δxi3 0 yi1 yi2 0 0 0 Δxi4 0 0 0 yiT2 0 0 ΔxiT 0 0 0 0 Δyi2 0 xi3 0 0 0 0 0 ΔyiT1 xiT and AN Σi Zi H Zi1 In this case choosing a precise form for the matrix H for the first step is no trivial matter Its northwest block should be as similar as possible to the covariance matrix of the vector Δυit so Chapter 24 Dynamic panel models 227 the same choice as the difference estimator is appropriate Ideally the southeast block should be proportional to the covariance matrix of the vector ιηi υ that is συ2 I ση2 ιι but since ση2 is unknown and any positive definite matrix renders the estimator consistent people just use I The offdiagonal blocks should in principle contain the covariances between Δυis and υit which would be an identity matrix if υit is white noise However since the southeast block is typically given a conventional value anyway the benefit in making this choice is not obvious Some packages use I others use a zero matrix Asymptotically it should not matter but on real datasets the difference between the resulting estimates can be noticeable Rank deficiency Both the difference estimator 244 and the system estimator 245 depend for their existence on the invertibility of AN This matrix may turn out to be singular for several reasons However this does not mean that the estimator is not computable In some cases adjustments are possible such that the estimator does exist but the user should be aware that in such cases not all software packages use the same strategy and replication of results may prove difficult or even impossible A first reason why AN may be singular is unavailability of instruments chiefly because of missing observations This case is easy to handle If a particular row of Zi is zero for all units the corresponding orthogonality condition or the corresponding instrument if you prefer is automatically dropped the overidentification rank is then adjusted for testing purposes Even if no instruments are zero however AN could be rank deficient A trivial case occurs if there are collinear instruments but a less trivial case may arise when T the total number of time periods available is not much smaller than N the number of units as for example in some macro datasets where the units are countries The total number of potentially usable orthogonality conditions is OT2 which may well exceed N in some cases Since AN is the sum of N matrices which have at most rank 2T 3 it could well happen that the sum is singular In all these cases dpanel substitutes the pseudoinverse of AN MoorePenrose for its regular inverse Our choice is shared by some software packages but not all so replication may be hard Covariance matrix and standard errors By default the standard errors shown for 1step estimation are robust based on the heteroskedasticityconsistent variance estimator Varyhat M1 Σi Wi Zi AN VN AN Σi Zi Wi M1 where M Σi Wi Zi AN Σi Zi Wi and VN N1 Σi Zi uhati uhati Zi with uhati the vector of residuals in regressions for individual i In addition as noted above the variance estimator for 2step estimation employs the finitesample correction of Windmeijer 2005 When the asymptotic option is passed to dpanel however the 1step variance estimator is simply σu2 M1 which is not heteroskedasticityconsistent and the Windmeijer correction is not applied for 2step estimation Use of this option is not recommended unless you wish to replicate prior results that did not report robust standard errors In particular tests based on the asymptotic 2step variance estimator are known to overreject quite substantially standard errors too small Treatment of missing values Textbooks seldom bother with missing values but in some cases their treatment may be far from obvious This is especially true if missing values are interspersed between valid observations For example consider the plain difference estimator with one lag so yt αyt1 η ϵt Chapter 24 Dynamic panel models 228 where the i index is omitted for clarity Suppose you have an individual with t 1 5 for which y3 is missing It may seem that the data for this individual are unusable because differencing yt would produce something like t 1 2 3 4 5 yt yt where nonmissing and missing Estimation seems to be unfeasible since there are no periods in which yt and yt1 are both observable However we can use a kdifference operator and get kyt αkyt1 kϵt where k 1Lk and past levels of yt are valid instruments In this example we can choose k 3 and use y1 as an instrument so this unit is in fact usable Not all software packages seem to be aware of this possibility so replicating published results may prove tricky if your dataset contains individuals with gaps between valid observations 242 Usage One feature of dpanels syntax is that you get default values for several choices you may wish to make so that in a standard situation the command is very concise The simplest case of the model 241 is a plain AR1 process yit αyit1 ηi vit 246 If you give the command dpanel 1 y Gretl assumes that you want to estimate 246 via the difference estimator 244 using as many orthogonality conditions as possible The scalar 1 between dpanel and the semicolon indicates that only one lag of y is included as an explanatory variable using 2 would give an AR2 model The syntax that gretl uses for the nonseasonal AR and MA lags in an ARMA model is also supported in this context For example if you want the first and third lags of y but not the second included as explanatory variables you can say dpanel 1 3 y or you can use a predefined matrix for this purpose matrix ylags 1 3 dpanel ylags y To use a single lag of y other than the first you need to employ this mechanism dpanel 3 y only lag 3 is included dpanel 3 y compare lags 1 2 and 3 are used To use the system estimator instead you add the system option as in dpanel 1 y system The level orthogonality conditions and the corresponding instrument are appended automatically see eq 245 Chapter 24 Dynamic panel models 229 Regressors If additional regressors are to be included they should be listed after the dependent variable in the same way as other gretl estimation commands such as ols For the difference orthogonality relations dpanel takes care of transforming the regressors in parallel with the dependent variable One case of potential ambiguity is when an intercept is specified but the differenceonly estimator is selected as in dpanel 1 y const In this case the default dpanel behavior which agrees with David Roodmans xtabond2 for Stata Roodman 2009a is to drop the constant since differencing reduces it to nothing but zeros However for compatibility with the DPD package for Ox you can give the option dpdstyle in which case the constant is retained equivalent to including a linear trend in equation 241 A similar point applies to the periodspecific dummy variables which can be added in dpanel via the timedummies option in the differencesonly case these dummies are entered in differenced form by default but when the dpdstyle switch is applied they are entered in levels The standard gretl syntax applies if you want to use lagged explanatory variables so for example the command dpanel 1 y const x0 to 1 system would result in estimation of the model yit αyit1 β0 β1xit β2xit1 ηi vit Instruments The default rules for instruments are lags of the dependent variable are instrumented using all available orthogonality conditions and additional regressors are considered exogenous so they are used as their own instruments If a different policy is wanted the instruments should be specified in an additional list separated from the regressors list by a semicolon The syntax closely mirrors that of the tsls command but in this context it is necessary to distinguish between regular instruments and what are often called GMMstyle instruments that is instruments that are handled in the same blockdiagonal manner as lags of the dependent variable as described above Regular instruments are transformed in the same way as regressors and the contemporaneous value of the transformed variable is used to form an orthogonality condition Since regressors are treated as exogenous by default it follows that these two commands estimate the same model dpanel 1 y z dpanel 1 y z z The instrument specification in the second case simply confirms what is implicit in the first that z is exogenous Note though that if you have some additional variable z2 which you want to add as a regular instrument it then becomes necessary to include z in the instrument list if it is to be treated as exogenous dpanel 1 y z z2 z is now implicitly endogenous dpanel 1 y z z z2 z is treated as exogenous Chapter 24 Dynamic panel models 230 The specification of GMMstyle instruments is handled by the special constructs GMM and GMMlevel The first of these relates to instruments for the equations in differences and the second to the equations in levels The syntax for GMM is GMMname minlag maxlagcollapse where name is replaced by the name of a series or the name of a list of series and minlag and maxlag are replaced by the minimum and maximum lags to be used as instruments The same goes for GMMlevel One common use of GMM is to limit the number of lagged levels of the dependent variable used as instruments for the equations in differences Its well known that although exploiting all possible orthogonality conditions yields maximal asymptotic efficiency in finite samples it may be prefer able to use a smaller subsetsee Roodman 2009b Okui 2009 For example the specification dpanel 1 y GMMy 2 4 ensures that no lags of yt earlier than t 4 will be used as instruments A second means of limiting the number of instruments is to collapse the sets of blockdiagonal instruments shown following equations 244 and 245 Instead of having a distinct instrument per observation per lag this is reduced to a distinct instrument per lag as shown in Figure 241 GMM yi1 0 0 0 0 0 0 yi1 yi2 0 0 0 0 0 0 yi1 yi2 yi3 yi1 0 0 yi1 yi2 0 yi1 yi2 yi3 GMMlevel yi2 0 0 0 yi3 0 0 0 yi4 yi2 yi3 yi4 Figure 241 Collapsing blockdiagonal instruments This treatment of instruments can be selected per GMM or GMMlevel caseby appending the collapse flag following the maxlag valueor it can be set globally by use of the collapse option to the dpanel command To our knowledge Roodmans xtabond2 was the first software to offer this useful facility A further use of GMM is to exploit more fully the potential orthogonality conditions afforded by an exogenous regressor or a related variable that does not appear as a regressor For example in dpanel 1 y x GMMz 2 6 the variable x is considered an endogenous regressor and up to 5 lags of z are used as instruments Note that in the following script fragment dpanel 1 y z dpanel 1 y z GMMz00 the two estimation commands should not be expected to give the same result as the sets of orthogonality relationships are subtly different In the latter case you have T 2 separate orthogonality relationships pertaining to zit none of which has any implication for the other ones in the former case you only have one In terms of the Zi matrix the first form adds a single row to the bottom of the instruments matrix while the second form adds a diagonal block with T 2 columns that is zi3 zi4 zit versus zi3 0 0 0 zi4 0 0 0 zit 243 Replication of DPD results In this section we show how to replicate the results of some of the pioneering work with dynamic paneldata estimators by Arellano Bond and Blundell As the DPD manual Doornik Arellano and Bond 2006 explains it is difficult to replicate the original published results exactly for two main reasons not all of the data used in those studies are publicly available and some of the choices made in the original software implementation of the estimators have been superseded Here therefore our focus is on replicating the results obtained using the current DPD package and reported in the DPD manual The examples are based on the program files abest1ox abest3ox and bbest1ox These are included in the DPD package along with the ArellanoBond database files abdatabn7 and abdatain71 The ArellanoBond data are also provided with gretl in the file abdatagdt In the following we do not show the output from DPD or gretl it is somewhat voluminous and is easily generated by the user As of this writing the results from OxDPD and gretl are identical in all relevant respects for all of the examples shown2 A complete OxDPD program to generate the results of interest takes this general form include oxstdh import packagesdpddpd main decl dpd new DPD dpdLoadabdatain7 dpdSetYearYEAR modelspecific code here delete dpd In the examples below we take this template for granted and show just the modelspecific code Example 1 The following OxDPD codedrawn from abest1oxreplicates column b of Table 4 in Arellano and Bond 1991 an instance of the differencesonly or GMMDIF estimator The dependent variable Chapter 24 Dynamic panel models 232 is the log of employment n the regressors include two lags of the dependent variable current and lagged values of the log realproduct wage w the current value of the log of gross capital k and current and lagged values of the log of industry output ys In addition the specification includes a constant and five year dummies unlike the stochastic regressors these deterministic terms are not differenced In this specification the regressors w k and ys are treated as exogenous and serve as their own instruments In DPD syntax this requires entering these variables twice on the XVAR and IVAR lines The GMMtype blockdiagonal instruments in this example are the second and subsequent lags of the level of n Both 1step and 2step estimates are computed dpdSetOptionsFALSE dont use robust standard errors dpdSelectYVAR n 0 2 dpdSelectXVAR w 0 1 k 0 0 ys 0 1 dpdSelectIVAR w 0 1 k 0 0 ys 0 1 dpdGmmn 2 99 dpdSetDummiesDCONSTANT DTIME print Arellano Bond 1991 Table 4 b dpdSetMethodM1STEP dpdEstimate dpdSetMethodM2STEP dpdEstimate Here is gretl code to do the same job open abdatagdt list X w w1 k ys ys1 dpanel 2 n X const timedummies asy dpdstyle dpanel 2 n X const timedummies asy twostep dpdstyle Note that in gretl the switch to suppress robust standard errors is asymptotic here abbreviated to asy3 The dpdstyle flag specifies that the constant and dummies should not be differenced in the context of a GMMDIF model With gretls dpanel command it is not necessary to specify the exogenous regressors as their own instruments since this is the default similarly the use of the second and all longer lags of the dependent variable as GMMtype instruments is the default and need not be stated explicitly Example 2 The DPD file abest3ox contains a variant of the above that differs with regard to the choice of instruments the variables w and k are now treated as predetermined and are instrumented GMM style using the second and third lags of their levels This approximates column c of Table 4 in Arellano and Bond 1991 We have modified the code in abest3ox slightly to allow the use of robust Windmeijercorrected standard errors which are the default in both DPD and gretl with 2step estimation dpdSelectYVAR n 0 2 dpdSelectXVAR w 0 1 k 0 0 ys 0 1 dpdSelectIVAR ys 0 1 dpdSetDummiesDCONSTANT DTIME dpdGmmn 2 99 dpdGmmw 2 3 dpdGmmk 2 3 3Option flags in gretl can always be truncated down to the minimal unique abbreviation Chapter 24 Dynamic panel models 233 print Arellano Bond 1991 Table 4 c print but using different instruments dpdSetMethodM2STEP dpdEstimate The gretl code is as follows open abdatagdt list X w w1 k ys ys1 list Ivars ys ys1 dpanel 2 n X const GMMw23 GMMk23 Ivars time twostep dpd Note that since we are now calling for an instrument set other then the default following the second semicolon it is necessary to include the Ivars specification for the variable ys However it is not necessary to specify GMMn299 since this remains the default treatment of the dependent variable Example 3 Our third example replicates the DPD output from bbest1ox this uses the same dataset as the previous examples but the model specifications are based on Blundell and Bond 1998 and involve comparison of the GMMDIF and GMMSYS system estimators The basic specification is slightly simplified in that the variable ys is not used and only one lag of the dependent variable appears as a regressor The OxDPD code is dpdSelectYVAR n 0 1 dpdSelectXVAR w 0 1 k 0 1 dpdSetDummiesDCONSTANT DTIME print Blundell Bond 1998 Table 4 197686 GMMDIF dpdGmmn 2 99 dpdGmmw 2 99 dpdGmmk 2 99 dpdSetMethodM2STEP dpdEstimate print Blundell Bond 1998 Table 4 197686 GMMSYS dpdGmmLeveln 1 1 dpdGmmLevelw 1 1 dpdGmmLevelk 1 1 dpdSetMethodM2STEP dpdEstimate Here is the corresponding gretl code open abdatagdt list X w w1 k k1 list Z w k Blundell Bond 1998 Table 4 197686 GMMDIF dpanel 1 n X const GMMZ299 time twostep dpd Blundell Bond 1998 Table 4 197686 GMMSYS dpanel 1 n X const GMMZ299 GMMlevelZ11 time twostep dpd system Chapter 24 Dynamic panel models 234 Note the use of the system option flag to specify GMMSYS including the default treatment of the dependent variable which corresponds to GMMleveln11 In this case we also want to use lagged differences of the regressors w and k as instruments for the levels equations so we need explicit GMMlevel entries for those variables If you want something other than the default treatment for the dependent variable as an instrument for the levels equations you should give an explicit GMMlevel specification for that variableand in that case the system flag is redundant but harmless For the sake of completeness note that if you specify at least one GMMlevel term dpanel will then include equations in levels but it will not automatically add a default GMMlevel specification for the dependent variable unless the system option is given 244 Crosscountry growth example The previous examples all used the ArellanoBond dataset for this example we use the dataset CELgdt which is also included in the gretl distribution As with the ArellanoBond data there are numerous missing values Details of the provenance of the data can be found by opening the dataset information window in the gretl GUI Data menu Dataset info item This is a subset of the BarroLee 138country panel dataset an approximation to which is used in Caselli Esquivel and Lefort 1996 and Bond Hoeffler and Temple 20014 Both of these papers explore the dynamic paneldata approach in relation to the issues of growth and convergence of per capita income across countries The dependent variable is growth in real GDP per capita over successive fiveyear periods the regressors are the log of the initial five years prior value of GDP per capita the logratio of in vestment to GDP s in the prior five years and the log of annual average population growth n over the prior five years plus 005 as standin for the rate of technical progress g plus the rate of depreciation δ with the last two terms assumed to be constant across both countries and periods The original model is 5yit βyit5 αsit γnit g δ νt ηi ϵit 247 which allows for a timespecific disturbance νt The Solow model with CobbDouglas production function implies that γ α but this assumption is not imposed in estimation The timespecific disturbance is eliminated by subtracting the period mean from each of the series Equation 247 can be transformed to an AR1 dynamic paneldata model by adding yit5 to both sides which gives yit 1 βyit5 αsit γnit g δ ηi ϵit 248 where all variables are now assumed to be timedemeaned In rough replication of Bond et al 2001 we now proceed to estimate the following two models a equation 248 via GMMDIF using as instruments the second and all longer lags of yit sit and nit g δ and b equation 248 via GMMSYS using yit1 sit1 and nit1 g δ as additional instruments in the levels equations We report robust standard errors throughout As a purely notational matter we now use t 1 to refer to values five years prior to t as in Bond et al 2001 The gretl script to do this job is shown in Listing 241 Note that the final transformed versions of the variables logs with timemeans subtracted are named ly yit linv sit and lngd nit gδ For comparison we estimated the same two models using OxDPD and xtabond2 In each case we constructed a commaseparated values dataset containing the data as transformed in the gretl script shown above using a missingvalue code appropriate to the target program For reference the commands used with Stata are reproduced below 4We say an approximation because we have not been able to replicate exactly the OLS results reported in the papers cited though it seems from the description of the data in Caselli et al 1996 that we ought to be able to do so We note that Bond et al 2001 used data provided by Professor Caselli yet did not manage to reproduce the latters results Chapter 24 Dynamic panel models 235 Listing 241 GDP growth example Download open CELgdt ngd n 005 ly logy linv logs lngd logngd take out time means loop i18 smpl time i restrict replace ly meanly linv meanlinv lngd meanlngd endloop smpl full list X linv lngd 1step GMMDIF dpanel 1 ly X GMMX299 2step GMMDIF dpanel 1 ly X GMMX299 twostep GMMSYS dpanel 1 ly X GMMX299 GMMlevelX11 twostep sys delimit insheet using CELcsv tsset unit time xtabond2 ly Lly linv lngd gmmLly lag1 99 gmmlinv lag2 99 gmmlngd lag2 99 rob nolev xtabond2 ly Lly linv lngd gmmLly lag1 99 gmmlinv lag2 99 gmmlngd lag2 99 rob nolev twostep xtabond2 ly Lly linv lngd gmmLly lag1 99 gmmlinv lag2 99 gmmlngd lag2 99 rob nocons twostep For the GMMDIF model all three programs find 382 usable observations and 30 instruments and yield identical parameter estimates and robust standard errors up to the number of digits printed or more see Table 2415 1step 2step coeff std error coeff std error ly1 0577564 01292 0610056 01562 linv 00565469 007082 0100952 007772 lngd 0143950 02753 0310041 02980 Table 241 GMMDIF BarroLee data Results for GMMSYS estimation are shown in Table 242 In this case we show two sets of gretl 5The coefficient shown for ly1 in the Tables is that reported directly by the software for comparability with the original model eq 247 it is necesary to subtract 1 which produces the expected negative value indicating conditional convergence in per capita income results those labeled gretl1 were obtained using gretls dpdstyle option while those labeled gretl2 did not use that optionthe intent being to reproduce the H matrices used by OxDPD and xtabond2 respectively gretl1 OxDPD gretl2 xtabond2 lyg1 09237 00385 09167 00373 09073 00370 09073 00370 linv 01592 00449 01636 00441 01856 00411 01856 00411 lngd 02370 01485 02178 01433 02355 01501 02355 01501 Table 242 2step GMMSYS BarroLee data standard errors in parentheses In this case all three programs use 479 observations gretl and xtabond2 use 41 instruments and produce the same estimates when using the same H matrix while OxDPD nominally uses 666 It is noteworthy that with GMMSYS plus messy missing observations the results depend on the precise array of instruments used which in turn depends on the details of the implementation of the estimator 245 Auxiliary test statistics We have concentrated above on parameter estimates and standard errors Here we add some discussion of the additional test statistics that typically accompany both GMMDIF and GMMSYS estimationtests of overidentification for first and secondorder autocorrelation and for the joint significance of regressors Overidentification If a model estimated with the use of instrumental variables is justidentified the condition of orthogonality of the residuals and the instruments can be satisfied exactly But if the specification is overidentified more instruments than endogenous regressors this condition can only be approximated and the degree to which orthogonality fails serves as a test for the validity of the instruments andor the specification Since dynamic panel models are almost always overidentified such a test is of particular importance There are two such tests in the econometric literature devised respectively by Sargan 1958 and Hansen 1982 They share a common principle a suitably scaled measure of deviation from perfect orthogonality can be shown to be distributed as χ2k with k the degree of overidentification under the null hypothesis of valid instruments and correct specification Both test statistics can be written as S sum from i1 to N viZi AN sum from i1 to N Zivi where the vi are the residuals in first differences for unit i and for that reason they are often rolled togetherfor example as HansenSargan tests by Davidson and MacKinnon 2004 The Sargan vs Hansen difference is buried in AN Sargans original test is the minimized orthogonality score divided by a scalar estimate of the error variance which is presumed to be homoskedastic while Hansens is the minimized criterion from efficient GMM estimation in which the scalar variance estimate is replaced by a heteroskedasticity and autocorrelationconsistent HAC estimator of the variance matrix of the error term These variants correspond to 1step and 2step estimates of the given specification Up till version 2021d gretl followed OxDPD in presenting a single overidentification statistic under the name Sarganin effect a Sargan test proper for the 1step estimator and a Hansen test Chapter 24 Dynamic panel models 237 for 2step Subsequently however gretl follows xtabond2 in distinguishing between the tests and presenting both statistics under their original names when 2step estimation is selected and therefore the HAC variance estimator is available This choice responds to an argument made by Roodman 2009b the Sargan test is questionable owing to its assumption of homoskedasticity but the Hansen test is seriously weakened by an excessive number of instruments it may underreject substantially so there may be a benefit to taking both tests into consideration There are cases where the degrees of freedom for the overidentification test differs between DPD and gretl this occurs when the AN matrix is singular section 241 In concept the df equals the number of instruments minus the number of parameters estimated for the first of these terms gretl uses the rank of AN while DPD appears to use the full dimension of this matrix Autocorrelation Negative firstorder autocorrelation of the residuals in differences is expected by construction of the dynamic panel estimator so a significant value for the AR1 test does not indicate a problem If the AR2 test rejects however this indicates violation of the maintained assumptions Note that valid AR tests cannot be produced when the asymptotic option is specified in conjunction with onestep GMMSYS estimation if you need the tests either add the twostep option or drop the asymptotic flag which is recommended in any case Wald tests on regressors Wald tests on the regressors and separately on the time dummy variables if included are based on the estimated variance matrix of the parameter estimates and are generally in agreement across software packages provided the parameter variance is estimated in the same way One small ex ception pertains to comparison between OxDPD and gretl when the difference estimator is used a constant term is included and the dpdstyle option is given with dpanel so the constant is not automatically omitted In this case DPD includes the constant in the timedummies Wald test but gretl does not 246 Postestimation available statistics After estimation the model accessor will return a bundle containing several items that may be of interest most should be selfexplanatory but heres a partial list Key Content AR1 AR2 1st and 2nd order autocorrelation test statistics sargan sargandf Sargan test for overidentifying restrictions and correspond ing degrees of freedom hansen hansendf Hansen test for overidentifying restrictions and corre sponding degrees of freedom wald walddf Wald test for overall significance and corresponding de grees of freedom GMMinst The matrix Z of instruments see equations 242 and 245 wgtmat The matrix A of GMM weights see equations 242 and 245 Note that hansen and hansendf are not included when 1step estimation is selected Note also that GMMinst and wgtmat which may be quite large matrices are not saved in the model bundle by default that requires use of the keepextra option with the dpanel command Listing 242 illustrates use of these matrices to replicate via hansl commands the calculation of the GMM esti mator Chapter 24 Dynamic panel models 238 Listing 242 replication of builtin command via hansl commands Download set verbose off open abdatagdt compose list of regressors list X w w1 k k1 list Z w k dpanel 1 n X const GMMZ299 twostep dpd keepextra redo by hand fetch Z and A from model A modelwgtmat mZt modelGMMinst note transposed create data matrices series valid okuhat series ddep diffn series dldep ddep1 list dreg diffX smpl valid dummy matrix mreg dldep dreg 1 matrix mdep ddep matrix uno mZt mreg matrix due qformuno A matrix tre unoA mZt mdep matrix coef due re print coef Chapter 24 Dynamic panel models 239 247 Memo dpanel options flag effect asymptotic Suppresses the use of robust standard errors twostep Calls for 2step estimation the default being 1step system Calls for GMMSYS with default treatment of the dependent variable as in GMMlevely11 collapse Collapse blockdiagonal sets of GMM instruments as per Roodman 2009a timedummies Includes periodspecific dummy variables dpdstyle Compute the H matrix as in DPD also suppresses differencing of automatic time dummies and omission of intercept in the GMMDIF case verbose Prints confirmation of the GMMstyle instruments used and when twostep is selected prints the 1step estimates first vcv Calls for printing of the covariance matrix quiet Suppresses the printing of results keepextra Save additional matrices in model bundle see above The time dummies option supports the qualifier noprint as in timedummiesnoprint This means that although the dummies are included in the specification their coefficients standard errors and so on are not printed Chapter 25 Nonlinear least squares 251 Introduction and examples Gretl supports nonlinear least squares NLS using a variant of the LevenbergMarquardt algorithm The user must supply a specification of the regression function prior to giving this specification the parameters to be estimated must be declared and given initial values Optionally the user may supply analytical derivatives of the regression function with respect to each of the parameters If derivatives are not given the user must instead give a list of the parameters to be estimated separated by spaces or commas preceded by the keyword params The tolerance criterion for terminating the iterative estimation procedure can be adjusted using the set command The syntax for specifying the function to be estimated consists of the name of the dependent variable followed by an expression to generate it This is illustrated in the following two examples with accompanying derivatives Consumption function from Greene nls C alpha beta Ygamma deriv alpha 1 deriv beta Ygamma deriv gamma beta Ygamma logY end nls Nonlinear function from Russell Davidson nls y alpha beta x1 1beta x2 deriv alpha 1 deriv beta x1 x2betabeta end nls vcv Note the command words nls which introduces the regression function deriv which introduces the specification of a derivative and end nls which terminates the specification and calls for estimation If the vcv flag is appended to the last line the covariance matrix of the parameter estimates is printed 252 Initializing the parameters The parameters of the regression function must be given initial values prior to the nls command In the GUI program this may be done via the menu item Variable Define new variable In some cases where the nonlinear function is a generalization of or a restricted form of a linear model it may be convenient to run an ols and initialize the parameters from the OLS coefficient estimates In relation to the first example above one might do ols C 0 Y alpha coeff0 beta coeffY gamma 1 And in relation to the second example one might do 240 Chapter 25 Nonlinear least squares 241 ols y 0 x1 x2 alpha coeff0 beta coeffx1 253 NLS dialog window It is probably most convenient to compose the commands for NLS estimation in the form of a gretl script but you can also do so interactively by selecting the item Nonlinear Least Squares under the Model Nonlinear models menu This opens a dialog box where you can type the function specification possibly prefaced by statements to set the initial parameter values and the derivatives if available An example of this is shown in Figure 251 Note that in this context you do not have to supply the nls and end nls tags Figure 251 NLS dialog box 254 Analytical and numerical derivatives If you are able to figure out the derivatives of the regression function with respect to the param eters it is advisable to supply those derivatives as shown in the examples above If that is not possible gretl will compute approximate numerical derivatives However the properties of the NLS algorithm may not be so good in this case see section 258 This is done by using the params statement which should be followed by a list of identifiers containing the parameters to be estimated In this case the examples above would read as follows Greene nls C alpha beta Ygamma params alpha beta gamma end nls Davidson nls y alpha beta x1 1beta x2 params alpha beta end nls If analytical derivatives are supplied they are checked for consistency with the given nonlinear function If the derivatives are clearly incorrect estimation is aborted with an error message If the derivatives are suspicious a warning message is issued but estimation proceeds This warning may sometimes be triggered by incorrect derivatives but it may also be triggered by a high degree of collinearity among the derivatives Note that you cannot mix analytical and numerical derivatives you should supply expressions for all of the derivatives or none 255 Advanced use The nls block can also contain more sophisticated constructs First it can handle intermediate expressions this makes it possible to construct the conditional mean expression as a multistep job thus enhancing modularity and readability of the code Second more complex objects such as lists and matrices can be used for this purpose For example suppose that we want to estimate a Probit Binary Response model via NLS The specification is yi Φgxi ui gxi b0 b1 x1i b2 x2i b xi 251 Note this is not the recommended way to estimate a probit model the ui term is heteroskedastic by construction and ML estimation is much preferable here Still NLS is a consistent estimator of the parameter vector b although its covariance matrix will have to be adjusted to compensate for heteroskedasticity this is accomplished via the robust switch Listing 251 NLS estimation of a Probit model Download open greene251gdt list X const age income ownrent selfempl initalisation ols cardhldr X quiet matrix b coeff sigma proceed with NLS estimation nls cardhldr cnormndx series ndx lincombX b params b end nls robust compare with ML probit probit cardhldr X pvalues The example in script 251 can be enhanced by using analytical derivatives since gxibj φb xi xij one could substitute the params line in the script with the twoliner series f dnormndx deriv b f X and have nls use analyticallycomputed derivatives which are quicker and usually more reliable Chapter 25 Nonlinear least squares 243 256 Controlling termination The NLS estimation procedure is an iterative process Iteration is terminated when the criterion for convergence is met or when the maximum number of iterations is reached whichever comes first Let k denote the number of parameters being estimated The maximum number of iterations is 100 k 1 when analytical derivatives are given and 200 k 1 when numerical derivatives are used Let ϵ denote a small number The iteration is deemed to have converged if at least one of the following conditions is satisfied Both the actual and predicted relative reductions in the error sum of squares are at most ϵ The relative error between two consecutive iterates is at most ϵ This default value of ϵ is the machine precision to the power 341 but it can be adjusted using the set command with the parameter nlstoler For example set nlstoler 0001 will relax the value of ϵ to 00001 257 Details on the code The underlying engine for NLS estimation is based on the minpack suite of functions available from netliborg Specifically the following minpack functions are called lmder LevenbergMarquardt algorithm with analytical derivatives chkder Check the supplied analytical derivatives lmdif LevenbergMarquardt algorithm with numerical derivatives fdjac2 Compute final approximate Jacobian when using numerical derivatives dpmpar Determine the machine precision On successful completion of the LevenbergMarquardt iteration a GaussNewton regression is used to calculate the covariance matrix for the parameter estimates If the robust flag is given a robust variant is computed The documentation for the set command explains the specific options available in this regard Since NLS results are asymptotic there is room for debate over whether or not a correction for degrees of freedom should be applied when calculating the standard error of the regression and the standard errors of the parameter estimates For comparability with OLS and in light of the reasoning given in Davidson and MacKinnon 1993 the estimates shown in gretl do use a degrees of freedom correction 258 Numerical accuracy Table 251 shows the results of running the gretl NLS procedure on the 27 Statistical Reference Datasets made available by the US National Institute of Standards and Technology NIST for test ing nonlinear regression software2 For each dataset two sets of starting values for the parameters are given in the test files so the full test comprises 54 runs Two full tests were performed one 1On a 32bit Intel Pentium machine a likely value for this parameter is 182 1012 2For a discussion of gretls accuracy in the estimation of linear models see Appendix C Chapter 25 Nonlinear least squares 244 using all analytical derivatives and one using all numerical approximations In each case the default tolerance was used3 Out of the 54 runs gretl failed to produce a solution in 4 cases when using analytical derivatives and in 5 cases when using numeric approximation Of the four failures in analytical derivatives mode two were due to nonconvergence of the LevenbergMarquardt algorithm after the maximum number of iterations on MGH09 and Bennett5 both described by NIST as of Higher difficulty and two were due to generation of range errors outofbounds floating point values when computing the Jacobian on BoxBOD and MGH17 described as of Higher difficulty and Average difficulty respectively The additional failure in numerical approximation mode was on MGH10 Higher diffi culty maximum number of iterations reached The table gives information on several aspects of the tests the number of outright failures the average number of iterations taken to produce a solution and two sorts of measure of the accuracy of the estimates for both the parameters and the standard errors of the parameters For each of the 54 runs in each mode if the run produced a solution the parameter estimates obtained by gretl were compared with the NIST certified values We define the minimum correct figures for a given run as the number of significant figures to which the least accurate gretl esti mate agreed with the certified value for that run The table shows both the average and the worst case value of this variable across all the runs that produced a solution The same information is shown for the estimated standard errors4 The second measure of accuracy shown is the percentage of cases taking into account all parame ters from all successful runs in which the gretl estimate agreed with the certified value to at least the 6 significant figures which are printed by default in the gretl regression output Table 251 Nonlinear regression the NIST tests Analytical derivatives Numerical derivatives Failures in 54 tests 4 5 Average iterations 32 127 Mean of min correct figures 8120 6980 parameters Worst of min correct figures 4 3 parameters Mean of min correct figures 8000 5673 standard errors Worst of min correct figures 5 2 standard errors Percent correct to at least 6 figures 965 919 parameters Percent correct to at least 6 figures 977 773 standard errors Using analytical derivatives the worst case values for both parameters and standard errors were improved to 6 correct figures on the test machine when the tolerance was tightened to 10e14 3The data shown in the table were gathered from a prerelease build of gretl version 109 compiled with gcc 33 linked against glibc 232 and run under Linux on an i686 PC IBM ThinkPad A21m 4For the standard errors I excluded one outlier from the statistics shown in the table namely Lanczos1 This is an odd case using generated data with an almostexact fit the standard errors are 9 or 10 orders of magnitude smaller than the coefficients In this instance gretl could reproduce the certified standard errors to only 3 figures analytical derivatives and 2 figures numerical derivatives Chapter 25 Nonlinear least squares 245 Using numerical derivatives the same tightening of the tolerance raised the worst values to 5 correct figures for the parameters and 3 figures for standard errors at a cost of one additional failure of convergence Note the overall superiority of analytical derivatives on average solutions to the test problems were obtained with substantially fewer iterations and the results were more accurate most notably for the estimated standard errors Note also that the sixdigit results printed by gretl are not 100 percent reliable for difficult nonlinear problems in particular when using numerical derivatives Having registered this caveat the percentage of cases where the results were good to six digits or better seems high enough to justify their printing in this form Chapter 26 Maximum likelihood estimation 261 Generic ML estimation with gretl Maximum likelihood estimation is a cornerstone of modern inferential procedures Gretl provides a way to implement this method for a wide range of estimation problems by use of the mle command We give here a few examples To give a foundation for the examples that follow we start from a brief reminder on the basics of ML estimation Given a sample of size T it is possible to define the density function1 for the whole sample namely the joint distribution of all the observations f Y θ where Y y1 yT Its shape is determined by a kvector of unknown parameters θ which we assume is contained in a set Θ and which can be used to evaluate the probability of observing a sample with any given characteristics After observing the data the values Y are given and this function can be evaluated for any legitimate value of θ In this case we prefer to call it the likelihood function the need for another name stems from the fact that this function works as a density when we use the yt s as arguments and θ as parameters whereas in this context θ is taken as the functions argument and the data Y only have the role of determining its shape In standard cases this function has a unique maximum The location of the maximum is unaffected if we consider the logarithm of the likelihood or loglikelihood for short this function will be denoted as ℓθ log f Y θ The loglikelihood functions that gretl can handle are those where ℓθ can be written as ℓθ t1T ℓt θ which is true in most cases of interest The functions ℓt θ are called the loglikelihood contributions Moreover the location of the maximum is obviously determined by the data Y This means that the value θ Y ArgmaxθΘ ℓθ 261 is some function of the observed data a statistic which has the property under mild conditions of being a consistent asymptotically normal and asymptotically efficient estimator of θ Sometimes it is possible to write down explicitly the function θY in general it need not be so In these circumstances the maximum can be found by means of numerical techniques These often rely on the fact that the loglikelihood is a smooth function of θ and therefore on the maximum its partial derivatives should all be 0 The gradient vector or score vector is a function that enjoys many interesting statistical properties in its own right it will be denoted here as gθ It is a 1We are supposing here that our data are a realization of continuous random variables For discrete random variables everything continues to apply by referring to the probability function instead of the density In both cases the distribution may be conditional on some exogenous variables 246 Chapter 26 Maximum likelihood estimation kvector with typical element gi θ ℓθθi t1T ℓt θθi Gradientbased methods can be briefly illustrated as follows 1 pick a point θ0 Θ 2 evaluate gθ0 3 if gθ0 is small stop Otherwise compute a direction vector d gθ0 4 evaluate θ1 θ0 d gθ0 5 substitute θ0 with θ1 6 restart from 2 Many algorithms of this kind exist they basically differ from one another in the way they compute the direction vector d gθ0 to ensure that ℓ θ1 ℓ θ0 so that we eventually end up on the maximum The default method gretl uses to maximize the loglikelihood is a gradientbased algorithm known as the BFGS Broyden Fletcher Goldfarb and Shanno method This technique is used in most econometric and statistical packages as it is wellestablished and remarkably powerful Clearly in order to make this technique operational it must be possible to compute the vector gθ for any value of θ In some cases this vector can be written explicitly as a function of Y If this is not possible or too difficult the gradient may be evaluated numerically The alternative NewtonRaphson algorithm is also available This method is more effective under some circumstances but is also more fragile see section 2610 and chapter 37 for details2 The choice of the starting value θ0 is crucial in some contexts and inconsequential in others In general however it is advisable to start the algorithm from sensible values whenever possible If a consistent estimator is available this is usually a safe and efficient choice this ensures that in large samples the starting point will be likely close to θ and convergence can be achieved in few iterations The maximum number of iterations allowed for the BFGS procedure and the relative tolerance for assessing convergence can be adjusted using the set command the relevant variables are bfgsmaxiter default value 500 and bfgstoler default value the machine precision to the power 34 262 Syntax ML estimation in gretl is supported by the mle command block This consists of an initial line holding the keyword mle plus an equation for the loglikelihood one or more statements within the block details below and a trailer line to close the block end mle Option flags may be appended to the trailer line Listing 261 gives a simple but complete example which serves to illustrate the equivalence of MLE and OLS in the context of the normal linear model 2Note that some of the statements made below for example regarding estimation of the covariance matrix have to be modified when Newtons method is used Chapter 26 Maximum likelihood estimation Listing 261 OLS and MLE Download open data97 list X const INCOME PRICE ols QNC X matrix b coeff scalar s2 sigma2 scalar l2pi log2pi scalar n nobs mle lt 05l2pi 05logs2 12s2 uhat2 series uhat QNC lincombX b s2 sumuhat2n params b end mle Initial line of block If possible the given expression should evaluate to a series or vector contribution to the loglikelihood per observation Failing that it must evaluate to a scalar the total loglikelihood The identifier on the lefthand side lt in Listing 261 is up to the user If the variable in question is defined prior to the mle block it can be referenced after ML estimation otherwise it is treated as a temporary variable and is destroyed after estimation Lines within the block These may take three forms 1 Helper statements that calculate auxiliary quantities in the example uhat and s2 Such statements will be evaluated before the loglikelihood and then reevaluated on each iteration 2 Keyword plus parameter as in params b which tells mle that the parameter to be adjusted to maximize the loglikelihood is the vector b This sort of statement can also be used to specify analytical derivatives of the loglikelihood with respect to the parameters see section 267 for discussion and examples 3 Statements employing print or printf to track the progress of calculation which can be useful for debugging Final line In the example above this merely terminates the block but if one wanted standard errors calculated via a numerical approximation to the Hessian for instance one could substitute end mle hessian For a full listing of applicable options see the mle entry in the Gretl Command Reference 263 Covariance matrix and standard errors By default the covariance matrix of the parameter estimates is based on the Outer Product of the Gradient OPG That is VarOPGθ GθGθ 1 262 where Gtheta is the T x k matrix of contributions to the gradient Other options are available If the hessian flag is given the covariance matrix is computed from a numerical approximation to the Hessian at convergence If the robust option is given the quasiML sandwich estimator is used VARQMLtheta Htheta 1 Gtheta Gtheta Htheta 1 where H denotes the numerical approximation to the Hessian A refinement here is that if the hac parameter is appended to the robust option as in end mle robusthac the sandwich estimator is augmented in the manner of Newey and West 1987 to allow for serial correlation in the gradient Note that this only makes sense for timeseries data In that case the details of the HAC estimator can be controlled via the set command as described in chapter 22 Clusterrobust estimation is also available in order to activate it use the clusterclustvar where clustvar should be a discrete series See section 225 for more details Note however that if the loglikelihood function supplied by the user just returns a scalar valueas opposed to a series or vector holding perobservation contributionsthen the OPG method is not applicable and so the covariance matrix must be estimated via a numerical approximation to the Hessian 264 Gamma estimation Suppose we have a sample of T independent and identically distributed observations from a Gamma distribution The density function for each observation xt is fxt alphapGammap xtp1 expalpha xt 263 The loglikelihood for the entire sample can be written as the logarithm of the joint density of all the observations Since these are independent and identical the joint density is the product of the individual densities and hence its log is lalpha p sumt1T logalphapGammap xtp1 expalpha xt sumt1T lt 264 where lt p logalpha xt gammap log xt alpha xt and gamma is the log of the gamma function In order to estimate the parameters alpha and p via ML we need to maximize 264 with respect to them The corresponding gretl code snippet is scalar alpha 1 scalar p 1 mle logl plnalpha x lngammap lnx alpha x params alpha p end mle The first two statements alpha 1 p 1 Chapter 26 Maximum likelihood estimation 250 are necessary to ensure that the variables alpha and p exist before the computation of logl is attempted Inside the mle block these variables which could be either scalars vectors or a com bination of the two see below for an example are identified as the parameters that should be adjusted to maximize the likelihood via the params keyword Their values will be changed by the execution of the mle command upon successful completion they will be replaced by the ML esti mates The starting value is 1 for both this is arbitrary and does not matter much in this example more on this later The above code can be made more readable and marginally more efficient by defining a variable to hold α xt This command can be embedded in the mle block as follows mle logl plnax lngammap lnx ax series ax alphax params alpha p end mle The variable ax is not added to the params list of course since it is just an auxiliary variable to facilitate the calculations You can insert as many such auxiliary lines as you require before the params line with the restriction that they must contain either a commands to generate series scalars or matrices or b print commands which may be used to aid in debugging In a simple example like this the choice of the starting values is almost inconsequential the algo rithm is likely to converge no matter what the starting values are However consistent methodof moments estimators of p and α can be simply recovered from the sample mean m and variance V since it can be shown that Ext pα Vxt pα2 it follows that the following estimators α mV p m α are consistent and therefore suitable to be used as starting point for the algorithm The gretl script code then becomes scalar m meanx scalar alpha mvarx scalar p malpha mle logl plnax lngammap lnx ax series ax alphax params alpha p end mle Another thing to note is that sometimes parameters are constrained within certain boundaries in this case for example both α and p must be positive numbers Gretl does not check for this it is the users responsibility to ensure that the function is always evaluated at an admissible point in the parameter space during the iterative search for the maximum An effective technique is to define a variable for checking that the parameters are admissible and setting the loglikelihood as undefined if the check fails An example which uses the conditional assignment operator follows scalar m meanx scalar alpha mvarx scalar p malpha mle logl check plnax lngammap lnx ax NA series ax alphax scalar check alpha0 p0 params alpha p end mle 265 Stochastic frontier cost function Note this section has the sole purpose of illustrating the mle command For the estimation of stochastic frontier cost or production functions you may want to use the frontier function package When modeling a cost function it is sometimes worthwhile to incorporate explicitly into the statistical model the notion that firms may be inefficient so that the observed cost deviates from the theoretical figure not only because of unobserved heterogeneity between firms but also because two firms could be operating at a different efficiency level despite being identical in all other respects In this case we may write Ci Ci ui vi where Ci is some variable cost indicator Ci is its theoretical value ui is a zeromean disturbance term and vi is the inefficiency term which is supposed to be nonnegative by its very nature A linear specification for Ci is often chosen For example the CobbDouglas cost function arises when Ci is a linear function of the logarithms of the input prices and the output quantities The stochastic frontier model is a linear model of the form yi xi beta epsiloni in which the error term epsiloni is the sum of ui and vi A common postulate is that ui N0 sigmau2 and vi N0 sigmav2 If independence between ui and vi is also assumed then it is possible to show that the density function of epsiloni has the form fepsiloni sqrt2pi Philambda epsiloni sigma 1sigma phiepsiloni sigma 265 where Phi and phi are respectively the distribution and density function of the standard normal sigma sqrtsigmau2 sigmav2 and lambda sigmau sigmav As a consequence the loglikelihood for one observation takes the form apart form an irrelevant constant lt log Philambda epsiloni sigma logsigma epsiloni2 2 sigma2 Therefore a CobbDouglas cost function with stochastic frontier is the model described by the following equations log Ci log Ci epsiloni log Ci c sumj1m betaj log yij sumj1n alphaj log pij epsiloni ui vi ui N0 sigmau2 vi N0 sigmav2 In most cases one wants to ensure that the homogeneity of the cost function with respect to the prices holds by construction Since this requirement is equivalent to sumj1n alphaj 1 the above equation for Ci can be rewritten as log Ci log pin c sumj1m betaj log yij sumj2n alphaj log pij log pin epsiloni 266 The above equation could be estimated by OLS but it would suffer from two drawbacks first the OLS estimator for the intercept c is inconsistent because the disturbance term has a nonzero expected value second the OLS estimators for the other parameters are consistent but inefficient in view of the nonnormality of epsiloni Both issues can be addressed by estimating 266 by maximum likelihood Nevertheless OLS estimation is a quick and convenient way to provide starting values for the MLE algorithm Chapter 26 Maximum likelihood estimation 252 Listing 262 shows how to implement the model described so far The banks91 file contains part of the data used in Lucchetti Papi and Zazzaro 2001 Listing 262 Estimation of stochastic frontier cost function with scalar parameters Download open banks91gdt transformations series cost lnVC series q1 lnQ1 series q2 lnQ2 series p1 lnP1 series p2 lnP2 series p3 lnP3 CobbDouglas cost function with homogeneity restrictions for initialization series rcost cost p1 series rp2 p2 p1 series rp3 p3 p1 ols rcost const q1 q2 rp2 rp3 CobbDouglas cost function with homogeneity restrictions and inefficiency scalar b0 coeffconst scalar b1 coeffq1 scalar b2 coeffq2 scalar b3 coeffrp2 scalar b4 coeffrp3 scalar su 01 scalar sv 01 mle logl lncnormelambdass lnss 05ess2 scalar ss sqrtsu2 sv2 scalar lambda susv series e rcost b0const b1q1 b2q2 b3rp2 b4rp3 params b0 b1 b2 b3 b4 su sv end mle The script in example 262 is relatively easy to modify to show how one can use vectors that is 1dimensional matrices for storing the parameters to optimize example 263 holds essentially the same script in which the parameters of the cost function are stored together in a vector Of course this makes also possible to use variable lists and other refinements which make the code more compact and readable 266 GARCH models GARCH models are handled by gretl via a native function However it is instructive to see how they can be estimated through the mle command3 3The gig addon which handles other variants of conditionally heteroskedastic models uses mle as its internal engine Chapter 26 Maximum likelihood estimation 253 Listing 263 Estimation of stochastic frontier cost function with matrix parameters Download open banks91gdt transformations series cost lnVC series q1 lnQ1 series q2 lnQ2 series p1 lnP1 series p2 lnP2 series p3 lnP3 CobbDouglas cost function with homogeneity restrictions for initialization series rcost cost p1 series rp2 p2 p1 series rp3 p3 p1 list X const q1 q2 rp2 rp3 ols rcost X X const q1 q2 rp2 rp3 CobbDouglas cost function with homogeneity restrictions and inefficiency matrix b coeff scalar su 01 scalar sv 01 mle logl lncnormelambdass lnss 05ess2 scalar ss sqrtsu2 sv2 scalar lambda susv series e rcost lincombX b params b su sv end mle The following equations provide the simplest example of a GARCH11 model yt mu epsilont epsilont ut sigmat ut N01 ht omega alpha epsilont12 beta ht1 Since the variance of yt depends on past values writing down the loglikelihood function is not simply a matter of summing the log densities for individual observations As is common in time series models yt cannot be considered independent of the other observations in our sample and consequently the density function for the whole sample the joint density for all observations is not just the product of the marginal densities Maximum likelihood estimation in these cases is achieved by considering conditional densities so what we maximize is a conditional likelihood function If we define the information set at time t as Ft yt yt1 then the density of yt conditional on Ft1 is normal yt Ft1 Nmu ht By means of the properties of conditional distributions the joint density can be factorized as follows fyt yt1 productt1T fyt Ft1 fy0 If we treat y0 as fixed then the term fy0 does not depend on the unknown parameters and therefore the conditional loglikelihood can then be written as the sum of the individual contributions as lmu omega alpha beta sumt1T lt 267 where lt log 1sqrtht phiyt musqrtht 12 loght yt mu2 ht The following script shows a simple application of this technique which uses the data file djclose it is one of the example dataset supplied with gretl and contains daily data from the Dow Jones stock index open djclose series y 100ldiffdjclose scalar mu 00 scalar omega 1 scalar alpha 04 scalar beta 00 mle ll 05logh e2h series e y mu series h vary series h omega alphae12 betah1 params mu omega alpha beta end mle 267 Analytical derivatives Computation of the score vector is essential for the working of the BFGS method In all the previous examples no explicit formula for the computation of the score was given so the algorithm was fed numerically evaluated gradients Numerical computation of the score for the ith parameter is performed via a finite approximation of the derivative namely ℓθ₁θₙθᵢ ℓθ₁θᵢ hθₙ ℓθ₁θᵢ hθₙ2h where h is a small number In many situations this is rather efficient and accurate A better approximation to the true derivative may be obtained by forcing mle to use a technique known as Richardson Extrapolation which gives extremely precise results but is considerably more CPUintensive This feature may be turned on by using the set command as in set bfgsrichardson on However one might want to avoid the approximation and specify an exact function for the derivatives As an example consider the following script nulldata 1000 series x1 normal series x2 normal series x3 normal series ystar x1 x2 x3 normal series y ystar 0 scalar b0 0 scalar b1 0 scalar b2 0 scalar b3 0 mle logl ylnP 1 yln1 P series ndx b0 b1x1 b2x2 b3x3 series P cnormndx params b0 b1 b2 b3 end mle verbose Here 1000 data points are artificially generated for an ordinary probit model yt is a binary variable which takes the value 1 if yt β₁x₁t β₂x₂t β₃x₃t εt 0 and 0 otherwise Therefore yt 1 with probability Φβ₁x₁t β₂x₂t β₃x₃t πt The probability function for one observation can be written as Pyt πtyt 1 πt1 yt Since the observations are independent and identically distributed the loglikelihood is simply the sum of the individual contributions Hence ℓ t1T yt logπt 1 yt log1 πt The verbose switch at the end of the end mle statement produces a detailed account of the iterations done by the BFGS algorithm 4Again gretl does provide a native probit command see section 381 but a probit model makes for a nice example here Chapter 26 Maximum likelihood estimation 256 In this case numerical differentiation works rather well nevertheless computation of the analytical score is straightforward since the derivative ℓ βi can be written as ℓ βi ℓ πt πt βi via the chain rule and it is easy to see that ℓ πt yt πt 1 yt 1 πt πt βi φβ1x1t β2x2t β3x3t xit The mle block in the above script can therefore be modified as follows mle logl ylnP 1yln1P series ndx b0 b1x1 b2x2 b3x3 series P cnormndx series m dnormndxyP 1y1P deriv b0 m deriv b1 mx1 deriv b2 mx2 deriv b3 mx3 end mle verbose Note that the params statement has been replaced by a series of deriv statements these have the double function of identifying the parameters over which to optimize and providing an analytical expression for their respective score elements 268 Debugging ML scripts We have discussed above the main sorts of statements that are permitted within an mle block namely auxiliary commands to generate helper variables deriv statements to specify the gradient with respect to each of the parameters and a params statement to identify the parameters in case analytical derivatives are not given For the purpose of debugging ML estimators one additional sort of statement is allowed you can print the value of a relevant variable at each step of the iteration This facility is more restricted then the regular print command The command word print should be followed by the name of just one variable a scalar series or matrix In the last example above a key variable named m was generated forming the basis for the analytical derivatives To track the progress of this variable one could add a print statement within the ML block as in series m dnormndxyP 1y1P print m 269 Using functions The mle command allows you to estimate models that gretl does not provide natively in some cases it may be a good idea to wrap up the mle block in a userdefined function see Chapter 14 so as to extend gretls capabilities in a modular and flexible way As an example we will take a simple case of a model that gretl does not yet provide natively the zeroinflated Poisson model or ZIP for short5 In this model we assume that we observe a mixed population for some individuals the variable yt is conditionally on a vector of exogenous covariates xt distributed as a Poisson random variate for some others yt is identically 0 The trouble is we dont know which category a given individual belongs to For instance suppose we have a sample of women and the variable yt represents the number of children that woman t has There may be a certain proportion α of women for whom yt 0 with certainty maybe out of a personal choice or due to physical impossibility But there may be other women for whom yt 0 just as a matter of chance they havent happened to have any children at the time of observation In formulae Pyt kxt α dt 1 α eμt μtyt yt μt expxt β dt 1 for yt 0 0 for yt 0 Writing a mle block for this model is not difficult mle ll logprob series xb expb0 b1 x series d y 0 series poiprob expxb xby gammay1 series logprob alpha 0 alpha 1 logalphad 1alphapoiprob NA params alpha b0 b1 end mle v However the code above has to be modified each time we change our specification by say adding an explanatory variable Using functions we can simplify this task considerably and eventually be able to write something easy like list X const x zipy X Lets see how this can be done First we need to define a function called zip that will take two arguments a dependent variable y and a list of explanatory variables X An example of such function can be seen in script 264 By inspecting the function code you can see that the actual estimation does not happen here rather the zip function merely uses the builtin modprint command to print out the results coming from another userwritten function namely zipestimate The function zipestimate is not meant to be executed directly it just contains the numbercrunching part of the job whose results are then picked up by the end function zip In turn zipestimate calls other userwritten functions to perform other tasks The whole set of internal functions is shown in the panel 265 All the functions shown in 264 and 265 can be stored in a separate inp file and executed once at the beginning of our job by means of the include command Assuming the name of this script file is zipestinp the following is an example script which a includes the script file b generates a simulated dataset and c performs the estimation of a ZIP model on the artificial data set verbose off 5The actual ZIP model is in fact a bit more general than the one presented here The specialized version discussed in this section was chosen for the sake of simplicity For further details see Greene 2003 Chapter 26 Maximum likelihood estimation 258 Listing 264 Zeroinflated Poisson Model userlevel function Download userlevel function estimate the model and print out the results function void zipseries y list X matrix coefstde zipestimatey X printf Zeroinflated Poisson model string parnames alpha string parnames varnameX modprint coefstde parnames end function Listing 265 Zeroinflated Poisson Model internal functions Download compute log probabilities for the plain Poisson model function series lnpoiprobseries y list X matrix beta series xb lincombX beta return expxb yxb lngammay1 end function compute log probabilities for the zeroinflated Poisson model function series lnzipprobseries y list X matrix beta scalar p0 check if the probability is in 01 otherwise return NA if p0 1 p0 0 series ret NA else series ret lnpoiproby X beta ln1p0 series ret y0 lnp0 expret ret endif return ret end function do the actual estimation silently function matrix zipestimateseries y list X initialize alpha to a sensible value half the frequency of zeros in the sample scalar alpha meany02 initialize the coeffs we assume the first explanatory variable is the constant here matrix coef zerosnelemX 1 coef1 meany 1alpha do the actual ML estimation mle ll lnzipproby X coef alpha params alpha coef end mle hessian quiet return coeff stderr end function Chapter 26 Maximum likelihood estimation 259 include the userwritten functions include zipestinp generate the artificial data nulldata 1000 set seed 732237 scalar truep 02 scalar b0 02 scalar b1 05 series x normal series y uniformtruep 0 randgenp expb0 b1x list X const x estimate the zeroinflated Poisson model zipy X The results are as follows Zeroinflated Poisson model coefficient std error zstat pvalue alpha 0209738 00261746 8013 112e15 const 0167847 00449693 3732 00002 x 0452390 00340836 1327 332e40 A further step may then be creating a function package for accessing your new zip function via gretls graphical interface For details on how to do this see section 145 2610 Advanced use of mle functions analytical derivatives algorithm choice All the techniques decribed in the previous sections may be combined and mle can be used for solving nonstandard estimation problems provided of course that one chooses maximum likeli hood as the preferred inference method The strategy that as of this writing has proven most successful in designing scripts for this pur pose is Modularize your code as much as possible Use analytical derivatives whenever possible Choose your optimization method wisely In the rest of this section we will expand on the probit example of section 267 to give the reader an idea of what a heavyduty application of mle looks like Most of the code fragments come from mleadvancedinp which is one of the sample scripts supplied with the standard installation of gretl see under File Script files Practice File BFGS with and without analytical derivatives The example in section 267 can be made more general by using matrices and userwritten func tions Consider the following code fragment list X const x1 x2 x3 matrix b zerosnelemX1 Chapter 26 Maximum likelihood estimation 260 mle logl ylnP 1yln1P series ndx lincombX b series P cnormndx params b end mle In this context the fact that the model we are estimating has four explanatory variables is totally incidental the code is written in such a way that we could change the content of the list X without having to make any other modification This was made possible by 1 gathering the parameters to estimate into a single vector b rather than using separate scalars 2 using the nelem function to initialize b so that its dimension is kept track of automatically 3 using the lincomb function to compute the index function A parallel enhancement could be achieved in the case of analytically computed derivatives since b is now a vector mle expects the argument to the deriv keyword to be a matrix in which each column is the partial derivative to the corresponding element of b It is useful to rewrite the score for the ith observation as ℓi β mix i 268 where mi is the signed Mills ratio that is mi yi φx iβ Φx iβ 1 yi φx iβ 1 Φx iβ which was computed in section 267 via series P cnormndx series m dnormndxyP 1y1P Here we will code it in a somewhat terser way as series m y invmillsndx invmillsndx and make use of the conditional assignment operator and of the specialized function invmills for efficiency Building the score matrix is now easily achieved via mle logl ylnP 1yln1P series ndx lincombX b series P cnormndx series m y invmillsndx invmillsndx matrix mX X deriv b mX m end mle in which the operator was used to turn series and lists into matrices see chapter 17 However proceeding in this way for more complex models than probit may imply inserting into the mle block a long series of instructions the example above merely happens to be short because the score matrix for the probit model is very easy to write in matrix form A better solution is writing a userlevel function to compute the score and using that inside the mle block as in function matrix scorematrix b series y list X series ndx lincombX b Chapter 26 Maximum likelihood estimation 261 series m y invmillsndx invmillsndx return m X end function mle logl ylnP 1yln1P series ndx lincombX b series P cnormndx deriv b scoreb y X end mle In this way no matter how complex the computation of the score is the mle block remains nicely compact Newtons method and the analytical Hessian As mentioned above gretl offers the user the option of using Newtons method for maximizing the loglikelihood In terms of the notation used in section 261 the direction for updating the inital parameter vector θ0 is given by d gθ0 λHθ01gθ0 269 where Hθ is the Hessian of the total loglikelihood computed at θ and 0 λ 1 is a scalar called the step length The above expression makes a few points clear 1 At each step it must be possible to compute not only the score gθ but also its derivative Hθ 2 the matrix Hθ should be nonsingular 3 it is assumed that for some positive value of λ ℓθ1 ℓθ0 in other words that going in the direction d gθ0 leads upwards for some step length The strength of Newtons method lies in the fact that if the loglikelihood is globally concave then 269 enjoys certain optimality properties and the number of iterations required to reach the maximum is often much smaller than it would be with other methods such as BFGS However it may have some disadvantages for a start the Hessian Hθ may be difficult or very expensive to compute moreover the loglikelihood may not be globally concave so for some values of θ the matrix Hθ is not negative definite or perhaps even singular Those cases are handled by gretls implementation of Newtons algorithm by means of several heuristic techniques6 but a number of adverse consequences may occur which range from longer computation time for optimization to nonconvergence of the algorithm As a consequence using Newtons method is advisable only when the computation of the Hessian is not too CPUintensive and the nature of the estimator is such that it is known in advance that the loglikelihood is globally concave The probit models satisfies both requisites so we will expand the preceding example to illustrate how to use Newtons method in gretl A first example may be given simply by issuing the command set optimizer newton 6The gist to it is that if H is not negative definite it is substituted by k dgH 1 k H where k is a suitable scalar however if youre interested in the precise details youll be much better off looking at the source code the file youll want to look at is libsrcgretlbfgsc before the mle block7 This will instruct gretl to use Newtons method instead of BFGS If the deriv keyword is used gretl will differentiate the score function numerically otherwise if the score has to be computed itself numerically gretl will calculate Hθ by differentiating the loglikelihood numerically twice The latter solution though is generally to be avoided as may be extremely timeconsuming and may yield imprecise results A much better option is to calculate the Hessian analytically and have gretl use its true value rather than a numerical approximation In most cases this is both much faster and numerically stable but of course comes at the price of having to differentiate the loglikelihood twice to respect with the parameters and translate the resulting expressions into efficient hansl code Luckily both tasks are relatively easy in the probit case the matrix of second derivatives of ℓi may be written as ²ℓiββ mi mi xiβ xi xi so the total Hessian is i1n ²ℓiββ X w₁ 0 0 w₂ 0 wₙ X 2610 where wᵢ mi mi xiβ It can be shown that wᵢ 0 so the Hessian is guaranteed to be negative definite in all sensible cases and the conditions are ideal for applying Newtons method A hansl translation of equation 2610 may look like function void Hessmatrix H matrix b series y list X computes the negative Hessian for a Probit model series ndx lincombX b series m y invmillsndx invmillsndx series w mmndx matrix mX X H mX wmX end function There are two characteristics worth noting of the function above For a start it doesnt return anything the result of the computation is simply stored in the matrix pointed at by the first argument of the function Second the result is not the Hessian proper but rather its negative This function becomes usable from within an mle block by the keyword hessian The syntax is mle hessian funcnamemataddr end mle In other words the hessian keyword must be followed by the call to a function whose first argument is a matrix pointer which is supposed to be filled with the negative of the Hessian at θ We said above section 261 that the covariance matrix of the parameter estimates is by default estimated using the Outer Product of the Gradient so long as the loglikelihood function returns the perobservation contributions However if you supply a function that computes the Hessian then by default it is used in estimating the covariance matrix If you wish to impose use of OPG instead append the opg option to the end of the mle block Note that gretl does not perform any numerical check on whether a usersupplied function computes the Hessian correctly On the one hand this means that you can trick mle into using alternatives to the Hessian and thereby implement other optimization methods For example if you 7To go back to BFGS you use set optimizer bfgs substitute in equation 269 the Hessian H with the negative of the OPG matrix GG as defined in 262 you get the socalled BHHH optimization method see Berndt et al 1974 Again the sample file mleadvancedinp provides an example On the other hand you may want to perform a check of your analyticallycomputed H matrix versus a numerical approximation If you have a function that computes the score this is relatively simple to do by using the fdjac function briefly described in section 373 which computes a numerical approximation to a derivative In practice you need a function computing gθ as a row vector and then use fdjac to differentiate it numerically with respect to θ The result can then be compared to your analyticallycomputed Hessian The code fragment below shows an example of how this can be done in the probit case function matrix totalscorematrix b series y list X computes the total score return sumcscoreb y X end function function void checkmatrix b series y list X compares the analytical Hessian to its numerical approximation obtained via fdjac matrix aH HessaH b y X stores the analytical Hessian into aH matrix nH fdjacb totalscoreb y X nH 05nH nH force symmetry printf Numerical Hessian 166f nH printf Analytical Hessian negative 166f aH printf Check should be zero 166f nH aH end function 2611 Estimating constrained models In many cases you may want to perform ML estimation of a model under some kind of constraint Mathematically this amounts to maximizing the loglikelihood ℓθ under some restriction Assume that the restriction can be represented as gθ 0 where the function g is differentiable On paper the most straightforward way to accomplish this task is to set up a Lagrangean L θ ℓθ λgθ and solve the firstorder conditions that arise from differentiating the Lagrangean with respect to θ and λ If an explicit solution can be found then all is well but in many cases the resulting system of equations cannot be solved explicitly so that numerical optimisation is necessary In such cases the approach above is not particularly useful a different strategy is much more convenient The idea is to find an alternative parametrization a means of expressing the vector θ as a differentiable function of a smaller set of parameters ψ In other words find a function h such that any admissible value of θ can be written as θ hψ and ghψ 0 for any value of ψ Then maximization of the loglikelihood is simply a question of operating on ℓψ ℓhψ using an ordinary unconstrained numerical optimization routine Once the ML estimate ψ is available it is easy to recover the corresponding constrained vector θ hψ Computing the covariance matrix involves an extra step known as the delta method the asymptotic covariance matrix of θ can be computed as Vθ JψVψJψ 2611 where J is the Jacobian matrix holding the partial derivatives of hψ It is recommended that the Jacobian matrix should be computed analytically whenever possible but as a fallback strategy numerical differentiation available via the function fdjac see section 373 is a viable alternative Note that the matrix produced by this method will be singular by construction The example reported in script 266 is perhaps a little contrived but useful to elucidate the technique Suppose we wish to estimate mean and variance of an iid sample of Gaussian random variables under the constraint that Vxt σ2 expExt expμ Of course the unconstrained ML estimators μ X and σ2 n1 ixi X2 are not guaranteed to satisfy the constraints in fact the probability that they do is 0 The Lagrangean in this case would be Lθ K n2 log σ2 12σ2 ixi μ2 λeμ σ2 and finding an explicit solution by solving the firstorder conditions is not at all easy Fortunately numerical optimization becomes straightforward by expressing the constrained parameters as θ μ σ2 ψ expψ hψ after maximizing the loglikelihood the covariance matrix for θ can be recovered by computing the Jacobian as Jψ dμdψ dσ2dψ 1 expψ and applying formula 2611 Running the example script should produce the following output unconstrained estimates mean 100314 variance 28903 check what expmuhat 0163481 Model 1 ML using observations 11000 loglik 05log2pi 05logs2 05xm2s2 Standard errors based on Outer Products matrix estimate std error z pvalue psi1 103763 00357311 2904 207e185 Loglikelihood 1949972 Akaike criterion 3901943 Schwarz criterion 3906851 HannanQuinn 3903808 check what expmuhat 0 coefficient std error z pvalue mean 103763 00357311 2904 207e185 variance 282251 0100851 2799 235e172 Chapter 26 Maximum likelihood estimation 265 Listing 266 Example of ML estimation of a model under constraints Download set verbose off set seed 7120 function matrix hmatrix psi ret psi1 exppsi1 return ret end function function matrix anJacobmatrix psi the derivative of h return 1 exppsi1 end function nulldata 1000 generate artificial data from a N1 e distribution series x 1 normal exp05 show that the unconstrained estimates dont satisfy the restriction scalar muhat meanx scalar s2hat sstxnobs printf unconstrained estimates mean g variance g muhat s2hat printf check vhat expmuhat g s2hat expmuhat now estimate under the constraint expmean variance psi 1 mle loglik 05log2pi 05logs2 05xm2s2 matrix par hpsi scalar m par1 scalar s2 par2 params psi end mle now map psi to the constrained parametrisation matrix par hpsi show that now the constraint holds printf check vhat expmuhat g par2 exppar1 take care of the covariance matrix matrix vpar qformanJacobpsi vcv alternatively one could use the numerical Jacobian as in matrix vpar qformfdjacpsi hpsi vcv finally print out the constrained parameters via modprint matrix cs par sqrtdiagvpar modprint cs mean variance Chapter 26 Maximum likelihood estimation 266 The example provided in listing 267 illustrates the usage of catch in an artificially simple context we use the mle command for estimating mean and variance of a Gaussian rv of course you dont need the mle apparatus for this but it makes for a nice example The gist of the example is using the set bfgsmaxiter command to force mle to abort after a very small number of iterations so that you can have an idea on how to use the catch modifier and the associated error accessor to handle the situation You may want to increase the maximum number if BFGS iterations in the example to check what happens if the algorithm is allowed to converge Note that upon successful completion of mle a bundle named model is available containing several quantities that may be of your interest including the total number of function evaluations Listing 267 Handling nonconvergence via catch Download set verbose off nulldata 200 set seed 8118 generate simulated data from a N34 variate series x normal32 set starting values scalar m 0 scalar s2 1 set iteration limit to a ridiculously low value set bfgsmaxiter 10 perform ML estimation note the catch modifier catch mle loglik 05 log2pi logs2 e2s2 series e2 x m2 params m s2 end mle quiet grab the error and proceed as needed err error if err printf Not converged m g s2 g m s2 else printf Converged after d iterations modelgrcount cs coeff sqrtdiagvcv pn m s2 modprint cs pn endif Chapter 27 GMM estimation 271 Introduction and terminology The Generalized Method of Moments GMM is a very powerful and general estimation method which encompasses practically all the parametric estimation techniques used in econometrics It was introduced in Hansen 1982 and Hansen and Singleton 1982 an excellent and thorough treatment is given in chapter 17 of Davidson and MacKinnon 1993 The basic principle on which GMM is built is rather straightforward Suppose we wish to estimate a scalar parameter θ based on a sample x1 x2 xT Let θ0 indicate the true value of θ Theoretical considerations either of statistical or economic nature may suggest that a relationship like the following holds Ext gθ 0 θ θ0 271 with g a continuous and invertible function That is to say there exists a function of the data and the parameter with the property that it has expectation zero if and only if it is evaluated at the true parameter value For example economic models with rational expectations lead to expressions like 271 quite naturally If the sampling model for the xts is such that some version of the Law of Large Numbers holds then X 1T t1 to T xt p gθ0 hence since g is invertible the statistic θ g1X p θ0 so θ is a consistent estimator of θ A different way to obtain the same outcome is to choose as an estimator of θ the value that minimizes the objective function Fθ 1T t1 to T xt gθ2 X gθ2 272 the minimum is trivially reached at θ g1X since the expression in square brackets equals 0 The above reasoning can be generalized as follows suppose θ is an nvector and we have m relations like Efixt θ 0 for i 1 m 273 where E is a conditional expectation on a set of p variables zt called the instruments In the above simple example m 1 and fxt θ xt gθ and the only instrument used is zt 1 Then it must also be true that Efixt θ zjt Efijtθ 0 for i 1 m and j 1 p 274 equation 274 is known as an orthogonality condition or moment condition The GMM estimator is defined as the minimum of the quadratic form Fθ W fWf 275 267 Chapter 27 GMM estimation 268 where f is a 1 m p vector holding the average of the orthogonality conditions and W is some symmetric positive definite matrix known as the weights matrix A necessary condition for the minimum to exist is the order condition n m p The statistic ˆθ Argmin θ Fθ W 276 is a consistent estimator of θ whatever the choice of W However to achieve maximum asymp totic efficiency W must be proportional to the inverse of the longrun covariance matrix of the orthogonality conditions if W is not known a consistent estimator will suffice These considerations lead to the following empirical strategy 1 Choose a positive definite W and compute the onestep GMM estimator ˆθ1 Customary choices for W are Imp or Im ZZ1 2 Use ˆθ1 to estimate Vfijtθ and use its inverse as the weights matrix The resulting esti mator ˆθ2 is called the twostep estimator 3 Reestimate Vfijtθ by means of ˆθ2 and obtain ˆθ3 iterate until convergence Asymp totically these extra steps are unnecessary since the twostep estimator is consistent and efficient however the iterated estimator often has better smallsample properties and should be independent of the choice of W made at step 1 In the special case when the number of parameters n is equal to the total number of orthogonality conditions m p the GMM estimator ˆθ is the same for any choice of the weights matrix W so the first step is sufficient in this case the objective function is 0 at the minimum If on the contrary n m p the second step or successive iterations is needed to achieve efficiency and the estimator so obtained can be very different in finite samples from the one step estimator Moreover the value of the objective function at the minimum suitably scaled by the number of observations yields Hansens J statistic this statistic can be interpreted as a test statistic that has a χ2 distribution with m p n degrees of freedom under the null hypothesis of correct specification See Davidson and MacKinnon 1993 section 176 for details In the following sections we will show how these ideas are implemented in gretl through some examples 272 GMM as Method of Moments This section draws from a kind contribution by Alecos Papadopoulos whom we thank A very simple illustration of GMM can be given by dropping the G via an example of the time honored statistical technique known as method of moments lets see how to estimate the param eters of a gamma distribution which we also used as an example for ML estimation in section 264 Assume that we have an iid sample of size T from a gamma distribution The gamma density can be parameterized in terms of the two parameters p shape and θ scale both real and positive1 In order to estimate them by the method of moments we need two moment conditions so that we have two equations and two unknowns in the GMM jargon this amounts to exact identification The two relations we need are Exi p θ Vxi p θ2 1In section 264 we used a slightly different perhaps more common parametrization employing θ 1α We are switching to the shapescale parametrization here for the sake of convenience These will become our moment conditions substituting the finite sample analogues of the theoretical moments we have X p θ 277 Vxi p θ² 278 Of course the two equations above are easy to solve analytically giving θ vX and p Xθ V being the sample variance of xi but its instructive to see how the gmm command will solve this system of equations numerically We feed gretl the necessary ingredients for GMM estimation in a command block starting with gmm and ending with end gmm Three elements are compulsory within a gmm block 1 one or more orthog statements 2 one weights statement 3 one params statement The three elements should be given in the stated order The orthog statements are used to specify the orthogonality conditions They must follow the syntax orthog x Z where x may be a series matrix or list of series and Z may also be a series matrix or list Note the structure of the orthogonality condition it is assumed that the term to the left of the semicolon represents a quantity that depends on the estimated parameters and so must be updated in the process of iterative estimation while the term on the right is a constant function of the data The weights statement is used to specify the initial weighting matrix and its syntax is straightforward The params statement specifies the parameters with respect to which the GMM criterion should be minimized it follows the same logic and rules as in the mle and nls commands The minimum is found through numerical minimization via BFGS see chapters 37 and 26 The progress of the optimization procedure can be observed by appending the verbose switch to the end gmm line Equations 277 and 278 are not yet in the moment condition form required by the gmm command We need to transform them and arrive at something looking like Eeji zji 0 with j 12 Therefore we need two corresponding observable variables e1 and e2 with corresponding instruments z1 and z2 and tell gretl that Ėej zj 0 must be satisfied where we used the Ė notation to indicate sample moments If we define the instrument as a series of ones and set e1i xi pθ then we can rewrite the first moment condition as Ėxi pθ 1 0 This is in the form required by the gmm command in the required input statement orthog e z e will be the variable on the left defined as a series and z will the variable to the right of the semicolon Since z1i 1 for all i we can use the builtin series const for that For the second moment condition we have analogously Ėxi X² pθ² 1 0 Chapter 27 GMM estimation 270 so that by setting e2i xi X2 pθ2 and z2 z1 we can rewrite the second moment condition as ˆEe2i 1 0 The weighting matrix which is required by the gmm command can be set to any 2 2 positive definite matrix since under exact identification the choice does not matter and its dimension is determined by the number of orthogonality conditions Therefore well use the identity matrix Example code follows create an empty data set nulldata 200 fix a random seed set seed 2207092 generate a gamma random variable with say shape p 3 and scale theta 2 series x randgenG 3 2 declare and set some initial value for parameters p and theta scalar p 1 scalar theta 1 create the weight matrix as the identity matrix matrix W I2 declare the series to be used in the orthogonality conditions series e1 0 series e2 0 gmm scalar m meanx series e1 x ptheta series e2 x m2 ptheta2 orthog e1 const orthog e2 const weights W params p theta end gmm The corresponding output is Model 1 1step GMM using observations 1200 estimate std error z pvalue p 309165 0346565 8921 463e19 theta 189983 0224418 8466 255e17 GMM criterion Q 497341e28 TQ 994682e26 If we want to use the unbiased estimator for the sample variance wed have to adjust the second moment condition by substituting series e2 x m2 ptheta2 with scalar adj nobs nobs 1 series e2 adj x m2 ptheta2 with the corresponding slight change in the output Model 1 1step GMM using observations 1200 estimate std error z pvalue p 307619 0344832 8921 463e19 theta 190937 0225546 8466 255e17 GMM criterion Q 280713e28 TQ 561426e26 One can observe tiny improvements in the point estimates since both moved a tad closer to the true values This however is just a smallsample effect and not something you should expect in larger samples 273 OLS as GMM Let us now move to an example that is closer to econometrics proper the linear model yt xt β ut Although most of us are used to read it as the sum of a hazily defined systematic part plus an equally hazy disturbance a more rigorous interpretation of this familiar expression comes from the hypothesis that the conditional mean Eytxt is linear and the definition of ut as yt Eytxt From the definition of ut it follows that Eutxt 0 The following orthogonality condition is therefore available Efβ 0 279 where fβ yt xt β xt The definitions given in section 271 therefore specialize here to θ is β the instrument is xt fijtθ is yt xt β xt ut xt the orthogonality condition is interpretable as the requirement that the regressors should be uncorrelated with the disturbances W can be any symmetric positive definite matrix since the number of parameters equals the number of orthogonality conditions Lets say we choose I The function FθW is in this case FθW 1T t1T ût xt² and it is easy to see why OLS and GMM coincide here the GMM objective function has the same minimizer as the objective function of OLS the residual sum of squares Note however that the two functions are not equal to one another at the minimum FθW 0 while the minimized sum of squared residuals is zero only in the special case of a perfect linear fit The code snippet below uses gretls gmm command to make the above operational The series e holds the residuals and the series x holds the regressor If x had been a list or a matrix the orthog statement would have generated one orthogonality condition for each element or column of x Chapter 27 GMM estimation 272 initialize stuff series e 0 scalar beta 0 matrix W I1 proceed with estimation gmm series e y xbeta orthog e x weights W params beta end gmm 274 TSLS as GMM Moving closer to the proper domain of GMM we now consider twostage least squares TSLS as a case of GMM TSLS is employed in the case where one wishes to estimate a linear model of the form yt Xtβut but where one or more of the variables in the matrix X are potentially endogenouscorrelated with the error term u We proceed by identifying a set of instruments Zt which are explanatory for the endogenous variables in X but which are plausibly uncorrelated with u The classic twostage procedure is 1 regress the endogenous elements of X on Z then 2 estimate the equation of interest with the endogenous elements of X replaced by their fitted values from 1 An alternative perspective is given by GMM We define the residual ˆut as yt Xt ˆβ as usual But instead of relying on EuX 0 as in OLS we base estimation on the condition EuZ 0 In this case it is natural to base the initial weighting matrix on the covariance matrix of the instruments Listing 271 presents a model from Stock and Watsons Introduction to Econometrics The demand for cigarettes is modeled as a linear function of the logs of price and income income is treated as exogenous while price is taken to be endogenous and two measures of tax are used as instruments Since we have two instruments and one endogenous variable the model is overidentified In the GMM context this happens when you have more orthogonality conditions than parameters to estimate If so asymptotic efficiency gains can be expected by iterating the procedure once or more This is accomplished by specifying after the end gmm statement two mutually exclusive options twostep or iterate whose meaning should be obvious Note that when the problem is over identified the weights matrix will influence the solution you get from the 1 and 2step procedures In cases other than onestep estimation the specified weights matrix will be overwritten with the final weights on completion of the gmm command If you wish to execute more than one GMM block with a common startingpoint it is therefore necessary to reinitialize the weights matrix between runs Partial output from this script is shown in 272 The estimated standard errors from GMM are robust by default if we supply the robust option to the tsls command we get identical results2 After the end gmm statement two mutually exclusive options can be specified twostep or iterate whose meaning should be obvious 275 Covariance matrix options The covariance matrix of the estimated parameters depends on the choice of W through ˆΣ JWJ1JWΩWJJWJ1 2710 2The data file used in this example is available in the Stock and Watson package for gretl See httpgretl sourceforgenetgretldatahtml Chapter 27 GMM estimation 273 Listing 271 TSLS via GMM Download open cigch10gdt real avg price including sales tax ravgprs avgprs cpi real avg cigspecific tax rtax tax cpi real average total tax rtaxs taxs cpi real average sales tax rtaxso rtaxs rtax logs of consumption price income lpackpc logpackpc lravgprs logravgprs perinc income popcpi lperinc logperinc restrict sample to 1995 observations smpl restrict year1995 Equation 1016 by tsls list xlist const lravgprs lperinc list zlist const rtaxso rtax lperinc tsls lpackpc xlist zlist robust setup for gmm matrix Z zlist matrix W invZZ series e 0 scalar b0 1 scalar b1 1 scalar b2 1 gmm e lpackpc b0 b1lravgprs b2lperinc orthog e Z weights W params b0 b1 b2 end gmm Listing 272 TSLS via GMM partial output Model 1 TSLS estimates using the 48 observations 148 Dependent variable lpackpc Instruments rtaxso rtax Heteroskedasticityrobust standard errors variant HC0 VARIABLE COEFFICIENT STDERROR T STAT PVALUE const 989496 0928758 10654 000001 lravgprs 127742 0241684 5286 000001 lperinc 0280405 0245828 1141 025401 Model 2 1step GMM estimates using the 48 observations 148 e lpackpc b0 b1lravgprs b2lperinc PARAMETER ESTIMATE STDERROR T STAT PVALUE b0 989496 0928758 10654 000001 b1 127742 0241684 5286 000001 b2 0280405 0245828 1141 025401 GMM criterion 00110046 where J is a Jacobian term Jij fiθj and Ω is the longrun covariance matrix of the orthogonality conditions Gretl computes J by numeric differentiation there is no provision for specifying a usersupplied analytical expression for J at the moment As for Ω a consistent estimate is needed The simplest choice is the sample covariance matrix of the fts Ω0θ 1T t1T ftθ ftθ 2711 This estimator is robust with respect to heteroskedasticity but not with respect to autocorrelation A heteroskedasticity and autocorrelationconsistent HAC variant can be obtained using the Bartlett kernel or similar A univariate version of this is used in the context of the lrvar function see equation 226 The multivariate version is set out in equation 2712 Ωkθ 1T tkTk ikk wi ftθ ftiθ 2712 Gretl computes the HAC covariance matrix by default when a GMM model is estimated on time series data You can control the kernel and the bandwidth that is the value of k in 2712 using the set command See chapter 22 for further discussion of HAC estimation You can also ask gretl not to use the HAC version by saying set forcehc on 276 A real example the Consumption Based Asset Pricing Model To illustrate gretls implementation of GMM we will replicate the example given in chapter 3 of Hall 2005 The model to estimate is a classic application of GMM and provides an example of a case when orthogonality conditions do not stem from statistical considerations but rather from economic theory A rational individual who must allocate his income between consumption and investment in a financial asset must in fact choose the consumption path of his whole lifetime since investment translates into future consumption It can be shown that an optimal consumption path should satisfy the following condition pUct δkE rtkUctkFt 2713 where p is the asset price U is the individuals utility function δ is the individuals subjective discount rate and rtk is the assets rate of return between time t and time t k Ft is the information set at time t equation 2713 says that the utility lost at time t by purchasing the asset instead of consumption goods must be matched by a corresponding increase in the discounted future utility of the consumption financed by the assets return Since the future is uncertain the individual considers his expectation conditional on what is known at the time when the choice is made We have said nothing about the nature of the asset so equation 2713 should hold whatever asset we consider hence it is possible to build a system of equations like 2713 for each asset whose price we observe If we are willing to believe that the economy as a whole can be represented as a single gigantic and immortal representative individual and the function Ux xα1α is a faithful representation of the individuals preferences then setting k 1 equation 2713 implies the following for any asset j E δ rjt1 pjt Ct1Ctα1Ft 1 2714 where Ct is aggregate consumption and α and δ are the risk aversion and discount rate of the representative individual In this case it is easy to see that the deep parameters α and δ can be estimated via GMM by using et δ rjt1 pjt Ct1 Ctα1 1 as the moment condition while any variable known at time t may serve as an instrument In the example code given in 273 we replicate selected portions of table 37 in Hall 2005 The variable consrat is defined as the ratio of monthly consecutive real per capita consumption services and nondurables for the US and ewr is the returnprice ratio of a fictitious asset constructed by averaging all the stocks in the NYSE The instrument set contains the constant and two lags of each variable The command set forcehc on on the second line of the script has the sole purpose of replicating the given example as mentioned above it forces gretl to compute the longrun variance of the orthogonality conditions according to equation 2711 rather than 2712 We run gmm four times onestep estimation for each of two initial weights matrices then iterative estimation starting from each set of initial weights Since the number of orthogonality conditions 5 is greater than the number of estimated parameters 2 the choice of initial weights should make a difference and indeed we see fairly substantial differences between the onestep estimates Models 1 and 2 On the other hand iteration reduces these differences almost to the vanishing point Models 3 and 4 Part of the output is given in 274 It should be noted that the J test leads to a rejection of the hypothesis of correct specification This is perhaps not surprising given the heroic assumptions required to move from the microeconomic principle in equation 2713 to the aggregate system that is actually estimated Chapter 27 GMM estimation 276 Listing 273 Estimation of the Consumption Based Asset Pricing Model Download open hallgdt set forcehc on scalar alpha 05 scalar delta 05 series e 0 list inst const consrat1 consrat2 ewr1 ewr2 matrix V0 100000Ineleminst matrix Z inst matrix V1 nobsinvZZ gmm e deltaewrconsratalpha1 1 orthog e inst weights V0 params alpha delta end gmm gmm e deltaewrconsratalpha1 1 orthog e inst weights V1 params alpha delta end gmm gmm e deltaewrconsratalpha1 1 orthog e inst weights V0 params alpha delta end gmm iterate gmm e deltaewrconsratalpha1 1 orthog e inst weights V1 params alpha delta end gmm iterate Chapter 27 GMM estimation 277 Listing 274 Estimation of the Consumption Based Asset Pricing Model output Model 1 1step GMM estimates using the 465 observations 195904199712 e dewrconsratalpha1 1 PARAMETER ESTIMATE STDERROR T STAT PVALUE alpha 314475 684439 0459 064590 d 0999215 00121044 82549 000001 GMM criterion 277808 Model 2 1step GMM estimates using the 465 observations 195904199712 e dewrconsratalpha1 1 PARAMETER ESTIMATE STDERROR T STAT PVALUE alpha 0398194 226359 0176 086036 d 0993180 000439367 226048 000001 GMM criterion 14247 Model 3 Iterated GMM estimates using the 465 observations 195904199712 e dewrconsratalpha1 1 PARAMETER ESTIMATE STDERROR T STAT PVALUE alpha 0344325 221458 0155 087644 d 0991566 000423620 234070 000001 GMM criterion 549178 J test Chisquare3 118103 pvalue 00081 Model 4 Iterated GMM estimates using the 465 observations 195904199712 e dewrconsratalpha1 1 PARAMETER ESTIMATE STDERROR T STAT PVALUE alpha 0344315 221359 0156 087639 d 0991566 000423469 234153 000001 GMM criterion 549178 J test Chisquare3 118103 pvalue 00081 Chapter 27 GMM estimation 278 277 Caveats A few words of warning are in order despite its ingenuity GMM is possibly the most fragile esti mation method in econometrics The number of nonobvious choices one has to make when using GMM is large and in finite samples each of these can have dramatic consequences for the eventual output Some of the factors that may affect the results are 1 Orthogonality conditions can be written in more than one way for example if Ext µ 0 then Extµ 1 0 holds too It is possible that a different specification of the moment conditions leads to different results 2 As with all other numerical optimization algorithms weird things may happen when the ob jective function is nearly flat in some directions or has multiple minima BFGS is usually quite good but there is no guarantee that it always delivers a sensible solution if one at all 3 The 1step and to a lesser extent the 2step estimators may be sensitive to apparently trivial details like the rescaling of the instruments Different choices for the initial weights matrix can also have noticeable consequences 4 With timeseries data there is no hard rule on the appropriate number of lags to use when computing the longrun covariance matrix see section 275 Our advice is to go by trial and error since results may be greatly influenced by a poor choice One of the consequences of this state of things is that replicating wellknown published studies may be extremely difficult Any nontrivial result is virtually impossible to reproduce unless all details of the estimation procedure are carefully recorded Chapter 28 Model selection criteria 281 Introduction In some contexts the econometrician chooses between alternative models based on a formal hy pothesis test For example one might choose a more general model over a more restricted one if the restriction in question can be formulated as a testable null hypothesis and the null is rejected on an appropriate test In other contexts one sometimes seeks a criterion for model selection that somehow measures the balance between goodness of fit or likelihood on the one hand and parsimony on the other The balancing is necessary because the addition of extra variables to a model cannot reduce the degree of fit or likelihood and is very likely to increase it somewhat even if the additional variables are not truly relevant to the datagenerating process The best known such criterion for linear models estimated via least squares is the adjusted R2 R2 1 SSRn k TSSn 1 where n is the number of observations in the sample k denotes the number of parameters esti mated and SSR and TSS denote the sum of squared residuals and the total sum of squares for the dependent variable respectively Compared to the ordinary coefficient of determination or unadjusted R2 R2 1 SSR TSS the adjusted calculation penalizes the inclusion of additional parameters other things equal 282 Information criteria A more general criterion in a similar spirit is Akaikes 1974 Information Criterion AIC The original formulation of this measure is AIC 2ℓˆθ 2k 281 where ℓˆθ represents the maximum loglikelihood as a function of the vector of parameter esti mates ˆθ and k as above denotes the number of independently adjusted parameters within the model In this formulation with AIC negatively related to the likelihood and positively related to the number of parameters the researcher seeks the minimum AIC The AIC can be confusing in that several variants of the calculation are in circulation For exam ple Davidson and MacKinnon 2004 present a simplified version AIC ℓˆθ k which is just 2 times the original in this case obviously one wants to maximize AIC In the case of models estimated by least squares the loglikelihood can be written as ℓˆθ n 2 1 log 2π log n n 2 log SSR 282 279 Substituting 282 into 281 we get AIC n1 log 2π log n n log SSR 2k which can also be written as AIC n log SSRn 2k n1 log 2π 283 Some authors simplify the formula for the case of models estimated via least squares For instance William Greene writes AIC log SSRn 2kn 284 This variant can be derived from 283 by dividing through by n and subtracting the constant 1 log 2π That is writing AICG for the version given by Greene we have AICG 1n AIC 1 log 2π Finally Ramanathan gives a further variant AICR SSRn e2kn which is the exponential of the one given by Greene Gretl began by using the Ramanathan variant but since version 131 the program has used the original Akaike formula 281 and more specifically 283 for models estimated via least squares Although the Akaike criterion is designed to favor parsimony arguably it does not go far enough in that direction For instance if we have two nested models with k 1 and k parameters respectively and if the null hypothesis that parameter k equals 0 is true in large samples the AIC will nonetheless tend to select the less parsimonious model about 16 percent of the time see Davidson and MacKinnon 2004 chapter 15 An alternative to the AIC which avoids this problem is the Schwarz 1978 Bayesian information criterion BIC The BIC can be written in line with Akaikes formulation of the AIC as BIC 2ℓθ k log n The multiplication of k by log n in the BIC means that the penalty for adding extra parameters grows with the sample size This ensures that asymptotically one will not select a larger model over a correctly specified parsimonious model A further alternative to AIC which again tends to select more parsimonious models than AIC is the HannanQuinn criterion or HQC Hannan and Quinn 1979 Written consistently with the formulations above this is HQC 2ℓθ 2k log log n The HannanQuinn calculation is based on the law of the iterated logarithm note that the last term is the log of the log of the sample size The authors argue that their procedure provides a strongly consistent estimation procedure for the order of an autoregression and that compared to other strongly consistent procedures this procedure will underestimate the order to a lesser degree Gretl reports the AIC BIC and HQC calculated as explained above for most sorts of models The key point in interpreting these values is to know whether they are calculated such that smaller values are better or such that larger values are better In gretl smaller values are better one wants to minimize the chosen criterion Chapter 29 Degrees of freedom correction 291 Introduction This chapter gives a brief account of the issue of correction for degrees of freedom in the context of econometric modeling leading up to a discussion of the policies adopted in gretl in this regard We also explain how to supplement the results produced automatically by gretl if you want to apply such a correction where gretl does not or vice versa The first few sections are quite basic experts are invited to skip to section 295 292 Back to basics Its well known that given a sample xi of size n from a normally distributed population the Maximum Likelihood ML estimator of the population variance σ2 is σ2 1n sum i1 to n xi x2 291 where x is the sample mean n1 sum i1 to n xi Its also well known that σ2 while it is a consistent estimator is biased and it is commonly replaced by the sample variance namely s2 1 n 1 sum t1 to n xi x2 292 The intuition behind the bias in 291 is straightforward First the quantity we seek to estimate is defined as σ2 E xi μ2 where μ Ex It is clear that if μ were observable a perfectly good estimator would be σ2 1n sum i1 to n xi μ2 But this is not a practical option μ is generally unobservable We therefore substitute x for the unknown μ It is easily shown that x is the leastsquares estimator of μ and also assuming normality the ML estimator It is unbiased but is of course subject to sampling error in any given sample it is highly unlikely that x μ Given that x is the leastsquares estimator the sum of squared deviations of the xi from any value other than x must be greater than the summation in 291 But since μ is almost certainly not equal to x the sum of squared deviations of the xi from μ will surely be greater than the sum of squared deviations in 291 It follows that the expected value of σ2 falls short of the population variance The proof that s2 is indeed the unbiased estimator can be found in any good statistics textbook where we also learn that the magnitude n 1 in 292 can be brought under a general description as the degrees of freedom of the calculation at hand Given x the n sample values provide only n 1 items of information since the nth value can always be deduced via the formula for x 293 Application to OLS regression The argument above carries over into the usual calculation of standard errors in the context of OLS regression as applied to the linear model y Xβ u If the disturbances u are assumed to be independently and identically distributed IID then the variance of the OLS estimator β is given by Var β σ²XᵀX¹ where σ² is the variance of the error term and X is an n k matrix of regressors But how should the unknown σ² be estimated The ML estimator is σ² 1n i1 to n ui² 293 where the ui² are squared residuals ui yi Xiβ But this estimator is biased and we typically use the unbiased counterpart s² 1nk i1 to n ui² 294 in which n k is the number of degrees of freedom given n residuals from a regression where k parameters are estimated The standard estimator of the variance of β in the context of OLS is then V s²XᵀX¹ And the standard errors of the individual parameter estimates sβi being the square roots of the diagonal elements of V inherit a degrees of freedom correction from the estimator s² Going one step further consider hypothesis testing in the context of OLS Since the variance of β is unknown and must itself be estimated the sampling distribution of the OLS coefficients is not strictly speaking normal But if the disturbances are normally distributed besides being IID then even in small samples the parameter estimates will follow a distribution that can be specified exactly namely the Student t distribution with degrees of freedom equal to the value given above ν n k That is besides using a df correction in computing the standard errors of the OLS coefficients one uses the same ν in selecting the particular distribution to which the tratio βi β0sβi should be referred in order to determine the marginal significance level or pvalue for the null hypothesis that βi β0 This is the payoff to df correction we get test statistics that follow a known distribution in small samples In big enough samples the point is moot since the quantitative distinction between σ² and s² vanishes So far so good Everyone expects df correction in plain OLS standard errors just as we expect division by n 1 in the sample variance And users of econometric software expect that the pvalues reported for OLS coefficients will be based on the tν distributionalthough they are not always sufficiently aware that the validity of such statistics in small samples depends on the assumption of normally distributed errors 294 Beyond OLS The situation is different when we move beyond estimation of the classical linear model via OLS We may wish to estimate nonlinear models sometimes by least squares and many models of interest to econometricians are commonly estimated via maximization of a likelihood function or via the generalized method of moments GMM In such cases we do not in general have exact smallsample results to rely upon in particular we cannot assume that coefficient estimates follow the t distribution Rather we typically appeal to asymptotic results in statistical theory We seek consistent estimators which although they may be biased nonetheless converge in probability to the corresponding parameter values as the sample size goes to infinity Under the right conditions laws of large numbers and central limit Chapter 29 Degrees of freedom correction 283 theorems entitle us to expect that test statistics will converge to the normal distribution or the χ2 distribution for multivariate tests given big enough samples To correct or not The question arises should we or should we not apply a df correction in reporting variance estimates and standard errors for models that depart from the classical linear specification The argument against applying df adjustment is that it lacks a theoretical basis it does not pro duce test statistics that follow any known distribution in small samples In addition if parameter estimates are obtained via ML it makes sense to report ML estimates of variances even if these are biased since it is the ML quantities that are used in computing the criterion function and in forming likelihoodratio tests On the other hand pragmatic arguments for doing df adjustment are a that it makes for closer comparability between regular OLS estimates and nonlinear ones and b that it provides a pinch of salt in relation to smallsample resultsthat is it inflates standard errors confidence intervals and pvalues somewhateven if it lacks rigorous justification Note that even for fairly small samples the difference between the biased and unbiased estimators in equations 291 and 292 above will be small For example if n 30 then s2 30 29 ˆσ 2 In econometric modelling proper however the difference can be quite substantial If n 50 and k 10 the s2 defined in 294 will be 5040 125 as large as the ˆσ 2 in 293 and standard errors will be about 12 percent larger1 One can make a case for inflating the standard errors obtained via nonlinear estimators as a precaution against taking results to be more precise than they really are In rejoinder to the last point one might equally say that savvy econometricians should know to apply a discount factor albeit an imprecise one to smallsample estimates outside of the classical normal linear modelor even that they should distrust such results and insist on large samples before making inferences This line of thinking suggests that test statistics such as z ˆβiˆσˆβi should be referred to the distribution to which they conform asymptoticallyin this case N0 1 for H0 βi 0if and only if the conditions for appealing to asymptotic results can be considered as met From this point of view df adjustment may be seen as providing a false sense of security 295 Consistency and awkward cases Consistency in the ordinary sense of uniformity of treatment is a bugbear when dealing with this issue To give a simple example suppose an econometrics program follows the policy of applying df correction for OLS estimation but not for ML estimation One is of course free to estimate the classical normal linear model via ML in which case ˆβ should be numerically identical to that obtained via OLS But the user of the software will obtain two different sets of standard errors depending on the estimation method Admittedly this example is not very troublesome presumably one would apply ML to the classical linear model only to make a pedagogical point Here is a more awkward case An unrestricted vector autoregression VAR is a system of equations but the ML estimate of this system given normal errors is equivalent to equationbyequation OLS Should df correction be applied to VARs Consistency with OLS argues Yes However a popular extension of the VAR methodology is the vector errorcorrection model VECM VECMs are closely related to VARs and one might well be interested in making comparisons across the two but a VECM is a nonlinear system and the cointegrating vectors that lie at the heart of this model must be estimated via Maximum Likelihood So perhaps VAR results should not be df adjusted for comparability with VECMs 1A fairly typical situation in timeseries macroeconometrics would be have between 100 and 200 quarterly observa tions and to be estimating up to maybe 30 parameters including lags In this case df correction would make a difference to standard errors on the order of 10 percent Chapter 29 Degrees of freedom correction 284 Another grey area is the class of Feasible Generalized Least Squares FGLS estimatorsfor exam ple weighted least squares following the estimation of a skedastic function or estimators designed to handle firstorder autocorrelation such as CochraneOrcutt These depart from the classical lin ear model and the theoreretical basis for inference in such models is asymptotic yet according to econometric tradition standard errors are generally df adjusted Yet another awkward case robust heteroskedasticity andor autocorrelationconsistent stan dard errors in the context of OLS Such estimators are justified by asymptotic arguments and in general we cannot determine their smallsample distributions That would argue for referring the associated test statistics to the normal distribution But comparability with classical standard errors pulls in the other direction Suppose in a particular case a robust estimator produces a standard error that is numerically indistinguishable from the classical one if the former is referred to the normal distribution and the latter to the t distribution switching to robust standard errors will give a smaller pvalue for the coefficient in question making it appear more significant and arguably this is misleading 296 What gretl does First of all the third column in gretl model outputfollowing coefficient and std erroris labeled either tratio or z This is your signal tratio indicates that the estimated standard error employs a degrees of freedom adjustment and the reported pvalue is obtained from the Student t distribution while z indicates that no such adjustment is applied and the pvalue comes from the standard normal distribution If you see that gretl is applying a df adjustment but you dont want this the first point to check is whether you can switch to the asymptotic variant by using an option flag or other command The ols and tsls commands support a nodfcorr option to suppress degrees of freedom adjustment In the case of TwoStage Least Squares its certainly arguable that df correction should not be performed by default however gretl does this largely for comparability with other software for example Statas ivreg command But you can override the default if you wish The estimate command for systems of equations also supports the nodfcorr option when the specified estimation method is OLS or TSLS For other estimators supported by gretls system command no df adjustment is applied by default By default gretl uses the t distribution for statistics based on robust standard errors under OLS However users can specify that pvalues be calculated using the standard normal distri bution whenever the robust option is passed to an estimation command by means of the following set command set robustz on If these possibilities do not apply it is fairly straightforward to purge regression results of df correction as illustrated in the following script fragment We assume that a model has just been estimated so that the modelrelated accessors stderr coeff and so on are available matrix se stderr sqrtdfT matrix zscore coeff se matrix pv 2 pvaluez abszscore matrix M coeff se zscore pv cnamesetM coeff stderr z pvalue print M This will print the original coefficient estimates along with asymptotic standard errors and the associated zscores and twosided normal pvalues The converse case is left as an exercise for the reader Chapter 29 Degrees of freedom correction 285 VARs As mentioned above Vector Autoregressions constitute a particularly awkward case with consid erations of consistency of treatment pulling in two opposite directions For that reason gretl has adopted an agnostic policy in relation to such systems We do not offer a vcv accessor but instead accessors named xtxinv the matrix XX1 for the system as a whole and sigma an estimate of the crossequation variancecovariance matrix Σ Its then up to the user to build an estimate of the variance matrix of the parameter estimatescall it V should that be required Note that sigma gives the Maximum Likelihood Estimator without a degrees of freedom adjust ment so if you do matrix Vml sigma xtxinv where represents Kronecker product you obtain the MLE of the variance matrix of the param eter estimates But if you want the unbiased estimator you can do matrix S sigma TTncoeff matrix Vu S xtxinv to employ a suitably inflated variant of the Σ estimate For VARs and also VECMs ncoeff gives the number of coefficients per equation The second variant above is such that the vector of standard errors produced by matrix SE sqrtdiagVu agrees with the standard errors printed as part of the perequation VAR output A fuller example of usage of the xtxinv accessor is given in Listing 291 this shows how one can replicate the Ftests for Granger causality that are displayed by default by the var command with the refinement that depending on the setting of the USEF flag these tests can be done using a small sample correction as in gretls output or in asymptotic χ2 form Vector Error Correction Models are more complex than VARs in this respect since we employ Jo hansens variance estimator for the β terms This means for example that the xtxinv accessor treats each estimated error correction EC term as one regressor on its own such that the sam pling uncertainty of the loading coefficients is thereby addressed after Kroneckermultiplying with sigma as before The internals of the EC terms are of course made up of the integrated levels variables and the special jvbeta accessor is responsible for the variance of the cointegration coefficients where degreesoffreedom corrections are not available But as soon as the loading coefficients attached to the EC terms are restricted there is no common set of regressors with freely varying coefficients in the VECM system anymore and therefore in these cases the formulas above are misleading The xtxinv accessor can still be retrieved because it does not involve the coefficients but in the restricted α case it should no longer be used as shown above The notion of system degrees of freedom then also becomes fuzzier since the number of regressors can vary across equations Chapter 29 Degrees of freedom correction 286 Listing 291 Computing statistics to test for Granger causality Download open denmarkgdt list LST LRM LRY IBO IDE scalar p 2 lags in VAR scalar USEF 1 small sample correction var p LST quiet k nelemLST matrix theta veccoeff matrix V sigma xtxinv if USEF scalar df T ncoeff V Tdf endif matrix GC zerosk k cnamesetGC LST rnamesetGC LST matrix idx seq1p 1 loop i 1k loop j 1k GCij qformthetaidx invpdVidxidx idx jk p1 p endloop endloop if USEF GC p matrix pvals pvalueF p df GC else matrix pvals pvalueX p GC endif cnamesetpvals LST rnamesetpvals LST print GC pvals Chapter 30 Time series filters In addition to the usual application of lags and differences gretl provides fractional differencing and various filters commonly used in macroeconomics for trendcycle decomposition notably the HodrickPrescott filter Hodrick and Prescott 1997 the BaxterKing bandpass filter Baxter and King 1999 and the Butterworth filter Butterworth 1930 301 Fractional differencing The concept of differencing a time series d times is pretty obvious when d is an integer it may seem odd when d is fractional However this idea has a welldefined mathematical content consider the function fz 1 zd where z and d are real numbers By taking a Taylor series expansion around z 0 we see that fz 1 dz dd12 z² or more compactly fz 1 i1 to ψi zi with ψk i1 to kdi1k ψk1 dk1k The same expansion can be used with the lag operator so that if we defined Yt 1 L05 Xt this could be considered shorthand for Yt Xt 05Xt1 0125Xt2 00625Xt3 In gretl this transformation can be accomplished by the syntax Y fracdiffX 05 302 The HodrickPrescott filter This filter is accessed using the hpfilter function which takes as its first argument the name of the variable to be processed Further optional arguments are explained below A time series yt may be decomposed into a trend or growth component gt and a cyclical component ct yt gt ct t 1 2 T 287 Chapter 30 Time series filters 288 The HodrickPrescott filter effects such a decomposition by minimizing the following t1 to T yt gt² λ t2 to T1 gt1 gt gt gt1² The first term above is the sum of squared cyclical components ct yt gt The second term is a multiple λ of the sum of squares of the trend components second differences This second term penalizes variations in the growth rate of the trend component the larger the value of λ the higher is the penalty and hence the smoother the trend series Note that the hpfilter function in gretl produces the cyclical component ct of the original series If you want the smoothed trend you can subtract the cycle from the original ct hpfilteryt gt yt ct Hodrick and Prescott 1997 suggest that a value of λ 1600 is reasonable for quarterly data The default value in gretl is 100 times the square of the data frequency which of course yields 1600 for quarterly data The value can be adjusted using an optional second argument to hpfilter as in ct hpfilteryt 1300 As of version 2018a the hpfilter function accepts a third optional Boolean argument If set to nonzero what is performed is the socalled onesided version of the filter See Section 3612 for further details 303 The Baxter and King filter This filter is accessed using the bkfilt function which again takes the name of the variable to be processed as its first argument The operation of the filter can be controlled via three further optional argument Consider the spectral representation of a time series yt yt from π to π eiω dZω To extract the component of yt that lies between the frequencies ω and ω one could apply a bandpass filter ct from π to π Fω eiω dZω where Fω 1 for ω ω ω and 0 elsewhere This would imply in the time domain applying to the series a filter with an infinite number of coefficients which is undesirable The Baxter and King bandpass filter applies to yt a finite polynomial in the lag operator AL ct AL yt where AL is defined as AL ik to k ai Li The coefficients ai are chosen such that Fω Aeiω Aeiω is the best approximation to Fω for a given k Clearly the higher k the better the approximation is but since 2k observations have to be discarded a compromise is usually sought Moreover the filter has also other appealing Chapter 30 Time series filters 289 theoretical properties among which the property that A1 0 so a series with a single unit root is made stationary by application of the filter In practice the filter is normally used with monthly or quarterly data to extract the business cycle component namely the component between 6 and 36 quarters Usual choices for k are 8 or 12 maybe higher for monthly series The default values for the frequency bounds are 8 and 32 and the default value for the approximation order k is 8 You can adjust these values using the full form of bkfilt which is bkfiltseriesname f1 f2 k where f1 and f2 represent the lower and upper frequency bounds respectively 304 The Butterworth filter The Butterworth filter Butterworth 1930 is an approximation to an ideal squarewave filter The ideal filter divides the spectrum of a time series into a passband frequencies less than some chosen ω for a lowpass filter or frequencies greater than ω for highpass and a stopband the gain is 1 for the passband and 0 for the stopband The ideal filter is unattainable in practice since it would require an infinite number of coefficients but the Butterworth filter offers a remarkably good approximation This filter is derived and persuasively advocated by Pollock 2000 For data y the filtered sequence x is given by x y λΣQM λQΣQ1Qy 301 where Σ 2IT LT L1 T T2 and M 2IT LT L1 T T IT denotes the identity matrix of order T LT e1 e2 eT1 0 is the finitesample matrix version of the lag operator and Q is defined such that premultiplication of a Tvector of data by Q of order T 2 T produces the second differences of the data The matrix product QΣQ 2IT LT L1 T T is a Toeplitz matrix The behavior of the Butterworth filter is governed by two parameters the frequency cutoff ω and an integer order n which determines the number of coefficients used The λ that appears in 301 is tanω22n Higher values of n produce a better approximation to the ideal filter in principle ie a sharper cut between the passband and the stopband but there is a downside with a greater number of coefficients numerical instability may be an issue and the influence of the initial values in the sample may be exaggerated In gretl the Butterworth filter is implemented by the bwfilt function1 which takes three argu ments the series to filter the order n and the frequency cutoff ω expressed in degrees The cutoff value must be greater than 0 and less than 180 This function operates as a lowpass filter for the highpass variant subtract the filtered series from the original as in series bwcycle y bwfilty 8 67 Pollock recommends that the parameters of the Butterworth filter be tuned to the data one should examine the periodogram of the series in question possibly after removal of a polynomial trend in search of a dead spot of low power between the frequencies one wishes to exclude and the frequencies one wishes to retain If ω is placed in such a dead spot then the job of separation can be done with a relatively small n hence avoiding numerical problems By way of illustration consider the periodogram for quarterly observations on new cars sales in the US2 19751 to 19904 the upper panel in Figure 301 1The code for this filter is based on D S G Pollocks programs IDEOLOG and DETREND The Pascal source code for Chapter 30 Time series filters 290 periods 640 107 58 40 30 25 21 300000 250000 200000 150000 100000 50000 0 0 20 40 60 80 100 120 140 160 180 degrees 3400 3200 3000 2800 2600 2400 2200 2000 1800 1600 1976 1978 1980 1982 1984 1986 1988 1990 QNC original data QNC smoothed 1 08 06 04 02 0 0 pi4 pi2 3pi4 pi Figure 301 The Butterworth filter applied A seasonal pattern is clearly visible in the periodogram centered at an angle of 90 or 4 periods If we set ω 68 or thereabouts we should be able to excise the seasonality quite cleanly using n 8 The result is shown in the lower panel of the Figure along with the frequency response or gain plot for the chosen filter Note the smooth and reasonably steep dropoff in gain centered on the nominal cutoff of 68 3π8 The apparatus that supports this sort of analysis in the gretl GUI can be found under the Variable menu in the main window the items Periodogram and Filter In the periodogram dialog box you have the option of expressing the frequency axis in degrees which is helpful when selecting a Butterworth filter and in the Butterworth filter dialog you have the option of plotting the frequency response as well as the smoothed series andor the residual or cycle 305 The discrete Fourier transform The Fourier transform is not itself a timeseries filter but by providing the bridge between the time and the frequency domain it is a fundamental building block of many filter internals and deserves some detailed comments The discrete Fourier transform can be best thought of as a linear invertible transform of a complex vector Hence if x is an ndimensional vector whose kth element is xk ak ibk then the output of the discrete Fourier transform is a vector f Fx whose kth element is fk Σj0n1 eiωjk xj where ωjk 2πi jk n Since the transformation is invertible the vector x can be recovered from the former is available from httpwwwleacukusersdsgp1 and the C sources for the latter were kindly made available to us by the author 2This is the variable QNC from the Ramanathan data file data97 Chapter 30 Time series filters 291 f via the socalled inverse transform xk 1n Σj0n1 eiωjk fj The Fourier transform is used in many diverse situations on account of this key property the convolution of two vectors can be performed efficiently by multiplying the elements of their Fourier transforms and inverting the result If zk Σj1n xj ykj then Fz Fx Fy That is Fzk Fxk Fyk For computing the Fourier transform gretl uses the external library fftw3 see Frigo and Johnson 2005 This guarantees extreme speed and accuracy In fact the CPU time needed to perform the transform is On log n for any n This is why the array of numerical techniques employed in fftw3 is commonly known as the Fast Fourier Transform Gretl provides two matrix functions for performing the Fourier transform and its inverse fft2 and ffti3 For example matrix x1 1 2 3 perform the transform matrix f fft2x1 perform the inverse transform matrix x2 fftif yields x1 1 2 3 f 6 0 15 0866 15 0866 x2 1 2 3 Should it be necessary to compute the Fourier transform on several vectors with the same number of elements it is numerically more efficient to group them into a matrix rather than invoking fft for each vector separately As an example consider the multiplication of two polynomials ax 1 05x bx 1 03x 08x2 cx ax bx 1 08x 065x2 04x3 The coefficients of the polynomial cx are the convolution of the coefficients of ax and bx the following gretl code fragment illustrates how to compute the coefficients of cx define the two polynomials a 1 05 0 0 b 1 03 08 0 perform the transforms 3The same functionality is available via the legacy function fft that predates gretls native support of complex matrices It is more limited than fft2 as the input is understood to be real It returns the real and imaginary parts of the result in separate columns The fft function is kept for backward compatibility but for new scripts it is recommended to use the newer function fft2 instead The inverse function ffti supports both representations Chapter 30 Time series filters 292 fa fft2a fb fft2b multiply the two transforms element by element fc fa fb compute the coefficients of c via the inverse transform c fftifc Maximum efficiency would have been achieved by grouping a and b into a matrix The computa tional advantage is so little in this case that the exercise is a bit silly but the following alternative may be preferable for a large number of rowscolumns define the two polynomials a 1 05 0 0 b 1 03 08 0 perform the transforms jointly f fft2a b complexmultiply the two transforms fc f1 f2 compute the coefficients of c via the inverse transform c fftifc Traditionally the Fourier transform in econometrics has been mostly used in timeseries analysis the periodogram being the best known example Listing 301 shows how to compute the peri odogram of a time series via the fft2 function Listing 301 Periodogram via the Fourier transform Download set verbose off nulldata 50 generate an AR1 process series e normal series x 0 x 09x1 e compute the periodogram F fft2x note that the series is turned into a matrix on the fly S absF2 S S2nobs21 2pinobs sfreq seq1nobs2 omega sfreq 2pinobs period nobs sfreq omega omega sfreq period S compare the builtin command pergm x print omega Chapter 31 Univariate time series models 311 Introduction Time series models are discussed in this chapter and the next two Here we concentrate on ARIMA models unit root tests and GARCH The following chapter deals with VARs and chapter 33 with cointegration and error correction 312 ARIMA models Representation and syntax The arma command performs estimation of AutoRegressive Integrated Moving Average ARIMA models These are models that can be written in the form φLyt θLϵt 311 where φL and θL are polynomials in the lag operator L defined such that Lnxt xtn and ϵt is a white noise process The exact content of yt of the AR polynomial φ and of the MA polynomial θ will be explained in the following Mean terms The process yt as written in equation 311 has without further qualifications mean zero If the model is to be applied to real data it is necessary to include some term to handle the possibility that yt has nonzero mean There are two possible ways to represent processes with nonzero mean one is to define µt as the unconditional mean of yt namely the central value of its marginal distribution Therefore the series yt yt µt has mean 0 and the model 311 applies to yt In practice assuming that µt is a linear function of some observable variables xt the model becomes φLyt xtβ θLϵt 312 This is sometimes known as a regression model with ARMA errors its structure may be more apparent if we represent it using two equations yt xtβ ut φLut θLϵt The model just presented is also sometimes known as ARMAX ARMA eXogenous variables It seems to us however that this label is more appropriately applied to a different model another way to include a mean term in 311 is to base the representation on the conditional mean of yt that is the central value of the distribution of yt given its own past Assuming again that this can be represented as a linear combination of some observable variables zt the model would expand to φLyt ztγ θLϵt 313 The formulation 313 has the advantage that γ can be immediately interpreted as the vector of marginal effects of the zt variables on the conditional mean of yt And by adding lags of zt to 293 Chapter 31 Univariate time series models 294 this specification one can estimate Transfer Function models which generalize ARMA by adding the effects of exogenous variable distributed across time Gretl provides a way to estimate both forms Models written as in 312 are estimated by maximum likelihood models written as in 313 are estimated by conditional maximum likelihood For more on these options see the section on Estimation below In the special case when xt zt 1 that is the models include a constant but no exogenous variables the two specifications discussed above reduce to ϕLyt μ θLεt 314 and ϕL yt α θLεt 315 respectively These formulations are essentially equivalent but if they represent one and the same process μ and α are fairly obviously not numerically identical rather α 1 ϕ1 ϕp μ The gretl syntax for estimating 314 is simply arma p q y The AR and MA lag orders p and q can be given either as numbers or as predefined scalars The parameter μ can be dropped if necessary by appending the option nc no constant to the command If estimation of 315 is needed the switch conditional must be appended to the command as in arma p q y conditional Generalizing this principle to the estimation of 312 or 313 you get that arma p q y const x1 x2 would estimate the following model yt xt β ϕ1 yt1 xt1 β ϕp ytp xtp β εt θ1 εt1 θq εtq where in this instance xt β β0 xt1 β1 xt2 β2 Appending the conditional switch as in arma p q y const x1 x2 conditional would estimate the following model yt xt y ϕ1 yt1 ϕp ytp εt θ1 εt1 θq εtq Ideally the issue broached above could be made moot by writing a more general specification that nests the alternatives that is ϕL yt xt β zt γ θL εt 316 we would like to generalize the arma command so that the user could specify for any estimation method whether certain exogenous variables should be treated as xt s or zt s but were not yet at that point and neither are most other software packages Chapter 31 Univariate time series models 295 Seasonal models A more flexible lag structure is desirable when analyzing time series that display strong seasonal patterns Model 311 can be expanded to φLΦLsyt θLΘLsϵt 317 For such cases a fuller form of the syntax is available namely arma p q P Q y where p and q represent the nonseasonal AR and MA orders and P and Q the seasonal orders For example arma 1 1 1 1 y would be used to estimate the following model 1 φL1 ΦLsyt µ 1 θL1 ΘLsϵt If yt is a quarterly series and therefore s 4 the above equation can be written more explicitly as yt µ φyt1 µ Φyt4 µ φ Φyt5 µ ϵt θϵt1 Θϵt4 θ Θϵt5 Such a model is known as a multiplicative seasonal ARMA model Gaps in the lag structure The standard way to specify an ARMA model in gretl is via the AR and MA orders p and q respec tively In this case all lags from 1 to the given order are included In some cases one may wish to include only certain specific AR andor MA lags This can be done in either of two ways One can construct a matrix containing the desired lags positive integer values and supply the name of this matrix in place of p or q One can give a commaseparated list of lags enclosed in braces in place of p or q The following code illustrates these options matrix pvec 14 arma pvec 1 y arma 14 1 y Both forms above specify an ARMA model in which AR lags 1 and 4 are used but not 2 and 3 This facility is available only for the nonseasonal component of the ARMA specification Differencing and ARIMA The above discussion presupposes that the time series yt has already been subjected to all the transformations deemed necessary for ensuring stationarity see also section 313 Differencing is the most common of these transformations and gretl provides a mechanism to include this step into the arma command the syntax arma p d q y would estimate an ARMAp q model on dyt It is functionally equivalent to Chapter 31 Univariate time series models 296 series tmp y loop i1d tmp difftmp endloop arma p q tmp except with regard to forecasting after estimation see below When the series yₜ is differenced before performing the analysis the model is known as ARIMA I for Integrated for this reason gretl provides the arima command as an alias for arma Seasonal differencing is handled similarly with the syntax arma p d q P D Q y where D is the order for seasonal differencing Thus the command arma 1 0 0 1 1 1 y would produce the same parameter estimates as series dsy sdiffy arma 1 0 1 1 dsy where we use the sdiff function to create a seasonal difference eg for quarterly data yₜ yₜ₄ In specifying an ARIMA model with exogenous regressors we face a choice which relates back to the discussion of the variant models 312 and 313 above If we choose model 312 the regression model with ARMA errors how should this be extended to the case of ARIMA The issue is whether or not the differencing that is applied to the dependent variable should also be applied to the regressors Consider the simplest case ARIMA with nonseasonal differencing of order 1 We may estimate either 𝜙L1 Lyₜ Xₜβ θLεₜ 318 or 𝜙L1 Lyₜ Xₜβ θLεₜ 319 The first of these formulations can be described as a regression model with ARIMA errors while the second preserves the levels of the X variables As of gretl version 186 the default model is 318 in which differencing is applied to both yₜ and Xₜ However when using the default estimation method native exact ML see below the option ydiffonly may be given in which case gretl estimates 319¹ Estimation The default estimation method for ARMA models is exact maximum likelihood estimation under the assumption that the error term is normally distributed using a variety of techniques the main algorithm for evaluating the loglikelihood is AS197 by Melard 1984 Maximization is performed via BFGS and the score is approximated numerically This method produces results that are directly comparable with many other software packages The constant and any exogenous variables are treated as in equation 312 The covariance matrix for the parameters is computed using a numerical approximation to the Hessian at convergence The alternative method invoked with the conditional switch is conditional maximum likelihood CML also known as conditional sum of squares see Hamilton 1994 p 132 This method was exemplified in the script 133 and only a brief description will be given here Given a sample of size T the CML method minimizes the sum of squared onestepahead prediction errors generated ¹Prior to gretl 186 the default model was 319 We changed this for the sake of consistency with other software Chapter 31 Univariate time series models 297 by the model for the observations t0 T The starting point t0 depends on the orders of the AR polynomials in the model The numerical maximization method used is BHHH and the covariance matrix is computed using a GaussNewton regression The CML method is nearly equivalent to maximum likelihood under the hypothesis of normality the difference is that the first t0 1 observations are considered fixed and only enter the like lihood function as conditioning variables As a consequence the two methods are asymptotically equivalent under standard conditionsexcept for the fact discussed above that our CML imple mentation treats the constant and exogenous variables as per equation 313 The two methods can be compared as in the following example open data101 arma 1 1 r arma 1 1 r conditional which produces the estimates shown in Table 311 As you can see the estimates of φ and θ are quite similar The reported constants differ widely as expectedsee the discussion following equations 314 and 315 However dividing the CML constant by 1 φ we get 738 which is not far from the ML estimate of 693 Table 311 ML and CML estimates Parameter ML CML µ 693042 0923882 107322 0488661 φ 0855360 00511842 0852772 00450252 θ 0588056 00986096 0591838 00456662 Convergence and initialization The numerical methods used to maximize the likelihood for ARMA models are not guaranteed to converge Whether or not convergence is achieved and whether or not the true maximum of the likelihood function is attained may depend on the starting values for the parameters Gretl employs one of the following two initialization mechanisms depending on the specification of the model and the estimation method chosen 1 Estimate a pure AR model by Least Squares nonlinear least squares if the model requires it otherwise OLS Set the AR parameter values based on this regression and set the MA parameters to a small positive value 00001 2 The HannanRissanen method First estimate an autoregressive model by OLS and save the residuals Then in a second OLS pass add appropriate lags of the firstround residuals to the model to obtain estimates of the MA parameters To see the details of the ARMA estimation procedure add the verbose option to the command This prints a notice of the initialization method used as well as the parameter values and log likelihood at each iteration Besides the builtin initialization mechanisms the user has the option of specifying a set of starting values manually This is done via the set command the first argument should be the keyword initvals and the second should be the name of a prespecified matrix containing starting values For example matrix start 0 085 034 set initvals start arma 1 1 y Chapter 31 Univariate time series models 298 The specified matrix should have just as many parameters as the model in the example above there are three parameters since the model implicitly includes a constant The constant if present is always given first otherwise the order in which the parameters are expected is the same as the order of specification in the arma or arima command In the example the constant is set to zero φ1 to 085 and θ1 to 034 You can get gretl to revert to automatic initialization via the command set initvals auto Two variants of the BFGS algorithm are available in gretl In general we recommend the default vari ant which is based on an implementation by Nash 1990 but for some problems the alternative limitedmemory version LBFGSB see Byrd et al 1995 may increase the chances of convergence on the ML solution This can be selected via the lbfgs option to the arma command Estimation via X12ARIMA As an alternative to estimating ARMA models using native code gretl offers the option of using the external program X12ARIMA This is the seasonal adjustment software produced and main tained by the US Census Bureau it is used for all official seasonal adjustments at the Bureau The current version X13 can also be used working as a dropin replacement Gretl includes a module which interfaces with X12ARIMA it translates arma commands using the syntax outlined above into a form recognized by X12ARIMA executes the program and retrieves the results for viewing and further analysis within gretl To use this facility you have to install X12ARIMA separately Packages for both MS Windows and GNULinux are available from the gretl website httpgretlsourceforgenet To invoke X12ARIMA as the estimation engine append the flag x12arima as in arma p q y x12arima As with native estimation the default is to use exact ML but there is the option of using conditional ML with the conditional flag However please note that when X12ARIMA is used in conditional ML mode the comments above regarding the variant treatments of the mean of the process yt do not apply That is when you use X12ARIMA the model that is estimated is 312 regardless of whether estimation is by exact ML or conditional ML In addition the treatment of exogenous regressors in the context of ARIMA differencing is always that shown in equation 318 Forecasting ARMA models are often used for forecasting purposes The autoregressive component in particu lar offers the possibility of forecasting a process out of sample over a substantial time horizon Gretl supports forecasting on the basis of ARMA models using the method set out by Box and Jenkins 19762 The Box and Jenkins algorithm produces a set of integrated AR coefficients which take into account any differencing of the dependent variable seasonal andor nonseasonal in the ARIMA context thus making it possible to generate a forecast for the level of the original variable By contrast if you first difference a series manually and then apply ARMA to the differenced series forecasts will be for the differenced series not the level This point is illustrated in Listing 311 The parameter estimates are identical for the two models The forecasts differ but are mutually consistent the variable fcdiff emulates the ARMA forecast static one step ahead within the sample range and dynamic out of sample Lag selection A variant of the arma and arima commands is available as an aid to specification If you give the lagselect option the lag orders p and qas well as P and Q if applicableare taken as 2See in particular their Program 4 on p 505ff Chapter 31 Univariate time series models 299 Listing 311 ARIMA forecasting Download open greene182gdt log of quarterly US nominal GNP 19501 to 19834 series y logY and its first difference series dy diffy reserve 2 years for outofsample forecast smpl 19814 Estimate using ARIMA arima 1 1 1 y forecast over full period smpl full fcast fc1 Return to subsample and run ARMA on the first difference of y smpl 19814 arma 1 1 dy smpl full fcast fc2 series fcdiff t19821 fc1 y1 fc1 fc11 compare the forecasts over the later period smpl 19811 19834 print y fc1 fc2 fcdiff byobs The output from the last command is y fc1 fc2 fcdiff 19811 7964086 7940930 002668 002668 19812 7978654 7997576 003349 003349 19813 8009463 7997503 001885 001885 19814 8015625 8033695 002423 002423 19821 8014997 8029698 001407 001407 19822 8026562 8046037 001634 001634 19823 8032717 8063636 001760 001760 19824 8042249 8081935 001830 001830 19831 8062685 8100623 001869 001869 19832 8091627 8119528 001891 001891 19833 8115700 8138554 001903 001903 19834 8140811 8157646 001909 001909 Chapter 31 Univariate time series models 300 maxima and the usual output is replaced by a table showing information criteria and loglikelihood for a range of specifications from zero lags to the maxima If no seasonal component is given this table has six columns p and q the criteria AIC BIC and HQC see Chapter 28 and loglikelihood In the seasonal case there are eight columns P and Q are inserted following p and q Asterisks identify the rows specifications on which each information criterion is minimized If the input specification includes differencing nonseasonal andor seasonal this is respected but d and D are treated as fixed values rather than maxima You have the usual choice between exact and conditional ML estimation but the option of using X12ARIMA or X13 is not supported You also have the usual option of including exogenous regressors ARMAX On successful completion the table of results is available in the form of a matrix via the test accessor The printed version can be suppressed via the quiet option A simple example of usage is shown in Listing 312 using annual sunspot data from 1700 to 2021 The table part of which is elided for brevity has the three information criteria agreeing on ARMA42 as the optimum among the specifications estimated The script illustrates how the test matrix can be used to extract the best specification 313 Unit root tests The ADF test The Augmented DickeyFuller ADF test is as implemented in gretl the tstatistic on 𝜙 in the following regression Δyₜ μₜ 𝜙yₜ₁ Σᵖᵢ₁ γᵢΔyₜᵢ εₜ 3110 This test statistic is probably the bestknown and most widely used unit root test It is a onesided test whose null hypothesis is 𝜙 0 versus the alternative 𝜙 0 and hence large negative values of the test statistic lead to the rejection of the null Under the null yₜ must be differenced at least once to achieve stationarity under the alternative yₜ is already stationary and no differencing is required One peculiar aspect of this test is that its limit distribution is nonstandard under the null hypothesis moreover the shape of the distribution and consequently the critical values for the test depends on the form of the μₜ term A full analysis of the various cases is inappropriate here Hamilton 1994 contains an excellent discussion but any recent time series textbook covers this topic Suffice it to say that gretl allows the user to choose the specification for μₜ among four different alternatives μₜ command option 0 nc μ₀ c μ₀ μ₁t ct μ₀ μ₁t μ₁t² ctt These option flags are not mutually exclusive when they are used together the statistic will be reported separately for each selected case By default gretl uses the combination c ct For each case approximate pvalues are calculated by means of the algorithm developed in MacKinnon 1996 The gretl command used to perform the test is adf for example adf 4 x1 Chapter 31 Univariate time series models 301 Listing 312 ARMA lag selection Download open sunspotsgdt ARMA lag selection with maxima of 4 for p and q arma 4 4 sunspots lagselect determine the best row per BIC column 4 bestrow iminctest4 extract this row spec testbestrow12 extract p and q as scalars scalar p spec1 scalar q spec2 and estimate the best specification arma p q sunspots Part of the lagselection table Estimated using AS 197 exact ML Dependent variable sunspots T 322 Criteria for ARMAp q specifications p q AIC BIC HQC lnL 0 0 35752367 35827858 35782505 17856183 0 1 32837333 32950569 32882540 16388666 0 2 31236726 31387708 31297002 15578363 0 3 30718351 30907078 30793697 15309175 0 4 30470500 30696973 30560916 15175250 1 0 32203385 32316621 32248593 16071692 1 1 31084048 31235030 31144325 15502024 1 2 30603363 30792090 30678709 15251681 1 3 30512713 30739187 30603129 15196357 1 4 30451230 30715449 30556715 15155615 3 0 30086022 30274750 30161368 14993011 3 1 30105262 30331735 30195677 14992631 3 2 29763054 30027273 29868539 14811527 3 3 29696493 29998457 29817046 14768246 3 4 29705017 30044727 29840640 14762509 4 0 30105497 30331970 30195912 14992748 4 1 30123267 30387485 30228751 14991633 4 2 29695073 29997037 29815626 14767536 4 3 29712552 30052262 29848175 14766276 4 4 29711378 30088833 29862070 14755689 Chapter 31 Univariate time series models 302 would compute the test statistic as the tstatistic for 𝜙 in equation 3110 with p 4 in the two cases μₜ μ₀ and μₜ μ₀ μ₁t The number of lags p in equation 3110 should be chosen as to ensure that 3110 is a parametrization flexible enough to represent adequately the shortrun persistence of Δyₜ Setting p too low results in size distortions in the test whereas setting p too high leads to low power As a convenience to the user the parameter p can be automatically determined Setting p to a negative number triggers a sequential procedure that starts with p lags and decrements p until the tstatistic for the parameter γₚ exceeds 1645 in absolute value The ADFGLS test Elliott Rothenberg and Stock 1996 proposed a variant of the ADF test which involves an alternative method of handling the parameters pertaining to the deterministic term μₜ these are estimated first via Generalized Least Squares and in a second stage an ADF regression is performed using the GLS residuals This variant offers greater power than the regular ADF test for the cases μₜ μ₀ and μₜ μ₀ μ₁t The ADFGLS test is available in gretl via the gls option to the adf command When this option is selected the nc and ctt options become unavailable and only one case can be selected at a time by default the constantonly model is used but a trend can be added using the ct flag When a trend is present in this test MacKinnontype pvalues are not available instead we show critical values from Table 1 in Elliott et al 1996 The KPSS test The KPSS test Kwiatkowski Phillips Schmidt and Shin 1992 is a unit root test in which the null hypothesis is opposite to that in the ADF test under the null the series in question is stationary the alternative is that the series is I1 The basic intuition behind this test statistic is very simple if yₜ can be written as yₜ μ uₜ where uₜ is some zeromean stationary process then not only does the sample average of the yₜs provide a consistent estimator of μ but the longrun variance of uₜ is a welldefined finite number Neither of these properties hold under the alternative The test itself is based on the following statistic η Σᵀᵢ1 Sₜ² T²σ² 3111 where Sₜ Σₛ₁ᵗ eₛ and σ² is an estimate of the longrun variance of eₜ yₜ y Under the null this statistic has a welldefined nonstandard asymptotic distribution which is free of nuisance parameters and has been tabulated by simulation Under the alternative the statistic diverges As a consequence it is possible to construct a onesided test based on η where H₀ is rejected if η is bigger than the appropriate critical value gretl provides the 90 95 and 99 percent quantiles The critical values are computed via the method presented by Sephton 1995 which offers greater accuracy than the values tabulated in Kwiatkowski et al 1992 Usage example kpss m y where m is an integer representing the bandwidth or window size used in the formula for estimating the long run variance σ² Σⁿᵢᵐ 1 i m 1 ŷᵢ The ŷᵢ terms denote the empirical autocovariances of eₜ from order m through m For this estimator to be consistent m must be large enough to accommodate the shortrun persistence of et but not too large compared to the sample size T If the supplied m is nonpositive a default value is computed namely the integer part of 4 T10014 The above concept can be generalized to the case where yt is thought to be stationary around a deterministic trend In this case formula 3111 remains unchanged but the series et is defined as the residuals from an OLS regression of yt on a constant and a linear trend This second form of the test is obtained by appending the trend option to the kpss command kpss n y trend Note that in this case the asymptotic distribution of the test is different and the critical values reported by gretl differ accordingly Panel unit root tests The most commonly used unit root tests for panel data involve a generalization of the ADF procedure in which the joint null hypothesis is that a given times series is nonstationary for all individuals in the panel In this context the ADF regression 3110 can be rewritten as Δyit μit φiyit1 from j1 to pi γij Δyitj ϵit 3112 The model 3112 allows for maximal heterogeneity across the individuals in the panel the parameters of the deterministic term the autoregressive coefficient φ and the lag order p are all specific to the individual indexed by i One possible modification of this model is to impose the assumption that φi φ for all i that is the individual time series share a common autoregressive root although they may differ in respect of other statistical properties The choice of whether or not to impose this assumption has an important bearing on the hypotheses under test Under model 3112 the joint null is φi 0 for all i meaning that all the individual time series are nonstationary and the alternative simply the negation of the null is that at least one individual time series is stationary When a common φ is assumed the null is that φ 0 and the alternative is that φ 0 The null still says that all the individual series are nonstationary but the alternative now says that they are all stationary The choice of model should take this point into account as well as the gain in power from forming a pooled estimate of φ and of course the plausibility of assuming a common AR1 coefficient3 In gretl the formulation 3112 is used automatically when the adf command is used on panel data The joint test statistic is formed using the method of Im Pesaran and Shin 2003 In this context the behavior of adf differs from regular timeseries data only one case of the deterministic term is handled per invocation of the command the default is that μit includes just a constant but the nc and ct flags can be used to suppress the constant or to include a trend respectively and the quadratic trend option ctt is not available The alternative that imposes a common value of φ is implemented via the levinlin command The test statistic is computed as per Levin Lin and Chu 2002 As with the adf command the first argument is the lag order and the second is the name of the series to test and the default case for the deterministic component is a constant only The options nc and ct have the same effect as with adf One refinement is that the lag order may be given in either of two forms if a scalar is given this is taken to represent a common value of p for all individuals but you may instead provide a vector holding a set of pi values hence allowing the order of autocorrelation of the series to differ by individual So for example given 3If the assumption of a common φ seems excessively restrictive bear in mind that we routinely assume common slope coefficients when estimating panel models even if this is unlikely to be literally true Chapter 31 Univariate time series models 304 levinlin 2 y levinlin 223344 y the first command runs a joint ADF test with a common lag order of 2 while the second which assumes a panel with six individuals allows for differing shortrun dynamics The first argument to levinlin can be given as a set of commaseparated integers enclosed in braces as shown above or as the name of an appropriately dimensioned predefined matrix see chapter 17 Besides variants of the ADF test the KPSS test also can be used with panel data via the kpss command In this case the test of the null hypothesis that the given time series is stationary for all individuals is implemented using the method of Choi 2001 This is an application of meta analysis the statistical technique whereby an overall or composite pvalue for the test of a given null hypothesis can be computed from the pvalues of a set of separate tests Unfortunately in the case of the KPSS test we are limited by the unavailability of precise pvalues although if an individual test statistic falls between the 10 percent and 1 percent critical values we are able to interpolate with a fair degree of confidence This gives rise to four cases 1 All the individual KPSS test statistics fall between the 10 percent and 1 percent critical values the Choi method gives us a plausible composite pvalue 2 Some of the KPSS test statistics exceed the 1 percent value and none fall short of the 10 percent value we can give an upper bound for the composite pvalue by setting the unknown pvalues to 001 3 Some of the KPSS test statistics fall short of the 10 percent critical value but none exceed the 1 percent value we can give a lower bound to the composite pvalue by setting the unknown pvalues to 010 4 None of the above conditions are satisfied the Choi method fails to produce any result for the composite KPSS test 314 Cointegration test The generally recommended test for cointegration is the Johansen test which is discussed in detail in chapter 33 In this context we just offer a few remarks on the cointegration test of Engle and Granger 1987 because it builds on the univariate ADF test discussed above section 313 For the EngleGranger test the procedure is 1 Test each series for a unit root using an ADF test 2 Run a cointegrating regression via OLS For this we select one of the potentially cointegrated variables as dependent and include the other potentially cointegrated variables as regressors 3 Perform an ADF test on the residuals from the cointegrating regression The idea is that cointegration is supported if a the null of nonstationarity is not rejected for each of the series individually in step 1 while b the null is rejected for the residuals at step 3 That is each of the individual series is I1 but some linear combination of the series is I0 This test is implemented in gretl by the coint command which requires an integer lag order for the ADF tests followed by a list of variables to be tested the first of which will be taken as dependent in the cointegrating regression Please see the online help for coint or the Gretl Command Reference for further details 315 ARCH and GARCH Heteroskedasticity means a nonconstant variance of the error term in a regression model Autoregressive Conditional Heteroskedasticity ARCH is a phenomenon specific to time series models whereby the variance of the error displays autoregressive behavior for instance the time series exhibits successive periods where the error variance is relatively large and successive periods where it is relatively small This sort of behavior is reckoned to be common in asset markets an unsettling piece of news can lead to a period of increased volatility in the market An ARCH error process of order q can be represented as ut σt εt σt2 Eut2Ωt1 α0 from i1 to q αi uti2 where the εts are independently and identically distributed iid with mean zero and variance 1 and where σt is taken to be the positive square root of σt2 Ωt1 denotes the information set as of time t 1 and σt2 is the conditional variance that is the variance conditional on information dated t 1 and earlier It is important to notice the difference between ARCH and an ordinary autoregressive error process The simplest firstorder case of the latter can be written as ut ρ ut1 εt 1 ρ 1 where the εts are independently and identically distributed with mean zero and variance σ2 With an AR1 error if ρ is positive then a positive value of ut will tend to be followed by a positive ut1 With an ARCH error process a disturbance ut of large absolute value will tend to be followed by further large absolute values but with no presumption that the successive values will be of the same sign ARCH in asset prices is a stylized fact and is consistent with market efficiency on the other hand autoregressive behavior of asset prices would violate market efficiency One can test for ARCH of order q in the following way 1 Estimate the model of interest via OLS and save the squared residuals ût2 2 Perform an auxiliary regression in which the current squared residual is regressed on a constant and q lags of itself 3 Find the TR2 value sample size times unadjusted R2 for the auxiliary regression 4 Refer the TR2 value to the χ2 distribution with q degrees of freedom and if the pvalue is small enough reject the null hypothesis of homoskedasticity in favor of the alternative of ARCHq This test is implemented in gretl via the modtest command with the arch option which must follow estimation of a timeseries model by OLS either a singleequation model or a VAR For example ols y 0 x modtest 4 arch This example specifies an ARCH order of q 4 if the order argument is omitted q is set equal to the periodicity of the data In the graphical interface the ARCH test is accessible from the Tests menu in the model window again for singleequation OLS or VARs GARCH The simple ARCHq process is useful for introducing the general concept of conditional heteroskedasticity in time series but it has been found to be insufficient in empirical work The dynamics of the error variance permitted by ARCHq are not rich enough to represent the patterns found in financial data The generalized ARCH or GARCH model is now more widely used The representation of the variance of a process in the GARCH model is somewhat but not exactly analogous to the ARMA representation of the level of a time series The variance at time t is allowed to depend on both past values of the variance and past values of the realized squared disturbance as shown in the following system of equations yt Xt β ut 3113 ut σt εt 3114 σt2 α0 from i1 to q αi uti2 from j1 to p δj σtj2 3115 As above εt is an iid sequence with unit variance Xt is a matrix of regressors or in the simplest case just a vector of 1s allowing for a nonzero mean of yt Note that if p 0 GARCH collapses to ARCHq the generalization is embodied in the δj terms that multiply previous values of the error variance In principle the underlying innovation εt could follow any suitable probability distribution and besides the obvious candidate of the normal or Gaussian distribution the Students t distribution has been used in this context Currently gretl only handles the case where εt is assumed to be Gaussian However when the robust option to the garch command is given the estimator gretl uses for the covariance matrix can be considered QuasiMaximum Likelihood even with nonnormal disturbances See below for more on the options regarding the GARCH covariance matrix Example garch p q y const x where p 0 and q 0 denote the respective lag orders as shown in equation 3115 These values can be supplied in numerical form or as the names of predefined scalar variables GARCH estimation Estimation of the parameters of a GARCH model is by no means a straightforward task Consider equation 3115 the conditional variance at any point in time σt2 depends on the conditional variance in earlier periods but σt2 is not observed and must be inferred by some sort of Maximum Likelihood procedure By default gretl uses native code that employs the BFGS maximizer you also have the option activated by the fcp commandline switch of using the method proposed by Fiorentini et al 19964 which was adopted as a benchmark in the study of GARCH results by McCullough and Renfro 1998 It employs analytical first and second derivatives of the loglikelihood and uses a mixedgradient algorithm exploiting the information matrix in the early iterations and then switching to the Hessian in the neighborhood of the maximum likelihood This progress can be observed if you append the verbose option to gretls garch command Several options are available for computing the covariance matrix of the parameter estimates in connection with the garch command At a first level one can choose between a standard and a robust estimator By default the Hessian is used unless the robust option is given in which case the QML estimator is used A finer choice is available via the set command as shown in Table 312 It is not uncommon when one estimates a GARCH model for an arbitrary time series to find that the iterative calculation of the estimates fails to converge For the GARCH model to make sense there are strong restrictions on the admissible parameter values and it is not always the case that there exists a set of values inside the admissible parameter space for which the likelihood is maximized 4The algorithm is based on Fortran code deposited in the archive of the Journal of Applied Econometrics by the authors and is used by kind permission of Professor Fiorentini Chapter 31 Univariate time series models 307 Table 312 Options for the GARCH covariance matrix command effect set garchvcv hessian Use the Hessian set garchvcv im Use the Information Matrix set garchvcv op Use the Outer Product of the Gradient set garchvcv qml QML estimator set garchvcv bw BollerslevWooldridge sandwich estimator The restrictions in question can be explained by reference to the simplest and much the most common instance of the GARCH model where p q 1 In the GARCH1 1 model the conditional variance is σ 2 t α0 α1u2 t1 δ1σ 2 t1 3116 Taking the unconditional expectation of 3116 we get σ 2 α0 α1σ 2 δ1σ 2 so that σ 2 α0 1 α1 δ1 For this unconditional variance to exist we require that α1 δ1 1 and for it to be positive we require that α0 0 A common reason for nonconvergence of GARCH estimates that is a common reason for the non existence of αi and δi values that satisfy the above requirements and at the same time maximize the likelihood of the data is misspecification of the model It is important to realize that GARCH in itself allows only for timevarying volatility in the data If the mean of the series in question is not constant or if the error process is not only heteroskedastic but also autoregressive it is necessary to take this into account when formulating an appropriate model For example it may be necessary to take the first difference of the variable in question andor to add suitable regressors Xt as in 3113 Chapter 32 Vector Autoregressions Gretl provides a standard set of procedures for dealing with the multivariate timeseries models known as VARs Vector AutoRegression More general modelssuch as VARMAs nonlinear models or multivariate GARCH modelsare not provided as of now although it is entirely possible to estimate them by writing custom procedures in the gretl scripting language In this chapter we will briefly review gretls VAR toolbox 321 Notation A VAR is a structure whose aim is to model the time persistence of a vector of n time series yt via a multivariate autoregression as in yt A1yt1 A2yt2 Apytp Bxt ϵt 321 The number of lags p is called the order of the VAR The vector xt if present contains a set of exogenous variables often including a constant possibly with a time trend and seasonal dummies The vector ϵt is typically assumed to be a vector white noise with covariance matrix Σ Equation 321 can be written more compactly as ALyt Bxt ϵt 322 where AL is a matrix polynomial in the lag operator or as yt yt1 ytp1 A yt1 yt2 ytp B 0 0 xt ϵt 0 0 323 The matrix A is known as the companion matrix and equals A A1 A2 Ap I 0 0 0 I 0 Equation 323 is known as the companion form of the VAR Another representation of interest is the socalled VMA representation which is written in terms of an infinite series of matrices Θi defined as Θi yt ϵti 324 The Θi matrices may be derived by recursive substitution in equation 321 for example assuming for simplicity that B 0 and p 1 equation 321 would become yt Ayt1 ϵt 308 Chapter 32 Vector Autoregressions 309 which could be rewritten as yt An1 ytn1 εt A εt1 A2 εt2 An εtn In this case Θi Ai In general it is possible to compute Θi as the n n northwest block of the ith power of the companion matrix A so Θ0 is always an identity matrix The VAR is said to be stable if all the eigenvalues of the companion matrix A are smaller than 1 in absolute value or equivalently if the matrix polynomial AL in equation 322 is such that Az 0 implies z 1 If this is the case limn Θn 0 and the vector yt is stationary as a consequence the equation yt Eyt i0 Θi εti 325 is a legitimate Wold representation If the VAR is not stable then the inferential procedures that are called for become somewhat more specialized except for some simple cases In particular if the number of eigenvalues of A with modulus 1 is between 1 and n 1 the canonical tool to deal with these models is the cointegrated VAR model discussed in chapter 33 322 Estimation The gretl command for estimating a VAR is var which in the command line interface is invoked in the following manner modelname var p Ylist Xlist where p is a scalar the VAR order and Ylist is a list of variables specifying the content of yt The optional Xlist argument can be used to specify a set of exogenous variables If this argument is omitted the vector xt is taken to contain a constant only if present it must be separated from Ylist by a semicolon Note however that a few common choices can be obtained in a simpler way the options trend and seasonals call for inclusion of a linear trend and a set of seasonal dummies respectively In addition the nc option no constant can be used to suppress the standard inclusion of a constant The construct can be used to store the model under a name see section 32 if so desired To estimate a VAR using the graphical interface choose Time Series Vector Autoregression under the Model menu The parameters in eq 321 are typically free from restrictions which implies that multivariate OLS provides a consistent and asymptotically efficient estimator of all the parameters 1 Given the simplicity of OLS this is what every software package including gretl uses example script 321 exemplifies the fact that the var command gives you exactly the output you would have from a battery of OLS regressions The advantage of using the dedicated command is that after estimation is done it makes it much easier to access certain quantities and manage certain tasks For example the coeff accessor returns the estimated coefficients as a matrix with n columns and sigma returns an estimate of the matrix Σ the covariance matrix of εt Moreover for each variable in the system an F test is automatically performed in which the null hypothesis is that no lags of variable j are significant in the equation for variable i This is commonly known as a Granger causality test In addition two accessors become available for the companion matrix compan and the VMA representation vma The latter deserves a detailed description since the VMA representation 325 is of infinite order gretl defines a horizon up to which the Θi matrices are computed automatically 1 In fact under normality of εt OLS is indeed the conditional ML estimator You may want to use other methods if you need to estimate a VAR in which some parameters are constrained Chapter 32 Vector Autoregressions 310 Listing 321 Estimation of a VAR via OLS Download Input open swch14gdt series infl 400sdifflogPUNEW scalar p 2 list X LHUR infl list Xlag lagspX loop foreach i X ols i const Xlag endloop var p X Output selected portions Model 1 OLS using observations 1960319994 T 158 Dependent variable LHUR coefficient std error tratio pvalue const 0113673 00875210 1299 01960 LHUR1 154297 00680518 2267 878e51 LHUR2 0583104 00645879 9028 700e16 infl1 00219040 000874581 2505 00133 infl2 00148408 000920536 1612 01090 Mean dependent var 6019198 SD dependent var 1502549 Sum squared resid 8654176 SE of regression 0237830 VAR system lag order 2 OLS estimates observations 1960319994 T 158 Loglikelihood 32273663 Determinant of covariance matrix 020382769 AIC 42119 BIC 44057 HQC 42906 Portmanteau test LB39 226984 df 148 00000 Equation 1 LHUR coefficient std error tratio pvalue const 0113673 00875210 1299 01960 LHUR1 154297 00680518 2267 878e51 LHUR2 0583104 00645879 9028 700e16 infl1 00219040 000874581 2505 00133 infl2 00148408 000920536 1612 01090 Mean dependent var 6019198 SD dependent var 1502549 Sum squared resid 8654176 SE of regression 0237830 Chapter 32 Vector Autoregressions 311 Periodicity horizon Quarterly 20 5 years Monthly 24 2 years Daily 3 weeks All other cases 10 Table 321 VMA horizon as a function of the dataset periodicity By default this is a function of the periodicity of the data see table 321 but it can be set by the user to any desired value via the set command with the horizon parameter as in set horizon 30 Calling the horizon h the vma accessor returns an h 1 n2 matrix in which the i 1th row is the vectorized form of Θi VAR lagorder selection In order to help the user choose the most appropriate VAR order gretl provides a special variant of the var command var p Ylist Xlist lagselect When the lagselect option is given estimation is performed for all lags up to p and a table is printed it displays for each order a Likelihood Ratio test for the order p versus p 1 plus an array of information criteria see chapter 28 For each information criterion in the table a star indicates what appears to be the best choice The same output can be obtained through the graphical interface via the Time Series VAR lag selection entry under the Model menu Warning in finite samples the choice of the maximum lag p may affect the outcome of the proce dure This is not a bug but rather an unavoidable side effect of the way these comparisons should be made If your sample contains T observations and you invoke the lag selection procedure with maximum order p gretl examines all VARs of order ranging form 1 to p estimated on a uniform sample of T p observations In other words the comparison procedure does not use all the avail able data when estimating VARs of order less than p so as to ensure that all the models in the comparison are estimated on the same data range Choosing a different value of p may therefore alter the results although this is unlikely to happen if your sample size is reasonably large An example of this unpleasant phenomenon is given in example script 322 As can be seen ac cording to the HannanQuinn criterion order 2 seems preferable to order 1 if the maximum tested order is 4 but the situation is reversed if the maximum tested order is 6 323 Structural VARs Gretls builtin var command does not support the general class of models known as Structural VARsthough it does support the Cholesky decompositionbased approach the classic and most popular structural VAR variant If you wish to go beyond that there is a gretl addon named SVAR which will likely meet your needs SVAR is supplied as part of the gretl package you can find its documentation which is quite detailed as follows under the Tools menu in the gretl main window go to Function packagesOn local machine Or use the f x button on the toolbar at the foot of the main window In the function packages window either scroll down or use the search box to find SVAR Then rightclick and select Info This opens a window which gives basic information on the package including a link to SVARpdf the full documentation The remainder of this section will thus only deal with the Choleskybased recursive shock identifi cation used by the native var command Chapter 32 Vector Autoregressions 312 Listing 322 VAR lag selection via Information Criteria Input open denmark list Y 1 2 3 4 var 4 Y lagselect var 6 Y lagselect Output selected portions VAR system maximum lag order 4 The asterisks below indicate the best that is minimized values of the respective information criteria AIC Akaike criterion BIC Schwarz Bayesian criterion and HQC HannanQuinn criterion lags loglik pLR AIC BIC HQC 1 60915315 23104045 22346466 22814552 2 63170153 000013 23360844 21997203 22839757 3 64238574 016478 23152382 21182677 22399699 4 65322564 015383 22950025 20374257 21965748 VAR system maximum lag order 6 The asterisks below indicate the best that is minimized values of the respective information criteria AIC Akaike criterion BIC Schwarz Bayesian criterion and HQC HannanQuinn criterion lags loglik pLR AIC BIC HQC 1 59438410 23444249 22672078 23151288 2 61543480 000038 23650400 22260491 23123070 3 62497613 026440 23386781 21379135 22625083 4 63603766 013926 23185210 20559827 22189144 5 65836014 000016 23443271 20200150 22212836 6 66988472 011243 23260601 19399743 21795797 Chapter 32 Vector Autoregressions 313 IRF and FEVD Assume that the disturbance in equation 321 can be thought of as a linear function of a vector of structural shocks ut which are assumed to have unit variance and to be mutually unncorrelated so Vut I If εt Kut it follows that Σ Vεt K K The main object of interest in this setting is the sequence of matrices Ck yt uti Θk K 326 known as the structural VMA representation From the Ck matrices defined in equation 326 two quantities of interest may be derived the Impulse Response Function IRF and the Forecast Error Variance Decomposition FEVD The IRF of variable i to shock j is simply the sequence of the elements in row i and column j of the Ck matrices In symbols Iijk yit ujtk As a rule Impulse Response Functions are plotted as a function of k and are interpreted as the effect that a shock has on an observable variable through time Of course what we observe are the estimated IRFs so it is natural to endow them with confidence intervals following common practice gretl computes the confidence intervals by using the bootstrap 2 details are given later in this section Another quantity of interest that may be computed from the structural VMA representation is the Forecast Error Variance Decomposition FEVD The forecast error variance after h steps is given by Ωh k0h Ck Ck hence the variance for variable i is ωi2 Ωhii k0h diagCk Cki k0h l1n k cil2 where k cil is trivially the i l element of Ck As a consequence the share of uncertainty on variable i that can be attributed to the jth shock after h periods equals V Dijh k0h k cij2 k0h l1n k cil2 This makes it possible to quantify which shocks are most important to determine a certain variable in the short andor in the long run Triangularization The formula 326 takes K as known while of course it has to be estimated The estimation problem has been the subject of an enormous body of literature we will not even attempt to summarize here see for example Lütkepohl 2005 chapter 9 Suffice it to say that the most popular choice dates back to Sims 1980 and consists in assuming that K is lower triangular so its estimate is simply the Cholesky decomposition of the estimate of Σ The main consequence of this choice is that the ordering of variables within the vector yt becomes meaningful since K is also the matrix of Impulse Response Functions at lag 0 the triangularity 2 It is possible in principle to compute analytical confidence intervals via an asymptotic approximation but this is not a very popular choice asymptotic formulae are known to often give a very poor approximation of the finitesample properties Chapter 32 Vector Autoregressions 314 assumption means that the first variable in the ordering responds instantaneously only to shock number 1 the second one only to shocks 1 and 2 and so forth For this reason each variable is thought to own one shock variable 1 owns shock number 1 and so on In this sort of exercise therefore the ordering of the y variables is important To put it differently if variable foo comes before variable bar in the Y list it follows that the shock owned by foo affects bar instantaneously but not vice versa Impulse Response Functions and the FEVD can be printed out via the command line interface by using the impulseresponses and variancedecomp options respectively If you need to store them into matrices you could compute the structural VMA and proceed from there For example the following code snippet shows you how to manually compute a matrix containing the IRFs open denmark list Y 1 2 3 4 scalar n nelemY var 2 Y quiet impulseresponses matrix K choleskysigma matrix V vma matrix IRF V K In print IRF in which the equality vecCk vecΘkK K IvecΘk was used A more convenient way of obtaining the desired quantities is to use the irf and fevd functions which can be used in scripts after a VAR or VECM see the next chapter has been estimated In these functions you must specify the number of the responding target variable and the number of the analyzed shock to get the corresponding results as a column vector The choice of how many periods should be calculated and thus how long the result vector will be is determined by previ ously invoking set horizon x where x is a nonnegative integer and the first response concerns the impact effect As always it is recommended to consult the function reference under the help menu where in the case of the irf function it is also explained that the implicit shock size is such that the impact response in the same equation is one standard deviation of the corresponding error term IRF bootstrap The IRFs obtained above are estimates and as such they are uncertain Mostly due to the fact that they are nonlinear functions of the VAR parameters the standard way of assessing this estimation uncertainty and to derive confidence intervals or bands is to use a bootstrap approach Again more advanced options are available with the SVAR addon but the irf function used after the builtin var or vecm command also provides the option to run a bootstrap based on resampling from the residuals The number of bootstrap iterations can be adjusted through set bootiters x where x must be larger than 499 The desired nominal confidence level must be specified after the target and shock numbers as the third argument and in that case the return vector becomes a threecolumn matrix where the lower and upper bounds of the confidence intervals are given in the extra two columns Menudriven usage Almost all the functionality related to the described recursively identified structural VARs is also available under the menus in the model window that appears after a VAR is estimated in the GUI3 3Note that you cannot directly invoke the SVAR addon from the model window of an estimated VAR that menu entry is only present in gretls main window under the Model menu and multivariate time series submenu Chapter 32 Vector Autoregressions 315 In the Plots menu there are a number of menu entries relating to the impulse responses as well as one entry for the forecast error variance decomposition Selecting any of these will bring up a little specification window where the ordering for the Cholesky decomposition must be chosen and in case of IRFs the intended bootstrap coverage can be set In the Analysis menu there are also entries for IRF and FEVD which may sometimes be a little confusing The point is that here the numbers of the point estimates will be printed out in a tabular format instead of being plotted 324 Residualbased diagnostic tests Three diagnostic tests based on residuals are available after estimating a VARfor normality autocorrelation and ARCH Autoregressive Conditional Heteroskedasticity These are implemented by the modtest command using the options normality autocorr and arch respectively The multivariate normality test is that of Doornik and Hansen 1994 it is based on the skewness and kurtosis of the VAR residuals The autocorrelation and ARCH test are also by default multivariate they are described in detail by Lütkepohl 2005 see sections 444 and 1651 Both tests are of the LM type although the autocorrelation test statistic is referred to a Rao F distribution Rao 1973 These tests may involve estimation of a large number of parameters depending on the lag horizon chosen and can fail for lack of degrees of freedom in small samples As a fallback the univariate option can be used to specify that the tests be run perequation rather than in multivariate mode Listing 323 illustrates the VAR autocorrelation tests replicating an example given by Lütkepohl 2005 p 174 Note the difference in the interpretation of the order argument to modtest with the autocorr option this also applies to the ARCH test in the multivariate version order is taken as the maximum lag order and tests are run from lag 1 up to the maximum but in the univariate version a single test is run for each equation using just the specified lag order The example also exposes what exactly is returned by the test and pvalue accessors in the two variants Chapter 32 Vector Autoregressions 316 Listing 323 VAR autocorrelation test from Lütkepohl Download Input open wgmacrogdt quiet list Y investment income consumption list dlnY ldiffY smpl 19604 19784 var 2 dlnY modtest 4 autocorr eval test pvalue modtest 4 autocorr univariate eval test pvalue Output from tests modtest 4 autocorr Test for autocorrelation of order up to 4 Rao F Approx dist pvalue lag 1 0615 F9 148 07827 lag 2 0754 F18 164 07507 lag 3 1143 F27 161 02982 lag 4 1254 F36 154 01743 eval test pvalue 061524 078269 075397 075067 11429 029820 12544 017431 modtest 4 autocorr univariate Test for autocorrelation of order 4 Equation 1 LjungBox Q 611506 with pvalue PChisquare4 611506 0191 Equation 2 LjungBox Q 167136 with pvalue PChisquare4 167136 0796 Equation 3 LjungBox Q 159931 with pvalue PChisquare4 159931 0809 eval test pvalue 61151 019072 16714 079591 15993 080892 Chapter 33 Cointegration and Vector Error Correction Models 331 Introduction The twin concepts of cointegration and error correction have drawn a good deal of attention in macroeconometrics over recent years The attraction of the Vector Error Correction Model VECM is that it allows the researcher to embed a representation of economic equilibrium relationships within a relatively rich timeseries specification This approach overcomes the old dichotomy be tween a structural models that faithfully represented macroeconomic theory but failed to fit the data and b timeseries models that were accurately tailored to the data but difficult if not impos sible to interpret in economic terms The basic idea of cointegration relates closely to the concept of unit roots see section 313 Sup pose we have a set of macroeconomic variables of interest and we find we cannot reject the hypoth esis that some of these variables considered individually are nonstationary Specifically suppose we judge that a subset of the variables are individually integrated of order 1 or I1 That is while they are nonstationary in their levels their first differences are stationary Given the statistical problems associated with the analysis of nonstationary data for example the threat of spurious regression the traditional approach in this case was to take first differences of all the variables before proceeding with the analysis But this can result in the loss of important information It may be that while the variables in question are I1 when taken individually there exists a linear combination of the variables that is stationary without differencing or I0 There could be more than one such linear combina tion That is while the ensemble of variables may be free to wander over time nonetheless the variables are tied together in certain ways And it may be possible to interpret these ties or cointegrating vectors as representing equilibrium conditions For example suppose we find some or all of the following variables are I1 money stock M the price level P the nominal interest rate R and output Y According to standard theories of the demand for money we would nonetheless expect there to be an equilibrium relationship between real balances interest rate and output for example m p γ0 γ1y γ2r γ1 0 γ2 0 where lowercase variable names denote logs In equilibrium then m p γ1y γ2r γ0 Realistically we should not expect this condition to be satisfied each period We need to allow for the possibility of shortrun disequilibrium But if the system moves back towards equilibrium fol lowing a disturbance it follows that the vector x m p y r is bound by a cointegrating vector β β1 β2 β3 β4 such that βx is stationary with a mean of γ0 Furthermore if equilibrium is correctly characterized by the simple model above we have β2 β1 β3 0 and β4 0 These things are testable within the context of cointegration analysis There are typically three steps in this sort of analysis 1 Test to determine the number of cointegrating vectors the cointegrating rank of the system 2 Estimate a VECM with the appropriate rank but subject to no further restrictions 317 Chapter 33 Cointegration and Vector Error Correction Models 318 3 Probe the interpretation of the cointegrating vectors as equilibrium conditions by means of restrictions on the elements of these vectors The following sections expand on each of these points giving further econometric details and explaining how to implement the analysis using gretl 332 Vector Error Correction Models as representation of a cointegrated system Consider a VAR of order p with a deterministic part given by μt typically a polynomial in time One can write the nvariate process yt as yt μt A1 yt1 A2 yt2 Ap ytp εt 331 But since yti yt1 Δyt1 Δyt2 Δyti1 we can rewrite the above as Δyt μt Π yt1 i1p1 Γi Δyti εt 332 where Π i1p Ai I and Γi ji1p Aj This is the VECM representation of 331 The interpretation of 332 depends crucially on r the rank of the matrix Π If r 0 the processes are all I1 and not cointegrated If r n then Π is invertible and the processes are all I0 Cointegration occurs in between when 0 r n and Π can be written as α β In this case yt is I1 but the combination zt β yt is I0 If for example r 1 and the first element of β was 1 then one could write zt y1t β2 y2t βn ynt which is equivalent to saying that y1t β2 y2t βn ynt zt is a longrun equilibrium relationship the deviations zt may not be 0 but they are stationary In this case 332 can be written as Δyt μt α β yt1 i1p1 Γi Δyti εt 333 If β were known then zt would be observable and all the remaining parameters could be estimated via OLS In practice the procedure estimates β first and then the rest The rank of Π is investigated by computing the eigenvalues of a closely related matrix whose rank is the same as Π however this matrix is by construction symmetric and positive semidefinite As a consequence all its eigenvalues are real and nonnegative and tests on the rank of Π can therefore be carried out by testing how many eigenvalues are 0 If all the eigenvalues are significantly different from 0 then all the processes are stationary If on the contrary there is at least one zero eigenvalue then the yt process is integrated although some linear combination β yt might be stationary At the other extreme if no eigenvalues are significantly different from 0 then not only is the process yt nonstationary but the same holds for any linear combination β yt in other words no cointegration occurs Estimation typically proceeds in two stages first a sequence of tests is run to determine r the cointegration rank Then for a given rank the parameters in equation 333 are estimated The two commands that gretl offers for estimating these systems are johansen and vecm respectively The syntax for johansen is Chapter 33 Cointegration and Vector Error Correction Models 321 then we should not place any restriction on the intercept Otherwise the question arises of whether it makes sense to specify a cointegration relationship which includes a nonzero intercept One ex ample where this is appropriate is the relationship between two interest rates generally these are not trended but the VAR might still have an intercept because the difference between the two the interest rate spread might be stationary around a nonzero mean for example because of a risk or liquidity premium The previous example can be generalized in three directions 1 If a VAR of order greater than 1 is considered the algebra gets more convoluted but the conclusions are identical 2 If the VAR includes more than two endogenous variables the cointegration rank r can be greater than 1 In this case α is a matrix with r columns and the case with restricted constant entails the restriction that µ0 should be some linear combination of the columns of α 3 If a linear trend is included in the model the deterministic part of the VAR becomes µ0 µ1t The reasoning is practically the same as above except that the focus now centers on µ1 rather than µ0 The counterpart to the restricted constant case discussed above is a restricted trend case such that the cointegration relationships include a trend but the first differences of the variables in question do not In the case of an unrestricted trend the trend appears in both the cointegration relationships and the first differences which corresponds to the presence of a quadratic trend in the variables themselves in levels In order to accommodate the five cases gretl provides the following options to the johansen and vecm commands µt option flag description 0 nc no constant µ0 α µ0 0 rc restricted constant µ0 uc unrestricted constant µ0 µ1t α µ1 0 crt constant restricted trend µ0 µ1t ct constant unrestricted trend Note that for this command the above options are mutually exclusive In addition you have the option of using the seasonals options for augmenting µt with centered seasonal dummies In each case pvalues are computed via the approximations devised by Doornik 1998 334 The Johansen cointegration tests The two Johansen tests for cointegration are used to establish the rank of β or in other words the number of cointegrating vectors These are the λmax test for hypotheses on individual eigenvalues and the trace test for joint hypotheses Suppose that the eigenvalues λi are sorted from largest to smallest The null hypothesis for the λmax test on the ith eigenvalue is that λi 0 The corresponding trace test instead considers the hypothesis λj 0 for all j i The gretl command johansen performs these two tests The corresponding menu entry in the GUI is Model Time Series Cointegration Test Johansen As in the ADF test the asymptotic distribution of the tests varies with the deterministic component µt one includes in the VAR see section 333 above The following code uses the denmark data file supplied with gretl to replicate Johansens example found in his 1995 book open denmark johansen 2 LRM LRY IBO IDE rc seasonals Chapter 33 Cointegration and Vector Error Correction Models 323 of r equilibrium relations as y1t b1r1yr1t b1nynt y2t b2r1yr1t b2nynt yrt brr1yr1t brnyrt where the first r variables are expressed as functions of the remaining n r Although the triangular representation ensures that the statistical problem of estimating β is solved the resulting equilibrium relationships may be difficult to interpret In this case the user may want to achieve identification by specifying manually the system of r 2 constraints that gretl will use to produce an estimate of β As an example consider the money demand system presented in section 96 of Verbeek 2004 The variables used are m the log of real money stock M1 infl inflation cpr the commercial paper rate y log of real GDP and tbr the Treasury bill rate2 Estimation of β can be performed via the commands open moneygdt smpl 19541 19944 vecm 6 2 m infl cpr y tbr rc and the relevant portion of the output reads Maximum likelihood estimates observations 1954119944 T 164 Cointegration rank 2 Case 2 Restricted constant beta cointegrating vectors standard errors in parentheses m 10000 00000 00000 00000 infl 00000 10000 00000 00000 cpr 056108 24367 010638 42113 y 040446 091166 010277 40683 tbr 054293 24786 010962 43394 const 37483 16751 078082 30909 Interpretation of the coefficients of the cointegration matrix β would be easier if a meaning could be attached to each of its columns This is possible by hypothesizing the existence of two longrun relationships a money demand equation m c1 β1infl β2y β3tbr and a risk premium equation cpr c2 β4infl β5y β6tbr 2This data set is available in the verbeek data package see httpgretlsourceforgenetgretldatahtml Chapter 33 Cointegration and Vector Error Correction Models 324 which imply that the cointegration matrix can be normalized as β 1 0 β1 β4 0 1 β2 β5 β3 β6 c1 c2 This renormalization can be accomplished by means of the restrict command to be given after the vecm command or in the graphical interface by selecting the Test Linear Restrictions menu entry The syntax for entering the restrictions should be fairly obvious3 restrict b11 1 b13 0 b21 0 b23 1 end restrict which produces Cointegrating vectors standard errors in parentheses m 10000 00000 00000 00000 infl 0023026 0041039 00054666 0027790 cpr 00000 10000 00000 00000 y 042545 0037414 0033718 017140 tbr 0027790 10172 00045445 0023102 const 33625 068744 025318 12870 336 Overidentifying restrictions One purpose of imposing restrictions on a VECM system is simply to achieve identification If these restrictions are simply normalizations they are not testable and should have no effect on the max imized likelihood In addition however one may wish to formulate constraints on β andor α that derive from the economic theory underlying the equilibrium relationships substantive restrictions of this sort are then testable via a likelihoodratio statistic Gretl is capable of testing general linear restrictions of the form Rbvecβ q 335 andor Ravecα 0 336 Note that the β restriction may be nonhomogeneous q 0 but the α restriction must be homoge neous Nonlinear restrictions are not supported and neither are restrictions that cross between β 3Note that in this context we are bending the usual matrix indexation convention using the leading index to refer to the column of β the particular cointegrating vector This is standard practice in the literature and defensible insofar as it is the columns of β the cointegrating relations or equilibrium errors that are of primary interest Chapter 33 Cointegration and Vector Error Correction Models 325 and α When r 1 such restrictions may be in common across all the columns of β or α or may be specific to certain columns of these matrices For useful discussions of this point see Boswijk 1995 and Boswijk and Doornik 2004 section 44 The restrictions 335 and 336 may be written in explicit form as vecβ Hφ h0 337 and vecα Gψ 338 respectively where φ and ψ are the free parameter vectors associated with β and α respectively We may refer to the free parameters collectively as θ the column vector formed by concatenating φ and ψ Gretl uses this representation internally when testing the restrictions If the list of restrictions that is passed to the restrict command contains more constraints than necessary to achieve identification then an LR test is performed In addition the restrict com mand can be given the full switch in which case full estimates for the restricted system are printed including the Γi terms and the system thus restricted becomes the current model for the purposes of further tests Thus you are able to carry out cumulative tests as in Chapter 7 of Johansen 1995 Syntax The full syntax for specifying the restriction is an extension of that exemplified in the previous section Inside a restrict end restrict block valid statements are of the form parameter linear combination scalar where a parameter linear combination involves a weighted sum of individual elements of β or α but not both in the same combination the scalar on the righthand side must be 0 for combina tions involving α but can be any real number for combinations involving β Below we give a few examples of valid restrictions b11 1618 b14 2b25 0 a13 0 a11 a12 0 Special syntax is used when a certain constraint should be applied to all columns of β in this case one index is given for each b term and the square brackets are dropped Hence the following syntax restrict b1 b2 0 end restrict corresponds to β β11 β21 β11 β21 β13 β23 β14 β24 The same convention is used for α when only one index is given for an a term the restriction is presumed to apply to all r columns of α or in other words the variable associated with the given row of α is weakly exogenous For instance the formulation Chapter 33 Cointegration and Vector Error Correction Models 326 restrict a3 0 a4 0 end restrict specifies that variables 3 and 4 do not respond to the deviation from equilibrium in the previous period4 A variant on the singleindex syntax for common restrictions on α and β is available you can replace the index number with the name of the corresponding variable in square brackets For example instead of a3 0 one could write acpr 0 if the third variable in the system is named cpr Finally a shortcut or anyway an alternative is available for setting up complex restrictions but currently only in relation to β you can specify Rb and q as in Rbvecβ q by giving the names of previously defined matrices For example matrix I4 I4 matrix vR I4I4zeros41 matrix vq mshapeI4161 restrict R vR q vq end restrict which manually imposes Phillips normalization on the β estimates for a system with cointegrating rank 4 There are two points to note in relation to this option First vecβ is taken to include the coeffi cients on all terms within the cointegration space including the restricted constant or trend if any as well as any restricted exogenous variables Second it is acceptable to give an R matrix with a number of columns equal to the number of rows of β this variant is taken to specify a restriction that is in common across all the columns of β An example Brand and Cassola 2004 propose a money demand system for the Euro area in which they postu late three longrun equilibrium relationships money demand m βll βyy Fisher equation π φl Expectation theory of l s interest rates where m is real money demand l and s are long and shortterm interest rates y is output and π is inflation5 The names for these variables in the gretl data file are mp rl rs y and infl respectively The cointegration rank assumed by the authors is 3 and there are 5 variables giving 15 elements in the β matrix 3 3 9 restrictions are required for identification and a justidentified system would have 15 9 6 free parameters However the postulated longrun relationships feature only three free parameters so the overidentification rank is 3 4Note that when two indices are given in a restriction on α the indexation is consistent with that for β restrictions the leading index denotes the cointegrating vector and the trailing index the equation number 5A traditional formulation of the Fisher equation would reverse the roles of the variables in the second equation but this detail is immaterial in the present context moreover the expectation theory of interest rates implies that the third equilibrium relationship should include a constant for the liquidity premium However since in this example the system is estimated with the constant term unrestricted the liquidity premium gets absorbed into the system intercept and disappears from zt Chapter 33 Cointegration and Vector Error Correction Models 327 Listing 331 Estimation of a money demand system with constraints on β Download Input open brandcassolagdt perform a few transformations mp mp100 y y100 infl infl4 rs rs4 rl rl4 replicate table 4 page 824 vecm 2 3 mp infl rl rs y q ll0 lnl restrict full b11 1 b12 0 b14 0 b21 0 b22 1 b24 0 b25 0 b31 0 b32 0 b33 1 b34 1 b35 0 end restrict ll1 rlnl Partial output Unrestricted loglikelihood lu 11660268 Restricted loglikelihood lr 11586451 2 lu lr 147635 PChiSquare3 147635 068774 beta cointegrating vectors standard errors in parentheses mp 10000 00000 00000 00000 00000 00000 infl 00000 10000 00000 00000 00000 00000 rl 16108 067100 10000 062752 0049482 00000 rs 00000 00000 10000 00000 00000 00000 y 13304 00000 00000 0030533 00000 00000 Chapter 33 Cointegration and Vector Error Correction Models 328 Listing 331 replicates Table 4 on page 824 of the Brand and Cassola article6 Note that we use the lnl accessor after the vecm command to store the unrestricted loglikelihood and the rlnl accessor after restrict for its restricted counterpart The example continues in script 332 where we perform further testing to check whether a the income elasticity in the money demand equation is 1 βy 1 and b the Fisher relation is homo geneous φ 1 Since the full switch was given to the initial restrict command additional restrictions can be applied without having to repeat the previous ones The second script contains a few printf commands which are not strictly necessary to format the output nicely It turns out that both of the additional hypotheses are rejected by the data with pvalues of 0002 and 0004 Listing 332 Further testing of money demand system Input restrict b15 1 end restrict lluie rlnl restrict b23 1 end restrict llhfh rlnl replicate table 5 page 824 printf Testing zero restrictions in cointegration space printf LRtest rank 3 chi23 64f 64f 2ll0ll1 pvalueX 3 2ll0ll1 printf Unit income elasticity LRtest rank 3 printf chi24 g 64f 2ll0lluie pvalueX 4 2ll0lluie printf Homogeneity in the Fisher hypothesis printf LRtest rank 3 chi24 63f 64f 2ll0llhfh pvalueX 4 2ll0llhfh Output Testing zero restrictions in cointegration space LRtest rank 3 chi23 14763 06877 Unit income elasticity LRtest rank 3 chi24 172071 00018 Homogeneity in the Fisher hypothesis LRtest rank 3 chi24 15547 00037 Another type of test that is commonly performed is the weak exogeneity test In this context a variable is said to be weakly exogenous if all coefficients on the corresponding row in the α matrix are zero If this is the case that variable does not adjust to deviations from any of the longrun equilibria and can be considered an autonomous driving force of the whole system The code in Listing 333 performs this test for each variable in turn thus replicating the first column of Table 6 on page 825 of Brand and Cassola 2004 The results show that weak exogeneity might perhaps be accepted for the longterm interest rate and real GDP pvalues 007 and 008 respectively 6Modulo what appear to be a few typos in the article Chapter 33 Cointegration and Vector Error Correction Models 329 Listing 333 Testing for weak exogeneity Input restrict a1 0 end restrict tsm 2ll0 rlnl restrict a2 0 end restrict tsp 2ll0 rlnl restrict a3 0 end restrict tsl 2ll0 rlnl restrict a4 0 end restrict tss 2ll0 rlnl restrict a5 0 end restrict tsy 2ll0 rlnl loop foreach i m p l s y printf Delta i 63f 64f tsi pvalueX 6 tsi endloop Output variable LR test pvalue Delta m 18111 00060 Delta p 21067 00018 Delta l 11819 00661 Delta s 16000 00138 Delta y 11335 00786 Chapter 33 Cointegration and Vector Error Correction Models 331 optimizer may end up at a local maximum or in the case of the switching algorithm at a saddle point The solution or lack thereof may be sensitive to the initial value selected for θ By default gretl selects a starting point using a deterministic method based on Boswijk 1995 but two further options are available the initialization may be adjusted using simulated annealing or the user may supply an explicit initial value for θ The default initialization method is 1 Calculate the unrestricted ML ˆβ using the Johansen procedure 2 If the restriction on β is nonhomogeneous use the method proposed by Boswijk φ0 Ir ˆβHIr ˆβh0 339 where ˆβ ˆβ 0 and A denotes the MoorePenrose inverse of A Otherwise φ0 HH1Hvecˆβ 3310 3 vecβ0 Hφ0 h0 4 Calculate the unrestricted ML ˆα conditional on β0 as per Johansen ˆα S01β0β 0S11β01 3311 5 If α is restricted by vecα Gψ then ψ0 GG1G vecˆα and vecα 0 Gψ0 Alternative initialization methods As mentioned above gretl offers the option of adjusting the initialization using simulated anneal ing This is invoked by adding the jitter option to the restrict command The basic idea is this we start at a certain point in the parameter space and for each of n iterations currently n 4096 we randomly select a new point within a certain radius of the previous one and determine the likelihood at the new point If the likelihood is higher we jump to the new point otherwise we jump with probability P and remain at the previous point with probability 1 P As the iterations proceed the system gradually coolsthat is the radius of the random perturbation is reduced as is the probability of making a jump when the likelihood fails to increase In the course of this procedure many points in the parameter space are evaluated starting with the point arrived at by the deterministic method which well call θ0 One of these points will be best in the sense of yielding the highest likelihood call it θ This point may or may not have a greater likelihood than θ0 And the procedure has an end point θn which may or may not be best The rule followed by gretl in selecting an initial value for θ based on simulated annealing is this use θ if θ θ0 otherwise use θn That is if we get an improvement in the likelihood via annealing we make full use of this on the other hand if we fail to get an improvement we nonetheless allow the annealing to randomize the starting point Experiments indicate that the latter effect can be helpful Besides annealing a further alternative is manual initialization This is done by passing a prede fined vector to the set command with parameter initvals as in set initvals myvec The details depend on whether the switching algorithm or LBFGS is used For the switching algo rithm there are two options for specifying the initial values The more userfriendly one for most people we suppose is to specify a matrix that contains vecβ followed by vecα For example Chapter 33 Cointegration and Vector Error Correction Models 332 open denmarkgdt vecm 2 1 LRM LRY IBO IDE rc seasonals matrix BA 1 1 6 6 6 02 01 002 003 set initvals BA restrict b1 1 b1 b2 0 b3 b4 0 end restrict In this examplefrom Johansen 1995the cointegration rank is 1 and there are 4 variables However the model includes a restricted constant the rc flag so that β has 5 elements The α matrix has 4 elements one per equation So the matrix BA may be read as β1 β2 β3 β4 β5 α1 α2 α3 α4 The other option which is compulsory when using LBFGS is to specify the initial values in terms of the free parameters φ and ψ Getting this right is somewhat less obvious As mentioned above the implicitform restriction Rvecβ q has explicit form vecβ Hφ h0 where H R the right nullspace of R The vector φ is shorter by the number of restrictions than vecβ The savvy user will then see what needs to be done The other point to take into account is that if α is unrestricted the effective length of ψ is 0 since it is then optimal to compute α using Johansens formula conditional on β equation 3311 above The example above could be rewritten as open denmarkgdt vecm 2 1 LRM LRY IBO IDE rc seasonals matrix phi 8 6 set initvals phi restrict lbfgs b1 1 b1 b2 0 b3 b4 0 end restrict In this more economical formulation the initializer specifies only the two free parameters in φ 5 elements in β minus 3 restrictions There is no call to give values for ψ since α is unrestricted Scale removal Consider a simpler version of the restriction discussed in the previous section namely restrict b1 1 b1 b2 0 end restrict This restriction comprises a substantive testable requirementthat β1 and β2 sum to zeroand a normalization or scaling β1 1 The question arises might it be easier and more reliable to maximize the likelihood without imposing β1 110 If so we could record this normalization remove it for the purpose of maximizing the likelihood then reimpose it by scaling the result Unfortunately it is not possible to say in advance whether scale removal of this sort will give better results for any particular estimation problem However this does seem to be the case more often than not Gretl therefore performs scale removal where feasible unless you 10As a numerical matter that is In principle this should make no difference Chapter 33 Cointegration and Vector Error Correction Models 333 explicitly forbid this by giving the noscaling option flag to the restrict command or provide a specific vector of initial values or select the LBFGS algorithm for maximization Scale removal is deemed infeasible if there are any crosscolumn restrictions on β or any non homogeneous restrictions involving more than one element of β In addition experimentation has suggested to us that scale removal is inadvisable if the system is just identified with the normalizations included so we do not do it in that case By just identified we mean that the system would not be identified if any of the restrictions were removed On that criterion the above example is not just identified since the removal of the second restriction would not affect identification and gretl would in fact perform scale removal in this case unless the user specified otherwise Chapter 34 Multivariate models By a multivariate model we mean one that includes more than one dependent variable Certain specific types of multivariate model for timeseries data are discussed elsewhere chapter 32 deals with VARs and chapter 33 with VECMs Here we discuss two general sorts of multivariate model implemented in gretl via the system command SUR systems Seemingly Unrelated Regressions in which all the regressors are taken to be exogenous and interest centers on the covariance of the error term across equations and simultaneous systems in which some regressors are assumed to be endogenous In this chapter we give an account of the syntax and use of the system command and its compan ions restrict and estimate we also explain the options and accessors available in connection with multivariate models 341 The system command The specification of a multivariate system takes the form of a block of statements starting with system and ending with end system Once a system is specified it can estimated via various methods using the estimate command with or without restrictions which may be imposed via the restrict command Starting a system block The first line of a system block may be augmented in either or both of two ways An estimation method is specified for the system This is done by following system with an expression of the form methodestimator where estimator must be one of ols Ordinary Least Squares tsls TwoStage Least Squares sur Seemingly Unrelated Regressions 3sls ThreeStage Least Squares liml Limited Information Maximum Likelihood or fiml Full Information Maximum Likelihood Two examples system methodsur system methodfiml OLS TSLS and LIML are of course singleequation methods rather than true system estima tors they are included to facilitate comparisons The system is assigned a name This is done by giving the name first followed by a back arrow followed by system If the name contains spaces it must be enclosed in double quotes Here are two examples sys1 system System 1 system Note however that this naming method is not available within a userdefined function only in the main body of a gretl script If the initial system line is augmented in the first way the effect is that the system is estimated as soon as its definition is completed using the specified method The effect of the second option is 334 Chapter 34 Multivariate models 335 that the system can then be referenced by the assigned name for the purposes of the restrict and estimate commands in the gretl GUI an additional effect is that an icon for the system is added to the Session view These two possibilities can be combined as in mysys system method3sls In this example the system is estimated immediately via ThreeStage Least Squares and is also available for subsequent use under the name mysys If the system is not named via the backarrow mechanism it is still available for subsequent use via restrict and estimate in this case you should use the generic name system to refer to the lastdefined multivariate system The body of a system block The most basic element in the body of a system block is the equation statement which is used to specify each equation within the system This takes the same form as the regression specifica tion for singleequation estimators namely a list of series with the dependent variable given first followed by the regressors with the series given either by name or by ID number order in the dataset A system block must contain at least two equation statements and for systems without endogenous regressors these statements are all that is required So for example a minimal SUR specification might look like this system methodsur equation y1 const x1 equation y2 const x2 end system For simultaneous systems it is necessary to determine which regressors are endogenous and which exogenous By default all regressors are treated as exogenous except that any variable that appears as the dependent variable in one equation is automatically treated as endogeous if it appears as a regressor elsewhere However an explicit list of endogenous regressors may be supplied follow ing the equations lines this takes the form of the keyword endog followed by the names or ID numbers of the relevant regressors When estimation is via TSLS or 3SLS it is possible to specify a particular set of instruments for each equation This is done by giving the equation lists in the format used with the tsls command first the dependent variable then the regressors then a semicolon followed by the instruments as in system method3sls equation y1 const x11 x12 const x11 z1 equation y2 const x21 x22 const x21 z2 end system An alternative way of specifying instruments is to insert an extra line starting with instr followed by the list of variables acting as instruments This is especially useful for specifying the system with the equations keyword see the following subsection As in tsls any regressors that are not also listed as instruments are treated as endogenous so in the example above x11 and x21 are treated as exogenous while x21 and x22 are endogenous and instrumented by z1 and z2 respectively One more sort of statement is allowed in a system block that is the keyword identity followed by an equation that defines an accounting relationship rather then a stochastic one between variables For example identity Y C I G X Chapter 34 Multivariate models 336 There can be more than one identity in a system block But note that these statements are specific to estimation via FIML they are ignored for other estimators 342 Equation systems within functions It is also possible to define a multivariate system in a programmatic way This is useful if the precise specification of the system depends on some input parameters that are not known in advance but are given when the script is actually run The relevant syntax is given by the equations keyword note the plural which replaces the block of equation lines in the standard form This keyword must be followed by two arguments The first is a named list containing all series on the lefthand side of the system which determines the number of equations in the system The nature of the second argument depends on whether or not the list of regressors is in common for all equations as in SUR Common regressors a second named list Differing regressors an array of lists one per equation The first case is straightforward the second requires a little more explanation Suppose we have a twoequation system with regressors given by the lists xlist1 and xlist2 We can then define a suitable array as follows lists Xlists defarrayxlist1 xlist2 See section 118 for alternative ways of building an array Therefore specifying a system generically in this way just involves building the necessary list arguments as shown in the following example open denmark list LHS LRM LRY list RHS1 const LRM1 IBO1 IDE1 list RHS2 const LRY1 IBO1 lists RHS defarrayRHS1 RHS2 system methodols equations LHS RHS end system As mentioned above the option of assigning a specific name to a system is not available within functions but the generic identifier system can be used to similar effect The following example illustrates how one can define a system estimate it via two methods apply a restriction then reestimate it subject to the restriction function void anonsysseries x series y system equation x const equation y const end system estimate system methodols estimate system methodsur restrict system b11 b21 0 end restrict estimate system methodols end function Chapter 34 Multivariate models 339 The above account has to be qualified for the case where a system is set up for estimation via TSLS or 3SLS using a specific list of instruments per equation as described in section 341 In that case it is possible to include more endogenous regressors than explicit equations although of course there must be sufficient instruments to achieve identification In such systems endogenous re gressors that have no associated explicit equation are treated as if exogenous when constructing the structuralform matrices This means that forecasts are conditional on the observed values of the extra endogenous regressors rather than solely on the values of the exogenous and predeter mined variables On the contrary gretl does not provide a native command for generating simulated data from a multiequation system but this is relatively easily accomplished by means of scripting script 341 gives an example on a 3variable system1 All equations contain lagged endogenous variables but the equation for consumption at time t also contains income at time t as an explanatory variable This makes the system simultaneous so we use FIML as the estimation method Once the system is estimated we store its results to a bundle named sys so as to make it easier to retrieve certain quantities used in the remainder of the script First we compute the reduced form matrices by using the Gamma A and B bundle elements Of course simulation needs values for the exogenous variable which are easy to create in a system such as this where all the exogenous variables are deterministic The simulation horizon is set for this example at 12 periods Subsequently structuralform disturbances are drawn randomly from a multivariate normal dis tibution with mean 0 and variance equal to the estimated covariance matrix ˆΣ available as the sigma element of the sys bundle These are then mapped to reducedform innovations via the relationship vt Γ 1ϵt Finally all these ingredients are combined to produce the simulated values with the varsimul function Note that initial values for the VAR recursion are taken from the latest available data Running the script should produce the following set of simulated values Sim 14 x 3 13887 12874 14508 13889 12877 14515 13893 12880 14520 13895 12885 14518 13895 12894 14517 13900 12902 14520 13907 12908 14525 13917 12910 14534 13920 12911 14539 13919 12906 14547 13934 12910 14567 13935 12908 14575 13942 12913 14581 13944 12916 14583 1Note the system of equations that is being estimated here is not meant to stand for a realistic model of the European economy It is just set up in such a way to provide a simple example Chapter 34 Multivariate models 340 Listing 341 Simulation from a simultaneous equation system Download set verbose off set seed 131020 load the data and generate the variables open AWM18gdt quiet Con logPCR Inv logGCR Inc logYER list EXO const time estimate the system via FIML system methodfiml equation Con EXO Con1 Inc0 to 1 equation Inv EXO Inv1 Inc1 equation Inc EXO Inc1 to 2 Inv1 end system bundle sys system save the estimated system to a bundle compute the reduced form VAR representation matrix iG invsysGamma matrix rfA iG sysA matrix rfB iG sysB produce the simulation scalar horizon 12 retrieve a few magnitudes from the estimated system scalar g sysneqns number of equations scalar p colssysA g maximum lag future values of the exogenous variable matrix SimExo 1 seqnobs 1 nobs horizon matrix X SimExo rfB simulated disturbances E mnormalhorizon g choleskysyssigma reduced form disturbances V E iG structural form disturbances initial values list ENDO Con Inv Inc matrix init ENDOnobsp1 perform simulation Sim varsimulrfA X V init print Sim Chapter 35 Forecasting 351 Introduction In some econometric contexts forecasting is the prime objective one wants estimates of the future values of certain variables to reduce the uncertainty attaching to current decision making In other contexts where realtime forecasting is not the focus prediction may nonetheless be an important moment in the analysis For example outofsample prediction can provide a useful check on the validity of an econometric model In other cases we are interested in questions of what if for example how might macroeconomic outcomes have differed over a certain period if a different policy had been pursued In the latter cases prediction need not be a matter of actually projecting into the future but in any case it involves generating fitted values from a given model The term postdiction might be more accurate but it is not commonly used we tend to talk of prediction even when there is no true forecast in view This chapter offers an overview of the methods available within gretl for forecasting or prediction whether forward in time or not and explicates some of the finer points of the relevant commands 352 Saving and inspecting fitted values In the simplest case the predictions of interest are just the within sample fitted values from an econometric model For the singleequation linear model yt Xtβ ut these are ˆyt Xt ˆβ In commandline mode the ˆy series can be retrieved after estimating a model using the accessor yhat as in series yh yhat If the model in question takes the form of a system of equations yhat returns a matrix each column of which contains the fitted values for a particular dependent variable To extract the fitted series for eg the dependent variable in the second equation do matrix Yh yhat series yh2 Yh2 Having obtained a series of fitted values you can use the fcstats function to produce a vector of statistics that characterize the accuracy of the predictions see section 354 below The gretl GUI offers several ways of accessing and examining withinsample predictions In the model display window the Save menu contains an item for saving fitted values the Graphs menu allows plotting of fitted versus actual values and the Analysis menu offers a display of actual fitted and residual values 353 The fcast command The fcast command and its equivalent GUI invocation see below generates predictions based on the last estimated model Several questions arise here How to control the range over which predictions are generated How to control the forecasting method where a choice is available How to control the printing andor saving of the results Basic answers can be found in the Gretl Command Reference we add some more details here 341 Chapter 35 Forecasting 342 The forecast range The range defaults to the currently defined sample range If this remains unchanged following esti mation of the model in question the forecast will be within sample and with some qualifications noted below it will essentially duplicate the information available via the retrieval of fitted values see section 352 above A common situation is that a model is estimated over a given sample and then forecasts are wanted for a subsequent outofsample range The simplest way to accomplish this is via the outofsample option to fcast For example assuming we have a quarterly timeseries dataset containing observations from 19801 to 20084 four of which are to be reserved for forecasting reserve the last 4 observations smpl 19801 20074 ols y 0 xlist fcast outofsample This will generate a forecast from 20081 to 20084 There are two other ways of adjusting the forecast range offering finer control Use the smpl command to adjust the sample range prior to invoking fcast Use the optional startobs and endobs arguments to fcast which should come right after the command word These values set the forecast range independently of the sample range What if one wants to generate a true forecast that goes beyond the available data In that case one can use the dataset command with the addobs parameter to add extra observations before forecasting For example use the entire dataset which ends in 20084 ols y 0 xlist dataset addobs 4 fcast 20091 20094 But this will work as stated only if the set of regressors in xlist does not contain any stochastic regressors other than lags of y The dataset addobs command attempts to detect and extrapolate certain common deterministic variables eg time trend periodic dummy variables In addition lagged values of the dependent variable can be supported via a dynamic forecast see below for discussion of the staticdynamic distinction But future values of any other included regressors must be supplied before such a forecast is possible Note that specific values in a series can be set directly by date for example x120091 1205 Or if the assumption of no change in the regressors is warranted one can do something like this loop t2009120094 loop foreach i xlist it i20084 endloop endloop In singleequation OLS models a recursive forecast option is also available expanding the es timation sample onebyone and recalculating the forecasts again and again for the constantly updated information set In this case a number must be given of how many periods ahead should be forecast for each of the estimation samples Note that only this kstepsahead forecast will be printed or accessible in fcast not the interim values from step 1 through k 1 if k 1 If those interim values are also needed then several fcast recursive rounds would have to be done with different stepsahead numbers Chapter 35 Forecasting 343 Static and dynamic forecasts The distinction between static and dynamic forecasts applies only to dynamic models ie those that feature one or more lags of the dependent variable The simplest case is the AR1 model yt α0 α1yt1 ϵt 351 In some cases the presence of a lagged dependent variable is implicit in the dynamics of the error term for example yt β ut ut ρut1 ϵt which implies that yt 1 ρβ ρyt1 ϵt Suppose we want to forecast y for period s using a dynamic model say 351 for example If we have data on y available for period s 1 we could form a fitted value in the usual way ˆys ˆα0 ˆα1ys1 But suppose that data are available only up to s 2 In that case we can apply the chain rule of forecasting ˆys1 ˆα0 ˆα1ys2 ˆys ˆα0 ˆα1 ˆys1 This is what is called a dynamic forecast A static forecast on the other hand is simply a fitted value even if it happens to be computed outofsample Printing displaying and saving forecasts When working from the GUI the way to perform and access forecasts is to first estimate a model with some inherently dynamic features and then in the model window navigate to the Forecasts entry in the Analysis menu If some outofsample observations are already available see above a dialog window is presented where the discussed forecasting options can be chosen by pointing and clicking Executing the forecasts then automatically yields two result windows one with a time series plot of the forecasts along with their confidence bands if those were chosen and another one with tabular output The produced plot can be saved to the current session or exported like any other plot in gretl by rightclicking Notice that in the textual result window there is a button at the top which offers to save the point forecasts and their standard errors as new series to the active dataset In a command line context the fcast command automatically prints out the tables with the pro duced forecasts their standard errors and associated confidence intervalsunless you wish to suppress this verbose output with the options statsonly or quiet The former option re stricts output to the forecast evaluation statistics as explained in the next section the latter option silences output altogether Another accepted syntax variant is to supply the name of a new series for the point forecasts after the fcast command as for example in fcast Yfc outofsample At the same time this also suppresses printout Accessing and saving the produced forecast time series along with the estimated standard errors also works through the fcast and fcse accessors available after fcast execution These return vectors as gretl matrix objects not series so if you want to add the results to the dataset in this way you would have to set the active sample to the forecast range first You can of course first access and store the matrices and then later after resetting the sample assign them to series Note that the estimated standard errors do not incorporate parameter uncertainty in the case of dynamic models But if you want to create forecast plots within a script the relevant option already has to be ap pended to the fcast command As explained in the command reference specify plotfilename Chapter 35 Forecasting 346 For 95 confidence intervals t140 0025 1977 LHUR prediction std error 95 interval 19991 4300000 4335004 0222784 3894549 4775460 19992 4300000 4312724 0401960 3518028 5107421 19993 4233333 4272764 0539582 3205982 5339547 19994 4100000 4223213 0642001 2953943 5492482 Forecast evaluation statistics Mean Error 0052593 Root Mean Squared Error 0067311 Mean Absolute Error 0052593 Mean Percentage Error 12616 Mean Absolute Percentage Error 12616 Theils U2 087334 Bias proportion UM 061049 Regression proportion UR 029203 Disturbance proportion UD 0097478 INFL prediction std error 95 interval 19991 1651245 1812250 0431335 0959479 2665022 19992 2048545 2088185 0777834 0550366 3626004 19993 2298952 2266445 1075855 0139423 4393467 19994 2604836 2610037 1409676 0176969 5397043 Forecast evaluation statistics Mean Error 0043335 Root Mean Squared Error 0084525 Mean Absolute Error 0059588 Mean Percentage Error 26178 Mean Absolute Percentage Error 33248 Theils U2 0095932 Bias proportion UM 026285 Regression proportion UR 045311 Disturbance proportion UD 028404 One of the main differences is that specifying a variable name after the fcast command does not mean to save something under that name but now it serves to pick one of the N variables of the VAR for printing out the forecasts That leaves only the fcast and fcse accessors to obtain and save the produced forecastsin this system case the returned matrix objects will have as many columns as equations In the GUI the relevant menu entry is again Forecasts in the Analysis menu in the window of the estimated VAR model Here the user must pick the variable of interest after which a dialog window with relevant options is presented As in the singleequation context a plot and a textual output windows are created Again forecast series can be added to the dataset through the button and the plot can be saved or exported Special VAR cases exogenous variables cointegration It may be worth noting that when a VAR is specified with additional nondeterministic exogenous regressors a similar issue as with single equations arises the forecast is conditional and requires some assumptions about the development of those regressors out of sample As before these Chapter 35 Forecasting 347 values can be easily filled in after the dataset has been extended with the observations for the fore casting sample but naturally only the user not gretl can and must decide what those values should be This includes handcrafted deterministic variables like shift dummies but on the other hand standard deterministic terms like trends and seasonals will be extrapolated by gretl automatically Using a cointegrated VAR model with gretls vecm command does not change the way a forecast is obtained afterwards The VECM can be internally represented as a VAR in levels that automat ically contains the reducedrank restrictions of cointegration and this VAR form is then used to calculate the forecasts Providing forecast standard errors and the associated confidence bands is also straightforward since only the innovation uncertainty is captured in those This ease of use also carries over to the situation when a VECM with additional exogenous terms is used for forecastingprovided that future values of the exogenous variables are specified of course 356 Forecasting from simultaneous systems To be interesting for a forecasting application a simultaneousequation system must be dynamic including some lags of endogenous variables as regressors Otherwise we would be conducting a scenario analysis purely conditional on assumed exogenous developments For the following discussion we therefore presuppose that we are dealing with such a dynamic system Then the difference between such a model set up with gretls system block and a VAR system concerns mainly two aspects First a VAR model is already given as a socalled reduced form and as such is ready to be used for forward simulation aka forecasting In contrast a simultaneous system can come in a structural form with some contemporaneous endogenous variables as regressors in the equations the future values of those regressors are unknown however Second a plain VAR is estimated by OLS whereas a simultaneous system can be estimated with different methods for reasons of efficiency Neither of these differences present any deep challenge for forecasting however As explained at the end of the previous chapter on multivariate models see the subsection titled Structural and reduced forms it is easy to obtain the reduced form of any such simultaneous equation system This reduced form is used by gretl to simulate the system forward in time just as with a VAR model The slight complication for computing the forecast variances is merely that the estimated error term ϵt from the structural form must be mapped to the reducedform innovations vt using the inverse of the estimated structural relations matrix Γ This is automatically taken into account The estimation method through which the coefficient values of the system are determined does not matter for forecasting The prediction algorithm can simply take these point esti mates as given use these for calculating the associated reduced form and use that represen tation to iterate the model forward over the desired forecasting horizon It should nonetheless be obvious that different estimators entail different forecast values As a consequence of these considerations the way to handle forecasts from simultaneous systems in gretl is exactly as discussed before in the context of VARs possibly with exogenous regressors This applies to the commandline interface as well as the GUI Chapter 36 State Space Modeling 361 Introduction This chapter describes the handling of linear state space models in gretl 2022b and higher1 Here is a brief highlevel overview of gretls Kalman apparatus To obtain a Kalman structurein the form of a bundleyou use the ksetup function Having obtained such a bundle you can then adjust its contents as described in detail below You then do things with your state space model via the functions kfilter forecasting ksmooth state smoothing andor kdsmooth disturbance smoothing 362 Notation In this document our basic representation of a state space model is given by the following pair of equations yt Ztαt εt 361 αt1 Ttαt ηt 362 where 361 is the observation or measurement equation and 362 is the state transition equation The state vector αt is r 1 and the vector of observables yt is n 1 The n 1 vector εt and the r 1 vector ηt are assumed to be vector Gaussian white noise Eεtε s Σt for t s otherwise 0 Eηtη s Ωt for t s otherwise 0 The number of timeseries observations is denoted by N In the case where Zt Z Tt T Σt Σ and Ωt Ω for all t the model is said to be timeinvariant We assume timeinvariance in much of what follows but discuss the timevarying casealong with other extensions of the basic model in section 369 363 Defining the model as a bundle The ksetup function is used to initialize a state space model by specifying only its indispensable elements the observables and their link to the unobserved state vector plus the law of motion for the latter and the covariance matrix of its innovations Therefore the function takes a minimum of four arguments The corresponding bundle keys are as follows 1The user interface was substantially different prior to version 2017a For example be aware that Lucchetti 2011 is based on the old syntax If anyone needs documentation for the original interface it can be found at httpgretl sourceforgenetpaperskalmanoldpdf Additional functionality relating to exact diffuse initialization of the Kalman filter was added in version 2022b 348 Chapter 36 State Space Modeling 349 Symbol Dimensions Reserved key y N n obsy Z n r obsymat T r r statemat Ω r r statevar Please note that the matrix Z in the observation equation must be given in transposed form This is required to preserve compatibility with gretl versions prior to 2022a Correspondingly if you retrieve this matrix using its key obsymat its the transpose you actually obtain The names of these input matrices dont matter in fact they may be anonymous matrices con structed on the fly But if and when you wish to copy them out of the bundle you must use the specified keys as in matrix Z SSmodobsymat matrix T SSmodstatemat Although all the arguments are in principle matrices as a convenience you may give obsy as a series or list of series and the other arguments can be given as scalars if in context they are 1 1 If applicable you may specify any of the following optional input matrices2 Symbol Dimensions Key If omitted Σ n n obsvar no disturbance term in observation equation α0 r 1 inistate α0 is a zero vector P0 r r inivar P0 is set automatically These matrices are not passed to ksetup rather you add them to the bundle returned by ksetup under their reserved keys as you usually add elements to a bundle for example SSmodobsvar Veps Naturally the arguments you pass to ksetup must have mutually compatible dimensions other wise an error is returned Once setup is complete the dimensions of the modelr n and N become available as scalar members of the bundle under their own names In case inivar is not specified the matrix P10 will be automatically initialized by gretl only if all the eigenvalues of T lie inside the unit circle and the model is stationary In this case the variance for the marginal distribution of αt is well defined and the initializer is computed using vecP10 I T T1 vecΩ If the above condition is not satisfied you will have to make a choice on which technique to use for diffuse initialization In Section 368 we provide a fuller discussion of the various options but heres what is probably the bottom line for many users In earlier versions of gretl a rather crude solution was adopted initializing P10 to a numerically large matrix This was accomplished by setting a value of 1 on the bundle under the reserved key diffuse From gretl version 2022b on if you have scripts where you set diffuse1 on your Kalman bundle you can now try diffuse2 instead This invokes the new exact initial method for state space models with a diffuse initializer Dont expect identical results from the new code but to the extent results differ the new ones should be somewhat more accurate If results differ wildly youve probably found a bug please report it You may also find that the new code is faster it should be less likely to get hung up on numerical problems that delay or prevent convergence of ML estimation 2Additional optional matrices are described in section 369 below Chapter 36 State Space Modeling 353 which case Pt is not even defined or simply out of lack of information In that case there are two possible approaches The traditional one used by gretl up to version 2022a is to ascribe a very large variance to the initial Pt as in P0 κ Ir where κ is say 107 You can impose this diffuse prior by setting SSmoddiffuse 1 In some cases this strategy may lead to numerical problems It may then be helpful to specify a diffuse initializer via inivar using a somewhat smaller value of κ as in SSmodinivar 10e5 Istdim where stdim is the dimension of the state While the κ I approach works fairly well in many cases it is nowadays generally deprecated in favor of one or other exact initial method Such methods depend on derivation of the properties of the Kalman filter and smoother in the limit as the aforementioned very large variance tends to infinity In libgretl we have implemented two such methods the univariate approach to multi variate observable advocated by Durbin and Koopman 2012 and the augmented Kalman method set out by de Jong 1991 and de Jong and ChuChunLin 20033 Well refer to them via the labels univariate and dejong respectively Exact diffuse methods The univariate approach handles a vector observable by unpacking it and substituting scalar calculations for matrix ones so far as possible Durbin and Koopman claim it is faster than the alternatives It is also able to deal in a straightforward way with incomplete observations where some but not all elements of yt are missing at time t it can utilize any nonmissing elements while ignoring the missing ones However it runs into complications if a the variance matrix of the observation disturbances is not diagonal andor b the disturbances are correlated between the state and observation equations Case a can be handled at the cost of some extra preliminary computationtransforming y and Z to induce a diagonal variance matrixand this is automati cally carried out by gretl if needed Handling case b is more bothersome requiring augmentation of the state at present this not is supported in gretl The dejong approach has no problem with the variance cases a and b mentioned above How ever its not clear how incomplete observations can be handled and at present observations with any missing elements are ignored In short there are cases where univariate may work best and other cases that are not handled by univariate but where dejong works fine Hence our decision to implement both methods Table 361 sets out the various cases that arise via combination of code where legacy indicates the Kalman code as of gretl 2022a and diffuse status ie whether the model is diffuse and if so how it is handled Note that although the primary virtue of univariate and dejong is their handling of the exact diffuse case these methods can handle the nondiffuse case and the traditional κdiffuse case The case used depends on various points the primary one being the diffuse integer member of the state space bundle which defaults to 0 but can be set to 1 or 2 diffuse0 case 1 is the default for backward compatibility but case 4 or 7 can be selected by adding univariate1 or dejong1 to the bundle diffuse1 case 2 is the default but case 5 or 8 can be selected as above 3The first of these is used in the KFAS package for R Helske 2017 and the second by the sspace command in Stata See httpswwwstatacommanualstssspacepdf Chapter 36 State Space Modeling 354 nondiffuse κdiffuse exact diffuse code diffuse0 diffuse1 diffuse2 legacy 1 2 univariate 4 5 6 dejong 7 8 9 Table 361 Crosstabulation of codepath and diffuse status Numbers in cells are used for reference in the text legacy indicates gretl 2022a or earlier diffuse2 the default is 6 but can be switched to 9 via dejong1 For cases in the same columnnamely 147 258 and 69results from kfilter ksmooth and kdsmooth should in principle be the same across the codepaths but in practice there are bound to be slight differences due to the different algorithms employed And note that slight differences at that level may be somewhat amplified by iterated filtering as in ML estimation 369 Extensions and refinements Regressors in the observation equation The observation equation 361 can be augmented to allow for the effect of a kvector of observable exogenous variables xt in addition to that of the unobserved state as in yt Btxt Ztαt εt This specification can be added to a bundle previously obtained via ksetup by use of the keys obsx for x and obsxmat for B In that case obsx must be an N k matrix and B must be n k But please note as with the case of Z described above backward compatibility dictates that obsxmat be given in transposed form An exception to this dimensionality rule is granted for convenience If the observation equation includes a constant but no additional exogenous variables you can give B as n 1 without having to specify obsx More generally if the column dimension of B is 1 greater than k it is assumed that the first element of B is associated with an implicit column of ones Intercept in the state equation In some applications it may be useful to have an intercept in the state transition equation thus generalizing equation 362 to αt1 µt Ttαt ηt The term µ is never strictly necessary the system 361 and 362 can absorb it as an extra non timevarying element in the state vector However this comes at the cost of expanding all the matrices that touch the state α T η Ω Z making the model relatively awkward to formulate and forecasts more expensive to compute We therefore adopt the convention above on practical grounds The r 1 vector µ can be added to a bundle under the key stconst Despite its name this matrix can be specified as timevarying as explained in the next section Timevarying matrices Any or all of the matrices obsymat obsxmat obsvar statemat statevar and stconst may be timevarying In that case you must supply the name of a function to be called to update the matrix or matrices in question you add this to the bundle as a string under the key timevarcall4 For 4The choice of the name for the function itself is of course totally up to the user Chapter 36 State Space Modeling 355 example if just obsymat Zt should be updated by a function named TVZ you would write SSmodtimevarcall TVZ The function that plays this role will be called at each timestep of the filtering or simulation operation prior to performing any calculations It should have a single bundlepointer parameter by means of which it will be passed a pointer to the Kalman bundle to which the call is attached Its return value if any will not be used so generally it returns nothing is of type void However you can use gretls funcerr keyword to raise an error if things seem to be going wrong see chapter 14 for details Besides the bundle members noted above a time variation function has access to the current 1 based time step under the reserved key t and the nvector containing the forecast error from the previous time step vt1 under the key uhat when t 1 the latter will be a zero vector If any additional information is needed for performing the update it can be placed in the bundle under a userspecified key So for example a simple updater for a 1 1 Z matrix might look like this function void TVZ bundle b bobsymat bZvalsbt end function where bZvals is a bundled Nvector An updater that operates on both Z n r and T r r might be function void update2 bundle b bobsymat mshapebZvalsbt br bn bstatemat unvechbTvalsbt end function where in this case we assume that bZvals is N rn with row t holding the transposed vec of Zt and bTvals is N rr 12 with row t holding the vech of Tt Simpler variants eg just one element of a relevant matrix in question is changed and more complex variantssay involving some sort of conditionalityare also possible in this framework It is worth noting that this setup lends itself to a much wider scope than timevarying system matrices In fact this syntax allows for the possibility of executing userdefined operations at each step The function that goes under timevarcall can read all the elements of the model bundle and can modify several of them the system matrices which can therefore be made timevarying as well as the userdefined elements An extended example of use of the timevariation facility is presented in section 3612 Crosscorrelated disturbances The formulation given in equations 361 and 362 assumes mutual independence of the distur bances in the state and observation equations εt and ηt This assumption holds good in many practical applications but in some cases one may wish to allow for crosscorrelation More generally we note three common representations of the variance of the disturbances in 361 and 362 1 The basic representation εt and ηt are assumed to be mutually uncorrelated and we write their respective possibly timevarying variance matrices as Vεt n n and Vηt r r 2 The de Jong representation write εt Gtνt and ηt Htνt where Gt is nr Ht is r p and p is the length of the underlying disturbance vector νt This formulation allows for correlation of the disturbances across the equations if HtG t is nonzero Chapter 36 State Space Modeling 356 3 The DurbinKoopman representation as in the first case assume that the disturbances are uncorrelated across the equations but write ηt Rtξt and Vηt RtQtR t where Rt is a selection matrix and Qt Vξt Let m r denote the dimension of ξt Then Qt is m m and Rt is r m This allows for the possibility that there are fewer disturbances to the state than elements of the state vector With the de Jong representation in place of 361362 we may write yt Ztαt Gtνt αt1 Ttαt Htνt In that case we may reexpress the variance matrices from section 362 above as Σt GtG t Ωt HtH t with the addition of Covηt εt HtG t You can select the de Jong or DurbinKoopman representation by supplying extra arguments to the ksetup function For the de Jong version in place of giving Ω you should give the two matrices identified above as H and G as in bundle SSxmod ksetupy Z T H G and in case you wish to retrieve or update information on the variance of the disturbances note that in the crosscorrelated case the bundle keys statevar and obsvar are taken to designate the factors H and G respectively To select the DurbinKoopman representation a sixth boolean argument must be used If that has a nonzero value statevar is taken to be Q and the fifth argument is taken to be R Note that in this case obsvar should be added separately as in the basic case The following statements illustrate the three cases basic bundle kb1 ksetupy Z T Veta kb1obsvar Veps if wanted de Jong bundle kb2 ksetupy Z T H G DurbinKoopman bundle kb3 ksetupy Z T Q R 1 kb3obsvar Veps if wanted 3610 The ksimul function This simulation function has as its required arguments a pointer to a Kalman bundle and a matrix containing artificial disturbances and it returns a matrix of simulation results An optional trailing Boolean argument is supported the purpose of which is explained below If the disturbances are not crosscorrelated the matrix argument must be either N r if there is no disturbance in the observation equation or N r n if the Σ obsvar matrix is specified Row t holds either η t or η t ε t Note that if Ω statevar is not simply an identity matrix you will have to scale the artificial state disturbances appropriately the same goes for Σ and the observation Chapter 36 State Space Modeling 358 α1 defaults to the value given under the key inistate or if that in turn is not present to a zero vector Alternatively the starting point can be made stochastic To do this you can emulate the procedure followed by SsfPack namely setting α1 a Av0 where a is a nonstochastic rvector v0 is an rvector of standard normal random numbers and A is a matrix such that AA P0 Lets say we have a statespace bundle b on which we have already set suitable values of inistate corresponding to a above and inivar P0 To perform a simulation with a stochastic starting point we can set α1 thus matrix A psdrootbinivar bsimstart binistate A mnormalbr 1 3611 Numerical optimization If the object of using a state space model is to produce maximum likelihood estimates of some parameters of interest note that the loglikelihood surface may be quite awkward far from globally concave posing a challenge for numerical methods such as BFGS the default maximizer under gretls mle command Symptoms may include failure of convergencetypically due to an excessive computed gradient even as the maximizer cannot find an improvement in the objective function or an excessive number of iterations In such cases it is worth considering the following points In some cases scaling the observables may help if the order of magnitude of yt is too small or too large floatingpoint precision may become an issue for estimating variances If you can obtain plausible initial values for the parameters things are likely to go better than starting with arbitrary values The limitedmemory version of BFGS LBFGS may work better than the standard version in some cases To engage this issue the command set lbfgs on prior to ML estimation It may be helpful to employ a more accurate but computationally more expensive method for computing the gradient namely Richardson extrapolation Here the command is set bfgsrichardson on 3612 Example scripts This section presents a selection of short sample scripts to illustrate the most important points covered in this chapter ARMA estimation Functions illustrated in this example ksetup kfilter As is well known the Kalman filter provides a very efficient way to compute the likelihood of ARMA models as an example take an ARMA11 model yt φyt1 εt θεt1 Chapter 36 State Space Modeling 359 Listing 361 ARMA estimation Download function void arma11viakalman series y parameter initialization scalar phi 0 scalar theta 0 scalar sigma 1 statespace model setup matrix Z 1 theta matrix T phi 0 1 0 matrix Q sigma2 0 0 0 bundle kb ksetupy Z T Q maximum likelihood estimation mle logl ERR NA kbllt kbobsymat2 theta kbstatemat11 phi kbstatevar11 sigma2 ERR kfilterkb params phi theta sigma end mle hessian end function main open armagdt open the arma example dataset arma11viakalmany estimate an arma11 model arma 1 1 y nc check via native command Chapter 36 State Space Modeling 361 Listing 362 HP filter Download function series hpviakalmanseries y scalar lambda0 bool oneside0 if lambda 0 lambda 100 pd2 endif State transition matrix matrix T 2 1 1 0 Observation matrix matrix Z 1 0 Covariance matrix in the state equation matrix Q 1sqrtlambda 0 0 0 matrix my y string desc if oneside matrix my my 0 desc 1sided endif ssm ksetupmy Z T Q ssmobsvar sqrtlambda ssministate 2y1y2 3y12y2 ssmdiffuse 1 err oneside kfilterssm ksmoothssm if err series ret NA else mu oneside ssmstate22 ssmstate1 series ret y mu endif string d sprintfsHPfiltered s lambda g desc argnamey lambda setinfo ret descriptiond return ret end function example clear open fedstlbin data houst y loghoust onesided builtin then hansl n1c hpfilty 1600 1 series h1c hpviakalmany 1600 1 ols n1c const h1c simpleprint twosided builtin then hansl n2c hpfilty 1600 series h2c hpviakalmany 1600 ols n2c const h2c simpleprint Chapter 36 State Space Modeling 362 Estimates for µt can be obtained by running a forward filter for the onesided version plus a smoothing pass for the twosided one Code implementing the filter is shown in script 362 along with an example using the housing starts series from the St Louis Fed database The example also compares the result of the function to that from gretls native hpfilt function Note that in the case of the onesided filter a little trick is required in order to get the desired result the state matrix stored by the kfilter function is the estimate of ˆαtt1 whereas what we require is in fact ˆαtt To work around this we add an extra observation to the end of the series and retrieve the onestepahead estimate of the lagged state Local level model Functions illustrated in this example ksetup kfilter ksmooth Suppose we have a series yt µt εt where µt is a random walk with normal increments of variance σ 2 1 and εt is normal white noise with variance σ 2 2 independent of µt This is known as the local level model and it can be cast in statespace form as equations 361362 with T 1 ηt N0 σ 2 1 Z 1 and εt N0 σ 2 2 5 The translation to hansl is bundle llmod ksetupy 1 1 s1 llmodobsvar s2 llmoddiffuse 1 The two unknown parameters σ 2 1 and σ 2 2 can be estimated via maximum likelihood Listing 363 provides an example of simulation and estimation of such a model Since simulating the local level model is trivial using ordinary gretl commands we dont use ksimul in this context6 Timevarying models To illustrate state space models with timevarying system matrices we will use timevarying OLS Suppose the DGP for an observable time series yt is given by yt β0 β1txt εt 364 where the slope coefficient β1 evolves through time according to β1t1 β1t ηt 365 It is easy to see that the pair of equations above define a state space model with equation 364 as the measurement equation and 365 as the state transition equation The unobservable state is β1t T 1 and Ω σ 2 η As for the measurement equation Σ σ 2 ε while the matrix multiplying β1t and hence playing the role of Zt is the timevarying xt Once the system is framed as a statespace model estimation of the three unknown parameters β0 σ 2 ε and σ 2 η can proceed by maximum likelihood in a manner similar to example 361 and 363 The sequence of slope coefficients β1t can then be estimated by running the smoother which also yields a consistent estimate of the dispersion of the estimated state Listing 364 presents an example in which data from the AWM database are used to estimate a Phillips Curve with timevarying slope INFQt β0 β1tURXt εt 5Note that the local level model plus other common Structural Time Series models are implemented in the StrucTiSM function package 6Warning as the script stands there is an offbyone misalignment between the state vector and the observable series For convenience the script is written as if equation 362 was modified into the equivalent formulation αt Tαt1 ηt We kept the script as simple as possible so that the reader can focus on the interesting aspects Chapter 36 State Space Modeling 363 Listing 363 Local level model Download nulldata 200 set seed 101010 setobs 1 1 specialtimeseries set the true variance parameters trues1 05 trues2 025 and simulate some data v normal sqrttrues1 w normal sqrttrues2 mu 2 cumv y mu w starting values for variance estimates s1 1 s2 1 statespace model setup bundle kb ksetupy 1 1 s1 kbobsvar s2 kbdiffuse 1 ML estimation of variances mle ll ERR NA kbllt ERR kfilterkb params kbstatevar kbobsvar end mle compute the smoothed state ksmoothkb series muhat kbstate Chapter 36 State Space Modeling 364 Listing 364 Phillips curve on Euro data with timevarying slope Download function void ateachstepbundle b bobsymat transpbmXbt end function open AWMgdt quiet smpl 19741 19941 parameter initialization scalar b0 meanINFQ scalar sobs 01 scalar sstate 01 bundle setup bundle B ksetupINFQ 1 1 1 matrix BmX URX matrix Bdepvar INFQ Btimevarcall ateachstep Bdiffuse 1 ML estimation of intercept and the two variances mle LL err NA Bllt Bobsy Bdepvar b0 Bobsvar sobs2 Bstatevar sstate2 err kfilterB params b0 sobs sstate end mle display the smoothed timevarying slope ksmoothB series tvarb1hat Bstate1 series tvarb1se sqrtBstvar1 gnuplot tvarb1hat timeseries withlines outputdisplay bandtvarb1hattvarb1se196 bandstylefill Chapter 36 State Space Modeling 365 03 025 02 015 01 005 0 1975 1980 1985 1990 tvarb1hat Figure 361 Phillips Curve on Euro data timevarying slope and 95 confidence interval where INFQ is a measure of quarterly inflation and URX a measure of unemployment At the end of the script the evolution of the slope coefficient over time is plotted along with a 95 confidence bandsee Figure 361 Disturbance smoothing Functions illustrated in this example ksetup kdsmooth In section 367 we noted that the kdsmooth function can produce two different measures of the dispersion of the smoothed disturbances depending on the the value of the optional trailing Boolean parameter Here we show what these two measures are good for using the famous Nile flow data that have been much analysed in the statespace literature We focus on the state equation that is the randomwalk component of the observed series Our script is shown in Listing 365 This is an instance of the Local Level model and the ML variance estimates are obtained as in Listing 363 In the first call to kdsmooth we omit the optional switch and therefore compute Eˆηt ˆη t for each t This quantity is suitable for constructing the auxiliary residuals shown in the top panel of Figure 362 for similar plots see Koopman et al 1999 Pelagatti 2011 This plot suggests the presence of a structural break shortly prior to 1900 as various authors have observed In the second kdsmooth call we ask gretl to compute instead Eˆηt ηtˆηt ηty1 yT the MSE of ˆηt considered as an estimator of ηt And in the lower panel of the Figure we plot ˆηt along with a 90 confidence band roughly 164 times the RMSE This reveals that given the sampling variance of ˆηt were not really sure that any of the ηt values were truly different from zero The resolution of the seeming conflict here is commonly reckoned to be that there was in fact a change in mean around 1900 but besides that event theres little evidence for a nonzero σ 2 η Or in other words the standard local level model is not really applicable to the data 3613 Graphical interface By this point the reader will have gathered that setting up a state space model can be quite a complex undertaking and the only general way to accomplish it is by writing a script However some cases are simple enough to lend themselves to a standardized treatment and so can be Chapter 36 State Space Modeling 366 Listing 365 Working with smoothed disturbances Nile data Download open nilegdt ML variance estimates scalar s2eta 146849 scalar s2eps 150997 bundle LLM ksetupnile 1 1 s2eta LLMobsvar s2eps LLMdiffuse 1 kdsmoothLLM series etaaux LLMsmdist1 LLMsmdisterr1 series zero 0 plot etaaux options timeseries withlines bandzeroconst2 literal unset ylabel literal set title Auxiliary residual state equation end plot outputdisplay kdsmoothLLM 1 series etahat LLMsmdist1 series sdeta LLMsmdisterr1 plot etahat options timeseries withlines bandetahatsdeta164485 literal unset ylabel literal set title State disturbance with 90 confidence band end plot outputdisplay Chapter 36 State Space Modeling 367 a Auxiliary standardized residuals state equation 4 3 2 1 0 1 2 3 1880 1900 1920 1940 1960 b Estimated state disturbance with 90 confidence band 120 100 80 60 40 20 0 20 40 60 80 100 1880 1900 1920 1940 1960 Figure 362 Nile data auxiliary residuals and ˆηt from disturbance smoother Chapter 36 State Space Modeling 368 handled via a relatively streamlined graphical interface As of version 2022a gretl provides just this a GUI for estimating a subset of state space models that while limited may still be useful for pedagogical purposes sparing the user from the intricacies of scripting In this section we describe the GUI and the class of models it supports The GUI can be used for performing ML estimation of models of the kind yt Zαt εt 366 αt Tαt1 Rηt 367 where yt is a vector of observables Vεt is a diagonal matrixor possibly 0 in which case the last term of equation 366 is dropped As for the covariance matrix of the shocks to equation 367 it is assumed that ηt is an IID sequence of normal random variates with diagonal covariance matrix Ση Therefore the covariance matrix denoted by Ω in the previous sections of this chapter whose corresponding key in the Kalman bundle is statevar is assumed to be Ω RΣηR Note that R can have fewer columns than r thereby making Ω singular In the graphical interface this is called the state variance factor The system matrices Z T and R are assumed to be timeinvariant and known so estimation only concerns the variances of εt and ηt Clearly this is a limited subset of the range of models that gretl can handle but it may be of some value to users Figure 363 GUI hook for state space models ML estimation is carried out internally using the mle command with the limitedmemory version of the BFGS optimizer and the user is given the option of tracking the optimization process via a verbosity option For reasons of numerical performance it is convenient to have the choice of representing variances as transformations of the BFGS parameters in one of the three following ways Absolute value maximization is performed on the variances σ 2 θ Square maximization is performed on the standard deviations σ 2 θ2 Exponential maximization is performed on the log standard deviations σ 2 exp2 θ Chapter 36 State Space Modeling 369 Normally this choice should make no difference for wellbehaved data although numerical prob lems may occur sometimes In these cases it may be helpful to rescale the data by multiplying yt by some scalar such as 100 or 00001 so as to make the order of magnitude of the parameters less prone to finiteprecision issues In any case the function reports the estimates of the standard errors whatever the parametrization type Once the parameters are estimated the user has the choice of performing smoothing of the states The GUI is shown in figure 363 The observables box is used for specifying a list of series or a single series for yt The next two boxes handle the Z and T matrices respectively These can be preexisting matrices or may be created on the fly The same applies for the next box dedicated to the R matrix However the R matrix can be omitted in which case it is implicitly assumed R I The remaining GUI elements should hopefully be selfexplanatory The function returns a bundle which includes a subbundle called kmod with all the statespace internals a matrix called state holding the estimated states and matrices coeff and vcv holding respectively the coefficients and standard errors obtained via ML estimation Example Random walk plus noise The model here is yt αt εt αt αt1 ηt so that Z T R 1 The following script simulates the DGP above with Vεt 1 and Vηt 116 and sets up the two matrices Z and T ready to be entered into the second and third boxes of the GUI helper respectively obviously the first box should contain the string y Note that the first box expects as argument a named list thereby allowing for multivariate models clear set verbose off set seed 280921 nulldata 256 setobs 1 1 special example 1 random walk plus noise series m cumnormal 025 series y m normal Z 1 T 1 4 3 2 1 0 1 2 3 0 50 100 150 200 250 m mhat Figure 364 Estimated state Chapter 36 State Space Modeling 371 stdev1 00173633 000448117 3875 00001 State transition equation coefficient std error z pvalue stdev1 00269790 000409457 6589 443e11 stdev2 000648082 000202576 3199 00014 Loglikelihood 12133 Note that the output window will contain a few icons on the top bar By clicking on the second one from the left it is possible to save to the gretl workspace one or more elements from the returned bundle For example the kmod key corresponds to the estimated kalman bundle Saving it under the name kb and running the code below will produce the plot shown in Figure 365 series trend kbstate1 series seas kbstate2 scatters y trend seas 63 64 65 66 67 68 69 7 71 72 73 74 1973 1978 1983 1988 1993 y 64 65 66 67 68 69 7 71 72 73 1973 1978 1983 1988 1993 trend 015 01 005 0 005 01 015 1973 1978 1983 1988 1993 seas Figure 365 Estimated trend and seasonal component Chapter 37 Numerical methods Several functions are available to aid in the construction of specialpurpose estimators their pur pose is to find numerically approximate solutions to problems that in principle could be solved analytically but in practice cannot be for one reason or another In this chapter we illustrate the tools that gretl offers for optimization of functions differentiation and integration 371 Derivativebased optimization methods In some cases the function we want to optimize is differentiable and has a maximum in the interior of the search space In these cases you will want to use algorithms that exploit this feature such as BFGS or NewtonRaphson If this is not the case you may want to use derivativefree methods which are illustrated in section 372 BFGS The BFGSmax function has two required arguments a vector holding the initial values of a set of parameters and a call to a function that calculates the scalar criterion to be maximized given the current parameter values and any other relevant data If the object is in fact minimization this function should return the negative of the criterion On successful completion BFGSmax returns the maximized value of the criterion and the vector given via the first argument holds the parameter values which produce the maximum It is assumed here that the objective function is a userdefined function see Chapter 14 with the following general setup function scalar ObjFunc const matrix theta matrix X scalar val do some computation return val end function The first argument contains the parameter vector which should not be modified within the func tion and the second may be used to hold extra values that are necessary to compute the objective function but are not the variables of the optimization problem Here the pointer form is chosen for the argument but depending on the problem it could also be passed as a plain argument with our without the const modifier For example if the objective function were a loglikelihood the first argument would contain the parameters and the second one the data Or for more economic theory inclined readers if the objective function were the utility of a consumer the first argument might contain the quantities of goods and the second one their prices and disposable income The operation of BFGS can be adjusted using the set variables bfgsmaxiter and bfgstoler see Chapter 26 In addition you can provoke verbose output from the maximizer by setting maxverbose to on again via the set command alternatively you could have set it to full and get even richer output The Rosenbrock function is often used as a test problem for optimization algorithms It is also known as Rosenbrocks Valley or Rosenbrocks Banana Function on account of the fact that its contour lines are bananashaped It is defined by f x y 1 x2 100y x22 372 Chapter 37 Numerical methods 373 Listing 371 Finding the minimum of the Rosenbrock function Download function scalar Rosenbrockconst matrix param scalar x param1 scalar y param2 return 1x2 100 y x22 end function matrix theta 0 0 set maxverbose on M BFGSmaxtheta Rosenbrocktheta print theta The function has a global minimum at x y 1 1 where f x y 0 Listing 371 shows a gretl script that discovers the minimum using BFGSmax giving a verbose account of progress Note that in this particular case the function to be maximized only depends on the parameters so the second parameter is omitted from the definition of the function Rosenbrock Supplying analytical derivatives for BFGS An optional third argument to the BFGSmax function enables the user to supply analytical deriva tives of the criterion function with respect to the parameters without which a numerical approxi mation to the gradient is computed This argument is similar to the second one in that it specifies a function call In this case the function that is called must have the following signature Its first argument should be a predefined matrix correctly dimensioned to hold the gradient that is if the parameter vector contains k elements the gradient matrix must also be a kvector This matrix argument must be given in pointer form so that its content can be modified by the func tion Note that unlike the parameter vector where the choice of initial values can be important the initial values given to the gradient are immaterial and do not affect the results In addition the gradient function must have as one of its argument the parameter vector This may be given in pointer form which enhances efficiency but that is not required Additional arguments may be specified if necessary Given the current parameter values the function call must fill out the gradient vector appropriately It is not required that the gradient function returns any value directly if it does that value is ignored Listing 372 illustrates showing how the Rosenbrock script can be modified to use analytical deriva tives Note that since this is a minimization problem the values written into g1 and g2 in the function Rosengrad are in fact the derivatives of the negative of the Rosenbrock function Limitedmemory variant and constrained optimization As an alternative to standard BFGS gretl offers the limitedmemory variant LBFGSB This is de scribed by Byrd et al 1995 and Zhu et al 1997 Gretl uses version 30 of this code which features improvements described by Morales and Nocedal 2011 Some problems that defeat standard BFGS may be amenable to solution by LBFGSB To see if this is the case gretl code that uses BFGS can be pushed into using the alternative algorithm via the set command as follows set lbfgs on The primary case for using LBFGSB however is constrained optimization this algorithm supports constraints on the parameters in the form of minima andor maxima In gretl this is implemented Chapter 37 Numerical methods 374 Listing 372 Rosenbrock function with analytical gradient Download function scalar Rosenbrock const matrix param scalar x param1 scalar y param2 return 1x2 100 y x22 end function function void Rosengrad matrix g const matrix param scalar x param1 scalar y param2 g1 21x 2x200yx2 g2 200y x2 end function matrix theta 0 0 matrix grad 0 0 set maxverbose 1 M BFGSmaxtheta Rosenbrocktheta Rosengradgrad theta print theta print grad by the function BFGScmax c for constrained The syntax is basically similar to that of BFGSmax except that the first argument must followed by specification of a bounds matrix This matrix should have three columns and as many rows as there are constrained elements of the parameter vector Each row should hold the 1based index of the constrained parameter followed by lower and upper bounds The values huge and huge should be used to indicate that the parameter is unconstrained downward or upward respectively For example the following code constructs a matrix to specify that the second element of the parameter vector must be nonnegative and the fourth must lie between 0 and 1 matrix bounds 2 0 huge 4 0 1 NewtonRaphson BFGS discussed above is an excellent allpurpose maximizer and about as robust as possible given the limitations of digital computer arithmetic The NewtonRaphson maximizer is not as robust but may converge much faster than BFGS for problems where the maximand is reasonably well behavedin particular where it is anything like quadratic see below The case for using Newton Raphson is enhanced if it is possible to supply a function to calculate the Hessian analytically The gretl function NRmax which implements the NewtonRaphson method has a maximum of four arguments The first two required arguments are exactly as for BFGS an initial parameter vector and a function call which returns the maximand given the parameters The optional third argu ment is again as in BFGS a function call that calculates the gradient Specific to NRmax is an optional fourth argument namely a function call to calculate the negative Hessian The first argument of this function must be a predefined matrix of the right dimension to hold the Hessianthat is a k k matrix where k is the length of the parameter vectorgiven in pointer form The second argument should be the parameter vector optionally in pointer form Other data may be passed as additional arguments as needed Similarly to the case with the gradient if the fourth argument to NRmax is omitted then a numerical approximation to the Hessian is constructed What is ultimately required in NewtonRaphson is the negative inverse of the Hessian Note that if you give the optional fourth argument your function should compute the negative Hessian but Chapter 37 Numerical methods 375 should not invert it NRmax takes care of inversion with special handling for the case where the matrix is not negative definite which can happen far from the maximum Script 373 extends the Rosenbrock example using NRmax with a function Rosenhess to compute the Hessian The functions Rosenbrock and Rosengrad are just the same as in Listing 372 and are omitted for brevity Listing 373 Rosenbrock function via NewtonRaphson function void Rosenhess matrix H const matrix param scalar x param1 scalar y param2 H11 2 400y 1200x2 H12 400x H21 400x H22 200 end function matrix theta 0 0 matrix grad 0 0 matrix H zeros2 2 set maxverbose 1 M NRmaxtheta Rosenbrocktheta Rosengradgrad theta RosenhessH theta print theta print grad The idea behind NewtonRaphson is to exploit a quadratic approximation to the maximand under the assumption that it is concave If this is true the method is very effective However if the algorithm happens to evaluate the function at a point where the Hessian is not negative definite things may go wrong Script 374 exemplifies this by using a normal density which is concave in the interval 1 1 and convex elsewhere If the algorithm is started from within the interval everything goes well and NR is slightly more effective than BFGS If however the Hessian is positive at the starting point BFGS converges with only little more difficulty while NewtonRaphson fails 372 Derivativefree optimization methods Golden section search method Suppose you have a function f x of a scalar argument that is known to have a unique maximum The golden section method is rather effective at finding it quickly without making use of derivatives see Press et al 2007 section 102 for a thorough description The gretl function implementing this method is called GSSmax The idea is roughly to take an interval x0 x1 also known as the bracket that should contain the maximizing value Once y0 f x0 and y1 f x1 are computed the algorithm sets a new point x2 that replaces the end of the previous interval for which the function takes the worse value So for example if y0 y1 then x0 is replaced and the interval becomes x1 x2 The width of the interval shrinks progressively so after a few iterations you should end close to the maximum As an illustration consider the function f x 50 x32ex which is maximized at x 15 The following script sets as the initial interval the range 0 10 function scalar gscalar x Chapter 37 Numerical methods 376 Listing 374 Maximization of a Gaussian density Download function scalar NDmatrix x scalar z x1 return exp05zz end function set maxverbose 1 x 075 A BFGSmaxx NDx x 075 A NRmaxx NDx x 15 A BFGSmaxx NDx x 15 A NRmaxx NDx return 50 x15 expx end function set maxverbose on m 5 0 10 y GSSmaxm gm1 printf fg g m1 y The output is 1 bracket381966618034 values818747159001 2 bracket236068381966 values171118818747 3 bracket145898236068 values204841171118 4 bracket0901699145898 values173764204841 20 bracket150017150042 values204958204958 21 bracket150001150017 values204958204958 f149996 204958 As you can see from the output the bracket shrinks progressively the center of the interval when the algorithm stops is x 149996 figure 371 gives a pictorial representation of the process where the blue line is the function to maximize and the red segments are the successive choices for the bracket Simulated Annealing Simulated annealingas implemented by the gretl function simannis not a fullblown maxi mization method in its own right but can be a useful auxiliary tool in problems where convergence depends sensitively on the initial values of the parameters The idea is that you supply initial values and the simulated annealing mechanism tries to improve on them via controlled randomization Chapter 37 Numerical methods 377 0 5 10 15 20 25 0 1 2 3 4 5 6 7 Figure 371 Golden section search method example The simann function takes up to three arguments The first two required are the same as for BFGSmax and NRmax an initial parameter vector and a function that computes the maximand The optional third argument is a positive integer giving the maximum number of iterations n which defaults to 1024 Starting from the specified point in the parameter space for each of n iterations we select at random a new point within a certain radius of the previous one and determine the value of the criterion at the new point If the criterion is higher we jump to the new point otherwise we jump with probability P and remain at the previous point with probability 1 P As the iterations proceed the system gradually coolsthat is the radius of the random perturbation is reduced as is the probability of making a jump when the criterion fails to increase In the course of this procedure n 1 points in the parameter space are evaluated call them θi i 0 n where θ0 is the initial value given by the user Let θ denote the best point among θ1 θn highest criterion value The value written into the parameter vector on completion is then θ if θ is better than θ0 otherwise θn In other words failing an actual improvement in the criterion simann randomizes the starting point which may be helpful in tricky optimization problems Listing 375 shows simann at work as a helper for BFGSmax in finding the maximum of a bimodal function Unaided BFGSmax requires 60 function evaluations and 55 evaluations of the gradient while after simulated annealing the maximum is found with 7 function evaluations and 6 evalua tions of the gradient1 NelderMead The NelderMead derivativefree simplex maximizer also known as the amoeba algorithm is implemented by the function NMmax The argument list of this function is essentially the same as for simann the required arguments are an initial parameter vector and a functioncall to compute the maximand while an optional third argument can be used to set the maximum number of function evaluations default value 2000 This method is unlikely to produce as close an approximation to the true optimum as derivative based methods such as BFGS and NewtonRaphson but it is more robust than the latter It may succeed in some cases where derivativebased methods fail and it may be useful like simann for improving the starting point for an optimization problem so that a derivativebased method can then take over successfully 1Your mileage may vary these figures are somewhat compiler and machinedependent Chapter 37 Numerical methods 378 Listing 375 BFGS with initialization via Simulated Annealing Download function scalar bimodal matrix x matrix A scalar ret expqformx1 A ret 2expqformx4 A return ret end function set seed 12334 set maxverbose on scalar k 2 matrix A 01 Ik matrix x0 3 5 x x0 u BFGSmaxx bimodalx A print x x x0 u simannx bimodalx A 1000 print x u BFGSmaxx bimodalx A print x NMmax includes an internal convergence checknamely verification that the best value achieved for the objective function at termination of the algorithm is at least a local optimumbut by de fault it doesnt flag an error if this condition is not satisfied This permits a mode of usage where you set a fairly tight budget of function evaluations for example 200 and just take any improve ment in the objective function that is available without worrying about whether an optimum has truly been reached However if you want the convergence check to be enforced you can flag this by setting a negative value for the maximum function evaluations argument in that case the absolute value of the argument is taken and an error is provoked on nonconvergence If the task for this function is actually minimization you can either have the functioncall return the negative of the actual criterion or if you prefer call NMmax under the alias NMmin Here is an example of use minimization of the Powell quartic function which is problematic for BFGS The true minimum is zero obtained for x a 4vector of zeros function scalar powell const matrix x fx1 x1 10 x2 fx2 x3 x4 fx3 x2 2 x3 fx4 x1 x4 return fx12 50 fx22 fx34 100 fx44 end function matrix x 3 1 0 1 printf Initial fX g powellx fmin NMminx powellx printf Estimate of optimal X 14f x printf fX g fmin Chapter 37 Numerical methods 380 Listing 376 Delta Method Download function matrix MPCmatrix param matrix Y beta param2 gamma param3 y Y1 return betagammaygamma1 end function William Greene Econometric Analysis 5e Chapter 9 set echo off set messages off open greene51gdt Use OLS to initialize the parameters ols realcons 0 realdpi quiet a coeff0 b coeffrealdpi g 10 Run NLS with analytical derivatives nls realcons a b realdpig deriv a 1 deriv b realdpig deriv g b realdpig logrealdpi end nls matrix Y realdpi20004 matrix theta coeff matrix V vcv mpc MPCtheta Y matrix Jac fdjactheta MPCtheta Y Sigma qformJac V printf mpc g stderr g mpc sqrtSigma scalar teststat mpc1sqrtSigma printf Test for MPC 1 g pvalue g teststat pvaluenabsteststat Chapter 38 Discrete and censored dependent variables 387 Model 1 Logit estimates using the 32 observations 132 Dependent variable GRADE VARIABLE COEFFICIENT STDERROR T STAT SLOPE at mean const 130213 493132 2641 GPA 282611 126294 2238 0533859 TUCE 00951577 0141554 0672 00179755 PSI 237869 106456 2234 0449339 Mean of GRADE 0344 Number of cases correctly predicted 26 812 fbetax at mean of independent vars 0189 McFaddens pseudoRsquared 0374038 Loglikelihood 128896 Likelihood ratio test Chisquare3 154042 pvalue 0001502 Akaike information criterion AIC 337793 Schwarz Bayesian criterion BIC 396422 HannanQuinn criterion HQC 357227 Predicted 0 1 Actual 0 18 3 1 3 8 Model 2 Probit estimates using the 32 observations 132 Dependent variable GRADE VARIABLE COEFFICIENT STDERROR T STAT SLOPE at mean const 745232 254247 2931 GPA 162581 0693883 2343 0533347 TUCE 00517288 00838903 0617 00169697 PSI 142633 0595038 2397 0467908 Mean of GRADE 0344 Number of cases correctly predicted 26 812 fbetax at mean of independent vars 0328 McFaddens pseudoRsquared 0377478 Loglikelihood 128188 Likelihood ratio test Chisquare3 155459 pvalue 0001405 Akaike information criterion AIC 336376 Schwarz Bayesian criterion BIC 395006 HannanQuinn criterion HQC 35581 Predicted 0 1 Actual 0 18 3 1 3 8 Test for normality of residual Null hypothesis error is normally distributed Test statistic Chisquare2 361059 with pvalue 0164426 Table 381 Example logit and probit output Chapter 38 Discrete and censored dependent variables 388 Odds ratios A noteworthy feature of the binary logit model is that the regression coefficients have an inter pretation as log odds ratios where the odds ratio is 0 Py 1Py 0 In the logit example above the coefficient on TUCE has a value of 0095 The corresponding odds ratio is then e0095 110 meaning that the estimated effect of a unit increase in TUCE is to move the odds ratio by 10 percent in favor of GRADE 1 When a binary logit model is estimated via the gretl GUI the Analysis menu in the model output window incudes an Odds ratios item This opens a window showing the odds ratios along with standard errors obtained via the delta method plus a 95 percent confidence interval as illustrated below 95 confidence intervals z0025 19600 odds ratio std error low high GPA 168797 213181 142019 200624 TUCE 109983 0155686 0833365 145150 PSI 107907 114874 133934 869380 Note however that confidence intervals shown are not calculated using the deltamethod standard errors rather the bounds are obtained by exponentiating the bounds of regular confidence inter vals for the coefficients This makes sense on the assumption that the coefficients themselves are more likely to be normally distributed than their exponentials Odds ratio information can also be retrieved following binary logit estimation via scripting In this case it takes the form of a matrix provided by the oddsratios accessor or as modeloddsratios The perfect prediction problem One curious characteristic of logit and probit models is that quite paradoxically estimation is not feasible if a model fits the data perfectly this is called the perfect prediction problem The reason why this problem arises is easy to see by considering equation 386 if for some vector β and scalar k its the case that zi k whenever yi 0 and zi k whenever yi 1 the same thing is true for any multiple of β Hence Lβ can be made arbitrarily close to 0 simply by choosing enormous values for β As a consequence the loglikelihood has no maximum despite being bounded Gretl has a mechanism for preventing the algorithm from iterating endlessly in search of a non existent maximum One subcase of interest is when the perfect prediction problem arises because of a single binary explanatory variable In this case the offending variable is dropped from the model and estimation proceeds with the reduced specification Nevertheless it may happen that no single perfect classifier exists among the regressors in which case estimation is simply impos sible and the algorithm stops with an error This behavior is triggered during the iteration process if max zi iyi0 min zi iyi1 If this happens unless your model is trivially misspecified like predicting if a country is an oil exporter on the basis of oil revenues it is normally a smallsample problem you probably just dont have enough data to estimate your model You may want to drop some of your explanatory variables This problem is well analyzed in Stokes 2004 the results therein are replicated in the example script murderratesinp Chapter 38 Discrete and censored dependent variables 389 382 Ordered response models These models constitute a simple variation on ordinary logitprobit models and are usually applied when the dependent variable is a discrete and ordered measurementnot simply binary but on an ordinal rather than an interval scale For example this sort of model may be applied when the dependent variable is a qualitative assessment such as Good Average and Bad In the general case consider an ordered response variable y that can take on any of the J1 values 0 1 2 J We suppose as before that underlying the observed response is a latent variable y Xβ ε z ε Now define cut points α1 α2 αJ such that y 0 if y α1 y 1 if α1 y α2 y J if y αJ For example if the response takes on three values there will be two such cut points α1 and α2 The probability that individual i exhibits response j conditional on the characteristics xi is then given by Pyi j xi Py α1 xi Fα1 zi for j 0 Pαj y αj1 xi Fαj1 zi Fαj zi for 0 j J Py αJ xi 1 FαJ zi for j J 388 The unknown parameters αj are estimated jointly with the βs via maximum likelihood The ˆαj estimates are reported by gretl as cut1 cut2 and so on For the probit variant a conditional moment test for normality constructed in the spirit of Chesher and Irish 1987 is also included Note that the αj parameters can be shifted arbitrarily by adding a constant to zi so the model is underidentified if there is some linear combination of the explanatory variables which is constant The most obvious case in which this occurs is when the model contains a constant term for this reason gretl drops automatically the intercept if present However it may happen that the user in adventently specifies a list of regressors that may be combined in such a way to produce a constant for example by using a full set of dummy variables for a discrete factor If this happens gretl will also drop any offending regressors In order to apply these models in gretl the dependent variable must either take on only non negative integer values or be explicitly marked as discrete In case the variable has noninteger values it will be recoded internally Note that gretl does not provide a separate command for ordered models the logit and probit commands automatically estimate the ordered version if the dependent variable is acceptable but not binary Listing 383 reproduces the results presented in section 1510 of Wooldridge 2002a The question of interest in this analysis is what difference it makes to the allocation of assets in pension funds whether individual plan participants have a choice in the matter The response variable is an ordinal measure of the weight of stocks in the pension portfolio Having reported the results of estimation of the ordered model Wooldridge illustrates the effect of the choice variable by reference to an average participant The example script shows how one can compute this effect in gretl After estimating ordered models the uhat accessor yields generalized residuals as in binary mod els additionally the yhat accessor function returns ˆzi so it is possible to compute an unbiased estimator of the latent variable y i simply by adding the two together Chapter 38 Discrete and censored dependent variables 390 Listing 383 Ordered probit model Download Replicate the results in Wooldridge Econometric Analysis of Cross Section and Panel Data section 1510 using pensionplan data from Papke AER 1998 The dependent variable pctstck percent stocks codes the asset allocation responses of mostly bonds mixed and mostly stocks as 0 50 100 The independent variable of interest is choice a dummy indicating whether individuals are able to choose their own asset allocations open pensiongdt demographic characteristics of participant list DEMOG age educ female black married dummies coding for income level list INCOME finc25 finc35 finc50 finc75 finc100 finc101 Papkes OLS approach ols pctstck const choice DEMOG INCOME wealth89 prftshr save the OLS choice coefficient choiceols coeffchoice estimate ordered probit probit pctstck choice DEMOG INCOME wealth89 prftshr k ncoeff matrix b coeff1k2 a1 coeffk1 a2 coeffk Wooldridge illustrates the choice effect in the ordered probit by reference to a single nonblack male aged 60 with 135 years of education income in the range 50K 75K and wealth of 200K participating in a plan with profit sharing matrix X 60 135 0 0 0 0 0 0 1 0 0 200 1 with choice 0 scalar Xb 0 X b P0 cdfN a1 Xb P50 cdfN a2 Xb P0 P100 1 cdfN a2 Xb E0 50 P50 100 P100 with choice 1 Xb 1 X b P0 cdfN a1 Xb P50 cdfN a2 Xb P0 P100 1 cdfN a2 Xb E1 50 P50 100 P100 printf With choice Ey 2f without Ey 2f E1 E0 printf Estimated choice effect via ML 2f OLS 2f E1 E0 choiceols Chapter 38 Discrete and censored dependent variables 402 where durat measures durations 0 represents the constant which is required for such models X is a named list of regressors and cens is the censoring dummy By default the Weibull distribution is used you can substitute any of the other three distribu tions discussed here by appending one of the option flags exponential loglogistic or lognormal Interpreting the coefficients in a duration model requires some care and we will work through an illustrative case The example comes from section 203 of Wooldridge 2002a and it concerns criminal recidivism7 The data filename recidgdt pertain to a sample of 1445 convicts released from prison between July 1 1977 and June 30 1978 The dependent variable is the time in months until they are again arrested The information was gathered retrospectively by examining records in April 1984 the maximum possible length of observation is 81 months Rightcensoring is impor tant when the date were compiled about 62 percent had not been rearrested The dataset contains several covariates which are described in the data file we will focus below on interpretation of the married variable a dummy which equals 1 if the respondent was married when imprisoned Listing 387 shows the gretl commands for Weibull and lognormal models along with most of the output Consider first the Weibull scale factor σ The estimate is 1241 with a standard error of 0048 We dont print a z score and pvalue for this term since H0 σ 0 is not of interest Recall that σ corresponds to 1α we can be confident that α is less than 1 so recidivism displays negative duration dependence This makes sense it is plausible that if a past offender manages to stay out of trouble for an extended period his risk of engaging in crime again diminishes The exponential model would therefore not be appropriate in this case On a priori grounds however we may doubt the monotonic decline in hazard that is implied by the Weibull specification Even if a person is liable to return to crime it seems relatively unlikely that he would do so straight out of prison In the data we find that only 26 percent of those followed were rearrested within 3 months The lognormal specification which allows the hazard to rise and then fall may be more appropriate Using the duration command again with the same covariates but the lognormal flag we get a loglikelihood of 1597 as against 1633 for the Weibull confirming that the lognormal gives a better fit Let us now focus on the married coefficient which is positive in both specifications but larger and more sharply estimated in the lognormal variant The first thing is to get the interpretation of the sign right Recall that Xβ enters negatively into the intermediate variable w equation 3820 The Weibull hazard is λwi ewi so being married reduces the hazard of reoffending or in other words lengthens the expected duration out of prison The same qualitative interpretation applies for the lognormal To get a better sense of the married effect it is useful to show its impact on the hazard across time We can do this by plotting the hazard for two values of the index function Xβ in each case the values of all the covariates other than married are set to their means or some chosen values while married is set first to 0 then to 1 Listing 388 provides a script that does this and the resulting plots are shown in Figure 381 Note that when computing the hazards we need to multiply by the Jacobian of the transformation from ti to wi logti xiβσ namely 1t Note also that the estimate of σ is available via the accessor sigma but it is also present as the last element in the coefficient vector obtained via coeff A further difference between the Weibull and lognormal specifications is illustrated in the plots The Weibull is an instance of a proportional hazard model This means that for any sets of values of the covariates xi and xj the ratio of the associated hazards is invariant with respect to duration In this example the Weibull hazard for unmarried individuals is always 11637 times that for married In the lognormal variant on the other hand this ratio gradually declines from 16703 at one month to 11766 at 100 months 7Germán Rodríguez of Princeton University has a page discussing this example and displaying estimates from Stata at httpdataprincetonedupop509recid1html Chapter 38 Discrete and censored dependent variables 403 Listing 387 Models for recidivism data Download Input open recidgdt list X workprg priors tserved felon alcohol drugs black married educ age duration durat 0 X cens duration durat 0 X cens lognormal Partial output Model 1 Duration Weibull using observations 11445 Dependent variable durat coefficient std error z pvalue const 422167 0341311 1237 385e35 workprg 0112785 0112535 1002 03162 priors 0110176 00170675 6455 108e10 tserved 00168297 000213029 7900 278e15 felon 0371623 0131995 2815 00049 alcohol 0555132 0132243 4198 269e05 drugs 0349265 0121880 2866 00042 black 0563016 0110817 5081 376e07 married 0188104 0135752 1386 01659 educ 00289111 00241153 1199 02306 age 000462188 0000664820 6952 360e12 sigma 124090 00482896 Chisquare10 1654772 pvalue 239e30 Loglikelihood 1633032 Akaike criterion 3290065 Model 2 Duration lognormal using observations 11445 Dependent variable durat coefficient std error z pvalue const 409939 0347535 1180 411e32 workprg 00625693 0120037 05213 06022 priors 0137253 00214587 6396 159e10 tserved 00193306 000297792 6491 851e11 felon 0443995 0145087 3060 00022 alcohol 0634909 0144217 4402 107e05 drugs 0298159 0132736 2246 00247 black 0542719 0117443 4621 382e06 married 0340682 0139843 2436 00148 educ 00229194 00253974 09024 03668 age 000391028 0000606205 6450 112e10 sigma 181047 00623022 Chisquare10 1667361 pvalue 131e30 Loglikelihood 1597059 Akaike criterion 3218118 Chapter 38 Discrete and censored dependent variables 404 Listing 388 Create plots showing conditional hazards Download open recidgdt q leave married separate for analysis list X workprg priors tserved felon alcohol drugs black educ age Weibull variant duration durat 0 X married cens coefficients on all Xs apart from married matrix betaw coeff1ncoeff2 married coefficient scalar mcw coeffncoeff1 scalar sw sigma Lognormal variant duration durat 0 X married cens lognormal matrix betan coeff1ncoeff2 scalar mcn coeffncoeff1 scalar sn sigma list allX 0 X evaluate Xbeta at means of all variables except marriage scalar Xbw meancallX betaw scalar Xbn meancallX betan construct two plot matrices matrix matw zeros100 3 matrix matn zeros100 3 loop t1100 first column duration matwt 1 t matnt 1 t wiw logt Xbwsw win logt Xbnsn second col hazard with married 0 matwt 2 1t expwiw matnt 2 1t pdfz win cdfz win wiw logt Xbw mcwsw win logt Xbn mcnsn third col hazard with married 1 matwt 3 1t expwiw matnt 3 1t pdfz win cdfz win endloop cnamesetmatw months unmarried married cnamesetmatn months unmarried married gnuplot 2 3 1 withlines supp matrixmatw outputweibullplt gnuplot 2 3 1 withlines supp matrixmatn outputlognormplt Chapter 38 Discrete and censored dependent variables 405 0006 0008 0010 0012 0014 0016 0018 0020 0 20 40 60 80 100 months Weibull unmarried married 0006 0008 0010 0012 0014 0016 0018 0020 0 20 40 60 80 100 months Lognormal unmarried married Figure 381 Recidivism hazard estimates for married and unmarried exconvicts Chapter 38 Discrete and censored dependent variables 406 Alternative representations of the Weibull model One point to watch out for with the Weibull duration model is that the estimates may be represented in different ways The representation given by gretl is sometimes called the accelerated failuretime AFT metric An alternative that one sometimes sees is the log relativehazard metric in fact this is the metric used in Wooldridges presentation of the recidivism example To get from AFT estimates to log relativehazard form it is necessary to multiply the coefficients by σ 1 For example the married coefficient in the Weibull specification as shown here is 0188104 and ˆσ is 124090 so the alternative value is 0152 which is what Wooldridge shows 2002a Table 201 Fitted values and residuals By default gretl computes fitted values accessible via yhat as the conditional mean of duration The formulae are shown below where Γ denotes the gamma function and the exponential variant is just Weibull with σ 1 Weibull Loglogistic Lognormal expXβΓ1 σ expXβ πσ sinπσ expXβ σ 22 The expression given for the loglogistic mean however is valid only for σ 1 otherwise the expectation is undefined a point that is not noted in all software8 Alternatively if the medians option is given gretls duration command will produce conditional medians as the content of yhat For the Weibull the median is expXβlog 2σ for the loglogistic and lognormal it is just expXβ The values we give for the accessor uhat are generalized CoxSnell residuals computed as the integrated hazard function which equals the negative log of the survivor function ϵi Λti xi θ log Sti xi θ Under the null of correct specification of the model these generalized residuals should follow the unit exponential distribution which has mean and variance both equal to 1 and density expϵ See chapter 18 of Cameron and Trivedi 2005 for further discussion 8The predict adjunct to the streg command in Stata 10 for example gaily produces large negative values for the loglogistic mean in duration models with σ 1 Chapter 39 Quantile regression 391 Introduction In Ordinary Least Squares OLS regression the fitted values ˆyi Xi ˆβ represent the conditional mean of the dependent variableconditional that is on the regression function and the values of the independent variables In median regression by contrast and as the name implies fitted values represent the conditional median of the dependent variable It turns out that the principle of estimation for median regression is easily stated though not so easily computed namely choose ˆβ so as to minimize the sum of absolute residuals Hence the method is known as Least Absolute Deviations or LAD While the OLS problem has a straightforward analytical solution LAD is a linear programming problem Quantile regression is a generalization of median regression the regression function predicts the conditional τquantile of the dependent variablefor example the first quartile τ 25 or the ninth decile τ 90 If the classical conditions for the validity of OLS are satisfiedthat is if the error term is indepen dently and identically distributed conditional on X then quantile regression is redundant all the conditional quantiles of the dependent variable will march in lockstep with the conditional mean Conversely if quantile regression reveals that the conditional quantiles behave in a manner quite distinct from the conditional mean this suggests that OLS estimation is problematic Gretl has offered quantile regression functionality since version 175 in addition to basic LAD regression which has been available since early in gretls history via the lad command1 392 Basic syntax The basic invocation of quantile regression is quantreg tau reglist where reglist is a standard gretl regression list dependent variable followed by regressors including the constant if an intercept is wanted and tau is the desired conditional quantile in the range 001 to 099 given either as a numerical value or the name of a predefined scalar variable but see below for a further option Estimation is via the FrischNewton interior point solver Portnoy and Koenker 1997 which is sub stantially faster than the traditional BarrodaleRoberts 1974 simplex approach for large prob lems 1We gratefully acknowledge our borrowing from the quantreg package for GNU R version 417 The core of the package is composed of Fortran code written by Roger Koenker this is accompanied by various driver and auxiliary functions written in the R language by Koenker and Martin Mächler The latter functions have been reworked in C for gretl We have added some guards against potential numerical problems in small samples 407 Chapter 39 Quantile regression 408 By default standard errors are computed according to the asymptotic formula given by Koenker and Bassett 1978 Alternatively if the robust option is given we use the sandwich estimator developed in Koenker and Zhao 19942 393 Confidence intervals An option intervals is available When this is given we print confidence intervals for the param eter estimates instead of standard errors These intervals are computed using the rank inversion method and in general they are asymmetrical about the point estimatesthat is they are not sim ply plus or minus so many standard errors The specifics of the calculation are inflected by the robust option without this the intervals are computed on the assumption of IID errors Koenker 1994 with it they use the heteroskedasticityrobust estimator developed by Koenker and Machado 1999 By default 90 percent intervals are produced You can change this by appending a confidence value expressed as a decimal fraction to the intervals option as in quantreg tau reglist intervals95 When the confidence intervals option is selected the parameter estimates are calculated using the BarrodaleRoberts method This is simply because the FrischNewton code does not currently support the calculation of confidence intervals Two further details First the mechanisms for generating confidence intervals for quantile esti mates require that the model has at least two regressors including the constant If the intervals option is given for a model containing only one regressor an error is flagged Second when a model is estimated in this mode you can retrieve the confidence intervals using the accessor coeffci This produces a k 2 matrix where k is the number of regressors The lower bounds are in the first column the upper bounds in the second See also section 395 below 394 Multiple quantiles As a further option you can give tau as a matrixeither the name of a predefined matrix or in numerical form as in 05 25 5 75 95 The given model is estimated for all the τ values and the results are printed in a special form as shown below in this case the intervals option was also given Model 1 Quantile estimates using the 235 observations 1235 Dependent variable foodexp With 90 percent confidence intervals VARIABLE TAU COEFFICIENT LOWER UPPER const 005 124880 983021 130517 025 954835 737861 120098 050 814822 532592 114012 075 623966 327449 107314 095 641040 462649 835790 income 005 0343361 0343327 0389750 025 0474103 0420330 0494329 050 0560181 0487022 0601989 075 0644014 0580155 0690413 095 0709069 0673900 0734441 2These correspond to the iid and nid options in Rs quantreg package respectively Chapter 39 Quantile regression 409 03 035 04 045 05 055 06 065 07 075 0 02 04 06 08 1 tau Coefficient on income Quantile estimates with 90 band OLS estimate with 90 band Figure 391 Regression of food expenditure on income Engels data The gretl GUI has an entry for Quantile Regression under ModelRobust estimation and you can select multiple quantiles there too In that context just give spaceseparated numerical values as per the predefined options shown in a dropdown list When you estimate a model in this way most of the standard menu items in the model window are disabled but one extra item is availablegraphs showing the τ sequence for a given coeffi cient in comparison with the OLS coefficient An example is shown in Figure 391 This sort of graph provides a simple means of judging whether quantile regression is redundant OLS is fine or informative In the example shownbased on data on household income and food expenditure gathered by Ernst Engel 18211896it seems clear that simple OLS regression is potentially misleading The crossing of the OLS estimate by the quantile estimates is very marked However it is not always clear what implications should be drawn from this sort of conflict With the Engel data there are two issues to consider First Engels famous law claims an income elasticity of food consumption that is less than one and talk of elasticities suggests a logarithmic formulation of the model Second there are two apparently anomalous observations in the data set household 105 has the thirdhighest income but unexpectedly low expenditure on food as judged from a simple scatter plot while household 138 which also has unexpectedly low food consumption has much the highest income almost twice that of the next highest With n 235 it seems reasonable to consider dropping these observations If we do so and adopt a loglog formulation we get the plot shown in Figure 392 The quantile estimates still cross the OLS estimate but the evidence against OLS is much less compelling the 90 percent confidence bands of the respective estimates overlap at all the quantiles considered A script to produce the results discussed above is presented in listing 391 395 Large datasets As noted above when you give the intervals option with the quantreg command which calls for estimation of confidence intervals via rank inversion gretl switches from the default Frisch Chapter 39 Quantile regression 410 076 078 08 082 084 086 088 09 092 094 096 0 02 04 06 08 1 tau Coefficient on logincome Quantile estimates with 90 band OLS estimate with 90 band Figure 392 Loglog regression 2 observations dropped from full Engel data set Listing 391 Food expenditure and income Engel data Download this data file is supplied with gretl open engelgdt specify some quantiles matrix tau 05 25 5 75 95 use levels of variables QM1 quantreg tau foodexp 0 income intervals use loglog specification with two outliers removed logs foodexp income smpl obs105 obs138 restrict QM2 quantreg tau lfoodexp 0 lincome intervals The script saves the two models as icons Doubleclicking on a models icon opens a window to display the results and the Graph menu in this window gives access to a tausequence plot Chapter 39 Quantile regression 411 Newton algorithm to the BarrodaleRoberts simplex method This is OK for moderately large datasets up to say a few thousand observations but on very large problems the simplex algorithm may become seriously bogged down For example Koenker and Hallock 2001 present an analysis of the determinants of birth weights using 198377 observations and with 15 regressors Generating confidence intervals via BarrodaleRoberts for a single value of τ took about half an hour on a Lenovo Thinkpad T60p with 183GHz Intel Core 2 processor If you want confidence intervals in such cases you are advised not to use the intervals option but to compute them using the method of plus or minus so many standard errors One Frisch Newton run took about 8 seconds on the same machine showing the superiority of the interior point method The script below illustrates quantreg 10 y 0 xlist scalar crit qnorm95 matrix ci coeff crit stderr ci cicoeff crit stderr print ci The matrix ci will contain the lower and upper bounds of the symmetrical 90 percent confidence intervals To avoid a situation where gretl becomes unresponsive for a very long time we have set the maxi mum number of iterations for the BorrodaleRoberts algorithm to the somewhat arbitrary value of 1000 We will experiment further with this but for the meantime if you really want to use this method on a large dataset and dont mind waiting for the results you can increase the limit using the set command with parameter rqmaxiter as in set rqmaxiter 5000 Chapter 40 Nonparametric methods 415 Listing 402 NadarayaWatson example Download Nonparametric regression example husbands age on wifes age open mroz87gdt initial value for the bandwidth scalar h nobs02 three increasingly smooth estimates series m0 nadarwatHA WA h series m1 nadarwatHA WA h 5 series m2 nadarwatHA WA h 10 produce the graph dataset sortby WA gnuplot HA m0 m1 m2 WA outputdisplay withlinesm0m1m2 30 35 40 45 50 55 60 30 35 40 45 50 55 60 HA WA m0 m1 m2 Figure 402 NadarayaWatson example for several choices of the bandwidth parameter Chapter 40 Nonparametric methods 416 If you need a point estimate of mX for some value of X which is not present among the valid observations of your dependent variable you may want to add some fake observations to your dataset in which y is missing and x contains the values you want mx evaluated at For example the following script evaluates mx at regular intervals between 20 and 20 nulldata 120 set seed 120496 first part of the sample actual data smpl 1 100 x normal y x2 sinx normal second part of the sample fake x data smpl 101 120 x obs110 5 compute the NadarayaWatson estimate with bandwidth equal to 04 note that 10002 0398 smpl full m nadarwaty x 04 show mx for the fake x values only smpl 101 120 print x m o and running it produces x m 101 18 1165934 102 16 0730221 103 14 0314705 104 12 0026057 105 10 0131999 106 08 0215445 107 06 0269257 108 04 0304451 109 02 0306448 110 00 0238766 111 02 0038837 112 04 0354660 113 06 0908178 114 08 1485178 115 10 2000003 116 12 2460100 117 14 2905176 118 16 3380874 119 18 3927682 120 20 4538364 Chapter 41 MIDAS models 418 Parameterization code string Normalized exponential Almon 1 nealmon Normalized beta zero last lag 2 beta0 Normalized beta nonzero last lag 3 betan Almon polynomial 4 almonp Oneparameter beta 5 beta1 Table 411 MIDAS parameterizations In the case of the nonnormalized Almon polynomial the γ coefficient in 412 is identically 10 and is omitted The beta1 case is the the same as the twoparameter beta0 except that θ1 is constrained to equal 1 leaving θ2 as the only free parameter Ghysels and Qian 2016 make a case for use of this particularly parsimonious version2 An additional function is provided for convenience it is named mlincomb and it combines mweights with the lincomb function which takes a list of series argument followed by a vector of coeffi cients and produces a series result namely a linear combination of the elements of the list If we have a suitable list X available we can do for example series foo mlincombX theta beta0 This is equivalent to series foo lincombX mweightsnelemX theta beta0 but saves a little typing and some CPU cycles 412 Estimating MIDAS models Gretl offers a dedicated command midasreg for estimation of MIDAS models Theres a corre sponding item MIDAS under the Time series section of the Model menu in the gretl GUI We begin by discussing that then move on to possibilities for defining your own estimator The syntax of midasreg looks like this midasreg depvar xlist midasterms options The depvar slot takes the name or series ID number of the dependent variable and xlist is the list of regressors that are observed at the same frequency as the dependent variable this list may contain lags of the dependent variable The midasterms slot accepts one or more specifica tions for highfrequency terms Each of these specifications must conform to one or other of the following patterns 1 mdsmlist minlag maxlag type theta 2 mdslllist type theta In case 1 mlist must be a MIDAS list as defined in section 202 which contains a full set of perperiod series but no lags Lags will be generated automatically governed by the minlag and maxlag integer arguments which may be given as numerical values or the names of predefined scalar variables The integer or string type argument represents the type of parameterization in addition to the values 1 to 4 defined in Table 411 a value of 0 or the string umidas indicates unrestricted MIDAS In case 2 llist is assumed to be a list that already contains the required set of highfrequency lagsas may be obtained via the hflags function described in section 203hence minlag and maxlag are not wanted 2Note however that at present beta1 cannot be mixed with other parameterizations in a single model Chapter 41 MIDAS models 419 The final theta argument is optional in most cases implying an automatic initialization of the hyperparameters If this argument is given it must take one of the following forms 1 The name of a matrix vector holding initial values for the hyperparameters or a simple expression which defines a matrix using scalars such as 1 5 2 The keyword null indicating that an automatic initialization should be used as happens when this argument is omitted 3 An integer value in numerical form indicating how many hyperparameters should be used which again calls for automatic initialization The third of these forms is required if you want automatic initialization in the Almon polynomial case since we need to know how many terms you wish to include In the normalized exponential Almon case we default to the usual two hyperparameters if theta is omitted or given as null The midasreg syntax allows the user to specify multiple highfrequency predictors if wanted these can have different lag specifications different parameterizations andor different frequencies The options accepted by midasreg include quiet suppress printed output verbose show detail of iterations if applicable and robust use a HAC estimator of the NeweyWest type in computing standard errors Two additional specialized options are described below Examples of usage Suppose we have a dependent variable named dy and a MIDAS list named dX and we wish to run a MIDAS regression using one lag of the dependent variable and highfrequency lags 1 to 10 of the series in dX The following will produce UMIDAS estimates midasreg dy const dy1 mdsdX 1 10 0 The next lines will produce estimates for the normalized exponential Almon parameterization with two coefficients both initialized to zero midasreg dy const dy1 mdsdX 1 10 nealmon 00 In the examples above the required lags will be added to the dataset automatically then deleted after use If you are estimating several models using a single set of MIDAS lags it is more efficient to create the lags once and use the mdsl specifier For example the following estimates three variant parameterizations exponential Almon beta with zero last lag and beta with nonzero last lag on the same data list dXL hflags1 10 dX midasreg dy 0 dy1 mdsldXL nealmon 00 midasreg dy 0 dy1 mdsldXL beta0 15 midasreg dy 0 dy1 mdsldXL betan 110 Any additional MIDAS terms should be separated by spaces as in midasreg dy const dy1 mdsdX191theta1 mdsZ163theta2 Replication exercise We give a substantive illustration of midasreg in Listing 411 This replicates the first practical example discussed by Ghysels in the users guide titled MIDAS Matlab Toolbox3 The dependent 3See Ghysels 2015 This document announces itself as Version 20 of the guide and is dated November 1 2015 The example were looking at appears on pages 2426 the associated Matlab code can be found in the program appADLMIDAS1m Chapter 41 MIDAS models 421 Listing 411 Script to replicate results given by Ghysels Download set verbose off open gdpmidasgdt quiet form the dependent variable series dy 100 ldiffqgdp form list of highfrequency lagged log differences list X payems list dXL hflags3 11 hfldiffX 100 initialize matrix to collect forecasts matrix FC estimation sample smpl 19851 20091 print unrestricted MIDAS umidas midasreg dy 0 dy1 mdsldXL 0 fcast outofsample static quiet FC fcast print normalized beta with zero last lag beta0 midasreg dy 0 dy1 mdsldXL 2 15 fcast outofsample static quiet FC fcast print normalized beta nonzero last lag betan midasreg dy 0 dy1 mdsldXL 3 110 fcast outofsample static quiet FC fcast print normalized exponential Almon nealmon midasreg dy 0 dy1 mdsldXL 1 00 fcast outofsample static quiet FC fcast print Almon polynomial almonp midasreg dy 0 dy1 mdsldXL 4 4 fcast outofsample static quiet FC fcast smpl 20092 20112 matrix my dy print Forecast RMSEs printf umidas 4f fcstatsmy FC12 printf beta0 4f fcstatsmy FC22 printf betan 4f fcstatsmy FC32 printf nealmon 4f fcstatsmy FC42 printf almonp 4f fcstatsmy FC52 Chapter 41 MIDAS models 422 Listing 412 Replication of Ghysels results partial output normalized beta nonzero last lag betan Model 3 MIDAS NLS using observations 1985120091 T 97 Using LBFGSB with conditional OLS Dependent variable dy estimate std error tratio pvalue const 0748578 0146404 5113 174e06 dy1 0248055 0118903 2086 00398 MIDAS list dXL highfrequency lags 3 to 11 HFslope 172167 0582076 2958 00039 Beta1 0998501 00269479 3705 110e56 Beta2 295148 293404 1006 03171 Beta3 00743143 00271273 2739 00074 Sum squared resid 2878262 SE of regression 0562399 Rsquared 0356376 Adjusted Rsquared 0321012 Loglikelihood 7871248 Akaike criterion 1694250 Schwarz criterion 1848732 HannanQuinn 1756715 Almon polynomial almonp Model 5 MIDAS NLS using observations 1985120091 T 97 Using LevenbergMarquardt algorithm Dependent variable dy estimate std error tratio pvalue const 0741403 0146433 5063 214e06 dy1 0255099 0119139 2141 00349 MIDAS list dXL highfrequency lags 3 to 11 Almon0 106035 153491 06908 04914 Almon1 0193615 130812 01480 08827 Almon2 0140466 0299446 04691 06401 Almon3 00116034 00198686 05840 05607 Sum squared resid 2866623 SE of regression 0561261 Rsquared 0358979 Adjusted Rsquared 0323758 Loglikelihood 7851596 Akaike criterion 1690319 Schwarz criterion 1844802 HannanQuinn 1752784 Forecast RMSEs umidas 05424 beta0 05650 betan 05210 nealmon 05642 almonp 05329 Chapter 41 MIDAS models 423 LBFGSB with conditional OLS LBFGS is a limited memory version of the BFGS optimizer and the trailing B means that it supports bounds on the parameters which is useful for reasons given below Golden Section search with conditional OLS This is a line search method used only when there is a just a single hyperparameter to estimate LevenbergMarquardt is the default NLS method but if the MIDAS specifications include any of the beta variants or normalized exponential Almon we switch to LBFGSB unless the user gives the levenberg option The ability to set bounds on the hyperparameters via LBFGSB is helpful first because the beta parameters other than the third one if applicable must be nonnegative but also because one is liable to run into numerical problems in calculating the weights andor gradient if their values become too extreme For example we have found it useful to place bounds of 2 and 2 on the exponential Almon parameters Heres what we mean by conditional OLS in the context of LBFGSB and line search the search algorithm itself is only responsible for optimizing the MIDAS hyperparameters and when the algo rithm calls for calculation of the sum of squared residuals given a certain hyperparameter vector we optimize the remaining parameters coefficients on basefrequency regressors slopes with respect to MIDAS terms via OLS Testing for a structural break The breaktest option can be used to carry out the Quandt Likelihood Ratio QLR test for a structural break at the stage of running the final GaussNewton regression to check for conver gence and calculate the covariance matrix of the parameter estimates This can be a useful aid to diagnosis since nonhomogeneity of the data over the estimation period can lead to numerical problems in nonlinear estimation besides compromising the forecasting capacity of the resulting equation For example when this option is given with the command to estimate the betan model shown in Listing 412 the following result is appended to the standard output QLR test for structural break Null hypothesis no structural break Test statistic chisquare6 351745 at observation 20052 with asymptotic pvalue 0000127727 Despite the strong evidence for a structural break in this case the nonlinear estimator appears to converge successfully But one might wonder if a shorter estimation period could provide better outofsample forecasts Defining your own MIDAS estimator As explained above the midasreg command is in effect a wrapper for various underlying meth ods Some users may wish to undo the wrapping This would be required if you wish to introduce any nonlinearity other than that associated with the stock MIDAS parameterizations or to define your own MIDAS parameterization Anyone with ambitions in this direction will presumably be quite familiar with the commands and functions available in hansl gretls scripting language so we will not say much here beyond presenting a couple of examples First we show how the nls command can be used along with the MIDASrelated functions described in section 411 to estimate a model with the exponential Almon specification open gdpmidasgdt quiet series dy 100 ldiffqgdp series dy1 dy1 list X payems Chapter 41 MIDAS models 425 Listing 413 Manual MIDAS oneparameter beta specification Download set verbose off function scalar beta1SSR scalar th2 const series y const series x list L matrix theta 1 th2 series mdx mlincombL theta 2 run OLS conditional on theta ols y 0 x mdx quiet return ess end function function matrix midasGNR const matrix theta const series y const series x list L int type GaussNewton regression series mdx mlincombL theta type ols y 0 x mdx quiet matrix b coeff matrix u uhat matrix mgrad mgradientnelemL theta type matrix M const x mdx b3 L mgrad matrix V set svd on in case of strong collinearity molsu M null V return b theta sqrtdiagV end function main open gdpmidasgdt quiet series dy 100 ldiffqgdp series dy1 dy1 list dX ldpayem list dXL hflags3 11 dX estimation sample smpl 19851 20091 matrix b 0 101 100 use Golden Section minimizer SSR GSSminb beta1SSRb1 dy dy1 dXL 10e6 printf SSR GSS 15g SSR matrix theta 1 b1 column vector needed matrix bse midasGNRtheta dy dy1 dXL 2 bse42 nan mask std error of clamped coefficient modprint bse const dy1 HFslope Beta1 Beta2 Chapter 42 Gretl and ODBC Gretl provides a method for retrieving data from databases which support the Open Database Connectivity ODBC standard Most users wont be interested in this but there may be some for whom this feature matters a lottypically those who work in an environment where huge data collections are accessible via a Data Base Management System DBMS In the following section we explain what is needed for ODBC support in gretl We provide some background information on how ODBC works in section 422 and explain the details of getting gretl to retrieve data from a database in section 423 Section 424 provides some example of usage and section 425 gives some details on the management of ODBC connections 421 ODBC support The piece of software that bridges between gretl and the ODBC system is a dynamically loaded plugin This is included in the gretl packages for MS Windows and Mac OS X On other unixtype platforms notably Linux you may have to build gretl from source to get ODBC support This is because the plugin depends on having unixODBC installed which we cannot assume to be the case on typical Linux systems To enable the ODBC plugin when building gretl you must pass the option withodbc to gretls configure script In addition if unixODBC is installed in a non standard location you will have to specify its installation prefix using withODBCprefix as in for example configure withodbc withODBCprefixoptODBC 422 ODBC base concepts ODBC is short for Open DataBase Connectivity a group of software methods that enable a client to interact with a database server The most common operation is when the client fetches some data from the server ODBC acts as an intermediate layer between client and server so the client talks to ODBC rather than accessing the server directly see Figure 421 ODBC query data Figure 421 Retrieving data via ODBC For the above mechanism to work it is necessary that the relevant ODBC software is installed and working on the client machine contact your DB administrator for details At this point the database or databases that the server provides will be accessible to the client as a data source with a specific identifier a Data Source Name or DSN in most cases a username and a password are required to connect to the data source 429 Chapter 42 Gretl and ODBC 430 Once the connection is established the user sends a query to ODBC which contacts the database manager collects the results and sends them back to the user The query is almost invariably formulated in a special language used for the purpose namely SQL1 We will not provide here an SQL tutorial there are many such tutorials on the Net besides each database manager tends to support its own SQL dialect so the precise form of an SQL query may vary slightly if the DBMS on the other end is Oracle MySQL PostgreSQL or something else Suffice it to say that the main statement for retrieving data is the SELECT statement Within a DBMS data are organized in tables which are roughly equivalent to spreadsheets The SELECT statement returns a subset of a table which is itself a table For example imagine that the database holds a table called NatAccounts containing the data shown in Table 421 year qtr gdp consump tradebal 1970 1 584763 3447469 589101 1970 2 597746 3501769 706871 1970 3 604270 3552497 837927 1970 4 609706 3617947 791761 1971 1 609597 362490 62743 1971 2 617002 3683136 665876 1971 3 625536 372605 479589 1971 4 630047 3770339 649813 Table 421 The NatAccounts table The SQL statement SELECT qtr tradebal gdp FROM NatAccounts WHERE year1970 produces the subset of the original data shown in Table 422 qtr tradebal gdp 1 589101 584763 2 706871 597746 3 837927 604270 4 791761 609706 Table 422 Result of a SELECT statement Gretl provides a mechanism for forwarding your query to the DBMS via ODBC and including the results in your currently open dataset 423 Syntax At present we do not offer a graphical interface for ODBC import this must be done via the com mand line interface The two commands used for fetching data via an ODBC connection are open and data The open command is used for connecting to a DBMS its syntax is open dsndatabase userusername passwordpassword odbc The user and password items are optional the effect of this command is to initiate an ODBC connection It is assumed that the machine gretl runs on has a working ODBC client installed 1See httpenwikipediaorgwikiSQL Chapter 42 Gretl and ODBC 431 In order to actually retrieve the data the data command is used Its syntax is data series obsformatformatstring queryquerystring odbc where series is a list of names of gretl series to contain the incoming data separated by spaces Note that these series need not exist pior to the ODBC import formatstring is an optional parameter used to handle cases when a rectangular organisation of the database cannot be assumed more on this later querystring is a string containing the SQL statement used to extract the data There should be no spaces around the equals signs in the obsformat and query fields in the data command The querystring can in principle contain any valid SQL statement which results in a table This string may be specified directly within the command as in data x querySELECT foo FROM bar odbc which will store into the gretl variable x the content of the column foo from the table bar However since in a reallife situation the string containing the SQL statement may be rather long it may be best to store it in a string variable For example string SqlQry SELECT foo1 foo2 FROM bar data x y querySqlQry odbc The observation format specifier If the optional parameter obsformat is absent as in the above example the SQL query should return k columns of data where k is the number of series names listed in the data command It may be necessary to include a smpl command before the data command to set up the right window for the incoming data In addition if one cannot assume that the data will be delivered in the correct order typically chronological order the SQL query should contain an appropriate ORDER BY clause The optional format string is used for those cases when there is no certainty that the data from the query will arrive in the same order as the gretl dataset This may happen when missing values are interspersed within a column or with data that do not have a natural ordering eg crosssectional data In this case the SQL statement should return a table with m k columns where the first m columns are used to identify the observation or row in the gretl dataset into which the actual data values in the final k columns should be placed The obsformat string is used to translate the first m fields into a string which matches the string gretl uses to identify observations in the currently open dataset Up to three columns can be used for this purpose m 3 Note that the strings gretl uses to identify observations can be seen by printing any variable by observation as in print index byobs The series named index is automatically added to a dataset created via the nulldata command The format specifiers available for use with obsformat are as follows d print an integer value s print an string value g print a floatingpoint value Chapter 42 Gretl and ODBC 432 In addition the format can include literal characters to be passed through such as slashes or colons to make the resulting string compatible with gretls observation identifiers For example consider the following fictitious case we have a 5daysperweek dataset to which we want to add the stock index for the Verdurian market2 it so happens that in Verduria Saturdays are working days but Wednesdays are not We want a column which does not contain data on Saturdays because we wouldnt know where to put them but at the same time we want to place missing values on all the Wednesdays In this case the following syntax could be used string QRYSELECT yearmonthdayVerdSE FROM AlmeaIndexes data y obsformatddd queryQRY odbc The column VerdSE holds the data to be fetched which will go into the gretl series y The first three columns are used to construct a string which identifies the day Daily dates take the form YYYYMMDD in gretl If a row from the DBMS produces the observation string 20080401 this will match OK its a Tuesday but 20080405 will not match since it is a Saturday the corresponding row will therefore be discarded On the other hand since no string 20080423 will be found in the data coming from the DBMS its a Wednesday that entry is left blank in our series y 424 Examples Table Consump Table DATA Field Type time decimal72 income decimal166 consump decimal166 Field Type year decimal40 qtr decimal10 varname varchar16 xval decimal2010 Table 423 Example AWM database structure Table Consump Table DATA 197000 424278975500 344746944000 197025 433218709400 350176890400 197050 440954219100 355249672300 197075 446278664700 361794719900 197100 447752681800 362489970500 197125 453553860100 368313558500 197150 460115133100 372605015300 1970 1 CAN 5179085000000 1970 2 CAN 6625996000000 1970 3 CAN 11304155000000 1970 4 CAN 4672508000000 1970 1 COMPR 184000000000 1970 2 COMPR 186341000000 1970 3 COMPR 183000000000 1970 4 COMPR 182663000000 1970 1 D1 10000000000 1970 2 D1 00000000000 Table 424 Example AWM database data In the following examples we will assume that access is available to a database known to ODBC with the data source name AWM with username Otto and password Bingo The database AWM contains quarterly data in two tables see 423 and 424 2See httpwwwalmeopediacomindexphpVerduria Chapter 42 Gretl and ODBC 433 The table Consump is the classic rectangular dataset that is its internal organization is the same as in a spreadsheet or econometrics package each row is a data point and each column is a variable The structure of the DATA table is different each record is one figure stored in the column xval and the other fields keep track of which variable it belongs to for which date Listing 421 Simple query from a rectangular table nulldata 160 setobs 4 19701 time open dsnAWM userOtto passwordBingo odbc string Qry SELECT consump income FROM Consump data cons inc queryQry odbc Listing 421 shows a query for two series first we set up an empty quarterly dataset Then we connect to the database using the open statement Once the connection is established we retrieve two columns from the Consump table No observation string is required because the data already have a suitable structure we need only import the relevant columns Listing 422 Simple query from a nonrectangular table string S select year qtr xval from DATA where varnameWLN ORDER BY year qtr data wln obsformatdd queryS odbc In example 422 by contrast we make use of the observation string since we are drawing from the DATA table which is not rectangular The SQL statement stored in the string S produces a table with three columns The ORDER BY clause ensures that the rows will be in chronological order although this is not strictly necessary in this case 425 Connectivity details It may be helpful to supply some details on gretls management of ODBC connections First when the open command is invoked with the odbc option gretl checks to see if a connection to the specified DSN Data Source Name can be established via the ODBC function SQLConnect If not an error is flagged if so the connection is dropped SQLDisconnect but the DSN details are stored The stored DSN then remains the implicit source for subsequent invocation of the data command with the odbc option until a countermanding open command is issued Each time an OBDCrelated data command is issued gretl attempts to reestablish a connection to the given DSN the connection is dropped once the data transfer is complete Chapter 42 Gretl and ODBC 434 Listing 423 Handling of missing values for a nonrectangular table string foo select year qtr xval from DATA where varnameSTN AND qtr1 data bar obsformatdd queryfoo odbc print bar byobs Listing 423 shows what happens if the rows in the outcome from the SELECT statement do not match the observations in the currently open gretl dataset The query includes a condition which filters out all the data from the first quarter The query result invisible to the user would be something like year qtr xval 1970 2 78705000000 1970 3 75600000000 1970 4 71892000000 1971 2 58679000000 1971 3 62442000000 1971 4 59811000000 1972 2 46883000000 1972 3 46302000000 Internally gretl fills the variable bar with the corresponding value if it finds a match otherwise NA is used Printing out the variable bar thus produces Obs bar 19701 19702 78705 19703 75600 19704 71892 19711 19712 58679 19713 62442 19714 59811 19721 19722 46883 19723 46302 Chapter 43 Gretl and TEX 436 Figure 431 LATEX menu in model window Table 431 Example of LATEX tabular output Model 1 OLS estimates using the 51 observations 151 Dependent variable ENROLL Variable Coefficient Std Error tstatistic pvalue const 0241105 00660225 36519 00007 CATHOL 0223530 00459701 48625 00000 PUPIL 000338200 000271962 12436 02198 WHITE 0152643 00407064 37499 00005 Mean of dependent variable 00955686 SD of dependent variable 00522150 Sum of squared residuals 00709594 Standard error of residuals ˆσ 00388558 Unadjusted R2 0479466 Adjusted R2 0446241 F3 47 144306 Chapter 43 Gretl and TEX 437 standard errors in parentheses The distinction between the Copy and Save options for both tabular and equation is twofold First Copy puts the TEX source on the clipboard while with Save you are prompted for the name of a file into which the source should be saved Second with Copy the material is copied as a fragment while with Save it is written as a complete file The point is that a wellformed TEX source file must have a header that defines the documentclass article report book or whatever and tags that say begindocument and enddocument This material is included when you do Save but not when you do Copy since in the latter case the expectation is that you will paste the data into an existing TEX source file that already has the relevant apparatus in place The items under Equation options should be selfexplanatory when printing the model in equa tion form do you want standard errors or tratios displayed in parentheses under the parameter estimates The default is to show standard errors if you want tratios select that item Other windows Several other sorts of output windows also have TEX preview copy and save enabled In the case of windows having a graphical toolbar look for the TEX button Figure 432 shows this icon second from the right on the toolbar along with the dialog that appears when you press the button Figure 432 TEX icon and dialog One aspect of gretls TEX support that is likely to be particularly useful for publication purposes is the ability to produce a typeset version of the model table see section 34 An example of this is shown in Table 432 433 Finetuning typeset output There are three aspects to this adjusting the appearance of the output produced by gretl in LATEX preview mode adjusting the formatting of gretls tabular output for models when using the tabprint command and incorporating gretls output into your own TEX files Previewing in the GUI As regards preview mode you can control the appearance of gretls output using a file named gretlpretex which should be placed in your gretl user directory see the Gretl Command Ref erence If such a file is found its contents will be used as the preamble to the TEX source The default value of the preamble is as follows documentclass11ptarticle usepackageutf8inputenc Chapter 43 Gretl and TEX 438 Table 432 Example of model table output OLS estimates Dependent variable ENROLL Model 1 Model 2 Model 3 const 02907 02411 008557 007853 006602 005794 CATHOL 02216 02235 02065 004584 004597 005160 PUPIL 0003035 0003382 0001697 0002727 0002720 0003025 WHITE 01482 01526 004074 004071 ADMEXP 01551 01342 n 51 51 51 R2 04502 04462 02956 ℓ 9609 9536 8869 Standard errors in parentheses indicates significance at the 10 percent level indicates significance at the 5 percent level Chapter 43 Gretl and TEX 439 usepackageamsmath usepackagedcolumnlongtable begindocument hispagestyleempty Note that the amsmath and dcolumn packages are required For some sorts of output the longtable package is also needed Beyond that you can for instance change the type size or the font by al tering the documentclass declaration or including an alternative font package In addition if you wish to typeset gretl output in more than one language you can set up per language preamble files A localized preamble file is identified by a name of the form gretlprexxtex where xx is replaced by the first two letters of the current setting of the LANG environment vari able For example if you are running the program in Polish using LANGplPL then gretl will do the following when writing the preamble for a TEX source file 1 Look for a file named gretlprepltex in the gretl user directory If this is not found then 2 look for a file named gretlpretex in the gretl user directory If this is not found then 3 use the default preamble Conversely suppose you usually run gretl in a language other than English and have a suitable gretlpretex file in place for your native language If on some occasions you want to produce TEX output in English then you could create an additional file gretlpreentex this file will be used for the preamble when gretl is run with a language setting of say enUS Commandline options After estimating a model via a scriptor interactively via the gretl console or using the command line program gretlcliyou can use the commands tabprint or eqnprint to print the model to file in tabular format or equation format respectively These options are explained in the Gretl Command Reference If you wish alter the appearance of gretls tabular output for models in the context of the tabprint command you can specify a custom row format using the format flag The format string must be enclosed in double quotes and must be tied to the flag with an equals sign The pattern for the format string is as follows There are four fields representing the coefficient standard error t ratio and pvalue respectively These fields should be separated by vertical bars they may contain a printftype specification for the formatting of the numeric value in question or may be left blank to suppress the printing of that column subject to the constraint that you cant leave all the columns blank Here are a few examples format4f4f4f4f format4f4f3f format5f4f4f format8g8g4f The first of these specifications prints the values in all columns using 4 decimal places The second suppresses the pvalue and prints the tratio to 3 places The third omits the tratio The last one again omits the t and prints both coefficient and standard error to 8 significant figures Once you set a custom format in this way it is remembered and used for the duration of the gretl session To revert to the default formatting you can use the special variant formatdefault Further editing Once you have pasted gretls TEX output into your own document or saved it to file and opened it in an editor you can of course modify the material in any wish you wish In some cases machine generated TEX is hard to understand but gretls output is intended to be humanreadable and Chapter 43 Gretl and TEX 440 editable In addition it does not use any nonstandard style packages Besides the standard LATEX document classes the only files needed are as noted above the amsmath dcolumn and longtable packages These should be included in any reasonably full TEX implementation 434 Installing and learning TEX This is not the place for a detailed exposition of these matters but here are a few pointers So far as we know every GNULinux distribution has a package or set of packages for TEX and in fact these are likely to be installed by default Check the documentation for your distribution For MS Windows several packaged versions of TEX are available one of the most popular is MiKTEX at httpwwwmiktexorg For Mac OS X a nice implementation is iTEXMac at httpitexmac sourceforgenet An essential starting point for online TEX resources is the Comprehensive TEX Archive Network CTAN at httpwwwctanorg As for learning TEX many useful resources are available both online and in print Among online guides Tony Roberts LATEX from quick and dirty to style and finesse is very helpful at httpwwwsciusqeduaustaffrobertsaLaTeXlatexintrohtml An excellent source for advanced material is The LATEX Companion Goossens et al 2004 Chapter 44 Gretl and R 441 Introduction R is by far the largest free statistical project1 Like gretl it is a GNU project and the two have a lot in common however gretls approach focuses on ease of use much more than R which instead aims to encompass the widest possible range of statistical procedures As is natural in the free software ecosystem we dont view ourselves as competitors to R2 but rather as projects sharing a common goal who should support each other whenever possible For this reason gretl provides a way to interact with R and thus enable users to pool the capabilities of the two packages In this chapter we will explain how to exploit Rs power from within gretl We assume that the reader has a working installation of R available and a basic grasp of Rs syntax3 Despite several valiant attempts no graphical shell has gained wide acceptance in the R community by and large the standard method of working with R is by writing scripts or by typing commands at the R prompt much in the same way as one would write gretl scripts or work with the gretl console In this chapter the focus will be on the methods available to execute R commands without leaving gretl 442 Starting an interactive R session The easiest way to use R from gretl is in interactive mode Once you have your data loaded in gretl you can select the menu item Tools Start GNU R and an interactive R session will be started with your dataset automatically preloaded A simple example OLS on crosssection data For this example we use Ramanathans dataset data41 one of the sample files supplied with gretl We first run in gretl an OLS regression of price on sqft bedrms and baths The basic results are shown in Table 441 Table 441 OLS house price regression via gretl Variable Coefficient Std Error tstatistic pvalue const 129062 883033 14616 01746 sqft 0154800 00319404 48465 00007 bedrms 21587 270293 07987 04430 baths 12192 432500 02819 07838 1Rs homepage is at httpwwwrprojectorg 2OK who are we kidding But its friendly competition 3The main reference for R documentation is httpcranrprojectorgmanualshtml In addition R tutorials abound on the Net as always Google is your friend 441 Chapter 44 Gretl and R 442 We will now replicate the above results using R Select the menu item Tools Start GNU R A window similar to the one shown in figure 441 should appear Figure 441 R window The actual look of the R window may be somewhat different from what you see in Figure 441 especially for Windows users but this is immaterial The important point is that you have a window where you can type commands to R If the above procedure doesnt work and no R window opens it means that gretl was unable to launch R You should ensure that R is installed and working on your system and that gretl knows where it is The relevant settings can be found by selecting the Tools Preferences General menu entry under the Programs tab Assuming R was launched successfully you will see notification that the data from gretl are avail able In the background gretl has arranged for two R commands to be executed one to load the gretl dataset in the form of a data frame one of several forms in which R can store data and one to attach the data so that the variable names defined in the gretl workspace are available as valid identifiers within R In order to replicate gretls OLS estimation go into the R window and type at the prompt model lmprice sqft bedrms baths summarymodel You should see something similar to Figure 442 Surprisethe estimates coincide To get out just close the R window or type q at the R prompt Time series data We now turn to an example which uses time series data we will compare gretls and Rs estimates of Box and Jenkins immortal airline model The data are contained in the bjg sample dataset The following gretl code open bjg arima 0 1 1 0 1 1 lg nc produces the estimates shown in Table 442 Chapter 44 Gretl and R 443 Figure 442 OLS regression on house prices via R Table 442 Airline model from Box and Jenkins 1976 selected portion of gretls estimates Variable Coefficient Std Error tstatistic pvalue θ1 0401824 00896421 44825 00000 Θ1 0556936 00731044 76184 00000 Variance of innovations 000134810 Loglikelihood 244696 Akaike information criterion 48339 Chapter 44 Gretl and R 444 If we now open an R session as described in the previous subsection the datapassing mechanism is slightly different Since our data were defined in gretl as time series we use an R timeseries object ts for short for the transfer In this way we can retain in R useful information such as the periodicity of the data and the sample limits The downside is that the names of individual series as defined in gretl are not valid identifiers In order to extract the variable lg one needs to use the syntax lg gretldata lg ARIMA estimation can be carried out by issuing the following two R commands lg gretldata lg arimalg c011 seasonalc011 which yield Coefficients ma1 sma1 04018 05569 se 00896 00731 sigma2 estimated as 0001348 log likelihood 2447 aic 4834 Happily the estimates again coincide 443 Running an R script Opening an R window and keying in commands is a convenient method when the job is small In some cases however it would be preferable to have R execute a script prepared in advance One way to do this is via the source command in R Alternatively gretl offers the facility to edit an R script and run it having the current dataset preloaded automatically This feature can be accessed via the File Script Files menu entry By selecting User file one can load a preexisting R script if you want to create a new script instead select the New script R script menu entry Figure 443 Editing window for R scripts In either case you are presented with a window very similar to the editor window used for ordinary gretl scripts as in Figure 443 There are two main differences First you get syntax highlighting for Rs syntax instead of gretls Second clicking on the Execute button the gears icon launches an instance of R in which your commands are executed Before R is actually run you are asked if you want to run R interactively or not see Figure 444 An interactive run opens an R instance similar to the one seen in the previous section your data will be preloaded if the preload data box is checked and your commands will be executed Once this is done you will find yourself at the R prompt where you can enter more commands Chapter 44 Gretl and R 445 Figure 444 Editing window for R scripts A noninteractive run on the other hand will execute your script collect the output from R and present it to you in an output window R will be run in the background If for example the script in Figure 443 is run noninteractively a window similar to Figure 445 will appear Figure 445 Output from a noninteractive R run 444 Sending data back and forth As regards the passing of data between the two programs so far we have only considered passing series from gretl to R In order to achieve a satisfactory degree of interoperability more is needed In the following subsections we see how matrices can be exchanged and how data can be passed from R back to gretl Chapter 44 Gretl and R 446 Passing matrices from gretl to R For passing matrices from gretl to R you can use the mwrite matrix function described in section 177 For example the following gretl code fragment generates the matrix A 3 7 11 4 8 12 5 9 13 6 10 14 and stores it into the file mymatfilemat in the users dotdirsee section 152 Note that writing to this special directory which is sure to exist and be writable by the user is mandated by the nonzero value for the third optional argument to mwrite matrix A mshapeseq31443 err mwriteA mymatfilemat 1 The recommended R code to import such a matrix is A gretlloadmatmymatfilemat The function gretlloadmat which is predefined when R is called from gretl retrieves the matrix from dotdir The mat extension for gretl matrix files is not compulsory you can name these files as you wish Its also possible to take more control over the details of the transfer if you wish You have the builtin string variable dotdir in gretl while in R you have the same variable under the name gretldotdir To use a location other than dotdir you may a omit the third argument to mwrite and supply a full path to the matrix file and b use a more generic approach to reading the file in R Heres an example Gretl side mwriteA pathtomymatfilemat R side A asmatrixreadtablepathtomymatfilemat skip1 Passing data from R to gretl For passing data in the opposite direction gretl defines a special function that can be used in the R environment An R object will be written as a temporary file in dotdir from where it can be easily retrieved from within gretl The name of this function is gretlexport it takes one required argument the object to be exported At present the objects that can be exported with this method are matrices data frames and timeseries objects The function creates a text file by default with the same name as the exported object plus an appropriate suffix in gretls temporary directory Data frames and time series objects are stored as CSV files and can be retrieved by using gretls append command Matrices are stored in a special text format that is understood by gretl see section 177 the file suffix is in this case mat and to read the matrix in gretl you must use the mread function This function also has an optional second argument namely a string which specifies a basename for the export file in case you want to use a name other than that attached to the object within R As in the default case an appropriate suffix csv or mat will be added to the basename Chapter 44 Gretl and R 447 As an example we take the airline data and use them to estimate a structural time series model à la Harvey 19894 The model we will use is the Basic Structural Model BSM in which a time series is decomposed into three terms yt µt γt εt where µt is a trend component γt is a seasonal component and εt is a noise term In turn the following is assumed to hold µt βt1 ηt βt ζt sγt ωt where s is the seasonal differencing operator 1 Ls and ηt ζt and ωt are mutually uncorre lated white noise processes The object of the analysis is to estimate the variances of the noise components which may be zero and to recover estimates of the latent processes µt the level βt the slope and γt We will use Rs StructTS command and import the results back into gretl Once the bjg dataset is loaded in gretl we pass the data to R and execute the following script extract the log series y gretldata lg estimate the model strmod StructTSy save the fitted components smoothed compon aststsSmoothstrmod save the estimated variances vars asmatrixstrmodcoef export into gretls temp dir gretlexportcompon gretlexportvars Running this script via gretl produces minimal output current data loaded as ts object gretldata wrote homecottrellgretlcomponcsv wrote homecottrellgretlvarsmat However we are now able to pull the results back into gretl by executing the following commands either from the console or by creating a small script string fname sprintfscomponcsv dotdir append fname vars mreadvarsmat 1 The first command reads the estimated timeseries components from a CSV file which is the format that the passing mechanism employs for series The matrix vars is read from the file varsmat After the above commands have been executed three new series will have appeared in the gretl workspace namely the estimates of the three components by plotting them together with the original data you should get a graph similar to Figure 446 The estimates of the variances can be seen by printing the vars matrix as in print vars vars 4 x 1 4The function package StucTiSM is available to handle this class of models natively in gretl Chapter 44 Gretl and R 448 46 48 5 52 54 56 58 6 62 64 66 1949 1955 1961 lg 46 48 5 52 54 56 58 6 62 1949 1955 1961 level 001 001005 00101 001015 00102 001025 1949 1955 1961 slope 025 02 015 01 005 0 005 01 015 02 025 03 1949 1955 1961 sea Figure 446 Estimated components from BSM 000077185 00000 00013969 00000 That is ˆσ 2 η 000077185 ˆσ 2 ζ 0 ˆσ 2 ω 00013969 ˆσ 2 ε 0 Notice that since ˆσ 2 ζ 0 the estimate for βt is constant and the level component is simply a random walk with a drift 445 Interacting with R from the command line Up to this point we have spoken only of interaction with R via the GUI program In order to do the same from the command line interface gretl provides the foreign command This enables you to embed nonnative commands within a gretl script A foreign block takes the form foreign languageR senddatalist quiet R commands end foreign and achieves the same effect as submitting the enclosed R commands via the GUI in the non interactive mode see section 443 above The senddata option arranges for autoloading of the data present in the gretl session or a subset thereof specified via a named list The quiet option prevents the output from R from being echoed in the gretl output Using this method replicating the example in the previous subsection is rather easy basically all it takes is encapsulating the content of the R script in a foreign end foreign block see Listing 441 Chapter 44 Gretl and R 449 Listing 441 Estimation of the Basic Structural Model simple Download open bjggdt foreign languageR senddata y gretldata lg strmod StructTSy compon aststsSmoothstrmod vars asmatrixstrmodcoef gretlexportcompon gretlexportvars end foreign append dotdircomponcsv rename level lglevel rename slope lgslope rename sea lgseas vars mreadvarsmat 1 The above syntax despite being already quite useful by itself shows its full power when it is used in conjunction with userwritten functions Listing 442 shows how to define a gretl function that calls R internally Listing 442 Estimation of the Basic Structural Model via a function Download function list RStructTSseries myseries smpl okmyseries restrict sx argnamemyseries foreign languageR senddata quiet sx gretldata myseries strmod StructTSsx compon aststsSmoothstrmod gretlexportcompon end foreign append dotdircomponcsv rename level sxlevel rename slope sxslope rename sea sxseas list ret sxlevel sxslope sxseas return ret end function main open bjggdt list X RStructTSlg Chapter 44 Gretl and R 450 446 Performance issues with R R is a large and complex program which takes an appreciable time to initialize itself In interactive use this not a significant problem but if you have a gretl script that calls R repeatedly the cumulated startup costs can become bothersome To get around this gretl calls the R shared library by preference in this case the startup cost is borne only once on the first invocation of R code from within gretl Support for the R shared library is built into the gretl packages for MS Windows and OS Xbut the advantage is realized only if the library is in fact available at run time If you are building gretl yourself on Linux and wish to make use of the R library you should ensure a that R has been built with the shared library enabled specify enableRshlib when configuring your build of R and b that the pkgconfig program is able to detect your R installation We do not link to the R library at build time rather we open it dynamically on demand The gretl GUI has an item under the ToolsPreferences menu which enables you to select the path to the library if it is not detected automatically If you have the R shared library installed but want to force gretl to call the R executable instead you can do set Rlib off 447 Further use of the R library Besides improving performance as noted above use of the R shared library makes possible a further refinement That is you can define functions in R within a foreign block then call those functions later in your script much as if they were gretl functions This is illustrated below set Rfunctions on foreign languageR plusone functionq z q1 invisiblez end foreign scalar bRplusone2 The R function plusone is obviously trivial in itself but the example shows a couple of points First for this mechanism to work you need to enable Rfunctions via the set command Second to avoid collision with the gretl function namespace calls to functions defined in this way must be prefixed with R as in Rplusone But please note this mechanism will not work if you have defined a gretl bundle named R in that case identifiers beginning with R will be understood as referring to members of the bundle in question Builtin R functions may also be called in this way once Rfunctions is set on For example one can invoke Rs choose function which computes binomial coefficients set Rfunctions on scalar b Rchoose104 The use of R functions from within gretl is limited by the need for an unambiguous and lossless mapping between R and gretl datatypes both for arguments passed by gretl and for return values generated by R So far the following possibilities are supported see chapter 11 for details on the definition of types on the gretl side The most basic typesreal scalars real matrices and single stringscan be pushed in either direction no problem Since gretl 2023b row and column names will be preserved when transferring matrices Chapter 44 Gretl and R 451 A series in gretl can be pushed to R as a vector If the gretl series is stringvalued see chapter 16 R will receive the string values Gretls arrays of strings can be pushed to R as vectors of strings and vice versa Gretls bundles can be pushed to R as lists with tags naming the elements and Rs lists can be retrieved as gretl bundles provided that their elements have a corresponding gretl type and are identified by tags But this is subject to the restriction that a gretl bundle passed to R cannot contain instances of the gretl list type or arrays of anything other than strings Chapter 45 Gretl and Ox 451 Introduction Ox written by Jurgen A Doornik see Doornik 2007 is described by its author as an object oriented statistical system At its core is a powerful matrix language which is complemented by a comprehensive statistical library Among the special features of Ox are its speed and well designed syntax Ox comes in two versions Ox Professional and Ox Console Ox is available for Windows Linux Mac OS X and several Unix platforms wwwdoornikcom Ox is proprietary closedsource software The commandline version of the program is however available free of change for academic users Quoting again from Doorniks website The Console command line versions may be used freely for academic research and teaching purposes only The Ox syntax is public and of course you may do with your own Ox code whatever you wish If you wish to use Ox in conjunction with gretl please refer to doornikcom for further details on licensing As the reader will no doubt have noticed most other software that we discuss in this Guide is open source and freely available for all users We make an exception for Ox on the grounds that it is indeed fast and well designed and that its statistical libraryalong with various addon packages that are also availablehas exceptional coverage of cuttingedge techniques in econometrics The gretl authors have used Ox for benchmarking some of gretls more advanced features such as dynamic panel models and state space models1 452 Ox support in gretl The support offered for Ox in gretl is similar to that offered for R as discussed in chapter 44 To enable support for Ox go to the ToolsPreferencesGeneral menu item and look under the Programs tab Find the entry for the path to the oxl executable that is the program that runs Ox files on MS Windows it is called oxlexe Adjust the path if its not already right for your system and you should be ready to go With support enabled you can open and edit Ox programs in the gretl GUI Clicking the execute icon in the editor window will send your code to Ox for execution Figures 451 and Figure 452 show an Ox program and part of its output In addition you can embed Ox code within a gretl script using a foreign block as described in connection with R A trivial example which simply prints the gretl data matrix within Ox is shown below open data41 matrix m dataset mwritem gretlmat 1 foreign languageOx include oxstdh main 1For a review of Ox see CribariNeto and Zarkos 2003 and for a somewhat dated comparison of Ox with other matrixoriented packages such as GAUSS see Steinhaus 1999 452 Chapter 45 Gretl and Ox 453 Figure 451 Ox editing window Figure 452 Output from Ox Chapter 45 Gretl and Ox 454 decl gmat gretlloadmatgretlmat printgmat end foreign The above example illustrates how a matrix can be passed from gretl to Ox We use the mwrite function to write a matrix into the users dotdir see section 152 then in Ox we use the function gretlloadmat to retrieve the matrix How does gretlloadmat come to be defined When gretl writes out the Ox program correspond ing to your foreign block it does two things in addition First it writes a small utility file named gretlioox into your dotdir This contains a definition for gretlloadmat and also for the function gretlexport see below Second gretl interpolates into your Ox code a line which in cludes this utility file it is inserted right after the inclusion of oxstdh which is needed in all Ox programs Note that gretlloadmat expects to find the named file in the users dotdir 453 Illustration replication of DPD model Listing 451 shows a more ambitious case This script replicates one of the dynamic panel data models in Arellano and Bond 1991 first using gretl and then using Ox we then check the relative differences between the parameter estimates produced by the two programs which turn out to be reassuringly small Unlike the previous example in this case we pass the dataset from gretl to Ox as a CSV file in order to preserve the variable names Note the use of the internal variable csvna to get the right representation of missing values for use with Oxand also note that the senddata option for the foreign command is not available in connection with Ox We get the parameter estimates back from Ox using gretlexport on the Ox side and mread on the gretl side The gretlexport function takes two arguments a matrix and a file name The file is written into the users dotdir from where it can be picked up using mread The final portion of the output from Listing 451 is shown below matrix oxparm mreadoxparmmat 1 Generated matrix oxparm eval absparm oxparm oxparm 14578e13 35642e13 50672e15 16091e13 89808e15 20450e14 10218e13 21048e13 95898e15 18658e14 21852e14 29451e13 19398e13 Chapter 45 Gretl and Ox 455 Listing 451 Estimation of dynamic panel data model via gretl and Ox open abdatagdt 1step GMM estimation dpanel 2 n w w1 k ys ys1 0 timedummies dpdstyle matrix parm coeff Write CSV file for Ox set csvna NaN store dotdirabdatacsv Replicate using the Ox DPD package foreign languageOx include oxstdh import packagesdpddpd main decl dpd new DPD dpdLoaddotdirabdatacsv dpdSetYearYEAR dpdSelectYVAR n 0 2 dpdSelectXVAR w 0 1 k 0 0 ys 0 1 dpdSelectIVAR w 0 1 k 0 0 ys 0 1 dpdGmmn 2 99 GMMtype instrument dpdSetDummiesDCONSTANT DTIME dpdSetTest2 2 Sargan AR 12 tests dpdEstimate 1step estimation decl parm dpdGetPar gretlexportparm oxparmmat delete dpd end foreign Compare the results matrix oxparm mreadoxparmmat 1 eval absparm oxparm oxparm Chapter 46 Gretl and Octave 461 Introduction GNU Octave written by John W Eaton and others is described as a highlevel language primar ily intended for numerical computations The program is oriented towards solving linear and nonlinear problems numerically and performing other numerical experiments using a language that is mostly compatible with Matlab wwwgnuorgsoftwareoctave Octave is available in sourcecode form naturally for GNU software and also in the form of binary packages for MS Win dows and Mac OS X Numerous contributed packages that extend Octaves functionality in various ways can be found at octavesfnet 462 Octave support in gretl The support offered for Octave in gretl is similar to that offered for R chapter 44 For example you can open and edit Octave scripts in the gretl GUI Clicking the execute icon in the editor window will send your code to Octave for execution Figures 461 and Figure 462 show an Octave script and its output in this example we use the function logisticregression to replicate some results from Greene 2000 In addition you can embed Octave code within a gretl script using a foreign block as described in connection with R A trivial example is shown below it simply loads and prints the gretl data matrix within Octave then takes it back to gretl and checks for any difference there should be none Note that in Octave appending to a line suppresses verbose output leaving off the semicolon results in printing of the object that is produced if any open data41 matrix m dataset mwritem gretlmat 1 foreign languageOctave gmat gretlloadmatgretlmat gretlexportgmat octavemat end foreign matrix chk mreadoctavemat 1 eval maxrmaxcabsm chk The functions gretlloadmat and gretlexport which are predefined when you run Octave from within gretl have the following signatures function A gretlloadmatfname autodot1 function gretlexportX fname autodot1 By default traffic in matrices goes via the users dotdir see section 152 on the Octave side that is the name of this directory is prepended to filename for both reading and writing This is complementary to use of the export and import parameters with gretls mwrite and mread functions respectively However if you wish to take control over the reading and writing locations 456 Chapter 46 Gretl and Octave 457 Figure 461 Octave editing window Figure 462 Output from Octave Chapter 46 Gretl and Octave 458 you can supply a zero value for autodot or give an absolute path when calling gretlloadmat and gretlexport in that case the filename argument is used as is 463 Illustration spectral methods We now present a more ambitious example which exploits Octaves handling of the frequency domain and also its ability to use code written for MATLAB namely estimation of the spec tral coherence of two time series For this illustration we require two extra Octave packages from octavesfnet namely those supporting spectral functions specfun and signal process ing signal After downloading the packages you can install them from within Octave as follows using version numbers as of March 2010 pkg install specfun108targz pkg install signal1010targz In addition we need some specialized MATLAB files made available by Mario Forni of the Univer sity of Modena at httpmorganaunimoreitfornimariomatlabhtm The files needed are coheren2m coherenm coherm cospecm crosscovm crosspecm crosspem and specm These are in a form appropriate for MS Windows On Linux you could run the following shell script to get the files and remove the Windows endoffile character which prevents the files from running under Octave SITEhttpmorganaunimoreitfornimarioMYPROG download files and delete trailing CtrlZ for f in coheren2m coherenm coherm cospecm crosscovm crosspecm crosspem specm do wget SITEf cat f tr d 032 tmpm mv tmpm f done The Forni files should be placed in some appropriate directory and you should tell Octave where to find them by adding that directory to Octaves loadpath On Linux this can be done via an entry in ones octaverc file For example addpathstatsoctaveforni Alternatively an addpath directive can be written into the Octave script that calls on these files With everything set up on the Octave side we now write a gretl script see Listing 461 which opens a timeseries dataset constructs and writes a matrix containing two series and defines a foreign block containing the Octave statements needed to produce the spectral coherence matrix This matrix is exported via gretlexport and picked up using mread Finally we produce a graph from the matrix in gretl In the script this is sent to the screen Figure 463 shows the same graph in PDF format Chapter 46 Gretl and Octave 459 Listing 461 Estimation of spectral coherence via Octave open data97 matrix xy PRIME UNEMP mwritexy xymat 1 foreign languageOctave pkg load signal uncomment and modify the following if necessary addpathstatsoctaveforni xy gretlloadmatxymat x xy1 y xy2 note the last parameter is the Bartlett window size h coherx y 8 gretlexporth hmat end foreign h mreadhmat 1 cnameseth coherence gnuplot 1 timeseries withlines matrixh outputdisplay 05 04 03 02 01 0 01 02 03 04 0 10 20 30 40 50 60 coherence Figure 463 Spectral coherence estimated via Octave Chapter 47 Gretl and Stata Stata wwwstatacom is closedsource proprietary and expensive software and as such is not a natural companion to gretl Nonetheless given Statas popularity it is desirable to have a convenient way of comparing results across the two programs and to that end we provide some support for Stata code under the foreign command To enable support for Stata go to the ToolsPreferencesGeneral menu item and look under the Programs tab Find the entry for the path to the Stata executable Adjust the path if its not already right for your system and you should be ready to go The following example illustrates whats available You can send the current gretl dataset to Stata using the senddata flag And having defined a matrix within Stata you can export it for use with gretl via the gretlexport command this takes two arguments the name of the matrix to export and the filename to use the file is written to the users dotdir from where it can be retrieved using the mread function1 To suppress printed output from Stata you can add the quiet flag to the foreign block Listing 471 Comparison of clustered standard errors with Stata Download function matrix statareorder matrix se stata puts the intercept last but gretl puts it first scalar n rowsse return sen se1n1 end function open data41 ols 1 0 2 3 clusterbedrms matrix se stderr foreign languagestata senddata regress price sqft bedrms vcecluster bedrms matrix vcv eV gretlexport vcv vcvmat end foreign matrix statavcv mreadvcvmat 1 statase statareordersqrtdiagstatavcv matrix check se statase print check In addition you can edit pure Stata scripts in the gretl GUI and send them for execution as with native gretl scripts Note that Stata coerces all variable names to lowercase on data input so even if series names in 1We do not currently offer the complementary functionality of gretlloadmat which enables reading of matrices written by gretls mwrite function in Ox and Octave This is not at all easy to implement in Stata code 460 Chapter 47 Gretl and Stata 461 gretl are uppercase or of mixed case its necessary to use all lowercase in Stata Also note that when opening a data file within Stata via the use command it will be necessary to provide the full path to the file Chapter 48 Gretl and Python 481 Introduction According to wwwpythonorg Python is an easy to learn powerful programming language It has efficient highlevel data structures and a simple but effective approach to objectoriented program ming Pythons elegant syntax and dynamic typing together with its interpreted nature make it an ideal language for scripting and rapid application development in many areas on most platforms Indeed Python is widely used in a great variety of contexts Numerous addon modules are avail able the ones likely to be of greatest interest to econometricians include NumPy the fundamen tal package for scientific computing with Pythonsee wwwnumpyorg SciPy which builds on NumPysee wwwscipyorg and Statsmodels httpstatsmodelssourceforgenet 482 Python support in gretl The support offered for Python in gretl is similar to that offered for Octave chapter 46 You can open and edit Python scripts in the gretl GUI Clicking the execute icon in the editor window will send your code to Python for execution In addition you can embed Python code within a gretl script using a foreign block as described in connection with R When you launch Python from within gretl one variable and two convenience functions are pre defined as follows gretldotdir gretlloadmatfilename autodot1 gretlexportM filename autodot1 The variable gretldotdir holds the path to the users dot directory The first function loads a matrix of the given filename as written by gretls mwrite function and the second writes matrix M under the given filename in the format wanted by gretl By default the traffic in matrices goes via the dot directory on the Python side that is the name of this directory is prepended to filename for both reading and writing This is complementary to use of the export and import parameters with gretls mwrite and mread functions respectively However if you wish to take control over the reading and writing locations you can supply a zero value for autodot or give an absolute path when calling gretlloadmat and gretlexport in that case the filename argument is used as is Note that gretlloadmat and gretlexport depend on NumPy they make use of the functions loadtxt and savetxt respectively Nonetheless the presence of NumPy is not an absolute require ment if you dont need to use these two functions 483 Illustration linear regression with multicollinearity Listing 481 compares the numerical accuracy of gretls ols command with that of the function linalglstsq in NumPy using the notorious Longley test data which exhibit extreme multi collinearity Unlike some econometrics packages NumPy does a good job on these data The script computes and prints the logrelative error in estimation of the regression coefficients using 462 Chapter 48 Gretl and Python 463 the NISTcertified values as a benchmark1 the error values correspond to the number of correct digits with a maximum of 15 The results will likely differ somewhat by computer architecture and compiler Listing 481 Comparing regression results with Python Download set verbose off function matrix logrelerr const matrix est const matrix true return log10absest true abstrue end function open longleygdt q list LX prdefl year ols employ 0 LX q matrix bgretl coeff mwriteemploy const LX alldatamat 1 foreign languagepython import numpy as np X gretlloadmatalldatamat 1 NumPys OLS b nplinalglstsqX1 X00 gretlexportnptransposenpmatrixb pybmat 1 end foreign NISTs certified coefficient values matrix bnist 348225863459582 150618722713733 0358191792925910E01 202022980381683 103322686717359 0511041056535807E01 182915146461355 matrix bnumpy mreadpybmat 1 matrix E logrelerrbgretl bnist logrelerrbnumpy bnist cnamesetE gretl python printf Logrelative errors Longley coefficients 125g E printf Column means 125g meancE Logrelative errors Longley coefficients gretl python 12844 12850 11528 11414 12393 12401 13135 13121 13738 13318 12587 12363 12848 12852 Column means 12725 12617 1See httpwwwitlnistgovdiv898strdllsdataLongleyshtml Chapter 49 Gretl and Julia 491 Introduction According to julialangorg Julia is a highlevel highperformance dynamic programming lan guage for technical computing with syntax that is familiar to users of other technical computing environments It provides a sophisticated compiler distributed parallel execution numerical ac curacy and an extensive mathematical function library Julia is well known for being very fast however you should be aware that by default starting Julia takes some time due to JustinTime compilation of the input This fixed cost is well worth bearing if you are asking Julia to perform a big computation but small jobs are likely to run faster if you use the Juliaspecific nocompile option with the foreign command1 492 Julia support in gretl The support offered for Julia in gretl is similar to that offered for Octave chapter 46 You can open and edit Julia scripts in the gretl GUI Clicking the execute icon in the editor window will send your code to Julia for execution In addition you can embed Julia code within a gretl script using a foreign block as described in connection with R When you launch Julia from within gretl one variable and two convenience functions are predefined as follows gretldotdir gretlloadmatfilename autodottrue gretlexportM filename autodottrue The variable gretldotdir holds the path to the users dot directory The first function loads a matrix of the given filename as written by gretls mwrite function and the second writes matrix M under the given filename in the format wanted by gretl By default the traffic in matrices goes via the dot directory on the Julia side that is the name of this directory is prepended to filename for both reading and writing This is complementary to use of the export and import parameters with gretls mwrite and mread functions respectively However if you wish to take control over the reading and writing locations you can supply a zero value for autodot or give an absolute path when calling gretlloadmat and gretlexport in that case the filename argument is used as is 493 Illustration Listing 491 shows a minimal example of how to interact with Julia from a gretl script Since this is a very small job JIT compilation is not worthwhile in our testing the script runs almost 4 times faster if the Julia block is opened with foreign languagejulia nocompile This has the effect of passing the option compileno to the Julia executable 1Caveat it seems that this option is not supported by all builds of Julia 464 Chapter 49 Gretl and Julia 465 Listing 491 Simple Julia IO example Download set verbose off matrix A mnormal44 generate a random matrix mwriteA A 1 and save it to a file foreign languagejulia call Julia printHi from Julia output a string A gretlloadmatA grab the matrix from gretl gretlexportinvA iAmat and save its inverse end foreign go back to gretl matrix iA mreadiAmat 1 read the inverse from Julia matrix check A iA compute the product print check print out the check should be I Output good approximation to identity matrix Hi from Julia check 4 x 4 10000 69389e18 16653e16 16653e16 00000 10000 00000 00000 44409e16 83267e17 10000 66613e16 44409e16 13878e17 11102e16 10000 Chapter 50 Troubleshooting gretl 501 Bug reports Bug reports are welcomewell if not exactly welcome then useful and appreciated Hopefully you are unlikely to find bugs in the actual calculations done by gretl although this statement does not constitute any sort of warranty You may however come across bugs or oddities in the behavior of the graphical interface Please remember that the usefulness of bug reports is greatly enhanced if you can be as specific as possible what exactly went wrong under what conditions and on what operating system If you saw an error message what precisely did it say One way of making a bug report more useful is to run the program in such a way that you can see and copy any additional information that gets printed to the stderr output stream On Linux and Mac OS X thats just a matter of launching gretl from the command prompt in a terminal window On MS Windows its a bit more complicated since stderr is by default invisble However you can quite easily set up a special gretl shortcut that does the job On the Windows desktop rightclick and select New shortcut In the dialog box that appears browse to find gretlexe and append the debug flag as shown in Figure 501 Note that there are two dashes before debug Figure 501 Creating a debugging shortcut 466 Chapter 50 Troubleshooting gretl 467 When you start gretl in this mode a console window appears as well as the gretl window and stderr output goes to the console To copy this output click at the top left of the console window for a menu Figure 502 first do Select all then Copy You can paste the results into Notepad or similar Figure 502 The program with console window 502 Auxiliary programs As mentioned above gretl calls some other programs to accomplish certain tasks gnuplot for graphing LATEX for highquality typesetting of regression output GNU R If something goes wrong with such external links it is not always easy for gretl to produce an informative error message If such a link fails when accessed from the gretl graphical interface you may be able to get more information by starting gretl from the command prompt rather than via a desktop menu entry or icon On the X window system start gretl from the shell prompt in an xterm on MS Windows start the program gretlexe from a console window or DOS box using the g or debug option flag Additional error messages may be displayed on the terminal window Also please note that for most external calls gretl assumes that the programs in question are available in your paththat is that they can be invoked simply via the name of the program without supplying the programs full location1 Thus if a given program fails try the experiment of typing the program name at the command prompt as shown below Graphing Typesetting GNU R X window system gnuplot pdflatex R MS Windows wgnuplotexe pdflatex RGuiexe If the program fails to start from the prompt its not a gretl issue but rather that the programs home directory is not in your path or the program is not installed properly For details on modifying your path please see the documentation or online help for your operating system or shell 1The exception to this rule is the invocation of gnuplot under MS Windows where a full path to the program is given Chapter 51 The command line interface The gretl package includes the commandline program gretlcli On Linux it can be run from a terminal window xterm rxvt or similar or at the text console Under MS Windows it can be run in a console window sometimes inaccurately called a DOS box gretlcli has its own help file which may be accessed by typing help at the prompt It can be run in batch mode sending output directly to a file see also the Gretl Command Reference If gretlcli is linked to the readline library this is automatically the case in the MS Windows version also see Appendix B the command line is recallable and editable and offers command completion You can use the Up and Down arrow keys to cycle through previously typed commands On a given command line you can use the arrow keys to move around in conjunction with Emacs editing keystrokes1 The most common of these are Keystroke Effect Ctrla go to start of line Ctrle go to end of line Ctrld delete character to right where Ctrla means press the a key while the Ctrl key is also depressed Thus if you want to change something at the beginning of a command you dont have to backspace over the whole line erasing as you go Just hop to the start and add or delete characters If you type the first letters of a command name then press the Tab key readline will attempt to complete the command name for you If theres a unique completion it will be put in place automatically If theres more than one completion pressing Tab a second time brings up a list Probably the most useful mode for heavyduty work with gretlcli is batch noninteractive mode in which the program reads and processes a script and sends the output to file For example gretlcli b scriptfile outputfile Note that scriptfile is treated as a program argument only the output file requires redirection Dont forget the b batch switch otherwise the program will wait for user input after executing the script and if output is redirected the program will appear to hang 1Actually the key bindings shown below are only the defaults they can be customized See the readline manual 468 Appendix A Data file details A1 Basic native format In gretls basic native data formatfor which we use the suffix gdta dataset is stored in XML extensible markup language Data files correspond to the simple DTD document type defini tion given in gretldatadtd which is supplied with the gretl distribution and is installed in the system data directory eg usrsharegretldata on Linux Such files may be plain text un compressed or gzipped They contain the actual data values plus additional information such as the names and descriptions of variables the frequency of the data and so on In a gdt file the actual data values are written to 17 significant figures for generated data such as logs or pseudorandom numbers or to a maximum of 15 figures for primary data The C printf format g is used for 17 or 15 so that trailing zeros are not printed Most users will probably not have need to read or write such files other than via gretl itself but if you want to manipulate them using other software tools you should examine the DTD and also take a look at a few of the supplied practice data files data41gdt gives a simple example data410gdt is an example where observation labels are included A2 Binary data file format A native binary format is also available for dataset storage This formatwith suffix gdtboffers much faster writing and reading for very large datasets For small to moderately sized datasets say up to a few megabytes there is little advantage in the binary format and we recommend use of plain gdt Note that gdtb files are saved in the endianness of the machine on which theyre written and are not portable across platforms of differing endianness but since almost all machines on which gretl is likely to be run are littleendian this is unlikely to be a serious limitation The implementation of gdtb format can be found in purebinc in the plugin subdirectory of the gretl source tree Prior to version 2021b of gretl gdtb files had a different structure namely a PKZIP file containing an XML component for the metadata plus a binary component for the actual data values It turned out that this hybrid format did not scale well for datasets with a great deal of metadata For backward compatibility gretl can still read such oldstyle files but it doesnt write them any more A3 Native database format A gretl database has two primary parts a plain text index file with filename suffix idx containing information on the included series and a binary file suffix bin containing the actual data Two examples of the format for an entry in the idx file are shown below G0M910 Composite index of 11 leading indicators 1987100 M 194801 199511 n 575 currbal Balance of Payments Balance on Current Account SA Q 19601 19994 n 160 The first field is the series name The second is a description of the series maximum 128 charac ters On the second line the first field is a frequency code M for monthly Q for quarterly A for 470 Appendix A Data file details 471 annual B for businessdaily daily with five days per week D for 7day daily S for 6day daily U for undated No other frequencies are accepted at present Then comes the starting date with two digits following the point for monthly data one for quarterly data none for annual a space a hyphen another space the ending date the string n and the integer number of observations In the case of daily data the starting and ending dates should be given in the ISO 8601 form YYYYMMDD This format must be respected exactly Optionally the first line of the index file may contain a short comment up to 64 characters on the source and nature of the data following a hash mark For example Federal Reserve Board interest rates The corresponding binary database file holds the data values represented as floats that is single precision floatingpoint numbers taking four bytes apiece The values are packed by variable so that the first n numbers are the observations of variable 1 the next m the observations on variable 2 and so on A third file may accompany the idx and bin files namely a codebook containing a description of the data If present this must be plain text with filename suffix cb or PDF with suffix pdf The components of a gretl database are generally combined in a single file with zlib compression and gz suffix for distribution A small program named gretlzip can be used to create or unpack such files See the utilsdbzip subdirectory of the gretl source tree Appendix B Building gretl Here we give instructions detailed enough to allow a user with only a basic knowledge of a Unix type system to build gretl These steps were tested on a fresh installation of Debian Etch For other Linux distributions especially Debianbased ones like Ubuntu and its derivatives little should change Other Unixlike operating systems such as Mac OS X and BSD would probably require more substantial adjustments In this guided example we will build gretl complete with documentation This introduces a few more requirements but gives you the ability to modify the documentation files as well like the help files or the manuals B1 Installing the prerequisites We assume that the basic GNU utilities are already installed on the system together with these other programs some TEXLATEXsystem texlive will do beautifully Gnuplot ImageMagick We also assume that the user has administrative privileges and knows how to install packages The examples below are carried out using the aptget shell command but they can be performed with menubased utilities like aptitude dselect or the GUIbased program synaptic Users of Linux distributions which employ rpm packages eg Red HatFedora Mandriva SuSE may want to refer to the dependencies page on the gretl website The first step is installing the C compiler and related basic utilities if these are not already in place On a Debian or derivative system these are contained in a bunch of packages that can be installed via the command aptget install gcc autoconf automake19 libtool flex bison gccdoc libc6dev libcdev gfortran gettext pkgconfig Then it is necessary to install the development dev packages for the libraries that gretl uses 472 Appendix B Building gretl 473 Library command GLIB aptget install libglib20dev GTK 30 aptget install libgtk30dev PNG aptget install libpng12dev XSLT aptget install libxslt1dev LAPACK aptget install liblapackdev FFTW aptget install libfftw3dev READLINE aptget install libreadlinedev ZLIB aptget install zlib1gdev XML aptget install libxml2dev GMP aptget install libgmpdev CURL aptget install libcurl4gnutlsdev MPFR aptget install libmpfrdev It is possible to substitute GTK 20 for GTK 30 The dev packages for these libraries are necessary to compile gretlyoull also need the plain nondev library packages to run gretl but most of these should already be part of a standard installation In order to enable other optional features like audio support you may need to install more libraries The above steps can be much simplified on Linux systems that provide debbased package managers such as Debian and its derivatives Ubuntu Knoppix and other distributions The command aptget builddep gretl will download and install all the necessary packages for building the version of gretl that is currently present in your APT sources Technically this does not guarantee that all the software necessary to build the git version is included because the version of gretl on your repository may be quite old and build requirements may have changed in the meantime However the chances of a mismatch are rather remote for a reasonably uptodate system so in most cases the above command should take care of everything correctly B2 Getting the source release or git At this point it is possible to build from the source You have two options here obtain the latest released source package or retrieve the current git version of gretl git the version control software currently in use for gretl The usual caveat applies to the git version namely that it may not build correctly and may contain experimental code on the other hand git often contains bugfixes relative to the released version If you want to help with testing and to contribute bug reports we recommend using git gretl To work with the released source 1 Download the gretl source package from gretlsourceforgenet 2 Unzip and untar the package On a system with the GNU utilities available the command would be tar xvfJ gretlNtarxz replace N with the specific version number of the file you downloaded at step 1 3 Change directory to the gretl source directory created at step 2 eg gretl2020a 4 Proceed to the next section Configure and make To work with git youll first need to install the git client program if its not already on your sys tem Relevant resources you may wish to consult include the main git website at gitscmcom and instructions specific to gretl gretl git basics Appendix B Building gretl 474 When grabbing the git sources for the first time you should first decide where you want to store the code For example you might create a directory called git under your home directory Open a terminal window cd into this directory and type the following commands git clone gitgitcodesfnetpgretlgit gretlgit At this point git should create a subdirectory named gretlgit and fill it with the current sources When you want to update the source this is very simple just move into the gretlgit directory and type git pull Assuming youre now in the gretlgit directory you can proceed in the same manner as with the released source package B3 Configure the source The next command you need is configure this is a complex script that detects which tools you have on your system and sets things up The configure command accepts many options you may want to run configure help first to see what options are available One option you way wish to tweak is prefix By default the installation goes under usrlocal but you can change this For example configure prefixusr will put everything under the usr tree Note that the recommended location to build gretl is not in the source directory The way to achieve that is quite simple the invocation of the configure script has to take into account the relative path to the source tree So if your build directory is inside underneath the source tree it is configure while if it is in a parallel tree it would be something like gretlgitconfigure If you have a multicore machine you may want to activate support for OpenMP which permits the parallelization of matrix multiplication and some other tasks This requires adding the configure flag enableopenmp By default the gretl GUI is built using version 30 of the GTK library if available otherwise version 20 If you have both versions installed and prefer to use GTK 20 use the flag enablegtk2 In order to have the documentation built we need to pass the relevant option to configure as in enablebuilddoc Appendix B Building gretl 475 But please note that this option will work only if you are using the git source In order to build the documentation there is the possibility that you will have to install some extra software on top of the packages mentioned in the previous section For example you may need some extra LATEX packages to compile the manuals Two of the required packages that not every standard LATEX installation include are typically pifontsty and appendixsty You could install the corresponding packages from your distribution or you could simply download them from CTAN and install them by hand This for example if you want to install under usr with OpenMP support and also build the documentation you would do configure prefixusr enableopenmp enablebuilddoc You will see a number of checks being run and if everything goes according to plan you should see a summary similar to that displayed in Listing B1 Listing B1 Sample output from configure Configuration Installation path usr Use readline library yes Use gnuplot for graphs yes Use LaTeX for typesetting output yes Use libgsf for zipunzip no sse2 support for RNG yes OpenMP support yes MPI support no AVX support for arithmetic no Build with GTK version 20 Build gretl documentation yes Use Lucida fonts no Build message catalogs yes X12ARIMA support yes TRAMOSEATS support yes libR support yes ODBC support no Experimental audio support no Use xdgutils in installation if DESTDIR not set LAPACK libraries llapack lblas lgfortran Now type make to build gretl You can also do make pdfdocs to build the PDF documentation If youre using git its a good idea to rerun the configure script after doing an update This is not always necessary but sometimes it is and it never does any harm For this purpose you may want to write a little shell script that calls configure with any options you want to use B4 Build and install We are now ready to undertake the compilation proper this is done by running the make command which takes care of compiling all the necessary source files in the correct order All you need to do Appendix B Building gretl 476 is type make This step will likely take several minutes to complete a lot of output will be produced on screen Once this is done you can install your freshly baked copy of gretl on your system via make install On most systems the make install command requires you to have administrative privileges Hence either you log in as root before launching make install or you may want to use the sudo utility as in sudo make install Now try if everything works go back to your home directory and run gretl cd gretl If all is well you ought to see gretl start at which point just exit the program in the usual way On the other hand there is the possibility that gretl doesnt start and instead you see a message like usrlocalbingretlx11 error while loading shared libraries libgretl10so0 cannot open shared object file No such file or directory In this case just run sudo ldconfig The problem should be fixed once and for all Appendix C Numerical accuracy Gretl uses doubleprecision arithmetic throughoutexcept for the multipleprecision plugin in voked by the menu item Model Other linear models High precision OLS which represents floating point values using a number of bits given by the environment variable GRETLMPBITS default value 256 The normal equations of Least Squares are by default solved via Cholesky decomposition which is highly accurate provided the matrix of crossproducts of the regressors XX is not very ill conditioned If this problem is detected gretl automatically switches to use QR decomposition The program has been tested rather thoroughly on the statistical reference datasets provided by NIST the US National Institute of Standards and Technology and a full account of the results may be found on the gretl website follow the link Numerical accuracy To date two published reviews have discussed gretls accuracy Giovanni Baiocchi and Walter Dis taso 2003 and Talha Yalta and Yasemin Yalta 2007 We are grateful to these authors for their careful examination of the program Their comments have prompted several modifications includ ing the use of Stephen Moshiers cephes code for computing pvalues and other quantities relating to probability distributions see netliborg changes to the formatting of regression output to en sure that the program displays a consistent number of significant digits and attention to compiler issues in producing the MS Windows version of gretl which at one time was slighly less accurate than the Linux version Gretl now includes a plugin that runs the NIST linear regression test suite You can find this under the Tools menu in the main window When you run this test the introductory text explains the expected result If you run this test and see anything other than the expected result please send a bug report to cottrellwfuedu All regression statistics are printed to 6 significant figures in the current version of gretl except when the multipleprecision plugin is used in which case results are given to 12 figures If you want to examine a particular value more closely first save it for example using the genr command then print it using printf to as many digits as you like see the Gretl Command Reference 477 Appendix D Related free software Gretls capabilities are substantial and are expanding Nonetheless you may find there are some things you cant do in gretl or you may wish to compare results with other programs If you are looking for complementary functionality in the realm of free opensource software we recommend the following programs The selfdescription of each program is taken from its website GNU R rprojectorg R is a system for statistical computation and graphics It consists of a language plus a runtime environment with graphics a debugger access to certain system functions and the ability to run programs stored in script files It compiles and runs on a wide variety of UNIX platforms Windows and MacOS Comment There are numerous addon packages for R covering most areas of statistical work GNU Octave wwwoctaveorg GNU Octave is a highlevel language primarily intended for numerical computations It provides a convenient command line interface for solving linear and nonlinear problems numerically and for performing other numerical experiments using a language that is mostly compatible with Matlab It may also be used as a batchoriented language Julia julialangorg Julia is a highlevel highperformance dynamic programming language for technical computing with syntax that is familiar to users of other technical computing environments It provides a sophisticated compiler distributed parallel execution numerical accuracy and an extensive mathematical function library JMulTi wwwjmultide JMulTi was originally designed as a tool for certain econometric pro cedures in time series analysis that are especially difficult to use and that are not available in other packages like Impulse Response Analysis with bootstrapped confidence intervals for VARVEC modelling Now many other features have been integrated as well to make it possi ble to convey a comprehensive analysis Comment JMulTi is a java GUI program you need a java runtime environment to make use of it As mentioned above gretl offers the facility of exporting data in the formats of both Octave and R In the case of Octave the gretl data set is saved as a single matrix X You can pull the X matrix apart if you wish once the data are loaded in Octave see the Octave manual for details As for R the exported data file preserves any time series structure that is apparent to gretl The series are saved as individual structures The data should be brought into R using the source command In addition gretl has a convenience function for moving data quickly into R Under gretls Tools menu you will find the entry Start GNU R This writes out an R version of the current gretl data set in the users gretl directory and sources it into a new R session The particular way R is invoked depends on the internal gretl variable Rcommand whose value may be set under the Tools Preferences menu The default command is RGuiexe under MS Windows Under X it is xterm e R Please note that at most three spaceseparated elements in this command string will be processed any extra elements are ignored 478 Appendix E Listing of URLs Below is a listing of the full URLs of websites mentioned in the text Estima RATS httpwwwestimacom FFTW3 httpwwwfftworg Gnome desktop homepage httpwwwgnomeorg GNU Multiple Precision GMP library httpgmpliborg CURL library httpcurlhaxxselibcurl GNU Octave homepage httpwwwoctaveorg GNU R homepage httpwwwrprojectorg GNU R manual httpcranrprojectorgdocmanualsRintropdf Gnuplot homepage httpwwwgnuplotinfo Gretl data page httpgretlsourceforgenetgretldatahtml Gretl homepage httpgretlsourceforgenet GTK homepage httpwwwgtkorg GTK port for win32 httpswikignomeorgProjectsGTKWin32 InfoZip homepage httpwwwinfoziporgpubinfozipzlib JMulTi homepage httpwwwjmultide JRSoftware httpwwwjrsoftwareorg Julia homepage httpjulialangorg Mingw gcc for win32 homepage httpwwwmingworg Minpack httpwwwnetliborgminpack Penn World Table httppwteconupennedu Readline homepage httpcnswwwcnscwrueduchetreadlinerltophtml Readline manual httpcnswwwcnscwrueduchetreadlinereadlinehtml Xmlsoft homepage httpxmlsoftorg 479 Bibliography Akaike H 1974 A new look at the statistical model identification IEEE Transactions on Auto matic Control AC19 716723 Anderson T W and C Hsiao 1981 Estimation of dynamic models with error components Jour nal of the American Statistical Association 76 598606 Andrews D W K and J C Monahan 1992 An improved heteroskedasticity and autocorrelation consistent covariance matrix estimator Econometrica 60 953966 Arellano M 2003 Panel Data Econometrics Oxford Oxford University Press Arellano M and S Bond 1991 Some tests of specification for panel data Monte carlo evidence and an application to employment equations The Review of Economic Studies 58 277297 Armesto M T K Engemann and M Owyang 2010 Forecasting with mixed frequencies Fed eral Reserve Bank of St Louis Review 926 521536 URL httpresearchstlouisfedorg publicationsreview1011Armestopdf Baiocchi G and W Distaso 2003 GRETL Econometric software for the GNU generation Journal of Applied Econometrics 18 105110 Baltagi B H 1995 Econometric Analysis of Panel Data New York Wiley Baltagi B H and YJ Chang 1994 Incomplete panels A comparative study of alternative esti mators for the unbalanced oneway error component regression model Journal of Econometrics 62 6789 Baltagi B H and Q Li 1990 A lagrange multiplier test for the error components model with incomplete panels Econometric Reviews 9 103107 Baltagi B H and P X Wu 1999 Unequally spaced panel data regressions with AR1 distur bances Econometric Theory 15 814823 Barrodale I and F D K Roberts 1974 Solution of an overdetermined system of equations in the ℓl norm Communications of the ACM 17 319320 Baxter M and R G King 1999 Measuring business cycles Approximate bandpass filters for economic time series The Review of Economics and Statistics 814 575593 Beck N and J N Katz 1995 What to do and not to do with timeseries crosssection data The American Political Science Review 89 634647 Bera A K C M Jarque and L F Lee 1984 Testing the normality assumption in limited depen dent variable models International Economic Review 25 563578 Berndt E B Hall R Hall and J Hausman 1974 Estimation and inference in nonlinear structural models Annals of Economic and Social Measurement 34 653665 Bhargava A L Franzini and W Narendranathan 1982 Serial correlation and the fixed effects model Review of Economic Studies 49 533549 Blundell R and S Bond 1998 Initial conditions and moment restrictions in dynamic panel data models Journal of Econometrics 87 115143 480 Bibliography 481 Bond S A Hoeffler and J Temple 2001 GMM estimation of empirical growth models Economics Papers from Economics Group Nuffield College University of Oxford No 2001W21 Boswijk H P 1995 Identifiability of cointegrated systems Tinbergen Institute Discussion Paper 9578 URL httpwwwaseuvanlppbin258fulltextpdf Boswijk H P and J A Doornik 2004 Identifying estimating and testing restricted cointegrated systems An overview Statistica Neerlandica 584 440465 Bournay J and G Laroque 1979 Réflexions sur la méthode délaboration des comptes trimestriels Annales de linséé 36 330 URL httpwwwjstorcomstable20075332 Box G E P and G Jenkins 1976 Time Series Analysis Forecasting and Control San Franciso HoldenDay Brand C and N Cassola 2004 A money demand system for euro area M3 Applied Economics 368 817838 Butterworth S 1930 On the theory of filter amplifiers Experimental Wireless The Wireless Engineer 7 536541 Byrd R H P Lu J Nocedal and C Zhu 1995 A limited memory algorithm for bound constrained optimization SIAM Journal on Scientific Computing 165 11901208 Cameron A C and D L Miller 2015 A practitioners guide to clusterrobust inference Journal of Human Resources 502 317373 Cameron A C and P K Trivedi 1986 Econometric models based on count data comparisons and applications of some estimators and tests Journal of Applied Econometrics 1 2954 1998 Regression Analysis of Count Data Cambridge Cambridge University Press 2005 Microeconometrics Methods and Applications Cambridge Cambridge University Press 2013 Regression Analysis of Count Data Cambridge University Press Caselli F G Esquivel and F Lefort 1996 Reopening the convergence debate A new look at crosscountry growth empirics Journal of Economic Growth 13 363389 Chesher A and M Irish 1987 Residual analysis in the grouped and censored normal linear model Journal of Econometrics 34 3361 Choi I 2001 Unit root tests for panel data Journal of International Money and Finance 202 249272 Cholette P A 1984 Adjusting subannual series to yearly benchmarks Survey Methodology 101 3549 URL httpswww150statcangccan1pub12001x1984001article 14348engpdf Chow G C and Al Lin 1971 Best linear unbiased interpolation distribution and extrapolation of time series by related series The Review of Economics and Statistics 534 372375 URL httpswwwjstororgstable1928739 Cleveland W S 1979 Robust locally weighted regression and smoothing scatterplots Journal of the American Statistical Association 74368 829836 Cottrell A 2017 Random effects estimators for unbalanced panel data a Monte Carlo analysis gretl working papers number 4 URL httpsideasrepecorgpancwgretl4html Cottrell A and R Lucchetti 2016 Gretl Function Package Guide gretl documentation URL http sourceforgenetprojectsgretlfilesmanual Bibliography 482 CribariNeto F and S G Zarkos 2003 Econometric and statistical computing using Ox Compu tational Economics 21 277295 Datta D D and W Du 2012 Nonparametric HAC estimation for time series data with missing observations Board of Governors of the Federal Reserve System International Finance Discus sion Papers Number 1060 URL httpswwwfederalreservegovpubsifdp20121060 ifdp1060pdf Davidson R and E Flachaire 2001 The wild bootstrap tamed at last GREQAM Document de Travail 99A32 URL httprussellvchariteunivmrsfrGMMbootwild5europdf Davidson R and J G MacKinnon 1993 Estimation and Inference in Econometrics New York Oxford University Press 2004 Econometric Theory and Methods New York Oxford University Press Denton F T 1971 Adjustment of monthly or quarterly series to annual totals An approach based on quadratic minimization Journal of the American Statistical Association 66333 99102 URL httpwwwjstorcomstable2284856 Di Fonzo T 2003 Benchmarking di serie storiche economiche Nota tecnica ed estensioni Work ing paper Università degli Studi di Padova URL httppaduaresearchcabunipdit7302 1WP200310pdf Di Fonzo T and M Marini 2012 On the extrapolation with the Denton proportional benchmark ing method IMF Working Paper WP12169 URL httpswwwimforgexternalpubsft wp2012wp12169pdf Doornik J A 1995 Testing general restrictions on the cointegrating space Discussion Paper Nuffield College URL httpwwwdoornikcomresearchcoigenpdf 1998 Approximations to the asymptotic distribution of cointegration tests Journal of Economic Surveys 12 573593 Reprinted with corrections in McAleer and Oxley 1999 2007 ObjectOriented Matrix Programming Using Ox London Timberlake Consultants Press third edn URL httpwwwdoornikcom Doornik J A M Arellano and S Bond 2006 Panel Data estimation using DPD for Ox Doornik J A and H Hansen 1994 An omnibus test for univariate and multivariate normality Working paper Nuffield College Oxford Durbin J and S J Koopman 2012 Time Series Analysis by State Space Methods Oxford Oxford University Press second edn Elliott G T J Rothenberg and J H Stock 1996 Efficient tests for an autoregressive unit root Econometrica 64 813836 Engle R F and C W J Granger 1987 Cointegration and error correction Representation esti mation and testing Econometrica 55 251276 Fernández R B 1981 A methodological note on the estimation of time series The Review of Economics and Statistics 633 471476 URL httpswwwjstororgstable1924371 Fiorentini G G Calzolari and L Panattoni 1996 Analytic derivatives and the computation of GARCH estimates Journal of Applied Econometrics 11 399417 Frigo M and S G Johnson 2005 The design and implementation of FFTW3 Proceedings of the IEEE 93 2 216231 Ghysels E 2015 MIDAS Matlab Toolbox University of North Carolina Chapel Hill URL http wwwuncedueghyselspapersMIDASUsersguideV10pdf Bibliography 483 Ghysels E and H Qian 2016 Estimating MIDAS regressions via OLS with polynomial parameter profiling University of North Carolina Chapel Hill and MathWorks URL httpdxdoiorg 102139ssrn2837798 Ghysels E P SantaClara and R Valkanov 2004 The MIDAS touch Mixed data sampling re gression models Série Scientifique CIRANO Montréal URL httpwwwciranoqccafiles publications2004s20pdf Golub G H and C F Van Loan 1996 Matrix Computations Baltimore and London The John Hopkins University Press third edn Goossens M F Mittelbach and A Samarin 2004 The LATEX Companion Boston AddisonWesley second edn Gould W 2013 Interpreting the intercept in the fixedeffects model URL httpwwwstata comsupportfaqsstatisticsinterceptinfixedeffectsmodel Gourieroux C A Monfort E Renault and A Trognon 1987 Generalized residuals Journal of Econometrics 34 532 Greene W H 2000 Econometric Analysis Upper Saddle River NJ PrenticeHall fourth edn 2003 Econometric Analysis Upper Saddle River NJ PrenticeHall fifth edn Hall A D 2005 Generalized Method of Moments Oxford Oxford University Press Hamilton J D 1994 Time Series Analysis Princeton NJ Princeton University Press Hannan E J and B G Quinn 1979 The determination of the order of an autoregression Journal of the Royal Statistical Society B 41 190195 Hansen L P 1982 Large sample properties of generalized method of moments estimation Econometrica 504 10291054 Hansen L P and K J Singleton 1982 Generalized instrumental variables estimation of nonlinear rational expectations models Econometrica 50 12691286 Harvey A C 1989 Forecasting Structural Time Series Models and the Kalman Filter Cambridge Cambridge University Press Harvey A C and A Jaeger 1993 Detrending stylized facts and the business cycle Journal of Applied Econometrics 83 231247 Hausman J A 1978 Specification tests in econometrics Econometrica 46 12511271 Heckman J 1979 Sample selection bias as a specification error Econometrica 47 153161 Helske J 2017 KFAS Exponential family state space models in R Journal of Statistical Software 7810 139 URL httpsdoiorg1018637jssv078i10 Hodrick R and E C Prescott 1997 Postwar US business cycles An empirical investigation Journal of Money Credit and Banking 29 116 Im K S M H Pesaran and Y Shin 2003 Testing for unit roots in heterogeneous panels Journal of Econometrics 115 5374 Islam N 1995 Growth Empirics A Panel Data Approach The Quarterly Journal of Economics 1104 11271170 Johansen S 1995 LikelihoodBased Inference in Cointegrated Vector Autoregressive Models Ox ford Oxford University Press de Jong P 1991 The diffuse Kalman filter The Annals of Statistics 19 10731083 Bibliography 484 de Jong P and S ChuChunLin 2003 Smoothing with an unknown initial condition Journal of Time Series Analysis 242 141148 Kalbfleisch J D and R L Prentice 2002 The Statistical Analysis of Failure Time Data New York Wiley second edn Keane M P and K I Wolpin 1997 The career decisions of young men Journal of Political Economy 105 473522 King R G and S T Rebelo 1993 Low frequency filtering and real business cycles Journal of Economic dynamics and Control 1712 207231 Klein P 2000 Using the generalized Schur form to solve a multivariate linear rational expecta tions model Journal of Economic Dynamics and Control 2410 14051423 Koenker R 1994 Confidence intervals for regression quantiles In P Mandl and M Huskova eds Asymptotic Statistics pp 349359 New York SpringerVerlag Koenker R and G Bassett 1978 Regression quantiles Econometrica 46 3350 Koenker R and K Hallock 2001 Quantile regression Journal of Economic Perspectives 154 143156 Koenker R and J Machado 1999 Goodness of fit and related inference processes for quantile regression Journal of the American Statistical Association 94 12961310 Koenker R and Q Zhao 1994 Lestimation for linear heteroscedastic models Journal of Non parametric Statistics 3 223235 Koopman S J 1993 Disturbance smoother for state space models Biometrika 80 117126 Koopman S J N Shephard and J A Doornik 1999 Statistical algorithms for models in state space using SsfPack 22 Econometrics Journal 2 107160 Kwiatkowski D P C B Phillips P Schmidt and Y Shin 1992 Testing the null of stationarity against the alternative of a unit root How sure are we that economic time series have a unit root Journal of Econometrics 54 159178 Levin A CF Lin and J Chu 2002 Unit root tests in panel data asymptotic and finitesample properties Journal of Econometrics 108 124 Lucchetti R 2011 State space methods in gretl Journal of Statistical Software 4111 122 Lucchetti R L Papi and A Zazzaro 2001 Banks inefficiency and economic growth A micro macro approach Scottish Journal of Political Economy 48 400424 Lütkepohl H 2005 New Intoduction to Multiple Time Series Analysis Berlin Springer MacKinnon J G 1996 Numerical distribution functions for unit root and cointegration tests Journal of Applied Econometrics 11 601618 MacKinnon J G and H White 1985 Some heteroskedasticityconsistent covariance matrix esti mators with improved finite sample properties Journal of Econometrics 29 305325 Magnus J R and H Neudecker 1988 Matrix Differential Calculus with Applications in Statistics and Econometrics John Wiley Sons McAleer M and L Oxley 1999 Practical Issues in Cointegration Analysis Oxford Blackwell McCullagh P and J A Nelder 1983 Generalized linear models London and New York Chapman and Hall Bibliography 485 McCullough B D and C G Renfro 1998 Benchmarks and software standards A case study of GARCH procedures Journal of Economic and Social Measurement 25 5971 Melard G 1984 Algorithm AS 197 A Fast Algorithm for the Exact Maximum Likelihood of AutoregressiveMoving Average Models Journal of the Royal Statistical Society Series C Applied Statistics 331 104114 Morales J L and J Nocedal 2011 Remark on Algorithm 778 LBFGSB Fortran routines for largescale bound constrained optimization ACM Transactions on Mathematical Software 381 14 Mroz T 1987 The sensitivity of an empirical model of married womens hours of work to eco nomic and statistical assumptions Econometrica 5 765799 Nadaraya E A 1964 On estimating regression Theory of Probability and its Applications 9 141142 Nash J C 1990 Compact Numerical Methods for Computers Linear Algebra and Function Min imisation Bristol Adam Hilger second edn Nerlove M 1971 Further evidence on the estimation of dynamic economic relations from a time series of cross sections Econometrica 39 359382 1999 Properties of alternative estimators of dynamic panel models An empirical anal ysis of crosscountry data for the study of economic growth In C Hsiao K Lahiri LF Lee and M H Pesaran eds Analysis of Panels and Limited Dependent Variable Models Cambridge Cambridge University Press Newey W K and K D West 1987 A simple positive semidefinite heteroskedasticity and auto correlation consistent covariance matrix Econometrica 55 703708 1994 Automatic lag selection in covariance matrix estimation Review of Economic Stud ies 61 631653 Okui R 2009 The optimal choice of moments in dynamic panel data models Journal of Econo metrics 1511 116 Parzen E 1963 On spectral analysis with missing observations and amplitude modulation Sankhya The Indian Journal of Statistics Series A 254 383392 Pelagatti M 2011 State space methods in OxSsfPack Journal of Statistical Software 413 125 Pollock D S G 2000 Trend estimation and detrending via rational squarewave filters Journal of Econometrics 992 317334 Portnoy S and R Koenker 1997 The Gaussian hare and the Laplacian tortoise computability of squarederror versus absoluteerror estimators Statistical Science 124 279300 Press W S Teukolsky W Vetterling and B Flannery 2007 Numerical Recipes The Art of Scientific Computing Cambridge University Press 3 edn Ramanathan R 2002 Introductory Econometrics with Applications Fort Worth Harcourt fifth edn Rao C R 1973 Linear Statistical Inference and its Applications New York Wiley second edn Rho SH and T J Vogelsang 2018 Heteroskedasticity autocorrelation robust inference in time series regressions with missing data Econometric Theory 353 601629 URL httpsdoi org101017S0266466618000117 Roodman D 2009a How to do xtabond2 An introduction to difference and system GMM in Stata The Stata Journal 9 86136 URL httpsdoiorg1011771536867X0900900106 Bibliography 486 2009b A note on the theme of too many instruments Oxford Bulletin of Economics and Statistics 71 135158 URL httpsdoiorg101111j14680084200800542x Sargan J D 1958 The estimation of economic relationships using instrumental variables Econo metrica 263 393415 URL httpsdoiorg1023071907619 Schwarz G 1978 Estimating the dimension of a model Annals of Statistics 6 461464 Sephton P S 1995 Response surface estimates of the KPSS stationarity test Economics Letters 47 255261 Shumway R H and D S Stoffer 2017 Time series analysis and its applications with R examples Springer 4th edn Sims C A 1980 Macroeconomics and reality Econometrica 48 148 Steinhaus S 1999 Comparison of mathematical programs for data analysis edition 3 Univer sity of Frankfurt URL httpwwwinformatikunifrankfurtdeststncrunch Stock J H and M W Watson 1999 Forecasting inflation Journal of Monetary Economics 442 293335 2003 Introduction to Econometrics Boston AddisonWesley 2008 Heteroskedasticityrobust standard errors for fixed effects panel data regression Econometrica 761 155174 Stokes H H 2004 On the advantage of using two or more econometric software systems to solve the same problem Journal of Economic and Social Measurement 29 307320 Swamy P A V B and S S Arora 1972 The exact finite sample properties of the estimators of coefficients in the error components regression models Econometrica 40 261275 Theil H 1961 Economic Forecasting and Policy Amsterdam NorthHolland 1966 Applied Economic Forecasting Amsterdam NorthHolland Verbeek M 2004 A Guide to Modern Econometrics New York Wiley second edn Watson G S 1964 Smooth regression analysis Shankya Series A 26 359372 White H 1980 A heteroskedasticityconsistent covariance matrix astimator and a direct test for heteroskedasticity Econometrica 48 817838 Windmeijer F 2005 A finite sample correction for the variance of linear efficient twostep GMM estimators Journal of Econometrics 126 2551 Wooldridge J M 2002a Econometric Analysis of Cross Section and Panel Data Cambridge MA MIT Press 2002b Introductory Econometrics A Modern Approach Mason OH SouthWestern sec ond edn Yalta A T and A Y Yalta 2007 GRETL 160 and its numerical accuracy Journal of Applied Econometrics 22 849854 Zhu C R H Byrd and J Nocedal 1997 Algorithm 778 LBFGSB Fortran routines for largescale boundconstrained optimization ACM Transactions on Mathematical Software 234 550560
33
Macroeconomia 1
FECAP
5
Macroeconomia 1
FECAP
27
Macroeconomia 1
FECAP
59
Macroeconomia 1
FECAP
18
Macroeconomia 1
UFRN
1
Macroeconomia 1
USP
Texto de pré-visualização
Gretl Users Guide Gnu Regression Econometrics and Timeseries Library Allin Cottrell Department of Economics Wake Forest University Riccardo Jack Lucchetti Dipartimento di Economia Università Politecnica delle Marche September 2023 Permission is granted to copy distribute andor modify this document under the terms of the GNU Free Documentation License Version 11 or any later version published by the Free Software Foundation see httpwwwgnuorglicensesfdlhtml Contents 1 Introduction 1 11 Features at a glance 1 12 Acknowledgements 1 13 Installing the programs 2 I Running the program 3 2 Getting started 4 21 Lets run a regression 4 22 Estimation output 6 23 The main window menus 7 24 Keyboard shortcuts 10 25 The gretl toolbar 10 3 Modes of working 12 31 Command scripts 12 32 Saving script objects 14 33 The gretl console 14 34 The Session concept 15 4 Data files 18 41 Data file formats 18 42 Databases 18 43 Creating a dataset from scratch 19 44 Structuring a dataset 21 45 Panel data specifics 23 46 Missing data values 27 47 Maximum size of data sets 28 48 Data file collections 28 49 Assembling data from multiple sources 30 5 Subsampling a dataset 31 51 Introduction 31 52 Setting the sample 31 53 Restricting the sample 32 i Contents ii 54 Panel data 33 55 Resampling and bootstrapping 35 6 Graphs and plots 36 61 Gnuplot graphs 36 62 Plotting graphs from scripts 40 63 Boxplots 46 7 Joining data sources 48 71 Introduction 48 72 Basic syntax 48 73 Filtering 49 74 Matching with keys 50 75 Aggregation 53 76 Stringvalued key variables 53 77 Importing multiple series 54 78 A realworld case 55 79 The representation of dates 57 710 Timeseries data 58 711 Special handling of time columns 61 712 Panel data 61 713 Memo join options 63 8 Realtime data 66 81 Introduction 66 82 Atomic format for realtime data 66 83 More on timerelated options 68 84 Getting a certain data vintage 68 85 Getting the nth release for each observation period 69 86 Getting the values at a fixed lag after the observation period 70 87 Getting the revision history for an observation 71 9 Temporal disaggregation 74 91 Introduction 74 92 Notation and design 75 93 Overview of data handling 76 94 Extrapolation 76 95 Function signature 77 96 Handling of deterministic terms 78 97 Some technical details 78 98 The plot option 80 Contents iii 99 Multiple lowfrequency series 80 910 Examples 81 10 Special functions in genr 82 101 Introduction 82 102 Cumulative densities and pvalues 83 103 Retrieving internal variables dollar accessors 84 11 Gretl data types 85 111 Introduction 85 112 Series 85 113 Scalars 86 114 Matrices 86 115 Lists 86 116 Strings 86 117 Bundles 87 118 Arrays 92 119 The life cycle of gretl objects 95 12 Discrete variables 98 121 Declaring variables as discrete 98 122 Commands for discrete variables 99 13 Loop constructs 103 131 Introduction 103 132 Loop control variants 103 133 Special controls 106 134 Progressive mode 106 135 Loop examples 107 14 Userdefined functions 110 141 Defining a function 110 142 Calling a function 113 143 Deleting a function 113 144 Function programming details 114 145 Function packages 121 15 Named lists and strings 122 151 Named lists 122 152 Named strings 127 16 Stringvalued series 132 Contents iv 161 Introduction 132 162 Creating a stringvalued series 132 163 Permitted operations 135 164 Stringvalued series and functions 137 165 Other import formats 139 17 Matrix manipulation 140 171 Creating matrices 140 172 Empty matrices 141 173 Selecting submatrices 142 174 Deleting rows or columns 143 175 Matrix operators 144 176 Matrixscalar operators 146 177 Matrix functions 146 178 Matrix accessors 151 179 Namespace issues 153 1710Creating a data series from a matrix 153 1711Matrices and lists 154 1712Deleting a matrix 154 1713Printing a matrix 154 1714Example OLS using matrices 156 18 Complex matrices 157 181 Introduction 157 182 Creating a complex matrix 157 183 Indexation 158 184 Operators 159 185 Functions 159 186 File inputoutput 160 187 Backward incompatibility 160 19 Calendar dates 164 191 Introduction 164 192 Date and time representations 164 193 Converting between representations 166 194 Epoch day arithmetic 170 195 Other accessors and functions 171 196 Working with preGregorian dates 172 20 Handling mixedfrequency data 177 201 Basics 177 Contents v 202 The notion of a MIDAS list 179 203 Highfrequency lag lists 180 204 Highfrequency first differences 182 205 MIDASrelated plots 182 206 Alternative MIDAS data methods 182 21 Cheat sheet 188 211 Dataset handling 188 212 Creatingmodifying variables 192 213 Neat tricks 199 II Econometric methods 204 22 Robust covariance matrix estimation 205 221 Introduction 205 222 Crosssectional data and the HCCME 206 223 Time series data and HAC covariance matrices 207 224 Special issues with panel data 212 225 The clusterrobust estimator 213 23 Panel data 215 231 Estimation of panel models 215 232 Autoregressive panel models 223 24 Dynamic panel models 225 241 Introduction 225 242 Usage 228 243 Replication of DPD results 231 244 Crosscountry growth example 234 245 Auxiliary test statistics 236 246 Postestimation available statistics 237 247 Memo dpanel options 239 25 Nonlinear least squares 240 251 Introduction and examples 240 252 Initializing the parameters 240 253 NLS dialog window 241 254 Analytical and numerical derivatives 241 255 Advanced use 242 256 Controlling termination 243 257 Details on the code 243 Contents vi 258 Numerical accuracy 243 26 Maximum likelihood estimation 246 261 Generic ML estimation with gretl 246 262 Syntax 247 263 Covariance matrix and standard errors 248 264 Gamma estimation 249 265 Stochastic frontier cost function 251 266 GARCH models 252 267 Analytical derivatives 255 268 Debugging ML scripts 256 269 Using functions 256 2610Advanced use of mle functions analytical derivatives algorithm choice 259 2611Estimating constrained models 263 2612Handling nonconvergence gracefully 264 27 GMM estimation 267 271 Introduction and terminology 267 272 GMM as Method of Moments 268 273 OLS as GMM 271 274 TSLS as GMM 272 275 Covariance matrix options 272 276 A real example the Consumption Based Asset Pricing Model 274 277 Caveats 278 28 Model selection criteria 279 281 Introduction 279 282 Information criteria 279 29 Degrees of freedom correction 281 291 Introduction 281 292 Back to basics 281 293 Application to OLS regression 282 294 Beyond OLS 282 295 Consistency and awkward cases 283 296 What gretl does 284 30 Time series filters 287 301 Fractional differencing 287 302 The HodrickPrescott filter 287 303 The Baxter and King filter 288 Contents vii 304 The Butterworth filter 289 305 The discrete Fourier transform 290 31 Univariate time series models 293 311 Introduction 293 312 ARIMA models 293 313 Unit root tests 300 314 Cointegration test 304 315 ARCH and GARCH 305 32 Vector Autoregressions 308 321 Notation 308 322 Estimation 309 323 Structural VARs 311 324 Residualbased diagnostic tests 315 33 Cointegration and Vector Error Correction Models 317 331 Introduction 317 332 Vector Error Correction Models as representation of a cointegrated system 318 333 Interpretation of the deterministic components 319 334 The Johansen cointegration tests 321 335 Identification of the cointegration vectors 322 336 Overidentifying restrictions 324 337 Numerical solution methods 330 34 Multivariate models 334 341 The system command 334 342 Equation systems within functions 336 343 Restriction and estimation 337 344 System accessors 338 35 Forecasting 341 351 Introduction 341 352 Saving and inspecting fitted values 341 353 The fcast command 341 354 Univariate forecast evaluation statistics 344 355 Forecasts based on VAR models 345 356 Forecasting from simultaneous systems 347 36 State Space Modeling 348 361 Introduction 348 Contents viii 362 Notation 348 363 Defining the model as a bundle 348 364 Special features of statespace bundles 350 365 The kfilter function 350 366 The ksmooth function 351 367 The kdsmooth function 352 368 Diffuse initialization of the state vector 352 369 Extensions and refinements 354 3610The ksimul function 356 3611Numerical optimization 358 3612Example scripts 358 3613Graphical interface 365 37 Numerical methods 372 371 Derivativebased optimization methods 372 372 Derivativefree optimization methods 375 373 Numerical differentiation 379 374 Numerical integration 383 38 Discrete and censored dependent variables 385 381 Logit and probit models 385 382 Ordered response models 389 383 Multinomial logit 391 384 Bivariate probit 392 385 Panel estimators 392 386 The Tobit model 394 387 Interval regression 394 388 Sample selection model 395 389 Count data 397 3810Duration models 399 39 Quantile regression 407 391 Introduction 407 392 Basic syntax 407 393 Confidence intervals 408 394 Multiple quantiles 408 395 Large datasets 409 40 Nonparametric methods 412 401 Locally weighted regression loess 412 402 The NadarayaWatson estimator 414 Contents ix 41 MIDAS models 417 411 Parsimonious parameterizations 417 412 Estimating MIDAS models 418 413 Parameterization functions 424 III Technical details 428 42 Gretl and ODBC 429 421 ODBC support 429 422 ODBC base concepts 429 423 Syntax 430 424 Examples 432 425 Connectivity details 433 43 Gretl and TEX 435 431 Introduction 435 432 TEXrelated menu items 435 433 Finetuning typeset output 437 434 Installing and learning TEX 440 44 Gretl and R 441 441 Introduction 441 442 Starting an interactive R session 441 443 Running an R script 444 444 Sending data back and forth 445 445 Interacting with R from the command line 448 446 Performance issues with R 450 447 Further use of the R library 450 45 Gretl and Ox 452 451 Introduction 452 452 Ox support in gretl 452 453 Illustration replication of DPD model 454 46 Gretl and Octave 456 461 Introduction 456 462 Octave support in gretl 456 463 Illustration spectral methods 458 47 Gretl and Stata 460 48 Gretl and Python 462 Contents x 481 Introduction 462 482 Python support in gretl 462 483 Illustration linear regression with multicollinearity 462 49 Gretl and Julia 464 491 Introduction 464 492 Julia support in gretl 464 493 Illustration 464 50 Troubleshooting gretl 466 501 Bug reports 466 502 Auxiliary programs 467 51 The command line interface 468 IV Appendices 469 A Data file details 470 A1 Basic native format 470 A2 Binary data file format 470 A3 Native database format 470 B Building gretl 472 B1 Installing the prerequisites 472 B2 Getting the source release or git 473 B3 Configure the source 474 B4 Build and install 475 C Numerical accuracy 477 D Related free software 478 E Listing of URLs 479 Bibliography 480 Chapter 1 Introduction 11 Features at a glance Gretl is an econometrics package including a shared library a commandline client program and a graphical user interface Userfriendly Gretl offers an intuitive user interface it is very easy to get up and running with econometric analysis Thanks to its association with the econometrics textbooks by Ramu Ramanathan Jeffrey Wooldridge and James Stock and Mark Watson the package offers many practice data files and command scripts These are well annotated and accessible Two other useful resources for gretl users are the available documentation and the gretlusers mailing list Flexible You can choose your preferred point on the spectrum from interactive pointandclick to complex scripting and can easily combine these approaches Crossplatform Gretls home platform is Linux but it is also available for MS Windows and Mac OS X and should work on any unixlike system that has the appropriate basic libraries see Appendix B Open source The full source code for gretl is available to anyone who wants to critique it patch it or extend it See Appendix B Sophisticated Gretl offers a full range of leastsquares based estimators either for single equations and for systems including vector autoregressions and vector error correction models Sev eral specific maximum likelihood estimators eg probit ARIMA GARCH are also provided natively more advanced estimation methods can be implemented by the user via generic maximum likelihood or nonlinear GMM Extensible Users can enhance gretl by writing their own functions and procedures in gretls script ing language which includes a wide range of matrix functions Accurate Gretl has been thoroughly tested on several benchmarks among which the NIST refer ence datasets See Appendix C Internet ready Gretl can fetch materials such databases collections of textbook datafiles and add on packages over the internet International Gretl will produce its output in English French Italian Spanish Polish Portuguese German Basque Turkish Russian Albanian or Greek depending on your computers native language setting 12 Acknowledgements The gretl code base originally derived from the program ESL Econometrics Software Library written by Professor Ramu Ramanathan of the University of California San Diego We are much in debt to Professor Ramanathan for making this code available under the GNU General Public Licence and for helping to steer gretls early development 1 Chapter 1 Introduction 2 We are also grateful to the authors of several econometrics textbooks for permission to package for gretl various datasets associated with their texts This list currently includes William Greene au thor of Econometric Analysis Jeffrey Wooldridge Introductory Econometrics A Modern Approach James Stock and Mark Watson Introduction to Econometrics Damodar Gujarati Basic Economet rics Russell Davidson and James MacKinnon Econometric Theory and Methods and Marno Ver beek A Guide to Modern Econometrics GARCH estimation in gretl is based on code deposited in the archive of the Journal of Applied Econometrics by Professors Fiorentini Calzolari and Panattoni and the code to generate pvalues for DickeyFuller tests is due to James MacKinnon In each case we are grateful to the authors for permission to use their work With regard to the internationalization of gretl thanks go to Ignacio DíazEmparanza Spanish Michel Robitaille and Florent Bresson French Cristian Rigamonti Italian Tadeusz Kufel and Pawel Kufel Polish Markus Hahn and Sven Schreiber German Hélio Guilherme and Henrique Andrade Portuguese Susan Orbe Basque Talha Yalta Turkish and Alexander Gedranovich Russian Gretl has benefitted greatly from the work of numerous developers of free opensource software for specifics please see Appendix B Our thanks are due to Richard Stallman of the Free Software Foundation for his support of free software in general and for agreeing to adopt gretl as a GNU program in particular Many users of gretl have submitted useful suggestions and bug reports In this connection par ticular thanks are due to Ignacio DíazEmparanza Tadeusz Kufel Pawel Kufel Alan Isaac Cri Rigamonti Sven Schreiber Talha Yalta Andreas Rosenblad and Dirk Eddelbuettel who maintains the gretl package for Debian GNULinux 13 Installing the programs Linux On the Linux1 platform you have the choice of compiling the gretl code yourself or making use of a prebuilt package Building gretl from the source is necessary if you want to access the development version or customize gretl to your needs but this takes quite a few skills most users will want to go for a prebuilt package Some Linux distributions feature gretl as part of their standard offering Debian Ubuntu and Fe dora for example If this is the case all you need to do is install gretl through your package manager of choice In addition the gretl webpage at httpgretlsourceforgenet offers a generic package in rpm format for modern Linux systems If you prefer to compile your own or are using a unix system for which prebuilt packages are not available instructions on building gretl can be found in Appendix B MS Windows The MS Windows version comes as a selfextracting executable Installation is just a matter of downloading gretlinstallexe and running this program You will be prompted for a location to install the package Mac OS X The Mac version comes as a gzipped disk image Installation is a matter of downloading the image file opening it in the Finder and dragging Gretlapp to the Applications folder However when installing for the first time two prerequisite packages must be put in place first details are given on the gretl website 1In this manual we use Linux as shorthand to refer to the GNULinux operating system What is said herein about Linux mostly applies to other unixtype systems too though some local modifications may be needed Part I Running the program Chapter 2 Getting started 21 Lets run a regression This introduction is mostly angled towards the graphical client program please see Chapter 51 below and the Gretl Command Reference for details on the commandline program gretlcli You can supply the name of a data file to open as an argument to gretl but for the moment lets not do that just fire up the program1 You should see a main window which will hold information on the data set but which is at first blank and various menus some of them disabled at first What can you do at this point You can browse the supplied data files or databases open a data file create a new data file read the help items or open a command script For now lets browse the supplied data files Under the File menu choose Open data Sample file A second notebooktype window will open presenting the sets of data files supplied with the package see Figure 21 Select the first tab Ramanathan The numbering of the files in this section corresponds to the chapter organization of Ramanathan 2002 which contains discussion of the analysis of these data The data will be useful for practice purposes even without the text Figure 21 Practice data files window If you select a row in this window and click on Info this opens a window showing information on the data set in question for example on the sources and definitions of the variables If you find a file that is of interest you may open it by clicking on Open or just doubleclicking on the file name For the moment lets open data36 In gretl windows containing lists doubleclicking on a line launches a default action for the associated list entry eg displaying the values of a data series opening a file 1For convenience we refer to the graphical client program simply as gretl in this manual Note however that the specific name of the program differs according to the computer platform On Linux it is called gretlx11 while on MS Windows it is gretlexe On Linux systems a wrapper script named gretl is also installed see also the Gretl Command Reference 4 Chapter 2 Getting started 5 This file contains data pertaining to a classic econometric chestnut the consumption function The data window should now display the name of the current data file the overall data range and sample range and the names of the variables along with brief descriptive tagssee Figure 22 Figure 22 Main window with a practice data file open OK what can we do now Hopefully the various menu options should be fairly self explanatory For now well dip into the Model menu a brief tour of all the main window menus is given in Section 23 below Gretls Model menu offers numerous various econometric estimation routines The simplest and most standard is Ordinary Least Squares OLS Selecting OLS pops up a dialog box calling for a model specificationsee Figure 23 Figure 23 Model specification dialog To select the dependent variable highlight the variable you want in the list on the left and click the arrow that points to the Dependent variable slot If you check the Set as default box this variable will be preselected as dependent when you next open the model dialog box Shortcut doubleclicking on a variable on the left selects it as dependent and also sets it as the default To select independent variables highlight them on the left and click the green arrow or rightclick the Chapter 2 Getting started 6 highlighted variable to remove variables from the selected list use the rad arrow To select several variable in the list box drag the mouse over them to select several noncontiguous variables hold down the Ctrl key and click on the variables you want To run a regression with consumption as the dependent variable and income as independent click Ct into the Dependent slot and add Yt to the Independent variables list 22 Estimation output Once youve specified a model a window displaying the regression output will appear The output is reasonably comprehensive and in a standard format Figure 24 Figure 24 Model output window The output window contains menus that allow you to inspect or graph the residuals and fitted values and to run various diagnostic tests on the model For most models there is also an option to print the regression output in LATEX format See Chap ter 43 for details To import gretl output into a word processor you may copy and paste from an output window using its Edit menu or Copy button in some contexts to the target program Many not all gretl windows offer the option of copying in RTF Microsofts Rich Text Format or as LATEX If you are pasting into a word processor RTF may be a good option because the tabular formatting of the output is preserved2 Alternatively you can save the output to a plain text file then import the file into the target program When you finish a gretl session you are given the option of saving all the output from the session to a single file Note that on the gnome desktop and under MS Windows the File menu includes a command to send the output directly to a printer When pasting or importing plain text gretl output into a word processor select a monospaced or typewriter style font eg Courier to preserve the outputs tabular formatting Select a small font 10point Courier should do to prevent the output lines from being broken in the wrong place 2Note that when you copy as RTF under MS Windows Windows will only allow you to paste the material into ap plications that understand RTF Thus you will be able to paste into MS Word but not into notepad Note also that there appears to be a bug in some versions of Windows whereby the paste will not work properly unless the target application eg MS Word is already running prior to copying the material in question Chapter 2 Getting started 7 23 The main window menus Reading left to right along the main windows menu bar we find the File Tools Data View Add Sample Variable Model and Help menus File menu Open data Open a native gretl data file or import from other formats See Chapter 4 Append data Add data to the current working data set from a gretl data file a comma separated values file or a spreadsheet file Save data Save the currently open native gretl data file Save data as Write out the current data set in native format with the option of using gzip data compression See Chapter 4 Export data Write out the current data set in Comma Separated Values CSV format or the formats of GNU R or GNU Octave See Chapter 4 and also Appendix D Send to Send the current data set as an email attachment New data set Allows you to create a blank data set ready for typing in values or for importing series from a database See below for more on databases Clear data set Clear the current data set out of memory Generally you dont have to do this since opening a new data file automatically clears the old one but sometimes its useful Working directory Change the current working directory or workdir and specify re lated options For an explanation of the role of the workdir click the Help button in the dialog window which is presented or refer to the documentation of the set command with the workdir option in the command reference Script files A script is a file containing a sequence of gretl commands This item contains entries that let you open a script you have created previously User file open a sample script or open an editor window in which you can create a new script Session files A session file contains a snapshot of a previous gretl session including the data set used and any models or graphs that you saved Under this item you can open a saved session or save the current session Databases Allows you to browse various large databases either on your own computer or if you are connected to the internet on the gretl database server See Section 42 for details Function packages Manage usercontributed function packages that extend gretls capa bilities To learn more about such packages written in gretls builtin matrix and scripting language hansl please refer to the Packages entry in Help menu Resource from addon Access example scripts and datafiles that are shipped as part of gretls official addons Addons are function packages that are more tightly integrated with the gretl program than standard usercontributed packages Exit Quit the program Youll be prompted to save any unsaved work Tools menu Statistical tables Look up critical values for commonly used distributions normal or Gaussian t chisquare F and DurbinWatson Pvalue finder Look up pvalues from the Gaussian t chisquare F gamma binomial or Poisson distributions See also the pvalue command in the Gretl Command Reference Chapter 2 Getting started 8 Distribution graphs Produce graphs of various probability distributions In the resulting graph window the popup menu includes an item Add another curve which enables you to superimpose a further plot for example you can draw the t distribution with various different degrees of freedom Test statistic calculator Calculate test statistics and pvalues for a range of common hy pothesis tests population mean variance and proportion difference of means variances and proportions Nonparametric tests Calculate test statistics for various nonparametric tests Sign test Wilcoxon rank sum test Wilcoxon signed rank test Runs test Seed for random numbers Set the seed for the random number generator by default this is set based on the system time when the program is started Command log Open a window containing a record of the commands executed so far Gretl console Open a console window into which you can type commands as you would using the commandline program gretlcli as opposed to using pointandclick Start Gnu R Start R if it is installed on your system and load a copy of the data set currently open in gretl See Appendix D Sort variables Rearrange the listing of variables in the main window either by ID number or alphabetically by name Function packages Handles function packages see Section 145 which allow you to access functions written by other users and share the ones written by you NIST test suite Check the numerical accuracy of gretl against the reference results for linear regression made available by the US National Institute of Standards and Technol ogy Preferences Set the paths to various files gretl needs to access Choose the font in which gretl displays text output Activate or suppress gretls messaging about the availability of program updates and so on See the Gretl Command Reference for further details Data menu Select all Several menu items act upon those variables that are currently selected in the main window This item lets you select all the variables Display values Pops up a window with a simple not editable printout of the values of the selected variable or variables Edit values Opens a spreadsheet window where you can edit the values of the selected variables Add observations Gives a dialog box in which you can choose a number of observations to add at the end of the current dataset for use with forecasting Remove extra observations Active only if extra observations have been added automati cally in the process of forecasting deletes these extra observations Read info Edit info Read info just displays the summary information for the current data file Edit info allows you to make changes to it if you have permission to do so Print description Opens a window containing a full account of the current dataset in cluding the summary information and any specific information on each of the variables Add case markers Prompts for the name of a text file containing case markers short strings identifying the individual observations and adds this information to the data set See Chapter 4 Remove case markers Active only if the dataset has case markers identifying the obser vations removes these case markers Chapter 2 Getting started 9 Dataset structure invokes a series of dialog boxes which allow you to change the struc tural interpretation of the current dataset For example if data were read in as a cross section you can get the program to interpret them as time series or as a panel See also section 44 Compact data For timeseries data of higher than annual frequency gives you the option of compacting the data to a lower frequency using one of four compaction methods average sum start of period or end of period Expand data For timeseries data gives you the option of expanding the data to a higher frequency Transpose data Turn each observation into a variable and vice versa or in other words each row of the data matrix becomes a column in the modified data matrix can be useful with imported data that have been read in sideways View menu Icon view Opens a window showing the content of the current session as a set of icons see section 34 Graph specified vars Gives a choice between a time series plot a regular XY scatter plot an XY plot using impulses vertical bars an XY plot with factor separation ie with the points colored differently depending to the value of a given dummy variable boxplots and a 3D graph Serves up a dialog box where you specify the variables to graph See Chapter 6 for details Multiple graphs Allows you to compose a set of up to six small graphs either pairwise scatterplots or timeseries graphs These are displayed together in a single window Summary statistics Shows a full set of descriptive statistics for the variables selected in the main window Correlation matrix Shows the pairwise correlation coefficients for the selected variables Cross Tabulation Shows a crosstabulation of the selected variables This works only if at least two variables in the data set have been marked as discrete see Chapter 12 Principal components Produces a Principal Components Analysis for the selected vari ables Mahalanobis distances Computes the Mahalanobis distance of each observation from the centroid of the selected set of variables Crosscorrelogram Computes and graphs the crosscorrelogram for two selected vari ables Add menu Offers various standard transformations of variables logs lags squares etc that you may wish to add to the data set Also gives the option of adding random variables and for timeseries data adding seasonal dummy variables eg quarterly dummy variables for quarterly data Sample menu Set range Select a different starting andor ending point for the current sample within the range of data available Restore full range selfexplanatory Define based on dummy Given a dummy indicator variable with values 0 or 1 this drops from the current sample all observations for which the dummy variable has value 0 Restrict based on criterion Similar to the item above except that you dont need a pre defined variable you supply a Boolean expression eg sqft 1400 and the sample is restricted to observations satisfying that condition See the entry for genr in the Gretl Command Reference for details on the Boolean operators that can be used Chapter 2 Getting started 10 Random subsample Draw a random sample from the full dataset Drop all obs with missing values Drop from the current sample all observations for which at least one variable has a missing value see Section 46 Count missing values Give a report on observations where data values are missing May be useful in examining a panel data set where its quite common to encounter missing values Set missing value code Set a numerical value that will be interpreted as missing or not available This is intended for use with imported data when gretl has not recognized the missingvalue code used Variable menu Most items under here operate on a single variable at a time The active variable is set by highlighting it clicking on its row in the main data window Most options will be selfexplanatory Note that you can rename a variable and can edit its descriptive label under Edit attributes You can also Define a new variable via a formula eg involving some function of one or more existing variables For the syntax of such formulae look at the online help for Generate variable syntax or see the genr command in the Gretl Command Reference One simple example foo x1 x2 will create a new variable foo as the product of the existing variables x1 and x2 In these formulae variables must be referenced by name not number Model menu For details on the various estimators offered under this menu please consult the Gretl Command Reference Also see Chapter 25 regarding the estimation of nonlinear models Help menu Please use this as needed It gives details on the syntax required in various dialog entries 24 Keyboard shortcuts When working in the main gretl window some common operations may be performed using the keyboard as shown in the table below Return Opens a window displaying the values of the currently selected variables it is the same as selecting Data Display Values Delete Pressing this key has the effect of deleting the selected variables A confirma tion is required to prevent accidental deletions e Has the same effect as selecting Edit attributes from the Variable menu F2 Same as e Included for compatibility with other programs g Has the same effect as selecting Define new variable from the Variable menu which maps onto the genr command h Opens a help window for gretl commands F1 Same as h Included for compatibility with other programs r Refreshes the variable list in the main window t Graphs the selected variable a line graph is used for timeseries datasets whereas a distribution plot is used for crosssectional data 25 The gretl toolbar At the bottom left of the main window sits the toolbar Chapter 2 Getting started 11 The icons have the following functions reading from left to right 1 Launch a calculator program A convenience function in case you want quick access to a calculator when youre working in gretl The default program is calcexe under MS Win dows or xcalc under the X window system You can change the program under the Tools Preferences General menu Programs tab 2 Start a new script Opens an editor window in which you can type a series of commands to be sent to the program as a batch 3 Open the gretl console A shortcut to the Gretl console menu item Section 23 above 4 Open the session icon window 5 Open a window displaying available gretl function packages 6 Open this manual in PDF format 7 Open the help item for script commands syntax ie a listing with details of all available commands 8 Open the dialog box for defining a graph 9 Open the dialog box for estimating a model using ordinary least squares 10 Open a window listing the sample datasets supplied with gretl and any other data file collec tions that have been installed Chapter 3 Modes of working 31 Command scripts As you execute commands in gretl using the GUI and filling in dialog entries those commands are recorded in the form of a script or batch file Such scripts can be edited and rerun using either gretl or the commandline client gretlcli To view the current state of the script at any point in a gretl session choose Command log under the Tools menu This log file is called sessioninp and it is overwritten whenever you start a new session To preserve it save the script under a different name Script files will be found most easily using the GUI file selector if you name them with the extension inp To open a script you have written independently use the File Script files menu item to create a script from scratch use the File Script files New script item or the new script toolbar button In either case a script window will open see Figure 31 Figure 31 Script window editing a command file The toolbar at the top of the script window offers the following functions left to right 1 Save the file 2 Save the file under a specified name 3 Print the file this option is not available on all platforms 4 Execute the commands in the file 5 Copy selected text 6 Paste the selected text 7 Find and replace text 8 Undo the last Paste or Replace action 9 Help if you place the cursor in a command word and press the question mark you will get help on that command 10 Close the window When you execute the script by clicking on the Execute icon or by pressing Ctrlr all output is directed to a single window where it can be edited saved or copied to the clipboard To learn more about the possibilities of scripting take a look at the gretl Help item Command reference 12 Chapter 3 Modes of working 13 or start up the commandline program gretlcli and consult its help or consult the Gretl Command Reference If you run the script when part of it is highlighted gretl will only run that portion Moreover if you want to run just the current line you can do so by pressing CtrlEnter1 Clicking the right mouse button in the script editor window produces a popup menu This gives you the option of executing either the line on which the cursor is located or the selected region of the script if theres a selection in place If the script is editable this menu also gives the option of adding or removing comment markers from the start of the line or lines The gretl package includes over 70 example scripts Many of these relate to Ramanathan 2002 but they may also be used as a freestanding introduction to scripting in gretl and to various points of econometric theory You can explore the example files under File Script files Example scripts There you will find a listing of the files along with a brief description of the points they illustrate and the data they employ Open any file and run it to see the output Note that long commands in a script can be broken over two or more lines using backslash as a continuation character You can if you wish use the GUI controls and the scripting approach in tandem exploiting each method where it offers greater convenience Here are two suggestions Open a data file in the GUI Explore the datagenerate graphs run regressions perform tests Then open the Command log edit out any redundant commands and save it under a specific name Run the script to generate a single file containing a concise record of your work Start by establishing a new script file Type in any commands that may be required to set up transformations of the data see the genr command in the Gretl Command Reference Typically this sort of thing can be accomplished more efficiently via commands assembled with forethought rather than pointandclick Then save and run the script the GUI data window will be updated accordingly Now you can carry out further exploration of the data via the GUI To revisit the data at a later point open and rerun the preparatory script first Scripts and data files One common way of doing econometric research with gretl is as follows compose a script execute the script inspect the output modify the script run it againwith the last three steps repeated as many times as necessary In this context note that when you open a data file this clears out most of gretls internal state Its therefore probably a good idea to have your script start with an open command the data file will be reopened each time and you can be confident youre getting fresh results One further point should be noted When you go to open a new data file via the graphical interface you are always prompted opening a new data file will lose any unsaved work do you really want to do this When you execute a script that opens a data file however you are not prompted The assumption is that in this case youre not going to lose any work because the work is embodied in the script itself and it would be annoying to be prompted at each iteration of the work cycle described above This means you should be careful if youve done work using the graphical interface and then decide to run a script the current data file will be replaced without any questions asked and its your responsibility to save any changes to your data first 1This feature is not unique to gretl other econometric packages offer the same facility However experience shows that while this can be remarkably useful it can also lead to writing dinosaur scripts that are never meant to be executed all at once but rather used as a chaotic repository to cherrypick snippets from Since gretl allows you to have several script windows open at the same time you may want to keep your scripts tidy and reasonably small Chapter 3 Modes of working 14 32 Saving script objects When you estimate a model using pointandclick the model results are displayed in a separate window offering menus which let you perform tests draw graphs save data from the model and so on Ordinarily when you estimate a model using a script you just get a noninteractive printout of the results You can however arrange for models estimated in a script to be captured so that you can examine them interactively when the script is finished Here is an example of the syntax for achieving this effect Model1 ols Ct 0 Yt That is you type a name for the model to be saved under then a backpointing assignment arrow then the model command The assignment arrow is composed of the lessthan sign followed by a dash it must be separated by spaces from both the preceding name and the following command The name for a saved object may include spaces but in that case it must be wrapped in double quotes Model 1 ols Ct 0 Yt Models saved in this way will appear as icons in the gretl icon view window see Section 34 after the script is executed In addition you can arrange to have a named model displayed in its own window automatically as follows Model1show Again if the name contains spaces it must be quoted Model 1show The same facility can be used for graphs For example the following will create a plot of Ct against Yt save it under the name CrossPlot it will appear under this name in the icon view window and have it displayed CrossPlot gnuplot Ct Yt CrossPlotshow You can also save the output from selected commands as named pieces of text again these will appear in the session icon window from where you can open them later For example this com mand sends the output from an augmented DickeyFuller test to a text object named ADF1 and displays it in a window ADF1 adf 2 x1 ADF1show Objects saved in this way whether models graphs or pieces of text output can be destroyed using the command free appended to the name of the object as in ADF1free 33 The gretl console A further option is available for your computing convenience Under gretls Tools menu you will find the item Gretl console there is also an open gretl console button on the toolbar in the main window This opens up a window in which you can type commands and execute them one by one by pressing the Enter key interactively This is essentially the same as gretlclis mode of operation except that the GUI is updated based on commands executed from the console enabling you to work back and forth as you wish Chapter 3 Modes of working 15 In the console you have command history that is you can use the up and down arrow keys to navigate the list of command you have entered to date You can retrieve edit and then reenter a previous command In console mode you can create display and free objects models graphs or text aa described above for script mode 34 The Session concept Gretl offers the idea of a session as a way of keeping track of your work and revisiting it later The basic idea is to provide an iconic space containing various objects pertaining to your current working session see Figure 32 You can add objects represented by icons to this space as you go along If you save the session these added objects should be available again if you reopen the session later Figure 32 Icon view one model and one graph have been added to the default icons If you start gretl and open a data set then select Icon view from the View menu you should see the basic default set of icons these give you quick access to information on the data set if any correlation matrix Correlations and descriptive summary statistics Summary All of these are activated by doubleclicking the relevant icon The Data set icon is a little more complex doubleclicking opens up the data in the builtin spreadsheet but you can also rightclick on the icon for a menu of other actions To add a model to the Icon view first estimate it using the Model menu Then pull down the File menu in the model window and select Save to session as icon or Save as icon and close Simply hitting the S key over the model window is a shortcut to the latter action To add a graph first create it under the View menu Graph specified vars or via one of gretls other graphgenerating commands Click on the graph window to bring up the graph menu and select Save to session as icon Once a model or graph is added its icon will appear in the Icon view window Doubleclicking on the icon redisplays the object while rightclicking brings up a menu which lets you display or delete the object This popup menu also gives you the option of editing graphs The model table In econometric research it is common to estimate several models with a common dependent variablethe models differing in respect of which independent variables are included or per haps in respect of the estimator used In this situation it is convenient to present the regression results in the form of a table where each column contains the results coefficient estimates and standard errors for a given model and each row contains the estimates for a given variable across the models Note that some estimation methods are not compatible with the straightforward model Chapter 3 Modes of working 16 table format therefore gretl will not let those models be added to the model table These meth ods include nonlinear least squares nls generic maximumlikelihood estimators mle generic GMM gmm dynamic panel models dpanel interval regressions intreg bivariate probit models biprobit ARIMA models arima or arma and GARCH models garch and arch In the Icon view window gretl provides a means of constructing such a table and copying it in plain text LATEX or Rich Text Format The procedure is outlined below The model table can also be built noninteractively in script modesee the entry for modeltab in the Gretl Command Reference 1 Estimate a model which you wish to include in the table and in the model display window under the File menu select Save to session as icon or Save as icon and close 2 Repeat step 1 for the other models to be included in the table up to a total of six models 3 When you are done estimating the models open the icon view of your gretl session by se lecting Icon view under the View menu in the main gretl window or by clicking the session icon view icon on the gretl toolbar 4 In the Icon view there is an icon labeled Model table Decide which model you wish to appear in the leftmost column of the model table and add it to the table either by dragging its icon onto the Model table icon or by rightclicking on the model icon and selecting Add to model table from the popup menu 5 Repeat step 4 for the other models you wish to include in the table The second model selected will appear in the second column from the left and so on 6 When you are finished composing the model table display it by doubleclicking on its icon Under the Edit menu in the window which appears you have the option of copying the table to the clipboard in various formats 7 If the ordering of the models in the table is not what you wanted rightclick on the model table icon and select Clear table Then go back to step 4 above and try again A simple instance of gretls model table is shown in Figure 33 Figure 33 Example of model table Chapter 3 Modes of working 17 The graph page The graph page icon in the session window offers a means of putting together several graphs for printing on a single page This facility will work only if you have the LATEX typesetting system installed and are able to generate and view either PDF or PostScript output The output format is controlled by your choice of program for compiling TEX files which can be found under the Programs tab in the Preferences dialog box under the Tools menu in the main window Usually this should be pdflatex for PDF output or latex for PostScript In the latter case you must have a working setup for handling PostScript which will usually include dvips ghostscript and a viewer such as gv ggv or kghostview In the Icon view window you can drag up to eight graphs onto the graph page icon When you doubleclick on the icon or rightclick and select Display a page containing the selected graphs in PDF or EPS format will be composed and opened in your viewer From there you should be able to print the page To clear the graph page rightclick on its icon and select Clear As with the model table it is also possible to manipulate the graph page via commands in script or console modesee the entry for the graphpg command in the Gretl Command Reference Saving and reopening sessions If you create models or graphs that you think you may wish to reexamine later then before quitting gretl select Session files Save session from the File menu and give a name under which to save the session To reopen the session later either Start gretl then reopen the session file by going to the File Session files Open session or From the command line type gretl r sessionfile where sessionfile is the name under which the session was saved or Drag the icon representing a session file onto gretl Chapter 4 Data files 41 Data file formats Gretl has its own native format for data files Most users will probably not want to read or write such files outside of gretl itself but occasionally this may be useful and details on the file formats are given in Appendix A The program can also import data from a variety of other formats In the GUI program this can be done via the File Open Data User file menunote the dropdown list of acceptable file types In script mode simply use the open command The supported import formats are as follows Plain text files commaseparated or CSV being the most common type For details on what gretl expects of such files see Section 43 Spreadsheets MS Excel Gnumeric and Open Document ODS The requirements for such files are given in Section 43 Stata data files dta SPSS data files sav SAS xport files xpt Eviews workfiles wf11 JMulTi data files When you import data from a plain text format gretl opens a diagnostic window reporting on its progress in reading the data If you encounter a problem with illformatted data the messages in this window should give you a handle on fixing the problem Note that gretl has a facility for writing out data in the native formats of GNU R Octave JMulTi and PcGive see Appendix D In the GUI client this option is found under the File Export data menu in the commandline client use the store command with the appropriate option flag 42 Databases For working with large amounts of data gretl is supplied with a databasehandling routine A database as opposed to a data file is not read directly into the programs workspace A database can contain series of mixed frequencies and sample ranges You open the database and select series to import into the working dataset You can then save those series in a native format data file if you wish Databases can be accessed via the menu item File Databases For details on the format of gretl databases see Appendix A 1See httpuserswfueducottrelleviewsformat 18 Chapter 4 Data files 19 Online access to databases Several gretl databases are available from Wake Forest University Your computer must be con nected to the internet for this option to work Please see the description of the data command under the Help menu Visit the gretl data page for details and updates on available data Foreign database formats Thanks to Thomas Doan of Estima who made available the specification of the database format used by RATS 4 Regression Analysis of Time Series gretl can handle such databasesor at least a subset of same namely timeseries databases containing monthly and quarterly series Gretl can also import data from PcGive databases These take the form of a pair of files one containing the actual data with suffix bn7 and one containing supplementary information in7 In addition gretl offers ODBC connectivity Be warned this feature is meant for somewhat ad vanced users there is currently no graphical interface Interested readers will find more info in appendix 42 43 Creating a dataset from scratch There are several ways of doing this 1 Find or create using a text editor a plain text data file and open it via Import 2 Use your favorite spreadsheet to establish the data file save it in commaseparated format if necessary this may not be necessary if the spreadsheet format is MS Excel Gnumeric or Open Document then use one of the Import options 3 Use gretls builtin spreadsheet 4 Select data series from a suitable database 5 Use your favorite text editor or other software tools to a create data file in gretl format inde pendently Here are a few comments and details on these methods Common points on imported data Options 1 and 2 involve using gretls import mechanism For the program to read such data successfully certain general conditions must be satisfied The first row must contain valid variable names A valid variable name is of 31 characters maximum starts with a letter and contains nothing but letters numbers and the underscore character Longer variable names will be truncated to 31 characters Qualifications to the above First in the case of an plain text import if the file contains no row with variable names the program will automatically add names v1 v2 and so on Second by the first row is meant the first relevant row In the case of plain text imports blank rows and rows beginning with a hash mark are ignored In the case of Excel Gnumeric and ODS imports you are presented with a dialog box where you can select an offset into the spreadsheet so that gretl will ignore a specified number of rows andor columns Data values these should constitute a rectangular block with one variable per column and one observation per row The number of variables data columns must match the number of variable names given See also section 46 Numeric data are expected but in the case of Chapter 4 Data files 20 importing from plain text the program offers limited handling of character string data if a given column contains character data only consecutive numeric codes are substituted for the strings and once the import is complete a table is printed showing the correspondence between the strings and the codes Dates or observation labels Optionally the first column may contain strings such as dates or labels for crosssectional observations Such strings have a maximum of 15 characters as with variable names longer strings will be truncated A column of this sort should be headed with the string obs or date or the first row entry may be left blank For dates to be recognized as such the date strings should adhere to one or other of a set of specific formats as follows For annual data 4digit years For quarterly data a 4digit year followed by a separator either a period a colon or the letter Q followed by a 1digit quarter Examples 19971 20023 1947Q1 For monthly data a 4digit year followed by a period or a colon followed by a twodigit month Examples 199701 200210 Plain text CSV files can use comma space tab or semicolon as the column separator When you open such a file via the GUI you are given the option of specifying the separator though in most cases it should be detected automatically If you use a spreadsheet to prepare your data you are able to carry out various transformations of the raw data with ease adding things up taking percentages or whatever note however that you can also do this sort of thing easilyperhaps more easilywithin gretl by using the tools under the Add menu Appending imported data You may wish to establish a dataset piece by piece by incremental importation of data from other sources This is supported via the File Append data menu items gretl will check the new data for conformability with the existing dataset and if everything seems OK will merge the data You can add new variables in this way provided the data frequency matches that of the existing dataset Or you can append new observations for data series that are already present in this case the variable names must match up correctly Note that by default that is if you choose Open data rather than Append data opening a new data file closes the current one Using the builtin spreadsheet Under the File New data set menu you can choose the sort of dataset you want to establish eg quarterly time series crosssectional You will then be prompted for starting and ending dates or observation numbers and the name of the first variable to add to the dataset After supplying this information you will be faced with a simple spreadsheet into which you can type data values In the spreadsheet window clicking the right mouse button will invoke a popup menu which enables you to add a new variable column to add an observation append a row at the foot of the sheet or to insert an observation at the selected point move the data down and insert a blank row Once you have entered data into the spreadsheet you import these into gretls workspace using the spreadsheets Apply changes button Please note that gretls spreadsheet is quite basic and has no support for functions or formulas Data transformations are done via the Add or Variable menus in the main window Selecting from a database Another alternative is to establish your dataset by selecting variables from a database Begin with the File Databases menu item This has four forks Gretl native RATS 4 PcGive and On database server You should be able to find the file fedstlbin in the file selector that Chapter 4 Data files 21 opens if you choose the Gretl native option since this file which contains a large collection of US macroeconomic time series is supplied with the distribution You wont find anything under RATS 4 unless you have purchased RATS data2 If you do possess RATS data you should go into the Tools Preferences General dialog select the Databases tab and fill in the correct path to your RATS files If your computer is connected to the internet you should find several databases at Wake Forest University under On database server You can browse these remotely you also have the option of installing them onto your own computer The initial remote databases window has an item showing for each file whether it is already installed locally and if so if the local version is up to date with the version at Wake Forest Assuming you have managed to open a database you can import selected series into gretls workspace by using the Series Import menu item in the database window or via the popup menu that ap pears if you click the right mouse button or by dragging the series into the programs main window Creating a gretl data file independently It is possible to create a data file in one or other of gretls own formats using a text editor or software tools such as awk sed or perl This may be a good choice if you have large amounts of data already in machine readable form You will of course need to study these data formats XMLbased or traditional as described in Appendix A 44 Structuring a dataset Once your data are read by gretl it may be necessary to supply some information on the nature of the data We distinguish between three kinds of datasets 1 Cross section 2 Time series 3 Panel data The primary tool for doing this is the Data Dataset structure menu entry in the graphical inter face or the setobs command for scripts and the commandline interface Cross sectional data By a cross section we mean observations on a set of units which may be firms countries indi viduals or whatever at a common point in time This is the default interpretation for a data file if there is insufficient information to interpret data as timeseries or panel data they are automat ically interpreted as a cross section In the unlikely event that crosssectional data are wrongly interpreted as time series you can correct this by selecting the Data Dataset structure menu item Click the crosssectional radio button in the dialog box that appears then click Forward Click OK to confirm your selection Time series data When you import data from a spreadsheet or plain text file gretl will make fairly strenuous efforts to glean timeseries information from the first column of the data if it looks at all plausible that such information may be present If timeseries structure is present but not recognized again you can use the Data Dataset structure menu item Select Time series and click Forward select the appropriate data frequency and click Forward again then select or enter the starting observation 2See wwwestimacom Chapter 4 Data files 22 and click Forward once more Finally click OK to confirm the timeseries interpretation if it is correct or click Back to make adjustments if need be Besides the basic business of getting a data set interpreted as time series further issues may arise relating to the frequency of timeseries data In a gretl timeseries data set all the series must have the same frequency Suppose you wish to make a combined dataset using series that in their original state are not all of the same frequency For example some series are monthly and some are quarterly Your first step is to formulate a strategy Do you want to end up with a quarterly or a monthly data set A basic point to note here is that compacting data from a higher frequency eg monthly to a lower frequency eg quarterly is usually unproblematic You lose information in doing so but in general it is perfectly legitimate to take say the average of three monthly observations to create a quarterly observation On the other hand expanding data from a lower to a higher frequency is not in general a valid operation In most cases then the best strategy is to start by creating a data set of the lower frequency and then to compact the higher frequency data to match When you import higherfrequency data from a database into the current data set you are given a choice of compaction method average sum start of period or end of period In most instances average is likely to be appropriate You can also import lowerfrequency data into a highfrequency data set but this is generally not recommended What gretl does in this case is simply replicate the values of the lowerfrequency series as many times as required For example suppose we have a quarterly series with the value 355 in 19901 the first quarter of 1990 On expansion to monthly the value 355 will be assigned to the observations for January February and March of 1990 The expanded variable is therefore useless for finegrained timeseries analysis outside of the special case where you know that the variable in question does in fact remain constant over the subperiods When the current data frequency is appropriate gretl offers both Compact data and Expand data options under the Data menu These options operate on the whole data set compacting or exanding all series They should be considered expert options and should be used with caution Panel data Panel data are inherently three dimensionalthe dimensions being variable crosssectional unit and timeperiod For example a particular number in a panel data set might be identified as the observation on capital stock for General Motors in 1980 A note on terminology we use the terms crosssectional unit unit and group interchangeably below to refer to the entities that compose the crosssectional dimension of the panel These might for instance be firms countries or persons For representation in a textual computer file and also for gretls internal calculations the three dimensions must somehow be flattened into two This flattening involves taking layers of the data that would naturally stack in a third dimension and stacking them in the vertical dimension gretl always expects data to be arranged by observation that is such that each row represents an observation and each variable occupies one and only one column In this context the flattening of a panel data set can be done in either of two ways Stacked time series the successive vertical blocks each comprise a time series for a given unit Stacked cross sections the successive vertical blocks each comprise a crosssection for a given period You may input data in whichever arrangement is more convenient Internally however gretl always stores panel data in the form of stacked time series Chapter 4 Data files 23 45 Panel data specifics When you import panel data into gretl from a spreadsheet or comma separated format the panel nature of the data will not be recognized automatically most likely the data will be treated as undated A panel interpretation can be imposed on the data using the graphical interface or via the setobs command In the graphical interface use the menu item Data Dataset structure In the first dialog box that appears select Panel In the next dialog you have a threeway choice The first two options Stacked time series and Stacked cross sections are applicable if the data set is already organized in one of these two ways If you select either of these options the next step is to specify the number of crosssectional units in the data set The third option Use index variables is applicable if the data set contains two variables that index the units and the time periods respectively the next step is then to select those variables For example a data file might contain a country code variable and a variable representing the year of the observation In that case gretl can reconstruct the panel structure of the data regardless of how the observation rows are organized The setobs command has options that parallel those in the graphical interface If suitable index variables are available you can do for example setobs unitvar timevar panelvars where unitvar is a variable that indexes the units and timevar is a variable indexing the periods Alternatively you can use the form setobs freq 11 structure where freq is replaced by the block size of the data that is the number of periods in the case of stacked time series or the number of units in the case of stacked crosssections and structure is either stackedtimeseries or stackedcrosssection Two examples are given below the first is suitable for a panel in the form of stacked time series with observations from 20 periods the second for stacked cross sections with 5 units setobs 20 11 stackedtimeseries setobs 5 11 stackedcrosssection Panel data arranged by variable Publicly available panel data sometimes come arranged by variable Suppose we have data on two variables x1 and x2 for each of 50 states in each of 5 years giving a total of 250 observations per variable One textual representation of such a data set would start with a block for x1 with 50 rows corresponding to the states and 5 columns corresponding to the years This would be followed vertically by a block with the same structure for variable x2 A fragment of such a data file is shown below with quinquennial observations 19651985 Imagine the table continued for 48 more states followed by another 50 rows for variable x2 x1 1965 1970 1975 1980 1985 AR 1000 1105 1187 1312 1604 AZ 1000 1043 1138 1209 1406 If a datafile with this sort of structure is read into gretl3 the program will interpret the columns as distinct variables so the data will not be usable as is But there is a mechanism for correcting the situation namely the stack function Consider the first data column in the fragment above the first 50 rows of this column constitute a crosssection for the variable x1 in the year 1965 If we could create a new series by stacking the 3Note that you will have to modify such a datafile slightly before it can be read at all The line containing the variable name in this example x1 will have to be removed and so will the initial row containing the years otherwise they will be taken as numerical data Chapter 4 Data files 24 first 50 entries in the second column underneath the first 50 entries in the first we would be on the way to making a data set by observation in the first of the two forms mentioned above stacked crosssections That is wed have a column comprising a crosssection for x1 in 1965 followed by a crosssection for the same variable in 1970 The following gretl script illustrates how we can accomplish the stacking for both x1 and x2 We assume that the original data file is called paneltxt and that in this file the columns are headed with variable names v1 v2 v5 The columns are not really variables but in the first instance we pretend that they are open paneltxt series x1 stackv1v5 50 series x2 stackv1v5 50 50 setobs 50 11 stackedcrosssection store panelgdt x1 x2 The second and third lines illustrate the syntax of the stack function which has this signature series stacklist L scalar length scalar offset L a list of series on which to operate length an integer giving the number of observations to take from each series offset an integer giving the offset from the top of the dataset at which to start taking values optional defaults to 0 The syntax in the example above constructs a list of the 5 contiguous series to be stacked More generally you can define a named list of series and pass that as the first argument to stack see chapter 15 In this example were supposing that the full data set contains 100 rows and that in the stacking of variable x1 we wish to read only the first 50 rows from each column so we give 50 as the second argument On line 3 we do the stacking for variable x2 Again we want a length of 50 for the components of the stacked series but this time we want to start reading from the 50th row of the original data and so we add a third offset argument of 50 Line 4 then imposes a panel interpretation on the data Finally we save the stacked data to file with the panel interpretation The illustrative script above is appropriate when the number of variables to be processed is small When then are many variables in the dataset it will be more convenient to use a loop to accomplish the stacking as shown in the following script The setup is presumed to be the same as in the previous case 50 units 5 periods but with 20 variables rather than 2 open paneltxt list L v1v5 predefine a list of series scalar length 50 loop i120 scalar offset i 1 length series xi stackL length offset endloop setobs 50 101 stackedcrosssection store panelgdt x1x20 Sidebyside time series Theres a second sort of data that you may wish to convert to gretls panel format namely side byside time series for a number of crosssectional units For example a data file might contain separate GDP series of common length T for each of N countries To turn these into a single stacked Chapter 4 Data files 25 time series the stack function can again be used An example follows where we suppose the original data source is a commaseparated file named GDPcsv containing GDP data for countries from Austria GDPAT to Zimbabwe GDPZW in consecutive columns open GDPcsv scalar T nobs the number of periods list L GDPATGDPZW series GDP stackL T setobs T 101 stackedtimeseries store panelgdt GDP The resulting data file panelgdt will contain a single series of length NT where N is the number of countries and T is the length of the original dataset One could insert revised variants of lines 3 and 4 of the script if the original file contained additional sidebyside percountry series for investment consumption or whatever Panel data marker strings It can be helpful with panel data to have the observations identified by mnemonic markers A special function in the genr command is available for this purpose In the example under the heading Panel data arranged by variable above suppose all the states are identified by twoletter codes in the leftmost column of the original datafile When the stack function is invoked as shown these codes will be stacked along with the data values If the first row is marked AR for Arkansas then the marker AR will end up being shown on each row containing an observation for Arkansas Thats all very well but these markers dont tell us anything about the date of the observation To rectify this we could do genr time series year 1960 5 time genr markers sd marker year The first line generates a 1based index representing the period of each observation and the second line uses the time variable to generate a variable representing the year of the observation The third line contains this special feature if and only if the name of the new variable to generate is markers the portion of the command following the equals sign is taken as a Cstyle format string which must be wrapped in double quotes followed by a commaseparated list of arguments The arguments will be printed according to the given format to create a new set of observation markers Valid arguments are either the names of variables in the dataset or the string marker which denotes the preexisting observation marker The format specifiers which are likely to be useful in this context are s for a string and d for an integer Strings can be truncated for example 3s will use just the first three characters of the string To chop initial characters off an existing observation marker when constructing a new one you can use the syntax marker n where n is a positive integer in the case the first n characters will be skipped After the commands above are processed then the observation markers will look like for example AR1965 where the twoletter state code and the year of the observation are spliced together with a colon Panel dummy variables In a panel study you may wish to construct dummy variables of one or both of the following sorts a dummies as unique identifiers for the units or groups and b dummies as unique identifiers for the time periods The former may be used to allow the intercept of the regression to differ across the units the latter to allow the intercept to differ across periods Two special functions are available to create such dummies These are found under the Add menu in the GUI or under the genr command in script mode or gretlcli 1 unit dummies script command genr unitdum This command creates a set of dummy variables identifying the crosssectional units The variable du1 will have value 1 in each row corresponding to a unit 1 observation 0 otherwise du2 will have value 1 in each row corresponding to a unit 2 observation 0 otherwise and so on 2 time dummies script command genr timedum This command creates a set of dummy variables identifying the periods The variable dt1 will have value 1 in each row corresponding to a period 1 observation 0 otherwise dt2 will have value 1 in each row corresponding to a period 2 observation 0 otherwise and so on If a panel data set has the YEAR of the observation entered as one of the variables you can create a periodic dummy to pick out a particular year eg genr dum YEAR1960 You can also create periodic dummy variables using the modulus operator For instance to create a dummy with value 1 for the first observation and every thirtieth observation thereafter 0 otherwise do genr index series dum index1 30 0 Lags differences trends If the time periods are evenly spaced you may want to use lagged values of variables in a panel regression but see also chapter 24 you may also wish to construct first differences of variables of interest Once a dataset is identified as a panel gretl will handle the generation of such variables correctly For example the command genr x11 x11 will create a variable that contains the first lag of x1 where available and the missing value code where the lag is not available eg at the start of the time series for each group When you run a regression using such variables the program will automatically skip the missing observations When a panel data set has a fairly substantial time dimension you may wish to include a trend in the analysis The command genr time creates a variable named time which runs from 1 to T for each unit where T is the length of the timeseries dimension of the panel If you want to create an index that runs consecutively from 1 to m T where m is the number of units in the panel use genr index Basic statistics by unit gretl contains functions which can be used to generate basic descriptive statistics for a given variable on a perunit basis these are pnobs number of valid cases pmin and pmax minimum and maximum and pmean and psd mean and standard deviation As a brief illustration suppose we have a panel data set comprising 8 timeseries observations on each of N units or groups Then the command series pmx pmeanx creates a series of this form the first 8 values corresponding to unit 1 contain the mean of x for unit 1 the next 8 values contain the mean for unit 2 and so on The psd function works in a similar manner The sample standard deviation for group i is computed as si sqrtsumx xi2 Ti 1 where Ti denotes the number of valid observations on x for the given unit xi denotes the group mean and the summation is across valid observations for the group If Ti 2 however the standard deviation is recorded as 0 Chapter 4 Data files 27 One particular use of psd may be worth noting If you want to form a subsample of a panel that contains only those units for which the variable x is timevarying you can either use smpl pminx pmaxx restrict or smpl psdx 0 restrict 46 Missing data values Representation and handling Missing values are represented internally as NaN not a number as defined in the IEEE 754 floatingpoint standard In a nativeformat data file they should be represented as NA When im porting CSV data gretl accepts several common representations of missing values including 999 the string NA in upper or lower case a single dot or simply a blank cell Blank cells should of course be properly delimited eg 1206538 in which the middle value is presumed missing As for handling of missing values in the course of statistical analysis gretl does the following In calculating descriptive statistics mean standard deviation etc under the summary com mand missing values are simply skipped and the sample size adjusted appropriately In running regressions gretl first adjusts the beginning and end of the sample range trun cating the sample if need be Missing values at the beginning of the sample are common in time series work due to the inclusion of lags first differences and so on missing values at the end of the range are not uncommon due to differential updating of series and possibly the inclusion of leads If gretl detects any missing values inside the possibly truncated sample range for a regression the result depends on the character of the dataset and the estimator chosen In many cases the program will automatically skip the missing observations when calculating the regression results In this situation a message is printed stating how many observations were dropped On the other hand the skipping of missing observations is not supported for all procedures exceptions include all autoregressive estimators system estimators such as SUR and nonlinear least squares In the case of panel data the skipping of missing observations is supported only if their omission leaves a balanced panel If missing observations are found in cases where they are not supported gretl gives an error message and refuses to produce estimates Manipulating missing values Some special functions are available for the handling of missing values The Boolean function missing takes the name of a variable as its single argument it returns a series with value 1 for each observation at which the given variable has a missing value and value 0 otherwise that is if the given variable has a valid value at that observation The function ok is complementary to missing it is just a shorthand for missing where is the Boolean NOT operator For example one can count the missing values for variable x using scalar nmissx summissingx The function zeromiss which again takes a single series as its argument returns a series where all zero values are set to the missing code This should be used with cautionone does not want to confuse missing values and zerosbut it can be useful in some contexts For example one can determine the first valid observation for a variable x using Chapter 4 Data files 28 genr time scalar x0 minzeromisstime okx The function misszero does the opposite of zeromiss that is it converts all missing values to zero If missing values get involved in calculations they propagate according to the IEEE rules notably if one of the operands to an arithmetical operation is a NaN the result will also be NaN 47 Maximum size of data sets Basically the size of data sets both the number of variables and the number of observations per variable is limited only by the characteristics of your computer Gretl allocates memory dynami cally and will ask the operating system for as much memory as your data require Obviously then you are ultimately limited by the size of RAM Aside from the multipleprecision OLS option gretl uses doubleprecision floatingpoint numbers throughout The size of such numbers in bytes depends on the computer platform but is typically eight To give a rough notion of magnitudes suppose we have a data set with 10000 observations on 500 variables Thats 5 million floatingpoint numbers or 40 million bytes If we define the megabyte MB as 1024 1024 bytes as is standard in talking about RAM its slightly over 38 MB The program needs additional memory for workspace but even so handling a data set of this size should be quite feasible on a current PC which at the time of writing is likely to have at least 256 MB of RAM If RAM is not an issue there is one further limitation on data size though its very unlikely to be a binding constraint That is variables and observations are indexed by signed integers and on a typical PC these will be 32bit values capable of representing a maximum positive value of 231 1 2 147 483 647 The limits mentioned above apply to gretls native functionality There are tighter limits with regard to two thirdparty programs that are available as addons to gretl for certain sorts of time series analysis including seasonal adjustment namely TRAMOSEATS and X12ARIMA These pro grams employ a fixedsize memory allocation and cant handle series of more than 600 observa tions 48 Data file collections If youre using gretl in a teaching context you may be interested in adding a collection of data files andor scripts that relate specifically to your course in such a way that students can browse and access them easily There are three ways to access such collections of files For data files select the menu item File Open data Sample file or click on the folder icon on the gretl toolbar For script files select the menu item File Script files Example scripts When a user selects one of the items The data or script files included in the gretl distribution are automatically shown this includes files relating to Ramanathans Introductory Econometrics and Greenes Econometric Analysis The program looks for certain known collections of data files available as optional extras for instance the datafiles from various econometrics textbooks Davidson and MacKinnon Gujarati Stock and Watson Verbeek Wooldridge and the Penn World Table PWT 56 See Chapter 4 Data files 29 the data page at the gretl website for information on these collections If the additional files are found they are added to the selection windows The program then searches for valid file collections not necessarily known in advance in these places the system data directory the system script directory the user directory and all firstlevel subdirectories of these For reference typical values for these directories are shown in Table 41 Note that PERSONAL is a placeholder that is expanded by Windows corresponding to My Documents on Englishlanguage systems Linux MS Windows system data dir usrsharegretldata cProgram Filesgretldata system script dir usrsharegretlscripts cProgram Filesgretlscripts user dir HOMEgretl PERSONALgretl Table 41 Typical locations for file collections Any valid collections will be added to the selection windows So what constitutes a valid file collec tion This comprises either a set of data files in gretl XML format with the gdt suffix or a set of script files containing gretl commands with inp suffix in each case accompanied by a master file or catalog The gretl distribution contains several example catalog files for instance the file descriptions in the misc subdirectory of the gretl data directory and psdescriptions in the misc subdirectory of the scripts directory If you are adding your own collection data catalogs should be named descriptions and script catalogs should be be named psdescriptions In each case the catalog should be placed along with the associated data or script files in its own specific subdirectory eg usrsharegretl datamydata or cuserdatagretldatamydata The catalog files are plain text if they contain nonASCII characters they must be encoded as UTF 8 The syntax of such files is straightforward Here for example are the first few lines of gretls misc data catalog Gretl various illustrative datafiles armaartificial data for ARMA script example ectsnlsNonlinear least squares example hamiltonPrices and exchange rate US and Italy The first line which must start with a hash mark contains a short name here Gretl which will appear as the label for this collections tab in the data browser window followed by a colon followed by an optional short description of the collection Subsequent lines contain two elements separated by a comma and wrapped in double quotation marks The first is a datafile name leave off the gdt suffix here and the second is a short de scription of the content of that datafile There should be one such line for each datafile in the collection A script catalog file looks very similar except that there are three fields in the file lines a filename without its inp suffix a brief description of the econometric point illustrated in the script and a brief indication of the nature of the data used Again here are the first few lines of the supplied misc script catalog Gretl various sample scripts armaARMA modelingartificial data ectsnlsNonlinear least squares Davidsonartificial data leverageInfluential observationsartificial data longleyMulticollinearityUS employment Chapter 4 Data files 30 If you want to make your own data collection available to users these are the steps 1 Assemble the data in whatever format is convenient 2 Convert the data to gretl format and save as gdt files It is probably easiest to convert the data by importing them into the program from plain text CSV or a spreadsheet format MS Excel or Gnumeric then saving them You may wish to add descriptions of the individual variables the Variable Edit attributes menu item and add information on the source of the data the Data Edit info menu item 3 Write a descriptions file for the collection using a text editor 4 Put the datafiles plus the descriptions file in a subdirectory of the gretl data directory or user directory 5 If the collection is to be distributed to other people package the data files and catalog in some suitable manner eg as a zipfile If you assemble such a collection and the data are not proprietary we would encourage you to submit the collection for packaging as a gretl optional extra 49 Assembling data from multiple sources In many contexts researchers need to bring together data from multiple source files and in some cases these sources are not organized such that the data can simply be stuck together by append ing rows or columns to a base dataset In gretl the join command can be used for this purpose this command is discussed in detail in chapter 7 Chapter 5 Subsampling a dataset 51 Introduction Some subtle issues can arise here this chapter attempts to explain the issues A subsample may be defined in relation to a full dataset in two different ways we will refer to these as setting the sample and restricting the sample these methods are discussed in sections 52 and 53 respectively In addition section 54 discusses some special issues relating to panel data and section 55 covers resampling with replacement which is useful in the context of bootstrapping test statistics The following discussion focuses on the commandline approach But you can also invoke the methods outlined here via the items under the Sample menu in the GUI program 52 Setting the sample By setting the sample we mean defining a subsample simply by means of adjusting the starting andor ending point of the current sample range This is likely to be most relevant for timeseries data For example one has quarterly data from 19601 to 20034 and one wants to run a regression using only data from the 1970s A suitable command is then smpl 19701 19794 Or one wishes to set aside a block of observations at the end of the data period for outofsample forecasting In that case one might do smpl 20004 where the semicolon is shorthand for leave the starting observation unchanged The semicolon may also be used in place of the second parameter to mean that the ending observation should be unchanged By unchanged here we mean unchanged relative to the last smpl setting or relative to the full dataset if no subsample has been defined up to this point For example after smpl 19701 20034 smpl 20004 the sample range will be 19701 to 20004 An incremental or relative form of setting the sample range is also supported In this case a relative offset should be given in the form of a signed integer or a semicolon to indicate no change for both the starting and ending point For example smpl 1 will advance the starting observation by one while preserving the ending observation and smpl 2 1 31 Chapter 5 Subsampling a dataset 32 will both advance the starting observation by two and retard the ending observation by one An important feature of setting the sample as described above is that it necessarily results in the selection of a subset of observations that are contiguous in the full dataset The structure of the dataset is therefore unaffected for example if it is a quarterly time series before setting the sample it remains a quarterly time series afterwards 53 Restricting the sample By restricting the sample we mean selecting observations on the basis of some Boolean logical criterion or by means of a random number generator This is likely to be most relevant for cross sectional or panel data Suppose we have data on a crosssection of individuals recording their gender income and other characteristics We wish to select for analysis only the women If we have a male dummy variable with value 1 for men and 0 for women we could do smpl male0 restrict to this effect Or suppose we want to restrict the sample to respondents with incomes over 50000 Then we could use smpl income50000 restrict A question arises if we issue the two commands above in sequence what do we end up with in our subsample all cases with income over 50000 or just women with income over 50000 By default the answer is the latter women with income over 50000 The second restriction augments the first or in other words the final restriction is the logical product of the new restriction and any restriction that is already in place If you want a new restriction to replace any existing restrictions you can first recreate the full dataset using smpl full Alternatively you can add the replace option to the smpl command smpl income50000 restrict replace This option has the effect of automatically reestablishing the full dataset before applying the new restriction Unlike a simple setting of the sample restricting the sample may result in selection of non contiguous observations from the full data set It may therefore change the structure of the data set This can be seen in the case of panel data Say we have a panel of five firms indexed by the variable firm observed in each of several years identified by the variable year Then the restriction smpl year1995 restrict produces a dataset that is not a panel but a crosssection for the year 1995 Similarly smpl firm3 restrict produces a timeseries dataset for firm number 3 For these reasons possible noncontiguity in the observations possible change in the structure of the data gretl acts differently when you restrict the sample as opposed to simply setting it In Chapter 5 Subsampling a dataset 33 the case of setting the program merely records the starting and ending observations and uses these as parameters to the various commands calling for the estimation of models the computation of statistics and so on In the case of restriction the program makes a reduced copy of the dataset and by default treats this reduced copy as a simple undated crosssectionbut see the further discussion of panel data in section 54 If you wish to reimpose a timeseries interpretation of the reduced dataset you can do so using the setobs command or the GUI menu item Data Dataset structure The fact that restricting the sample results in the creation of a reduced copy of the original dataset may raise an issue when the dataset is very large With such a dataset in memory the creation of a copy may lead to a situation where the computer runs low on memory for calculating regression results You can work around this as follows 1 Open the full data set and impose the sample restriction 2 Save a copy of the reduced data set to disk 3 Close the full dataset and open the reduced one 4 Proceed with your analysis Random subsampling Besides restricting the sample on some deterministic criterion it may sometimes be useful when working with very large datasets or perhaps to study the properties of an estimator to draw a random subsample from the full dataset This can be done using for example smpl 100 random to select 100 cases If you want the sample to be reproducible you should set the seed for the random number generator first using the set command This sort of sampling falls under the restriction category a reduced copy of the dataset is made 54 Panel data Consider for concreteness the ArellanoBond dataset supplied with gretl abdatagdt This com prises data on 140 firms n 140 observed over the years 19761984 T 9 The dataset is nominally balanced in the sense that that the timeseries length is the same for all countries this being a requirement for a dataset to count as a panel in gretl but in fact there are many missing values NAs You may want to subsample such a dataset in either the crosssectional dimension limit the sam ple to a subset of firms or the time dimension eg use data from the 1980s only One way to subsample on firms keys off the notation used by gretl for panel observations The full data range is printed as 11 firm 1 period 1 to 1409 firm 140 period 9 The effect of smpl 11 809 is to limit the sample to the first 80 firms Note that if you instead tried smpl 11 804 this would provoke an error you cannot use this syntax to subsample in the time dimension of the panel Alternatively and perhaps more naturally you can use the unit option with the smpl command to limit the sample in the crosssectional dimension as in smpl 1 80 unit The firms in the ArellanoBond dataset are anonymous but suppose you had a panel with five named countries With such a panel you can inform gretl of the names of the groups using the setobs command For example given Chapter 5 Subsampling a dataset 34 string cstr Portugal Italy Ireland Greece Spain setobs country cstr panelgroups gretl creates a stringvalued series named country with group names taken from the variable cstr Then to include only Italy and Spain you could do smpl countryItaly countrySpain restrict or to exclude one country smpl countryIreland restrict Subsampling a panel in the time dimension can be done via restrict For example the ArellanoBond dataset contains a variable named YEAR that records the year of the observations and if one wanted to omit the first two years of data one could do smpl YEAR 1978 restrict If a dataset does not already incude a suitable variable for this purpose one can use the command genr time to create a simple 1based time index Another way to subsample in the time dimension of a panel starts with a specification of time via the setobs command as in setobs 1 1976 paneltime This tells gretl that paneltime is annual frequency 1 starting in 1976 In fact this is already done for abdatagdt Then to restrict the sample range to 19791982 you can do smpl 1979 1982 time Note that if you apply a sample restriction that just selects certain units firms countries or what ever or selects certain contiguous timeperiodssuch that n 1 T 1 and the timeseries length is still the same across all included unitsyour subsample will still be interpreted by gretl as a panel Unbalancing restrictions In some cases one wants to subsample according to a criterion that cuts across the grain of a panel dataset For instance suppose you have a micro dataset with thousands of individuals observed over several years and you want to restrict the sample to observations on employed women If we simply extracted from the total nT rows of the dataset those that pertain to women who were employed at time t t 1 T we would likely end up with a dataset that doesnt count as a panel in gretl because the specific timeseries length Ti would differ across individuals In some contexts it might be OK that gretl doesnt take your subsample to be a panel but if you want to apply panelspecific methods this is a problem You can solve it by giving the preservepanel option with smpl For example supposing your dataset contained dummy variables gender with the value 1 coding for women and employed you could do smpl gender1 employed1 restrict preservepanel What exactly does this do Well lets say the years of your data are 2000 2005 and 2010 and that some women were employed in all of those years giving a maximum Ti value of 3 But in dividual 526 is a woman who was employed only in the year 2000 Ti 1 The effect of the preservepanel option is then to insert padding rows of NAs for the years 2005 and 2010 for individual 526 and similarly for all individuals with 0 Ti 3 Your subsample then qualifies as a panel Chapter 5 Subsampling a dataset 35 55 Resampling and bootstrapping Given an original data series x the command series xr resamplex creates a new series each of whose elements is drawn at random from the elements of x If the original series has 100 observations each element of x is selected with probability 1100 at each drawing Thus the effect is to shuffle the elements of x with the twist that each element of x may appear more than once or not at all in xr The primary use of this function is in the construction of bootstrap confidence intervals or pvalues Here is a simple example Suppose we estimate a simple regression of y on x via OLS and find that the slope coefficient has a reported tratio of t0 with ν degrees of freedom A twotailed pvalue for the null hypothesis that the slope parameter equals zero can then be found using the tν distribution Depending on the context however we may doubt whether the ratio of coefficient to standard error truly follows the tν distribution In that case we could derive a bootstrap pvalue as shown in Listing 51 Under the null hypothesis that the slope with respect to x is zero y is simply equal to its mean plus an error term We simulate y by resampling the residuals from the initial OLS and reestimate the model We repeat this procedure a large number of times and record the number of cases where the absolute value of the tratio is greater than t0 the proportion of such cases is our bootstrap pvalue For a good discussion of simulationbased tests and bootstrapping see Davidson and MacKinnon 2004 chapter 4 Davidson and Flachaire 2001 is also instructive Listing 51 Calculation of bootstrap pvalue Download nulldata 50 set seed 54321 series x normal series y 10 x 2normal ols y 0 x the reported tstat t0 abscoeff2 stderr2 save the residuals series u uhat scalar ybar meany number of replications for bootstrap scalar B 1000 scalar tcount 0 series ysim loop B generate simulated y by resampling ysim ybar resampleu ols ysim 0 x quiet scalar tsim abscoeff2 stderr2 tcount tsim t0 endloop printf proportion of cases with t 3f g t0 tcount B Chapter 6 Graphs and plots 61 Gnuplot graphs A separate program gnuplot is called to generate graphs Gnuplot is a very fullfeatured graphing program with myriad options It is available from wwwgnuplotinfo but note that a suitable copy of gnuplot is bundled with the packaged versions of gretl for MS Windows and Mac OS X Gretl gives you direct access via a graphical interface to a subset of gnuplots options and it tries to choose sensible values for you it also allows you to take complete control over graph details if you wish With a graph displayed you can click on the graph window for a popup menu with the following options Save as PNG Save the graph in Portable Network Graphics format the same format that you see on screen Save as postscript Save in encapsulated postscript EPS format Save as Windows metafile Save in Enhanced Metafile EMF format Save to session as icon The graph will appear in iconic form when you select Icon view from the View menu Zoom Lets you select an area within the graph for closer inspection not available for all graphs Print Current GTK or MS Windows only lets you print the graph directly Copy to clipboard MS Windows only lets you paste the graph into Windows applications such as MS Word Edit Opens a controller for the plot which lets you adjust many aspects of its appearance Close Closes the graph window Displaying data labels For simple XY scatter plots some further options are available if the dataset includes case mark ers that is labels identifying each observation1 With a scatter plot displayed when you move the mouse pointer over a data point its label is shown on the graph By default these labels are transient they do not appear in the printed or copied version of the graph They can be removed by selecting Clear data labels from the graph popup menu If you want the labels to be affixed per manently so they will show up when the graph is printed or copied select the option Freeze data labels from the popup menu Clear data labels cancels this operation The other labelrelated option All data labels requests that case markers be shown for all observations At present the display of case markers is disabled for graphs containing more than 250 data points 1For an example of such a dataset see the Ramanathan file data410 this contains data on private school enrollment for the 50 states of the USA plus Washington DC the case markers are the twoletter codes for the states 36 Chapter 6 Graphs and plots 37 GUI plot editor Selecting the Edit option in the graph popup menu opens an editing dialog box shown in Figure 61 Notice that there are several tabs allowing you to adjust many aspects of a graphs appearance font title axis scaling line colors and types and so on You can also add lines or descriptive labels to a graph under the Lines and Labels tabs The Apply button applies your changes without closing the editor OK applies the changes and closes the dialog Figure 61 gretls gnuplot controller Publicationquality graphics advanced options The GUI plot editor has two limitations First it cannot represent all the myriad options that gnuplot offers Users who are sufficiently familiar with gnuplot to know what theyre missing in the plot editor presumably dont need much help from gretl so long as they can get hold of the gnuplot command file that gretl has put together Second even if the plot editor meets your needs in terms of finetuning the graph you see on screen a few details may need further work in order to get optimal results for publication Either way the first step in advanced tweaking of a graph is to get access to the graph command file In the graph display window rightclick and choose Save to session as icon If its not already open open the icon view windoweither via the menu item ViewIcon view or by clicking the session icon view button on the mainwindow toolbar Rightclick on the icon representing the newly added graph and select Edit plot commands from the popup menu You get a window displaying the plot file Figure 62 Here are the basic things you can do in this window Obviously you can edit the file you just opened You can also send it for processing by gnuplot by clicking the Execute cogwheel icon in the toolbar Or you can use the Save as button to save a copy for editing and processing as you wish Chapter 6 Graphs and plots 38 Figure 62 Plot commands editor Unless youre a gnuplot expert most likely youll only need to edit a couple of lines at the top of the file specifying a driver plus options and an output file We offer here a brief summary of some points that may be useful First gnuplots output mode is set via the command set term followed by the name of a supported driver terminal in gnuplot parlance plus various possible options The top line in the plot commands window shows the set term line that gretl used to make a PNG file commented out The graphic formats that are most suitable for publication are PDF and EPS These are supported by the gnuplot term types pdf pdfcairo and postscript with the eps option The pdfcairo driver has the virtue that is behaves in a very similar manner to the PNG one the output of which you see on screen This is provided by the version of gnuplot that is included in the gretl packages for MS Windows and Mac OS X if youre on Linux it may or may be supported If pdfcairo is not available the pdf terminal may be available the postscript terminal is almost certainly available Besides selecting a term type if you want to get gnuplot to write the actual output file you need to append a set output line giving a filename Here are a few examples of the first two lines you might type in the window editing your plot commands Well make these more realistic shortly set term pdfcairo set output mygraphpdf set term pdf set output mygraphpdf set term postscript eps set output mygrapheps There are a couple of things worth remarking here First you may want to adjust the size of the graph and second you may want to change the font The default sizes produced by the above drivers are 5 inches by 3 inches for pdfcairo and pdf and 5 inches by 35 inches for postscript eps In each case you can change this by giving a size specification which takes the form XXYY examples below Chapter 6 Graphs and plots 39 You may ask why bother changing the size in the gnuplot command file After all PDF and EPS are both vector formats so the graphs can be scaled at will True but a uniform scaling will also affect the font size which may end looking wrong You can get optimal results by experimenting with the font and size options to gnuplots set term command Here are some examples comments follow below pdfcairo regular size slightly amended set term pdfcairo font Sans6 size 5in35in or small size set term pdfcairo font Sans5 size 3in2in pdf regular size slightly amended set term pdf font Helvetica8 size 5in35in or small set term pdf font Helvetica6 size 3in2in postscript regular set term post eps solid font Helvetica16 or small set term post eps solid font Helvetica12 size 3in2in On the first line we set a sans serif font for pdfcairo at a suitable size for a 5 35 inch plot which you may find looks better than the rather letterboxy default of 5 3 And on the second we illustrate what you might do to get a smaller 3 2 inch plot You can specify the plot size in centimeters if you prefer as in set term pdfcairo font Sans6 size 6cm4cm We then repeat the exercise for the pdf terminal Notice that here were specifying one of the 35 standard PostScript fonts namely Helvetica Unlike pdfcairo the plain pdf driver is unlikely to be able to find fonts other than these In the third pair of lines we illustrate options for the postscript driver which as you see can be abbreviated as post Note that here we have added the option solid Unlike most other drivers this one uses dashed lines unless you specify the solid option Also note that weve apparently specified a much larger font in this case Thats because the eps option in effect tells the postscript driver to work at halfsize among other things so we need to double the font size Table 61 summarizes the basics for the three drivers we have mentioned Terminal default size inches suggested font pdfcairo 5 3 Sans6 pdf 5 3 Helvetica8 post eps 5 35 Helvetica16 Table 61 Drivers for publicationquality graphics To find out more about gnuplot visit wwwgnuplotinfo This site has documentation for the current version of the program in various formats Additional tips To be written Line widths enhanced text Show a before and after example Chapter 6 Graphs and plots 40 62 Plotting graphs from scripts When working with scripts you may want to have a graph shown onto your display or saved into a file In fact if in your usual workflow you find yourself creating similar graphs over and over again you might want to consider the option of writing a script which automates this process for you gretl gives you two main tools for doing this one is a command called gnuplot whose main use is to create standard plot quickly The other one is the plot command block which has a more elaborate syntax but offers you more control on output The gnuplot command The gnuplot command is described at length in the Gretl Command Reference and the online help system Here we just summarize its main features basically it consists of the gnuplot keyword followed by a list of items telling the command what you want plotted and a list of options telling it how you want it plotted For example the line gnuplot y1 y2 x will give you a basic XY plot of the two series y1 and y2 on the vertical axis versus the series x on the horizontal axis In general the arguments to the gnuplot command is a list of series the last of which goes on the xaxis while all the other ones go onto the yaxis By default the gnuplot command gives you a scatterplot If you just have one variable on the yaxis then gretl will also draw a the OLS interpolation if the fit is good enough2 Several aspects of the behavior described above can be modified You do this by appending options to the command Most options can be broadly grouped in three categories 1 Plot styles we support points the default choice lines lines and points together and im pulses vertical lines 2 Algorithm for the fitted line here you can choose between linear quadratic and cubic inter polation but also more exotic choices such as semilog inverse or loess nonparametric Of course you can also turn this feature off 3 Input and output you can choose whether you want your graph on your computer screen and possibly use the inbuilt graphical widget to further customize it see above page 37 or rather save it to a file We support several graphical formats among which PNG and PDF to make it easy to incorporate your plots into text documents The following script uses the AWM dataset to exemplify some traditional plots in macroeconomics open AWMgdt quiet consumption and income different styles gnuplot PCR YER gnuplot PCR YER outputdisplay gnuplot PCR YER outputdisplay timeseries gnuplot PCR YER outputdisplay timeseries withlines Phillips curve different fitted lines gnuplot INFQ URX outputdisplay 2The technical condition for this is that the twotailed pvalue for the slope coefficient should be under 10 Chapter 6 Graphs and plots 41 gnuplot INFQ URX fitnone outputdisplay gnuplot INFQ URX fitinverse outputdisplay gnuplot INFQ URX fitloess outputdisplay These examples use variables from the areawide model dataset by the European Central Bank ECB which is shipped with gretl in the AWMgdt file PCR is aggregate private real consumption and YER is real GDP The first command line above thus plots consumption against income as a kind of Keynesian consumption function More precisely it produces a simple scatter plot with an automatically linear fitted line If this is executed in the gretl console the plot will be directly shown in a new window but if this line is contained in a script then instead a file with the plot commands will be saved for later execution The second example line changes this behavior for a script command and forces the plot to be shown directly The third line instead asks for a plot of the two variables as two separate curves against time on the xaxis Each observation point is drawn separately with a certain symbol determined by gnuplot defaults If you add the option withlines the points will be connected with a continuous line and the symbols omitted The second set of example lines above demonstrate how the fitted line in the scatter plot can be controlled from gretls side The option fitnone overrides gnuplots default to draw a line if it deems the fit to be good enough The effect of fitinverse is to consider the variable on the yaxis as a function of 1X instead of X and draw the corresponding hyperbolic branch For the workings of a Loess fit locallyweighted polynomial regression please refer to the documentation of the loess function For more detail consult the Gretl Command Reference The plot command block The plot environment is a way to pass information to Gnuplot in a more structured way so that customization of basic plots becomes easier It has the following characteristics The block starts with the plot keyword followed by a required parameter the name of a list a single series or a matrix This parameter specifies the data to be plotted The starting line may be prefixed with the savename apparatus to save a plot as an icon in the GUI program The block ends with end plot Inside the block you have zero or more lines of these types identified by an initial keyword option specify a single option details below options specify multiple options on a single line if more than one option is given on a line the options should be separated by spaces literal a command to be passed to gnuplot literally printf a printf statement whose result will be passed to gnuplot literally this allows the use of string variables without having to resort to style string substitution The options available are basically those of the current gnuplot command but with a few dif ferences For one thing you dont need the leading doubledash in an option or options line Besides that You cant use the option matrixwhatever with plot that possibility is handled by pro viding the name of a matrix on the initial plot line The inputfilename option is not supported use gnuplot for the case where youre supplying the entire plot specification yourself Chapter 6 Graphs and plots 42 The several options pertaining to the presence and type of a fitted line are replaced in plot by a single option fit which requires a parameter Supported values for the parameter are none linear quadratic cubic inverse semilog and loess Example option fitquadratic As with gnuplot the default is to show a linear fit in an XY scatter if its significant at the 10 percent level Heres a simple example the plot specification from the bandplot package which shows how to achieve the same result via the gnuplot command and a plot block respectivelythe latter occupies a few more lines but is clearer gnuplot 1 2 3 4 withlines matrixplotmat fitnone outputdisplay set linetype 3 lc rgb 0000ff set title title set nokey set xlabel xname plot plotmat options withlines fitnone literal set linetype 3 lc rgb 0000ff literal set nokey printf set title s title printf set xlabel s xname end plot outputdisplay Note that outputdisplay is appended to end plot also note that if you give a matrix to plot its assumed you want to plot all the columns In addition if you give a single series and the dataset is time series its assumed you want a timeseries plot Example Plotting an histogram together with a density Listing 61 contains a slightly more elaborate example here we load the Mroz example dataset and calculate the log of the individuals wage Then we match the histogram of a discretized version of the same variable obtained via the aggregate function versus the theoretical density if data were Gaussian There are a few points to note The data for the plot are passed through a matrix in which we set column names via the cnameset function those names are then automatically used by the plot environment In this example we make extensive use of the set literal construct for refining the plot by passing instruction to gnuplot the power of gnuplot is impossible to overstate We encourage you to visit the demos version of gnuplots website httpgnuplotsourceforgenet and revel in amazement In the plot environment you can use all the quantities you have in your script This is the way we calibrate the histogram width try setting the scalar k in the script to different values Note that the printf command has a special meaning inside a plot environment The script displays the plot on your screen If you want to save it to a file instead replace outputdisplay at the end with outputfilename Its OK to insert comments in the plot environment actually its a rather good idea to com ment as much as possible as always The output from the script is shown in Figure 63 Chapter 6 Graphs and plots 43 Listing 61 Plotting the log wage from the Mroz example dataset Download set verbose off open mroz87gdt quiet series lWW logWW scalar m meanlWW scalar s sdlWW prepare matrix with data for plot number of valid observations scalar n nobslWW discretize log wage scalar k 4 series disclWW roundlWWkk get frequencies matrix f aggregatenull disclWW add density phi dnormf1 ms sk put columns together and add labels plotmat f2n phi f1 strings cnames defarrayfrequency density log wage cnamesetplotmat cnames create plot plot plotmat move legend literal set key outside rmargin set line style literal set linetype 2 dashtype 2 linewidth 2 set histogram color literal set linetype 1 lc rgb 777777 set histogram style literal set style fill solid 025 border set histogram width printf set boxwidth 42f 05k options withlines2 withboxes1 end plot outputdisplay Chapter 6 Graphs and plots 44 0 002 004 006 008 01 012 014 016 018 2 1 0 1 2 3 log wage frequency density Figure 63 Output from listing 61 Listing 62 Plotting t densities for varying degrees of freedom Download set verbose off function string tplotscalar m return sprintfstudxd title td m m end function matrix dfs 2 4 16 plot literal set xrange 4545 literal set yrange 0045 literal Binvpq explgammapqlgammaplgammaq literal studxm Binv05m05sqrtm10xxm05m10 printf plot s s s tplotdfs1 tplotdfs2 tplotdfs3 end plot outputdisplay Chapter 6 Graphs and plots 45 Example Plotting Students t densities The power of the printf statement in a plot block becomes apparent when used jointly with userdefined functions as exemplified in Listing 62 in which we create a plot showing the den sity functions of Students t distribution for three different settings of the degrees of freedom parameter note that plotting a t density is very easy to do from the GUI just go to the Tools Distribution graphs menu First we define a user function called tplot which returns a string with the ingredients to pass to the gnuplot plot statement as a function of a scalar parameter the degrees of freedom in our case Next this function is used within the plot block to plot the appropriate density Note that most of the statements to mathematically define the function to plot are outsourced to gnuplot via the literal command The output from the script is shown in Figure 64 0 005 01 015 02 025 03 035 04 045 4 3 2 1 0 1 2 3 4 t2 t4 t16 Figure 64 Output from listing 62 Chapter 6 Graphs and plots 46 63 Boxplots These plots after Tukey and Chambers display the distribution of a variable Its shape depends on a few quantities defined as follows xmin sample minimum Q1 first quartile m median x mean Q3 third quartile xmax sample maximum R Q3 Q1 interquartile range The central box encloses the middle 50 percent of the data ie goes from Q1 to Q3 therefore its height equals R A line is drawn across the box at the median m and a sign identifies the mean x The length of the whiskers depends on the presence of outliers The top whisker extends from the top of the box up to a maximum of 15 times the interquartile range but can be shorter if the sample maximum is lower than that value that is it reaches minxmax Q3 15R Observations larger than Q3 15R if any are considered outliers and represented individually via dots3 The bottom whisker obeys the same logic with obvious adjustments Figure 65 provides an example of all this by using the variable FAMINC from the sample dataset mroz87 0 20000 40000 60000 80000 FAMINC x Q3 Q1 m xmin xmax outliers Figure 65 Sample boxplot In the case of boxplots with confidence intervals dotted lines show the limits of an approximate 90 3To give you an intuitive idea if a variable is normally distributed the chances of picking an outlier by this definition are slightly below 07 Chapter 6 Graphs and plots 47 percent confidence interval for the median This is obtained by the bootstrap method which can take a while if the data series is very long For details on constructing boxplots see the entry for boxplot in the Gretl Command Reference or use the Help button that appears when you select one of the boxplot items under the menu item View Graph specified vars in the main gretl window Factorized boxplots A nice feature which is quite useful for data visualization is the conditional or factorized boxplot This type of plot allows you to examine the distribution of a variable conditional on the value of some discrete factor As an example well use one of the datasets supplied with gretl that is rac3d which contains an example taken from Cameron and Trivedi 2013 on the health conditions of 5190 people The script below compares the unconditional marginal distribution of the number of illnesses in the past 2 weeks with the distribution of the same variable conditional on age classes open rac3dgdt unconditional boxplot boxplot ILLNESS outputdisplay create a discrete variable for age class 0 below 20 1 between 20 and 39 etc series ageclass floorAGE02 conditional boxplot boxplot ILLNESS ageclass factorized outputdisplay After running the code above you should see two graphs similar to Figure 66 By comparing the marginal plot to the factorized one the effect of age on the mean number of illnesses is quite evident by joining the green crosses you get what is technically known as the conditional mean function or regression function if you prefer 0 1 2 3 4 5 ILLNESS 0 1 2 3 4 5 0 1 2 3 ILLNESS ageclass Distribution of ILLNESS by ageclass Figure 66 Conditional and unconditional distribution of illnesses Chapter 7 Joining data sources 71 Introduction Gretl provides two commands for adding data from file to an existing dataset in the programs workspace namely append and join The append command which has been available for a long time is relatively simple and is described in the Gretl Command Reference Here we focus on the join command which is much more flexible and sophisticated This chapter gives an overview of the functionality of join along with a detailed account of its syntax and options We provide several toy examples and discuss one realworld case at length First a note on terminology in the following we use the terms lefthand and inner to refer to the dataset that is already in memory and the terms righthand and outer to refer to the dataset in the file from which additional data are to be drawn Two main features of join are worth emphasizing at the outset Key variables can be used to match specific observations rows in the inner and outer datasets and this match need not be 1 to 1 A row filter may be applied to screen out unwanted observations in the outer dataset As will be explained below these features support rather complex concatenation and manipulation of data from different sources A further aspect of join should be notedone that makes this command particularly useful when dealing with very large data files That is when gretl executes a join operation it does not in gen eral read into memory the entire content of the righthand side dataset Only those columns that are actually needed for the operation are read in full This makes join faster and less demanding of computer memory than the methods available in most other software On the other hand gretls asymmetrical treatment of the inner and outer datasets in join may require some getting used to for users of other packages 72 Basic syntax The minimal invocation of join is join filename varname where filename is the name of a data file and varname is the name of a series to be imported Only two sorts of data file are supported at present delimited text files where the delimiter may be comma space tab or semicolon and native gretl data files gdt or gdtb A series named varname may already be present in the lefthand dataset but that is not required The series to be imported may be numerical or stringvalued For most of the discussion below we assume that just a single series is imported by each join command but see section 77 for an account of multiple imports The effect of the minimal version of join is this gretl looks for a data column labeled varname in the specified file if such a column is found and the number of observations on the right matches the number of observations in the current sample range on the left then the values from the right are copied into the relevant range of observations on the left If varname does not already exist 48 Chapter 7 Joining data sources 49 on the left any observations outside of the current sample are set to NA if it exists already then observations outside of the current sample are left unchanged The case where you want to rename a series on import is handled by the data option This option has one required argument the name by which the series is known on the right At this point we need to explain something about righthand variable names column headings Righthand names We accept on input arbitrary column heading strings but if these strings do not qualify as valid gretl identifiers they are automatically converted and in the context of join you must use the converted names A gretl identifier must start with a letter contain nothing but ASCII letters digits and the underscore character and must not exceed 31 characters The rules used in name conversion are 1 Skip any leading nonletters 2 Until the 31character is reached or the input is exhausted transcribe legal characters skip illegal characters apart from spaces and replace one or more consecutive spaces with an underscore unless the last character transcribed is an underscore in which case space is skipped In the unlikely event that this policy yields an empty string we replace the original with coln where n is replaced by the 1based index of the column in question among those used in the join operation If you are in doubt regarding the converted name of a given column the function fixname can be used as a check it takes the original string as an argument and returns the converted name Examples eval fixnamevalididentifier valididentifier eval fixname12 Some name Somename Returning to the use of the data option suppose we have a column headed 12 Some name on the right and wish to import it as x After figuring how the righthand name converts we can do join foocsv x dataSomename No righthand names Some data files have no column headings they jump straight into the data and you need to deter mine from accompanying documentation what the columns represent Since gretl expects column headings you have to take steps to get the importation right It is generally a good idea to insert a suitable header row into the data file However if for some reason thats not practical you should give the noheader option in which case gretl will name the columns on the right as col1 col2 and so on If you do not do either of these things you will likely lose the first row of data since gretl will attempt to make variable names out of it as described above 73 Filtering Rows from the outer dataset can be filtered using the filter option The required parameter for this option is a Boolean condition that is an expression which evaluates to nonzero true include the row or zero false skip the row for each of the outer rows The filter expression may include any of the following terms up to three righthand series under their converted names as Chapter 7 Joining data sources 50 explained above scalar or string variables defined on the left any of the operators and functions available in gretl including userdefined functions and numeric or string constants Here are a few simple examples of potentially valid filter options assuming that the specified right hand side columns are found 1 relationship between two righthand variables filterx15x17 2 comparison of righthand variable with constant filternkids2 3 comparison of stringvalued righthand variable with string constant filterSEXF 4 filter on valid values of a righthand variable filtermissingincome 5 compound condition filterx 100 x 0 y 0 Note that if you are comparing against a string constant as in example 3 above it is necessary to put the string in escaped doublequotes each doublequote preceded by a backslash so the interpreter knows that F is not supposed to be the name of a variable It is safest to enclose the whole filter expression in double quotes however this is not strictly required unless the expression contains spaces or the equals sign In general an error is flagged if a missing value is encountered in a series referenced in a filter expression This is because the condition then becomes indeterminate taking example 2 above if the nkids value is NA on any given row we are not in a position to evaluate the condition nkids2 However you can use the missing functionor ok which is a shorthand for missingif you need a filter that keys off the missing or nonmissing status of a variable 74 Matching with keys Things get interesting when we come to keymatching The purpose of this facility is perhaps best introduced by example Suppose that as with many survey and censusbased datasets we have a dataset that is composed of two or more related files each having a different unit of observation for example we have a persons data file and a households data file Table 71 shows a simple artificial case The file peoplecsv contains a unique identifier for the individuals pid The households file hholdscsv contains the unique household identifier hid which is also present in the persons file As a first example of join with keys lets add the householdlevel variable xh to the persons dataset open peoplecsv quiet join hholdscsv xh ikeyhid print byobs The basic key option is named ikey this indicates inner key that is the key variable found in the lefthand or inner dataset By default it is assumed that the righthand dataset contains a column of the same name though as well see below that assumption can be overridden The join command above says find a series named xh in the righthand dataset and add it to the lefthand one using the values of hid to match rows Looking at the data in Table 71 we can see how this should work Persons 1 and 2 are both members of household 1 so they should both get values of 1 for xh persons 3 and 4 are members of household 2 so that xh 4 and so on Note that the order Chapter 7 Joining data sources 51 in which the key values occur on the righthand side does not matter The gretl output from the print command is shown in the lower panel of Table 71 peoplecsv hholdscsv pidhidgenderagexp hidcountryxh 11M501 1US1 21F402 6IT12 32M303 3UK6 42F252 4IT8 53M403 2US4 64F354 5IT10 74M703 84F603 95F204 106M404 pid hid xh 1 1 1 2 1 1 3 2 4 4 2 4 5 3 6 6 4 8 7 4 8 8 4 8 9 5 10 10 6 12 Table 71 Two linked CSV data files and the effect of a join Note that key variables are treated conceptually as integers If a specified key contains fractional values these are truncated Two extensions of the basic key mechanism are available If the outer dataset contains a relevant key variable but it goes under a different name from the inner key you can use the okey option to specify the outer key As with other right hand names this does not have to be a valid gretl identifier So for example if hholdscsv contained the hid information but under the name HHOLD the join command above could be modified as join hholdscsv xh ikeyhid okeyHHOLD If a single key is not sufficient to generate the matches you want you can specify a double key in the form of two series names separated by a comma in this case the importation of data is restricted to those rows on which both keys match The syntax here is for example join foocsv x ikeykey1key2 Again the okey option may be used if the corresponding righthand columns are named differently The same number of keys must be given on the left and the right but when a Chapter 7 Joining data sources 52 double key is used and only one of the key names differs on the right the name that is in common may be omitted although the comma separator must be retained For example the second of the following lines is acceptable shorthand for the first join foocsv x ikeykey1Lkey2 okeykey1Rkey2 join foocsv x ikeykey1Lkey2 okeyRkey2 The number of keymatches The example shown in Table 71 is an instance of a 1 to 1 match applying the matching criterion produces exactly one value of the variable xh corresponding to each row of the inner dataset Three other possibilities arise Some rows on the left have multiple matches on the right 1 to n matching Some rows on the right have multiple matches on the left n to 1 matching Some rows in the inner dataset have no match on the right The first case is addressed in detail in the next section here we discuss the others The n to 1 case is straightforward If a particular key value or combination of key values occurs at each of n 1 observations on the left but at a single observation on the right then the righthand value is entered at each of the matching slots on the left The handling of the case where theres no match on the right depends on whether the join operation is adding a new series to the inner dataset or modifying an existing one If its a new series then unmatched rows automatically get NA for the imported data However if join is pulling in values for a series already present on the left only matched rows will be updated In other words we do not overwite an existing value on the left with NA when theres no match on the right These defaults may not produce the desired results in every case but gretl provides the means to modify the effect if need be We will illustrate with two scenarios First consider adding a new series recording number of hours worked when the inner dataset contains individuals and the outer file contains data on jobs If an individual does not appear in the jobs file we may want to take her hours worked as implicitly zero rather than NA In this case gretls misszero function can be used to turn NA into 0 in the imported series Second consider updating a series via join when the outer file is presumed to contain all available updated values such that no match should be taken as an implicit NA In that case we want the presumably outofdate values on any unmatched rows to be overwritten with NA Let the series in question be called x both on the left and the right and let the common key be called pid The solution is then join updatecsv tmpvar datax ikeypid x tmpvar As a new variable tmpvar will get NA for all unmatched rows we then transcribe its values into x In a more complicated case one might use the smpl command to limit the sample range before assigning tmpvar to x or use the conditional assignment operator One further point given some missing values in an imported series you may want to know whether a the NAs were explicitly represented in the outer data file or b they arose due to no match You can find this out by using a method described in the following section namely the count variant of the aggregation option this will give you a series with 0 values for all and only unmatched rows Chapter 7 Joining data sources 53 75 Aggregation In the case of 1 to n matching of rows n 1 the user must specify an aggregation method that is a method for mapping from n rows down to one This is handled by the aggr option which requires a single argument from the following list Code Value returned count count of matches avg mean of matching values sum sum of matching values min minimum of matching values max maximum of matching values seqi the ith matching value eg seq2 minaux minimum of matching values of auxiliary variable maxaux maximum of matching values of auxiliary variable Note that the count aggregation method is special in that there is no need for a data series on the right the imported series is simply a function of the specified keys All the other methods require that actual data are found on the right Also note that when count is used the value returned when no match is found is as one might expect zero rather than NA The basic use of the seq method is shown above following the colon you give a positive integer rep resenting the 1based position of the observation in the sequence of matched rows Alternatively a negative integer can be used to count down from the last match seq1 selects the last match seq2 the secondlast match and so on If the specified sequence number is out of bounds for a given observation this method returns NA Referring again to the data in Table 71 suppose we want to import data from the persons file into a dataset established at household level Heres an example where we use the individual age data from peoplecsv to add the average and minimum age of household members open hholdscsv quiet join peoplecsv avgage ikeyhid dataage aggravg join peoplecsv minage ikeyhid dataage aggrmin Heres a further example where we add to the household data the sum of the personal data xp with the twist that we apply filters to get the sum specifically for household members under the age of 40 and for women open hholdscsv quiet join peoplecsv youngxp ikeyhid filterage40 dataxp aggrsum join peoplecsv femalexp ikeyhid filtergenderF dataxp aggrsum The possibility of using an auxiliary variable with the min and max modes of aggregation gives extra flexibility For example suppose we want for each household the income of its oldest member open hholdscsv quiet join peoplecsv oldestxp ikeyhid dataxp aggrmaxage 76 Stringvalued key variables The examples above use numerical variables household and individual ID numbers in the match ing process It is also possible to use stringvalued variables in which case a match means that the string values of the key variables compare equal with case sensitivity When using double keys Chapter 7 Joining data sources 54 you can mix numerical and string keys but naturally you cannot mix a string variable on the left via ikey with a numerical one on the right via okey or vice versa Heres a simple example Suppose that alongside hholdscsv we have a file countriescsv with the following content countryGDP UK100 US500 IT150 FR180 The variable country which is also found in hholdscsv is stringvalued We can pull the GDP of the country in which the household resides into our households dataset with open hholdscsv q join countriescsv GDP ikeycountry which gives hid country GDP 1 1 1 500 2 6 2 150 3 3 3 100 4 4 2 150 5 2 1 500 6 5 2 150 77 Importing multiple series The examples given so far have been limited in one respect While several columns in the outer data file may be referenced as keys or in filtering or aggregation only one column has actually provided dataand correspondingly only one series in the inner dataset has been created or modifiedper invocation of join However join can handle the importation of several series at once This section gives an account of the required syntax along with certain restrictions that apply to the multipleimport case There are two ways to specify more than one series for importation 1 The varname field in the command can take the form of a spaceseparated list of names rather than a single name 2 Alternatively you can give the name of an array of strings in place of varname the elements of this array should be the names of the series to import Here are the limitations 1 The data option which permits the renaming of a series on import is not available When importing multiple series you are obliged to accept their outer names fixed up as described in section 72 2 While the other join options are available they necessarily apply uniformly to all the series imported via a given command This means that if you want to import several series but using different keys filters or aggregation methods you must use a sequence of commands Here are a couple of examples of multiple imports Chapter 7 Joining data sources 55 open base datafile containing keys open PUMSdatagdt join using a list of import names join ss13pnccsv SCHL WAGP WKHP ikeySERIALNOSPORDER using a strings array may be worthwhile if the array will be used for more than one purpose strings S defarraySCHL WAGP WKHP join ss13pnccsv S ikeySERIALNOSPORDER 78 A realworld case For a real usecase for join with crosssectional data we turn to the Bank of Italys Survey on House hold Income and Wealth SHIW1 In ASCII form the 2010 survey results comprise 47 MB of data in 29 files In this exercise we will draw on five of the SHIW files to construct a replica of the dataset used in Thomas Mrozs famous paper Mroz 1987 on womens labor force participation which contains data on married women between the age of 30 and 60 along with certain characteristics of their households and husbands Our general strategy is as follows we create a core dataset by opening the file carcom10csv which contains basic data on the individuals After dropping unwanted individuals all but married women we use the resulting dataset as a base for pulling in further data via the join command The complete script to do the job is given in the Appendix to this chapter here we walk through the script with comments interspersed We assume that all the relevant files from the Bank of Italy survey are contained in a subdirectory called SHIW Starting with carcom10csv we use the cols option to the open command to import specific series namely NQUEST household ID number NORD sequence number for individuals within each household SEX male 1 female 2 PARENT status in household 1 head of household 2 spouse of head etc STACIV marital status married 1 STUDIO educational level coded from 1 to 8 ETA age in years and ACOM4C size of town open SHIWcarcom10csv cols12349102941 We then restrict the sample to married women from 30 to 60 years of age and additionally restrict the sample of women to those who are either heads of households or spouses of the head smpl SEX2 ETA30 ETA60 STACIV1 restrict smpl PARENT3 restrict For compatibility with the Mroz dataset as presented in the gretl data file mroz87gdt we rename the age and education variables as WA and WE respectively we compute the CIT dummy and finally we store the reduced base dataset in gretl format rename ETA WA rename STUDIO WE series CIT ACOM4C 2 store mrozrepgdt The next step will be to get data on working hours from the jobs file allb1csv Theres a com plication here We need the total hours worked over the course of the year for both the women 1Details of the survey can be found at httpwwwbancaditaliaitstatisticheindcampbilfaitdismicro The ASCII CSV data files for the 2010 survey are available at httpwwwbancaditaliaitstatisticheindcamp bilfaitdismicroannualeasciiind10asciizip Chapter 7 Joining data sources 56 and their husbands This is not available as such but the variables ORETOT and MESILAV give respectively average hours worked per week and the number of months worked in 2010 each on a perjob basis If each person held at most one job over the year we could compute his or her annual hours as HRS ORETOT 52 MESILAV12 However some people had more than one job and in this case what we want is the sum of annual hours across their jobs We could use join with the seq aggregation method to construct this sum but it is probably more straightforward to read the allb1 data compute the HRS values per job as shown above and save the results to a temporary CSV file open SHIWallb1csv cols12811 quiet series HRS misszeroORETOT 52 misszeroMESILAV12 store HRScsv NQUEST NORD HRS Now we can reopen the base dataset and join the hours variable from HRScsv Note that we need a double key here the women are uniquely identified by the combination of NQUEST and NORD We dont need an okey specification since these keys go under the same names in the righthand file We define labor force participation LFP based on hours open mrozrepgdt join HRScsv WHRS ikeyNQUESTNORD dataHRS aggrsum WHRS misszeroWHRS LFP WHRS 0 For reference heres how we could have used seq to avoid writing a temporary file join SHIWallb1csv njobs ikeyNQUESTNORD dataORETOT aggrcount series WHRS 0 loop i1maxnjobs join SHIWallb1csv htmp ikeyNQUESTNORD dataORETOT aggrseqi join SHIWallb1csv mtmp ikeyNQUESTNORD dataMESILAV aggrseqi WHRS misszerohtmp 52 misszeromtmp12 endloop To generate the work experience variable AX we use the file lavorocsv this contains a variable named ETALAV which records the age at which the person first started work join SHIWlavorocsv ETALAV ikeyNQUESTNORD series AX misszeroWA ETALAV We compute the womans hourly wage WW as the ratio of total employment income to annual working hours This requires drawing the series YL payroll income and YM net selfemployment income from the persons file rper10csv join SHIWrper10csv YL YM ikeyNQUESTNORD aggrsum series WW LFP YL YMWHRS 0 The familys net disposable income is available as Y in the file rfam10csv we import this as FAMINC join SHIWrfam10csv FAMINC ikeyNQUEST dataY Data on number of children are now obtained by applying the count method For the Mroz repli cation we want the number of children under the age of 6 and also the number aged 6 to 18 Chapter 7 Joining data sources 57 join SHIWcarcom10csv KIDS ikeyNQUEST aggrcount filterETA18 join SHIWcarcom10csv KL6 ikeyNQUEST aggrcount filterETA6 series K618 KIDS KL6 We want to add data on the womens husbands but how do we find them To do this we create an additional inner key which well call HID husband ID by subsampling in turn on the observations falling into each of two classes a those where the woman is recorded as head of household and b those where the husband has that status In each case we want the individual ID NORD of the household member whose status is complementary to that of the woman in question So for case a we subsample using PARENT1 head of household and filter the join using PARENT2 spouse of head in case b we do the converse We thus construct HID piecewise for women who are household heads smpl PARENT1 restrict replace join SHIWcarcom10csv HID ikeyNQUEST dataNORD filterPARENT2 for women who are not household heads smpl PARENT2 restrict replace join SHIWcarcom10csv HID ikeyNQUEST dataNORD filterPARENT1 smpl full Now we can use our new inner key to retrieve the husbands data matching HID on the left with NORD on the right within each household join SHIWcarcom10csv HA ikeyNQUESTHID okeyNQUESTNORD dataETA join SHIWcarcom10csv HE ikeyNQUESTHID okeyNQUESTNORD dataSTUDIO join HRScsv HHRS ikeyNQUESTHID okeyNQUESTNORD dataHRS aggrsum HHRS misszeroHHRS The remainder of the script is straightforward and does not require discussion here we recode the education variables for compatibility delete some intermediate series that are not needed any more add informative labels and save the final product See the Appendix for details To compare the results from this dataset with those from the earlier US data used by Mroz one can copy the input file heckitinp supplied with the gretl package and substitute mrozrepgdt for mroz87gdt It turns out that the results are qualitatively very similar 79 The representation of dates Up to this point all the data we have considered have been crosssectional In the following sections we discuss data that have a time dimension and before proceeding it may be useful to say some thing about the representation of dates Gretl takes the ISO 8601 standard as its reference point but provides mean of converting dates provided in other formats it also offers a set of calendrical functions for manipulating dates isodate isoconv epochday and others ISO 8601 recognizes two formats for daily dates extended and basic In both formats dates are given as 4digit year 2digit month and 2digit day in that order In extended format a dash is inserted between the fieldsas in 20131021 or more generally YYYYMMDDwhile in basic format the fields are run together YYYYMMDD Extended format is more easily parsed by human readers while basic format is more suitable for computer processing since one can apply ordinary arithmetic to compare dates as equal earlier or later The standard also recognizes YYYYMM as representing year and month eg 201011 for November 20102 as well as a plain fourdigit number for year alone One problem for economists is that the quarter is not a period covered by ISO 8601 This could be presented by YYYYQ with only one digit following the dash but in gretl output we in fact use a colon as in 20132 for the second quarter of 2013 For printed output of months gretl also uses 2The form YYYYMM is not recognized for year and month Chapter 7 Joining data sources 58 a colon as in 201306 A difficulty with following ISO here is that in a statistical context a string such as 198010 may look more like a subtraction than a date Anyway at present we are more interested in the parsing of dates on input rather than in what gretl prints And in that context note that excess precision is acceptable a month may be represented by its first day eg 20050501 for May 2005 and a quarter may be represented by its first month and day 20050701 for the third quarter of 2005 Some additional points regarding dates will be taken up as they become relevant in practical cases of joining data 710 Timeseries data Suppose our lefthand dataset is recognized by gretl as time series with a supported frequency annual quarterly monthly weekly daily or hourly This will be the case if the original data were read from a file that contained suitable time or date information or if a timeseries interpretation has been imposed using either the setobs command or its GUI equivalent Thenapart perhaps from some very special casesjoining additional data is bound to involve matching observations by timeperiod In this case contrary to the crosssectional case the inner dataset has a natural ordering of which gretl is aware hence no inner key is required If in addition the file from data which are to be joined is in native gretl format and contains time series information keys are not needed at all Three cases can arise the frequency of the outer dataset may be the same lower or higher than that of the inner dataset In the first two cases join should work without any special apparatus lowerfrequency values will be repeated for each highfrequency period In the third case however an aggregation method must be specified gretl needs to know how to map higherfrequency data into the existing dataset by averaging summing or whatever If the outer data file is not in native gretl format we need a means of identifying the period of each observation on the right an outer key which well call a time key The join command provides a simple but limited default for extracting period information from the outer data file plus an option that can be used if the default is not applicable as follows The default assumptions are 1 the time key appears in the first column 2 the heading of this column is either left blank or is one of obs date year period observation or observationdate on a caseinsensitive comparison and 3 the time format conforms to ISO 8601 where applicable extended daily date format YYYYMMDD monthly format YYYYMM or annual format YYYY If dates do not appear in the first column of the outer file or if the column heading or format is not as just described the tkey option can be used to indicate which column should be used andor what format should be assumed Setting the timekey column andor format The tkey option requires a parameter holding the name of the column in which the time key is located andor a string specifying the format in which datestimes are written in the timekey column This parameter should be enclosed in doublequotes If both elements are present they should be separated by a comma if only a format is given it should be preceded by a comma Some examples tkeyPeriodmdY tkeyPeriod tkeyobsperiod tkeyYmm Chapter 7 Joining data sources 59 The first of these applies if Period is not the first column on the right and dates are given in the US format of month day year separated by slashes The second implies that although Period is not the first column the date format is ISO 8601 The third again implies that the date format is OK here the name is required even if obsperiod is the first column since this heading is not one recognized by gretls heuristic The last example implies that dates are in the first column with one of the recognized headings but are given in the nonstandard format year m month The date format string should be composed using the codes employed by the POSIX function strptime Table 72 contains a list of the most relevant codes3 Code Meaning The character b The month name according to the current locale either abbreviated or in full C The century number 099 d The day of month 131 D Equivalent to mdy This is the American style date very con fusing to nonAmericans especially since dmy is widely used in Europe The ISO 8601 standard format is Ymd H The hour 023 j The day number in the year 1366 m The month number 112 n Arbitrary whitespace q The quarter 14 w The weekday number 06 with Sunday 0 y The year within century 099 When a century is not otherwise spec ified values in the range 6999 refer to years in the twentieth century 19691999 values in the range 0068 refer to years in the twenty first century 20002068 Y The year including century for example 1991 Table 72 Date format codes Example daily stock prices We show below the first few lines of a file named IBMcsv containing stockprice data for IBM corporation DateOpenHighLowCloseVolumeAdj Close 2013080219550195501932219516386100019516 2013080119665197171954119581285690019581 2013073119449196911944919504381000019504 Note that the data are in reverse timeseries orderthat wont matter to join the data can appear in any order Also note that the first column is headed Date and holds daily dates as ISO 8601 extended That means we can pull the data into gretl very easily In the following fragment we create a suitably dimensioned empty daily dataset then rely on the default behavior of join with timeseries data to import the closing stock price nulldata 500 setobs 5 20120101 join IBMcsv Close 3The q code for quarter is not present in strptime it is added for use with join since quarterly data are common in macroeconomics Chapter 7 Joining data sources 60 To make explicit what were doing we could accomplish exactly the same using the tkey option join IBMcsv Close tkeyDateYmd Example OECD quarterly data Table 73 shows an excerpt from a CSV file provided by the OECD statistical site statoecdorg in response to a request for GDP at constant prices for several countries4 FrequencyPeriodCountryValueFlags QuarterlyQ11960France463876148126845E QuarterlyQ11960Germany768802119278467E QuarterlyQ11960Italy414629791450547E QuarterlyQ11960United Kingdom578437090291889E QuarterlyQ21960France465618977328614E QuarterlyQ21960Germany782484138122549E QuarterlyQ21960Italy420714910290157E QuarterlyQ21960United Kingdom572853474696578E QuarterlyQ31960France46910441925852E QuarterlyQ31960Germany809532161494483E QuarterlyQ31960Italy426893675840156E QuarterlyQ31960United Kingdom581252066618986E QuarterlyQ41960France474664327992619E QuarterlyQ41960Germany817806132384948E QuarterlyQ41960Italy427221338414114E Table 73 Example of CSV file as provided by the OECD statistical website This is an instance of data in what we call atomic format that is a format in which each line of the outer file contains a single datapoint and extracting data mainly requires filtering the appropriate lines The outer time key is under the Period heading and has the format Qquarteryear Assuming that the file in Table 73 has the name oecdcsv the following script reconstructs the time series of Gross Domestic Product for several countries nulldata 220 setobs 4 19601 join oecdcsv FRA tkeyPeriodQqY dataValue filterCountryFrance join oecdcsv GER tkeyPeriodQqY dataValue filterCountryGermany join oecdcsv ITA tkeyPeriodQqY dataValue filterCountryItaly join oecdcsv UK tkeyPeriodQqY dataValue filterCountryUnited Kingdom Note the use of the format codes q for the quarter and Y for the 4digit year A touch of elegance could have been added by storing the invariant options to join using the setopt command as in setopt join persist tkeyPeriodQqY dataValue join oecdcsv FRA filterCountryFrance join oecdcsv GER filterCountryGermany join oecdcsv ITA filterCountryItaly join oecdcsv UK filterCountryUnited Kingdom setopt join clear If one were importing a large number of such series it might be worth rewriting the sequence of joins as a loop as in 4Retrieved 20130805 The OECD files in fact contain two leading columns with very long labels these are irrelevant to the present example and can be omitted without altering the sample script Chapter 7 Joining data sources 61 strings countries defarrayFrance Germany Italy United Kingdom strings vnames defarrayFRA GER ITA UK setopt join persist tkeyPeriodQqY dataValue loop foreach i countries vname vnamesi join oecdcsv vname filterCountryi endloop setopt join clear 711 Special handling of time columns When dealing with straight time series data the tkey mechanism described above should suffice in almost all cases In some contexts however time enters the picture in a more complex way examples include panel data see section 712 and socalled realtime data see chapter 8 To handle such cases join provides the tconvert option This can be used to select certain columns in the righthand data file for special treatment strings representing dates in these columns will be converted to numerical values 8digit numbers on the pattern YYYYMMDD ISO basic daily format Once dates are in this form it is easy to use them in keymatching or filtering By default it is assumed that the strings in the selected columns are in ISO extended format YYYYMMDD If that is not the case you can supply a timeformat string using the tconvfmt option The format string should be written using the codes shown in Table 72 Here are some examples select one column for treatment tconvertstartdate select two columns for treatment tconvertstartdateenddate specify USstyle daily date format tconvfmtmdY specify quarterly datestrings as in 2004q1 tconvfmtYqq Some points to note If a specified column is not selected for a substantive role in the join operation as data to be imported as a key or as an auxiliary variable for use in aggregation the column in question is not read and so no conversion is carried out If a specified column contains numerical rather than string values no conversion is carried out If a string value in a selected column fails parsing using the relevant time format user specified or default the converted value is NA On successful conversion the output is always in dailydate form as stated above If you specify a monthly or quarterly time format the converted date is the first day of the month or quarter 712 Panel data In section 710 we gave an example of reading quarterly GDP data for several countries from an OECD file In that context we imported each countrys data as a distinct timeseries variable Now Chapter 7 Joining data sources 62 suppose we want the GDP data in panel format instead stacked time series How can we do this with join As a reminder heres what the OECD data look like FrequencyPeriodCountryValueFlags QuarterlyQ11960France463876148126845E QuarterlyQ11960Germany768802119278467E QuarterlyQ11960Italy414629791450547E QuarterlyQ11960United Kingdom578437090291889E QuarterlyQ21960France465618977328614E and so on If we have four countries and quarterly observations running from 19601 to 20132 T 214 quarters we might set up our panel workspace like this scalar N 4 scalar T 214 scalar NT NT nulldata NT preserve setobs T 11 stackedtimeseries The relevant outer keys are obvious Country for the country and Period for the time period Our task is now to construct matching keys in the inner dataset This can be done via two panelspecific options to the setobs command Lets work on the time dimension first setobs 4 19601 paneltime series quarter obsdate This variant of setobs allows us to tell gretl that time in our panel is quarterly starting in the first quarter of 1960 Having set that the accessor obsdate will give us a series of 8digit dates representing the first day of each quarter19600101 19600401 19600701 and so on repeating for each country As we explained in section 711 we can use the tconvert option on the outer series Period to get exactly matching values in this case using a format of QqY for parsing the Period values Now for the country names string cstrs sprintfFrance Germany Italy United Kingdom setobs country cstrs panelgroups Here we write into the string cstrs the names of the countries using escaped doublequotes to handle the space in United Kingdom then pass this string to setobs with the panelgroups option preceded by the identifier country This asks gretl to construct a stringvalued series named country in which each name will repeat T times Were now ready to join Assuming the OECD file is named oecdcsv we do join oecdcsv GDP dataValue ikeycountryquarter okeyCountryPeriod tconvertPeriod tconvfmtQqY Other input formats The OECD file discussed above is in the most convenient format for join with one datapoint per line But sometimes we may want to make a panel from a data file structured like this Real GDP PeriodFranceGermanyItalyUnited Kingdom Chapter 7 Joining data sources 63 Q11960463863768757414630578437 Q21960465605782438420715572853 Q31960469091809484426894581252 Q41960474651817758427221584779 Q11961482285826031442528594684 Call this file sidebysidecsv Assuming the same initial setup as above we can panelize the data by setting the sample to each countrys time series in turn and importing the relevant column The only point to watch here is that the string United Kingdom being a column heading will become UnitedKingdom on importing see section 72 so well need a slightly different set of country strings strings cstrs defarrayFrance Germany Italy UnitedKingdom setobs country cstrs panelgroups loop foreach i cstrs smpl countryi restrict replace join sidebysidecsv GDP datai ikeyquarter okeyPeriod tconvertPeriod tconvfmtQqY endloop smpl full If our working dataset and the outer data file are dimensioned such that there are just as many timeseries observations on the right as there are time slots on the leftand the observations on the right are contiguous in chronological order and start on the same date as the working datasetwe could dispense with the key apparatus and just use the first line of the join command shown above However in general it is safer to use keys to ensure that the data end up in correct registration 713 Memo join options Basic syntax join filename varnames options flag effect data Give the name of the data column on the right in case it differs from varname 72 single import only filter Specify a condition for filtering data rows 73 ikey Specify up to two keys for matching data rows 74 okey Specify outer key names in case they differ the inner ones 74 aggr Select an aggregation method for 1 to n joins 75 tkey Specify righthand time key 710 tconvert Select outer date columns for conversion to numeric form 711 tconvfmt Specify a format for use with tconvert 711 noheader Treat the first row on the right as data 72 verbose Report on progress in reading the outer data Chapter 7 Joining data sources 64 Appendix the full Mroz data script start with everybody get gender age and a few other variables directly while were at it open SHIWcarcom10csv cols12349102941 subsample on married women between the ages of 30 and 60 smpl SEX2 ETA30 ETA60 STACIV1 restrict for simplicity restrict to heads of households and their spouses smpl PARENT3 restrict rename the age and education variables for compatibility compute the city dummy and finally save the reduced base dataset rename ETA WA rename STUDIO WE series CIT ACOM4C2 store mrozrepgdt make a temp file holding annual hours worked per job open SHIWallb1csv cols12811 quiet series HRS misszeroORETOT 52 misszeroMESILAV12 store HRScsv NQUEST NORD HRS reopen the base dataset and begin drawing assorted data in open mrozrepgdt womens annual hours summed across jobs join HRScsv WHRS ikeyNQUESTNORD dataHRS aggrsum WHRS misszeroWHRS labor force participation LFP WHRS 0 work experience ETALAV age when started first job join SHIWlavorocsv ETALAV ikeyNQUESTNORD series AX misszeroWA ETALAV womens hourly wages join SHIWrper10csv YL YM ikeyNQUESTNORD aggrsum series WW LFP YL YMWHRS 0 family income Y net disposable income join SHIWrfam10csv FAMINC ikeyNQUEST dataY get data on children using the count method join SHIWcarcom10csv KIDS ikeyNQUEST aggrcount filterETA18 join SHIWcarcom10csv KL6 ikeyNQUEST aggrcount filterETA6 series K618 KIDS KL6 data on husbands we first construct an auxiliary inner key for husbands using the little trick of subsampling the inner dataset for women who are household heads smpl PARENT1 restrict replace join SHIWcarcom10csv HID ikeyNQUEST dataNORD filterPARENT2 for women who are not household heads smpl PARENT2 restrict replace join SHIWcarcom10csv HID ikeyNQUEST dataNORD filterPARENT1 smpl full Chapter 7 Joining data sources 65 add husbands data via the newlyadded secondary inner key join SHIWcarcom10csv HA ikeyNQUESTHID okeyNQUESTNORD dataETA join SHIWcarcom10csv HE ikeyNQUESTHID okeyNQUESTNORD dataSTUDIO join HRScsv HHRS ikeyNQUESTHID okeyNQUESTNORD dataHRS aggrsum HHRS misszeroHHRS final cleanup begins recode educational attainment as years of education matrix eduyrs 0 5 8 11 13 16 18 21 series WE replaceWE seq18 eduyrs series HE replaceHE seq18 eduyrs cut some cruft delete SEX STACIV KIDS YL YM PARENT HID ETALAV add some labels for the series setinfo LFP d 1 if woman worked in 2010 setinfo WHRS d Wifes hours of work in 2010 setinfo KL6 d Number of children less than 6 years old in household setinfo K618 d Number of children between ages 6 and 18 in household setinfo WA d Wifes age setinfo WE d Wifes educational attainment in years setinfo WW d Wifes average hourly earnings in 2010 euros setinfo HHRS d Husbands hours worked in 2010 setinfo HA d Husbands age setinfo HE d Husbands educational attainment in years setinfo FAMINC d Family income in 2010 euros setinfo AX d Actual years of wifes previous labor market experience setinfo CIT d 1 if live in large city save the final product store mrozrepgdt Chapter 8 Realtime data 81 Introduction As of gretl version 1913 the join command see chapter 7 has been enhanced to deal with so called realtime datasets in a straightforward manner Such datasets contain information on when the observations in a time series were actually published by the relevant statistical agency and how they have been revised over time Probably the most popular sources of such data are the Alfred online database at the St Louis Fed httpalfredstlouisfedorg and the OECDs StatEx tracts site httpstatsoecdorg The examples in this chapter deal with files downloaded from these sources but should be easy to adapt to files with a slightly different format As already stated join requires a columnoriented plain text file where the columns may be sepa rated by commas tabs spaces or semicolons Alfred and the OECD provide the option to download realtime data in this format tabdelimited files from Alfred commadelimited from the OECD If you have a realtime dataset in a spreadsheet file you must export it to a delimited text file before using it with join Representing revision histories is more complex than just storing a standard time series because for each observation period you have in general more than one published value over time along with the information on when each of these values were valid or current Sometimes this is repre sented in spreadsheets with two time axes one for the observation period and another one for the publication date or vintage The filled cells then form an upper triangle or a guillotine blade shape if the publication dates do not reach back far enough to complete the triangle This format can be useful for giving a human reader an overview of realtime data but it is not optimal for automatic processing for that purpose atomic format is best 82 Atomic format for realtime data What we are calling atomic format is exactly the format used by Alfred if you choose the option Observations by RealTime Period and by the OECD if you select all editions of a series for download as plain text CSV1 A file in this format contains one actual datapoint per line together with associated metadata This is illustrated in Table 81 where we show the first three lines from an Alfred file and an OECD file slightly modified2 Consider the first data line in the Alfred file in the observationdate column we find 19600101 indicating that the datapoint on this line namely 1120 is an observation or measurement in this case of the US index of industrial production that refers to the period starting on January 1st 1960 The realtimestartdate value of 19600216 tells us that this value was published on February 16th 1960 and the realtimeenddate value says that this vintage remained current through March 15th 1960 On the next day as we can see from the following line this datapoint was revised slightly downward to 1110 Daily dates in Alfred files are given in ISO extended format YYYYMMDD but below we describe how to deal with differently formatted dates Note that daily dates are appropriate for the last 1If you choose to download in Excel format from OECD you get a file in the triangular or guillotine format mentioned above 2In the Alfred file we have used commas rather than tabs as the column delimiter in the OECD example we have shortened the name in the Variable column 66 Chapter 8 Realtime data 67 Alfred monthly US industrial production observationdateINDPROrealtimestartdaterealtimeenddate 1960010111200001960021619600315 1960010111100001960031619611015 OECD monthly UK industrial production CountryVariableFrequencyTimeEditionValueFlags United KingdomINDPROMonthlyJan1990February 1999100 United KingdomINDPROMonthlyFeb1990February 1999993 Table 81 Variant atomic formats for realtime data two columns which jointly record the interval over which a given data vintage was current Daily dates might however be considered overly precise for the first column since the data period may well be the year quarter or month as it is in fact here However following Alfreds practice it is acceptable to specify a daily date indicating the first day of the period even for nondaily data3 Compare the first data line of the OECD example Theres a greater amount of leading metadata which is left implicit in the Alfred file Here Time is the equivalent of Alfreds observationdate and Edition the equivalent of Alfreds realtimestartdate So we read that in February 1999 a value of 100 was current for the UK index of industrial production for January 1990 and from the next line we see that in the same vintage month a value of 993 was current for industrial production in February 1990 Besides the different names and ordering of the columns there are a few more substantive differ ences between Alfred and OECD files most of which are irrelevant for join but some of which are possibly relevant The first irrelevant difference is the ordering of the lines It appears though were not sure how consistent this is that in Alfred files the lines are sorted by observation date first and then by publication dateso that all revisions of a given observation are grouped togetherwhile OECD files are sorted first by revision date Edition and then by observation date Time If we want the next revision of UK industrial production for January 1990 in the OECD file we have to scan down several lines until we find United KingdomINDPROMonthlyJan1990March 1999100 This difference is basically irrelevant because join can handle the case where the lines appear in random order although some operations can be coded more conveniently if were able to assume chronological ordering either on the Alfred or the OECD pattern it doesnt matter The second also irrelevant difference is that the OECD seems to include periodic Edition lines even when there is no change from the previous value as illustrated above where the UK industrial production index for January 1990 is reported as 100 as of March 1999 the same value that we saw to be current in February 1999 while Alfred reports a new value only when it differs from what was previously current A third difference lies in the dating of the revisions or editions As we have seen Alfred gives a specific daily date while in the UK industrial production file at any rate the OECD just dates each edition to a month This is not necessarily relevant for join but it does raise the question of whether the OECD might date revisions to a finer granularity in some of their files in which case one would have to be on the lookout for a different date format The final difference is that Alfred supplies an end date for each data vintage while the OECD 3Notice that this implies that in the Alfred example it is not clear without further information whether the observation period is the first quarter of 1960 the month January 1960 or the day January 1st 1960 However we assume that this information is always available in context Chapter 8 Realtime data 68 supplies only a starting date But there is less to this difference than meets the eye according to the Alfred webmaster by design a new vintage must start immediately following the day after the lapse of the old vintageso the end date conveys no independent information4 83 More on timerelated options Before we get properly started it is worth saying a little more about the tkey and tconvert options to join first introduced in section 711 as they apply in the case of realtime data When youre working with regular time series data tkey is likely to be useful while tconvert is unlikely to be applicable see section 710 On the other hand when youre working with panel data tkey is definitely not applicable but tconvert may well be helpful section 712 When working with realtime data however depending on the task in hand both options may be useful You will likely need tkey you may well wish to select at least one column for tconvert treatment and in fact you may want to name a given column in both contextsthat is include the tkey variable among the tconvert columns Why might this make sense Well think of the tconvert option as a preprocessing directive it asks gretl to convert date strings to numerical values 8digit ISO basic dates at source as they are read from the outer datafile The tkey option on the other hand singles out a column as the one to use for matching rows with the inner dataset So you would want to name a column in both roles if a it should be used for matching periods and also b it is desirable to have the values from this column in numerical form most likely for use in filtering As we have seen you can supply specific formats in connection with both tkey and tconvert in the latter case via the companion option tconvfmt to handle the case where the date strings on the right are not ISOfriendly at source This raises the question of how the format specifications work if a given column is named under both options Here are the rules that gretl applies 1 If a format is given with the tkey option it always applies to the tkey column alone and for that column it overrides any format given via the tconvfmt option 2 If a format is given via tconvfmt it is assumed to apply to all the tconvert columns unless this assumption is overriden by rule 1 84 Getting a certain data vintage The most common application of realtime data is to travel back in time and retrieve the data that were current as of a certain date in the past This would enable you to replicate a forecast or other statistical result that could have been produced at that date For example suppose we are interested in a variable of monthly frequency named INDPRO realtime data on which is stored in an Alfred file named INDPROtxt and we want to check the status quo as of June 15th 2011 If we dont already have a suitable dataset into which to import the INDPRO data our first steps will be to create an appropriately dimensioned empty dataset using the nulldata command and then specify its timeseries character via setobs as in nulldata 132 setobs 12 200401 For convenience we can put the name of our realtime file into a string variable On Windows this might look like 4Email received from Travis May of the Federal Reserve Bank of St Louis 20131017 This closes off the possibility that a given vintage could lapse or expire some time before the next vintage becomes available hence giving rise to a hole in an Alfred realtime file Chapter 8 Realtime data 69 string fname CUsersyournameDownloadsINDPROtxt We can then import the data vintage 20110615 using join arbitrarily choosing the selfexplanatory identifier ipasof20110615 join fname ipasof20110615 tkeyobservationdate dataINDPRO tconvertrealtimestartdate filterrealtimestartdate20110615 aggrmaxrealtimestartdate Here some detailed explanations of the various options are warranted The tkey option specifies the column which should be treated as holding the observation period identifiers to be matched against the periods in the current gretl dataset5 The more general form of this option is tkeycolnameformat note the double quotes here so if the dates do not come in standard format we can tell gretl how to parse them by using the appropriate conversion specifiers as shown in Table 72 For example here we could have written tkeyobservationdateYmd Next dataINDPRO tells gretl that we want to retrieve the entries stored in the column named INDPRO As explained in section 711 the tconvert option selects certain columns in the righthand data file for conversion from date strings to 8digit numbers on the pattern YYYYMMDD Well need this for the next step filtering since the transformation to numerical values makes it possible to perform basic arithmetic on dates Note that since date strings in Alfred files conform to gretls default assumption it is not necessary to use the tconvfmt option here The filter option specification in combination with the subsequent aggr aggregation treatment is the central piece of our data retrieval notice how we use the date constant 20110615 in ISO basic form to do numerical comparisons and how we perform the numerical max operation on the converted column realtimestartdate It would also have been possible to predefine a scalar variable as in vintage 20110615 and then use vintage in the join command instead Here we tell join that we only want to extract those publications that 1 already appeared before and including June 15th 2011 and 2 were not yet obsoleted by a newer release6 As a result your dataset will now contain a time series named ipasof20110615 with the values that a researcher would have had available on June 15th 2011 Of course all values for the observa tions after June 2011 will be missing and probably a few before that too because they only have become available later on 85 Getting the nth release for each observation period For some purposes it may be useful to retrieve the nth published value of each observation where n is a fixed positive integer irrespective of when each of these nth releases was published Sup pose we are interested in the third release then the relevant join command becomes join fname ip3rdpub tkeyobservationdate dataINDPRO aggrseq3 5Strictly speaking using tkey is unnecessary in this example because we could just have relied on the default which is to use the first column in the source file for the periods However being explicit is often a good idea 6By implementing the second condition through the max aggregation on the realtimestartdate column alone without using the realtimeenddate column we make use of the fact that Alfred files cannot have holes as explained before Chapter 8 Realtime data 70 Since we do not need the realtimestartdate information for this retrieval we have dropped the tconvert option here Note that this formulation assumes that the source file is ordered chronologically otherwise using the option aggrseq3 which retrieves the third value from each sequence of matches could have yielded a result different from the one intended However this assumption holds for Alfred files and is probably rather safe in general The values of the variable imported as ip3rdpub in this way were published at different dates so the variable is effectively a mix of different vintages Depending on the type of variable this may also imply drastic jumps in the values for example index numbers are regularly rebased to different base periods This problem also carries over to inflationadjusted economic variables where the base period of the price index changes over time Mixing vintages in general also means mixing different scales in the output with which you would have to deal appropriately7 86 Getting the values at a fixed lag after the observation period New data releases may take place on any day of the month and as we have seen the specific day of each release is recorded in realtime files from Alfred However if you are working with say monthly or quarterly data you may sometimes want to adjust the granularity of your realtime axis to a monthly or quarterly frequency For example in order to analyse the data revision process for monthly industrial production you might be interested in the extent of revisions between the data available two and three months after each observation period This is a relatively complicated task and there is more than one way of accomplishing it Either you have to make several passes through the outer dataset or you need a sophisticated filter written as a hansl function Either way you will want to make use of some of gretls builtin calendrical functions Well assume that a suitably dimensioned workspace has been set up as described above Given that the key ingredients of the join are a filtering function which well call relok for release is OK and the join command which calls it Heres the function function series relok series obsdate series reldate int p series yobs mobs yrel mrel get year and month from observation date isoconvobsdate yobs mobs get year and month from release date isoconvreldate yrel mrel find the delta in months series dm 12yrel mrel 12yobs mobs and implement the filter return dm p end function And heres the command scalar lag 3 choose your fixed lag here join fname ipplus3 dataINDPRO tkeyobservationdate tconvertobservationdaterealtimestartdate filterrelokobservationdate realtimestartdate lag aggrmaxrealtimestartdate Note that we use tconvert to convert both the observation date and the realtime start date or release date to 8digit numerical values Both of these series are passed to the filter which uses the 7Some usercontributed functions may be available that address this issue but it is beyond our scope here Another even more complicated issue in the realtime context is that of benchmark revisions applied by statistical agencies where the underlying definition or composition of a variable changes on some date which goes beyond a mere rescaling However this type of structural change is not in principle a feature of realtime data alone but applies to any timeseries data Chapter 8 Realtime data 71 builtin function isoconv to extract year and month We can then calculate dm the delta months since the observation date for each release The filter condition is that this delta should be no greater than the specified lag p8 This filter condition may be satisfied by more than one release but only the latest of those will actually be the vintage that was current at the end of the nth month after the observation period so we add the option aggrmaxrealtimestartdate If instead you want to target the release at the beginning of the nth month you would have to use a slightly more complicated filter function An illustration Figure 81 shows four time series for the monthly index of US industrial production from October 2005 to June 2009 the value as of first publication plus the values current 3 6 and 12 months out from the observation date9 From visual inspection it would seem that over much of this period the Federal reserve was fairly consistently overestimating industrial production at first release and shortly thereafter relative to the figure they arrived at with a lag of a year The script that produced this Figure is shown in full in Listing 81 Note that in this script we are using a somewhat more efficient version of the relok function shown above where we pass the required series arguments in pointer form to avoid having to copy them see chapter 14 94 96 98 100 102 104 106 108 110 112 114 116 2006 2007 2008 2009 First publication Plus 3 months Plus 6 months Plus 12 months Figure 81 Successive revisions to US industrial production 87 Getting the revision history for an observation For our final example we show how to retrieve the revision history for a given observation again using Alfred data on US industrial production In this exercise we are switching the time axis the observation period is a fixed point and time is vintage time A suitable script is shown in Listing 82 We first select an observation to track January 1970 We start the clock in the following month when a datapoint for this period was first published and let 8The filter is written on the assumption that the lag is expressed in months on that understanding it could be used with annual or quarterly data as well as monthly The idea could be generalized to cover weekly or daily data without much difficulty 9Why not a longer series Because if we try to extend it in either direction we immediately run into the index rebasing problem mentioned in section 85 with big staggered leaps downward in all the series Chapter 8 Realtime data 72 Listing 81 Retrieving successive realtime lags of US industrial production Download function series relok series obsdate series reldate int p series yobs mobs dobs yrel mrel drel isoconvobsdate yobs mobs dobs isoconvreldate yrel mrel drel series dm 12yrel mrel 12yobs mobs return dm p dm p drel dobs end function nulldata 45 setobs 12 200510 string fname INDPROtxt initial published values join fname firstpub dataINDPRO tkeyobservationdate tconvertrealtimestartdate aggrminrealtimestartdate plus 3 months join fname plus3 dataINDPRO tkeyobservationdate tconvertobservationdaterealtimestartdate filterrelokobservationdate realtimestartdate 3 aggrmaxrealtimestartdate plus 6 months join fname plus6 dataINDPRO tkeyobservationdate tconvertobservationdaterealtimestartdate filterrelokobservationdate realtimestartdate 6 aggrmaxrealtimestartdate plus 12 months join fname plus12 dataINDPRO tkeyobservationdate tconvertobservationdaterealtimestartdate filterrelokobservationdate realtimestartdate 12 aggrmaxrealtimestartdate setinfo firstpub graphnameFirst publication setinfo plus3 graphnamePlus 3 months setinfo plus6 graphnamePlus 6 months setinfo plus12 graphnamePlus 12 months set outputrealtimepdf for PDF gnuplot firstpub plus3 plus6 plus12 time withlines outputdisplay set key left bottom Chapter 8 Realtime data 73 it run to the end of the vintage history in this file March 2013 Our outer time key is the realtime start date and we filter on observation date we name the imported INDPRO values as ipjan70 Since it sometimes happens that more than one revision occurs in a given month we need to select an aggregation method here we choose to take the last revision in the month Recall from section 82 that Alfred records a new revision only when the datapoint in question actually changes This means that our imported series will contain missing values for all months when no real revision took place However we can apply a simple autoregressive rule to fill in the blanks each missing value equals the prior nonmissing value Figure 82 displays the revision history Over this sample period the periodic rebasing of the index overshadows amendments due to accrual of new information Listing 82 Retrieving a revision history Download choose the observation to track here YYYYMMDD scalar target 19700101 nulldata 518 preserve setobs 12 197002 join INDPROtxt ipjan70 dataINDPRO tkeyrealtimestartdate tconvertobservationdate filterobservationdatetarget aggrseq1 ipjan70 okipjan70 ipjan70 ipjan701 gnuplot ipjan70 time withlines outputdisplay 20 40 60 80 100 120 140 160 180 1970 1975 1980 1985 1990 1995 2000 2005 2010 ipjan70 Figure 82 Vintages of the index of US industrial production for January 1970 Chapter 9 Temporal disaggregation 91 Introduction This chapter describes and explains the facility for temporal disaggregation in gretl1 This is im plemented by the tdisagg function which supports three variants of the method of Chow and Lin 1971 the method of Fernández 1981 and two variants of the method of Denton 1971 as modi fied by Cholette 1984 Given the analytical similarities between them the three ChowLin variants and the Fernández method will be grouped in the discussion below as ChowLin methods The balance of this section provides a gentle introduction to the idea of temporal disaggregation experts may wish to skip to the next section Basically temporal disaggregation is the business of taking timeseries data observed at some given frequency say annually and producing a counterpart series at a higher frequency say quarterly The term disaggregation indicates the inverse operation of aggregation and to understand tem poral disaggregation its helpful first to understand temporal aggregation In aggregating a high frequency series to a lower frequency there are three basic methods the appropriate method de pending on the nature of the data Here are some illustrative examples GDP say we have quarterly GDP data and wish to produce an annual series This is a flow variable and the annual flow will be the sum of the quarterly values unless the quarterly values are annualized in which case we would aggregate by taking their mean Industrial Production this takes the form of an index reporting the level of production over some period relative to that in a base period in which the index is by construction 100 To aggregate from for example monthly to quarterly we should take the average of the monthly values The sum would give a nonsense result The same goes for price indices and also for ratios of stocks to flows or vice versa inventory to sales debt to GDP capacity utilization Money stock this is typically reported as an endofperiod value so in aggregating from monthly to quarterly wed take the value from the final month of each quarter In case a stock variable is reported as a startofperiod value the aggregated version would be that of the first month of the quarter A central idea in temporal disaggregation is that the high frequency series must respect both the given low frequency data and the aggregation method So for example whatever numbers we come up with for quarterly GDP given an annual series as starting point our numbers must sum to the annual total If money stock is measured at the end of the period then whatever numbers we come up with for monthly money stock given quarterly data the figure for the last month of the quarter must match that for the quarter as a whole This is why temporal disaggregation is sometimes called benchmarking the given low frequency data constitute a benchmark which the constructed high frequency data must match in a well defined sense that depends on the nature of the data Colloquially we might describe temporal disaggregation as interpolation but strictly speaking interpolation applies only to stock variables We have a known endofquarter value say which is also the value at the end of the last month of the quarter and were trying to figure out what the 1We are grateful to Tommaso Di Fonzo Professor of Statistical Science at the University of Padua for detailed and precise comments on earlier drafts Any remaining errors are of course our responsibility 74 Chapter 9 Temporal disaggregation 75 value might have been at the end of months 1 and 2 Were filling in the blanks or interpolating In the GDP case however the procedure is distribution rather than interpolation We have a given annual total and were trying to figure out how it should be distributed over the quarters Were also doing distribution for variables taking the form of indices or ratios except in this case were seeking plausible values whose mean equals the given lowfrequency value While matching the low frequency benchmark is an important constraint it obviously does not tie down the high frequency values That is a job for either regressionbased methods such as ChowLin or nonregression methods such as Denton Details are provided in section 97 92 Notation and design Some notation first the two main ingredients in temporal disaggregation are a T g matrix Y holding the series to be disaggregated and a matrix X with k columns and s T m rows to aid in the disaggregation The idea is that Y contains time series data sampled at some frequency f while each column of X contains time series data at a higher frequency sf So for each observation Yt we have s corresponding rows in X The object is to produce a transformation of Y to frequency sf with the help of X whose columns are typically called related series or indicators in the temporal disaggregation literature via either distribution or interpolation depending on the nature of the data For most of this document we will assume that g 1 or in other words we are performing temporal disaggregation on a single lowfrequency series but tdisagg supports batch processing of several series and we return to this point in section 99 If the m in s T m is greater than zero that implies that there are some extra highfrequency observations available for extrapolationsee section 94 for details We need to say something more about what goes into X Under the Denton methods this must be a single series generally known as the preliminary series2 For the ChowLin methods X can hold a combination of deterministic terms eg constant trend and stochastic series Naturally suitable candidates for the role of preliminary series or indicator will be variables that are correlated with Y and in particular might be expected to share shortrun dynamics with Y However it is possible to carry out disaggregation using deterministic terms onlyin the simplest case with X containing nothing but a constant Experts in the field tend to frown on this with reason in the absence of any genuine highfrequency information disaggregation just amounts to a mechanical smoothing But some people may have a use for such smoothing and its permitted by tdisagg We should draw attention to a design decision in tdisagg we have separated the specification of indicators in X from certain standard deterministic terms that might be wanted namely a constant linear trend or quadratic trend If you want a disaggregation without stochastic indicators you can omit or set to null the argument corresponding to X In that case a constant only will be employed automatically but for the ChowLin methods one can adjust the deterministic terms used via an option named det described below In other words the content of X becomes implicit See section 96 for more detail Heres an important point to note when X is given explicitly although this matrix may contain extra observations at the end we assume that Y and X are correctly aligned at the start Take for example the annual to quarterly case if the first observation in annual Y is for 1980 then the first observation in quarterly X must be for the first quarter of 1980 Ensuring this is the users responsibility We will have some more to say about this in the following section 2Theres nothing to stop a user from constructing such a series using several primary series as inputby taking the first principal component or some other meansbut that possibility is beyond our scope here Chapter 9 Temporal disaggregation 76 93 Overview of data handling The tdisagg function has three basic arguments representing Y X and s respectively plus several options see below The first two arguments can be given either in matrix form as such or as dataset objectsthat is a series for Y and a series or list of series for X Or as mentioned above X can be omitted left implicit This gives rise to five cases which is most convenient will depend on the users workflow 1 Both Y and X are matrices In this case the size and periodicity of the currently open dataset if any are irrelevant If Y has T rows X must of course have at least s T rows if that condition is not satisfied an Invalid argument error will be flagged 2 Y is a series or list and X a matrix In this case we assume that the periodicity of the currently open dataset is the lower one and T will be taken as equal to nobs the number of observations in the current sample range Again X must have at least s T rows 3 Y is a matrix and X a series or list We then assume that the periodicity of the currently open dataset is the higher one so that nobs defines s T m And Y is supposed to be at the lower frequency so its number of rows gives T We should then be able to find m as nobs minus s T if m 0 an error is flagged 4 Both Y and X are dataset objects We have two subcases here a If X is a series or an ordinary list of series the periodicity of the currently open dataset is taken to be the higher one The series or list containing Y should hold the appropriate entries every s elements For example if s 4 Y1 will be taken from the first observation Y2 from the fifth Y3 from the ninth and so on In practical terms series of this sort are likely to be composed by repeating each element of a lowfrequency variable s times b Alternatively X could be a MIDAS list The concept of a MIDAS list is fully explained in chapter 20 but for example in a quarterly dataset a MIDAS list would be a list of three series for the third second and first month note the ordering In this case the current periodicity is taken to be the lower one and X will contain one column corresponding to the highfrequency representation of the MIDAS list 5 X is omitted If Y is given as a matrix it is taken to have T rows Otherwise the interpretation is determined heuristically if the Y series is recognized by gretl as composed of repeated lowfrequency observations or if a series result is requested it is taken as having length sT otherwise its length is taken to be T In the previous section we flagged the importance of correct alignment of X and Y at the start of the data were now in a position to say a little more about this If either X or Y are given in matrix form alignment is truly the users responsibility But if they are dataset objects gretl can be more helpful We automatically advance the start of the sample range to exclude any leading missing values and retard the end of the sample ranges for X and Y to exclude trailing missing values allowing for the possibility that X may extend beyond Y In addition we further advance the sample start if this is required to ensure that the X data begin in the first highfrequency subperiod eg the first quarter of a year or the first month of a quarter But please note when gretl automatically excludes leading or trailing missing values intrasample missing values will still provoke an error 94 Extrapolation As mentioned above if X holds covariate data which extend beyond the range of the original series to be disaggregated then extrapolation is supported But this is inherently risky and becomes riskier the longer the horizon over which it is attempted In tdisagg extrapolation is by default limited to one lowfrequency period s highfrequency periods beyond the end of the original data The user can adjust this behavior via the extmax member of the opts bundle passed to tdisagg described in the next section Chapter 9 Temporal disaggregation 77 95 Function signature The signature of tdisagg is matrix tdisaggY0 X int s bundle opts bundle results where square brackets indicate optional arguments Note that while the return value is a matrix if Y0 contains a single column or series it can be assigned to a series as in series ys tdisaggY0 provided its of the right length to match the current dataset or the current sample range Details on the arguments follow Y0 Y as a matrix series or list X optional X as a matrix series or list This should not contain standard deterministic terms since they are handled separately see det under opts below If this matrix is omitted then disaggregation will be performed using deterministic terms only s int The temporal expansion factor for example 3 for quarterly to monthly 4 for annual to quarterly or 12 for annual to monthly We do not support cases such as monthly to weekly or monthly to daily where s is not a fixed integer value common to all observations otherwise anything goes opts bundle optional a bundle holding additional options The recognized keys are in alpha betical order aggtype string Specifies the type of temporal aggregation appropriate to the series in ques tion The value must be one of sum each lowfrequency value is a sum of s highfrequency values the default avg each lowfrequency value is the average of s high frequency val ues or last or first indicating respectively that each lowfrequency value is the last or first of s high frequency values det int Relevant only when one of the ChowLin methods is selected This is a numeric code for the deterministic terms to be included in the regressions 0 means none 1 constant only 2 constant and linear trend 3 constant and quadratic trend The default is 1 extmax int the maximum number of highfrequency periods over which extrapolation should be carried out conditional on the availability of covariate data A zero value means no extrapolation a value of 1 means as many periods as possible and a positive value limits extrapolation to the specified number of periods See section 94 for a statement of the default value method string Selects the method of disaggregation see the listing below Note that the ChowLin methods employ an autoregression coefficient ρ which captures the persis tence of the target series at the higher frequency and is used in GLS estimation of the parameters linking X to Y chowlin the default is modeled on the original method proposed by Chow and Lin It uses a value of ρ computed as the transformation of a maximumlikelihood estimate of the lowfrequency autocorrelation coefficient chowlinmle is equivalent to the method called chowlinmaxlog in the tempdis agg package for R ρ is estimated by iterated GLS using the loglikelihood as criterion as recommended by Bournay and Laroque 1979 The BFGS algorithm is used inter nally chowlinssr is equivalent to the method called chowlinminrssquilis in tem pdisagg ρ is estimated by iterated GLS using the sum of squared GLS residuals as criterion LBFGS is used internally Chapter 9 Temporal disaggregation 78 fernandez is basically ChowLin with ρ 1 It is suitable if the target series has a unit root and is not cointegrated with the indicator series dentonpfd is the proportional first differences variant of Denton as modified by Cholette See Di Fonzo and Marini 2012 for details dentonafd is the additive first differences variant of Denton again as modified by Cholette In contrast to the ChowLin methods neither Denton procedure involves regression plot int If a nonzero value is given a simple plot is displayed by way of a sanity check on the final series See section 98 for details rho scalar Relevant only when one of the ChowLin methods is selected If the method is chowlin then rho is treated as a fixed value for ρ thus enabling the user to by pass the default estimation procedure altogether If the method is chowlinmle or chowlinssr on the other hand the supplied ρ value is used to initialize the numeri cal optimization algorithm verbose int Controls the verbosity of ChowLin or Fernández output If 0 the default nothing is printed unless an error occurs if 1 summary output from the relevant regres sion is shown if 2 in addition output from the optimizer for the iterated GLS procedure is shown if applicable results bundle optional If present this argument must be a previously defined bundle Upon successful completion of any of the methods other than denton it contains details of the disaggregation under the following keys method the method employed rho the value of ρ used lnl loglikelihood maximized by the chowlinmle method SSR sum of squared residuals minimized by the chowlinssr method coeff the GLS or OLS coefficients stderr standard errors for the coefficients If ρ is set to zeroeither by specification of the user or because the estimate ˆρ turned out to be nonpositivethen estimation of the coefficients is via OLS In that case the lnl and SSR values are calculated using the OLS residuals which will be on a different scale from the weighted residuals in GLS 96 Handling of deterministic terms It may be helpful to set out clearly in one place how deterministic terms are handled by tdisagg If X is given explicitly No deterministic term is added when the Denton method is used since a single preliminary series is wanted but a constant is added when one of the ChowLin methods is selected The latter default can be overridden ie the constant removed or a trend added by means of the det entry in the options bundle If X is omitted By default a constant is used for all methods Again for ChowLin this can be overridden by specifying a det value If for some reason you wanted Denton with just a trend you would have to supply X containing a trend 97 Some technical details In this section we provide some technical details on the methods used by tdisagg We will refer to the version of Y converted to the high frequency sf as the final series Chapter 9 Temporal disaggregation 79 As regards the Cholettemodified Denton methods for the proportional first difference variant we calculate the final series using the solution described by Di Fonzo and Marini 2012 specifically equation 4 on page 5 and for the additive variant we draw on Di Fonzo 2003 pages 3 and 5 in particular Note that these procedures require the construction and inversion of a matrix of order s 1T If both s and T are large it can therefore take some time and be quite demanding of RAM As regards ChowLin let ρ0 indicate the rho value passed via the options bundle if applicable We then take these steps 1 If ρ0 0 set ρ ρ0 and go to step 6 if the method is chowlin or step 7 otherwise But if ρ0 0 set ρ0 0 2 Estimate via OLS a regression of Y on CX3 where C is the appropriate aggregation matrix Let ˆβOLS equal the coefficients from this regression If ρ0 0 and the method is chowlin go to step 8 3 Calculate the low frequency first order autocorrelation of the OLS residuals ˆρL If ˆρL 106 go to step 4 Otherwise if the method is chowlin set ρ 0 and go to step 8 else set ρ 05 and go to step 7 4 Refine the positive estimate of ˆρL via Maximum Likelihood estimation of the AR1 specifica tion as described in Davidson and MacKinnon 2004 5 If ˆρL 0999 set ρ to the highfrequency counterpart of ˆρL using the approach given in Chow and Lin 1971 Otherwise set ρ 0999 If the method is chowlin go to step 6 otherwise go to step 7 6 Perform GLS with the given value of ρ store the coefficients as ˆβGLS and go to step 9 7 Perform iterated GLS starting from the prior value of ρ adjusting ρ with the goal of either maximizing the loglikelihood method chowlinmle or minimizing the sum of squared GLS residuals chowlinssr set ˆβGLS to the final coefficient estimates and go to step 9 8 Calculate the final series as XˆβOLS CCC1 ˆuOLS where ˆuOLS indicates the OLS residuals and stop 9 Calculate the final series as XˆβGLS VCCVC1 ˆuGLS where ˆuGLS indicates the GLS residuals and V is the estimated highfrequency covariance matrix A few notes on our ChowLin algorithm follow One might question the value of performing steps 2 to 5 when the method is one that calls for GLS iteration chowlinmle or chowlinssr but our testing indicates that it can be helpful to have a reasonably good estimate of ρ in hand before embarking on these iterations Conversely one might wonder why we bother with GLS iterations if we find ˆρL 106 But this allows for the possibility most likely associated with small sample size that iteration will lead to ρ 0 even when the estimate based on the intial OLS residuals is zero or negative Note that in all cases we are discarding an estimate of ρ 0 truncating to 0 which we take to be standard in this field In our iterated GLS we achieve this by having the optimizer pick values x in which are translated to 0 1 via the logistic CDF ρ 11 expx To be precise thats the case with chowlinmle But we find that the chowlinssr method is liable to overestimate ρ and proceed to values arbitrarily close to 1 resulting in numerical problems We therefore bound this method to x in 20 69 corresponding to ρ values between nearzero and approximately 09994 3Strictly speaking CX uses only the first sT rows of X if m 0 4It may be worth noting that the tempdisagg package for R limits both methods to a maximum ρ of 0999 We find however that the ML method can look after itself and does not require the fixed upper bound short of 10 Chapter 9 Temporal disaggregation 80 1600 1700 1800 1900 2000 2100 2200 2300 2400 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 Temporal disaggregation chowlin original data final series 4 Figure 91 Example output from plot option showing annual GNP red and quarterly final series blue using quarterly industrial production as indicator As for the Fernández method this is quite straightforward The place of the highfrequency co variance matrix V in ChowLin is taken by DD1 where D is the approximate firstdifferencing matrix with 1 on the diagonal and 1 on the first subdiagonal For efficient computation however we store neither D nor DD as such and do not perform any explicit inversion The special struc ture of DD1 makes it possible to produce the effect of premultiplication by this matrix with OT 2 floatingpoint operations Estimation of ρ is not an issue since it equals 1 by assumption 98 The plot option The semantics of this option may be enriched in future but for now its a simple boolean switch The effect is to produce a time series plot of the final series along with the original lowfrequency series shown in step form If aggregation is by sum the final series is multiplied by s for comparability with the original If the disaggregation has been successful these two series should track closely together with the final series showing plausible shortrun dynamics An example is shown in Figure 91 If there are many observations the two lines may appear virtually coincident In that case one can see whats going on in more detail by exploiting the Zoom functionality of the plot which is accessed via the rightclick menu in the plot window 99 Multiple lowfrequency series We now return to a point mentioned in section 92 namely that Y may be given as a T g matrix with g 1 or a list of g series This means that a single call to tdisagg can be used to process several input series batch processing in which case the return value is a matrix with s T m rows and g columns There are some restrictions First and most obviously a single call to tdisagg implies a single selection of indicators or related series X and a single selection of options aggregation type of the data deterministic terms disaggregation method and so on So this possibility will be relevant only if you have several series that want the same treatment In addition if g 1 the plot and verbose options are ignored and the results bundle is not filled if you need those features you should supply a single series or vector in Y Chapter 9 Temporal disaggregation 81 The advantage of batch processing lies in the spreading of fixed computational cost leading to shorter execution time However the relative importance of the fixed cost differs substantially according to the disaggregation method For the ChowLin methods the fixed cost is relatively small and so little speedup can be expected but for the Denton methods it dominates and in our testing you can process g 1 series in little more time than it takes to process a single series As they say Your mileage may vary but if you have a large number of series to be disaggregated via one of the Denton methods you may well find it much faster to use the batch facility of tdisagg 910 Examples Listing 91 shows an example of usage and its output The data are drawn from the St Louis Fed we disaggregate quarterly GDP to monthly with the help of industrial production and payroll employment using the default ChowLin method Several other example scripts are available from httpgretlsourceforgenettdisagg Listing 91 Example of tdisagg usage Download Input Traditional ChowLin y is a series with repetition and X is a list of series This corresponds to case 4a as described in section 93 of the documentation above ensure that no data are in place clear open gretls St Louis Fed database open fedstlbin import two monthly series data indpro payems import quarterly GDP values are repeated data gdpc1 restrict sample to complete data smpl nomissing disaggregate GDP from quarterly to monthly using industrial production and payroll employment as indicators scalar s 3 list X indpro payems series gdpm tdisagggdpc1 X s verbose1 aggtypesum Output Aggregation type sum GLS estimates chowlin T 294 Dependent variable gdpc1 coefficient std error tratio pvalue const 312394 263372 1186 02365 indpro 109158 175785 6210 183e09 payems 00242860 000171935 1413 739e35 rho 0999 SSR 515439 lnl 160498 Generated series gdpm ID 4 Chapter 10 Special functions in genr 101 Introduction The genr command provides a flexible means of defining new variables At the same time the somewhat paradoxical situation is that the genr keyword is almost never visible in gretl scripts For example it is not really recommended to write a line such as genr b 25 because there are the following alternatives scalar b 25 which also invokes the genr apparatus in gretl but provides explicit type information about the variable b which is usually preferable gretls language hansl is stati cally typed so b cannot switch from scalar to string or matrix for example b 25 leaving it to gretl to infer the admissible or most natural type for the new object which would again be a scalar in this case matrix b 25 This formulation is required if b is going to be expanded with additional rows or columns later on Otherwise gretls static typing would not allow b to be promoted from scalar to matrix so it must be a matrix right from the start even if it is of dimension 11 initially This definition could also be written as matrix b 25 but the more explicit form is recommended In addition to scalar or matrix other type keywords that can be used to substitute the generic genr term are those enumerated in the following chapter 11 In the case of an array the concrete specification should be used so one of matrices strings lists bundles1 Therefore theres only a handful of special cases where it is really necessary to use the genr keyword genr time Creates a time trend variable 123 under the name time Note that within an appropriately defined panel dataset this variable honors the panel structure and is a true time index In a crosssectional dataset the command will still work and produces the same result as genr index below but of course no temporal meaning exists genr index Creates an observation variable named index running from 1 to the sample size genr unitdum In the context of panel data creates a set of dummies for the panel groups or units These are named du1 du2 and so forth Actually this particular genr usage is not strictly necessary because a list of group dummies can also be obtained as series gr unit list groupdums dummifygr NA The NA argument to the dummify function has the effect of not skipping any unit as the reference group thus producing the full set of dummies 1A recently added advanced datatype is an array of arrays with the associated type specifier arrays 82 Chapter 10 Special functions in genr 83 genr timedum Again for panel data creates a set of dummies for the time periods named dt1 dt2 And again a listproducing variant without genr exists using the special accessor obsminor which indexes time in the panel context and can be used as a substitute for time from above series tindex obsminor list timedums dummifytindex NA genr markers See section 45 for an explanation and example of this panelrelated feature Finally there also exists genr dummy which produces a set of seasonal dummies However it is recommended to use the seasonals function instead which can also return centered dummies The rest of this chapter discusses other special function aspects 102 Cumulative densities and pvalues The two functions cdf and pvalue provide complementary means of examining values from 17 probability distributions as of July 2021 among which the most important ones standard normal Students t χ2 F gamma and binomial The syntax of these functions is set out in the Gretl Command Reference here we expand on some subtleties The cumulative density function or CDF for a random variable is the integral of the variables density from its lower limit typically either or 0 to any specified value x The pvalue at least the onetailed righthand pvalue as returned by the pvalue function is the complementary probability the integral from x to the upper limit of the distribution typically In principle therefore there is no need for two distinct functions given a CDF value p0 you could easily find the corresponding pvalue as 1p0 or vice versa In practice with finiteprecision com puter arithmetic the two functions are not redundant This requires a little explanation In gretl as in most statistical programs floating point numbers are represented as doubles double precision values that typically have a storage size of eight bytes or 64 bits Since there are only so many bits available only so many floatingpoint numbers can be represented doubles do not model the real line Typically doubles can represent numbers over the range roughly 17977 10308 but only to about 15 digits of precision Suppose youre interested in the left tail of the χ2 distribution with 50 degrees of freedom youd like to know the CDF value for x 09 Take a look at the following interactive session scalar p1 cdfX 50 09 Generated scalar p1 894977e35 scalar p2 pvalueX 50 09 Generated scalar p2 1 scalar test 1 p2 Generated scalar test 0 The cdf function has produced an accurate value but the pvalue function gives an answer of 1 from which it is not possible to retrieve the answer to the CDF question This may seem surprising at first but consider if the value of p1 above is correct then the correct value for p2 is 1894977 1035 But theres no way that value can be represented as a double that would require over 30 digits of precision Of course this is an extreme example If the x in question is not too far off into one or other tail of the distribution the cdf and pvalue functions will in fact produce complementary answers as shown below scalar p1 cdfX 50 30 Generated scalar p1 00111648 Chapter 10 Special functions in genr 84 scalar p2 pvalueX 50 30 Generated scalar p2 0988835 scalar test 1 p2 Generated scalar test 00111648 But the moral is that if you want to examine extreme values you should be careful in selecting the function you need in the knowledge that values very close to zero can be represented as doubles while values very close to 1 cannot 103 Retrieving internal variables dollar accessors A very useful feature is to retrieve in a script various values calculated by gretl in the course of estimating models or testing hypotheses Since they all start with a literal character they are called dollar accessors The variables that can be retrieved in this way are listed in the Gretl Command Referenceor in the builtin function help under the Help menu The dollar accessors can be used like other gretl objects in script assignments or statements Some of those accessors are actually independent of any estimation or test and describe for example the context of the running gretl program But here we say a bit more about the special variables test and pvalue These variables hold respectively the value of the last test statistic calculated using an explicit testing command and the pvalue for that test statistic If no such test has been performed at the time when these variables are referenced they will produce the missing value code Some explicit testing commands that work in this way are as follows among others add joint test for the sig nificance of variables added to a model adf Augmented DickeyFuller test see below arch test for ARCH chow Chow test for a structural break coeffsum test for the sum of specified coef ficients coint EngleGranger cointegration test cusum the HarveyCollier tstatistic difftest test for a difference of two groups kpss KPSS stationarity test no pvalue available modtest see below meantest test for difference of means omit joint test for the significance of vari ables omitted from a model reset Ramseys RESET restrict general linear restriction runs runs test for randomness and vartest test for difference of variances In most cases both a test and a pvalue are stored the exception is the KPSS test for which a pvalue is not currently available The modtest command which must follow an estimation command offers several diagnostic tests the particular test performed depends on the option flag provided Please see the Gretl Command Reference and for example chapters 32 and 31 of this Guide for details An important point to notice about this mechanism is that the internal variables test and pvalue are overwritten each time one of the tests listed above is performed If you want to reference these values you must do so at the correct point in the sequence of gretl commands Chapter 11 Gretl data types 111 Introduction Gretl offers the following data types scalar holds a single numerical value series holds n numerical values where n is the number of observations in the current dataset matrix holds a rectangular array of numerical values of any two dimensions list holds the ID numbers of a set of series string holds an array of characters bundle holds zero or more objects of various types array holds zero or more objects of a given type The numerical values mentioned above are all doubleprecision floating point numbers In this chapter we give a rundown of the basic characteristics of each of these types and also explain their life cycle creation modification and destruction The list and matrix types whose uses are relatively complex are discussed at greater length in chapters 15 and 17 respectively 112 Series We begin with the series type which is the oldest and in a sense the most basic type in gretl When you open a data file in the gretl GUI what you see in the main window are the ID numbers names and descriptions if available of the series read from the file All the series existing at any point in a gretl session are of the same length although some may have missing values The variables that can be added via the items under the Add menu in the main window logs squares and so on are also series For a gretl session to contain any series a common series length must be established This is usually achieved by opening a data file or importing a series from a database in which case the length is set by the first import But one can also use the nulldata command which takes as it single argument the desired length a positive integer Each series has these basic attributes an ID number a name and of course n numerical values A series may also have a description which is shown in the main window and is also accessible via the labels command a display name for use in graphs a record of the compaction method used in reducing the variables frequency for timeseries data only and flags marking the variable as discrete andor as a numeric encoding of a qualitative characteristic These attributes can be edited in the GUI by choosing Edit Attributes either under the Variable menu or via rightclick or by means of the setinfo command In the context of most commands you are able to reference series by name or by ID number as you wish The main exception is the definition or modification of variables via a formula here you must use names since ID numbers would get confused with numerical constants Note that series ID numbers are always consecutive and the ID number for a given series will change if you delete a lowernumbered series In some contexts where gretl is liable to get confused by 85 Chapter 11 Gretl data types 86 such changes deletion of lownumbered series is disallowed Discrete series It is possible to mark variables of the series type as discrete The meaning and uses of this facility are explained in chapter 12 Stringvalued series It is generally expected that series in gretl will be properly numeric on a ratio or at least an ordinal scale or the sort of numerical indicator variables 01 dummies that are traditional in econometrics However stringvalued series are also supportedsee chapter 16 for details 113 Scalars The scalar type is relatively simple just a convenient named holder for a single numerical value Scalars have none of the additional attributes pertaining to series do not have ID numbers and must be referenced by name A common use of scalar variables is to record information made available by gretl commands for further processing as in scalar s2 sigmaˆ2 to record the square of the standard error of the regression following an estimation command such as ols You can define and work with scalars in gretl without having any dataset in place In the gretl GUI scalar variables can be inspected and their values edited via the Icon view see the View menu in the main window 114 Matrices Matrices in gretl work much as in other mathematical software eg MATLAB Octave Like scalars they have no ID numbers and must be referenced by name and they can be used without any dataset in place Matrix indexing is 1based the topleft element of matrix A is A11 Matrices are discussed at length in chapter 17 advanced users of gretl will want to study this chapter in detail Matrices have two optional attribute beyond their numerical content they may have column andor row names attached these are displayed when the matrix is printed See the cnameset and rnameset functions for details In the gretl GUI matrices can be inspected analysed and edited via the Icon view item under the View menu in the main window each currently defined matrix is represented by an icon 115 Lists As with matrices lists merit an explication of their own see chapter 15 Briefly named lists can and should be used to make command scripts less verbose and repetitious and more easily modifiable Since lists are in fact lists of series ID numbers they can be used only when a dataset is in place In the gretl GUI named lists can be inspected and edited under the Data menu in the main window via the item Define or edit list 116 Strings String variables may be used for labeling or for constructing commands They are discussed in chapter 15 They must be referenced by name they can be defined in the absence of a dataset Chapter 11 Gretl data types 87 Such variables can be created and modified via the commandline in the gretl console or via script there is no means of editing them via the gretl GUI 117 Bundles A bundle is a container or wrapper for various sorts of objectsprimarily scalars matrices strings arrays and bundles Yes a bundle can contain other bundles Secondarily series and lists can be placed in bundles but this is subject to important qualifications noted below A bundle takes the form of a hash table or associative array each item placed in the bundle is associated with a key which can used to retrieve it subsequently We begin by explaining the mechanics of bundles then offer some thoughts on what they are good for There are several ways of creating a bundle Here are the first two Just declare it as in bundle foo or define an empty bundle using the defbundle function without any arguments bundle foo defbundle These formulations are basically equivalent in that they both create an empty bundle The differ ence is that the second variant may be reusedif a bundle named foo already exists the effect is to empty itwhile the first may only be used once in a given gretl session it is an error to attempt to declare a variable that already exists To create a bundle and populate it with some members in one go you can use the defbundle function with some arguments For example bundle foo defbundlex 13 mat I3 str some string Here the arguments come in pairs key followed by the object to be associated with the key with all terms commaseparated However you may prefer to use one or other of the alternative idioms introduced in gretl 2021a The first of these looks like this bundle foo x 13 mat I3 str some string Its more streamlined than defbundle but not quite so flexible You dont have to quote the keys but that also means that you cant give the name of a key as a string variable its always taken as a string literal Yet more streamlined but also less flexible is this variant bundle foo x mat str which works if and only if there are existing objects x mat and str in scope and you want to add them to the bundle under keys equal to their own names For more on the defbundle function see the Gretl Command Reference or the Function Reference under Help in the GUI program To add an object to a bundle you assign to a compound lefthand value the name of the bundle followed by the key Two forms of syntax are acceptable in this context The recommended syntax for most uses is bundlenamekey that is the name of the bundle followed by a dot then the key Both the bundle name and the key must be valid gretl identifiers1 For example the statement foomatrix1 m 1As a reminder 31 characters maximum starting with a letter and composed of just letters numbers or underscore Chapter 11 Gretl data types 88 adds an object called m presumably a matrix to bundle foo under the key matrix1 If you wish to make it explicit that m is supposed to be a matrix you can use the form matrix foomatrix1 m Alternatively a bundle key may be given as a string enclosed in square brackets as in foomatrix1 m This syntax offers greater flexibility in that the key string does not have to be a valid identifier for example it can include spaces In addition when using the square bracket syntax it is possible to use a string variable to define or access the key in question For example string s matrix 1 foos m matrix is added under key matrix 1 To get an item out of a bundle again use the name of the bundle followed by the key as in matrix bm foomatrix1 or using the alternative notation matrix bm foomatrix1 or using a string variable matrix bm foos Note that the key identifying an object within a given bundle is necessarily unique If you reuse an existing key in a new assignment the effect is to replace the object which was previously stored under the given key It is not required that the type of the replacement object is the same as that of the original Also note that when you add an object to a bundle what in fact happens is that the bundle acquires a copy of the object The external object retains its own identity and is unaffected if the bundled object is replaced by another Consider the following script fragment bundle foo matrix m I3 foomykey m scalar x 20 foomykey x After the above commands are completed bundle foo does not contain a matrix under mykey but the original matrix m is still in good health To delete an object from a bundle use the delete command with the bundlekey combination as in delete foomykey This destroys the object associated with mykey and removes the key from the hash table To determine whether a bundle contains an object associated with a given key use the inbundle function This takes two arguments the name of the bundle and the key string The value returned by this function is an integer which codes for the type of the object 0 for no match 1 for scalar 2 for series 3 for matrix 4 for string 5 for bundle and 6 for array The function typestr may be used to get the string corresponding to this code For example scalar type inbundlefoo x if type 0 Chapter 11 Gretl data types 89 print x no such object else printf x is of type s typestrtype endif Besides adding accessing replacing and deleting individual items the other operations that are supported for bundles are union printing and deletion As regards union if bundles b1 and b2 are defined you can say bundle b3 b1 b2 to create a new bundle that is the union of the two others The algorithm is create a new bundle that is a copy of b1 then add any items from b2 whose keys are not already present in the new bundle This means that bundle union is not commutative if the bundles have one or more key strings in common If b is a bundle and you say print b you get a listing of the bundles keys along with the types of the corresponding objects as in print b bundle b x scalar mat matrix inside bundle Note that in the example above the bundle b nests a bundle named inside If you want to see whats inside nested bundles with a single command you can append the tree option to the print command Series and lists as bundle members It is possible to add both series and lists to a bundle as in open data410 list X const CATHOL INCOME bundle b by ENROLL bX X eval by eval bX However it is important to bear in mind the following limitations A series as such is inherently a member of a dataset and a bundle can survive the replace ment or destruction of the dataset from which a series was added It may then be impossible or even if possible meaningless to extract a bundled series as a series However its always possible to retrieve the values of the series in the form of a matrix column vector In gretl commands that call for series arguments you cannot give a bundled series without first extracting it In the little example above the series ENROLL was added to bundle b under the key y but by is not itself a series member of a dataset its just an anonymous array of values It therefore cannot be given as say the dependent variable in a call to gretls ols command A gretl list is just an array of ID numbers of series in a given dataset a macro if you like So as with series theres no guarantee that a bundled list can be extracted as a list though it can always be extracted as a row vector Chapter 11 Gretl data types 90 The points made above are illustrated in Listing 111 In Case 1 we open a little dataset with just 14 crosssectional observations and put a series into a bundle We then open a timeseries dataset with 64 observations while preserving the bundle and extract the bundled series This instance is legal since the stored series does not overflow the length of the new dataset it gets written into the first 14 observations but its probably not meaningful Its up to the user to decide if such operations make sense In Case 2 a similar sequence of statements leads to an error trapped by catch because this time the stored series will not fit into the new dataset We can nonetheless grab the data as a vector In Case 3 we put a list of three series into a bundle This does not put any actual data values into the bundle just the ID numbers of the specified series which happen to be 4 5 and 6 We then switch to a dataset that contains just 4 series so the list cannot be extracted as such IDs 5 and 6 are out of bounds Once again however we can retrieve the ID numbers in matrix form if we want In some cases putting a gretl list as such into a bundle may be appropriate but in others you are better off adding the content of the list in matrix form as in open data410 list X const CATHOL INCOME bundle b matrix bX X In this case were adding a matrix with three columns and as many rows as there are in the dataset we have the actual data not just a reference to the data that might go bad See chapter 17 for more on this What are bundles good for Bundles are unlikely to be of interest in the context of standalone gretl scripts but they can be very useful in the context of complex function packages where a good deal of information has to be passed around between the component functions see Cottrell and Lucchetti 2016 Instead of using a lengthy list of individual arguments function A can bundle up the required data and pass it to functions B and C where relevant information can be extracted via a mnemonic key In this context bundles should be passed in pointer form see chapter 14 as illustrated in the fol lowing trivial example where a bundle is created at one level then filled out by a separate function modification of bundle pointer by user function function void filloutbundle bundle b bmat I3 bstr foo bx 32 end function bundle mybundle filloutbundlemybundle The bundle type can also be used to advantage as the return value from a packaged function in cases where a package writer wants to give the user the option of accessing various results In the gretl GUI function packages that return a bundle are treated specially the output window that displays the printed results acquires a menu showing the bundled items their names and types from which the user can save items of interest For example a function package that estimates a model might return a bundle containing a vector of parameter estimates a residual series and a covariance matrix for the parameter estimates among other possibilities As a refinement to support the use of bundles as a function return type the setnote function can be used to add a brief explanatory note to a bundled itemsuch notes will then be shown in the Chapter 11 Gretl data types 91 Listing 111 Series and lists in bundles Download Case 1 store and retrieve series OK open data41 bundle b series bx sqft open data97 preserve series x bx print x byobs Case 2 store and retrieve series gives an error but the data can be retrieved in matrix form open data97 bundle b series bx QNC open data41 preserve catch series x bx wrong wont fit if error matrix mx bx print mx else print x endif Case 3 store and retrieve list gives an error but the ID numbers in the list can be retrieved as a row vector open data97 list L PRIME UNEMP STOCK bundle b list bL L open data41 preserve catch list L bL if error matrix mL bL print mL prints 4 5 6 endif Chapter 11 Gretl data types 92 GUI menu This function takes three arguments the name of a bundle a key string and the note For example setnoteb vcv covariance matrix After this the object under the key vcv in bundle b will be shown as covariance matrix in a GUI menu 118 Arrays The gretl array type is intended for scripting use Arrays have no GUI representation and theyre unlikely to acquire one2 A gretl array is as you might expect a container which can hold zero or more objects of a certain type indexed by consecutive integers starting at 1 It is onedimensional This type is implemented by a quite generic backend The types of object that can be put into arrays are strings matrices lists bundles and arrays3 Of gretls primary types then neither scalars nor series are supported by the array mechanism There would be little point in supporting arrays of scalars as such since the matrix type already plays that role and more flexibly As for series they have a special status as elements of a dataset which is in a sense an array of series already and in addition we have the list type which already functions as a sort of array for subsets of the series in a dataset Creating an array An array can be brought into existence in any of three ways bare declaration or using one of the functions array or defarray In each case one of the specific typewords strings matrices lists bundles or arrays must be used Here are some examples declare an empty array of strings strings S make an empty array of matrices matrices M array0 make an array with space for four bundles bundles B array4 make an array with three specified strings strings P defarrayfoo bar baz The bare declaration form and the function form with array0 have the same effect of creating an empty array but the second can be used in contexts where bare declaration is not allowed and it can also be used to destroy the content of an existing array and reduce it to size zero The array function expects a nonnegative integer argument and can be used to create an array of pregiven size in this case the elements are initialized appropriately as empty strings empty matrices empty lists empty bundles or empty arrays The defarray function takes a variable number of arguments one or more each of which may be the name of a variable of the appropriate type or an expression which evaluates to an object of the appropriate type Setting and getting elements There are two ways to set the value of an array element you can set a particular element using the array index or you can append an element using the operator 2However its possible to save arrays invisibly in the context of a GUI session by virtue of the fact that they can be packed into bundles see below and bundles can be saved as part of a session 3It was not possible to nest arrays prior to version 2019d of gretl Chapter 11 Gretl data types 93 first case strings S array3 S2 string the second alternative matrices M array0 M mnormalTk In the first method the index must of course be within bounds that is greater than zero and no greater than the current length of the array When the second method is used it automatically extends the length of the array by 1 To get hold of an element the array index must be used for S an array of strings string s S5 for M an array of matrices printf 125g M1 Removing elements Theres a counterpart to the operator mentioned above can be used to remove one or more elements specified by content from an array of strings Note that works on all matching ele ments so after the following statements strings S defarraya a b a S a S becomes a oneelement array holding only the original third element More generally a negative index can be used to remove a specified element from an array of any type as in strings S defarraya a b a S S1 where only the first element is removed See chapter 17 for more on the semantics of negative indices Operations on whole arrays Three operators are applicable to whole arrays but only one to arrays of arbitrary type the other two being restricted to arrays of strings The generally available operation is appending You can do for example for M1 and M2 both arrays of matrices matrices BigM M1 M2 or if you wish to augment M1 M1 M2 In each case the result is an array of matrices whose length is the sum of the lengths of M1 and M2and similarly for the other supported types The operators specific to strings are union via and intersection via Given the following code for S1 and S2 both arrays of strings strings Su S1 S2 strings Si S1 S2 the array Su will contain all the strings in S1 plus any in S2 that are not in S1 while Si will contain all and only the strings that appear in both S1 and S2 Chapter 11 Gretl data types 94 Arrays as function arguments One can write hansl functions that take as arguments any of the array types and it is possible to pass arrays as function arguments in pointerized form In addition hansl functions may return any of the array types Here is a trivial example for strings function void printstrings strings S loop i1nelemS printf element d s i Si endloop end function function strings mkstrs int n strings S arrayn loop i1n Si sprintfmember d i endloop return S end function strings Foo mkstrs5 print Foo printstringsFoo A couple of points are worth noting here First the nelem function works to give the number of elements in any of the container types lists arrays bundles matrices Second if you do print Foo for Foo an array youll see something like print Foo Array of strings length 5 Nesting arrays While gretls array structure is in itself onedimensional you can add extra dimensions by nesting For example the code below creates an array holding n arrays of m bundles arrays BB arrayn loop i1n bundles BBi arraym endloop The syntax for setting or accessing any of the n m bundles or their members is then on the following pattern BBijm I3 eval BBij eval BBijm or eval BBijm where the respective array subscripts are each put into square brackets The elements of an array of arrays must obviously all be arrays but its not required that they have a common contenttype For example the following code creates an array holding an array of matrices plus an array of strings arrays AA array2 matrices AA1 array3 strings AA2 array3 Chapter 11 Gretl data types 95 Arrays and bundles As mentioned the bundle type is supported by the array mechanism In addition arrays of what ever type can be put into bundles matrices M array8 set values of Mi here bundle b bM M The mutual packability of bundles and arrays means that its possible to go quite far down the rabbithole users are advised not to get carried away 119 The life cycle of gretl objects Creation The most basic way to create a new variable of any type is by declaration where one states the type followed by the name of the variable to create as in scalar x series y matrix A and so forth In that case the object in question is given a default initialization as follows a new scalar has value NA missing a new series is filled with NAs a new matrix is empty zero rows and columns a new string is empty a new list has no members new bundles and new arrays are empty Declaration can be supplemented by a definite initialization as in scalar x pi series y logx matrix A zeros104 The type of a new variable can be left implicit as in x y100 z 35 Here the type of x will be determined automatically depending on the context If y is a scalar a series or a matrix x will inherit ys type otherwise an error will be generated since division is applicable to these types only The new variable z will naturally be of scalar type In general however we recommend that you state the type of a new variable explicitly This makes the intent clearer to a reader of the script and also guards against errors that might otherwise be difficult to understand ie a certain variable turns out to be of the wrong type for some subsequent calculation but you dont notice at first because you didnt say what type you wanted Exceptions to this rule might reasonably be granted for clear and simple cases where theres little possibility of confusion Modification Typically the values of variables of all types are modified by assignment using the operator with the name of the variable on the left and a suitable value or formula on the right Chapter 11 Gretl data types 96 z normal x 100 logy logy1 M qforma X By a suitable value we mean one that is conformable for the type in question A gretl variable acquires its type when it is first created and this cannot be changed via assignment for example if you have a matrix A and later want a string A you will have to delete the matrix first One point to watch out for in gretl scripting is type conflicts having to do with the names of series brought in from a data file For example in setting up a command loop see chapter 13 it is very common to call the loop index i Now a loop index is a scalar typically incremented each time round the loop If you open a data file that happens to contain a series named i you will get a type error Types not conformable for operation when you try to use i as a loop index Although the type of an existing variable cannot be changed on the fly gretl nonetheless tries to be as understanding as possible For example if x is an existing series and you say x 100 gretl will give the series a constant value of 100 rather than complaining that you are trying to assign a scalar to a series This issue is particularly relevant for the matrix typesee chapter 17 for details Besides using the regular assignment operator you also have the option of using an inflected equals sign as in the C programming language This is shorthand for the case where the new value of the variable is a function of the old value For example x 100 in longhand x x 100 x 100 in longhand x x 100 For scalar variables you can use a more condensed shorthand for simple increment or decrement by 1 namely trailing or respectively x 100 x x now equals 99 x x now equals 100 In the case of objects holding more than one valueseries matrices and bundlesyou can mod ify particular values within the object using an expression within square brackets to identify the elements to access We have discussed this above for the bundle type and chapter 17 goes into details for matrices As for series there are two ways to specify particular values for modification you can use a simple 1based index or if the dataset is a time series or panel or if it has marker strings that identify the observations you can use an appropriate observation string Such strings are displayed by gretl when you print data with the byobs flag Examples x13 100 simple index the 13th observation x19954 100 date quarterly time series x200308 100 date monthly time series xAZ 100 the observation with marker string AZ x315 100 panel the 15th observation for the 3rd unit Note that with quarterly or monthly time series there is no ambiguity between a simple index number and a date since dates always contain a colon With annual timeseries data however such ambiguity exists and it is resolved by the rule that a number in brackets is always read as a simple index x1905 means the nineteenhundred and fifth observation not the observation for the year 1905 You can specify a year by quotation as in x1905 Chapter 11 Gretl data types 97 Destruction Objects of the types discussed above with the important exception of named lists are all destroyed using the delete command delete objectname Lists are an exception for this reason in the context of gretl commands a named list expands to the ID numbers of the member series so if you say delete L for L a list the effect is to delete all the series in L the list itself is not destroyed but ends up empty To delete the list itself without deleting the member series you must invert the command and use the list keyword list L delete Note that the delete command cannot be used within a loop construct see chapter 13 Chapter 12 Discrete variables When a variable can take only a finite typically small number of values then it is said to be discrete In gretl variables of the series type only can be marked as discrete When we speak of variables below this should be understood as referring to series Some gretl commands act in a slightly different way when applied to discrete variables moreover gretl provides a few commands that only apply to discrete variables Specifically the dummify and xtab commands see below are available only for discrete variables while the freq frequency distribution command produces different output for discrete variables 121 Declaring variables as discrete Gretl uses a simple heuristic to judge whether a given variable should be treated as discrete but you also have the option of explicitly marking a variable as discrete in which case the heuristic check is bypassed The heuristic is as follows First are all the values of the variable reasonably round where this is taken to mean that they are all integer multiples of 025 If this criterion is met we then ask whether the variable takes on a fairly small set of distinct values where fairly small is defined as less than or equal to 8 If both conditions are satisfied the variable is automatically considered discrete To mark a variable as discrete you have two options 1 From the graphical interface select Variable Edit Attributes from the menu A dialog box will appear and if the variable seems suitable you will see a tick box labeled Treat this variable as discrete This dialog box can also be invoked via the context menu rightclick on a variable or by pressing the F2 key 2 From the commandline interface via the discrete command The command takes one or more arguments which can be either variables or list of variables For example list xlist x1 x2 x3 discrete z1 xlist z2 This syntax makes it possible to declare as discrete many variables at once which cannot presently be done via the graphical interface The switch reverse reverses the declaration of a variable as discrete or in other words marks it as continuous For example discrete foo now foo is discrete discrete foo reverse now foo is continuous The commandline variant is more powerful in that you can mark a variable as discrete even if it does not seem to be suitable for this treatment Note that marking a variable as discrete does not affect its content It is the users responsibility to make sure that marking a variable as discrete is a sensible thing to do Note that if you want to recode a continuous variable into classes you can use gretls arithmetical functionality as in the following example 98 Chapter 12 Discrete variables 99 nulldata 100 generate a series with mean 2 and variance 1 series x normal 2 split into 4 classes series z x0 x2 x4 now declare z as discrete discrete z Once a variable is marked as discrete this setting is remembered when you save the data file 122 Commands for discrete variables The dummify command The dummify command takes as argument a series x and creates dummy variables for each distinct value present in x which must have already been declared as discrete Example open greene222 discrete Z5 mark Z5 as discrete dummify Z5 The effect of the above command is to generate 5 new dummy variables labeled DZ51 through DZ55 which correspond to the different values in Z5 Hence the variable DZ54 is 1 if Z5 equals 4 and 0 otherwise This functionality is also available through the graphical interface by selecting the menu item Add Dummies for selected discrete variables The dummify command can also be used with the following syntax list dlist dummifyx This not only creates the dummy variables but also a named list see section 151 that can be used afterwards The following example computes summary statistics for the variable Y for each value of Z5 open greene222 discrete Z5 mark Z5 as discrete list foo dummifyZ5 loop foreach i foo smpl i restrict replace summary Y endloop smpl full Since dummify generates a list it can be used directly in commands that call for a list as input such as ols For example open greene222 discrete Z5 mark Z5 as discrete ols Y 0 dummifyZ5 The freq command The freq command displays absolute and relative frequencies for a given variable The way fre quencies are counted depends on whether the variable is continuous or discrete This command is also available via the graphical interface by selecting the Variable Frequency distribution menu entry Chapter 12 Discrete variables 100 For discrete variables frequencies are counted for each distinct value that the variable takes For continuous variables values are grouped into bins and then the frequencies are counted for each bin The number of bins by default is computed as a function of the number of valid observations in the currently selected sample via the rule shown in Table 121 However when the command is invoked through the menu item Variable Frequency Plot this default can be overridden by the user Observations Bins 8 n 16 5 16 n 50 7 50 n 850 n n 850 29 Table 121 Number of bins for various sample sizes For example the following code open greene191 freq TUCE discrete TUCE mark TUCE as discrete freq TUCE yields Read datafile usrlocalsharegretldatagreenegreene191gdt periodicity 1 maxobs 32 observations range 132 Listing 5 variables 0 const 1 GPA 2 TUCE 3 PSI 4 GRADE freq TUCE Frequency distribution for TUCE obs 132 number of bins 7 mean 219375 sd 390151 interval midpt frequency rel cum 13417 12000 1 312 312 13417 16250 14833 1 312 625 16250 19083 17667 6 1875 2500 19083 21917 20500 6 1875 4375 21917 24750 23333 9 2812 7188 24750 27583 26167 7 2188 9375 27583 29000 2 625 10000 Test for null hypothesis of normal distribution Chisquare2 1872 with pvalue 039211 discrete TUCE mark TUCE as discrete freq TUCE Frequency distribution for TUCE obs 132 frequency rel cum 12 1 312 312 14 1 312 625 17 3 938 1562 Chapter 12 Discrete variables 101 19 3 938 2500 20 2 625 3125 21 4 1250 4375 22 2 625 5000 23 4 1250 6250 24 3 938 7188 25 4 1250 8438 26 2 625 9062 27 1 312 9375 28 1 312 9688 29 1 312 10000 Test for null hypothesis of normal distribution Chisquare2 1872 with pvalue 039211 As can be seen from the sample output a DoornikHansen test for normality is computed auto matically This test is suppressed for discrete variables where the number of distinct values is less than 10 This command accepts two options quiet to avoid generation of the histogram when invoked from the command line and gamma for replacing the normality test with Lockes nonparametric test whose null hypothesis is that the data follow a Gamma distribution If the distinct values of a discrete variable need to be saved the values matrix construct can be used see chapter 17 The xtab command The xtab command cab be invoked in either of the following ways First xtab ylist xlist where ylist and xlist are lists of discrete variables This produces crosstabulations twoway frequencies of each of the variables in ylist by row against each of the variables in xlist by column Or second xtab xlist In the second case a full set of crosstabulations is generated that is each variable in xlist is tabu lated against each other variable in the list In the graphical interface this command is represented by the Cross Tabulation item under the View menu which is active if at least two variables are selected Here is an example of use open greene222 discrete Z mark Z1Z8 as discrete xtab Z1 Z4 Z5 Z6 which produces Crosstabulation of Z1 rows against Z5 columns 1 2 3 4 5 TOT 0 20 91 75 93 36 315 1 28 73 54 97 34 286 TOTAL 48 164 129 190 70 601 Chapter 12 Discrete variables 102 Pearson chisquare test 548233 4 df pvalue 0241287 Crosstabulation of Z1 rows against Z6 columns 9 12 14 16 17 18 20 TOT 0 4 36 106 70 52 45 2 315 1 3 8 48 45 37 67 78 286 TOTAL 7 44 154 115 89 112 80 601 Pearson chisquare test 123177 6 df pvalue 350375e24 Crosstabulation of Z4 rows against Z5 columns 1 2 3 4 5 TOT 0 17 60 35 45 14 171 1 31 104 94 145 56 430 TOTAL 48 164 129 190 70 601 Pearson chisquare test 111615 4 df pvalue 00248074 Crosstabulation of Z4 rows against Z6 columns 9 12 14 16 17 18 20 TOT 0 1 8 39 47 30 32 14 171 1 6 36 115 68 59 80 66 430 TOTAL 7 44 154 115 89 112 80 601 Pearson chisquare test 183426 6 df pvalue 00054306 Pearsons χ2 test for independence is automatically displayed provided that all cells have expected frequencies under independence greater than 107 However a common rule of thumb states that this statistic is valid only if the expected frequency is 5 or greater for at least 80 percent of the cells If this condition is not met a warning is printed Additionally the row or column options can be given in this case the output displays row or column percentages respectively If you want to cut and paste the output of xtab to some other program eg a spreadsheet you may want to use the zeros option this option causes cells with zero frequency to display the number 0 instead of being empty Chapter 13 Loop constructs 131 Introduction The command loop opens a special mode in which gretl accepts a block of commands to be re peated zero or more times This feature may be useful for among other things Monte Carlo simulations bootstrapping of test statistics and iterative estimation procedures The general form of a loop is loop controlexpression progressive verbose loop body endloop Five forms of controlexpression are available as explained in section 132 Not all gretl commands are available within loops the commands that are not presently accepted in this context are shown in Table 131 Table 131 Commands not usable in loops function include nulldata quit run setmiss By default the genr command operates quietly in the context of a loop without printing informa tion on the variable generated To force the printing of feedback you may specify the verbose option to loop The progressive option to loop modifies the behavior of the commands print and store and certain estimation commands in a manner that may be useful with Monte Carlo analyses see Section 134 The following sections explain the various forms of the loop control expression and provide some examples of use of loops If you are carrying out a substantial Monte Carlo analysis with many thousands of repetitions memory capacity and processing time may be an issue To minimize the use of computer resources run your script using the commandline program gretlcli with output redirected to a file 132 Loop control variants Count loop The simplest form of loop control is a direct specification of the number of times the loop should be repeated We refer to this as a count loop The number of repetitions may be a numerical constant as in loop 1000 or may be read from a scalar variable as in loop replics In the case where the loop count is given by a variable say replics in concept replics is an integer if the value is not integral it is converted to an integer by truncation Note that replics is evaluated only once when the loop is initially compiled 103 Chapter 13 Loop constructs 104 While loop A second sort of control expression takes the form of the keyword while followed by a Boolean expression For example loop while essdiff 00001 Execution of the commands within the loop will continue so long as a the specified condition evaluates as true and b the number of iterations does not exceed the value of the internal variable loopmaxiter By default this equals 100000 but you can specify a different value or remove the limit via the set command see the Gretl Command Reference Index loop A third form of loop control uses an index variable for example i1 In this case you specify starting and ending values for the index as in loop i120 The index variable may be a preexisting scalar if this is not the case the variable is created automatically and is destroyed on exit from the loop The index may be used within the loop body in either of two ways you can access the integer value of i or you can use its string representation i The starting and ending values for the index can be given in numerical form by reference to pre defined scalar variables or as expressions that evaluate to scalars In the latter two cases the variables are evaluated once at the start of the loop In addition with time series data you can give the starting and ending values in the form of dates as in loop i1950119994 for quarterly data This form of loop control is intended to be quick and easy and as such it is subject to certain limitations In particular standard behavior is to increment the index variable by one at each iteration So for example if you have loop imn where m and n are scalar variables with values m n at the time of execution the index will not be decremented rather the loop will simply be bypassed One modification of this behavior is supported via the option flag decr or d for short This causes the index to be decremented by one at each iteration For example loop imn decr In this case the loop will be bypassed if m n If you need more flexible control see the for form below The index loop is particularly useful in conjunction with the values matrix function when some operation must be carried out for each value of some discrete variable see chapter 12 Consider the following example open greene222 discrete Z8 v8 valuesZ8 loop i1rowsv8 scalar xi v8i smpl Z8xi restrict replace printf meanY Z8 g 85f sdY Z8 g g xi meanY xi sdY endloop 1It is common programming practice to use simple onecharacter names for such variables Chapter 13 Loop constructs 105 In this case we evaluate the conditional mean and standard deviation of the variable Y for each value of Z8 Foreach loop The fourth form of loop control also uses an index variable in this case to index a specified set of strings The loop is executed once for each string in the list This can be useful for performing repetitive operations on a list of variables Here is an example of the syntax loop foreach i peach pear plum print i endloop This loop will execute three times printing out peach pear and plum on the respective itera tions The numerical value of the index starts at 1 and is incremented by 1 at each iteration If you wish to loop across a list of variables that are contiguous in the dataset you can give the names of the first and last variables in the list separated by rather than having to type all the names For example say we have 50 variables AK AL WY containing income levels for the states of the US To run a regression of income on time for each of the states we could do genr time loop foreach i ALWY ols i const time endloop This loop variant can also be used for looping across the elements in a named list see chapter 15 For example list ylist y1 y2 y3 loop foreach i ylist ols i const x1 x2 endloop Note that if you use this idiom inside a function see chapter 14 looping across a list that has been supplied to the function as an argument it is necessary to use the syntax listnamei to reference the listmember variables In the context of the example above this would mean replacing the third line with ols ylisti const x1 x2 Two other cases are supported the target of foreach can be a named array of strings or a bundle see chapter 11 In the array case i gets naturally the string at position i in the array from 1 to the number of elements in the bundle case it gets the keystrings of all bundle members in no particular order For a bundle b the command print b gives a fairly terse account of the bundles membership for a full account you can do loop foreach i b print i eval bi endloop For loop The final form of loop control emulates the for statement in the C programming language The syntax is loop for followed by three component expressions separated by semicolons and sur rounded by parentheses The three components are as follows Chapter 13 Loop constructs 106 1 Initialization This is evaluated only once at the start of the loop Common example setting a scalar control variable to some starting value 2 Continuation condition this is evaluated at the top of each iteration including the first If the expression evaluates as true nonzero iteration continues otherwise it stops Common example an inequality expressing a bound on a control variable 3 Modifier an expression which modifies the value of some variable This is evaluated prior to checking the continuation condition on each iteration after the first Common example a control variable is incremented or decremented Heres a simple example loop for r001 r991 r01 In this example the variable r will take on the values 001 002 099 across the 99 iterations Note that due to the finite precision of floating point arithmetic on computers it may be necessary to use a continuation condition such as the above r991 rather than the more natural r99 Using doubleprecision numbers on an x86 processor at the point where you would expect r to equal 099 it may in fact have value 0990000000000001 Any or all of the three expressions governing a for loop may be omittedthe minimal form is If the continuation test is omitted it is implicitly true so you have an infinite loop unless you arrange for some other way out such as a break statement see section 133 below If the initialization expression in a for loop takes the common form of setting a scalar variable to a given value the string representation of that scalars value is made available within the loop via the accessor varname 133 Special controls Besides the control afforded by the governing expression at the top of a loop the flow of execution can be adjusted via the keywords break and continue The break keyword terminates execution of the current loop immediately while continue has the effect of skipping any subsequent statements within the loop on the current iteration execution will proceed to the next iteration if the condition for continuation is still satisfied 134 Progressive mode If the progressive option is given for a command loop special behavior is invoked for certain commands namely print store and simple estimation commands By simple here we mean commands which a estimate a single equation as opposed to a system of equations and b do so by means of a single command statement as opposed to a block of statements as with nls and mle The paradigm is ols other possibilities include tsls wls logit and so on The special behavior is as follows Estimators The results from each individual iteration of the estimator are not printed Instead after the loop is completed you get a printout of a the mean value of each estimated coefficient across all the repetitions b the standard deviation of those coefficient estimates c the mean value of the estimated standard error for each coefficient and d the standard deviation of the estimated standard errors Note that this is useful only if there is some random input at each step print When this command is used to print the value of a variable its value is not printed each time round the loop Rather when the loop is terminated you get a printout of the mean and standard deviation of the variable across the repetitions of the loop This mode is intended for use Chapter 13 Loop constructs 107 with variables that have a scalar value at each iteration for example the sum of squared residuals from a regression Series cannot be printed in this way and neither can matrices store This command writes out the values of the specified scalars from each time round the loop to a specified file Thus it keeps a complete record of their values across the iterations For example coefficient estimates could be saved in this way so as to permit subsequent examination of their frequency distribution Only one such store can be used in a given loop 135 Loop examples Monte Carlo example A simple example of a Monte Carlo loop in progressive mode is shown in Listing 131 Listing 131 Simple Monte Carlo loop Download nulldata 50 set seed 547 series x 100 uniform open a progressive loop to be repeated 100 times loop 100 progressive series u 10 normal construct the dependent variable series y 10x u run OLS regression ols y const x grab the coefficient estimates and Rsquared scalar a coeffconst scalar b coeffx scalar r2 rsq arrange for printing of stats on these print a b r2 and save the coefficients to file store coeffsgdt a b endloop This loop will print out summary statistics for the a and b estimates and R2 across the 100 rep etitions After running the loop coeffsgdt which contains the individual coefficient estimates from all the runs can be opened in gretl to examine the frequency distribution of the estimates in detail The nulldata command is useful for Monte Carlo work Instead of opening a real data set nulldata 50 for instance creates an artificial dataset containing just a constant and an index variable with 50 observations Constructed variables can then be added See the set command for information on generating repeatable pseudorandom series Iterated least squares Listing 132 uses a while loop to replicate the estimation of a nonlinear consumption function of the form C α βY γ ϵ as presented in Greene 2000 Example 113 This script is included in the gretl distribution under the name greene113inp you can find it in gretl under the menu item File Script files Example scripts Greene Chapter 13 Loop constructs 108 The option printfinal for the ols command arranges matters so that the regression results will not be printed each time round the loop but the results from the regression on the last iteration will be printed when the loop terminates Listing 132 Nonlinear consumption function Download open greene113gdt run initial OLS ols C 0 Y scalar essbak ess scalar essdiff 1 scalar beta coeffY scalar gamma 1 iterate OLS till the error sum of squares converges loop while essdiff 00001 form the linearized variables series C0 C gamma beta Ygamma logY series x1 Ygamma series x2 beta Ygamma logY run OLS ols C0 0 x1 x2 printfinal nodfcorr vcv beta coeff2 gamma coeff3 ess ess essdiff absess essbakessbak essbak ess endloop print parameter estimates using their proper names printf alpha g coeff1 printf beta g beta printf gamma g gamma Listing 133 shows how a loop can be used to estimate an ARMA model exploiting the outer product of the gradient OPG regression discussed by Davidson and MacKinnon 1993 Further examples of loop usage that may be of interest can be found in chapter 21 Chapter 13 Loop constructs 109 Listing 133 ARMA 1 1 Download Estimation of an ARMA11 model manually using a loop open armagdt scalar c 0 scalar a 01 scalar m 01 series e 00 series dec e series dea e series dem e scalar crit 1 loop while crit 10e9 onestep forecast errors e y c ay1 me1 loglikelihood scalar loglik 05 sume2 print loglik partials of e with respect to c a and m dec 1 m dec1 dea y1 m dea1 dem e1 m dem1 partials of l with respect to c a and m series scc dec e series sca dea e series scm dem e OPG regression ols const scc sca scm printfinal nodfcorr vcv Update the parameters c coeff1 a coeff2 m coeff3 show progress printf constant 8g gradient 6g c coeff1 printf ar1 coefficient 8g gradient 6g a coeff2 printf ma1 coefficient 8g gradient 6g m coeff3 crit T ess print crit endloop scalar sec stderr1 scalar sea stderr2 scalar sem stderr3 printf printf constant 8g se 6g t 4f c sec csec printf ar1 coefficient 8g se 6g t 4f a sea asea printf ma1 coefficient 8g se 6g t 4f m sem msem Chapter 14 Userdefined functions 141 Defining a function Gretl offers a mechanism for defining functions which may be called via the command line in the context of a script or if packaged appropriately see section 145 via the programs graphical interface The syntax for defining a function looks like this function type funcname parameters function body end function The opening line of a function definition contains these elements in strict order 1 The keyword function 2 type which states the type of value returned by the function if any This must be one of void if the function does not return anything scalar series matrix list string bundle or one of gretls array types matrices bundles strings see section 118 3 funcname the unique identifier for the function Function names have a maximum length of 31 characters they must start with a letter and can contain only letters numerals and the underscore character You will get an error if you try to define a function having the same name as an existing gretl command 4 The functions parameters in the form of a commaseparated list enclosed in parentheses This may be run into the function name or separated by white space as shown In case the function takes no arguments unusual but acceptable this should be indicated by placing the keyword void between the parameterlist parentheses Function parameters can be of any of the types shown below1 Type Description bool scalar variable acting as a Boolean switch int scalar variable acting as an integer scalar scalar variable series data series list named list of series matrix matrix or vector string string variable or string literal bundle allpurpose container see section 117 matrices array of matrices see section 118 bundles array of bundles strings array of strings 1An additional parameter type is available for GUI use namely obs this is equivalent to int except for the way it is represented in the graphical interface for calling a function 110 Chapter 14 Userdefined functions 111 Each element in the listing of parameters must include two terms a type specifier and the name by which the parameter shall be known within the function An example follows function scalar myfunc series y list xvars bool verbose Each of the typespecifiers with the exception of list and string may be modified by prepending an asterisk to the associated parameter name as in function scalar myfunc series y scalar b The meaning of this modification is explained below see section 144 it is related to the use of pointer arguments in the C programming language Function parameters optional refinements Besides the required elements mentioned above the specification of a function parameter may include some additional fields as follows The const modifier For scalar or int parameters minimum maximum andor default values or for bool pa rameters just a default value For optional arguments other than scalar int and bool the special default value null For all parameters a descriptive string For int parameters with minimum and maximum values specified a set of strings to associate with the allowed numerical values value labels The first three of these options may be useful in many contexts the last two may be helpful if a function is to be packaged for use in the gretl GUI but probably not otherwise We now expand on each of the options The const modifier must be given as a prefix to the basic parameter specification as in const matrix M This constitutes a promise that the corresponding argument will not be modified within the function gretl will flag an error if the function attempts to modify the argument Minimum maximum and default values for scalar or int types These values should di rectly follow the name of the parameter enclosed in square brackets and with the individual elements separated by colons For example suppose we have an integer parameter order for which we wish to specify a minimum of 1 a maximum of 12 and a default of 4 We can write int order1124 If you wish to omit any of the three specifiers leave the corresponding field empty For example 14 would specify a minimum of 1 and a default of 4 while leaving the maximum unlimited However as a special case it is acceptable to give just one value with no colons in which case the value is interpreted as a default So for example int k0 designates a default value of 0 for the parameter k with no minimum or maximum specified If you wished to specify a minimum of zero with no maximum or default you would have to write Chapter 14 Userdefined functions 112 int k0 For a parameter of type bool whose values are just zero or nonzero you can specify a default of 1 true or 0 false as in bool verbose0 Descriptive string This will show up as an aid to the user if the function is packaged see section 145 below and called via gretls graphical interface The string should be enclosed in double quotes and separated from the preceding elements of the parameter specification with a space as in series y dependent variable Value labels These may be used only with int parameters for which minimum and maximum values have been specified so that there is a fixed number of admissible values and the number of labels must match the number of values They will show up in the graphical interface in the form of a dropdown list making the function writers intent clearer when an integer argument represents a categorical selection A set of value labels must be enclosed in braces and the individual labels must be enclosed in double quotes and separated by commas or spaces For example int case131 Fixed effects Between model Random effects If two or more of the trailing optional fields are given in a parameter specification they must be given in the order shown above minmaxdefault description value labels Note that there is no facility for escaping characters within descriptive strings or value labels these may contain spaces but they cannot contain the doublequote character Here is an example of a wellformed function specification using all the elements mentioned above function matrix myfunc series y dependent variable list X regressors int p01 lag order int c121 criterion AIC BIC bool quiet0 One advantage of specifying default values for parameters where applicable is that in script or commandline mode users may omit trailing arguments that have defaults For example myfunc above could be invoked with just two arguments corresponding to y and X implicitly p 1 c 1 and quiet is false Functions taking no parameters You may define a function that has no parameters these are called routines in some programming languages In this case use the keyword void in place of the listing of parameters function matrix myfunc2 void The function body The function body is composed of gretl commands or calls to userdefined functions that is function calls may be nested A function may call itself that is functions may be recursive While the function body may contain function calls it may not contain function definitions That is you cannot define a function inside another function For further details see section 144 Chapter 14 Userdefined functions 113 142 Calling a function A user function is called by typing its name followed by zero or more arguments enclosed in parentheses If there are two or more arguments they must be separated by commas There are automatic checks in place to ensure that the number of arguments given in a function call matches the number of parameters and that the types of the given arguments match the types specified in the definition of the function An error is flagged if either of these conditions is violated One qualification allowance is made for omitting arguments at the end of the list provided that default values are specified in the function definition To be precise the check is that the number of arguments is at least equal to the number of required parameters and is no greater than the total number of parameters In general an argument to a function may be given either as the name of a preexisting variable or as an expression which evaluates to a variable of the appropriate type The following trivial example illustrates a function call that correctly matches the corresponding function definition function definition function scalar olsess series y list xvars ols y 0 xvars quiet printf ESS g ess return ess end function main script open data41 list xlist 2 3 4 function call the return value is ignored here olsessprice xlist The function call gives two arguments the first is a data series specified by name and the second is a named list of regressors Note that while the function offers the Error Sum of Squares as a return value it is ignored by the caller in this instance As a side note here if you want a function to calculate some value having to do with a regression but are not interested in the full results of the regression you may wish to use the quiet flag with the estimation command as shown above A second example shows how to write a function call that assigns a return value to a variable in the caller function definition function series getuhat series y list xvars ols y 0 xvars quiet return uhat end function main script open data41 list xlist 2 3 4 function call series resid getuhatprice xlist 143 Deleting a function If you have defined a function and subsequently wish to clear it out of memory you can do so using the keywords delete or clear as in function myfunc delete function getuhat clear Chapter 14 Userdefined functions 114 Note however that if myfunc is already a defined function providing a new definition automatically overwrites the previous one so it should rarely be necessary to delete functions explicitly 144 Function programming details Variables versus pointers Most arguments to functions can be passed in two ways as they are or via pointers the exception is the list type which cannot be passed as a pointer First consider the following rather artificial example function series triple1 series x return 3x end function function void triple2 series x x 3 end function nulldata 10 series y normal series y3 triple1y print y3 triple2y print y These two functions produce essentially the same resultthe two print statements in the caller will show the same valuesbut in quite different ways The first explicitly returns a modified version of its input which must be a plain series after the call to triple1 y is unaltered it would have been altered only if the return value were assigned back to y rather than y3 The second function modifies its input given as a pointer to a series in place without actually returning anything Its worth noting that triple2 as it stands would not be considered idiomatic as a gretl function although its formally OK The point here is just to illustrate the distinction between passing an argument in the default way and in pointer form Why make this distinction There are two main reasons for doing so modularity and performance By modularity we mean the insulation of a function from the rest of the script which calls it One of the many benefits of this approach is that your functions are easily reusable in other contexts To achieve modularity variables created within a function are local to that function and are destroyed when the function exits unless they are made available as return values and these values are picked up or assigned by the caller In addition functions do not have access to variables in outer scope that is variables that exist in the script from which the function is called except insofar as these are explicitly passed to the function as arguments By default when a variable is passed to a function as an argument what the function actually gets is a copy of the outer variable which means that the value of the outer variable is not modified by anything that goes on inside the function This means that you can pass arguments to a function without worrying about possible side effects at the same time the function writer can use argument variables as workspace without fear of disruptive effects at the level of the caller The use of pointers however allows a function and its caller to cooperate such that an outer variable can be modified by the function In effect this allows a function to return more than one value although only one variable can be returned directlysee below To indicate that a particular object is to be passed as a pointer the parameter in question is marked with a prefix of in the function definition and the corresponding argument is marked with the complementary prefix in the caller For example Chapter 14 Userdefined functions 115 function series getuhatandessseries y list xvars scalar ess ols y 0 xvars quiet ess ess series uh uhat return uh end function open data41 list xlist 2 3 4 scalar SSR series resid getuhatandessprice xlist SSR In the above we may say that the function is given the address of the scalar variable SSR and it assigns a value to that variable under the local name ess For anyone used to programming in C note that it is not necessary or even possible to dereference the variable in question within the function using the operator Unadorned use of the name of the variable is sufficient to access the variable in outer scope An address parameter of this sort can be used as a means of offering optional information to the caller That is the corresponding argument is not strictly needed but will be used if present In that case the parameter should be given a default value of null and the the function should test to see if the caller supplied a corresponding argument or not using the builtin function exists For example here is the simple function shown above modified to make the filling out of the ess value optional function series getuhatandessseries y list xvars scalar essnull ols y 0 xvars quiet if existsess ess ess endif return uhat end function If the caller does not care to get the ess value it can use null in place of a real argument series resid getuhatandessprice xlist null Alternatively trailing function arguments that have default values may be omitted so the following would also be a valid call series resid getuhatandessprice xlist One limitation on the use of pointertype arguments should be noted you cannot supply a given variable as a pointer argument more than once in any given function call For example suppose we have a function that takes two matrixpointer arguments function scalar pointfunc matrix a matrix b And suppose we have two matrices x and y at the caller level The call pointfuncx y is OK but the call pointfuncx x will not work will generate an error Thats because the situation inside the function would become too confusing with what is really the same object existing under two names Chapter 14 Userdefined functions 116 Const parameters Pointertype arguments may also be useful for optimizing performance Even if a variable is not modified inside the function it may be a good idea to pass it as a pointer if it occupies a lot of memory Otherwise the time gretl spends transcribing the value of the variable to the local copy may be nonnegligible compared to the time the function spends doing the job it was written for Listing 141 takes this to the extreme We define two functions which return the number of rows of a matrix a pretty fast operation The first gets a matrix as argument while the second gets a pointer to a matrix The functions are evaluated 500 times on a matrix with 2000 rows and 2000 columns on a typical system floatingpoint numbers take 8 bytes of memory so the total size of the matrix is roughly 32 megabytes Running the code in example 141 will produce output similar to the following the actual numbers of course depend on the machine youre using Elapsed time without pointers copy 247197 seconds with pointers no copy 000378627 seconds Listing 141 Performance comparison values versus pointer Download function scalar rowcount1 matrix X return rowsX end function function scalar rowcount2 const matrix X return rowsX end function set verbose off X zeros20002000 scalar r set stopwatch loop 500 r rowcount1X endloop e1 stopwatch set stopwatch loop 500 r rowcount2X endloop e2 stopwatch printf Elapsed time without pointers copy g seconds with pointers no copy g seconds e1 e2 If a pointer argument is used for this sort of purposeand the object to which the pointer points is not modified is treated as readonly by the functionone can signal this to the user by adding the const qualifier as shown for function rowcount2 in Listing 141 When a pointer argument is qualified in this way any attempt to modify the object within the function will generate an error However combining the const flag with the pointer mechanism is technically redundant for the Chapter 14 Userdefined functions 117 following reason if you mark a matrix argument as const then gretl will in fact pass it in pointer mode internally since it cant be modified within the function theres no downside to simply mak ing it available to the function rather than copying it So in the example above we could revise the signature of the second function as function scalar rowcount2a const matrix X and call it with r rowcount2aX for the same speedup relative to rowcount1 From the callers point of view the second optionusing the const modifier without pointer notationis preferable as it allows the caller to pass an object created on the fly Suppose the caller has two matrices A and B in scope and wishes to pass their vertical concatenation as an argument The following call would work fine r rowcount2aAB To use rowcount2 on the other hand the caller would have to create a named variable first since you cannot give the address of a anonymous object such as AB matrix AB AB r rowcount2AB This requires an extra line of code and leaves AB occupying memory after the call We have illustrated using a matrix parameter but the const modifier may be used with the same effectnamely the argument is passed directly without being copied but is protected against modification within the functionfor all the types that support the pointer apparatus List arguments The use of a named list as an argument to a function gives a means of supplying a function with a set of variables whose number is unknown when the function is writtenfor example sets of regressors or instruments Within the function the list can be passed on to commands such as ols A list argument can also be unpacked using a foreach loop construct but this requires some care For example suppose you have a list X and want to calculate the standard deviation of each variable in the list You can do loop foreach i X scalar sdi sdXi endloop Please note a special piece of syntax is needed in this context If we wanted to perform the above task on a list in a regular script not inside a function we could do loop foreach i X scalar sdi sdi endloop where i gets the name of the variable at position i in the list and sdi gets its standard deviation But inside a function working on a list supplied as an argument if we want to reference an individual variable in the list we must use the syntax listnamevarname Hence in the example above we write sdXi This is necessary to avoid possible collisions between the namespace of the function and the name space of the caller script For example suppose we have a function that takes a list argument and that defines a local variable called y Now suppose that this function is passed a list containing Chapter 14 Userdefined functions 118 a variable named y If the two namespaces were not separated either wed get an error or the external variable y would be silently overwritten by the local one It is important therefore that listargument variables should not be visible by name within functions To get hold of such variables you need to use the form of identification just mentioned the name of the list followed by a dot followed by the name of the variable Constancy of list arguments When a named list of variables is passed to a function the function is actually provided with a copy of the list The function may modify this copy for instance adding or removing members but the original list at the level of the caller is not modified Optional list arguments If a list argument to a function is optional this should be indicated by appending a default value of null as in function scalar myfunc scalar y list Xnull In that case if the caller gives null as the list argument or simply omits the last argument the named list X inside the function will be empty This possibility can be detected using the nelem function which returns 0 for an empty list String arguments String arguments can be used for example to provide flexibility in the naming of variables created within a function In the following example the function mavg returns a list containing two moving averages constructed from an input series with the names of the newly created variables governed by the string argument function list mavg series y string vname list retlist deflist string newname sprintfs2 vname retlist genseriesnewname yy1 2 newname sprintfs4 vname retlist genseriesnewname yy1y2y3 4 return retlist end function open data99 list malist mavgnocars nocars print malist byobs The last line of the script will print two variables named nocars2 and nocars4 For details on the handling of named strings see chapter 15 If a string argument is considered optional it may be given a null default value as in function scalar foo series y string vnamenull Retrieving the names of arguments The variables given as arguments to a function are known inside the function by the names of the corresponding parameters For example within the function whose signature is function void somefun series y we have the series known as y It may be useful however to be able to determine the names of the variables provided as arguments This can be done using the function argname which takes the name of a function parameter as its single argument and returns a string Here is a simple illustration Chapter 14 Userdefined functions 119 function void namefun series y printf the series given as y was named s argnamey end function open data97 namefunQNC This produces the output the series given as y was named QNC Please note that this will not always work the arguments given to functions may be anonymous variables created on the fly as in somefunlogQNC or somefunCPI100 In that case the argname function returns an empty string Function writers who wish to make use of this facility should check the return from argname using the strlen function if this returns 0 no name was found Return values Functions can return nothing just printing a result perhaps or they can return a single variable The return value if any is specified via a statement within the function body beginning with the keyword return followed by either the name of a variable which must be of the type announced on the first line of the function definition or an expression which produces a value of the correct type Having a function return a list or bundle is a way of permitting the return of more than one variable For example you can define several series inside a function and package them as a list in this case they are not destroyed when the function exits Here is a simple example which also illustrates the possibility of setting the descriptive labels for variables generated in a function function list makecubes list xlist list cubes deflist loop foreach i xlist series i3 xlisti3 setinfo i3 d cube of i list cubes i3 endloop return cubes end function open data41 list xlist price sqft list cubelist makecubesxlist print xlist cubelist byobs labels A return statement causes the function to return exit at the point where it appears within the body of the function A function may also exit when a the end of the function code is reached in the case of a function with no return value b a gretl error occurs or c a funcerr statement is reached The funcerr keywordwhich may be followed by a string enclosed in double quotes or the name of a string variable or nothingcauses a function to exit with an error flagged If a string is provided either literally or via a variable this is printed on exit otherwise a generic error message is printed This mechanism enables the author of a function to preempt an ordinary execution error andor offer a more specific and helpful error message For example if nelemxlist 0 funcerr xlist must not be empty endif Chapter 14 Userdefined functions 120 A function may contain more than one return statement as in function scalar multi bool s if s return 1000 else return 10 endif end function However it is recommended programming practice to have a single return point from a function unless this is very inconvenient The simple example above would be better written as function scalar multi bool s return s 1000 10 end function Overloading You may have noticed that several builtin functions in gretl are overloadedthat is a given argument slot may accept more than one type of argument and the return value may depend on the type of the argument in question For instance the argument x for the pdf function may be a scalar series or matrix and the return type will match that choice on the callers part Since gretl2021b this possibility also exists for userdefined functions The metatype numeric can be used in place of a specific type to accept a scalar series or matrix argument and similarly the returntype of a function can be marked as numeric As a function writer you can choose to be more restrictive than the default which allows scalar series or matrix for any numeric argument For instance if you write a function in which two arguments x and y are specified as numeric you might decide to disallow the case where x is a matrix and y a series or vice versa as too complicated You can use the typeof function to determine what types of arguments were supplied and the funcerr command or errorif function to reject an unsupported combination If your function is going to return a certain specific type say matrix regardless of the type of the input then the return value should be labeled accordingly use numeric for the return only if its truly unknown in advance Listing 142 offers an admittedly artificial example its numeric inputs can be scalars series or column vectors but they must be of a single type Naturally if your overloaded function is intended for public use you should state clearly in its documentation what is supported and what is not Error checking When gretl first reads and compiles a function definition there is minimal errorchecking the only checks are that the function name is acceptable and so far as the body is concerned that you are not trying to define a function inside a function see Section 141 Otherwise if the function body contains invalid commands this will become apparent only when the function is called and its commands are executed Debugging The usual mechanism whereby gretl echoes commands and reports on the creation of new variables is by default suppressed when a function is being executed If you want more verbose output from a particular function you can use either or both of the following commands within the function Chapter 14 Userdefined functions 121 Listing 142 Example of overloaded function Download function numeric xplusby numeric x scalar b numeric y erroriftypeofx typeofy x and y must be of the same type if typeofx 2 scalar or series return x by elif rowsx rowsy colsx 1 colsy 1 return x by else funcerr x and y should be column vectors endif end function call 1 x and y are scalars eval xplusby10 3 2 call 2 x and y are vectors matrix x mnormal10 1 matrix y mnormal10 1 eval xplusbyx 2 y open data41 call 3 x and y are series series bb xplusbybedrms 05 baths print bb byobs set echo on set messages on Alternatively you can achieve this effect for all functions via the command set debug 1 Usually when you set the value of a state variable using the set command the effect applies only to the current level of function execution For instance if you do set messages on within function f1 which in turn calls function f2 then messages will be printed for f1 but not f2 The debug variable however acts globally all functions become verbose regardless of their level Further you can do set debug 2 in addition to command echo and the printing of messages this is equivalent to setting maxverbose which produces verbose output from the BFGS maximizer at all levels of function execution 145 Function packages At various points above we have alluded to function packages and the use of these via the gretl GUI This topic is covered in depth by the Gretl Function Package Guide If youre running gretl you can find this under the Help menu Alternatively you may download it from httpssourceforgenetprojectsgretlfilesmanual Chapter 15 Named lists and strings 151 Named lists Many gretl commands take one or more lists of series as arguments To make this easier to handle in the context of command scripts and in particular within userdefined functions gretl offers the possibility of named lists Creating and modifying named lists A named list is created using the keyword list followed by the name of the list an equals sign and an expression that forms a list The most basic sort of expression that works in this context is a spaceseparated list of variables given either by name or by ID number For example list xlist 1 2 3 4 list reglist income price Note that the variables in question must be of the series type Two abbreviations are available in defining lists You can use the wildcard character to create a list of variables by name For example dum can be used to indicate all variables whose names begin with dum You can use two dots to indicate a range of variables For example incomeprice indicates the set of variables whose ID numbers are greater than or equal to that of income and less than or equal to that of price In addition there are two special forms If you use the keyword null on the righthand side you get an empty list If you use the keyword dataset on the right you get a list containing all the series in the current dataset except the predefined const The name of the list must start with a letter and must be composed entirely of letters numbers or the underscore character The maximum length of the name is 31 characters list names cannot contain spaces Once a named list has been created it will be remembered for the duration of the gretl session unless you delete it and can be used in the context of any gretl command where a list of variables is expected One simple example is the specification of a list of regressors list xlist x1 x2 x3 x4 ols y 0 xlist To get rid of a list you use the following syntax list xlist delete 122 Chapter 15 Named lists and strings 123 Be careful delete xlist will delete the series contained in the list so it implies data loss which may not be what you want On the other hand list xlist delete will simply undefine the xlist identifier the series themselves will not be affected Similarly to print the names of the members of a list you have to invert the usual print command as in list xlist print If you just say print xlist the list will be expanded and the values of all the member series will be printed Lists can be modified in various ways To redefine an existing list altogether use the same syntax as for creating a list For example list xlist 1 2 3 xlist 4 5 6 After the second assignment xlist contains just variables 4 5 and 6 To append or prepend variables to an existing list we can make use of the fact that a named list stands in for a longhand list For example we can do list xlist xlist 5 6 7 xlist 9 10 xlist 11 12 Another option for appending a term or a list to an existing list is to use as in xlist cpi To drop a variable from a list use xlist cpi In most contexts where lists are used in gretl it is expected that they do not contain any duplicated elements If you form a new list by simple concatenation as in list L3 L1 L2 where L1 and L2 are existing lists its possible that the result may contain duplicates To guard against this you can form a new list as the union of two existing ones list L3 L1 L2 The result is a list that contains all the members of L1 plus any members of L2 that are not already in L1 In the same vein you can construct a new list as the intersection of two existing ones list L3 L1 L2 Here L3 contains all the elements that are present in both L1 and L2 You can also subtract one list from another list L3 L1 L2 The result contains all the elements of L1 that are not present in L2 Indexing into a defined list is also possible as if it were a vector list L2 L114 This leaves L2 with the first four members of L1 Notice that the ordering of list members is pathdependent Chapter 15 Named lists and strings 124 Lists and matrices There are two ways one can think of lists and matrices being interchangeable either you think of a list as a collection of references to series or you may consider the rectangle of data given by the series that the list contains In the former case a list may be translated into or created from a onedimensional matrix that is a vector Therefore the matrix in question must be interpretable as a vector containing ID numbers of data series It may be either a row or a column vector and each of its elements must have an integer part that is no greater than the number of variables in the data set For example matrix m 1234 list L m The above is OK provided the data set contains at least 4 variables Conversely the command matrix m L will create a row vector with the ID numbers of the series referenced by L The latter case occurs when the matrix is assumed to contain valid data To create a matrix from the list simply assing to a matrix the list name surrounded by curly brackets as in matrix m L Note the difference with the above without the curly brackets matrix m would have been just a vector Also note that any row corresponding to one or more missing entries will be dropped unless the skipmissing set variable is set to on For the reverse operation gretl provides the mat2list function which takes a matrix say X as argument and creates new series as well as a list containing them The row dimension of X must equal either the length of the current dataset or the number of observations in the current sample range The naming of the series in the returned list proceeds as follows First if the optional prefix argument is supplied the series created from column i of X is named by appending i to the given string Otherwise if X has column names set these names are used Finally if neither of the above conditions is satisfied the names are column1 column2 and so on For example matrix X mnormalnobs 8 list L mat2listX xnorm will add to the dataset eight fulllength series named xnorm1 xnorm2 and so on Querying a list You can determine the number of variables or elements in a list using the function nelem list xlist 1 2 3 nl nelemxlist The scalar variable nl will be assigned a value of 3 since xlist contains 3 members You can determine whether a given series is a member of a specified list using the function inlist as in scalar k inlistL y where L is a list and y a series The series may be specified by name or ID number The return value is the 1based position of the series in the list or zero if the series is not present in the list Generating lists of transformed variables Given a named list of series you are able to generate lists of transformations of these series using the functions log lags diff ldiff sdiff or dummyify For example list xlist x1 x2 x3 list lxlist logxlist list difflist diffxlist When generating a list of lags in this way you specify the maximum lag order inside the parentheses before the list name and separated by a comma For example list xlist x1 x2 x3 list laglist lags2 xlist or scalar order 4 list laglist lagsorder xlist These commands will populate laglist with the specified number of lags of the variables in xlist You can give the name of a single series in place of a list as the second argument to lags this is equivalent to giving a list with just one member The dummyify function creates a set of dummy variables coding for all but one of the distinct values taken on by the original variable which should be discrete The smallest value is taken as the omitted catgory Like lags this function returns a list even if the input is a single series Another useful operation you can perform with lists is creating interaction variables Suppose you have a discrete variable xi taking values from 1 to n and a variable zi which could be continuous or discrete In many cases you want to split zi into a set of n variables via the rule zij zi when xi j 0 otherwise in practice you create dummies for the xi variable first and then you multiply them all by zi these are commonly called the interactions between xi and zi In gretl you can do list H D Z where D is a list of discrete series or a single discrete series Z is a list or a single series¹ all the interactions will be created and listed together under the name H An example is provided in script 151 Generating series from lists There are various ways of retrieving or generating individual series from a named list The most basic method is indexing into the list For example series x3 Xlist3 will retrieve the third element of the list Xlist under the name x3 or will generate an error if Xlist has less then three members In addition gretl offers several functions that apply to a list and return a series In most cases these functions also apply to single series and behave as natural extensions when applied to lists but this is not always the case 1Warning this construct does not work if neither D nor Z are of the the list type Chapter 15 Named lists and strings 126 Listing 151 Usage of interaction lists Download Input open mroz87gdt quiet the coding below makes it so that KIDS 0 no kids KIDS 1 young kids only KIDS 2 young or older kids series KIDS KL6 0 KL6 0 K618 0 list D CIT KIDS interaction discrete variables list X WE WA variables to split list INTER D X smpl 1 6 print D X INTER o Output selected portions CIT KIDS WE WA WECIT0 1 0 2 12 32 12 2 1 1 12 30 0 3 0 2 12 35 12 4 0 1 12 34 12 5 1 2 14 31 0 6 1 0 12 54 0 WECIT1 WACIT0 WACIT1 WEKIDS0 WEKIDS1 1 0 32 0 0 0 2 12 0 30 0 12 3 0 35 0 0 0 4 0 34 0 0 12 5 14 0 31 0 0 6 12 0 54 12 0 WEKIDS2 WAKIDS0 WAKIDS1 WAKIDS2 1 12 0 0 32 2 0 0 30 0 3 12 0 0 35 4 0 0 34 0 5 14 0 0 31 6 0 54 0 0 For recognizing and handling missing values gretl offers several functions see the Gretl Command Reference for details In this context it is worth remarking that the ok function can be used with a list argument For example list xlist x1 x2 x3 series xok okxlist After these commands the series xok will have value 1 for observations where none of x1 x2 or Chapter 15 Named lists and strings 127 YpcFR YpcGE YpcIT NFR NGE NIT 1997 1149 1246 1193 59830635 82034771 56890372 1998 1153 1227 1200 60046709 82047195 56906744 1999 1150 1224 1178 60348255 82100243 56916317 2000 1156 1188 1172 60750876 82211508 56942108 2001 1160 1169 1181 61181560 82349925 56977217 2002 1163 1155 1122 61615562 82488495 57157406 2003 1121 1169 1110 62041798 82534176 57604658 2004 1103 1166 1069 62444707 82516260 58175310 2005 1124 1151 1051 62818185 82469422 58607043 2006 1119 1142 1033 63195457 82376451 58941499 Table 151 GDP per capita and population in 3 European countries Source Eurostat x3 has a missing value and value 0 for any observations where this condition is not met The functions max min mean sd sum and var behave horizontally rather than vertically when their argument is a list For instance the following commands list Xlist x1 x2 x3 series m meanXlist produce a series m whose ith element is the average of x1i x2i and x3i missing values if any are implicitly discarded In addition gretl provides three functions for weighted operations wmean wsd and wvar Consider as an illustration Table 151 the first three columns are GDP per capita for France Germany and Italy columns 4 to 6 contain the population for each country If we want to compute an aggregate indicator of per capita GDP all we have to do is list Ypc YpcFR YpcGE YpcIT list N NFR NGE NIT y wmeanYpc N so for example y1996 1149 59830635 1246 82034771 1193 56890372 59830635 82034771 56890372 120163 See the Gretl Command Reference for more details 152 Named strings For some purposes it may be useful to save a string that is a sequence of characters as a named variable that can be reused Some examples of the definition of a string variable are shown below string s1 some stuff I want to save string s2 getenvHOME string s3 s1 11 The first field after the typename string is the name under which the string should be saved then comes an equals sign then comes a specification of the string to be saved This may take any of the following forms Chapter 15 Named lists and strings 128 a string literal enclosed in double quotes or the name of an existing string variable or a function that returns a string see below or any of the above followed by and an integer offset The role of the integer offset is to use a substring of the preceding element starting at the given character offset An empty string is returned if the offset is greater than the length of the string in question To add to the end of an existing string you can use the operator as in string s1 some stuff I want to string s1 save or you can use the operator to join two or more strings as in string s1 sweet string s2 Home s1 home Note that when you define a string variable using a string literal no characters are treated as special other than the double quotes that delimit the string Specifically the backslash is not used as an escape character So for example string s is a valid assignment producing a string that contains a single backslash character If you wish to use backslashescapes to denote newlines tabs embedded doublequotes and so on use the sprintf function instead see the printf command for an account of the escape characters This function can also be used to produce a string variable whose definition involves the values of other variables as in scalar x 8 foo sprintfvard x produces var8 String variables and string substitution String variables can be used in two ways in scripting the name of the variable can be typed as is or it may be preceded by the at sign In the first variant the named string is treated as a variable in its own right while the second calls for string substitution The context determines which of these variants is appropriate In the following contexts the names of string variables should be given in plain form without the at sign When such a variable appears among the arguments to the printf command or sprintf function When such a variable is given as the argument to a function On the righthand side of a string assignment Here is an illustration of the use of a named string argument with printf Chapter 15 Named lists and strings 129 string vstr variance Generated string vstr printf vstr 12s vstr vstr variance String substitution can be used in contexts where a string variable is not acceptable as such If gretl encounters the symbol followed directly by the name of a string variable this notation is treated as a macro the value of the variable is sustituted literally into the command line before the regular parsing of the command is carried out One common use of string substitution is when you want to construct and use the name of a series programatically For example suppose you want to create 10 random normal series named norm1 to norm10 This can be accomplished as follows string sname loop i110 sname sprintfnormd i series sname normal endloop Note that plain sname could not be used in the second line within the loop the effect would be to attempt to overwrite the string variable named sname with a series of the same name What we want is for the current value of sname to be dumped directly into the command that defines a series and the notation achieves that Another typical use of string substitution is when you want the options used with a particular command to vary depending on some condition For example function void useoptstr series y list xlist int verbose string optstr verbose simpleprint ols y xlist optstr end function open data41 list X const sqft useoptstrprice X 1 useoptstrprice X 0 When printing the value of a string variable using the print command the plain variable name should generally be used as in string s Just testing print s The following variant is equivalent though clumsy string s Just testing print s But note that this next variant does something quite different string s Just testing print s After string substitution the print command reads print Just testing which attempts to print the values of two variables Just and testing Chapter 15 Named lists and strings 130 Builtin strings Apart from any strings that the user may define some string variables are defined by gretl itself These may be useful for people writing functions that include shell commands The builtin strings are as shown in Table 152 gretldir the gretl installation directory workdir users current gretl working directory dotdir the directory gretl uses for temporary files gnuplot path to or name of the gnuplot executable tramo path to or name of the tramo executable x12a path to or name of the x12arima executable tramodir tramo data directory x12adir x12arima data directory Table 152 Builtin string variables To access these as ordinary string variables prepend a dollar sign as in dotdir to use them in stringsubstitution mode prepend the atsign dotdir Reading strings from the environment It is possible to read into gretls named strings values that are defined in the external environment To do this you use the function getenv which takes the name of an environment variable as its argument For example string user getenvUSER Generated string user string home getenvHOME Generated string home printf ss home directory is s user home cottrells home directory is homecottrell To check whether you got a nonempty value from a given call to getenv you can use the function strlen which retrieves the length of the string as in string temp getenvTEMP Generated string temp scalar x strlentemp Generated scalar x 0 Capturing strings via the shell If shell commands are enabled in gretl you can capture the output from such commands using the syntax string stringname shellcommand That is you enclose a shell command in parentheses preceded by a dollar sign Reading from a file into a string You can read the content of a file into a string variable using the syntax string stringname readfilefilename The filename field may be given as a string variable For example Chapter 15 Named lists and strings 131 fname sprintfsQNCrts x12adir Generated string fname string foo readfilefname Generated string foo More string functions Gretl offers several functions for creating or manipulating strings You can find these listed and explained in the Function Reference under the category Strings Chapter 16 Stringvalued series 161 Introduction By a stringvalued series we mean a series whose primary values are strings though internally such series comprise an integer coding plus a dictionary mapping from the integer values to strings This chapter explains how to create such series and describes the operations that are supported for them 162 Creating a stringvalued series This can be done in three ways by reading such a series from a suitable source file by taking a suitable numerical series within gretl and adding string values using the stringify function and by direct assignment to a series from an array of strings In each case string values will be preserved when such a series is saved in a gretlnative data file Reading stringvalued series The primary suitable source for stringvalued series is a delimited text data file but see sec tion 165 below Heres a little example The following is the content of a file named gccsv cityyear Bilbao2009 Torun2011 Oklahoma City2013 Berlin2015 Athens2017 Naples2019 A script to read this file and its output are shown in Listing 161 from which we can see a few things By default the print command shows us the string values of the series city and it han dles nonASCII characters provided theyre in UTF8 but it doesnt handle longer strings very elegantly The numeric option to print exposes the integer codes for a stringvalued series The syntax seriesnameobs yields a string when a series is stringvalued If you want to access the numeric code for a particular stringvalued observation you can get it by casting the series in question to a vector by wrapping the identifier in curly brackets So for example 132 Chapter 16 Stringvalued series 133 Listing 161 Working with a stringvalued series Input open gccsv quiet print byobs print city byobs numeric printf The third gretl conference took place in s city3 Output print byobs city year 1 Bilbao 2009 2 Torun 2011 3 Oklahoma C 2013 4 Berlin 2015 5 Athens 2017 6 Naples 2019 print city byobs numeric city 1 1 2 2 3 3 4 4 5 5 6 6 The third gretl conference took place in Oklahoma City Chapter 16 Stringvalued series 134 printf The code for s is d city3 city3 gives The code for Oklahoma City is 3 The numeric codes for stringvalued series are always assigned thus reading the data file row by row the first string value is assigned 1 the next distinct string value is assigned 2 and so on Assigning string values to a numeric series This is done via the stringify function which takes two arguments the name of a series and an array of strings For this to work two conditions must be met 1 The series must have only integer values and the smallest value must be 1 or greater 2 The array of strings must have at least n distinct members where n is the largest value found in the series The logic of these conditions is that were looking to create a mapping as described above from a 1based sequence of integers to a set of strings However were allowing for the possibility that the series in question is an incomplete sample from an associated population Suppose we have a series that goes 2 3 5 9 10 This is taken to be a sample from a population that has at least 10 discrete values 1 2 10 and so requires at least 10 valuestrings Heres a simplified version of an example that one of the authors has had cause to use deriving USstyle letter grades from a series containing percentage scores for students Call the percentage series x and say we want to create a series with values A for x 90 B for 80 x 90 and so on down to F for x 60 Then we can do series grade 1 F the least value grade x 60 D grade x 70 C grade x 80 B grade x 90 A stringifygrade strsplitF D C B A The way the grade series is constructed is not the most compact but its nice and explicit and easy to amend if one wants to adjust the threshold values Note the use of strsplit to create an onthefly array of strings from a string literal this is convenient when the array contains a moderate number of elements with no embedded spaces An alternative way to get the same result is to define the array of strings via the defarray function as in stringifygrade defarrayFDCBA The inverse operation of stringify is performed by the strvals function this retrieves the array of distinct string values from a series or returns an empty array if the series is not string valued Assigning from an array of strings Given an array of strings whose length matches the full length of the current dataset you can assign directly to a series result provided these conditions are satisfied the dataset is not subsampled and if the assignment is to a preexisting series it is not already stringvalued Heres a trivial example Chapter 16 Stringvalued series 135 nulldata 6 strings S defarraya b c b a d series sx S print sx byobs Heres a second example where we create a stringvalued series using the observation markers from the current dataset after grabbing them as an array via the markers command open data410 markers toarrayS series state S print state byobs And heres a third example where we construct the array of strings by reading from a text file nulldata 8 series sv strsplitreadfileABCDtxt print sv byobs This will work fine if the content of ABCDtxt is something like A B C D D C B A containing 8 spaceseparated values with or without line breaks If the strings in question contain embedded spaces you would have to make use of the optional second argument to strsplit 163 Permitted operations One question that arises with stringvalued series is what exactly are you allowed to do with them The optimal policy may be debatable but here we set out the current state of things Setting values per observation You can set particular values in a stringvalued series either by string or numeric code For example suppose in relation to the example in section 162 that for some reason student number 31 with a percentage score of 88 nonetheless merits an A grade We could do grade31 A or if were confident about the mapping grade31 5 Or to raise the students grade by one letter grade31 1 What youre not allowed to do here is make a numerical adjustment that would put the value out of bounds in relation to the set of string values For example if we tried grade31 6 wed get an error On the other hand you can implicitly extend the set of string values This wouldnt make sense for the letter grades example but it might for say city names Returning to the example in section 162 suppose we try Chapter 16 Stringvalued series 136 dataset addobs 1 year7 2023 city7 Gdansk This will work were implicitly adding another member to the string table for city the associated numeric code will be the next available integer1 Logical product of two stringvalued series The operator can be used to produce what we might call the logical product of two stringvalued series as in series sv3 sv1 sv2 The result is another stringvalued series with value sisj at observations where sv1 has value si and sv2 has value sj For example if at a given observation sv1 has value A and sv2 has value X then sv3 will have value AX The set of strings attached to the resulting series will include all such string combinations even if they are not all represented in the given sample Assignment to a stringvalued series In an assignment statement where the lefthand side LHS term is an existing stringvalued series two general conditions must be met First the righthand side RHS term must be a series ei ther numeric or stringvalued and second the assignment operator must be plain inflected operators such as and are not supported When the RHS series is numeric all its values must be either integers between 1 and the number of strings attached to the LHS series or NA This is required to preserve the integrity of the LHS When the RHS series is itself stringvalued there are two cases to consider theres no sample restriction in place or there is such a restriction In the unrestricted case the LHS series is in effect destroyed and replaced by a clone of the RHS Otherwise string values on the RHS are written into the LHS only within the current sample range If an RHS string is already present on the left its numerical code is adjusted if necessary to match the LHS string table if it is not present on the left it is appended to the LHS string table Missing values We support one exception to the general rule never break the mapping between strings and nu meric codes for stringvalued series you can mark particular observations as missing This is done in the usual way eg grade31 NA Note however that on importing a string series from a delimited text file any nonblank strings in cluding NA will be interpreted as valid values any missing values in such a file should therefore be represented by blank cells Copying a stringvalued series If you make a copy of a stringvalued series as in series foo city the string values are not copied over you get a purely numerical series holding the codes of the original series But if you want a full copy with the string values that can easily be arranged 1So please be careful one may inadvertently add a new string value by mistyping a string thats already present Chapter 16 Stringvalued series 137 series citycopy city stringifycitycopy strvalscity Stringvalued series in other contexts Stringvalued series can be used on the righthand side of assignment statements at will and in that context their numerical values are taken For example series y sqrtcity will elicit no complaint and generate a numerical series 1 141421 Its up to the user to judge whether this sort of thing makes any sense Similarly its up to the user to decide if it makes sense to use a stringvalued series as is in a regression model whether as regressand or regressoragain the numerical values of the series are taken Often this will not make sense but sometimes it may the numerical values may by design form an ordinal or even a cardinal scale as in the grade example in section 162 More likely one would want to use dummify on a stringvalued series before using it in statistical modeling In that context gretls series labels are suitably informative For example suppose we have a series race with numerical values 1 2 and 3 and associated strings White Black and Other Then the hansl code list D dummifyrace labels will show these labels Drace2 dummy for race Black Drace3 dummy for race Other Given a series such as race you can use its string values in a sample restriction as in smpl race Black restrict although race 2 would also be acceptable Accessing string values We have mentioned above two ways of accessing string values from a given series via the syntax seriesnameobs to obtain a single such value and via the strvals function to obtain an array holding all its distinct values Here we note a third option direct assignment from a stringvalued series to an array of strings as in strings S sv where sv is a suitable series In this case you get an array holding all the sv strings for observations in the current sample range not just the distinct values as with strvals 164 Stringvalued series and functions We first offer a few words on builtin functions that can be applied to stringvalued series The five functions substr strsub regsub tolower and toupper all perform transformations on Chapter 16 Stringvalued series 138 stringsrespectively extraction of a substring replacement of a substring replacement via regular expression conversion to all lowercase and to all uppercase see the Gretl Command Reference for details These functions work on single strings arrays of strings and also stringvalued series Note that when applied to a stringvalued series these functions may reduce the number of distinct strings attached to the series For example some string values that are originally distinct may collapse into identity when converted to all lowercase This possibility is handled by adjustment of the integer codes as needed A special case is presented by the builtin strvsort function this does not return a modified stringvalued series but rather modifies such a series in place It puts the string values into alpha betical order and recalculates the integer codes so as to preserve the original association between observation number and string If for example the first observation had a string value of X coded as 1 it will still have value X but its code will reflect the position of X in the alphabet ized ordering This can be particularly useful if a dataset comprises several series having the same string values but occurring in various orders The effect of running strvsort on such series will be to impose a common numerical encoding Userdefined hansl functions can also deal with stringvalued series If you supply such a series as an argument to a hansl function its string values will be accessible within the function One can test whether a given series arg is stringvalued as follows if nelemstrvalsarg 0 yes else no endif Its also possible since gretl version 2023c to put something like the code that generated the grade series in section 162 into a function and return the stringified series as in the following where we assume that x contains percentage scores function series lettergrade series x series grade define grade based on x and stringify it as shown above return grade end function An alternative means of achieving the same effectand the only means available prior to gretl 2023cis to to define grade as a series at the level of the caller and pass it in pointer form to lettergrade as in function void lettergrade series x series grade define grade based on x and stringify it end function caller series grade lettergradex grade As youll see from the account above we dont offer any very fancy facilities for stringvalued series Well read them from suitable sources and well create them natively via stringifyand well try to ensure that they retain their integritybut we dont for example take the specification of a stringvalued series as a regressor as an implicit request to include the dummification of its distinct values Chapter 16 Stringvalued series 139 165 Other import formats In section 162 we illustrated the reading of stringvalued series with reference to a delimited text data file Gretl can also handle several other sources of stringvalued data including the spread sheet formats xls xlsx gnumeric and ods and to a degree the formats of Stata SAS and SPSS Stata files Stata supports two relevant sorts of variables 1 those that are of string type and 2 variables of one or other numeric type that have value labels defined Neither of these is exactly equivalent to what we call a stringvalued series in gretl Stata variables of string type have no numeric representation their values are literally strings and thats all Statas numeric variables with value labels do not have to be integervalued and their least value does not have to be 1 however you cant define a label for a value that is not an integer Thus in Stata you can have a series that comprises both integer and noninteger values but only the integer values can be labeled2 This means that on import to gretl we can readily handle variables of string type from Statas dta files We give them a 1based numeric encoding this is arbitrary but does not conflict with any information in the dta file On the other hand in general were not able to handle Statas numeric variables with value labels currently we report the value labels to the user but do not attempt to store them in the gretl dataset We could check such variables and import them as stringvalued series if they satisfy the criteria stated in section 162 but we dont at present SAS and SPSS files Gretl is able to read and preserve string values associated with variables from SAS export xpt files and also from SPSS sav files Such variables seem to be on the same pattern as Stata variables of string type 2Verified in Stata 12 Chapter 17 Matrix manipulation Together with the other two basic types of data series and scalars gretl offers a quite compre hensive array of matrix methods This chapter illustrates the peculiarities of matrix syntax and discusses briefly some of the more advanced matrix functions For a full listing of matrix functions and a comprehensive account of their syntax please refer to the Gretl Command Reference In this chapter were concerned with real matrices most of the points made here also apply to complex matrices but see the following chapter for additional specifics on the complex case 171 Creating matrices Matrices can be created using any of these methods 1 By direct specification of the scalar values that compose the matrixeither in numerical form or by reference to preexisting scalar variables or using computed values 2 By providing a list of data series 3 By providing a named list of series 4 Via a suitable expression that references existing matrices andor scalars or via some special functions To specify a matrix directly in terms of scalars the syntax is for example matrix A 1 2 3 4 5 6 The matrix is defined by rows the elements on each row are separated by commas and the rows are separated by semicolons The whole expression must be wrapped in braces Spaces within the braces are not significant The above expression defines a 2 3 matrix Each element should be a numerical value the name of a scalar variable or an expression that evaluates to a scalar Directly after the closing brace you can append a single quote to obtain the transpose To specify a matrix in terms of data series the syntax is for example matrix A x1 x2 x3 where the names of the variables are separated by commas Besides names of existing variables you can use expressions that evaluate to a series For example given a series x you could do matrix A x x2 Each variable occupies a column and there can only be one variable per column You cannot use the semicolon as a row separator in this case if you want the series arranged in rows append the transpose symbol The range of data values included in the matrix depends on the current setting of the sample range Instead of giving an explicit list of variables you may instead provide the name of a saved list see Chapter 15 as in 140 Chapter 17 Matrix manipulation 141 list xlist x1 x2 x3 matrix A xlist When you provide a named list the data series are by default placed in columns as is natural in an econometric context if you want them in rows append the transpose symbol As a special case of constructing a matrix from a list of variables you can say matrix A dataset This builds a matrix using all the series in the current dataset apart from the constant variable 0 When this dummy list is used it must be the sole element in the matrix definition You can however create a matrix that includes the constant along with all other variables using horizontal concatenation see below as in matrix A constdataset By default when you build a matrix from series that include missing values the data rows that contain NAs are skipped But you can modify this behavior via the command set skipmissing off In that case NAs are converted to NaN Not a Number In the IEEE floatingpoint stan dard arithmetic operations involving NaN always produce NaN Alternatively you can take greater control over the observations data rows that are included in the matrix using the set variable matrixmask as in set matrixmask msk where msk is the name of a series Subsequent commands that form matrices from series or lists will include only observations for which msk has nonzero and nonmissing values You can remove this mask via the command set matrixmask null Names of matrices must satisfy the same requirements as names of gretl variables in general the name can be no longer than 31 characters must start with a letter and must be composed of nothing but letters numbers and the underscore character 172 Empty matrices The syntax matrix A creates an empty matrixa matrix with zero rows and zero columns The main purpose of the concept of an empty matrix is to enable the user to define a starting point for subsequent concatenation operations For instance if X is an already defined matrix of any size the commands matrix A matrix B A X result in a matrix B identical to X From an algebraic point of view one can make sense of the idea of an empty matrix in terms of vector spaces if a matrix is an ordered set of vectors then A is the empty set As a consequence operations involving addition and multiplications dont have any clear meaning arguably they have none at all but operations involving the cardinality of this set that is the dimension of the space spanned by A are meaningful Chapter 17 Matrix manipulation 142 Function Return value Function Return value A transpA A rowsA 0 colsA 0 rankA 0 detA NA ldetA NA trA NA onenormA NA infnormA NA rcondA NA Table 171 Valid functions on an empty matrix A Legal operations on empty matrices are listed in Table 171 All other matrix operations gener ate an error when an empty matrix is given as an argument In line with the above interpreta tion some matrix functions return an empty matrix under certain conditions the functions diag vec vech unvech when the arguments is an empty matrix the functions I ones zeros mnormal muniform when one or more of the arguments is 0 and the function nullspace when its argument has full column rank 173 Selecting submatrices You can select submatrices of a given matrix using the syntax Arowscols where rows can take any of these forms 1 empty selects all rows 2 a single integer selects the single specified row 3 two integers separated by a colon selects a range of rows 4 the name of a matrix selects the specified rows With regard to option 2 the integer value can be given numerically as the name of an existing scalar variable or as an expression that evaluates to a scalar With option 4 the index matrix given in the rows field must be either p 1 or 1 p and should contain integer values in the range 1 to n where n is the number of rows in the matrix from which the selection is to be made The cols specification works in the same way mutatis mutandis Here are some examples matrix B A1 matrix B A2335 matrix B A22 matrix idx 1 2 6 matrix B Aidx The first example selects row 1 from matrix A the second selects a 23 submatrix the third selects a scalar and the fourth selects rows 1 2 and 6 from matrix A If the matrix in question is n 1 or 1 m it is OK to give just one index specifier and omit the comma For example A2 selects the second element of A if A is a vector Otherwise the comma is mandatory In addition there are some predefined index specifications represented by the keywords diag lower upper real imag and end With the exception of end these keywords imply specific row and column selections and therefore cannot be combined with any additional commaseparated term The diag specification selects the principal diagonal of a matrix Chapter 17 Matrix manipulation 143 lower and upper select respectively the elements of a matrix below and those above the principal diagonal real and imag are specific to complex matrices and are described in chapter 18 end selects the last element in a given row or column It can be employed in arithmetical expressions so for example end1 accesses the secondlast element in a row or column You can use submatrix selections on either the righthand side of a matrixgenerating formula or the left Here is an example of use of a selection on the right to extract a 2 2 submatrix B from a 3 3 matrix A then the lower triangle of A matrix A 1 2 3 4 5 6 7 8 9 matrix B A1223 matrix C Alower And here are examples of selection on the left The second line below writes a 2 2 identity matrix into the bottom right corner of the 3 3 matrix A The fourth line replaces the diagonal of A with 1s matrix A 1 2 3 4 5 6 7 8 9 matrix A2323 I2 matrix d 1 1 1 matrix Adiag d When the lower and upper selections are used on the right they yield a vector holding the elements in their scope The ordering of the elements is columnmajor in both cases as illustrated below for the 4 4 case d 1 2 4 1 d 3 5 2 4 d 6 3 5 6 d This means that lower and upper do not produce the same result for symmetric matrices bigger than 33 which may seem unfortunate but it gives the user a degree of flexibility in respect of the ordering of the elements Suppose you have a nonsymmetric matrix M and youd like to extract the infradiagonal elements in rowmajor order Mupper will do the job When lower and upper are used on the left the replacement must be either a a vector of length equal to the number of elements in the selection or b a scalar value In case a the elements of the target matrix are filled in columnmajor order in case b they are all set using the scalar One possible use of these tools is taking say a lower triangular matrix and rendering it symmetric by copying the elements from beneath the diagonal to above The way to get this right assuming you have a lower triangular matrix L is Lupper Lupper note not Lupper Llower 174 Deleting rows or columns A variant of submatrix notation is available for convenience in dropping specified rows andor columns from a matrix namely giving negative values for the indices Here is a simple example matrix A 1 2 3 4 5 6 7 8 9 matrix B A23 Chapter 17 Matrix manipulation 144 which creates B as a 22 matrix which drops row 2 and column 3 from A Negative indices can also be given in the form of an index vector matrix rdrop 135 matrix B Ardrop In this case B is formed by dropping rows 1 3 and 5 from A which must have at least 5 rows but retaining the column dimension of A There are two limitations on the use of negative indices First the fromto range syntax described in the previous section is not available but you can use the seq function to achieve an equivalent effect as in matrix A muniform1 10 matrix B Aseq37 where B drops columns 3 to 7 from A Second use of negative indices is valid only on the righthand side of a matrix calculation there is no negative index equivalent of assignment to a submatrix as in A13 ones3 colsA 175 Matrix operators The following binary operators are available for matrices addition subtraction ordinary matrix multiplication premultiplication by transpose matrix left division see below matrix right division see below columnwise concatenation rowwise concatenation Kronecker product test for equality test for inequality In addition the following operators dot operators apply on an elementbyelement basis Here are explanations of the less obvious cases For matrix addition and subtraction in general the two matrices have to be of the same dimensions but an exception to this rule is granted if one of the operands is a 1 1 matrix or scalar The scalar is implicitly promoted to the status of a matrix of the correct dimensions all of whose elements are equal to the given scalar value For example if A is an m n matrix and k a scalar then the commands matrix C A k matrix D A k Chapter 17 Matrix manipulation 145 both produce m n matrices with elements cij aij k and dij aij k respectively By premultiplication by transpose we mean for example that matrix C XY produces the product of Xtranspose and Y In effect the expression XY is shorthand for XY which is also valid syntax In the special case X Y however the two are not exactly equivalent The former expression uses a specialized algorithm with two advantages it is more efficient com putationally and ensures that the result is free of machine precision artifacts that may render it numerically nonsymmetric This however is unlikely to be an issue unless your X matrix is rather large at least several hundreds rowscolumns In matrix left division the statement matrix X A B is interpreted as a request to find the matrix X that solves AX B so A and B must have the same number of rows If A is a square matrix this is in principle equivalent to A1B which fails if A is singular the numerical method employed here is the LU decomposition If A is a T k matrix with T k then X is the leastsquares solution X AA1AB which fails if AA is singular the numerical method is the QR decomposition Otherwise the operation fails For matrix right division as in X A B X is the matrix that solves XB A so A and B must have the same number of columns If B is nonsingular this is in principle equivalent to AB1 otherwise X is the leastsquares solution In dot operations a binary operation is applied element by element the result of this operation is obvious if the matrices are of the same size However there are several other cases where such operators may be applied For example if we write matrix C A B then the result C depends on the dimensions of A and B Let A be an m n matrix and let B be p q the result is as follows Case Result Dimensions match m p and n q cij aij bij A is a column vector rows match m p n 1 cij ai bij B is a column vector rows match m p q 1 cij aij bi A is a row vector columns match m 1 n q cij aj bij B is a row vector columns match m p q 1 cij aij bj A is a column vector B is a row vector n 1 p 1 cij ai bj A is a row vector B is a column vector m 1 q 1 cij aj bi A is a scalar m 1 and n 1 cij a bij B is a scalar p 1 and q 1 cij aij b If none of the above conditions are satisfied the result is undefined and an error is flagged Note that this convention makes it unnecessary in most cases to use diagonal matrices to perform transformations by means of ordinary matrix multiplication if Y XV where V is diagonal it is computationally much more convenient to obtain Y via the instruction matrix Y X v where v is a row vector containing the diagonal of V In columnwise concatenation of an mn matrix A and an mp matrix B the result is an mnp matrix That is matrix C A B produces C A B Rowwise concatenation of an m n matrix A and an p n matrix B produces an m p n matrix That is matrix C A B produces C A B 176 Matrixscalar operators For matrix A and scalar k the operators shown in Table 172 are available Addition and subtraction were discussed in section 175 but we include them in the table for completeness In addition for square A and scalar x B Ax produces a matrix B which is A raised to the power x but only if either of two conditions are satisfied First if x is a nonnegative integer then Golub and Van Loans Binary Powering Algorithm 1122 is usedsee Golub and Van Loan 1996and A can then be a generic square matrix Second if A is positive semidefinite the power is computed via its eigendecomposition and x can be a real number subject to the constraint that x can be negative only if A is invertible Expression Effect matrix B A k bij kai j matrix B A k bij ai j k matrix B k A bij kai j matrix B A k bij ai j k matrix B A k bij ai j k matrix B k A bij k ai j matrix B A k bij ai j modulo k Table 172 Matrixscalar operators 177 Matrix functions Most of the functions available for scalars and series also apply to matrices on an elementbyelement basis This is the case for log exp sqrt sin and many others For example if a matrix A is already defined then matrix B sqrtA generates a matrix such that bij aij All such functions require a single matrix as argument or an expression which evaluates to a single matrix1 In this section we review some aspects of functions that apply specifically to matrices A full account of each function is available in the Gretl Command Reference 1Note that to find the matrix square root you need the cholesky function see below And since the exp function computes the exponential element by element it does not return the matrix exponential unless the matrix is diagonal To get the matrix exponential use mexp Chapter 17 Matrix manipulation 147 Matrix manipulation bin2dec cnameset cols dec2bin diag diagcat halton I lower mlag mnormal mrandgen mreverse mshape msortby msplitby muniform ones rnameset rows selifc selifr seq trimr unvech upper vec vech zeros Matrix algebra cholesky cnumber commute conv2d det eigen eigengen eigensym eigsolve fft ffti ginv hdprod infnorm inv invpd ldet Lsolve mexp mlog nullspace onenorm psdroot qform qrdecomp rank rcond svd toepsolv tr transp Statisticstransformations aggregate bkw corr cov ecdf fcstats ghk gini imaxc imaxr iminc iminr kpsscrit maxc maxr mcorr mcov mcovg meanc meanr minc minr mols mpols mrls mxtab normtest npcorr princomp prodc prodr quadtable quantile ranking resample sdc sphericorrsst sumc sumr uniq values Numerical methods BFGSmax BFGScmax fdjac fzero GSSmax NMmax NRmax numhess simann Table 173 Matrix functions by category Chapter 17 Matrix manipulation 148 Matrix reshaping In addition to the methods discussed in sections 171 and 173 a matrix can also be created by rearranging the elements of a preexisting matrix This is accomplished via the mshape function It takes three arguments the input matrix A and the rows and columns of the target matrix r and c respectively Elements are read from A and written to the target in columnmajor order If A contains fewer elements than n r c they are repeated cyclically if A has more elements only the first n are used For example matrix a mnormal23 a matrix b mshapea31 b matrix b mshapea52 b produces a a 12323 099714 039078 054363 043928 048467 matrix b mshapea31 Generated matrix b b b 12323 054363 099714 matrix b mshapea52 Replaced matrix b b b 12323 048467 054363 12323 099714 054363 043928 099714 039078 043928 Multiple returns and the null keyword Some functions take one or more matrices as arguments and compute one or more matrices these are eigensym Eigenanalysis of symmetric matrix eigen Eigenanalysis of general matrix mols Matrix OLS qrdecomp QR decomposition svd Singular value decomposition SVD The general rule is the main result of the function is always returned as the result proper Auxiliary returns if needed are retrieved using preexisting matrices which are passed to the Chapter 17 Matrix manipulation 149 function as pointers see 144 If such values are not needed the pointer may be substituted with the keyword null The syntax for qrdecomp and eigensym is of the form matrix B funcA C The first argument A represents the input data that is the matrix whose decomposition or analysis is required The second argument must be either the name of an existing matrix preceded by to indicate the address of the matrix in question in which case an auxiliary result is written to that matrix or the keyword null in which case the auxiliary result is not produced In case a nonnull second argument is given the specified matrix will be overwritten with the auxiliary result It is not required that the existing matrix be of the right dimensions to receive the result The function eigensym computes the eigenvalues and optionally the right eigenvectors of a sym metric n n matrix The eigenvalues are returned directly in a column vector of length n if the eigenvectors are required they are returned in an n n matrix For example matrix V matrix E eigensymM V matrix E eigensymM null In the first case E holds the eigenvalues of M and V holds the eigenvectors In the second E holds the eigenvalues but the eigenvectors are not computed The function eigen computes the eigenvalues and optionally the right andor left eigenvectors of a general n n matrix2 Following the input matrix argument there are two slots for matrix addresses the first to retrieve the right eigenvectors and the second for the left Calls to this function should therefore conform to one of the following patterns get the eigenvalues only matrix E eigenM get the right eigenvectors as well matrix V matrix E eigenM V get both sets of eigenvectors matrix V matrix W matrix E eigenM V W get the left eigenvectors but not the right matrix W matrix E eigenM null W The eigenvalues are returned directly in a complex nvector If the eigenvectors are wanted they are returned in a n n complex matrix The qrdecomp function computes the QR decomposition of an m n matrix A A QR where Q is an m n orthogonal matrix and R is an n n upper triangular matrix The matrix Q is returned directly while R can be retrieved via the second argument Here are two examples matrix R matrix Q qrdecompM R matrix Q qrdecompM null 2The legacy function eigengen used to be the way to do this prior to gretl 2019d Chapter 17 Matrix manipulation 150 In the first example the triangular R is saved as R in the second R is discarded The first line above shows an example of a simple declaration of a matrix R is declared to be a matrix variable but is not given any explicit value In this case the variable is initialized as a 1 1 matrix whose single element equals zero The syntax for svd is matrix B funcA C D The function svd computes all or part of the singular value decomposition of the real mn matrix A Let k minm n The decomposition is A UΣV where U is an m k orthogonal matrix Σ is an k k diagonal matrix and V is an k n orthogonal matrix3 The diagonal elements of Σ are the singular values of A they are real and nonnegative and are returned in descending order The first k columns of U and V are the left and right singular vectors of A The svd function returns the singular values in a vector of length k The left andor right singu lar vectors may be obtained by supplying nonnull values for the second andor third arguments respectively For example matrix s svdA U V matrix s svdA null null matrix s svdA null V In the first case both sets of singular vectors are obtained in the second case only the singular values are obtained and in the third the right singular vectors are obtained but U is not computed Please note when the third argument is nonnull it is actually V that is provided To reconstitute the original matrix from its SVD one can do matrix s svdA U V matrix B UsV Finally the syntax for mols is matrix B molsY X U This function returns the OLS estimates obtained by regressing the T n matrix Y on the T k matrix X that is a k n matrix holding XX1XY The Cholesky decomposition is used The matrix U if not null is used to store the residuals Reading and writing matrices fromto text files The two functions mread and mwrite can be used for basic matrix inputoutput This can be useful to enable gretl to exchange data with other programs The mread function accepts one string parameter the name of the plain text file from which the matrix is to be read The file in question may start with any number of comment lines defined as lines that start with the hash mark such lines are ignored Beyond that the content must conform to the following rules 1 The first noncomment line must contain two integers separated by a space or a tab indicat ing the number of rows and columns respectively 3This is not the only definition of the SVD some writers define U as m m Σ as m n with k nonzero diagonal elements and V as n n Chapter 17 Matrix manipulation 151 2 The columns must be separated by spaces or tab characters 3 The decimal separator must be the dot character Should an error occur such as the file being badly formatted or inaccessible an empty matrix see section 172 is returned The complementary function mwrite produces text files formatted as described above The column separator is the tab character so import into spreadsheets should be straightforward Usage is illustrated in example 171 Matrices stored via the mwrite command can be easily read by other programs the following table summarizes the appropriate commands for reading a matrix A from a file called amat in some widelyused programs4 Note that the Python example requires that the numpy module is loaded Program Sample code GAUSS tmp load amat A reshapetmp3rowstmptmp1tmp2 Octave fd fopenamat rc fscanffd d d C A reshapefscanffd g rccr fclosefd Ox decl A loadmatamat R A asmatrixreadtableamat skip1 Python A numpyloadtxtamat skiprows1 Julia A readdlmamat skipstart1 Optionally the mwrite and mread functions can use gzip compression this is invoked if the name of the matrix file has the suffix gz In this case the elements of the matrix are written in a single column Note however that compression should not be applied when writing matrices for reading by thirdparty software unless you are sure that the software can handle compressed data 178 Matrix accessors In addition to the matrix functions discussed above various accessor strings allow you to create copies of internal matrices associated with models previously estimated These are set out in Table 174 Many of the accessors in Table 174 behave somewhat differently depending on the sort of model that is referenced as follows Singleequation models sigma gets a scalar the standard error of the regression coeff and stderr get column vectors uhat and yhat get series System estimators sigma gets the crossequation residual covariance matrix uhat and yhat get matrices with one column per equation The format of coeff and stderr de pends on the nature of the system for VARs and VECMs where the matrix of regressors is the same for all equations these return matrices with one column per equation but for other system estimators they return a big column vector VARs and VECMs vcv is not available but XX1 where X is the common matrix of regres sors is available as xtxinv such that for VARs and VECMs without restrictions on α a vcv equivalent can be easily and efficiently constructed as sigma xtxinv 4Matlab users may find the Octave example helpful since the two programs are mostly compatible with one another Chapter 17 Matrix manipulation 152 Listing 171 Matrix inputoutput via text files Download nulldata 64 scalar n 3 string f1 acsv string f2 bcsv matrix a mnormalnn matrix b inva err mwritea f1 if err 0 fprintf Failed to write s f1 else err mwriteb f2 endif if err 0 fprintf Failed to write s f2 else c mreadf1 d mreadf2 a cd printf The following matrix should be an identity matrix print a endif coeff matrix of estimated coefficients compan companion matrix after VAR or VECM estimation jalpha matrix α loadings from Johansens procedure jbeta matrix β cointegration vectors from Johansens procedure jvbeta covariance matrix for the unrestricted elements of β from Johansens procedure rho autoregressive coefficients for error process sigma residual covariance matrix stderr matrix of estimated standard errors uhat matrix of residuals vcv covariance matrix of parameter estimates vma VMA matrices in stacked form see section 322 yhat matrix of fitted values Table 174 Matrix accessors for model data Chapter 17 Matrix manipulation 153 If the accessors are given without any prefix they retrieve results from the last model estimated if any Alternatively they may be prefixed with the name of a saved model plus a period in which case they retrieve results from the specified model Here are some examples matrix u uhat matrix b m1coeff matrix v2 m1vcv1212 The first command grabs the residuals from the last model the second grabs the coefficient vector from model m1 and the third which uses the mechanism of submatrix selection described above grabs a portion of the covariance matrix from model m1 If the model in question a VAR or VECM only compan and vma return the companion matrix and the VMA matrices in stacked form respectively see section 322 for details After a vector error correction model is estimated via Johansens procedure the matrices jalpha and jbeta are also available These have a number of columns equal to the chosen cointegration rank therefore the product matrix Pi jalpha jbeta returns the reducedrank estimate of A1 Since β is automatically identified via the Phillips nor malization see section 335 its unrestricted elements do have a proper covariance matrix which can be retrieved through the jvbeta accessor 179 Namespace issues Matrices share a common namespace with data series and scalar variables In other words no two objects of any of these types can have the same name It is an error to attempt to change the type of an existing variable for example scalar x 3 matrix x ones22 wrong It is possible however to delete or rename an existing variable then reuse the name for a variable of a different type scalar x 3 delete x matrix x ones22 OK 1710 Creating a data series from a matrix Section 171 above describes how to create a matrix from a data series or set of series You may sometimes wish to go in the opposite direction that is to copy values from a matrix into a regular data series The syntax for this operation is series sname mspec where sname is the name of the series to create and mspec is the name of the matrix to copy from possibly followed by a matrix selection expression Here are two examples series s x series u1 U1 It is assumed that x and U are preexisting matrices In the second example the series u1 is formed from the first column of the matrix U Chapter 17 Matrix manipulation 154 For this operation to work the matrix or matrix selection must be a vector with length equal to either the full length of the current dataset n or the length of the current sample range n If n n then only n elements are drawn from the matrix if the matrix or selection comprises n elements the n values starting at element t1 are used where t1 represents the starting observation of the sample range Any values in the series that are not assigned from the matrix are set to the missing code 1711 Matrices and lists To facilitate the manipulation of named lists of variables see Chapter 15 it is possible to convert between matrices and lists In section 171 above we mentioned the facility for creating a matrix from a list of variables as in matrix M listname That formulation with the name of the list enclosed in braces builds a matrix whose columns hold the variables referenced in the list What we are now describing is a different matter if we say matrix M listname without the braces we get a row vector whose elements are the ID numbers of the variables in the list This special case of matrix generation cannot be embedded in a compound expression The syntax must be as shown above namely simple assignment of a list to a matrix To go in the other direction you can include a matrix on the righthand side of an expression that defines a list as in list Xl M where M is a matrix The matrix must be suitable for conversion that is it must be a row or column vector containing nonnegative integer values none of which exceeds the highest ID number of a series in the current dataset Listing 172 illustrates the use of this sort of conversion to normalize a list moving the constant variable 0 to first position 1712 Deleting a matrix To delete a matrix just write delete M where M is the name of the matrix to be deleted 1713 Printing a matrix To print a matrix the easiest way is to give the name of the matrix in question on a line by itself which is equivalent to using the print command matrix M mnormal1002 M print M You can get finer control on the formatting of output by using the printf command as illustrated in the interactive session below Chapter 17 Matrix manipulation 155 Listing 172 Manipulating a list Download function void normalizelist matrix x If the matrix representing a list contains var 0 but not in first position move it to first position if x1 0 scalar k colsx loop for i2 ik i if xi 0 xi x1 x1 0 break endif endloop endif end function open data97 list Xl 2 3 0 4 matrix x Xl normalizelistx list Xl x list Xl print matrix Id I2 matrix Id I2 Generated matrix Id print Id print Id Id 2 x 2 1 0 0 1 printf 103f Id 1000 0000 0000 1000 For presentation purposes you may wish to give titles to the columns of a matrix For this you can use the cnameset function the first argument is a matrix and the second is either a named list of variables whose names will be used as headings or a string that contains as many spaceseparated substrings as the matrix has columns For example matrix M mnormal33 cnamesetM foo bar baz print M M 3 x 3 foo bar baz 17102 076072 0089406 099780 19003 025123 091762 039237 16114 Chapter 17 Matrix manipulation 156 1714 Example OLS using matrices Listing 173 shows how matrix methods can be used to replicate gretls builtin OLS functionality Listing 173 OLS via matrix methods Download open data41 matrix X const sqft matrix y price matrix b invpdXX Xy print estimated coefficient vector b matrix u y Xb scalar SSR uu scalar s2 SSR rowsX rowsb matrix V s2 invXX V matrix se sqrtdiagV print estimated standard errors se compare with builtin function ols price const sqft vcv Chapter 18 Complex matrices 181 Introduction Native support for complex matrices was added to gretl in version 2019d1 Not all of hansls matrix functions accept complex input but we have enabled a sizable subset of these functions which should suffice for most econometric purposes Complex numbers are not used in most areas of econometrics but there are a few notable ex ceptions among these complex numbers allow for an elegant treatment of univariate spectral analysis of time series and become indispensable if you consider multivariate spectral analysis see for example Shumway and Stoffer 2017 A more recent example is the numerical solution of linear models with rational expectations which are widely used in modern macroeconomics for which the complex Schur factorization has become the tool of choice Klein 2000 A first point to note is that complex values are treated as a special case of the hansl matrix type theres no complex type as such Complex scalars fall under the matrix type as 1 1 matrices the hansl scalar type is only for real values as is the series type A 1 1 complex matrix should do any work you might require of a complex scalar Before we proceed to the details of complex matrices in gretl heres a brief reminder of the relevant concepts and notation Complex numbers are pairs of the form a b i where a and b are real numbers and i is defined as the square root of 1 a is the real part and b the imaginary part One can specify a complex number either via a and b or in polar form The latter pertains to the complex plane which has the real component on the horizontal axis and the imaginary component on the vertical The polar representation of a complex number is composed of the length r of the ray from the origin to the point in question and the angle θ subtended between the positive real axis and this ray measured counterclockwise in radians In polar form the complex number z a b i can be written as z z cos θ i sin θ z eiθ where z r a2 b2 and θ tan1ba The quantity z is known as the modulus of z and θ as its complex argument or sometimes phase The notation z is used for the complex conjugate of z if z a b i then z a b i 182 Creating a complex matrix The standard constructor for complex matrices is the complex function This takes two argu ments giving the real and imaginary parts respectively and sticks them together as in C complexA B Four cases are supported as follows A and B are both m n real matrices Then C is an m n complex matrix such that ckj akj bkj i A and B are both scalars C is a 1 1 complex matrix such that c a b i 1Prior to that release gretl offered improvised support for some complex functionality see section 187 for details 157 Chapter 18 Complex matrices 158 A is an m n real matrix and B is a scalar C is an m n matrix such that ckj akj b i A is a scalar and B is an m n real matrix C is an m n matrix such that ckj a bkj i In addition complex matrices may naturally arise as the result of certain computations With both real and complex matrices in circulation one may wish to determine whether a particular matrix is complex The function iscomplex can tell you Passed an identifier it returns nonzero if it names a complex matrix 0 if it names a real matrix or NA otherwise The nonzero return value is either 1 or 2 with the following interpretation 1 indicates that the matrix is nominally complex each element is represented as having a real part and an imaginary part but all imaginary parts are zero 2 indicates that at least one element has a nonzero imaginary part The following code snippet illustrates the point matrix z1 complex10 scalar a iscomplexz1 matrix z2 complex11 scalar b iscomplexz2 printf a d b d a b The code above gives a 1 b 2 183 Indexation Indexation of complex matrices works as with real matrices on the understanding that each ele ment of a complex matrix is a complex pair So for example Cij gets you the complex pair at row i column j of C in the form of a 1 1 complex matrix If you wish to access just the real or imaginary part of a given element or range of elements you can use the functions Re or Im as in scalar rij ReCij which gets you the real part of cij In addition the dummy selectors real and imag can be used to assign to just the real or imaginary component of a complex matrix Here are two examples replace the real part of C with random normals Creal mnormalrowsC colsC set the imaginary part of C to all zeros Cimag 0 The replacement must be either a real matrix of the same dimensions as the target or a scalar Further the real and imag selectors may be combined with regular selectors to access specific portions of a complex matrix for either reading or writing Examples retrieve the real part of a submatrix of C matrix R C1212real set the imaginary part of C33 to y C33imag y Chapter 18 Complex matrices 159 184 Operators Most of the operators available for working with real matrices are also available for complex ones this includes the dotoperators which work elementwise or by broadcasting vectors Moreover mixed operands are accepted as in D C A where C is complex and A real the result D will be complex In such cases the real operand is treated as a complex matrix with an allzero imaginary part The operators not defined for complex values are Those that include the inequality tests or since complex values as such cannot be compared as greater or lesser though they can be compared as equal or not equal The real modulus operator percent sign as in x y which gives the remainder on division of x by y As for real matrices the transposition operator is available in both unary form as in B A and binary form as in C AB transposemultiply But note that for complex A this means the conjugate transpose AH If you need the nonconjugated transpose you can use transp You may wish to note although none of gretls explicit regression functions or commands accept complex input you can calculate parameter estimates for a leastsquares regression of complex Y T 1 on complex X T k via B X Y 185 Functions To give an idea of what works and what doesnt for complex matrices well walk through the hansl functionspace using the categories employed in gretls online Function reference under the Help menu in the GUI program Linear algebra The functions that accept complex arguments are cholesky det ldet eigen eigensym for Hermitian matrices fft ffti inv ginv hdprod mexp mlog qrdecomp rank svd tr and transp Note however that mexp and mlog require that the input matrix be diagonalizable and cholesky requires a positive definite Hermitian matrix In addition there are the complexonly functions ctrans which gives the conjugate transpose2 and schur for the Schur factorization Matrix building Given what was said in section 182 above several of the functions in this category should be thought of as applying to the real or imaginary part of a complex matrix for example ones and mnormal and are of course usable in that way However some of these functions can be applied to complex matrices as such namely diag diagcat lower upper vec vech and unvech Please note when unvech is applied to a suitable real vector it produces a symmetric matrix but when applied to a complex vector it produces a Hermitian matrix The only functions not available for complex matrices are cnameset and rnameset That is you cannot name the columns or rows of such matrices although this restriction could probably be lifted without great difficulty 2The transp function gives the straight nonconjugated transpose of a complex matrix Chapter 18 Complex matrices 160 Matrix shaping The functions that accept complex input are cols rows mreverse mshape selifc selifr and trimr The functions msortby sort and dsort are excluded for the reason mentioned in section 184 Statistical Supported for complex input meanc meanr sumc sumr prodc and prodr And thats all Mathematical In the matrix context these are functions that are applied element by element For complex input the following are supported log exp and sqrt plus all of the trigonometric functions with the exception of atan2 In addition there are the complexonly functions cmod complex modulus also accessible via abs carg complex argument conj complex conjugate Re real part and Im imaginary part Note that cargz atan2y x for z x y i Listing 181 illustrates usage of cmod and carg Transformations In this category only two functions can be applied to complex matrices namely cum and diff 186 File inputoutput Complex matrices are stored and retrieved correctly in the XML serialization used for gretl session files gretl The functions mwrite and mread work in two modes binary mode if the filename ends with bin and text mode otherwise Both modes handle complex matrices correctly if both the writing and the reading are to be done by gretl but for exchange of data with foreign programs text mode will not work for complex matrices as a whole The options are In text mode use mwrite and mread on the two parts of a complex matrix separately and reassemble the matrix in the target program Use binary mode on the whole matrix if this is supported for the given foreign program At present binary mode transfer of complex matrices is supported for octave python and julia Listing 182 shows some examples we export a complex matrix to each of these programs in turn calculate its inverse in the foreign program then verify that the result as imported back into gretl is the same as that calculated in gretl 187 Backward incompatibility Prior to version 2019d gretl did not provide native support for complex matrices It did however offer an improvised representation of such matrices for certain restricted purposes taking the form of an expanded regular gretl matrix with real values and imaginary parts in odd and even numbered columns respectively The functions fft eigengen and polroots returned matrices in this special form and the functions cmult and cdiv operated on such matrices As of version 2022b fft and polroots have been redefined to work with proper complex ma trices as described above The other affected functions are deprecated and will be removed or redefined in a subsequent release If you have any hansl code using the legacy representation the following brief porting guide may be helpful Chapter 18 Complex matrices 161 Listing 181 Variant representations of complex numbers We picked 8 points on the unit circle in the complex plane so their modulus is constant and equal to 1 The Polar matrix below shows that the complex argument is expressed in radians multiplying by 180π gives degrees The chk matrix verifies that we can retrieve the orginal representation of the complex values from the polar form in either of the two ways mentioned at the start of the chapter z z cos θ i sin θ or z z eiθ Download complex values in a bi form scalar rp5 sqrt05 matrix A 1 rp5 0 rp5 1 rp5 0 rp5 matrix B 0 rp5 1 rp5 0 rp5 1 rp5 matrix Z complexA B calculate modulus and argument matrix zmod cmodZ matrix theta cargZ matrix Polar zmod theta theta 180pi cnamesetPolar modulus radians degrees printf 124f Polar reconstitute the original Z matrix in two ways matrix Z1 zmod complexcostheta sintheta matrix Z2 zmod expcomplex0 theta matrix chk Z Z1 Z2 print chk Printing of Polar and chk modulus radians degrees 10000 00000 00000 10000 07854 450000 10000 15708 900000 10000 23562 1350000 10000 31416 1800000 10000 23562 1350000 10000 15708 900000 10000 07854 450000 100000 000000i 100000 000000i 100000 000000i 070711 070711i 070711 070711i 070711 070711i 000000 100000i 000000 100000i 000000 100000i 070711 070711i 070711 070711i 070711 070711i 100000 000000i 100000 000000i 100000 000000i 070711 070711i 070711 070711i 070711 070711i 000000 100000i 000000 100000i 000000 100000i 070711 070711i 070711 070711i 070711 070711i Chapter 18 Complex matrices 162 Listing 182 Exporting and importing complex matrices Download set seed 34756 matrix C complexmnormal33 mnormal33 D invC mwriteC Cbin 1 foreign languageoctave C gretlloadmatCbin gretlexportinvC octDbin end foreign octD mreadoctDbin 1 eval D octD foreign languagepython import numpy as np C gretlloadmatCbin gretlexportnplinalginvC pyDbin end foreign pyD mreadpyDbin 1 eval D pyD foreign languagejulia C gretlloadmatCbin gretlexportinvC jlDbin end foreign jlD mreadjlDbin 1 eval D jlD Chapter 18 Complex matrices 163 Porting old complex code cmult and cdiv These functions performed elementwise multiplication and division of complex column vectors in the old twocolumn form The statements old elementwise operations c1 cmulta1 b1 d1 cdiva1 b1 can be updated as new elementwise operations c2 a2 b2 d2 a2 b2 where a2 and b2 are newstyle complex vectors or matrices The following statements c3 a2 b2 d3 a2 b2 are also valid but have different effects the first performing standard rather than elementwise multiplication of matrices complex or real and the second performing right division equivalent to a2 invb2 Note that while the return value from cmult and cdiv could be either a real vector or a two column complex vector the newstyle operations yield a nominally complex result if at least one of the operands is complex even if the result has an allzero imaginary part A piece of code that appears in some contexts such as calculation of a periodogram is as follows given a complex vector v compute a vector w holding the squared moduli of the elements of v The oldstyle code to accomplish this was legacy v has two columns w sumrv2 and the new replacement is current v has a single complex column w absv2 where abs gives the complex modulus eigengen Most uses of this legacy function simply retrieve the eigenvalues of a general that is not symmetric matrix and do not exploit the option of retrieving eigenvectors In that context it is straightforward to substitute a call to the new function eigen The only point to note is that eigen returns a newstyle complex vector if you have need to convert this to the legacy representation you can use the cswitch function which is documented in the Gretl Command Reference In brief the following code gives you the legacy equivalent of a newstyle complex vector v newvec if vimag 0 oldv vreal else oldv cswitchv 2 endif polroots This function now returns a newstyle complex vector As with eigengen you can use cswitch to convert the vector if necessary Chapter 19 Calendar dates 191 Introduction Any software that aims to handle dates and times must have a good builtin calendar Gretl of fers several functions to handle date and time information which are documented in the Gretl Command Reference To facilitate their effective use this chapter lists the various possibilities for storing dates and times and discusses ways of converting between variant representations Our main focus in this chapter is dates as such year month and day but we add some discussion of timeofday where relevant A final section addresses the somewhat arcane issue of handling historical dates on the Julian calendar First of all it may be useful to distinguish two contexts You have a timeseries dataset in place or a panel dataset with a welldefined time dimension You have no such dataset in place or perhaps no dataset at all While you can work with dates in the second case in the first case you have extra resources You probably know that if you open a dataset that is in fact time series but gretl has not immedi ately recognized that fact you can rectify matters by use of the setobs command or via the menu item DataDataset structure in the gretl GUI You may also know that with a panel dataset you can impose a definite dating and frequency in its time dimension if appropriateagain via the setobs command but with the paneltime option In what follows we state if a relevant function or accessor requires a timeseries dataset or well defined paneldata time otherwise you can assume it does not carry such a requirement 192 Date and time representations In gretl there is more than one way to encode a date such as May 26th 1993 Some are more intuitive some less obvious from a human viewpoint but easier to handle for an algorithm The basic representations we discuss here are 1 the threenumbers approach 2 date as string 3 the ISO 8601 standard 4 the epoch day 5 Unix time seconds We first explain what these representations are then explain how to convert between them The threenumbers approach Since a date without regard to intraday detail basically consists of three numbers it can obviously be encoded in precisely that way For example the date May 26th 1993 can be stored as 164 Chapter 19 Calendar dates 165 scalar y 1993 scalar m 5 scalar d 26 Gretls multipleelement objects can be used to extend this approach for example by using a 3 element vector for year month and day or a 3column matrix for storing as many dates as desired If you wish to store dates as series in your dataset this approach would lead you to use three series possibly grouping them into a list as in nulldata 60 setobs 7 20200101 series y obsmajor series m obsminor series d obsmicro list DATE y m d This example above will generate daily dates for January and February 2020 Note that use of the obsm accessors requires a timeseries dataset and obsmicro in particular requires daily data See Section 195 for details Some CSV files represent dates in this sort of brokendown format with various conventions on the ordering of the three components Date as string To a human being this may seem the most natural choice The string 2661953 is pretty much unambiguous But using such a format for machine processing can be problematic due to differing conventions regarding the separators between day month and year as well as the order in which the three pieces of information are arranged For example 261953 is not unambiguous it will naturally be read differently by Europeans and Americans This can be a problem with CSV files found in the wild containing arbitrarily formatted dates Therefore gretl provides fairly comprehensive functionality for converting dates of this sort into more manageable formats The ISO 8601 standard Among other things the ISO 8601 standard provides two representations for a daily date the basic representation which uses an 8digit integer and the extended representation which uses a 10character string In the basic version the first four digits represent the year the middle two the month and the rightmost two the day so that for example 20170219 indicates February 19th 2017 The extended representation is similar except that the date is a string in which the items are separated by hy phens so the same date would be represented as 20170219 In several contexts ISO 8601 dates are privileged by gretl the ISO format as taken as the default and you need to supply an additional function argument or take extra steps if the representation is nonstandard Using series andor matrices to store ISO 8601 basic dates is perfectly straightforward Epoch days In gretl an epoch day is an unsigned 32bit integer which represents a date as the number of days since January 1 1 AD that is the first day of the Common Era on the proleptic Gregorian calendar1 For example 19930526 corresponds to 7277092 1The term proleptic as applied to a calendar indicates that it is extrapolated backwards or forwards relative to its period of actual historical use 2This representation derives from the astronomers Julian day which is also a count of days since a benchmark namely January 1 4713 BC at which time certain astronomical cycles were aligned Chapter 19 Calendar dates 166 This is the convention used by the GLib library on which gretl depends for much of its calendrical calculation Since an epoch day is an unsigned integer neither GLib nor gretl supports dates BC or prior to the Common Era This representation has several advantages Like ISO 8601 basic it lends itself naturally to storing dates as series Compared to ISO 8601 it has the disadvantage of not being readily understandable by humans but to compensate for that it makes it very easy to determine the length of a range of dates ISO basic dates can be used for comparison which of two dates on a given calendar refers to a later day but with epoch days one can carry out fullyfledged dates arithmetic Epoch days are always consecutive by construction but 8digit basic dates are consecutive only within a given month3 For more on arithmetic with epoch days see Section 194 Unix seconds In this representationthe cornerstone of date and time handling on Unixlike systemstime is the number of seconds since midnight at the start of 1970 according to Coordinated Universal Time UTC4 This format is therefore ideal for storing finegrained information including time of day as well as date This representation is not transparent to humans for example the number 123456789 corre sponds to the start of Thursday 29 Nov 1973 but again it lends itself naturally to calculation Since Unix seconds are hardwired to UTC a given value will correspond to different times and possibly different dates if evaluated in different time zones we expand on this point below 193 Converting between representations To support conversion between different representations gretl provides several dedicated func tions although in some cases conversion can be carried out by using generalpurpose functions Figure 191 displays a summary solid lines represent dedicated functions while dashed lines in dicate that no special function is needed Numerical formats are depicted as boxes and string formats as ovals For a full description of the functions referenced in the figure see the Gretl Com mand Reference In the rest of this section we discuss several cases of conversion with the help of examples Strings and threenumber dates As indicated in Figure 191 converting between date strings and the threenumber representation does not require datespecific functions The two generic functions that can be used for this purpose are printf and sscanf Heres how suppose you encode a date via the three scalars d30 m10 and y1983 You can use printf to turn it into a date string rather easily as in eus printfdmy d m y uss printfmdy m d y where the two strings follow the European and US conventions respectively The reverse operation using the sscanf function is a little trickier see the Gretl Command Ref erence for a full illustration The string s19831030 can be broken down into three scalars as scalar d m y n sscanfs ddd y m d 3In fact they advance by 101 minus the number of days in the previous month at the start of each month other than January and by 8870 at the start of each year 4UTC is to a first approximation the time such that the Sun is at its highest point at noon over the prime meridian the line of 0 longitude which as a matter of historical contingency runs through Greenwich England Chapter 19 Calendar dates 167 epochday ISO integer isodate ISO extended isodate list dmy epochday genr Generic string printf printf epochday isoconv Unix seconds strpdaystrfday sscanf strptimestrftime strpdaystrfday sscanf substr atof strptimestrftime Figure 191 Conversions between different date formats Note that in this case d in the format specification does not mean day but rather decimal integer which is why there are three instances Alternatively one could have used a 3element vector as in matrix date zeros13 n sscanfs ddd date1 date2 date3 Decomposing a series of basic dates To generate from a series of dates in ISO 8601 basic format distinct series holding year month and day the function isoconv can be used This function should be passed the original series followed by pointers to the series to be filled out For example if we have a series named dates in the prescribed format we might do series y m d isoconvdates y m d This is mostly just a convenience function provided the dates input is valid on the possibly proleptic Gregorian calendar it is equivalent to series y floordates10000 series m floordates10000y100 series d dates 10000y 100m However there is some value added isoconv checks the validity of the dates input If the implied year month and day for any dates observation do not correspond to a valid date then all the derived series will have value NA at that observation The inverse operation is trivial Chapter 19 Calendar dates 168 series dates 10000 y 100 m d The use of series here means that such operations require that a dataset is in place but although they would most naturally occur in the context of a timeseries dataset they could equally well occur with an undated dataset since an ISO 8601 basic value is just a numeric value with some restrictions and such values do not have to appear in chronological order Stringnumeric conversions dedicated functions The primary means of converting between string and scalar numeric representations of dates and times is provided by two pairs of functions strptimestrftime and strpdaystrfday The first of each pair takes string input and outputs a numeric value and the second performs the inverse operation as shown in Table 191 With the first pair the numeric value is Unix seconds with the second its an epoch day Numeric values are always relative to UTC and string values are by default at least always relative to local time function input output strptime datetime string format Unix seconds strftime Unix seconds format datetime string strpday date string format epoch day strfday epoch day format date string Table 191 String numeric datetime conversions Before moving on lets be clear on what we mean by local time Generically this is time according to the local time zone with or without a Daylight saving or Summer adjustment depending on the time of year In a computing context we have to be more specific the local time zone is whatever is set as such via the operating system and possibly adjusted via an environment variable on the host computer It will usually be the same as the geographically local zone but theres nothing to stop a user making a different setting Dates as stringvalued series It often happens that CSV files contain date information stored as strings Take for example a file containing earthquake data like the following5 Date Time Latitude Longitude Magnitude 01021965 134418 19246 145616 60 01041965 112949 1863 127352 58 01051965 180558 20579 173972 62 01081965 184943 59076 23557 58 01091965 133250 11938 126427 58 01101965 133632 13405 166629 67 01121965 133225 27357 87867 59 01151965 231742 13309 166212 60 Suppose we want to convert the Date column to epoch days Note that the date format follows the American convention monthdayyear The simplest way to accomplish the task is shown in Listing 191 where we assume that the data file is named earthquakescsv Note that the allcols option is wanted here so that gretl treats Dates as a stringvalued series rather than just a source of timeseries information For good measure we show how to add an ISO 8601 date series 5See httpswwwkagglecomdatasetsusgsearthquakedatabase for the dataset of which this is an extract Chapter 19 Calendar dates 169 Listing 191 Converting a stringvalued date series to epoch day open earthquakescsv allcols series eday strpdayDate mdY series isodates strfdayeday Ymd print Date eday isodates o Output Date eday isodates 1 01021965 717338 19650102 2 01041965 717340 19650104 3 01051965 717341 19650105 4 01081965 717344 19650108 5 01091965 717345 19650109 6 01101965 717346 19650110 7 01121965 717348 19650112 8 01151965 717351 19650115 Alternatively one might like to convert the Date and Time columns jointly to Unix seconds This can be done by sticking the two strings together at each observation and calling strptime with a suitable format as follows series usecs Unix seconds loop i1nobs usecsi strptimeDatei Timei mdYHMS endloop Unix seconds and time zones At 846 in the morning of September 11 2001 an airliner crashed into the North Tower of the World Trade Center in New York Relative to what time zone is that statement correct Eastern Daylight Time EDT of course Unless we have special reason to do otherwise we report the time of an event relative to the time zone in which it occurred and if we do otherwise we need to state the metric were using for example one might say that this event occurred at 20010911 1246 UTC Now consider the following script date 20010911 0846 format Ymd HM usecs strptimedate format check strftimeusecs format printf Unix time d usecs printf original s date printf recovered s check Run this script in any time zone you like and the last line of output will read recovered 20010911 0846 The usecs value will differ by time zonefor example itll be 1000212360 under Eastern Daylight Time but 1000194360 under Central European Timebut this difference cancels out in recover ing the original time via strftime So far so good But suppose I write a script in which I store the date as Unix seconds with my laptops clock set to EDT Chapter 19 Calendar dates 170 usecs 1000212360 date strftimeusecs Ymd HM print date Running this script under EDT will again print out 20010911 0846 but if I take my laptop to Italy in June set its clock to the local time and rerun the script Ill get 20010911 1446 Is that a problem Well 1446 is indeed the time in Italy when its 0846 in New York with both zones in their Summer variants its a problem only if you want to preserve the locality of the original time To do that you need to give timezone information to both strptime and strftime This is illustrated in Listing 192 Listing 192 Datetime invariance with respect to current time zone string date 20010911 0846 0400 string format Ymd HM z usecs strptimedate format printf Unix time d usecs In the code above we specify the time zone in date using 0400 meaning 4 hours behind UTC which is correct when Daylight Saving time is in force in the Eastern US And we match this with the z specifier in format As a result regardless of the time zone in which the code is run the Unix time value will be 1000212360 Then we come to unpacking that value date strftime1000212360 Ymd HM z 43600 print date Here we use the third optional argument to strftime to supply the offset in seconds of EDT relative to UTC Having told strptime the time zone why do we need this Well remember that Unix time is just a scalar value always relative to UTC it cannot store timezone information Anyway the result is that this code will print 20010911 0846 0400 regardless of where and when it is executed Some additional comments are in order First spaces matter in parsing the strptime arguments they must match between the date and format strings In the example above we inserted spaces before 0400 and z We could have omitted both spaces but not just one of them Second the C standard does not require that strptime and strftime know anything about time zones the extensions used in this example are supported by GLib functionality 194 Epoch day arithmetic Give the way epoch days are defined they provide a useful tool for checking whether daily data are complete Suppose we have what purport to be 7day daily data with a starting date of 20150101 and an ending date of 20161231 How many observations should there be ed1 epochday201511 ed2 epochday20161231 n ed2 ed1 1 We find that there should be n 731 observations if there are fewer theres something missing If the data are supposed to be on a 5day week skipping Saturday and Sunday or 6day week skipping Sunday alone the calculation is more complicated in this case we can use the dayspan function providing as arguments the epochday values for the first and last dates and the number of days per week Chapter 19 Calendar dates 171 ed1 epochday201511 ed2 epochday20161230 n dayspaned1 ed2 5 We discover that there were n 522 weekdays in this period The dayspan function can also be helpful if you wish to construct a suitably sized empty daily dataset prior to importing data from a thirdparty database for example stock prices from Yahoo Say the data to be imported are on a 5day week and you want the range to be from 20000103 the first weekday in 2000 to 20201230 a Wednesday Heres how one could initialize a suitable host dataset ed1 epochday200013 ed2 epochday20201230 n dayspaned1 ed2 5 nulldata n setobs 5 20000103 Another use of arithmetic using epoch days is constructing a sequence of dates of nonstandard frequency Suppose you want a biweekly series including alternate Saturdays in 2023 Heres a solution nulldata 26 setobs 1 1 specialtimeseries series eday eday1 epochday20230107 the first Saturday loop i2nobs edayi edayi1 14 endloop series dates strfdayeday Ymd 195 Other accessors and functions Accessors Gretl offers various accessors for generating dates One is now which returns the current datetime as a 2element vector The first element is Unix seconds and the second an epoch day see Section 192 This is always available regardless of the presence or absence of a dataset When a timeseries dataset is open up to four accessors are available to retrieve observation dates as numeric series First there is obsdate which returns ISO 8601 basic dates If the frequency is annual quarterly or monthly these dates represent the first day of the period in question if the frequency is hourly this accessor is not available Then theres a set of up to three accessors obsmajor obsminor and obsmicro The availability and interpretation of these values depends on the character of the dataset as shown in Table 192 For reference the constructor column shows the argument that should be supplied to the setobs command to impose each frequency on a dataset assuming it starts on January 1 1990 The hourly frequency is not fully supported by gretls calendrical apparatus But an epoch day value can be used to set the starting day for an hourly time series as exemplified in Table 192 726468 for 19900101 One could then construct a stringvalued hourly datetime series in this way series day strptimeisodateobsmajor series usecs day 3600 obsminor 1 Unix seconds series tstrs strftimeusecs Ymd HM When a panel dataset is open and its time dimension is specified see Section 191 and the docu mentation for the setobs command obsdate works as described for timeseries datasets But Chapter 19 Calendar dates 172 frequency description constructor obsmajor obsminor obsmicro 1 annual 1 1990 year 4 quarterly 4 19901 year quarter 12 monthly 12 199001 year month 5 6 7 daily n 19900101 year month day 52 weekly 52 19900101 year month day 24 hourly 24 72646801 day hour Table 192 Calendrical frequencies and accessors obsmajor and obsminor do not refer to the time dimension rather they give the 1based indices of the individuals and time periods respectively And obsmicro is not available Miscellaneous functions Besides conversion several other calendrical functions are available monthlen given month and year returns the length of the month in days optionally ignoring weekends weekday given a date as year month and day or ISO 8601 basic returns a number from 0 Sun day to 6 Saturday corresponding to day of the week juldate given an epoch day returns the corresponding date on the Julian calendar see Section 196 below dayspan given two epoch days calculates their distance optionally taking weekends into account easterday given the year returns the date of Easter on the Gregorian calendar isoweek given a date as year month and day returns the progressive number of the week within that year as per the ISO 8601 specification 196 Working with preGregorian dates Working with dates is fairly straightforward in the current era with the Gregorian calendar used universally for the dating of socioeconomic observations It is not so straightforward however when dealing with historical data recorded prior to the adoption of the Gregorian calendar in place of the Julian an event which first occurred in the principal Catholic countries in 1582 but which took place at different dates in different countries over a span of several centuries Gretl like most dataoriented software uses the Gregorian calendar by default for all dates thereby ensuring that dates are all consecutive the latter being a requirement of the ISO 8601 standard for dates and times As readers probably know the Julian calendar adds a leap day February 29 on each year that is divisible by 4 with no remainder But this overcompensates for the fact that a 365day year is too short to keep the calendar synchronized with the seasons The Gregorian calendar introduced a more complex rule which maintains better synchronization namely each year divisible by 4 with no remainder is a leap year unless its a centurial year eg 1900 in which case its a leap year only if it is divisible by 400 with no remainder So the years 1600 and 2000 were leap years on both calendars but 1700 1800 and 1900 were leap years only on the Julian calendar While the average length of a Julian year is 36525 days the Gregorian average is 3652425 days The fact that the Julian calendar inserts leap days more frequently means that the Julian date progressively although very slowly falls behind the Gregorian date For example February 18 Chapter 19 Calendar dates 173 2017 Gregorian is February 5 2017 on the Julian calendar On adoption of the Gregorian calendar it was therefore necessary to skip several days In England where the transition occurred in 1752 Wednesday September 2 was directly followed by Thursday September 14 In comparing calendars one wants to refer to a given day in terms that are not specific to either calendarbut how to define a given day This is accomplished by a count of days following some definite temporal benchmark As described in Section 192 gretl uses days since the start of 1 AD which we call epoch days In this section we address the problem of constructing within gretl a calendar which agrees with the actual historical calendar prior to the switch to Gregorian dating Most people will have no use for this but researchers working with archival data may find it helpful it would be tricky and errorprone to enter on the Gregorian calendar data whose dates are given on the Julian at source In order to represent Julian dates Gretl uses two basic tools one is the juldate function which converts a Gregorian epoch day into an ISO8601like integer and the convention that for some functions a negative value where a year is expected acts as a Julian calendar flag So for example the following code fragment edg epochday170011 edj epochday170011 produces edg 620548 and edj 620558 indicating that the two calendars differed by 10 days at the point in time known as January 1 1700 on the proleptic Gregorian calendar Taken together with the isodate and juldate functions which each take an epoch day argument and return an ISO 8601 basic date on respectively the Gregorian and Julian calendars epochday can be used to convert between the two calendars For example what was the date in England still on the Julian calendar on the day known to Italians as June 26 1740 Italy having been on the Gregorian calendar since October 1582 ed epochday1740626 englishdate juldateed printf d englishdate We find that the English date was 17400615 the 15th of June Working in the other direction what Italian date corresponded to the 5th of November 1740 in England ed epochday1740115 italiandate isodateed printf d italiandate Answer 17401116 Guy Fawkes night in 1740 occurred on November 16 from the Italian point of view Well now consider the trickiest case namely a calendar which includes the day on which the Julian to Gregorian switch occurred If we can handle this it should be relatively simple to handle a purely Julian calendar Our illustration will be England in 1752 a similar analysis could be done for Spain in 1582 or Greece in 1923 A solution is presented in Listing 193 The first step is to find the epoch day corresponding to the Julian date 17520101 which turns out to be 639551 Then we can create a series of epoch days from which we get both Julian and Gregorian dates for 355 days starting on epoch day 639551 Note 355 days because this was a short year it was a leap year but 11 days were skipped in September in making the transition to the Gregorian calendar We can then construct a series hcal which switches calendar at the right historical point Notice that although the series hcal contains the correct historical calendar in basic form the observation labels in the first column of the output are still just index numbers It may be prefer able to have historical dates in that role To achieve this we can decompose the hcal series into Chapter 19 Calendar dates 174 Listing 193 Historical calendar for Britain in 1752 Download 1752 was a short year on the British calendar nulldata 355 give a negative year to indicate Julian date ed0 epochday175211 consistent series of epoch day values series ed ed0 index 1 Julian dates as YYYYMMDD series jdate juldateed Gregorian dates as YYYYMMDD series gdate isodateed Historical cutover in September series hcal ed epochday175292 gdate jdate And lets take a look print ed jdate gdate hcal o Partial output ed jdate gdate hcal 1 639551 17520101 17520112 17520101 2 639552 17520102 17520113 17520102 245 639795 17520901 17520912 17520901 246 639796 17520902 17520913 17520902 247 639797 17520903 17520914 17520914 248 639798 17520904 17520915 17520915 355 639905 17521220 17521231 17521231 Chapter 19 Calendar dates 175 year month and day then use the special genr markers apparatus see chapter 4 Suitable code along with partial output is shown in Listing 194 Listing 194 Continuation of Britain 1752 example Download Additional input series y m d isoconvhcal y m d genr markers 04d02d02d y m d print ed jdate gdate hcal o Partial output ed jdate gdate hcal 17520101 639551 17520101 17520112 17520101 17520102 639552 17520102 17520113 17520102 17520901 639795 17520901 17520912 17520901 17520902 639796 17520902 17520913 17520902 17520914 639797 17520903 17520914 17520914 17520915 639798 17520904 17520915 17520915 17521231 639905 17521220 17521231 17521231 Year numbering A further complication in dealing with archival data is that the year number has not always been advanced on January 1 for example in Britain prior to 1752 March 25 was taken as the start of the new year On gretls calendar whether Julian or Gregorian the year number always advances on January 1 but its possible to construct observation markers following the old scheme This is illustrated for the year 1751 as we would now call it in Listing 195 Day of week and length of month Two of the functions described in Section 195 that by default operate on the Gregorian calendar can be induced to work on the Julian by the trick mentioned above namely giving the negative of the year These are weekday which takes arguments year month and day and monthlen which takes arguments month year and days per week Thus for example eval weekday1700229 gives 4 indicating that Julian February 29 1700 was a Thursday And eval monthlen219005 gives 21 indicating that there were 21 weekdays in Julian February 1900 Chapter 19 Calendar dates 176 Listing 195 Historical calendar for England in 1751 Download Input nulldata 365 a common year ed0 epochday175111 ed1 epochday1751325 series ed ed0 index 1 series jdate juldateed series y m d isoconvjdate y m d y ed ed1 y1 y genr markers 04d02d02d y m d print index o Partial output 17500101 1 17500102 2 17500103 3 17500323 82 17500324 83 17510325 84 17510326 85 17511231 365 Chapter 20 Handling mixedfrequency data 201 Basics In some cases one may want to handle data that are observed at different frequencies a facility known as MIDAS Mixed Data Sampling A common pairing includes GDP usually available quar terly and industrial production often available monthly The most common context when this feature is required is specification and estimation of MIDAS models see Chapter 41 but other cases are possible A gretl dataset formally handles only a single data frequency but we have adopted a straightfor ward means of representing nested frequencies a higher frequency series xH is represented by a set of m series each holding the value of xH in a subperiod of the base lowerfrequency period where m is the ratio of the higher frequency to the lower This is most easily understood by means of an example Suppose our base frequency is quarterly and we wish to include a monthly series in the analysis Then a relevant fragment of the gretl dataset might look as shown in Table 201 Here gdpc96 is a quarterly series while indpro is monthly so m 124 3 and the permonth values of indpro are identified by the suffix mn n 3 2 1 gdpc96 indprom3 indprom2 indprom1 19471 193447 143650 142811 141973 19472 193228 143091 143091 142532 19473 193031 144209 143091 142253 19474 196070 148121 147562 145606 19481 198954 147563 149240 148960 19482 202185 152313 150357 147842 Table 201 A slice of MIDAS data To recover the actual monthly time series for indpro one must read the three relevant series right toleft by row At first glance this may seem perverse but in fact it is the most convenient setup for MIDAS analysis In such models the highfrequency variables are represented by lists of lags and of course in econometrics it is standard to give the most recent lag first xt1 xt2 One can construct such a dataset manually from raw sources using hansls matrixhandling meth ods or the join command see Section 206 for illustrations but we have added native support for the common cases shown below base frequency higher frequency annual quarterly or monthly quarterly monthly or daily monthly daily The examples below mostly pertain to the case of quarterly plus monthly data Section 206 has details on handling of daily data 177 Chapter 20 Handling mixedfrequency data 178 A mixedfrequency dataset can be created in either of two ways by selective importation of series from a database or by creating two datasets of different frequencies then merging them Importation from a database Heres a simple example in which we draw from the fedstl St Louis Fed database which is supplied in the gretl distribution clear open fedstlbin data gdpc96 data indpro compactspread store gdpindprogdt Since gdpc96 is a quarterly series its importation via the data command establishes a quarterly dataset Then the MIDAS work is done by the option compactspread for the second invocation of data This spreads the series indprowhich is monthly at sourceinto three quarterly series exactly as shown in Table 201 Merging two datasets In this case we consider an Excel file provided by Eric Ghysels in his MIDAS Matlab Toolbox1 namely mydataxlsx This contains quarterly real GDP in Sheet1 and monthly nonfarm payroll employment in Sheet2 A hansl script to build a MIDASstyle file named gdppayrollmidasgdt is shown in Listing 201 Listing 201 Building a gretl MIDAS dataset via merger sheet 2 contains monthly employment data open MIDASv22mydataxlsx sheet2 rename VALUE payems dataset compact 4 spread limit to the sample range of the GDP data smpl 19471 20112 setinfo payemsm3 descriptionNonfarm payroll employment month 3 of quarter setinfo payemsm2 descriptionNonfarm payroll employment month 2 of quarter setinfo payemsm1 descriptionNonfarm payroll employment month 1 of quarter store payrollmidasgdt sheet 1 contains quarterly GDP data open MIDASv22mydataxlsx sheet1 rename VALUE qgdp setinfo qgdp descriptionReal quarterly US GDP append payrollmidasgdt store gdppayrollmidasgdt Note that both series are simply named VALUE in the source file so we use gretls rename command to set distinct and meaningful names The heavy lifting is done here by the line dataset compact 4 spread 1See httpeghyselswebuncedu for links Chapter 20 Handling mixedfrequency data 179 which tells gretl to compact an entire dataset in this case as it happens just containing one series to quarterly frequency using the spread method Once this is done it is straightforward to append the compacted data to the quarterly GDP dataset We will put an extended version of this dataset supplied with gretl and named gdpmidasgdt to use in subsequent sections 202 The notion of a MIDAS list In the following two sections well describe functions that rather easily do the right thing if you wish to create lists of lags or first differences of highfrequency series However we should first be clear about the correct domain for such functions since they could produce the most diabolical mashup of your data if applied to the wrong sort of list argumentfor instance a regular list containing distinct series all observed at the base frequency of the dataset So let us define a MIDAS list this is a list of m series holding perperiod values of a single high frequency series arranged in the order of most recent first as illustrated above Given the dataset shown in Table 201 an example of a correctly formulated MIDAS list would be list INDPRO indprom3 indprom2 indprom1 Or since the monthly observations are already in the required order we could define the list by means of a wildcard list INDPRO indprom Having created such a list one can use the setinfo command to tell gretl that its a bona fide MIDAS list setinfo INDPRO midas This will spare you some warnings that gretl would otherwise emit when you call some of the functions described below This step should not be necessary however if the series in question are the product of a compact operation with the spread parameter Inspecting highfrequency data The layout of highfrequency data shown in Table 201 is convenient for running regressions but not very convenient for inspecting and checking such data We therefore provide some methods for displaying MIDAS data at their natural frequency Figure 201 shows the gretl main window with the gdpmidas dataset loaded along with the menu that pops up if you rightclick with the payems series highlighted The items Display values and Time series plot show the data on their original monthly calendar while the Display components item shows the three component series on a quarterly calendar as in Table 201 These methods are also available via the command line For example the commands list PAYEMS payems print PAYEMS byobs midas hfplot PAYEMS withlines outputdisplay produce a monthly printout of the payroll employment data followed by a monthly timeseries plot See section 205 for more on hfplot Chapter 20 Handling mixedfrequency data 180 Figure 201 MIDAS data menu 203 Highfrequency lag lists A basic requirement of MIDAS is the creation of lists of highfrequency lags for use on the right hand side of a regression specification This is possible but not very convenient using the gretls lags function it is made easier by a dedicated variant of that function described below For illustration well consider an example presented in Ghysels Matlab implementation of MIDAS this uses 9 monthly lags of payroll employment starting at lag 3 in a model for quarterly GDP The estimation period for this model starts in 1985Q1 At this observation the stipulation that we start at lag 3 means that the first most recent lag is employment for October 19842 and the 9lag window means that we need to include monthly lags back to February 1984 Let the permonth employment series be called xm3 xm2 and xm1 and let quarterly lags be represented by 1 2 and so on Then the terms we want are reading lefttoright by row xm11 xm32 xm22 xm12 xm33 xm23 xm13 xm34 xm24 We could construct such a list in gretl using the following standard syntax Note that the third argument of 1 to lags below tells gretl that we want the terms ordered by lag rather than by variable this is required to respect the order of the terms shown above list X xm create lags for 4 quarters by lag list XL lags4X1 convert the list to a matrix matrix tmp XL trim off the first two elements and the last tmp tmp311 2That is what Ghysels means but see the subsection on Leads and nowcasting below for a possible ambiguity in this regard Chapter 20 Handling mixedfrequency data 181 and convert back to a list XL tmp However the following specialized syntax is more convenient list X xm setinfo X midas create highfrequency lags 3 to 11 list XL hflags3 11 X In the case of hflags the length of the list given as the third argument defines the compaction ratio m 3 in this example we can in fact must specify the lags we want in highfrequency terms and ordering of the generated series by lag is automatic Word to the wise do not use hflags on anything other than a MIDAS list as defined in section 202 unless perhaps you have some special project in mind and really know what you are doing Leads and nowcasting Before leaving the topic of lags it is worth commenting on the question of leads and socalled nowcastingthat is prediction of the current value of a lowerfrequency variable before its mea surement becomes available In a regular dataset where all series are of the same frequency lag 1 means the observation from the previous period lag 0 is equivalent to the current observation and lag 1 or lead 1 is the observation for the next period into the relative future When considering highfrequency lags in the MIDAS context however there is no uniquely deter mined highfrequency subperiod which is temporally coincident with a given lowfrequency period The placement of highfrequency lag 0 therefore has to be a matter of convention Unfortunately there are two incompatible conventions in currently available MIDAS software as follows Highfrequency lag 0 corresponds to the first subperiod within the current lowfrequency period This is what we find in Eric Ghysels MIDAS Matlab Toolbox its also clearly stated and explained in Armesto et al 2010 Highfrequency lag 0 corresponds to the last subperiod in the current lowfrequency period This convention is employed in the midasr package for R3 Consider for example the quarterlymonthly case In Matlab highfrequency HF lag 0 is the first month of the current quarter HF lag 1 is the last month of the prior quarter and so on In midasr however HF lag 0 is the last month of the current quarter HF lag 1 the middle month of the quarter and HF lag 3 is the first one to take you back in time relative to the start of the current quarter namely to the last month of the prior quarter In gretl we have chosen to employ the first of these conventions So lag 1 points to the most recent subperiod in the previous basefrequency period lag 0 points to the first subperiod in the current period and lag 1 to the second subperiod within the current period Continuing with the quarterlymonthly case monthly observations for lags 0 and 1 are likely to become available before a measurement for the quarterly variable is published possibly also a monthly value for lag 2 The first truly future lead does not occur until lag 3 The hflags function supports negative lags Suppose one wanted to use 9 lags of a highfrequency variable 1 0 1 7 for nowcasting Given a suitable MIDAS list X the following would do the job list XLnow hflags1 7 X 3See httpcranrprojectorgwebpackagesmidasr and for documentation httpsgithubcom mpiktasmidasruserguiderawmastermidasruserguidepdf Chapter 20 Handling mixedfrequency data 182 This means that one could generate a forecast for the current lowfrequency period which is not yet completed and for which no observation is available using data from two subperiods into the lowfrequency period eg the first two months of the quarter 204 Highfrequency first differences When working with nonstationary data one may wish to take first differences and in the MIDAS context that probably means highfrequency differences of the highfrequency data Note that the ordinary gretl functions diff and ldiff will not do what is wanted for series such as indpro as shown in Table 201 these functions will give permonth quarterly differences of the data month 3 of the current quarter minus month 3 of the previous quarter and so on To get the desired result one could create the differences before compacting the highfrequency data but this may not be convenient and its not compatible with the method of constructing a MIDAS dataset shown in section 201 The alternative is to employ the specialized differencing function hfdiff This takes one required argument a MIDAS list as defined in section 202 A second optional argument is a scalar multiplier with default value 10 this permits scaling the output series by a constant Theres also an hfldiff function for creating highfrequency log differences this has the same syntax as hfdiff So for example the following creates a list of highfrequency percentage changes 100 times log difference then a list of highfrequency lags of the changes list X indpro setinfo X midas list dX hfldiffX 100 list dXL hflags3 11 dX If you only need the series in the list dXL however you can nest these two function calls list dXL hflags3 11 hfldiffX 100 205 MIDASrelated plots In the context of MIDAS analysis one may wish to produce timeseries plots which show high and lowfrequency data in correct registration as in Figures 1 and 2 in Armesto et al 2010 This can be done using the hfplot command which has the following syntax hfplot midaslist lflist options The required argument is a MIDAS list as defined above Optionally one or more lowerfrequency series lflist can be added to the plot following a semicolon Supported options are withlines timeseries and output These have the same effects as with the gretls gnuplot command An example based on Figure 1 in Armesto et al 2010 is shown in Listing 202 and Figure 202 206 Alternative MIDAS data methods Importation via a column vector Listing 203 illustrates how one can construct via hansl a MIDAS list from a matrix column vector holding data of a higher frequency than the given dataset In practice one would probably read high frequency data from file using the mread function but here we just construct an artificial sequential vector Note the check in the highfreqlist function we determine the current sample size T and insist that the input matrix is suitably dimensioned with a single column of length equal to T times the compaction factor here 3 for monthly to quarterly Chapter 20 Handling mixedfrequency data 183 Listing 202 Replication of a plot from Armesto et al Download open gdpmidasgdt form and label the dependent variable series dy logqgdpqgdp1400 setinfo dy graphnameGDP form list of annualized HF differences list X payems list dX hfldiffX 1200 setinfo dX graphnamePayroll Employment smpl 19801 20091 hfplot dX dy withlines timeseries outputdisplay 10 5 0 5 10 15 20 1980 1985 1990 1995 2000 2005 Payroll Employment GDP Figure 202 Quarterly GDP and monthly Payroll Employment annualized percentage changes Chapter 20 Handling mixedfrequency data 184 Listing 203 Create a midas list from a matrix Download function list highfreqlist const matrix x int compfac string vname list ret deflist scalar T nobs if rowsx compfacT colsx 1 funcerr Invalid x matrix endif matrix m mreversemshapex compfac T loop i1compfac scalar k compfac 1 i ret genseriessprintfsd vname k mi endloop setinfo ret midas return ret end function construct a little quarterly dataset nulldata 12 setobs 4 19801 generate monthly data 1236 matrix x seq13nobs print x turn into midas list list H highfreqlistx 3 testm print H byobs Chapter 20 Handling mixedfrequency data 185 The final command in the script should produce testm3 testm2 testm1 19801 3 2 1 19802 6 5 4 19803 9 8 7 This functionality is available in the builtin function hflist which has the same signature as the hansl prototype above Importation via join The join command provides a general and flexible framework for importing data from external files see chapter 7 In order to handle multiplefrequency data it supports the spreading of highfrequency series to a MIDAS list in a single operation This requires use of the aggr option with parameter spread There are two acceptable forms of usage illustrated below Note that AWM is a quarterly dataset while hamilton is monthly First case open AWMgdt join hamiltongdt PC6IT aggrspread and second case open AWMgdt join hamiltongdt PCI dataPC6IT aggrspread In the first case MIDAS series PC6ITm3 PC6ITm2 and PC6ITm1 are added to the working dataset In the second case PCI is used as the base name for the imports giving PCIm3 PCIm2 and PCIm1 as the names of the permonth series Note that only one highfrequency series can be imported in a given join invocation with the option aggrspread which already implies the writing of multiple series in the lower frequency dataset An important point to note is that the aggrspread mechanism where we map from one higher frequency series to a set of lowerfrequency ones relies on finding a known reliable timeseries structure in the outer data file Native gretl timeseries data files will have such a structure and also wellformed gretlfriendly CSV files but not arbitrary commaseparated files So if you have difficulty importing data MIDASstyle from a given CSV file using aggrspread you might want to drop back to a more agnostic piecewise approach agnostic in the sense of assuming less about gretls ability to detect any timeseries structure that might be present Heres an example open hamiltongdt create monthofquarter series for filtering series mofq obsminor 1 3 1 write example CSV file the first column holds eg 1973M01 store testcsv PC6IT mofq open AWMgdt q import monthly components one at a time using a filter join testcsv PCIm3 dataPC6IT tkeyYMm filtermofq3 join testcsv PCIm2 dataPC6IT tkeyYMm filtermofq2 join testcsv PCIm1 dataPC6IT tkeyYMm filtermofq1 list PCI PCI setinfo PCI midas print PCIm byobs Chapter 20 Handling mixedfrequency data 186 The example is artificial in that a timeseries CSV file of suitable frequency written by gretl itself should work without special treatment But you may have to add helper columns such as the mofq series above to a thirdparty CSV file to enable a piecewise MIDAS join via filtering Daily data Daily data commonly financialmarket data are often used in practical applications of the MIDAS methodology Its therefore important that gretl support use of such data but there are special issues arising from the fact that the number of days in a month quarter or year is not a constant It seems to us that its necessary to stipulate a fixed conventional number of days per lower frequency period that is in practice per month or quarter since for the moment were ignoring the week as a basic temporal unit and were not yet attempting to support the combination of annual and daily data But matters are further complicated by the fact that daily data come in at least three sorts 5 days per week as in financialmarket data 6day some commercial data which skip Sunday and 7day That said we currently supportvia compactspread as described in section 201the following conversions Daily to monthly If the daily data are 5days per week we impose 22 days per month This is the median and also the mode of weekdays per month although some months have as few as 20 weekdays and some have 23 If the daily data are 6day we impose 26 days per month and in the 7day case 30 days per month Daily to quarterly In this case the stipulated days per quarter are simply 3 times the days permonth values specified above So given a daily dataset you can say dataset compact 12 spread to convert MIDASwise to monthly or substitute 4 for 12 for a quarterly target And this is sup posed to work whether the number of days per week is 5 6 or 7 That leaves the question of how we handle cases where the actual number of days in the calendar month or quarter falls short of or exceeds the stipulated number Well talk this through with reference to the conversion of 5day daily data to monthly all other cases are essentially the same mutatis mutandis4 We start at day 1 namely the first relevant daily date within the calendar period so the first weekday with 5day data From that point on we fill up to 22 slots with relevant daily observations including not skipping NAs due to holidays or whatever If at the end we have daily observations left over we ignore them If were short we fill the empty slots with the arithmetic mean of the valid used observations5 and we fill in any missing values in the same way This means that lags 1 to 22 of 5day daily data in a monthly dataset are always observations from days within the prior month or in some cases padding that substitutes for such observations lag 23 takes you back to the most recent day in the month before that Clearly we could get a good deal fancier in our handling of daily data for example letting the user determine the number of days per month or quarter andor offering more elaborate means of filling in missing and nonexistent daily values Its not clear that this would be worthwhile but its open to discussion A little dailytomonthly example is shown in Listing 204 and Figure 203 The example exercises the hfplot command see section 205 4Or should be Were not ready to guarantee that just yet 5This is the procedure followed in some example programs in the MIDAS Matlab Toolbox Chapter 20 Handling mixedfrequency data 187 Listing 204 Monthly plus daily data Download open a daily dataset open djclosegdt spread the data to monthly dataset compact 12 spread list DJ djc import an actual monthly series open fedstlbin data indpro highfrequency plot set outputdailypdf for PDF hfplot DJ indpro withlines outputdisplay set key top left 500 1000 1500 2000 2500 3000 1980 1982 1984 1986 1988 1990 48 50 52 54 56 58 60 62 64 66 djclose left indpro right Figure 203 Monthly industrial production and daily Dow Jones close Chapter 21 Cheat sheet This chapter explains how to perform some commonand some not so commontasks in gretls scripting language hansl Some but not all of the techniques listed here are also available through the graphical interface Although the graphical interface may be more intuitive and less intimidat ing at first we encourage users to take advantage of the power of gretls scripting language as soon as they feel comfortable with the program 211 Dataset handling Weird periodicities Problem You have data sampled each 3 minutes from 9am onwards youll probably want to specify the hour as 20 periods Solution setobs 20 91 special Comment Now functions like sdiff seasonal difference or estimation methods like seasonal ARIMA will work as expected Generating a panel dataset of given dimensions Problem You want to generate via nulldata a panel dataset and specify in advance the number of units and the time length of your series via two scalar variables Solution scalar nunits 100 scalar T 12 scalar NT T nunits nulldata NT preserve setobs T 11 stackedtimeseries Comment The essential ingredient that we use here is the preserve option it protects existing scalars and matrices for that matter from being trashed by nulldata thus making it possible to use the scalar T in the setobs command Help my data are backwards Problem Gretl expects time series data to be in chronological order most recent observation last but you have imported thirdparty data that are in reverse order most recent first Solution setobs 1 1 crosssection series sortkey obs dataset sortby sortkey setobs 1 1950 timeseries 188 Chapter 21 Cheat sheet 189 Comment The first line is required only if the data currently have a time series interpretation it removes that interpretation because for fairly obvious reasons the dataset sortby operation is not allowed for time series data The following two lines reverse the data using the negative of the builtin index variable obs The last line is just illustrative it establishes the data as annual time series starting in 1950 If you have a dataset that is mostly the right way round but a particular variable is wrong you can reverse that variable as follows x sortbyobs x Dropping missing observations selectively Problem You have a dataset with many variables and want to restrict the sample to those observa tions for which there are no missing observations for the variables x1 x2 and x3 Solution list X x1 x2 x3 smpl nomissing X Comment You can save the file via a store command to preserve a subsampled version of the dataset Alternative solutions based on the ok function such as list X x1 x2 x3 series sel okX smpl sel restrict are perhaps less obvious but more flexible Pick your poison By operations Problem You have a discrete variable d and you want to run some commands for example estimate a model by splitting the sample according to the values of d Solution matrix vd valuesd m rowsvd loop i1m scalar sel vdi smpl dsel restrict replace ols y const x endloop smpl full Comment The main ingredient here is a loop You can have gretl perform as many instructions as you want for each value of d as long as they are allowed inside a loop Note however that if all you want is descriptive statistics the summary command does have a by option Adding a time series to a panel Problem You have a panel dataset comprising observations of n indidivuals in each of T periods and you want to add a variable which is available in straight timeseries form For example you want to add annual CPI data to a panel in order to deflate nominal income figures In gretl a panel is represented in stacked timeseries format so in effect the task is to create a new variable which holds n stacked copies of the original time series Lets say the panel comprises 500 individuals observed in the years 1990 1995 and 2000 n 500 T 3 and we have these CPI data in the ASCII file cpitxt Chapter 21 Cheat sheet 190 date cpi 1990 130658 1995 152383 2000 172192 What we need is for the CPI variable in the panel to repeat these three values 500 times Solution Simple With the panel dataset open in gretl append cpitxt Comment If the length of the time series is the same as the length of the time dimension in the panel 3 in this example gretl will perform the stacking automatically Rather than using the append command you could use the Append data item under the File menu in the GUI program If the length of your time series does not exactly match the T dimension of your panel dataset append will not work but you can use the join command which is able to pick just the observations with matching time periods On selecting Append data in the GUI you are given a choice between plain append and join modes and if you choose the latter you get a dialog window allowing you to specify the keys for the join operation For native gretl data files you can use builtin series that identify the time periods such as obsmajor for your outer key to match the dates In the example above if the CPI data were in gretl format obsmajor would give you the year of the observations Time averaging of panel datasets Problem You have a panel dataset comprising observations of n indidivuals in each of T periods and you want to lower the time frequency by averaging This is commonly done in empirical growth economics where annual data are turned into 3 or 4 or 5year averages see for example Islam 1995 Solution In a panel dataset gretl functions that deal with time are aware of the panel structure so they will automatically do the right thing Therefore all you have to do is use the movavg function for computing moving averages and then just drop the years you dont need An example with artificial data follows nulldata 36 set seed 61218 setobs 12 11 stackedtimeseries generate simulated yearly data series year 2000 time series y roundnormal series x round3uniform list X y x print year X o now recast as 4year averages a dummy for endpoints series endpoint year 4 0 id variable series id unit compute averages loop foreach i X series i movavgi 4 Chapter 21 Cheat sheet 191 endloop drop extra observations smpl endpoint dummy permanent restore panel structure setobs id year panelvars print id year X o Running the above script produces among other output print year X o year y x 101 2001 1 1 102 2002 1 1 103 2003 1 0 104 2004 0 1 105 2005 1 2 106 2006 1 2 107 2007 1 0 108 2008 1 1 109 2009 0 3 110 2010 1 1 111 2011 1 1 112 2012 0 1 309 2009 0 1 310 2010 1 1 311 2011 0 2 312 2012 1 2 print id year X o id year y x 11 1 2004 025 075 12 1 2008 050 125 13 1 2012 000 150 33 3 2012 050 150 Turning observationmarker strings into a series Problem Heres one that might turn up in the context of the join command see chapter 7 The current dataset contains a stringvalued series that youd like to use as a key for matching observations perhaps the twoletter codes for the names of US states The file from which you wish to add data contains that same information but not in the form of a stringvalued series rather it exists in the form of observation markers Such markers cannot be used as a key directly but is there a way to parlay them into a stringvalued series Why of course there is Solution Well illustrate with the Ramanathan data file data410gdt which contains private school enrollment data and covariates for the 50 US states plus Washington DC n 51 open data410gdt markers toarraystatecodes genr index stringifyindex statecodes store joindatagdt Chapter 21 Cheat sheet 192 Comment The markers command saves the observation markers to an array of strings The com mand genr index creates a series that goes 1 2 3 and we attach the state codes to this series via stringify After saving the result we have a datafile that contains a series index that can be matched with whatever series holds the state code strings in the target dataset Suppose the relevant stringvalued key series in the target dataset is called state We might prefer to avoid the need to specify a distinct outer key again see chapter 7 In that case in place of genr index stringifyindex statecodes we could do genr index series state index stringifystate statecodes and the two datafiles will contain a comparable stringvalued state series 212 Creatingmodifying variables Generating a dummy variable for a specific observation Problem Generate dt 0 for all observation but one for which dt 1 Solution series d t19842 Comment The internal variable t is used to refer to observations in string form so if you have a crosssection sample you may just use d t123 If the dataset has observation labels you can use the corresponding label For example if you open the dataset mrwgdt supplied with gretl among the examples a dummy variable for Italy could be generated via series DIta tItaly Note that this method does not require scripting at all In fact you might as well use the GUI Menu AddDefine new variable for the same purpose with the same syntax Generating a discrete variable out of a set of dummies Problem The dummify function also available as a command generates a set of mutually exclusive dummies from a discrete variable The reverse functionality however seems to be absent Solution series x lincombD seq1 nelemD Comment Suppose you have a list D of mutually exclusive dummies that is a full set of 01 vari ables coding for the value of some characteristic such that the sum of the values of the elements of D is 1 at each observation This is by the way exactly what the dummify command produces The reverse job of dummify can be performed neatly by using the lincomb function The code above multiplies the first dummy variable in the list D by 1 the second one by 2 and so on Hence the return value is a series whose value is i if and only if the ith member of D has value 1 If you want your coding to start from 0 instead of 1 youll have to modify the code snippet above into series x lincombD seq0 nelemD1 Chapter 21 Cheat sheet 193 Easter Problem I have a 7day daily dataset How do I create an Easter dummy Solution We have the easterday function which returns month and day of Easter given the year The following is an example script which uses this function and a few string magic tricks series Easter 0 loop y20112016 a easterdayy m floora d round100am edstr sprintf04d02d02d y m d Easteredstr 1 endloop Comment The round function is necessary for the day component because otherwise floating point problems may ensue Try the year 2015 for example Recoding a variable Problem You want to perform a 1to1 recode on a variable For example consider tennis points you may have a variable x holding values 1 to 3 and you want to recode it to 15 30 40 Solution 1 series x replacex 1 15 series x replacex 2 30 series x replacex 3 40 Solution 2 matrix tennis 15 30 40 series x replacex seq13 tennis Comment There are many equivalent ways to achieve the same effect but for simple cases such as this the replace function is simple and transparent If you dont mind using matrices scripts using replace can also be remarkably compact Note that replace also performs nto1 surjec tive replacements such as series x replacez 2 3 5 11 22 33 1 which would turn all entries equal to 2 3 5 11 22 or 33 to 1 and leave the other ones unchanged Generating a subset of values dummy Problem You have a dataset which contains a finegrained coding for some qualitative variable and you want to collapse this to a relatively small set of dummy variables Examples you have place of work by US state and you want a small set of regional dummies or you have detailed occupational codes from a census dataset and you want a manageable number of occupational category dummies Lets call the source series src and one of the target dummies D1 And lets say that the values of src to be grouped under D1 are 2 13 14 and 25 Well consider three possible solutions Longhand Clever and Proper Longhand solution series D1 src2 src13 src14 src25 Chapter 21 Cheat sheet 194 Comment The above works fine if the number of distinct values in the source to be condensed into each dummy variable is fairly small but it becomes cumbersome if a single dummy must comprise dozens of source values Clever solution matrix sel 2131425 series D1 maxrsrc vecsel 0 Comment The subset of values to be grouped together can be written out as a matrix relatively compactly first line The magic that turns this into the desired series second line relies on the versatility of the dot elementwise matrix operators The expression src gets a column vector version of the input seriescall this xand vecsel gets the input matrix as a row vector in case its a column vector or a matrix with both dimensions greater than 1call this s If x is n 1 and s is 1 m the operator produces an n m result each element i j of which equals 1 if xi sj otherwise 0 The maxr function along with the operator see chapter 17 for both then produces the result we want Of course whichever procedure you use you have to repeat for each of the dummy series you want to create but keep readingthe proper solution is probably what you want if you plan to create several dummies Further comment Note that the clever solution depends on converting what is naturally a vector result into a series This will fail if there are missing values in src since by default missing values will be skipped when converting src to x and so the number of rows in the result will fall short of the number of observations in the dataset One fix is then to subsample the dataset to exclude missing values before employing this method another is to adjust the skipmissing setting via the set command see the Gretl Command Reference Proper solution The best solution in terms of both computational efficiency and code clarity would be using a conversion table and the replace function to produce a series on which the dummify command can be used For example suppose we want to convert from a series called fips holding FIPS codes1 for the 50 US states plus the District of Columbia to a series holding codes for the four standard US regions We could create a 2 51 matrixcall it srmapwith the 51 FIPS codes on the first row and the corresponding region codes on the second and then do series region replacefips srmap1 srmap2 Generating an ARMA11 Problem Generate yt 09yt1 εt 05εt1 with εt NIID0 1 Recommended solution alpha 09 theta 05 series y filternormal 1 theta alpha Bread and butter solution alpha 09 theta 05 series e normal series y 0 series y alpha y1 e theta e1 1FIPS is the Federal Information Processing Standard it assigns numeric codes from 1 to 56 to the US states and outlying areas Comment The filter function is specifically designed for this purpose so in most cases youll want to take advantage of its speed and flexibility That said in some cases you may want to generate the series in a manner which is more transparent maybe for teaching purposes In the second solution the statement series y 0 is necessary because the next statement evaluates y recursively so y1 must be set Note that you must use the keyword series here instead of writing genr y 0 or simply y 0 to ensure that y is a series and not a scalar Recoding a variable by classes Problem You want to recode a variable by classes For example you have the age of a sample of individuals xi and you need to compute age classes yi as yi 1 for xi 18 yi 2 for 18 xi 65 yi 3 for xi 65 Solution series y 1 x 18 x 65 Comment True and false expressions are evaluated as 1 and 0 respectively so they can be manipulated algebraically as any other number The same result could also be achieved by using the conditional assignment operator see below but in most cases it would probably lead to more convoluted constructs Conditional assignment Problem Generate yt via the following rule yt xt for dt a zt for dt a Solution series y d a x z Comment There are several alternatives to the one presented above One is a brute force solution using loops Another one more efficient but still suboptimal would be series y dax daz However the ternary conditional assignment operator is not only the most efficient way to accomplish what we want it is also remarkably transparent to read when one gets used to it Some readers may find it helpful to note that the conditional assignment operator works exactly the same way as the IF function in spreadsheets Generating a time index for panel datasets Problem gretl has a unit accessor but not the equivalent for time What should I use Solution series x time Comment The special construct genr time and its variants are aware of whether a dataset is a panel Chapter 21 Cheat sheet 196 Sanitizing a list of regressors Problem I noticed that builtin commands like ols automatically drop collinear variables and put the constant first How can I achieve the same result for an estimator Im writing Solution No worry The function below does just that function list sanitizelist X list R X const if nelemR nelemX R const R endif return dropcollR end function so for example the code below nulldata 20 x normal y normal z x y collinear list A x y const z list B sanitizeA list print A list print B returns list print A x y const z list print B const x y Besides it has been brought to our attention that some mischievous programs out there put the constant last instead of first like God intended We are not amused by this utter disrespect of econometric tradition but if you want to pursue the way of evil it is rather simple to adapt the script above to that effect Generating the hat values after an OLS regression Problem Ive just run an OLS regression and now I need the socalled the leverage values also known as the hat values I know you can access residuals and fitted values through dollar accessors but nothing like that seems to be available for hat values Solution Hat values are can be thought of as the diagonal of the projection matrix PX or more explicitly as hi x iXX1xi where X is the matrix of regressors and x i is its ith row The reader is invited to study the code below which offers four different solutions to the problem open data41gdt quiet list X const sqft bedrms baths ols price X method 1 leverage save quiet series h1 lever these are necessary for what comes next matrix mX X matrix iXX invpdmXmX method 2 series h2 diagqformmX iXX method 3 series h3 sumrmX mXiXX method 4 series h4 NA loop i1nobs matrix x mXi h4i xiXXx endloop verify print h1 h2 h3 h4 byobs Comment Solution 1 is the preferable one it relies on the builtin leverage command which computes the requested series quite efficiently taking care of missing values possible restrictions to the sample etc However three more are shown for didactical purposes mainly to show the user how to manipulate matrices Solution 2 first constructs the PX matrix explicitly via the qform function and then takes its diagonal this is definitely not recommended despite its compactness since you generate a much bigger matrix than you actually need and waste a lot of memory and CPU cycles in the process It doesnt matter very much in the present case since the sample size is very small but with a big dataset this could be a very bad idea Solution 3 is more clever and relies on the fact that if you define Z X XX1 then hi could also be written as hi xi zi Σk i1 xik zik which is in turn equivalent to the sum of the elements of the ith row of X Z where is the elementbyelement product In this case your clever usage of matrix algebra would produce a solution computationally much superior to solution 2 Solution 4 is the most oldfashioned one and employs an indexed loop While this wastes practically no memory and employs no more CPU cycles in algebraic operations than strictly necessary it imposes a much greater burden on the hansl interpreter since handling a loop is conceptually more complex than a single operation In practice youll find that for any realisticallysized problem solution 4 is much slower that solution 3 Moving functions for time series Problem gretl provides native functions for moving averages but I need a to compute a different statistic on a sliding data window Is there a way to do this without using loops Solution One of the nice things of the list data type is that if you define a list then several functions that would normally apply vertically to elements of a series apply horizontally across the list So for example the following piece of code open bjggdt order 12 list L lg lagsorder1 lg smpl order Chapter 21 Cheat sheet 198 series movmin minL series movmax maxL series movmed medianL smpl full computes the moving minimum maximum and median of the lg series Plotting the four series would produce something similar to figure 211 46 48 5 52 54 56 58 6 62 64 66 1950 1952 1954 1956 1958 1960 lg movmin movmed movmax Figure 211 Moving functions Generating data with a prescribed correlation structure Problem Id like to generate a bunch of normal random variates whose covariance matrix is exactly equal to a given matrix Σ How can I do this in gretl Solution The Cholesky decomposition is your friend If you want to generate data with a given population covariance matrix then all you have to do is postmultiply your pseudorandom data by the Cholesky factor transposed of the matrix you want For example set seed 123 S 2111 T 1000 X mnormalT rowsS X X choleskyS eval mcovX should give you eval mcovX 20016 10157 10157 10306 If instead you want your simulated data to have a given sample covariance matrix you have to apply the same technique twice one for standardizing the data another one for giving it the covariance structure you want Example S 2111 T 1000 Chapter 21 Cheat sheet 199 X mnormalT rowsS X X choleskyScholeskymcovX eval mcovX gives you eval mcovX 2 1 1 1 as required 213 Neat tricks Interaction dummies Problem You want to estimate the model yi xiβ1 ziβ2 diβ3 di ziβ4 εt where di is a dummy variable while xi and zi are vectors of explanatory variables Solution As of version 1912 gretl provides the operator to make this operation easy See section 151 for details especially example script 151 But back in my day we used loops to do that Heres how list X x1 x2 x3 list Z z1 z2 list dZ deflist loop foreach i Z series di d i list dZ dZ di endloop ols y X Z d dZ Comment Its amazing what string substitution can do for you isnt it Realized volatility Problem Given data by the minute you want to compute the realized volatility for the hour as RVt 160 60 τ1 y 2 tτ Imagine your sample starts at time 11 Solution smpl full genr time series minute inttime60 1 series second time 60 setobs minute second panel series rv psdy2 setobs 1 1 smpl second1 restrict store foo rv Comment Here we trick gretl into thinking that our dataset is a panel dataset where the minutes are the units and the seconds are the time this way we can take advantage of the special function psd panel standard deviation Then we simply drop all observations but one per minute and save the resulting data store foo rv translates as store in the gretl datafile foogdt the series rv Chapter 21 Cheat sheet 200 Looping over two paired lists Problem Suppose you have two lists with the same number of elements and you want to apply some command to corresponding elements over a loop Solution list L1 a b c list L2 x y z k1 1 loop foreach i L1 k2 1 loop foreach j L2 if k1 k2 ols i 0 j endif k2 endloop k1 endloop Comment The simplest way to achieve the result is to loop over all possible combinations and filter out the unneeded ones via an if condition as above That said in some cases variable names can help For example if list Lx x1 x2 x3 list Ly y1 y2 y3 then we could just loop over the integersquite intuitive and certainly more elegant loop i13 ols yi const xi endloop Convolution polynomial multiplication Problem How do I multiply polynomials Theres no dedicated function to do that and yet its a fairly basic mathematical task Solution Never fear We have the conv2d function which is a tool for a more general problem but includes polynomial multiplication as a special case Suppose you want to multiply two finiteorder polynomials Px mi0 pi xi and Qx ni0 qi xi What you want is the sequence of coefficients of the polynomial Rx Px Qx mn k0 rkk where rk ki0 pi qki is the convolution of the pi and qi coefficients The same operation can be performed via the FFT but in most cases using conv2d is quicker and more natural As an example well use the same one we used in Section 305 consider the multiplication of two polynomials Px 1 05x Qx 1 03x 08x2 Rx Px Qx 1 08x 065x2 04x3 Chapter 21 Cheat sheet 201 The following code snippet performs all the necessary calculations p 1 05 q 1 03 08 r conv2dp q print r Runnning the above produces r 4 x 1 1 08 065 04 which is indeed the desired result Note that the same computation could also be performed via the filter function at the price of slightly more elaborate syntax Comparing two lists Problem How can I tell if two lists contain the same variables not necessarily in the same order Solution In many respects lists are like sets so it makes sense to use the socalled symmetric difference operator which is defined as A B A B B A where in this context backslash represents the relative complement operator such that A B x A x B In practice we first check if there are any series in A but not in B then we perform the reverse check If the union of the two results is an empty set then the lists must contain the same variables The hansl syntax for this would be something like scalar NotTheSame nelemAB BA 0 Reordering list elements Problem Is there a way to reorder list elements Solution You can use the fact that a list can be cast into a vector of integers and then manipulated via ordinary matrix syntax So for example if you wanted to flip a list you may just use the mreverse function For example open AWMgdt quiet list X 3 6 9 12 matrix tmp X list revX mreversetmp list X print list revX print will produce list X print D1 D872 EENDIS GCD list revX print GCD EENDIS D872 D1 Chapter 21 Cheat sheet 202 Plotting an asymmetric confidence interval Problem I like the look of the band option to the gnuplot and plot commands but its set up for plotting a symmetric interval and I want to show an asymmetric one Solution Any interval is by construction symmetrical about its mean at each observation So you just need to perform a little tweak Say you want to plot a series x along with a band defined by the two series top and bot Here we go create series for midpoint and deviation series mid top bot2 series dev top mid gnuplot x bandmiddev timeseries withlines outputdisplay Crossvalidation Problem Id like to compute the socalled leaveoneout crossvalidation criterion for my regression Is there a command in gretl If you have a sample with n observations the leaveoneout crossvalidation criterion can be mechanically computed by running n regressions in which one observation at a time is omitted and all the other ones are used to forecast its value The sum of the n squared forecast errors is the statistic we want Fortunately there is no need to do so It is possible to prove that the same statistic can be computed as CV ni1ui1 hi2 where hi is the ith element of the hat matrix see section 212 from a regression on the whole sample This method is natively provided by gretl as a side benefit to the leverage command that stores the CV criterion into the test accessor The following script shows the equivalence of the two approaches set verbose off open data41gdt list X const sqft bedrms baths compute the CV criterion the silly way scalar CV 0 matrix mX X loop i 1 nobs xi mXi yi pricei smpl obs i restrict ols price X quiet smpl full scalar fe yi xi coeff CV fe2 endloop printf CV g CV the smart way ols price X quiet leverage quiet printf CV g test Chapter 21 Cheat sheet 203 Is my matrix result broken Problem Most of the matrix manipulation functions available in gretl flag an error if something goes wrong but theres no guarantee that every matrix computation will return an entirely finite matrix containing no infinities or NaNs So how do I tell if Ive got a fully valid matrix Solution Given a matrix m the call okm returns a matrix with the same dimensions as m with elements 1 for finite values and 0 for infinities or NaNs A matrix as a whole is OK if it has no elements which fail this test so heres a suitable check for a broken matrix using the logical NOT operator sumcsumrokm 0 If this gives a nonzero return value you know that m contains at least one nonfinite element Part II Econometric methods 204 Chapter 22 Robust covariance matrix estimation 221 Introduction Consider once again the linear regression model y X beta u where y and u are Tvectors X is a T x k matrix of regressors and beta is a kvector of parameters As is well known the estimator of beta given by Ordinary Least Squares OLS is betahat X X 1 X y If the condition EuX 0 is satisfied this is an unbiased estimator under somewhat weaker conditions the estimator is biased but consistent It is straightforward to show that when the OLS estimator is unbiased that is when Ebetahat beta 0 its variance is Varbetahat Ebetahat betabetahat beta X X 1 X Omega X X X 1 where Omega Euu is the covariance matrix of the error terms Under the assumption that the error terms are independently and identically distributed iid we can write Omega sigma squared I where sigma squared is the common variance of the errors and the covariances are zero In that case 223 simplifies to the classical formula Varbetahat sigma squared X X 1 If the iid assumption is not satisfied two things follow First it is possible in principle to construct a more efficient estimator than OLS for instance some sort of Feasible Generalized Least Squares FGLS Second the simple classical formula for the variance of the least squares estimator is no longer correct and hence the conventional OLS standard errors which are just the square roots of the diagonal elements of the matrix defined by 224 do not provide valid means of statistical inference In the recent history of econometrics there are broadly two approaches to the problem of noniid errors The traditional approach is to use an FGLS estimator For example if the departure from the iid condition takes the form of timeseries dependence and if one believes that this could be modeled as a case of firstorder autocorrelation one might employ an AR1 estimation method such as CochraneOrcutt HildrethLu or PraisWinsten If the problem is that the error variance is nonconstant across observations one might estimate the variance as a function of the independent variables and then perform weighted least squares using as weights the reciprocals of the estimated variances While these methods are still in use an alternative approach has found increasing favor that is use OLS but compute standard errors or more generally covariance matrices that are robust with respect to deviations from the iid assumption This is typically combined with an emphasis on using large datasets large enough that the researcher can place some reliance on the asymptotic consistency property of OLS This approach has been enabled by the availability of cheap computing power The computation of robust standard errors and the handling of very large datasets were daunting tasks at one time but now they are unproblematic The other point favoring the newer methodology is that while FGLS offers an efficiency advantage in principle it often involves making 205 Chapter 22 Robust covariance matrix estimation 206 additional statistical assumptions which may or may not be justified which may not be easy to test rigorously and which may threaten the consistency of the estimator for example the common factor restriction that is implied by traditional FGLS corrections for autocorrelated errors James Stock and Mark Watsons Introduction to Econometrics illustrates this approach at the level of undergraduate instruction many of the datasets they use comprise thousands or tens of thousands of observations FGLS is downplayed and robust standard errors are reported as a matter of course In fact the discussion of the classical standard errors labeled homoskedasticityonly is confined to an Appendix Against this background it may be useful to set out and discuss all the various options offered by gretl in respect of robust covariance matrix estimation The first point to notice is that gretl produces classical standard errors by default in all cases apart from GMM estimation In script mode you can get robust standard errors by appending the robust flag to estimation commands In the GUI program the model specification dialog usually contains a Robust standard errors check box along with a configure button that is activated when the box is checked The configure button takes you to a configuration dialog which can also be reached from the main menu bar Tools Preferences General HCCME There you can select from a set of possible robust estimation variants and can also choose to make robust estimation the default The specifics of the available options depend on the nature of the data under consideration crosssectional time series or panel and also to some extent the choice of estimator Although we introduced robust standard errors in the context of OLS above they may be used in conjunction with other estimators too The following three sections of this chapter deal with matters that are specific to the three sorts of data just mentioned Note that additional details regarding covariance matrix estimation in the context of GMM are given in chapter 27 We close this introduction with a brief statement of what robust standard errors can and cannot achieve They can provide for asymptotically valid statistical inference in models that are basically correctly specified but in which the errors are not iid The asymptotic part means that they may be of little use in small samples The correct specification part means that they are not a magic bullet if the error term is correlated with the regressors so that the parameter estimates themselves are biased and inconsistent robust standard errors will not save the day 222 Crosssectional data and the HCCME With crosssectional data the most likely departure from iid errors is heteroskedasticity nonconstant variance 1 In some cases one may be able to arrive at a judgment regarding the likely form of the heteroskedasticity and hence to apply a specific correction The more common case however is where the heteroskedasticity is of unknown form We seek an estimator of the covariance matrix of the parameter estimates that retains its validity at least asymptotically in face of unspecified heteroskedasticity It is not obvious a priori that this should be possible but White 1980 showed that varexphbeta X X 1 X Omega X X X 1 does the trick As usual in statistics we need to say under certain conditions but the conditions are not very restrictive Omega is in this context a diagonal matrix whose nonzero elements may be estimated using squared OLS residuals White referred to 225 as a heteroskedasticityconsistent covariance matrix estimator HCCME Davidson and MacKinnon 2004 chapter 5 offer a useful discussion of several variants on Whites HCCME theme They refer to the original variant of 225 in which the diagonal elements of Omega are estimated directly by the squared OLS residuals u squared as HC0 The associated standard errors are often called Whites standard errors The various refinements of Whites proposal share a common point of departure namely the idea that the squared OLS residuals are likely to be too 1In some specialized contexts spatial autocorrelation may be an issue Gretl does not have any builtin methods to handle this and we will not discuss it here small on average This point is quite intuitive The OLS parameter estimates hatbeta satisfy by design the criterion that the sum of squared residuals Sigma hatut2 Sigma leftyt Xt hatbetaright2 is minimized for given X and y Suppose that hatbeta eq beta This is almost certain to be the case even if OLS is not biased it would be a miracle if the hatbeta calculated from any finite sample were exactly equal to beta But in that case the sum of squares of the true unobserved errors Sigma ut2 Sigma leftyt Xt betaright2 is bound to be greater than Sigma hatut2 The elaborated variants on HC0 take this point on board as follows HC1 Applies a degreesoffreedom correction multiplying the HC0 matrix by TleftTkright HC2 Instead of using hatut2 for the diagonal elements of hatOmega uses hatut2left1htright where ht XtleftXXright1Xt the tth diagonal element of the projection matrix PX which has the property that PX cdot y haty The relevance of ht is that if the variance of all the ut is sigma2 the expectation of hatut2 is sigma2 left1htright or in other words the ratio hatut2left1htright has expectation sigma2 As Davidson and MacKinnon show 0 leq ht 1 for all t so this adjustment cannot reduce the the diagonal elements of hatOmega and in general revises them upward HC3 Uses hatut2left1htright2 The additional factor of left1htright in the denominator relative to HC2 may be justified on the grounds that observations with large variances tend to exert a lot of influence on the OLS estimates so that the corresponding residuals tend to be underestimated See Davidson and MacKinnon for a fuller explanation HC3a Implements the jackknife approach from MacKinnon and White 1985 HC3 is a close approximation of this The relative merits of these variants have been explored by means of both simulations and theoretical analysis Unfortunately there is not a clear consensus on which is best Davidson and MacKinnon argue that the original HC0 is likely to perform worse than the others nonetheless Whites standard errors are reported more often than the more sophisticated variants and therefore for reasons of comparability HC0 is the default HCCME in gretl If you wish to use HC1 HC2 HC3 or HC3a you can arrange for this in either of two ways In script mode you can do for example set hcversion 2 In the GUI program you can go to the HCCME configuration dialog as noted above and choose any of these variants to be the default 223 Time series data and HAC covariance matrices Heteroskedasticity may be an issue with time series data too but it is unlikely to be the only or even the primary concern One form of heteroskedasticity is common in macroeconomic time series but is fairly easily dealt with That is in the case of strongly trending series such as Gross Domestic Product aggregate consumption aggregate investment and so on higher levels of the variable in question are likely to be associated with higher variability in absolute terms The obvious fix employed in many macroeconometric studies is to use the logs of such series rather than the raw levels Provided the proportional variability of such series remains roughly constant over time the log transformation is effective Other forms of heteroskedasticity may resist the log transformation but may demand a special treatment distinct from the calculation of robust standard errors We have in mind here autoregressive conditional heteroskedasticity for example in the behavior of asset prices where large disturbances to the market may usher in periods of increased volatility Such phenomena call for specific estimation strategies such as GARCH see chapter 31 Despite the points made above some residual degree of heteroskedasticity may be present in time series data the key point is that in most cases it is likely to be combined with serial correlation autocorrelation hence demanding a special treatment In Whites approach hatOmega the estimated covariance matrix of the ut remains conveniently diagonal the variances Eleftut2right may differ by t but the covariances Eleftut usright for s eq t are all zero Autocorrelation in time series data means that at least some of the the offdiagonal elements of hatOmega should be nonzero This introduces a substantial complication and requires another piece of terminology estimates of the covariance matrix that are asymptotically valid in face of both heteroskedasticity and autocorrelation of the error process are termed HAC heteroskedasticity and autocorrelationconsistent The issue of HAC estimation is treated in more technical terms in chapter 27 Here we try to convey some of the intuition at a more basic level We begin with a general comment residual autocorrelation is not so much a property of the data as a symptom of an inadequate model Data may be persistent though time and if we fit a model that does not take this aspect into account properly we end up with a model with autocorrelated disturbances Conversely it is often possible to mitigate or even eliminate the problem of autocorrelation by including relevant lagged variables in a time series model or in other words by specifying the dynamics of the model more fully HAC estimation should not be seen as the first resort in dealing with an autocorrelated error process That said the obvious extension of Whites HCCME to the case of autocorrelated errors would seem to be this estimate the offdiagonal elements of hatOmega that is the autocovariances Eleftut usright using once again the appropriate OLS residuals hatomegats hatut hatus This is basically right but demands an important amendment We seek a consistent estimator one that converges towards the true Omega as the sample size tends towards infinity This cant work if we allow unbounded serial dependence A larger sample will enable us to estimate more of the true omegats elements that is for t and s more widely separated in time but will not contribute everincreasing information regarding the maximally separated omegats pairs since the maximal separation itself grows with the sample size To ensure consistency we have to confine our attention to processes exhibiting temporally limited dependence In other words we cut off the computation of the hatomegats values at some maximum value of p ts where p is treated as an increasing function of the sample size T although it cannot increase in proportion to T The simplest variant of this idea is to truncate the computation at some finite lag order p where p grows as say T14 The trouble with this is that the resulting hatOmega may not be a positive definite matrix In practical terms we may end up with negative estimated variances One solution to this problem is offered by The NeweyWest estimator Newey and West 1987 which assigns declining weights to the sample autocovariances as the temporal separation increases To understand this point it is helpful to look more closely at the covariance matrix given in 225 namely leftXX right1 leftX hatOmega X right leftXX right1 This is known as a sandwich estimator The bread which appears on both sides is leftXX right1 This k imes k matrix is also the key ingredient in the computation of the classical covariance matrix The filling in the sandwich is hatSigma where wj is the weight given to lag j 0 and the k imes k matrix hatGammaleftjright for j geq 0 is given by hatGammaleftjright sumtj1T hatut hatutj Xt Xtj that is the sample autocovariance matrix of xt cdot ut at lag j apart from a scaling factor T This leaves two questions How exactly do we determine the maximum lag length or bandwidth p of the HAC estimator And how exactly are the weights wj to be determined We will return to the difficult question of the bandwidth shortly As regards the weights gretl offers three variants The default is the Bartlett kernel as used by Newey and West This sets wj begincases 1 fracjp1 j le p 0 jp endcases so the weights decline linearly as j increases The other two options are the Parzen kernel and the Quadratic Spectral QS kernel For the Parzen kernel wj begincases 1 6aj2 6aj3 0 le aj le 05 2left1ajright3 05 aj le 1 0 aj1 endcases where aj jleftp1right and for the QS kernel wj frac2512pi2 dj2 leftfracsin mjmj cos mj right where dj jp and mj 6 pi dj 5 Figure 221 shows the weights generated by these kernels for p 4 and j 1 to 9 Chapter 22 Robust covariance matrix estimation 210 set haclag nw2 As shown in Table 221 the choice between nw1 and nw2 does not make a great deal of difference T p nw1 p nw2 50 2 3 100 3 4 150 3 4 200 4 4 300 5 5 400 5 5 Table 221 HAC bandwidth two rules of thumb You also have the option of specifying a fixed numerical value for p as in set haclag 6 In addition you can set a distinct bandwidth for use with the Quadratic Spectral kernel since this need not be an integer For example set qsbandwidth 35 Prewhitening and databased bandwidth selection An alternative approach is to deal with residual autocorrelation by attacking the problem from two sides The intuition behind the technique known as VAR prewhitening Andrews and Monahan 1992 can be illustrated by a simple example Let xt be a sequence of firstorder autocorrelated random variables xt ρxt1 ut The longrun variance of xt can be shown to be VLRxt VLRut 1 ρ2 In most cases ut is likely to be less autocorrelated than xt so a smaller bandwidth should suffice Estimation of VLRxt can therefore proceed in three steps 1 estimate ρ 2 obtain a HAC estimate of ˆut xt ˆρxt1 and 3 divide the result by 1 ρ2 The application of the above concept to our problem implies estimating a finiteorder Vector Au toregression VAR on the vector variables ξt Xt ˆut In general the VAR can be of any order but in most cases 1 is sufficient the aim is not to build a watertight model for ξt but just to mop up a substantial part of the autocorrelation Hence the following VAR is estimated ξt Aξt1 εt Then an estimate of the matrix XΩX can be recovered via I ˆ A1ˆΣεI ˆ A1 where ˆΣε is any HAC estimator applied to the VAR residuals You can ask for prewhitening in gretl using set hacprewhiten on Chapter 22 Robust covariance matrix estimation 211 There is at present no mechanism for specifying an order other than 1 for the initial VAR A further refinement is available in this context namely databased bandwidth selection It makes intuitive sense that the HAC bandwidth should not simply be based on the size of the sample but should somehow take into account the timeseries properties of the data and also the kernel chosen A nonparametric method for doing this was proposed by Newey and West 1994 a good concise account of the method is given in Hall 2005 This option can be invoked in gretl via set haclag nw3 This option is the default when prewhitening is selected but you can override it by giving a specific numerical value for haclag Even the NeweyWest databased method does not fully pin down the bandwidth for any particular sample The first step involves calculating a series of residual covariances The length of this series is given as a function of the sample size but only up to a scalar multiplefor example it is given as OT 29 for the Bartlett kernel Gretl uses an implied multiple of 1 NeweyWest with missing values If the estimation sample for a timeseries model includes incomplete observationswhere the value of the dependent variable or one more regressors is missingthe NeweyWest procedure must be either modified or abandoned since some ingredients of the ˆΣ matrix defined above will be absent Two modified methods have been discussed in the literature Parzen 1963 proposed what he called Amplitude Modulation AM which involves setting the values of the residual and each of the regressors to zero for the incomplete observations and then proceeding as usual Datta and Du 2012 propose the socalled Equal Spacing ES method calculate as if the incomplete observations did not exist and the complete observations therefore form an equallyspaced series Somewhat suprisingly it can be shown that both of these methods have appropriate asymptotic properties see Rho and Vogelsang 2018 for further elaboration In gretl you can select a preferred method via one or other of these commands set hacmissvals es ES Datta and Du set hacmissvals am AM Parzen set hacmissvals off The ES method is the default The off option means that gretl will refuse to produce HAC standard errors when the sample includes incomplete observations use this if you have qualms about the modified methods VARs a special case A wellspecified vector autoregression VAR will generally include enough lags of the dependent variables to obviate the problem of residual autocorrelation in which case HAC estimation is redundantalthough there may still be a need to correct for heteroskedasticity For that rea son plain HCCME and not HAC is the default when the robust flag is given in the context of the var command However if for some reason you need HAC you can force the issue by giving the option robusthac Longrun variance Let us expand a little on the subject of the longrun variance that was mentioned above and the associated tools offered by gretl for scripting You may also want to check out the reference for the lrcovar function for the multivariate case As is well known the variance of the average of T random variables x1 x2 xT with equal variance σ 2 equals σ 2T if the data are uncorrelated In this case the sample variance of xt over the sample size provides a consistent estimator If however there is serial correlation among the xᵗs the variance of X T¹ t1ᵀ xₜ must be estimated differently One of the most widely used statistics for this purpose is a nonparametric kernel estimator with the Bartlett kernel defined as ω²k T¹ tkᵀₖ ikᵏ wᵢxₜ Xxₜᵢ X where the integer k is known as the window size and the wᵢ terms are the socalled Bartlett weights defined as wᵢ 1ik1 It can be shown that for k large enough ω²kT yields a consistent estimator of the variance of X gretl implements this estimator by means of the function lrvar This function takes one required argument namely the series whose longrun variance is to be estimated followed by two optional arguments The first of these can be used to supply a value for k if it is omitted or negative the popular choice T¹³ is used The second allows specification of an assumed value for the population mean of X which then replaces X in the variance calculation Usage is illustrated below automatic window size use xbar for mean lrs2 lrvarx set a window size of 12 lrs2 lrvarx 12 set window size and impose assumed mean of zero lrs2 lrvarx 12 0 impose mean zero automatic window size lrs2 lrvarx 1 0 224 Special issues with panel data Since panel data have both a timeseries and a crosssectional dimension one might expect that in general robust estimation of the covariance matrix would require handling both heteroskedasticity and autocorrelation the HAC approach In addition some special features of panel data require attention The variance of the error term may differ across the crosssectional units The covariance of the errors across the units may be nonzero in each time period If the between variation is not swept out the errors may exhibit autocorrelation not in the usual timeseries sense but in the sense that the mean value of the error term may differ across units This is relevant when estimation is by pooled OLS Gretl currently offers two robust covariance matrix estimators specifically for panel data These are available for models estimated via fixed effects random effects pooled OLS and pooled twostage least squares The default robust estimator is that suggested by Arellano 2003 which is HAC provided the panel is of the large n small T variety that is many units are observed in relatively few periods The Arellano estimator is Σᴬ XX¹i1ⁿ Xᵢ ûᵢ ûᵢ XᵢXX¹ where X is the matrix of regressors with the group means subtracted in the case of fixed effects or quasidemeaned in the case of random effects ûᵢ denotes the vector of residuals for unit i and n is the number of crosssectional units² Cameron and Trivedi 2005 make a strong case for using this estimator they note that the ordinary White HCCME can produce misleadingly small standard ²This variance estimator is also known as the clustered over entities estimator errors in the panel context because it fails to take autocorrelation into account³ In addition Stock and Watson 2008 show that the White HCCME is inconsistent in the fixedeffects panel context for fixed T 2 In cases where autocorrelation is not an issue the estimator proposed by Beck and Katz 1995 and discussed by Greene 2003 chapter 13 may be appropriate This estimator which takes into account contemporaneous correlation across the units and heteroskedasticity by unit is ΣᴮᴷXX¹i1ⁿ j1ⁿ σᵢⱼ Xᵢ XⱼXX¹ The covariances σᵢⱼ are estimated via σᵢⱼ ûᵢ ûⱼ T where T is the length of the time series for each unit Beck and Katz call the associated standard errors PanelCorrected Standard Errors PCSE This estimator can be invoked in gretl via the command set pcse on The Arellano default can be reestablished via set pcse off Note that regardless of the pcse setting the robust estimator is not used unless the robust flag is given or the Robust box is checked in the GUI program 225 The clusterrobust estimator One further variance estimator is available in gretl namely the clusterrobust estimator This may be appropriate for crosssectional data mostly when the observations naturally fall into groups or clusters and one suspects that the error term may exhibit dependency within the clusters andor have a variance that differs across clusters Such clusters may be binary eg employed versus unemployed workers categorical with several values eg products grouped by manufacturer or ordinal eg individuals with low middle or high education levels For linear regression models estimated via least squares the cluster estimator is defined as ΣᴄXX¹j1ᵐ Xⱼ ûⱼ ûⱼ XⱼXX¹ where m denotes the number of clusters and Xⱼ and ûⱼ denote respectively the matrix of regressors and the vector of residuals that fall within cluster j As noted above the Arellano variance estimator for panel data models is a special case of this where the clustering is by panel unit For models estimated by the method of Maximum Likelihood in which case the standard variance estimator is the inverse of the negative Hessian H the cluster estimator is ΣᴄH¹j1ᵐ Gⱼ Gⱼ H¹ where Gⱼ is the sum of the score that is the derivative of the loglikelihood with respect to the parameter estimates across the observations falling within cluster j ³See also Cameron and Miller 2015 for a discussion of the Arellanotype estimator in the context of the random effects model Chapter 22 Robust covariance matrix estimation 214 It is common to apply a degrees of freedom adjustment to these estimators otherwise the variance may appear misleadingly small in comparison with other estimators if the number of clusters is small In the least squares case the factor is mm 1 n 1n k where n is the total number of observations and k is the number of parameters estimated in the case of ML estimation the factor is just mm 1 Availability and syntax The clusterrobust estimator is currently available for models estimated via OLS and TSLS and also for most ML estimators other than those specialized for timeseries data binary logit and pro bit ordered logit and probit multinomial logit Tobit interval regression biprobit count models and duration models Additionally the same option is available for generic maximum likelihood estimation as provided by the mle command see chapter 26 for extra details In all cases the syntax is the same you give the option flag cluster followed by the name of the series to be used to define the clusters as in ols y 0 x1 x2 clustercvar The specified clustering variable must a be defined not missing at all observations used in esti mating the model and b take on at least two distinct values over the estimation range The clusters are defined as sets of observations having a common value for the clustering variable It is generally expected that the number of clusters is substantially less than the total number of observations Chapter 23 Panel data A panel dataset is one in which each of N 1 units sometimes called individuals or groups is observed over time In a balanced panel there are T 1 observations on each unit more generally the number of observations may differ by unit In the following we index units by i and time by t To allow for imbalance in a panel we use the notation Ti to refer to the number of observations for unit or individual i 231 Estimation of panel models Pooled Ordinary Least Squares The simplest estimator for panel data is pooled OLS In most cases this is unlikely to be adequate but it provides a baseline for comparison with more complex estimators If you estimate a model on panel data using OLS an additional test item becomes available In the GUI model window this is the item panel specification under the Tests menu the script counterpart is the panspec command To take advantage of this test you should specify a model without any dummy variables represent ing crosssectional units The test compares pooled OLS against the principal alternatives the fixed effects and random effects models These alternatives are explained in the following section The fixed and random effects models In the graphical interface these options are found under the menu item ModelPanelFixed and random effects In the commandline interface one uses the panel command with or without the randomeffects option The fixedeffects option is also allowed but not strictly necessary being the default This section explains the nature of these models and comments on their estimation via gretl The pooled OLS specification may be written as yit Xitβ uit 231 where yit is the observation on the dependent variable for crosssectional unit i in period t Xit is a 1 k vector of independent variables observed for unit i in period t β is a k 1 vector of parameters and uit is an error or disturbance term specific to unit i in period t The fixed and random effects models have in common that they decompose the unitary pooled error term uit For the fixed effects model we write uit αi εit yielding yit Xitβ αi εit 232 That is we decompose uit into a unitspecific and timeinvariant component αi and an observation specific error εit1 The αis are then treated as fixed parameters in effect unitspecific yintercepts which are to be estimated This can be done by including a dummy variable for each crosssectional 1It is possible to break a third component out of uit namely wt a shock that is timespecific but common to all the units in a given period In the interest of simplicity we do not pursue that option here 215 unit and suppressing the global constant This is sometimes called the Least Squares Dummy Variables LSDV method Alternatively one can subtract the group mean from each of variables and estimate a model without a constant In the latter case the dependent variable may be written as ỹᵢₜ yᵢₜ ȳᵢ The group mean ȳᵢ is defined as ȳᵢ 1Tᵢ t1ᵀᵢ yᵢₜ where Tᵢ is the number of observations for unit i An exactly analogous formulation applies to the independent variables Given parameter estimates β obtained using such demeaned data we can recover estimates of the αᵢs using αᵢ 1Tᵢ t1ᵀᵢ yᵢₜ Xᵢₜ β These two methods LSDV and using demeaned data are numerically equivalent gretl takes the approach of demeaning the data If you have a small number of crosssectional units a large number of timeseries observations per unit and a large number of regressors it is more economical in terms of computer memory to use LSDV If need be you can easily implement this manually For example genr unitdum ols y x du See Chapter 10 for details on unitdum The αᵢ estimates are not printed as part of the standard model output in gretl there may be a large number of these and typically they are not of much inherent interest However you can retrieve them after estimation of the fixed effects model if you wish In the graphical interface go to the Save menu in the model window and select perunit constants In commandline mode you can do series newname ahat where newname is the name you want to give the series For the random effects model we write uᵢₜ vᵢ εᵢₜ so the model becomes yᵢₜ Xᵢₜ β vᵢ εᵢₜ 233 In contrast to the fixed effects model the vᵢs are not treated as fixed parameters but as random drawings from a given probability distribution The celebrated GaussMarkov theorem according to which OLS is the best linear unbiased estimator BLUE depends on the assumption that the error term is independently and identically distributed IID In the panel context the IID assumption means that Eu²ᵢₜ in relation to equation 231 equals a constant σ²ᵤ for all i and t while the covariance Euᵢₛ uᵢₜ equals zero for all s t and the covariance Euⱼₜ uᵢₜ equals zero for all j i If these assumptions are not metand they are unlikely to be met in the context of panel dataOLS is not the most efficient estimator Greater efficiency may be gained using generalized least squares GLS taking into account the covariance structure of the error term Consider observations on a given unit i at two different times s and t From the hypotheses above it can be worked out that Varuᵢₛ Varuᵢₜ σ²ᵥ σ²ε while the covariance between uᵢₛ and uᵢₜ is given by Euᵢₛ uᵢₜ σ²ᵥ In matrix notation we may group all the Tᵢ observations for unit i into the vector yᵢ and write it as yᵢ Xᵢ β uᵢ 234 The vector ui which includes all the disturbances for individual i has a variancecovariance matrix given by Varui Σi σε2 I σν2 J where J is a square matrix with all elements equal to 1 It can be shown that the matrix Ki I θi Ti J where θi 1 sqrtσε2 σε2 Ti σν2 has the property Ki Σ Ki σε2 I It follows that the transformed system Ki yi Ki Xi β Ki ui satisfies the GaussMarkov conditions and OLS estimation of 236 provides efficient inference But since Ki yi yi θi yi GLS estimation is equivalent to OLS using quasidemeaned variables that is variables from which we subtract a fraction θ of their average Notice that for σε2 0 θ 1 while for σν2 0 θ 0 This means that if all the variance is attributable to the individual effects then the fixed effects estimator is optimal if on the other hand individual effects are negligible then pooled OLS turns out unsurprisingly to be the optimal estimator To implement the GLS approach we need to calculate θ which in turn requires estimates of the two variances σε2 and σν2 These are often referred to as the within and between variances respectively since the former refers to variation within each crosssectional unit and the latter to variation between the units Several means of estimating these magnitudes have been suggested in the literature see Baltagi 1995 by default gretl uses the method of Swamy and Arora 1972 σε2 is estimated by the residual variance from the fixed effects model and σν2 is estimated indirectly with the help of the between regression which uses the group means of all the relevant variables is yi Xi β ei The residual variance from this regression se2 can be shown to estimate the sum σν2 σε2 T An estimate of σν2 can therefore be obtained by subtracting 1T times the estimate of σε2 from se2 σv2 se2 σε2 T Alternatively if the nerlove option is given gretl uses the method suggested by Nerlove 1971 In this case σν2 is estimated as the sample variance of the fixed effects αi σν2 1N 1 Σi1n αi ā2 where N is the number of individuals and ā is the mean of the estimated fixed effects Swamy and Aroras equation 237 involves T hence assuming a balanced panel When the number of time series observations Ti differs across individuals some sort of adjustment is needed By default gretl follows Stata by using the harmonic mean of the Ti s in place of T It may be argued however that a more substantial adjustment is called for in the unbalanced case Baltagi and Chang 1994 recommend a variant of SwamyArora which involves Tiweighted estimation of the between regression on the basis that units with more observations will be more informative about the variance of interest In gretl one can switch to the BaltagiChang variant by giving the unbalanced Chapter 23 Panel data 218 option with the panel command But the gain in efficiency from doing so may well be slim for a discussion of this point and related matters see Cottrell 2017 Unbalancedness also affects the Nerlove 1971 estimator but the econometric literature offers no guidance on the details Gretl uses the weighted average of the fixed effects as a natural extension of the original method Again see Cottrell 2017 for further details Choice of estimator Which panel method should one use fixed effects or random effects One way of answering this question is in relation to the nature of the data set If the panel comprises observations on a fixed and relatively small set of units of interest say the member states of the European Union there is a presumption in favor of fixed effects If it comprises observations on a large number of randomly selected individuals as in many epidemiological and other longitudinal studies there is a presumption in favor of random effects Besides this general heuristic however various statistical issues must be taken into account 1 Some panel data sets contain variables whose values are specific to the crosssectional unit but which do not vary over time If you want to include such variables in the model the fixed effects option is simply not available When the fixed effects approach is implemented using dummy variables the problem is that the timeinvariant variables are perfectly collinear with the perunit dummies When using the approach of subtracting the group means the issue is that after demeaning these variables are nothing but zeros 2 A somewhat analogous issue arises with the random effects estimator As mentioned above the default SwamyArora method relies on the groupmeans regression to obtain a measure of the between variance Suppose we have observations on n units or individuals and there are k independent variables of interest If k n this regression cannot be runsince we have only n effective observationsand hence SwamyArora estimates cannot be obtained In this case however it is possible to use Nerloves method instead If both fixed effects and random effects are feasible for a given specification and dataset the choice between these estimators may be expressed in terms of the two econometric desiderata efficiency and consistency From a purely statistical viewpoint we could say that there is a tradeoff between robustness and efficiency In the fixed effects approach we do not make any hypotheses on the group effects that is the timeinvariant differences in mean between the groups beyond the fact that they exist and that can be tested see below As a consequence once these effects are swept out by taking deviations from the group means the remaining parameters can be estimated On the other hand the random effects approach attempts to model the group effects as drawings from a probability distribution instead of removing them This requires that individual effects are representable as a legitimate part of the disturbance term that is zeromean random variables uncorrelated with the regressors As a consequence the fixedeffects estimator always works but at the cost of not being able to estimate the effect of timeinvariant regressors The richer hypothesis set of the randomeffects estimator ensures that parameters for timeinvariant regressors can be estimated and that esti mation of the parameters for timevarying regressors is carried out more efficiently These advan tages though are tied to the validity of the additional hypotheses If for example there is reason to think that individual effects may be correlated with some of the explanatory variables then the randomeffects estimator would be inconsistent while fixedeffects estimates would still be valid The Hausman test is built on this principle see below if the fixed and randomeffects estimates agree to within the usual statistical margin of error there is no reason to think the additional hypotheses invalid and as a consequence no reason not to use the more efficient RE estimator Testing panel models If you estimate a fixed effects or random effects model in the graphical interface you may notice that the number of items available under the Tests menu in the model window is relatively limited Panel models carry certain complications that make it difficult to implement all of the tests one expects to see for models estimated on straight timeseries or crosssectional data Nonetheless various panelspecific tests are printed along with the parameter estimates as a matter of course as follows When you estimate a model using fixed effects you automatically get an Ftest for the null hypothesis that the crosssectional units all have a common intercept That is to say that all the αi s are equal in which case the pooled model 231 with a column of 1s included in the X matrix is adequate When you estimate using random effects RE the BreuschPagan and Hausman tests are presented automatically To save their results in the context of a script one would copy the modelbptest or modelhausmantest bundles which are nested inside the model bundle Both of these inner bundles contain the elements test dfn degrees of freedom and pvalue The BreuschPagan test is the counterpart to the Ftest mentioned above The null hypothesis is that the variance of vi in equation 233 equals zero if this hypothesis is not rejected then again we conclude that the simple pooled model is adequate If the panel is unbalanced the method from Baltagi and Li 1990 is used to perform the BreuschPagan test for individual effects The Hausman test probes the consistency of the GLS estimates The null hypothesis is that these estimates are consistent that is that the requirement of orthogonality of the vi and the Xi is satisfied The test is based on a measure H of the distance between the fixedeffects and randomeffects estimates constructed such that under the null it follows the χ2 distribution with degrees of freedom equal to the number of timevarying regressors in the matrix X If the value of H is large this suggests that the random effects estimator is not consistent and the fixedeffects model is preferable There are two ways of calculating H the matrixdifference method and the regression method The procedure for the matrixdifference method is this Collect the fixedeffects estimates in a vector 𝛃 and the corresponding randomeffects estimates in β then form the difference vector 𝛃 β Form the covariance matrix of the difference vector as Var𝛃 β Var𝛃 Varβ Ψ where Varβ and Var𝛃 are estimated by the sample variance matrices of the fixed and randomeffects models respectively Compute H 𝛃 β Ψ1 𝛃 β Given the relative efficiencies of 𝛃 and β the matrix Ψ should be positive definite in which case H is positive but in finite samples this is not guaranteed and of course a negative χ2 value is not admissible The regression method avoids this potential problem The procedure is to estimate via OLS an augmented regression in which the dependent variable is quasidemeaned y and the regressors include both quasidemeaned X as in the RE specification and the demeaned variants of all the timevarying variables ie the fixedeffects regressors The Hausman null then implies that the coefficients on the latter subset of regressors should be statistically indistinguishable from zero If the RE specification employs the default covariancematrix estimator assuming IID errors H can be obtained as follows Chapter 23 Panel data 220 Treat the randomeffects model as the restricted model and record its sum of squared resid uals as SSRr Estimate the augmented unrestricted regression and record its sum of squared residuals as SSRu Compute H n SSRr SSRu SSRu where n is the total number of observations used Alternatively if the robust option is selected for RE estimation H is calculated as a Wald test based on a robust estimate of the covariance matrix of the augmented regression Either way H cannot be negative By default gretl computes the Hausman test via the regression method but it uses the matrix difference method if you pass the option matrixdiff to the panel command Serial correlation A simple test for firstorder autocorrelation of the error term namely the DurbinWatson DW statistic is printed as part of the output for pooled OLS as well as fixedeffects and randomeffects estimation Let us define serial correlation proper as serial correlation strictly in the time di mension of a panel dataset When based on the residuals from fixedeffects estimation the DW statistic is a test for serial correlation proper4 The DW value shown in the case of randomeffects estimation is based on the fixedeffects residuals When DW is based on pooled OLS residuals it tests for serial correlation proper only on the assumption of a common intercept Put differently in this case it tests a joint null hypothesis absence of fixed effects plus absence of first order serial correlation proper In the presence of missing observations the DW statistic is calculated as described in Baltagi and Wu 1999 their expression for d1 under equation 16 on page 819 When it is computed the DW statistic can be retrieved via the accessor dw after estimation In addition an approximate Pvalue for the null of no serial correlation ρ 0 against the alternative of ρ 0 may be available via the accessor dwpval This is based on the analysis in Bhargava et al 1982 strictly speaking it is the marginal significance level of DW considered as a dL value the value below which the test rejects as opposed to dU the value above which the test fails to reject In the panel case however dL and dU are quite close particularly when N the number of individual units is large At present gretl does not attempt to compute such Pvalues when the number of observations differs across individuals Robust standard errors For most estimators gretl offers the option of computing an estimate of the covariance matrix that is robust with respect to heteroskedasticity andor autocorrelation and hence also robust standard errors In the case of panel data robust covariance matrix estimators are available for the pooled fixed effects and random effects models See section 224 for details The constant in the fixed effects model Users are sometimes puzzled by the constant or intercept reported by gretl on estimation of the fixed effects model how can a constant remain when the group means have been subtracted from the data The method of calculation of this term is a matter of convention but the gretl authors decided to follow the convention employed by Stata this involves adding the global mean back into the variables from which the group means have been removed5 If you prefer to interpret the fixed effects model as OLS plus unit dummies throughout it can be proven the this approach is equivalent to using centered unit dummies instead of plain 01 dummies 4The generalization of the DurbinWatson statistic from the straight timeseries context to panel data is due to Bhargava et al 1982 5See Gould 2013 for an extended explanation Chapter 23 Panel data 221 The method that gretl uses internally is exemplified in Listing 231 The coefficients in the second OLS estimation including the intercept agree with those in the initial fixed effects model though the standard errors differ due to a degrees of freedom correction in the fixedeffects covariance matrix Note that the pmean function returns the group mean of a series The third estimator which produces quite a lot of outputinstead uses the stdize function to create the centered dummies It thereby shows the equivalence of the internallyused method to OLS plus centered dummies Note that in this case the standard errors agree with the initial estimates Listing 231 Calculating the intercept in the fixed effects model Download open abdatagdt list X w k ys list of explanatory variables builtin method panel n const X fixedeffects recentering by hand depvar n pmeann meann redefine the dependent variable list indepvars const loop foreach i X redefine the explanatory variables xi i pmeani meani indepvars xi endloop ols depvar indepvars perform estimation using centered dummies list C dummifyunit create the unit dummies smpl n X nomissing adjust to perform centering correctly list D stdizeC 1 center the unit dummies ols n const X D perform estimation Rsquared in the fixed effects model There is no uniquely correct way of calculating R2 in the context of the fixedeffects model It may be argued that a measure of the squared correlation between the dependent variable and the prediction yielded by the model is a desirable descriptive statistic to have but which model and which variant of the dependent variable are we talking about Fixedeffects models can be thought of in two equally defensible ways From one perspective they provide a nice clean way of sweeping out individual effects by using the fact that in the linear model a sufficient statistic is easy to compute Alternatively they provide a clever way to estimate the important parameters of a model in which you want to include for whatever reason a full set of individual dummies If you take the second of these perspectives your dependent variable is unmodified y and your model includes the unit dummies the appropriate R2 measure is then the squared correlation between y and the y computed using both the measured individual effects and the effects of the explicitly named regressors This is reported by gretl as the LSDV Rsquared If you take the first point of view on the other hand your dependent variable is really yit yi and your model just includes the β terms the coefficients of deviations of the x variables from their perunit means In this case the relevant measure of R2 is the socalled within R2 this variant is printed by gretl for fixedeffects model in place of the adjusted R2 it being unclear in this case what exactly the adjustment should amount to anyway Residuals in the fixed and random effect models After estimation of most kinds of models in gretl you can retrieve a series containing the residuals using the uhat accessor This is true of the fixed and random effects models but the exact meaning of gretls uhat in these cases requires a little explanation Consider first the fixed effects model yit Xit β αi εit In this model gretl takes the fitted value yhat to be αi Xit β and the residual uhat to be yit minus this fitted value This makes sense because the fixed effects the αi terms are taken as parameters to be estimated However it can be argued that the fixed effects are not really explanatory and if one defines the residual as the observed yit value minus its explained component one might prefer to see just yit Xit β You can get this after fixedeffects estimation as follows series uefe uhat ahat coeff1 where ahat gives the unitspecific intercept as it would be calculated if one included all N unit dummies and omitted a common yintercept and coeff1 gives the global yintercept Now consider the randomeffects model yit Xit β vi εit In this case gretl considers the error term to be vi εit since vi is conceived as a random drawing and the uhat series is an estimate of this namely yit Xit β What if you want an estimate of just vi or just εit in this case This poses a signalextraction problem given the composite residual how to recover an estimate of its components The solution is to ascribe to the individual effect vi a suitable fraction of the mean residual per individual ui Σt1Ti ûit The suitable fraction is the proportion of the variance of the variance of ui that is due to vi namely σv2 σv2 σε2 Ti 1 1 θi2 After random effects estimation in gretl you can access a series containing the vi s under the name ahat This series can be calculated by hand as follows case 1 balanced panel scalar theta theta series vhat 1 1 theta2 pmeanuhat Chapter 23 Panel data 223 case 2 unbalanced Ti varies by individual scalar s2v s2v scalar s2e s2e series frac s2v s2v s2epnobsuhat series ahat frac pmeanuhat If an estimate of εit is wanted it can then be obtained by subtraction from uhat 232 Autoregressive panel models Special problems arise when a lag of the dependent variable is included among the regressors in a panel model Consider a dynamic variant of the pooled model eq 231 yit Xitβ ρyit1 uit 239 First if the error uit includes a group effect vi then yit1 is bound to be correlated with the error since the value of vi affects yi at all t That means that OLS applied to 239 will be inconsistent as well as inefficient The fixedeffects model sweeps out the group effects and so overcomes this particular problem but a subtler issue remains which applies to both fixed and random effects estimation Consider the demeaned representation of fixed effects as applied to the dynamic model yit Xitβ ρ yit1 εit where yit yit yi and εit uit ui or uit αi using the notation of equation 232 The trouble is that yit1 will be correlated with εit via the group mean yi The disturbance εit influences yit directly which influences yi which by construction affects the value of yit for all t The same issue arises in relation to the quasidemeaning used for random effects Estimators which ignore this correlation will be consistent only as T in which case the marginal effect of εit on the group mean of y tends to vanish One strategy for handling this problem and producing consistent estimates of β and ρ was pro posed by Anderson and Hsiao 1981 Instead of demeaning the data they suggest taking the first difference of 239 an alternative tactic for sweeping out the group effects yit Xitβ ρyit1 ηit 2310 where ηit uit vi εit εit εit1 Were not in the clear yet given the structure of the error ηit the disturbance εit1 is an influence on both ηit and yit1 yit yit1 The next step is then to find an instrument for the contaminated yit1 Anderson and Hsiao suggest using either yit2 or yit2 both of which will be uncorrelated with ηit provided that the underlying errors εit are not themselves serially correlated The AndersonHsiao estimator is not provided as a builtin function in gretl since gretls sensible handling of lags and differences for panel data makes it a simple application of regression with instrumental variablessee Listing 232 which is based on a study of country growth rates by Nerlove 19997 Although the AndersonHsiao estimator is consistent it is not most efficient it does not make the fullest use of the available instruments for yit1 nor does it take into account the differenced structure of the error ηit It is improved upon by the methods of Arellano and Bond 1991 and Blundell and Bond 1998 These methods are taken up in the next chapter 7Also see Clint Cummins benchmarks page httpwwwstanfordeduclintbench Chapter 23 Panel data 224 Listing 232 The AndersonHsiao estimator for a dynamic panel model Download Penn World Table data as used by Nerlove open penngrowgdt Fixed effects for comparison panel Y 0 Y1 X Random effects for comparison panel Y 0 Y1 X randomeffects take differences of all variables diff Y X AndersonHsiao using Y2 as instrument tsls dY dY1 dX 0 dX Y2 AndersonHsiao using dY2 as instrument tsls dY dY1 dX 0 dX dY2 Chapter 24 Dynamic panel models The command for estimating dynamic panel models in gretl is dpanel This command supports both the difference estimator Arellano and Bond 1991 and the system estimator Blundell and Bond 1998 which has become the method of choice in the applied literature 241 Introduction Notation A dynamic linear panel data model can be represented as follows in notation based on Arellano 2003 yit αyit1 βxit ηi υit 241 where i 1 2 N indexes the crosssection units and t indexes time The main idea behind the difference estimator is to sweep out the individual effect via differencing Firstdifferencing eq 241 yields Δyit αΔyit1 βΔxit Δυit y Wit Δυit 242 in obvious notation The error term of 242 is by construction autocorrelated and also correlated with the lagged dependent variable so an estimator that takes both issues into account is needed The endogeneity issue is solved by noting that all values of yitk with k 1 can be used as instruments for Δyit1 unobserved values of yitk whether missing or presample can safely be substituted with 0 In the language of GMM this amounts to using the relation EΔυit yitk 0 k 1 243 as an orthogonality condition Autocorrelation is dealt with by noting that if υit is white noise the covariance matrix of the vector whose typical element is Δυit is proportional to a matrix H that has 2 on the main diagonal 1 on the first subdiagonals and 0 elsewhere Onestep GMM estimation of equation 242 amounts to computing yhat Σi Wi Zi AN Σi Zi Wi1 Σi Wi Zi AN Σi Zi Δyi 244 where Δyi Δyi3 ΔyiT Wi Δyi2 ΔyiT1 Δxi3 ΔxiT Zi yi1 0 0 0 Δxi3 0 yi1 yi2 0 Δxi4 0 0 0 yiT2 ΔxiT 225 Chapter 24 Dynamic panel models 226 and AN Σi Zi H Zi1 Once the 1step estimator is computed the sample covariance matrix of the estimated residuals can be used instead of H to obtain 2step estimates which are not only consistent but asymptotically efficient In principle the process may be iterated but nobody seems to be interested Standard GMM theory applies except for one point Windmeijer 2005 has computed finitesample corrections to the asymptotic covariance matrix of the parameters which are nowadays almost universally used The difference estimator is consistent but has been shown to have poor properties in finite samples when α is near one People these days prefer the socalled system estimator which complements the differenced data with lagged levels used as instruments with data in levels using lagged differences as instruments The system estimator relies on an extra orthogonality condition which has to do with the earliest value of the dependent variable yi1 The interested reader is referred to Blundell and Bond 1998 pp 124125 for details but here it suffices to say that this condition is satisfied in meanstationary models and brings an improvement in efficiency that may be substantial in many cases The set of orthogonality conditions exploited in the system approach is not very much larger than with the difference estimator since most of the possible orthogonality conditions associated with the equations in levels are redundant given those already used for the equations in differences The key equations of the system estimator can be written as ytilde Σi Wi Zi AN Σi Zi Wi1 Σi Wi Zi AN Σi Zi Δyi 245 where Δyi Δyi3 ΔyiT yi3 yiT Wi Δyi2 ΔyiT1 yi2 yiT1 Δxi3 ΔxiT xi3 xiT Zi yi1 0 0 0 0 0 Δxi3 0 yi1 yi2 0 0 0 Δxi4 0 0 0 yiT2 0 0 ΔxiT 0 0 0 0 Δyi2 0 xi3 0 0 0 0 0 ΔyiT1 xiT and AN Σi Zi H Zi1 In this case choosing a precise form for the matrix H for the first step is no trivial matter Its northwest block should be as similar as possible to the covariance matrix of the vector Δυit so Chapter 24 Dynamic panel models 227 the same choice as the difference estimator is appropriate Ideally the southeast block should be proportional to the covariance matrix of the vector ιηi υ that is συ2 I ση2 ιι but since ση2 is unknown and any positive definite matrix renders the estimator consistent people just use I The offdiagonal blocks should in principle contain the covariances between Δυis and υit which would be an identity matrix if υit is white noise However since the southeast block is typically given a conventional value anyway the benefit in making this choice is not obvious Some packages use I others use a zero matrix Asymptotically it should not matter but on real datasets the difference between the resulting estimates can be noticeable Rank deficiency Both the difference estimator 244 and the system estimator 245 depend for their existence on the invertibility of AN This matrix may turn out to be singular for several reasons However this does not mean that the estimator is not computable In some cases adjustments are possible such that the estimator does exist but the user should be aware that in such cases not all software packages use the same strategy and replication of results may prove difficult or even impossible A first reason why AN may be singular is unavailability of instruments chiefly because of missing observations This case is easy to handle If a particular row of Zi is zero for all units the corresponding orthogonality condition or the corresponding instrument if you prefer is automatically dropped the overidentification rank is then adjusted for testing purposes Even if no instruments are zero however AN could be rank deficient A trivial case occurs if there are collinear instruments but a less trivial case may arise when T the total number of time periods available is not much smaller than N the number of units as for example in some macro datasets where the units are countries The total number of potentially usable orthogonality conditions is OT2 which may well exceed N in some cases Since AN is the sum of N matrices which have at most rank 2T 3 it could well happen that the sum is singular In all these cases dpanel substitutes the pseudoinverse of AN MoorePenrose for its regular inverse Our choice is shared by some software packages but not all so replication may be hard Covariance matrix and standard errors By default the standard errors shown for 1step estimation are robust based on the heteroskedasticityconsistent variance estimator Varyhat M1 Σi Wi Zi AN VN AN Σi Zi Wi M1 where M Σi Wi Zi AN Σi Zi Wi and VN N1 Σi Zi uhati uhati Zi with uhati the vector of residuals in regressions for individual i In addition as noted above the variance estimator for 2step estimation employs the finitesample correction of Windmeijer 2005 When the asymptotic option is passed to dpanel however the 1step variance estimator is simply σu2 M1 which is not heteroskedasticityconsistent and the Windmeijer correction is not applied for 2step estimation Use of this option is not recommended unless you wish to replicate prior results that did not report robust standard errors In particular tests based on the asymptotic 2step variance estimator are known to overreject quite substantially standard errors too small Treatment of missing values Textbooks seldom bother with missing values but in some cases their treatment may be far from obvious This is especially true if missing values are interspersed between valid observations For example consider the plain difference estimator with one lag so yt αyt1 η ϵt Chapter 24 Dynamic panel models 228 where the i index is omitted for clarity Suppose you have an individual with t 1 5 for which y3 is missing It may seem that the data for this individual are unusable because differencing yt would produce something like t 1 2 3 4 5 yt yt where nonmissing and missing Estimation seems to be unfeasible since there are no periods in which yt and yt1 are both observable However we can use a kdifference operator and get kyt αkyt1 kϵt where k 1Lk and past levels of yt are valid instruments In this example we can choose k 3 and use y1 as an instrument so this unit is in fact usable Not all software packages seem to be aware of this possibility so replicating published results may prove tricky if your dataset contains individuals with gaps between valid observations 242 Usage One feature of dpanels syntax is that you get default values for several choices you may wish to make so that in a standard situation the command is very concise The simplest case of the model 241 is a plain AR1 process yit αyit1 ηi vit 246 If you give the command dpanel 1 y Gretl assumes that you want to estimate 246 via the difference estimator 244 using as many orthogonality conditions as possible The scalar 1 between dpanel and the semicolon indicates that only one lag of y is included as an explanatory variable using 2 would give an AR2 model The syntax that gretl uses for the nonseasonal AR and MA lags in an ARMA model is also supported in this context For example if you want the first and third lags of y but not the second included as explanatory variables you can say dpanel 1 3 y or you can use a predefined matrix for this purpose matrix ylags 1 3 dpanel ylags y To use a single lag of y other than the first you need to employ this mechanism dpanel 3 y only lag 3 is included dpanel 3 y compare lags 1 2 and 3 are used To use the system estimator instead you add the system option as in dpanel 1 y system The level orthogonality conditions and the corresponding instrument are appended automatically see eq 245 Chapter 24 Dynamic panel models 229 Regressors If additional regressors are to be included they should be listed after the dependent variable in the same way as other gretl estimation commands such as ols For the difference orthogonality relations dpanel takes care of transforming the regressors in parallel with the dependent variable One case of potential ambiguity is when an intercept is specified but the differenceonly estimator is selected as in dpanel 1 y const In this case the default dpanel behavior which agrees with David Roodmans xtabond2 for Stata Roodman 2009a is to drop the constant since differencing reduces it to nothing but zeros However for compatibility with the DPD package for Ox you can give the option dpdstyle in which case the constant is retained equivalent to including a linear trend in equation 241 A similar point applies to the periodspecific dummy variables which can be added in dpanel via the timedummies option in the differencesonly case these dummies are entered in differenced form by default but when the dpdstyle switch is applied they are entered in levels The standard gretl syntax applies if you want to use lagged explanatory variables so for example the command dpanel 1 y const x0 to 1 system would result in estimation of the model yit αyit1 β0 β1xit β2xit1 ηi vit Instruments The default rules for instruments are lags of the dependent variable are instrumented using all available orthogonality conditions and additional regressors are considered exogenous so they are used as their own instruments If a different policy is wanted the instruments should be specified in an additional list separated from the regressors list by a semicolon The syntax closely mirrors that of the tsls command but in this context it is necessary to distinguish between regular instruments and what are often called GMMstyle instruments that is instruments that are handled in the same blockdiagonal manner as lags of the dependent variable as described above Regular instruments are transformed in the same way as regressors and the contemporaneous value of the transformed variable is used to form an orthogonality condition Since regressors are treated as exogenous by default it follows that these two commands estimate the same model dpanel 1 y z dpanel 1 y z z The instrument specification in the second case simply confirms what is implicit in the first that z is exogenous Note though that if you have some additional variable z2 which you want to add as a regular instrument it then becomes necessary to include z in the instrument list if it is to be treated as exogenous dpanel 1 y z z2 z is now implicitly endogenous dpanel 1 y z z z2 z is treated as exogenous Chapter 24 Dynamic panel models 230 The specification of GMMstyle instruments is handled by the special constructs GMM and GMMlevel The first of these relates to instruments for the equations in differences and the second to the equations in levels The syntax for GMM is GMMname minlag maxlagcollapse where name is replaced by the name of a series or the name of a list of series and minlag and maxlag are replaced by the minimum and maximum lags to be used as instruments The same goes for GMMlevel One common use of GMM is to limit the number of lagged levels of the dependent variable used as instruments for the equations in differences Its well known that although exploiting all possible orthogonality conditions yields maximal asymptotic efficiency in finite samples it may be prefer able to use a smaller subsetsee Roodman 2009b Okui 2009 For example the specification dpanel 1 y GMMy 2 4 ensures that no lags of yt earlier than t 4 will be used as instruments A second means of limiting the number of instruments is to collapse the sets of blockdiagonal instruments shown following equations 244 and 245 Instead of having a distinct instrument per observation per lag this is reduced to a distinct instrument per lag as shown in Figure 241 GMM yi1 0 0 0 0 0 0 yi1 yi2 0 0 0 0 0 0 yi1 yi2 yi3 yi1 0 0 yi1 yi2 0 yi1 yi2 yi3 GMMlevel yi2 0 0 0 yi3 0 0 0 yi4 yi2 yi3 yi4 Figure 241 Collapsing blockdiagonal instruments This treatment of instruments can be selected per GMM or GMMlevel caseby appending the collapse flag following the maxlag valueor it can be set globally by use of the collapse option to the dpanel command To our knowledge Roodmans xtabond2 was the first software to offer this useful facility A further use of GMM is to exploit more fully the potential orthogonality conditions afforded by an exogenous regressor or a related variable that does not appear as a regressor For example in dpanel 1 y x GMMz 2 6 the variable x is considered an endogenous regressor and up to 5 lags of z are used as instruments Note that in the following script fragment dpanel 1 y z dpanel 1 y z GMMz00 the two estimation commands should not be expected to give the same result as the sets of orthogonality relationships are subtly different In the latter case you have T 2 separate orthogonality relationships pertaining to zit none of which has any implication for the other ones in the former case you only have one In terms of the Zi matrix the first form adds a single row to the bottom of the instruments matrix while the second form adds a diagonal block with T 2 columns that is zi3 zi4 zit versus zi3 0 0 0 zi4 0 0 0 zit 243 Replication of DPD results In this section we show how to replicate the results of some of the pioneering work with dynamic paneldata estimators by Arellano Bond and Blundell As the DPD manual Doornik Arellano and Bond 2006 explains it is difficult to replicate the original published results exactly for two main reasons not all of the data used in those studies are publicly available and some of the choices made in the original software implementation of the estimators have been superseded Here therefore our focus is on replicating the results obtained using the current DPD package and reported in the DPD manual The examples are based on the program files abest1ox abest3ox and bbest1ox These are included in the DPD package along with the ArellanoBond database files abdatabn7 and abdatain71 The ArellanoBond data are also provided with gretl in the file abdatagdt In the following we do not show the output from DPD or gretl it is somewhat voluminous and is easily generated by the user As of this writing the results from OxDPD and gretl are identical in all relevant respects for all of the examples shown2 A complete OxDPD program to generate the results of interest takes this general form include oxstdh import packagesdpddpd main decl dpd new DPD dpdLoadabdatain7 dpdSetYearYEAR modelspecific code here delete dpd In the examples below we take this template for granted and show just the modelspecific code Example 1 The following OxDPD codedrawn from abest1oxreplicates column b of Table 4 in Arellano and Bond 1991 an instance of the differencesonly or GMMDIF estimator The dependent variable Chapter 24 Dynamic panel models 232 is the log of employment n the regressors include two lags of the dependent variable current and lagged values of the log realproduct wage w the current value of the log of gross capital k and current and lagged values of the log of industry output ys In addition the specification includes a constant and five year dummies unlike the stochastic regressors these deterministic terms are not differenced In this specification the regressors w k and ys are treated as exogenous and serve as their own instruments In DPD syntax this requires entering these variables twice on the XVAR and IVAR lines The GMMtype blockdiagonal instruments in this example are the second and subsequent lags of the level of n Both 1step and 2step estimates are computed dpdSetOptionsFALSE dont use robust standard errors dpdSelectYVAR n 0 2 dpdSelectXVAR w 0 1 k 0 0 ys 0 1 dpdSelectIVAR w 0 1 k 0 0 ys 0 1 dpdGmmn 2 99 dpdSetDummiesDCONSTANT DTIME print Arellano Bond 1991 Table 4 b dpdSetMethodM1STEP dpdEstimate dpdSetMethodM2STEP dpdEstimate Here is gretl code to do the same job open abdatagdt list X w w1 k ys ys1 dpanel 2 n X const timedummies asy dpdstyle dpanel 2 n X const timedummies asy twostep dpdstyle Note that in gretl the switch to suppress robust standard errors is asymptotic here abbreviated to asy3 The dpdstyle flag specifies that the constant and dummies should not be differenced in the context of a GMMDIF model With gretls dpanel command it is not necessary to specify the exogenous regressors as their own instruments since this is the default similarly the use of the second and all longer lags of the dependent variable as GMMtype instruments is the default and need not be stated explicitly Example 2 The DPD file abest3ox contains a variant of the above that differs with regard to the choice of instruments the variables w and k are now treated as predetermined and are instrumented GMM style using the second and third lags of their levels This approximates column c of Table 4 in Arellano and Bond 1991 We have modified the code in abest3ox slightly to allow the use of robust Windmeijercorrected standard errors which are the default in both DPD and gretl with 2step estimation dpdSelectYVAR n 0 2 dpdSelectXVAR w 0 1 k 0 0 ys 0 1 dpdSelectIVAR ys 0 1 dpdSetDummiesDCONSTANT DTIME dpdGmmn 2 99 dpdGmmw 2 3 dpdGmmk 2 3 3Option flags in gretl can always be truncated down to the minimal unique abbreviation Chapter 24 Dynamic panel models 233 print Arellano Bond 1991 Table 4 c print but using different instruments dpdSetMethodM2STEP dpdEstimate The gretl code is as follows open abdatagdt list X w w1 k ys ys1 list Ivars ys ys1 dpanel 2 n X const GMMw23 GMMk23 Ivars time twostep dpd Note that since we are now calling for an instrument set other then the default following the second semicolon it is necessary to include the Ivars specification for the variable ys However it is not necessary to specify GMMn299 since this remains the default treatment of the dependent variable Example 3 Our third example replicates the DPD output from bbest1ox this uses the same dataset as the previous examples but the model specifications are based on Blundell and Bond 1998 and involve comparison of the GMMDIF and GMMSYS system estimators The basic specification is slightly simplified in that the variable ys is not used and only one lag of the dependent variable appears as a regressor The OxDPD code is dpdSelectYVAR n 0 1 dpdSelectXVAR w 0 1 k 0 1 dpdSetDummiesDCONSTANT DTIME print Blundell Bond 1998 Table 4 197686 GMMDIF dpdGmmn 2 99 dpdGmmw 2 99 dpdGmmk 2 99 dpdSetMethodM2STEP dpdEstimate print Blundell Bond 1998 Table 4 197686 GMMSYS dpdGmmLeveln 1 1 dpdGmmLevelw 1 1 dpdGmmLevelk 1 1 dpdSetMethodM2STEP dpdEstimate Here is the corresponding gretl code open abdatagdt list X w w1 k k1 list Z w k Blundell Bond 1998 Table 4 197686 GMMDIF dpanel 1 n X const GMMZ299 time twostep dpd Blundell Bond 1998 Table 4 197686 GMMSYS dpanel 1 n X const GMMZ299 GMMlevelZ11 time twostep dpd system Chapter 24 Dynamic panel models 234 Note the use of the system option flag to specify GMMSYS including the default treatment of the dependent variable which corresponds to GMMleveln11 In this case we also want to use lagged differences of the regressors w and k as instruments for the levels equations so we need explicit GMMlevel entries for those variables If you want something other than the default treatment for the dependent variable as an instrument for the levels equations you should give an explicit GMMlevel specification for that variableand in that case the system flag is redundant but harmless For the sake of completeness note that if you specify at least one GMMlevel term dpanel will then include equations in levels but it will not automatically add a default GMMlevel specification for the dependent variable unless the system option is given 244 Crosscountry growth example The previous examples all used the ArellanoBond dataset for this example we use the dataset CELgdt which is also included in the gretl distribution As with the ArellanoBond data there are numerous missing values Details of the provenance of the data can be found by opening the dataset information window in the gretl GUI Data menu Dataset info item This is a subset of the BarroLee 138country panel dataset an approximation to which is used in Caselli Esquivel and Lefort 1996 and Bond Hoeffler and Temple 20014 Both of these papers explore the dynamic paneldata approach in relation to the issues of growth and convergence of per capita income across countries The dependent variable is growth in real GDP per capita over successive fiveyear periods the regressors are the log of the initial five years prior value of GDP per capita the logratio of in vestment to GDP s in the prior five years and the log of annual average population growth n over the prior five years plus 005 as standin for the rate of technical progress g plus the rate of depreciation δ with the last two terms assumed to be constant across both countries and periods The original model is 5yit βyit5 αsit γnit g δ νt ηi ϵit 247 which allows for a timespecific disturbance νt The Solow model with CobbDouglas production function implies that γ α but this assumption is not imposed in estimation The timespecific disturbance is eliminated by subtracting the period mean from each of the series Equation 247 can be transformed to an AR1 dynamic paneldata model by adding yit5 to both sides which gives yit 1 βyit5 αsit γnit g δ ηi ϵit 248 where all variables are now assumed to be timedemeaned In rough replication of Bond et al 2001 we now proceed to estimate the following two models a equation 248 via GMMDIF using as instruments the second and all longer lags of yit sit and nit g δ and b equation 248 via GMMSYS using yit1 sit1 and nit1 g δ as additional instruments in the levels equations We report robust standard errors throughout As a purely notational matter we now use t 1 to refer to values five years prior to t as in Bond et al 2001 The gretl script to do this job is shown in Listing 241 Note that the final transformed versions of the variables logs with timemeans subtracted are named ly yit linv sit and lngd nit gδ For comparison we estimated the same two models using OxDPD and xtabond2 In each case we constructed a commaseparated values dataset containing the data as transformed in the gretl script shown above using a missingvalue code appropriate to the target program For reference the commands used with Stata are reproduced below 4We say an approximation because we have not been able to replicate exactly the OLS results reported in the papers cited though it seems from the description of the data in Caselli et al 1996 that we ought to be able to do so We note that Bond et al 2001 used data provided by Professor Caselli yet did not manage to reproduce the latters results Chapter 24 Dynamic panel models 235 Listing 241 GDP growth example Download open CELgdt ngd n 005 ly logy linv logs lngd logngd take out time means loop i18 smpl time i restrict replace ly meanly linv meanlinv lngd meanlngd endloop smpl full list X linv lngd 1step GMMDIF dpanel 1 ly X GMMX299 2step GMMDIF dpanel 1 ly X GMMX299 twostep GMMSYS dpanel 1 ly X GMMX299 GMMlevelX11 twostep sys delimit insheet using CELcsv tsset unit time xtabond2 ly Lly linv lngd gmmLly lag1 99 gmmlinv lag2 99 gmmlngd lag2 99 rob nolev xtabond2 ly Lly linv lngd gmmLly lag1 99 gmmlinv lag2 99 gmmlngd lag2 99 rob nolev twostep xtabond2 ly Lly linv lngd gmmLly lag1 99 gmmlinv lag2 99 gmmlngd lag2 99 rob nocons twostep For the GMMDIF model all three programs find 382 usable observations and 30 instruments and yield identical parameter estimates and robust standard errors up to the number of digits printed or more see Table 2415 1step 2step coeff std error coeff std error ly1 0577564 01292 0610056 01562 linv 00565469 007082 0100952 007772 lngd 0143950 02753 0310041 02980 Table 241 GMMDIF BarroLee data Results for GMMSYS estimation are shown in Table 242 In this case we show two sets of gretl 5The coefficient shown for ly1 in the Tables is that reported directly by the software for comparability with the original model eq 247 it is necesary to subtract 1 which produces the expected negative value indicating conditional convergence in per capita income results those labeled gretl1 were obtained using gretls dpdstyle option while those labeled gretl2 did not use that optionthe intent being to reproduce the H matrices used by OxDPD and xtabond2 respectively gretl1 OxDPD gretl2 xtabond2 lyg1 09237 00385 09167 00373 09073 00370 09073 00370 linv 01592 00449 01636 00441 01856 00411 01856 00411 lngd 02370 01485 02178 01433 02355 01501 02355 01501 Table 242 2step GMMSYS BarroLee data standard errors in parentheses In this case all three programs use 479 observations gretl and xtabond2 use 41 instruments and produce the same estimates when using the same H matrix while OxDPD nominally uses 666 It is noteworthy that with GMMSYS plus messy missing observations the results depend on the precise array of instruments used which in turn depends on the details of the implementation of the estimator 245 Auxiliary test statistics We have concentrated above on parameter estimates and standard errors Here we add some discussion of the additional test statistics that typically accompany both GMMDIF and GMMSYS estimationtests of overidentification for first and secondorder autocorrelation and for the joint significance of regressors Overidentification If a model estimated with the use of instrumental variables is justidentified the condition of orthogonality of the residuals and the instruments can be satisfied exactly But if the specification is overidentified more instruments than endogenous regressors this condition can only be approximated and the degree to which orthogonality fails serves as a test for the validity of the instruments andor the specification Since dynamic panel models are almost always overidentified such a test is of particular importance There are two such tests in the econometric literature devised respectively by Sargan 1958 and Hansen 1982 They share a common principle a suitably scaled measure of deviation from perfect orthogonality can be shown to be distributed as χ2k with k the degree of overidentification under the null hypothesis of valid instruments and correct specification Both test statistics can be written as S sum from i1 to N viZi AN sum from i1 to N Zivi where the vi are the residuals in first differences for unit i and for that reason they are often rolled togetherfor example as HansenSargan tests by Davidson and MacKinnon 2004 The Sargan vs Hansen difference is buried in AN Sargans original test is the minimized orthogonality score divided by a scalar estimate of the error variance which is presumed to be homoskedastic while Hansens is the minimized criterion from efficient GMM estimation in which the scalar variance estimate is replaced by a heteroskedasticity and autocorrelationconsistent HAC estimator of the variance matrix of the error term These variants correspond to 1step and 2step estimates of the given specification Up till version 2021d gretl followed OxDPD in presenting a single overidentification statistic under the name Sarganin effect a Sargan test proper for the 1step estimator and a Hansen test Chapter 24 Dynamic panel models 237 for 2step Subsequently however gretl follows xtabond2 in distinguishing between the tests and presenting both statistics under their original names when 2step estimation is selected and therefore the HAC variance estimator is available This choice responds to an argument made by Roodman 2009b the Sargan test is questionable owing to its assumption of homoskedasticity but the Hansen test is seriously weakened by an excessive number of instruments it may underreject substantially so there may be a benefit to taking both tests into consideration There are cases where the degrees of freedom for the overidentification test differs between DPD and gretl this occurs when the AN matrix is singular section 241 In concept the df equals the number of instruments minus the number of parameters estimated for the first of these terms gretl uses the rank of AN while DPD appears to use the full dimension of this matrix Autocorrelation Negative firstorder autocorrelation of the residuals in differences is expected by construction of the dynamic panel estimator so a significant value for the AR1 test does not indicate a problem If the AR2 test rejects however this indicates violation of the maintained assumptions Note that valid AR tests cannot be produced when the asymptotic option is specified in conjunction with onestep GMMSYS estimation if you need the tests either add the twostep option or drop the asymptotic flag which is recommended in any case Wald tests on regressors Wald tests on the regressors and separately on the time dummy variables if included are based on the estimated variance matrix of the parameter estimates and are generally in agreement across software packages provided the parameter variance is estimated in the same way One small ex ception pertains to comparison between OxDPD and gretl when the difference estimator is used a constant term is included and the dpdstyle option is given with dpanel so the constant is not automatically omitted In this case DPD includes the constant in the timedummies Wald test but gretl does not 246 Postestimation available statistics After estimation the model accessor will return a bundle containing several items that may be of interest most should be selfexplanatory but heres a partial list Key Content AR1 AR2 1st and 2nd order autocorrelation test statistics sargan sargandf Sargan test for overidentifying restrictions and correspond ing degrees of freedom hansen hansendf Hansen test for overidentifying restrictions and corre sponding degrees of freedom wald walddf Wald test for overall significance and corresponding de grees of freedom GMMinst The matrix Z of instruments see equations 242 and 245 wgtmat The matrix A of GMM weights see equations 242 and 245 Note that hansen and hansendf are not included when 1step estimation is selected Note also that GMMinst and wgtmat which may be quite large matrices are not saved in the model bundle by default that requires use of the keepextra option with the dpanel command Listing 242 illustrates use of these matrices to replicate via hansl commands the calculation of the GMM esti mator Chapter 24 Dynamic panel models 238 Listing 242 replication of builtin command via hansl commands Download set verbose off open abdatagdt compose list of regressors list X w w1 k k1 list Z w k dpanel 1 n X const GMMZ299 twostep dpd keepextra redo by hand fetch Z and A from model A modelwgtmat mZt modelGMMinst note transposed create data matrices series valid okuhat series ddep diffn series dldep ddep1 list dreg diffX smpl valid dummy matrix mreg dldep dreg 1 matrix mdep ddep matrix uno mZt mreg matrix due qformuno A matrix tre unoA mZt mdep matrix coef due re print coef Chapter 24 Dynamic panel models 239 247 Memo dpanel options flag effect asymptotic Suppresses the use of robust standard errors twostep Calls for 2step estimation the default being 1step system Calls for GMMSYS with default treatment of the dependent variable as in GMMlevely11 collapse Collapse blockdiagonal sets of GMM instruments as per Roodman 2009a timedummies Includes periodspecific dummy variables dpdstyle Compute the H matrix as in DPD also suppresses differencing of automatic time dummies and omission of intercept in the GMMDIF case verbose Prints confirmation of the GMMstyle instruments used and when twostep is selected prints the 1step estimates first vcv Calls for printing of the covariance matrix quiet Suppresses the printing of results keepextra Save additional matrices in model bundle see above The time dummies option supports the qualifier noprint as in timedummiesnoprint This means that although the dummies are included in the specification their coefficients standard errors and so on are not printed Chapter 25 Nonlinear least squares 251 Introduction and examples Gretl supports nonlinear least squares NLS using a variant of the LevenbergMarquardt algorithm The user must supply a specification of the regression function prior to giving this specification the parameters to be estimated must be declared and given initial values Optionally the user may supply analytical derivatives of the regression function with respect to each of the parameters If derivatives are not given the user must instead give a list of the parameters to be estimated separated by spaces or commas preceded by the keyword params The tolerance criterion for terminating the iterative estimation procedure can be adjusted using the set command The syntax for specifying the function to be estimated consists of the name of the dependent variable followed by an expression to generate it This is illustrated in the following two examples with accompanying derivatives Consumption function from Greene nls C alpha beta Ygamma deriv alpha 1 deriv beta Ygamma deriv gamma beta Ygamma logY end nls Nonlinear function from Russell Davidson nls y alpha beta x1 1beta x2 deriv alpha 1 deriv beta x1 x2betabeta end nls vcv Note the command words nls which introduces the regression function deriv which introduces the specification of a derivative and end nls which terminates the specification and calls for estimation If the vcv flag is appended to the last line the covariance matrix of the parameter estimates is printed 252 Initializing the parameters The parameters of the regression function must be given initial values prior to the nls command In the GUI program this may be done via the menu item Variable Define new variable In some cases where the nonlinear function is a generalization of or a restricted form of a linear model it may be convenient to run an ols and initialize the parameters from the OLS coefficient estimates In relation to the first example above one might do ols C 0 Y alpha coeff0 beta coeffY gamma 1 And in relation to the second example one might do 240 Chapter 25 Nonlinear least squares 241 ols y 0 x1 x2 alpha coeff0 beta coeffx1 253 NLS dialog window It is probably most convenient to compose the commands for NLS estimation in the form of a gretl script but you can also do so interactively by selecting the item Nonlinear Least Squares under the Model Nonlinear models menu This opens a dialog box where you can type the function specification possibly prefaced by statements to set the initial parameter values and the derivatives if available An example of this is shown in Figure 251 Note that in this context you do not have to supply the nls and end nls tags Figure 251 NLS dialog box 254 Analytical and numerical derivatives If you are able to figure out the derivatives of the regression function with respect to the param eters it is advisable to supply those derivatives as shown in the examples above If that is not possible gretl will compute approximate numerical derivatives However the properties of the NLS algorithm may not be so good in this case see section 258 This is done by using the params statement which should be followed by a list of identifiers containing the parameters to be estimated In this case the examples above would read as follows Greene nls C alpha beta Ygamma params alpha beta gamma end nls Davidson nls y alpha beta x1 1beta x2 params alpha beta end nls If analytical derivatives are supplied they are checked for consistency with the given nonlinear function If the derivatives are clearly incorrect estimation is aborted with an error message If the derivatives are suspicious a warning message is issued but estimation proceeds This warning may sometimes be triggered by incorrect derivatives but it may also be triggered by a high degree of collinearity among the derivatives Note that you cannot mix analytical and numerical derivatives you should supply expressions for all of the derivatives or none 255 Advanced use The nls block can also contain more sophisticated constructs First it can handle intermediate expressions this makes it possible to construct the conditional mean expression as a multistep job thus enhancing modularity and readability of the code Second more complex objects such as lists and matrices can be used for this purpose For example suppose that we want to estimate a Probit Binary Response model via NLS The specification is yi Φgxi ui gxi b0 b1 x1i b2 x2i b xi 251 Note this is not the recommended way to estimate a probit model the ui term is heteroskedastic by construction and ML estimation is much preferable here Still NLS is a consistent estimator of the parameter vector b although its covariance matrix will have to be adjusted to compensate for heteroskedasticity this is accomplished via the robust switch Listing 251 NLS estimation of a Probit model Download open greene251gdt list X const age income ownrent selfempl initalisation ols cardhldr X quiet matrix b coeff sigma proceed with NLS estimation nls cardhldr cnormndx series ndx lincombX b params b end nls robust compare with ML probit probit cardhldr X pvalues The example in script 251 can be enhanced by using analytical derivatives since gxibj φb xi xij one could substitute the params line in the script with the twoliner series f dnormndx deriv b f X and have nls use analyticallycomputed derivatives which are quicker and usually more reliable Chapter 25 Nonlinear least squares 243 256 Controlling termination The NLS estimation procedure is an iterative process Iteration is terminated when the criterion for convergence is met or when the maximum number of iterations is reached whichever comes first Let k denote the number of parameters being estimated The maximum number of iterations is 100 k 1 when analytical derivatives are given and 200 k 1 when numerical derivatives are used Let ϵ denote a small number The iteration is deemed to have converged if at least one of the following conditions is satisfied Both the actual and predicted relative reductions in the error sum of squares are at most ϵ The relative error between two consecutive iterates is at most ϵ This default value of ϵ is the machine precision to the power 341 but it can be adjusted using the set command with the parameter nlstoler For example set nlstoler 0001 will relax the value of ϵ to 00001 257 Details on the code The underlying engine for NLS estimation is based on the minpack suite of functions available from netliborg Specifically the following minpack functions are called lmder LevenbergMarquardt algorithm with analytical derivatives chkder Check the supplied analytical derivatives lmdif LevenbergMarquardt algorithm with numerical derivatives fdjac2 Compute final approximate Jacobian when using numerical derivatives dpmpar Determine the machine precision On successful completion of the LevenbergMarquardt iteration a GaussNewton regression is used to calculate the covariance matrix for the parameter estimates If the robust flag is given a robust variant is computed The documentation for the set command explains the specific options available in this regard Since NLS results are asymptotic there is room for debate over whether or not a correction for degrees of freedom should be applied when calculating the standard error of the regression and the standard errors of the parameter estimates For comparability with OLS and in light of the reasoning given in Davidson and MacKinnon 1993 the estimates shown in gretl do use a degrees of freedom correction 258 Numerical accuracy Table 251 shows the results of running the gretl NLS procedure on the 27 Statistical Reference Datasets made available by the US National Institute of Standards and Technology NIST for test ing nonlinear regression software2 For each dataset two sets of starting values for the parameters are given in the test files so the full test comprises 54 runs Two full tests were performed one 1On a 32bit Intel Pentium machine a likely value for this parameter is 182 1012 2For a discussion of gretls accuracy in the estimation of linear models see Appendix C Chapter 25 Nonlinear least squares 244 using all analytical derivatives and one using all numerical approximations In each case the default tolerance was used3 Out of the 54 runs gretl failed to produce a solution in 4 cases when using analytical derivatives and in 5 cases when using numeric approximation Of the four failures in analytical derivatives mode two were due to nonconvergence of the LevenbergMarquardt algorithm after the maximum number of iterations on MGH09 and Bennett5 both described by NIST as of Higher difficulty and two were due to generation of range errors outofbounds floating point values when computing the Jacobian on BoxBOD and MGH17 described as of Higher difficulty and Average difficulty respectively The additional failure in numerical approximation mode was on MGH10 Higher diffi culty maximum number of iterations reached The table gives information on several aspects of the tests the number of outright failures the average number of iterations taken to produce a solution and two sorts of measure of the accuracy of the estimates for both the parameters and the standard errors of the parameters For each of the 54 runs in each mode if the run produced a solution the parameter estimates obtained by gretl were compared with the NIST certified values We define the minimum correct figures for a given run as the number of significant figures to which the least accurate gretl esti mate agreed with the certified value for that run The table shows both the average and the worst case value of this variable across all the runs that produced a solution The same information is shown for the estimated standard errors4 The second measure of accuracy shown is the percentage of cases taking into account all parame ters from all successful runs in which the gretl estimate agreed with the certified value to at least the 6 significant figures which are printed by default in the gretl regression output Table 251 Nonlinear regression the NIST tests Analytical derivatives Numerical derivatives Failures in 54 tests 4 5 Average iterations 32 127 Mean of min correct figures 8120 6980 parameters Worst of min correct figures 4 3 parameters Mean of min correct figures 8000 5673 standard errors Worst of min correct figures 5 2 standard errors Percent correct to at least 6 figures 965 919 parameters Percent correct to at least 6 figures 977 773 standard errors Using analytical derivatives the worst case values for both parameters and standard errors were improved to 6 correct figures on the test machine when the tolerance was tightened to 10e14 3The data shown in the table were gathered from a prerelease build of gretl version 109 compiled with gcc 33 linked against glibc 232 and run under Linux on an i686 PC IBM ThinkPad A21m 4For the standard errors I excluded one outlier from the statistics shown in the table namely Lanczos1 This is an odd case using generated data with an almostexact fit the standard errors are 9 or 10 orders of magnitude smaller than the coefficients In this instance gretl could reproduce the certified standard errors to only 3 figures analytical derivatives and 2 figures numerical derivatives Chapter 25 Nonlinear least squares 245 Using numerical derivatives the same tightening of the tolerance raised the worst values to 5 correct figures for the parameters and 3 figures for standard errors at a cost of one additional failure of convergence Note the overall superiority of analytical derivatives on average solutions to the test problems were obtained with substantially fewer iterations and the results were more accurate most notably for the estimated standard errors Note also that the sixdigit results printed by gretl are not 100 percent reliable for difficult nonlinear problems in particular when using numerical derivatives Having registered this caveat the percentage of cases where the results were good to six digits or better seems high enough to justify their printing in this form Chapter 26 Maximum likelihood estimation 261 Generic ML estimation with gretl Maximum likelihood estimation is a cornerstone of modern inferential procedures Gretl provides a way to implement this method for a wide range of estimation problems by use of the mle command We give here a few examples To give a foundation for the examples that follow we start from a brief reminder on the basics of ML estimation Given a sample of size T it is possible to define the density function1 for the whole sample namely the joint distribution of all the observations f Y θ where Y y1 yT Its shape is determined by a kvector of unknown parameters θ which we assume is contained in a set Θ and which can be used to evaluate the probability of observing a sample with any given characteristics After observing the data the values Y are given and this function can be evaluated for any legitimate value of θ In this case we prefer to call it the likelihood function the need for another name stems from the fact that this function works as a density when we use the yt s as arguments and θ as parameters whereas in this context θ is taken as the functions argument and the data Y only have the role of determining its shape In standard cases this function has a unique maximum The location of the maximum is unaffected if we consider the logarithm of the likelihood or loglikelihood for short this function will be denoted as ℓθ log f Y θ The loglikelihood functions that gretl can handle are those where ℓθ can be written as ℓθ t1T ℓt θ which is true in most cases of interest The functions ℓt θ are called the loglikelihood contributions Moreover the location of the maximum is obviously determined by the data Y This means that the value θ Y ArgmaxθΘ ℓθ 261 is some function of the observed data a statistic which has the property under mild conditions of being a consistent asymptotically normal and asymptotically efficient estimator of θ Sometimes it is possible to write down explicitly the function θY in general it need not be so In these circumstances the maximum can be found by means of numerical techniques These often rely on the fact that the loglikelihood is a smooth function of θ and therefore on the maximum its partial derivatives should all be 0 The gradient vector or score vector is a function that enjoys many interesting statistical properties in its own right it will be denoted here as gθ It is a 1We are supposing here that our data are a realization of continuous random variables For discrete random variables everything continues to apply by referring to the probability function instead of the density In both cases the distribution may be conditional on some exogenous variables 246 Chapter 26 Maximum likelihood estimation kvector with typical element gi θ ℓθθi t1T ℓt θθi Gradientbased methods can be briefly illustrated as follows 1 pick a point θ0 Θ 2 evaluate gθ0 3 if gθ0 is small stop Otherwise compute a direction vector d gθ0 4 evaluate θ1 θ0 d gθ0 5 substitute θ0 with θ1 6 restart from 2 Many algorithms of this kind exist they basically differ from one another in the way they compute the direction vector d gθ0 to ensure that ℓ θ1 ℓ θ0 so that we eventually end up on the maximum The default method gretl uses to maximize the loglikelihood is a gradientbased algorithm known as the BFGS Broyden Fletcher Goldfarb and Shanno method This technique is used in most econometric and statistical packages as it is wellestablished and remarkably powerful Clearly in order to make this technique operational it must be possible to compute the vector gθ for any value of θ In some cases this vector can be written explicitly as a function of Y If this is not possible or too difficult the gradient may be evaluated numerically The alternative NewtonRaphson algorithm is also available This method is more effective under some circumstances but is also more fragile see section 2610 and chapter 37 for details2 The choice of the starting value θ0 is crucial in some contexts and inconsequential in others In general however it is advisable to start the algorithm from sensible values whenever possible If a consistent estimator is available this is usually a safe and efficient choice this ensures that in large samples the starting point will be likely close to θ and convergence can be achieved in few iterations The maximum number of iterations allowed for the BFGS procedure and the relative tolerance for assessing convergence can be adjusted using the set command the relevant variables are bfgsmaxiter default value 500 and bfgstoler default value the machine precision to the power 34 262 Syntax ML estimation in gretl is supported by the mle command block This consists of an initial line holding the keyword mle plus an equation for the loglikelihood one or more statements within the block details below and a trailer line to close the block end mle Option flags may be appended to the trailer line Listing 261 gives a simple but complete example which serves to illustrate the equivalence of MLE and OLS in the context of the normal linear model 2Note that some of the statements made below for example regarding estimation of the covariance matrix have to be modified when Newtons method is used Chapter 26 Maximum likelihood estimation Listing 261 OLS and MLE Download open data97 list X const INCOME PRICE ols QNC X matrix b coeff scalar s2 sigma2 scalar l2pi log2pi scalar n nobs mle lt 05l2pi 05logs2 12s2 uhat2 series uhat QNC lincombX b s2 sumuhat2n params b end mle Initial line of block If possible the given expression should evaluate to a series or vector contribution to the loglikelihood per observation Failing that it must evaluate to a scalar the total loglikelihood The identifier on the lefthand side lt in Listing 261 is up to the user If the variable in question is defined prior to the mle block it can be referenced after ML estimation otherwise it is treated as a temporary variable and is destroyed after estimation Lines within the block These may take three forms 1 Helper statements that calculate auxiliary quantities in the example uhat and s2 Such statements will be evaluated before the loglikelihood and then reevaluated on each iteration 2 Keyword plus parameter as in params b which tells mle that the parameter to be adjusted to maximize the loglikelihood is the vector b This sort of statement can also be used to specify analytical derivatives of the loglikelihood with respect to the parameters see section 267 for discussion and examples 3 Statements employing print or printf to track the progress of calculation which can be useful for debugging Final line In the example above this merely terminates the block but if one wanted standard errors calculated via a numerical approximation to the Hessian for instance one could substitute end mle hessian For a full listing of applicable options see the mle entry in the Gretl Command Reference 263 Covariance matrix and standard errors By default the covariance matrix of the parameter estimates is based on the Outer Product of the Gradient OPG That is VarOPGθ GθGθ 1 262 where Gtheta is the T x k matrix of contributions to the gradient Other options are available If the hessian flag is given the covariance matrix is computed from a numerical approximation to the Hessian at convergence If the robust option is given the quasiML sandwich estimator is used VARQMLtheta Htheta 1 Gtheta Gtheta Htheta 1 where H denotes the numerical approximation to the Hessian A refinement here is that if the hac parameter is appended to the robust option as in end mle robusthac the sandwich estimator is augmented in the manner of Newey and West 1987 to allow for serial correlation in the gradient Note that this only makes sense for timeseries data In that case the details of the HAC estimator can be controlled via the set command as described in chapter 22 Clusterrobust estimation is also available in order to activate it use the clusterclustvar where clustvar should be a discrete series See section 225 for more details Note however that if the loglikelihood function supplied by the user just returns a scalar valueas opposed to a series or vector holding perobservation contributionsthen the OPG method is not applicable and so the covariance matrix must be estimated via a numerical approximation to the Hessian 264 Gamma estimation Suppose we have a sample of T independent and identically distributed observations from a Gamma distribution The density function for each observation xt is fxt alphapGammap xtp1 expalpha xt 263 The loglikelihood for the entire sample can be written as the logarithm of the joint density of all the observations Since these are independent and identical the joint density is the product of the individual densities and hence its log is lalpha p sumt1T logalphapGammap xtp1 expalpha xt sumt1T lt 264 where lt p logalpha xt gammap log xt alpha xt and gamma is the log of the gamma function In order to estimate the parameters alpha and p via ML we need to maximize 264 with respect to them The corresponding gretl code snippet is scalar alpha 1 scalar p 1 mle logl plnalpha x lngammap lnx alpha x params alpha p end mle The first two statements alpha 1 p 1 Chapter 26 Maximum likelihood estimation 250 are necessary to ensure that the variables alpha and p exist before the computation of logl is attempted Inside the mle block these variables which could be either scalars vectors or a com bination of the two see below for an example are identified as the parameters that should be adjusted to maximize the likelihood via the params keyword Their values will be changed by the execution of the mle command upon successful completion they will be replaced by the ML esti mates The starting value is 1 for both this is arbitrary and does not matter much in this example more on this later The above code can be made more readable and marginally more efficient by defining a variable to hold α xt This command can be embedded in the mle block as follows mle logl plnax lngammap lnx ax series ax alphax params alpha p end mle The variable ax is not added to the params list of course since it is just an auxiliary variable to facilitate the calculations You can insert as many such auxiliary lines as you require before the params line with the restriction that they must contain either a commands to generate series scalars or matrices or b print commands which may be used to aid in debugging In a simple example like this the choice of the starting values is almost inconsequential the algo rithm is likely to converge no matter what the starting values are However consistent methodof moments estimators of p and α can be simply recovered from the sample mean m and variance V since it can be shown that Ext pα Vxt pα2 it follows that the following estimators α mV p m α are consistent and therefore suitable to be used as starting point for the algorithm The gretl script code then becomes scalar m meanx scalar alpha mvarx scalar p malpha mle logl plnax lngammap lnx ax series ax alphax params alpha p end mle Another thing to note is that sometimes parameters are constrained within certain boundaries in this case for example both α and p must be positive numbers Gretl does not check for this it is the users responsibility to ensure that the function is always evaluated at an admissible point in the parameter space during the iterative search for the maximum An effective technique is to define a variable for checking that the parameters are admissible and setting the loglikelihood as undefined if the check fails An example which uses the conditional assignment operator follows scalar m meanx scalar alpha mvarx scalar p malpha mle logl check plnax lngammap lnx ax NA series ax alphax scalar check alpha0 p0 params alpha p end mle 265 Stochastic frontier cost function Note this section has the sole purpose of illustrating the mle command For the estimation of stochastic frontier cost or production functions you may want to use the frontier function package When modeling a cost function it is sometimes worthwhile to incorporate explicitly into the statistical model the notion that firms may be inefficient so that the observed cost deviates from the theoretical figure not only because of unobserved heterogeneity between firms but also because two firms could be operating at a different efficiency level despite being identical in all other respects In this case we may write Ci Ci ui vi where Ci is some variable cost indicator Ci is its theoretical value ui is a zeromean disturbance term and vi is the inefficiency term which is supposed to be nonnegative by its very nature A linear specification for Ci is often chosen For example the CobbDouglas cost function arises when Ci is a linear function of the logarithms of the input prices and the output quantities The stochastic frontier model is a linear model of the form yi xi beta epsiloni in which the error term epsiloni is the sum of ui and vi A common postulate is that ui N0 sigmau2 and vi N0 sigmav2 If independence between ui and vi is also assumed then it is possible to show that the density function of epsiloni has the form fepsiloni sqrt2pi Philambda epsiloni sigma 1sigma phiepsiloni sigma 265 where Phi and phi are respectively the distribution and density function of the standard normal sigma sqrtsigmau2 sigmav2 and lambda sigmau sigmav As a consequence the loglikelihood for one observation takes the form apart form an irrelevant constant lt log Philambda epsiloni sigma logsigma epsiloni2 2 sigma2 Therefore a CobbDouglas cost function with stochastic frontier is the model described by the following equations log Ci log Ci epsiloni log Ci c sumj1m betaj log yij sumj1n alphaj log pij epsiloni ui vi ui N0 sigmau2 vi N0 sigmav2 In most cases one wants to ensure that the homogeneity of the cost function with respect to the prices holds by construction Since this requirement is equivalent to sumj1n alphaj 1 the above equation for Ci can be rewritten as log Ci log pin c sumj1m betaj log yij sumj2n alphaj log pij log pin epsiloni 266 The above equation could be estimated by OLS but it would suffer from two drawbacks first the OLS estimator for the intercept c is inconsistent because the disturbance term has a nonzero expected value second the OLS estimators for the other parameters are consistent but inefficient in view of the nonnormality of epsiloni Both issues can be addressed by estimating 266 by maximum likelihood Nevertheless OLS estimation is a quick and convenient way to provide starting values for the MLE algorithm Chapter 26 Maximum likelihood estimation 252 Listing 262 shows how to implement the model described so far The banks91 file contains part of the data used in Lucchetti Papi and Zazzaro 2001 Listing 262 Estimation of stochastic frontier cost function with scalar parameters Download open banks91gdt transformations series cost lnVC series q1 lnQ1 series q2 lnQ2 series p1 lnP1 series p2 lnP2 series p3 lnP3 CobbDouglas cost function with homogeneity restrictions for initialization series rcost cost p1 series rp2 p2 p1 series rp3 p3 p1 ols rcost const q1 q2 rp2 rp3 CobbDouglas cost function with homogeneity restrictions and inefficiency scalar b0 coeffconst scalar b1 coeffq1 scalar b2 coeffq2 scalar b3 coeffrp2 scalar b4 coeffrp3 scalar su 01 scalar sv 01 mle logl lncnormelambdass lnss 05ess2 scalar ss sqrtsu2 sv2 scalar lambda susv series e rcost b0const b1q1 b2q2 b3rp2 b4rp3 params b0 b1 b2 b3 b4 su sv end mle The script in example 262 is relatively easy to modify to show how one can use vectors that is 1dimensional matrices for storing the parameters to optimize example 263 holds essentially the same script in which the parameters of the cost function are stored together in a vector Of course this makes also possible to use variable lists and other refinements which make the code more compact and readable 266 GARCH models GARCH models are handled by gretl via a native function However it is instructive to see how they can be estimated through the mle command3 3The gig addon which handles other variants of conditionally heteroskedastic models uses mle as its internal engine Chapter 26 Maximum likelihood estimation 253 Listing 263 Estimation of stochastic frontier cost function with matrix parameters Download open banks91gdt transformations series cost lnVC series q1 lnQ1 series q2 lnQ2 series p1 lnP1 series p2 lnP2 series p3 lnP3 CobbDouglas cost function with homogeneity restrictions for initialization series rcost cost p1 series rp2 p2 p1 series rp3 p3 p1 list X const q1 q2 rp2 rp3 ols rcost X X const q1 q2 rp2 rp3 CobbDouglas cost function with homogeneity restrictions and inefficiency matrix b coeff scalar su 01 scalar sv 01 mle logl lncnormelambdass lnss 05ess2 scalar ss sqrtsu2 sv2 scalar lambda susv series e rcost lincombX b params b su sv end mle The following equations provide the simplest example of a GARCH11 model yt mu epsilont epsilont ut sigmat ut N01 ht omega alpha epsilont12 beta ht1 Since the variance of yt depends on past values writing down the loglikelihood function is not simply a matter of summing the log densities for individual observations As is common in time series models yt cannot be considered independent of the other observations in our sample and consequently the density function for the whole sample the joint density for all observations is not just the product of the marginal densities Maximum likelihood estimation in these cases is achieved by considering conditional densities so what we maximize is a conditional likelihood function If we define the information set at time t as Ft yt yt1 then the density of yt conditional on Ft1 is normal yt Ft1 Nmu ht By means of the properties of conditional distributions the joint density can be factorized as follows fyt yt1 productt1T fyt Ft1 fy0 If we treat y0 as fixed then the term fy0 does not depend on the unknown parameters and therefore the conditional loglikelihood can then be written as the sum of the individual contributions as lmu omega alpha beta sumt1T lt 267 where lt log 1sqrtht phiyt musqrtht 12 loght yt mu2 ht The following script shows a simple application of this technique which uses the data file djclose it is one of the example dataset supplied with gretl and contains daily data from the Dow Jones stock index open djclose series y 100ldiffdjclose scalar mu 00 scalar omega 1 scalar alpha 04 scalar beta 00 mle ll 05logh e2h series e y mu series h vary series h omega alphae12 betah1 params mu omega alpha beta end mle 267 Analytical derivatives Computation of the score vector is essential for the working of the BFGS method In all the previous examples no explicit formula for the computation of the score was given so the algorithm was fed numerically evaluated gradients Numerical computation of the score for the ith parameter is performed via a finite approximation of the derivative namely ℓθ₁θₙθᵢ ℓθ₁θᵢ hθₙ ℓθ₁θᵢ hθₙ2h where h is a small number In many situations this is rather efficient and accurate A better approximation to the true derivative may be obtained by forcing mle to use a technique known as Richardson Extrapolation which gives extremely precise results but is considerably more CPUintensive This feature may be turned on by using the set command as in set bfgsrichardson on However one might want to avoid the approximation and specify an exact function for the derivatives As an example consider the following script nulldata 1000 series x1 normal series x2 normal series x3 normal series ystar x1 x2 x3 normal series y ystar 0 scalar b0 0 scalar b1 0 scalar b2 0 scalar b3 0 mle logl ylnP 1 yln1 P series ndx b0 b1x1 b2x2 b3x3 series P cnormndx params b0 b1 b2 b3 end mle verbose Here 1000 data points are artificially generated for an ordinary probit model yt is a binary variable which takes the value 1 if yt β₁x₁t β₂x₂t β₃x₃t εt 0 and 0 otherwise Therefore yt 1 with probability Φβ₁x₁t β₂x₂t β₃x₃t πt The probability function for one observation can be written as Pyt πtyt 1 πt1 yt Since the observations are independent and identically distributed the loglikelihood is simply the sum of the individual contributions Hence ℓ t1T yt logπt 1 yt log1 πt The verbose switch at the end of the end mle statement produces a detailed account of the iterations done by the BFGS algorithm 4Again gretl does provide a native probit command see section 381 but a probit model makes for a nice example here Chapter 26 Maximum likelihood estimation 256 In this case numerical differentiation works rather well nevertheless computation of the analytical score is straightforward since the derivative ℓ βi can be written as ℓ βi ℓ πt πt βi via the chain rule and it is easy to see that ℓ πt yt πt 1 yt 1 πt πt βi φβ1x1t β2x2t β3x3t xit The mle block in the above script can therefore be modified as follows mle logl ylnP 1yln1P series ndx b0 b1x1 b2x2 b3x3 series P cnormndx series m dnormndxyP 1y1P deriv b0 m deriv b1 mx1 deriv b2 mx2 deriv b3 mx3 end mle verbose Note that the params statement has been replaced by a series of deriv statements these have the double function of identifying the parameters over which to optimize and providing an analytical expression for their respective score elements 268 Debugging ML scripts We have discussed above the main sorts of statements that are permitted within an mle block namely auxiliary commands to generate helper variables deriv statements to specify the gradient with respect to each of the parameters and a params statement to identify the parameters in case analytical derivatives are not given For the purpose of debugging ML estimators one additional sort of statement is allowed you can print the value of a relevant variable at each step of the iteration This facility is more restricted then the regular print command The command word print should be followed by the name of just one variable a scalar series or matrix In the last example above a key variable named m was generated forming the basis for the analytical derivatives To track the progress of this variable one could add a print statement within the ML block as in series m dnormndxyP 1y1P print m 269 Using functions The mle command allows you to estimate models that gretl does not provide natively in some cases it may be a good idea to wrap up the mle block in a userdefined function see Chapter 14 so as to extend gretls capabilities in a modular and flexible way As an example we will take a simple case of a model that gretl does not yet provide natively the zeroinflated Poisson model or ZIP for short5 In this model we assume that we observe a mixed population for some individuals the variable yt is conditionally on a vector of exogenous covariates xt distributed as a Poisson random variate for some others yt is identically 0 The trouble is we dont know which category a given individual belongs to For instance suppose we have a sample of women and the variable yt represents the number of children that woman t has There may be a certain proportion α of women for whom yt 0 with certainty maybe out of a personal choice or due to physical impossibility But there may be other women for whom yt 0 just as a matter of chance they havent happened to have any children at the time of observation In formulae Pyt kxt α dt 1 α eμt μtyt yt μt expxt β dt 1 for yt 0 0 for yt 0 Writing a mle block for this model is not difficult mle ll logprob series xb expb0 b1 x series d y 0 series poiprob expxb xby gammay1 series logprob alpha 0 alpha 1 logalphad 1alphapoiprob NA params alpha b0 b1 end mle v However the code above has to be modified each time we change our specification by say adding an explanatory variable Using functions we can simplify this task considerably and eventually be able to write something easy like list X const x zipy X Lets see how this can be done First we need to define a function called zip that will take two arguments a dependent variable y and a list of explanatory variables X An example of such function can be seen in script 264 By inspecting the function code you can see that the actual estimation does not happen here rather the zip function merely uses the builtin modprint command to print out the results coming from another userwritten function namely zipestimate The function zipestimate is not meant to be executed directly it just contains the numbercrunching part of the job whose results are then picked up by the end function zip In turn zipestimate calls other userwritten functions to perform other tasks The whole set of internal functions is shown in the panel 265 All the functions shown in 264 and 265 can be stored in a separate inp file and executed once at the beginning of our job by means of the include command Assuming the name of this script file is zipestinp the following is an example script which a includes the script file b generates a simulated dataset and c performs the estimation of a ZIP model on the artificial data set verbose off 5The actual ZIP model is in fact a bit more general than the one presented here The specialized version discussed in this section was chosen for the sake of simplicity For further details see Greene 2003 Chapter 26 Maximum likelihood estimation 258 Listing 264 Zeroinflated Poisson Model userlevel function Download userlevel function estimate the model and print out the results function void zipseries y list X matrix coefstde zipestimatey X printf Zeroinflated Poisson model string parnames alpha string parnames varnameX modprint coefstde parnames end function Listing 265 Zeroinflated Poisson Model internal functions Download compute log probabilities for the plain Poisson model function series lnpoiprobseries y list X matrix beta series xb lincombX beta return expxb yxb lngammay1 end function compute log probabilities for the zeroinflated Poisson model function series lnzipprobseries y list X matrix beta scalar p0 check if the probability is in 01 otherwise return NA if p0 1 p0 0 series ret NA else series ret lnpoiproby X beta ln1p0 series ret y0 lnp0 expret ret endif return ret end function do the actual estimation silently function matrix zipestimateseries y list X initialize alpha to a sensible value half the frequency of zeros in the sample scalar alpha meany02 initialize the coeffs we assume the first explanatory variable is the constant here matrix coef zerosnelemX 1 coef1 meany 1alpha do the actual ML estimation mle ll lnzipproby X coef alpha params alpha coef end mle hessian quiet return coeff stderr end function Chapter 26 Maximum likelihood estimation 259 include the userwritten functions include zipestinp generate the artificial data nulldata 1000 set seed 732237 scalar truep 02 scalar b0 02 scalar b1 05 series x normal series y uniformtruep 0 randgenp expb0 b1x list X const x estimate the zeroinflated Poisson model zipy X The results are as follows Zeroinflated Poisson model coefficient std error zstat pvalue alpha 0209738 00261746 8013 112e15 const 0167847 00449693 3732 00002 x 0452390 00340836 1327 332e40 A further step may then be creating a function package for accessing your new zip function via gretls graphical interface For details on how to do this see section 145 2610 Advanced use of mle functions analytical derivatives algorithm choice All the techniques decribed in the previous sections may be combined and mle can be used for solving nonstandard estimation problems provided of course that one chooses maximum likeli hood as the preferred inference method The strategy that as of this writing has proven most successful in designing scripts for this pur pose is Modularize your code as much as possible Use analytical derivatives whenever possible Choose your optimization method wisely In the rest of this section we will expand on the probit example of section 267 to give the reader an idea of what a heavyduty application of mle looks like Most of the code fragments come from mleadvancedinp which is one of the sample scripts supplied with the standard installation of gretl see under File Script files Practice File BFGS with and without analytical derivatives The example in section 267 can be made more general by using matrices and userwritten func tions Consider the following code fragment list X const x1 x2 x3 matrix b zerosnelemX1 Chapter 26 Maximum likelihood estimation 260 mle logl ylnP 1yln1P series ndx lincombX b series P cnormndx params b end mle In this context the fact that the model we are estimating has four explanatory variables is totally incidental the code is written in such a way that we could change the content of the list X without having to make any other modification This was made possible by 1 gathering the parameters to estimate into a single vector b rather than using separate scalars 2 using the nelem function to initialize b so that its dimension is kept track of automatically 3 using the lincomb function to compute the index function A parallel enhancement could be achieved in the case of analytically computed derivatives since b is now a vector mle expects the argument to the deriv keyword to be a matrix in which each column is the partial derivative to the corresponding element of b It is useful to rewrite the score for the ith observation as ℓi β mix i 268 where mi is the signed Mills ratio that is mi yi φx iβ Φx iβ 1 yi φx iβ 1 Φx iβ which was computed in section 267 via series P cnormndx series m dnormndxyP 1y1P Here we will code it in a somewhat terser way as series m y invmillsndx invmillsndx and make use of the conditional assignment operator and of the specialized function invmills for efficiency Building the score matrix is now easily achieved via mle logl ylnP 1yln1P series ndx lincombX b series P cnormndx series m y invmillsndx invmillsndx matrix mX X deriv b mX m end mle in which the operator was used to turn series and lists into matrices see chapter 17 However proceeding in this way for more complex models than probit may imply inserting into the mle block a long series of instructions the example above merely happens to be short because the score matrix for the probit model is very easy to write in matrix form A better solution is writing a userlevel function to compute the score and using that inside the mle block as in function matrix scorematrix b series y list X series ndx lincombX b Chapter 26 Maximum likelihood estimation 261 series m y invmillsndx invmillsndx return m X end function mle logl ylnP 1yln1P series ndx lincombX b series P cnormndx deriv b scoreb y X end mle In this way no matter how complex the computation of the score is the mle block remains nicely compact Newtons method and the analytical Hessian As mentioned above gretl offers the user the option of using Newtons method for maximizing the loglikelihood In terms of the notation used in section 261 the direction for updating the inital parameter vector θ0 is given by d gθ0 λHθ01gθ0 269 where Hθ is the Hessian of the total loglikelihood computed at θ and 0 λ 1 is a scalar called the step length The above expression makes a few points clear 1 At each step it must be possible to compute not only the score gθ but also its derivative Hθ 2 the matrix Hθ should be nonsingular 3 it is assumed that for some positive value of λ ℓθ1 ℓθ0 in other words that going in the direction d gθ0 leads upwards for some step length The strength of Newtons method lies in the fact that if the loglikelihood is globally concave then 269 enjoys certain optimality properties and the number of iterations required to reach the maximum is often much smaller than it would be with other methods such as BFGS However it may have some disadvantages for a start the Hessian Hθ may be difficult or very expensive to compute moreover the loglikelihood may not be globally concave so for some values of θ the matrix Hθ is not negative definite or perhaps even singular Those cases are handled by gretls implementation of Newtons algorithm by means of several heuristic techniques6 but a number of adverse consequences may occur which range from longer computation time for optimization to nonconvergence of the algorithm As a consequence using Newtons method is advisable only when the computation of the Hessian is not too CPUintensive and the nature of the estimator is such that it is known in advance that the loglikelihood is globally concave The probit models satisfies both requisites so we will expand the preceding example to illustrate how to use Newtons method in gretl A first example may be given simply by issuing the command set optimizer newton 6The gist to it is that if H is not negative definite it is substituted by k dgH 1 k H where k is a suitable scalar however if youre interested in the precise details youll be much better off looking at the source code the file youll want to look at is libsrcgretlbfgsc before the mle block7 This will instruct gretl to use Newtons method instead of BFGS If the deriv keyword is used gretl will differentiate the score function numerically otherwise if the score has to be computed itself numerically gretl will calculate Hθ by differentiating the loglikelihood numerically twice The latter solution though is generally to be avoided as may be extremely timeconsuming and may yield imprecise results A much better option is to calculate the Hessian analytically and have gretl use its true value rather than a numerical approximation In most cases this is both much faster and numerically stable but of course comes at the price of having to differentiate the loglikelihood twice to respect with the parameters and translate the resulting expressions into efficient hansl code Luckily both tasks are relatively easy in the probit case the matrix of second derivatives of ℓi may be written as ²ℓiββ mi mi xiβ xi xi so the total Hessian is i1n ²ℓiββ X w₁ 0 0 w₂ 0 wₙ X 2610 where wᵢ mi mi xiβ It can be shown that wᵢ 0 so the Hessian is guaranteed to be negative definite in all sensible cases and the conditions are ideal for applying Newtons method A hansl translation of equation 2610 may look like function void Hessmatrix H matrix b series y list X computes the negative Hessian for a Probit model series ndx lincombX b series m y invmillsndx invmillsndx series w mmndx matrix mX X H mX wmX end function There are two characteristics worth noting of the function above For a start it doesnt return anything the result of the computation is simply stored in the matrix pointed at by the first argument of the function Second the result is not the Hessian proper but rather its negative This function becomes usable from within an mle block by the keyword hessian The syntax is mle hessian funcnamemataddr end mle In other words the hessian keyword must be followed by the call to a function whose first argument is a matrix pointer which is supposed to be filled with the negative of the Hessian at θ We said above section 261 that the covariance matrix of the parameter estimates is by default estimated using the Outer Product of the Gradient so long as the loglikelihood function returns the perobservation contributions However if you supply a function that computes the Hessian then by default it is used in estimating the covariance matrix If you wish to impose use of OPG instead append the opg option to the end of the mle block Note that gretl does not perform any numerical check on whether a usersupplied function computes the Hessian correctly On the one hand this means that you can trick mle into using alternatives to the Hessian and thereby implement other optimization methods For example if you 7To go back to BFGS you use set optimizer bfgs substitute in equation 269 the Hessian H with the negative of the OPG matrix GG as defined in 262 you get the socalled BHHH optimization method see Berndt et al 1974 Again the sample file mleadvancedinp provides an example On the other hand you may want to perform a check of your analyticallycomputed H matrix versus a numerical approximation If you have a function that computes the score this is relatively simple to do by using the fdjac function briefly described in section 373 which computes a numerical approximation to a derivative In practice you need a function computing gθ as a row vector and then use fdjac to differentiate it numerically with respect to θ The result can then be compared to your analyticallycomputed Hessian The code fragment below shows an example of how this can be done in the probit case function matrix totalscorematrix b series y list X computes the total score return sumcscoreb y X end function function void checkmatrix b series y list X compares the analytical Hessian to its numerical approximation obtained via fdjac matrix aH HessaH b y X stores the analytical Hessian into aH matrix nH fdjacb totalscoreb y X nH 05nH nH force symmetry printf Numerical Hessian 166f nH printf Analytical Hessian negative 166f aH printf Check should be zero 166f nH aH end function 2611 Estimating constrained models In many cases you may want to perform ML estimation of a model under some kind of constraint Mathematically this amounts to maximizing the loglikelihood ℓθ under some restriction Assume that the restriction can be represented as gθ 0 where the function g is differentiable On paper the most straightforward way to accomplish this task is to set up a Lagrangean L θ ℓθ λgθ and solve the firstorder conditions that arise from differentiating the Lagrangean with respect to θ and λ If an explicit solution can be found then all is well but in many cases the resulting system of equations cannot be solved explicitly so that numerical optimisation is necessary In such cases the approach above is not particularly useful a different strategy is much more convenient The idea is to find an alternative parametrization a means of expressing the vector θ as a differentiable function of a smaller set of parameters ψ In other words find a function h such that any admissible value of θ can be written as θ hψ and ghψ 0 for any value of ψ Then maximization of the loglikelihood is simply a question of operating on ℓψ ℓhψ using an ordinary unconstrained numerical optimization routine Once the ML estimate ψ is available it is easy to recover the corresponding constrained vector θ hψ Computing the covariance matrix involves an extra step known as the delta method the asymptotic covariance matrix of θ can be computed as Vθ JψVψJψ 2611 where J is the Jacobian matrix holding the partial derivatives of hψ It is recommended that the Jacobian matrix should be computed analytically whenever possible but as a fallback strategy numerical differentiation available via the function fdjac see section 373 is a viable alternative Note that the matrix produced by this method will be singular by construction The example reported in script 266 is perhaps a little contrived but useful to elucidate the technique Suppose we wish to estimate mean and variance of an iid sample of Gaussian random variables under the constraint that Vxt σ2 expExt expμ Of course the unconstrained ML estimators μ X and σ2 n1 ixi X2 are not guaranteed to satisfy the constraints in fact the probability that they do is 0 The Lagrangean in this case would be Lθ K n2 log σ2 12σ2 ixi μ2 λeμ σ2 and finding an explicit solution by solving the firstorder conditions is not at all easy Fortunately numerical optimization becomes straightforward by expressing the constrained parameters as θ μ σ2 ψ expψ hψ after maximizing the loglikelihood the covariance matrix for θ can be recovered by computing the Jacobian as Jψ dμdψ dσ2dψ 1 expψ and applying formula 2611 Running the example script should produce the following output unconstrained estimates mean 100314 variance 28903 check what expmuhat 0163481 Model 1 ML using observations 11000 loglik 05log2pi 05logs2 05xm2s2 Standard errors based on Outer Products matrix estimate std error z pvalue psi1 103763 00357311 2904 207e185 Loglikelihood 1949972 Akaike criterion 3901943 Schwarz criterion 3906851 HannanQuinn 3903808 check what expmuhat 0 coefficient std error z pvalue mean 103763 00357311 2904 207e185 variance 282251 0100851 2799 235e172 Chapter 26 Maximum likelihood estimation 265 Listing 266 Example of ML estimation of a model under constraints Download set verbose off set seed 7120 function matrix hmatrix psi ret psi1 exppsi1 return ret end function function matrix anJacobmatrix psi the derivative of h return 1 exppsi1 end function nulldata 1000 generate artificial data from a N1 e distribution series x 1 normal exp05 show that the unconstrained estimates dont satisfy the restriction scalar muhat meanx scalar s2hat sstxnobs printf unconstrained estimates mean g variance g muhat s2hat printf check vhat expmuhat g s2hat expmuhat now estimate under the constraint expmean variance psi 1 mle loglik 05log2pi 05logs2 05xm2s2 matrix par hpsi scalar m par1 scalar s2 par2 params psi end mle now map psi to the constrained parametrisation matrix par hpsi show that now the constraint holds printf check vhat expmuhat g par2 exppar1 take care of the covariance matrix matrix vpar qformanJacobpsi vcv alternatively one could use the numerical Jacobian as in matrix vpar qformfdjacpsi hpsi vcv finally print out the constrained parameters via modprint matrix cs par sqrtdiagvpar modprint cs mean variance Chapter 26 Maximum likelihood estimation 266 The example provided in listing 267 illustrates the usage of catch in an artificially simple context we use the mle command for estimating mean and variance of a Gaussian rv of course you dont need the mle apparatus for this but it makes for a nice example The gist of the example is using the set bfgsmaxiter command to force mle to abort after a very small number of iterations so that you can have an idea on how to use the catch modifier and the associated error accessor to handle the situation You may want to increase the maximum number if BFGS iterations in the example to check what happens if the algorithm is allowed to converge Note that upon successful completion of mle a bundle named model is available containing several quantities that may be of your interest including the total number of function evaluations Listing 267 Handling nonconvergence via catch Download set verbose off nulldata 200 set seed 8118 generate simulated data from a N34 variate series x normal32 set starting values scalar m 0 scalar s2 1 set iteration limit to a ridiculously low value set bfgsmaxiter 10 perform ML estimation note the catch modifier catch mle loglik 05 log2pi logs2 e2s2 series e2 x m2 params m s2 end mle quiet grab the error and proceed as needed err error if err printf Not converged m g s2 g m s2 else printf Converged after d iterations modelgrcount cs coeff sqrtdiagvcv pn m s2 modprint cs pn endif Chapter 27 GMM estimation 271 Introduction and terminology The Generalized Method of Moments GMM is a very powerful and general estimation method which encompasses practically all the parametric estimation techniques used in econometrics It was introduced in Hansen 1982 and Hansen and Singleton 1982 an excellent and thorough treatment is given in chapter 17 of Davidson and MacKinnon 1993 The basic principle on which GMM is built is rather straightforward Suppose we wish to estimate a scalar parameter θ based on a sample x1 x2 xT Let θ0 indicate the true value of θ Theoretical considerations either of statistical or economic nature may suggest that a relationship like the following holds Ext gθ 0 θ θ0 271 with g a continuous and invertible function That is to say there exists a function of the data and the parameter with the property that it has expectation zero if and only if it is evaluated at the true parameter value For example economic models with rational expectations lead to expressions like 271 quite naturally If the sampling model for the xts is such that some version of the Law of Large Numbers holds then X 1T t1 to T xt p gθ0 hence since g is invertible the statistic θ g1X p θ0 so θ is a consistent estimator of θ A different way to obtain the same outcome is to choose as an estimator of θ the value that minimizes the objective function Fθ 1T t1 to T xt gθ2 X gθ2 272 the minimum is trivially reached at θ g1X since the expression in square brackets equals 0 The above reasoning can be generalized as follows suppose θ is an nvector and we have m relations like Efixt θ 0 for i 1 m 273 where E is a conditional expectation on a set of p variables zt called the instruments In the above simple example m 1 and fxt θ xt gθ and the only instrument used is zt 1 Then it must also be true that Efixt θ zjt Efijtθ 0 for i 1 m and j 1 p 274 equation 274 is known as an orthogonality condition or moment condition The GMM estimator is defined as the minimum of the quadratic form Fθ W fWf 275 267 Chapter 27 GMM estimation 268 where f is a 1 m p vector holding the average of the orthogonality conditions and W is some symmetric positive definite matrix known as the weights matrix A necessary condition for the minimum to exist is the order condition n m p The statistic ˆθ Argmin θ Fθ W 276 is a consistent estimator of θ whatever the choice of W However to achieve maximum asymp totic efficiency W must be proportional to the inverse of the longrun covariance matrix of the orthogonality conditions if W is not known a consistent estimator will suffice These considerations lead to the following empirical strategy 1 Choose a positive definite W and compute the onestep GMM estimator ˆθ1 Customary choices for W are Imp or Im ZZ1 2 Use ˆθ1 to estimate Vfijtθ and use its inverse as the weights matrix The resulting esti mator ˆθ2 is called the twostep estimator 3 Reestimate Vfijtθ by means of ˆθ2 and obtain ˆθ3 iterate until convergence Asymp totically these extra steps are unnecessary since the twostep estimator is consistent and efficient however the iterated estimator often has better smallsample properties and should be independent of the choice of W made at step 1 In the special case when the number of parameters n is equal to the total number of orthogonality conditions m p the GMM estimator ˆθ is the same for any choice of the weights matrix W so the first step is sufficient in this case the objective function is 0 at the minimum If on the contrary n m p the second step or successive iterations is needed to achieve efficiency and the estimator so obtained can be very different in finite samples from the one step estimator Moreover the value of the objective function at the minimum suitably scaled by the number of observations yields Hansens J statistic this statistic can be interpreted as a test statistic that has a χ2 distribution with m p n degrees of freedom under the null hypothesis of correct specification See Davidson and MacKinnon 1993 section 176 for details In the following sections we will show how these ideas are implemented in gretl through some examples 272 GMM as Method of Moments This section draws from a kind contribution by Alecos Papadopoulos whom we thank A very simple illustration of GMM can be given by dropping the G via an example of the time honored statistical technique known as method of moments lets see how to estimate the param eters of a gamma distribution which we also used as an example for ML estimation in section 264 Assume that we have an iid sample of size T from a gamma distribution The gamma density can be parameterized in terms of the two parameters p shape and θ scale both real and positive1 In order to estimate them by the method of moments we need two moment conditions so that we have two equations and two unknowns in the GMM jargon this amounts to exact identification The two relations we need are Exi p θ Vxi p θ2 1In section 264 we used a slightly different perhaps more common parametrization employing θ 1α We are switching to the shapescale parametrization here for the sake of convenience These will become our moment conditions substituting the finite sample analogues of the theoretical moments we have X p θ 277 Vxi p θ² 278 Of course the two equations above are easy to solve analytically giving θ vX and p Xθ V being the sample variance of xi but its instructive to see how the gmm command will solve this system of equations numerically We feed gretl the necessary ingredients for GMM estimation in a command block starting with gmm and ending with end gmm Three elements are compulsory within a gmm block 1 one or more orthog statements 2 one weights statement 3 one params statement The three elements should be given in the stated order The orthog statements are used to specify the orthogonality conditions They must follow the syntax orthog x Z where x may be a series matrix or list of series and Z may also be a series matrix or list Note the structure of the orthogonality condition it is assumed that the term to the left of the semicolon represents a quantity that depends on the estimated parameters and so must be updated in the process of iterative estimation while the term on the right is a constant function of the data The weights statement is used to specify the initial weighting matrix and its syntax is straightforward The params statement specifies the parameters with respect to which the GMM criterion should be minimized it follows the same logic and rules as in the mle and nls commands The minimum is found through numerical minimization via BFGS see chapters 37 and 26 The progress of the optimization procedure can be observed by appending the verbose switch to the end gmm line Equations 277 and 278 are not yet in the moment condition form required by the gmm command We need to transform them and arrive at something looking like Eeji zji 0 with j 12 Therefore we need two corresponding observable variables e1 and e2 with corresponding instruments z1 and z2 and tell gretl that Ėej zj 0 must be satisfied where we used the Ė notation to indicate sample moments If we define the instrument as a series of ones and set e1i xi pθ then we can rewrite the first moment condition as Ėxi pθ 1 0 This is in the form required by the gmm command in the required input statement orthog e z e will be the variable on the left defined as a series and z will the variable to the right of the semicolon Since z1i 1 for all i we can use the builtin series const for that For the second moment condition we have analogously Ėxi X² pθ² 1 0 Chapter 27 GMM estimation 270 so that by setting e2i xi X2 pθ2 and z2 z1 we can rewrite the second moment condition as ˆEe2i 1 0 The weighting matrix which is required by the gmm command can be set to any 2 2 positive definite matrix since under exact identification the choice does not matter and its dimension is determined by the number of orthogonality conditions Therefore well use the identity matrix Example code follows create an empty data set nulldata 200 fix a random seed set seed 2207092 generate a gamma random variable with say shape p 3 and scale theta 2 series x randgenG 3 2 declare and set some initial value for parameters p and theta scalar p 1 scalar theta 1 create the weight matrix as the identity matrix matrix W I2 declare the series to be used in the orthogonality conditions series e1 0 series e2 0 gmm scalar m meanx series e1 x ptheta series e2 x m2 ptheta2 orthog e1 const orthog e2 const weights W params p theta end gmm The corresponding output is Model 1 1step GMM using observations 1200 estimate std error z pvalue p 309165 0346565 8921 463e19 theta 189983 0224418 8466 255e17 GMM criterion Q 497341e28 TQ 994682e26 If we want to use the unbiased estimator for the sample variance wed have to adjust the second moment condition by substituting series e2 x m2 ptheta2 with scalar adj nobs nobs 1 series e2 adj x m2 ptheta2 with the corresponding slight change in the output Model 1 1step GMM using observations 1200 estimate std error z pvalue p 307619 0344832 8921 463e19 theta 190937 0225546 8466 255e17 GMM criterion Q 280713e28 TQ 561426e26 One can observe tiny improvements in the point estimates since both moved a tad closer to the true values This however is just a smallsample effect and not something you should expect in larger samples 273 OLS as GMM Let us now move to an example that is closer to econometrics proper the linear model yt xt β ut Although most of us are used to read it as the sum of a hazily defined systematic part plus an equally hazy disturbance a more rigorous interpretation of this familiar expression comes from the hypothesis that the conditional mean Eytxt is linear and the definition of ut as yt Eytxt From the definition of ut it follows that Eutxt 0 The following orthogonality condition is therefore available Efβ 0 279 where fβ yt xt β xt The definitions given in section 271 therefore specialize here to θ is β the instrument is xt fijtθ is yt xt β xt ut xt the orthogonality condition is interpretable as the requirement that the regressors should be uncorrelated with the disturbances W can be any symmetric positive definite matrix since the number of parameters equals the number of orthogonality conditions Lets say we choose I The function FθW is in this case FθW 1T t1T ût xt² and it is easy to see why OLS and GMM coincide here the GMM objective function has the same minimizer as the objective function of OLS the residual sum of squares Note however that the two functions are not equal to one another at the minimum FθW 0 while the minimized sum of squared residuals is zero only in the special case of a perfect linear fit The code snippet below uses gretls gmm command to make the above operational The series e holds the residuals and the series x holds the regressor If x had been a list or a matrix the orthog statement would have generated one orthogonality condition for each element or column of x Chapter 27 GMM estimation 272 initialize stuff series e 0 scalar beta 0 matrix W I1 proceed with estimation gmm series e y xbeta orthog e x weights W params beta end gmm 274 TSLS as GMM Moving closer to the proper domain of GMM we now consider twostage least squares TSLS as a case of GMM TSLS is employed in the case where one wishes to estimate a linear model of the form yt Xtβut but where one or more of the variables in the matrix X are potentially endogenouscorrelated with the error term u We proceed by identifying a set of instruments Zt which are explanatory for the endogenous variables in X but which are plausibly uncorrelated with u The classic twostage procedure is 1 regress the endogenous elements of X on Z then 2 estimate the equation of interest with the endogenous elements of X replaced by their fitted values from 1 An alternative perspective is given by GMM We define the residual ˆut as yt Xt ˆβ as usual But instead of relying on EuX 0 as in OLS we base estimation on the condition EuZ 0 In this case it is natural to base the initial weighting matrix on the covariance matrix of the instruments Listing 271 presents a model from Stock and Watsons Introduction to Econometrics The demand for cigarettes is modeled as a linear function of the logs of price and income income is treated as exogenous while price is taken to be endogenous and two measures of tax are used as instruments Since we have two instruments and one endogenous variable the model is overidentified In the GMM context this happens when you have more orthogonality conditions than parameters to estimate If so asymptotic efficiency gains can be expected by iterating the procedure once or more This is accomplished by specifying after the end gmm statement two mutually exclusive options twostep or iterate whose meaning should be obvious Note that when the problem is over identified the weights matrix will influence the solution you get from the 1 and 2step procedures In cases other than onestep estimation the specified weights matrix will be overwritten with the final weights on completion of the gmm command If you wish to execute more than one GMM block with a common startingpoint it is therefore necessary to reinitialize the weights matrix between runs Partial output from this script is shown in 272 The estimated standard errors from GMM are robust by default if we supply the robust option to the tsls command we get identical results2 After the end gmm statement two mutually exclusive options can be specified twostep or iterate whose meaning should be obvious 275 Covariance matrix options The covariance matrix of the estimated parameters depends on the choice of W through ˆΣ JWJ1JWΩWJJWJ1 2710 2The data file used in this example is available in the Stock and Watson package for gretl See httpgretl sourceforgenetgretldatahtml Chapter 27 GMM estimation 273 Listing 271 TSLS via GMM Download open cigch10gdt real avg price including sales tax ravgprs avgprs cpi real avg cigspecific tax rtax tax cpi real average total tax rtaxs taxs cpi real average sales tax rtaxso rtaxs rtax logs of consumption price income lpackpc logpackpc lravgprs logravgprs perinc income popcpi lperinc logperinc restrict sample to 1995 observations smpl restrict year1995 Equation 1016 by tsls list xlist const lravgprs lperinc list zlist const rtaxso rtax lperinc tsls lpackpc xlist zlist robust setup for gmm matrix Z zlist matrix W invZZ series e 0 scalar b0 1 scalar b1 1 scalar b2 1 gmm e lpackpc b0 b1lravgprs b2lperinc orthog e Z weights W params b0 b1 b2 end gmm Listing 272 TSLS via GMM partial output Model 1 TSLS estimates using the 48 observations 148 Dependent variable lpackpc Instruments rtaxso rtax Heteroskedasticityrobust standard errors variant HC0 VARIABLE COEFFICIENT STDERROR T STAT PVALUE const 989496 0928758 10654 000001 lravgprs 127742 0241684 5286 000001 lperinc 0280405 0245828 1141 025401 Model 2 1step GMM estimates using the 48 observations 148 e lpackpc b0 b1lravgprs b2lperinc PARAMETER ESTIMATE STDERROR T STAT PVALUE b0 989496 0928758 10654 000001 b1 127742 0241684 5286 000001 b2 0280405 0245828 1141 025401 GMM criterion 00110046 where J is a Jacobian term Jij fiθj and Ω is the longrun covariance matrix of the orthogonality conditions Gretl computes J by numeric differentiation there is no provision for specifying a usersupplied analytical expression for J at the moment As for Ω a consistent estimate is needed The simplest choice is the sample covariance matrix of the fts Ω0θ 1T t1T ftθ ftθ 2711 This estimator is robust with respect to heteroskedasticity but not with respect to autocorrelation A heteroskedasticity and autocorrelationconsistent HAC variant can be obtained using the Bartlett kernel or similar A univariate version of this is used in the context of the lrvar function see equation 226 The multivariate version is set out in equation 2712 Ωkθ 1T tkTk ikk wi ftθ ftiθ 2712 Gretl computes the HAC covariance matrix by default when a GMM model is estimated on time series data You can control the kernel and the bandwidth that is the value of k in 2712 using the set command See chapter 22 for further discussion of HAC estimation You can also ask gretl not to use the HAC version by saying set forcehc on 276 A real example the Consumption Based Asset Pricing Model To illustrate gretls implementation of GMM we will replicate the example given in chapter 3 of Hall 2005 The model to estimate is a classic application of GMM and provides an example of a case when orthogonality conditions do not stem from statistical considerations but rather from economic theory A rational individual who must allocate his income between consumption and investment in a financial asset must in fact choose the consumption path of his whole lifetime since investment translates into future consumption It can be shown that an optimal consumption path should satisfy the following condition pUct δkE rtkUctkFt 2713 where p is the asset price U is the individuals utility function δ is the individuals subjective discount rate and rtk is the assets rate of return between time t and time t k Ft is the information set at time t equation 2713 says that the utility lost at time t by purchasing the asset instead of consumption goods must be matched by a corresponding increase in the discounted future utility of the consumption financed by the assets return Since the future is uncertain the individual considers his expectation conditional on what is known at the time when the choice is made We have said nothing about the nature of the asset so equation 2713 should hold whatever asset we consider hence it is possible to build a system of equations like 2713 for each asset whose price we observe If we are willing to believe that the economy as a whole can be represented as a single gigantic and immortal representative individual and the function Ux xα1α is a faithful representation of the individuals preferences then setting k 1 equation 2713 implies the following for any asset j E δ rjt1 pjt Ct1Ctα1Ft 1 2714 where Ct is aggregate consumption and α and δ are the risk aversion and discount rate of the representative individual In this case it is easy to see that the deep parameters α and δ can be estimated via GMM by using et δ rjt1 pjt Ct1 Ctα1 1 as the moment condition while any variable known at time t may serve as an instrument In the example code given in 273 we replicate selected portions of table 37 in Hall 2005 The variable consrat is defined as the ratio of monthly consecutive real per capita consumption services and nondurables for the US and ewr is the returnprice ratio of a fictitious asset constructed by averaging all the stocks in the NYSE The instrument set contains the constant and two lags of each variable The command set forcehc on on the second line of the script has the sole purpose of replicating the given example as mentioned above it forces gretl to compute the longrun variance of the orthogonality conditions according to equation 2711 rather than 2712 We run gmm four times onestep estimation for each of two initial weights matrices then iterative estimation starting from each set of initial weights Since the number of orthogonality conditions 5 is greater than the number of estimated parameters 2 the choice of initial weights should make a difference and indeed we see fairly substantial differences between the onestep estimates Models 1 and 2 On the other hand iteration reduces these differences almost to the vanishing point Models 3 and 4 Part of the output is given in 274 It should be noted that the J test leads to a rejection of the hypothesis of correct specification This is perhaps not surprising given the heroic assumptions required to move from the microeconomic principle in equation 2713 to the aggregate system that is actually estimated Chapter 27 GMM estimation 276 Listing 273 Estimation of the Consumption Based Asset Pricing Model Download open hallgdt set forcehc on scalar alpha 05 scalar delta 05 series e 0 list inst const consrat1 consrat2 ewr1 ewr2 matrix V0 100000Ineleminst matrix Z inst matrix V1 nobsinvZZ gmm e deltaewrconsratalpha1 1 orthog e inst weights V0 params alpha delta end gmm gmm e deltaewrconsratalpha1 1 orthog e inst weights V1 params alpha delta end gmm gmm e deltaewrconsratalpha1 1 orthog e inst weights V0 params alpha delta end gmm iterate gmm e deltaewrconsratalpha1 1 orthog e inst weights V1 params alpha delta end gmm iterate Chapter 27 GMM estimation 277 Listing 274 Estimation of the Consumption Based Asset Pricing Model output Model 1 1step GMM estimates using the 465 observations 195904199712 e dewrconsratalpha1 1 PARAMETER ESTIMATE STDERROR T STAT PVALUE alpha 314475 684439 0459 064590 d 0999215 00121044 82549 000001 GMM criterion 277808 Model 2 1step GMM estimates using the 465 observations 195904199712 e dewrconsratalpha1 1 PARAMETER ESTIMATE STDERROR T STAT PVALUE alpha 0398194 226359 0176 086036 d 0993180 000439367 226048 000001 GMM criterion 14247 Model 3 Iterated GMM estimates using the 465 observations 195904199712 e dewrconsratalpha1 1 PARAMETER ESTIMATE STDERROR T STAT PVALUE alpha 0344325 221458 0155 087644 d 0991566 000423620 234070 000001 GMM criterion 549178 J test Chisquare3 118103 pvalue 00081 Model 4 Iterated GMM estimates using the 465 observations 195904199712 e dewrconsratalpha1 1 PARAMETER ESTIMATE STDERROR T STAT PVALUE alpha 0344315 221359 0156 087639 d 0991566 000423469 234153 000001 GMM criterion 549178 J test Chisquare3 118103 pvalue 00081 Chapter 27 GMM estimation 278 277 Caveats A few words of warning are in order despite its ingenuity GMM is possibly the most fragile esti mation method in econometrics The number of nonobvious choices one has to make when using GMM is large and in finite samples each of these can have dramatic consequences for the eventual output Some of the factors that may affect the results are 1 Orthogonality conditions can be written in more than one way for example if Ext µ 0 then Extµ 1 0 holds too It is possible that a different specification of the moment conditions leads to different results 2 As with all other numerical optimization algorithms weird things may happen when the ob jective function is nearly flat in some directions or has multiple minima BFGS is usually quite good but there is no guarantee that it always delivers a sensible solution if one at all 3 The 1step and to a lesser extent the 2step estimators may be sensitive to apparently trivial details like the rescaling of the instruments Different choices for the initial weights matrix can also have noticeable consequences 4 With timeseries data there is no hard rule on the appropriate number of lags to use when computing the longrun covariance matrix see section 275 Our advice is to go by trial and error since results may be greatly influenced by a poor choice One of the consequences of this state of things is that replicating wellknown published studies may be extremely difficult Any nontrivial result is virtually impossible to reproduce unless all details of the estimation procedure are carefully recorded Chapter 28 Model selection criteria 281 Introduction In some contexts the econometrician chooses between alternative models based on a formal hy pothesis test For example one might choose a more general model over a more restricted one if the restriction in question can be formulated as a testable null hypothesis and the null is rejected on an appropriate test In other contexts one sometimes seeks a criterion for model selection that somehow measures the balance between goodness of fit or likelihood on the one hand and parsimony on the other The balancing is necessary because the addition of extra variables to a model cannot reduce the degree of fit or likelihood and is very likely to increase it somewhat even if the additional variables are not truly relevant to the datagenerating process The best known such criterion for linear models estimated via least squares is the adjusted R2 R2 1 SSRn k TSSn 1 where n is the number of observations in the sample k denotes the number of parameters esti mated and SSR and TSS denote the sum of squared residuals and the total sum of squares for the dependent variable respectively Compared to the ordinary coefficient of determination or unadjusted R2 R2 1 SSR TSS the adjusted calculation penalizes the inclusion of additional parameters other things equal 282 Information criteria A more general criterion in a similar spirit is Akaikes 1974 Information Criterion AIC The original formulation of this measure is AIC 2ℓˆθ 2k 281 where ℓˆθ represents the maximum loglikelihood as a function of the vector of parameter esti mates ˆθ and k as above denotes the number of independently adjusted parameters within the model In this formulation with AIC negatively related to the likelihood and positively related to the number of parameters the researcher seeks the minimum AIC The AIC can be confusing in that several variants of the calculation are in circulation For exam ple Davidson and MacKinnon 2004 present a simplified version AIC ℓˆθ k which is just 2 times the original in this case obviously one wants to maximize AIC In the case of models estimated by least squares the loglikelihood can be written as ℓˆθ n 2 1 log 2π log n n 2 log SSR 282 279 Substituting 282 into 281 we get AIC n1 log 2π log n n log SSR 2k which can also be written as AIC n log SSRn 2k n1 log 2π 283 Some authors simplify the formula for the case of models estimated via least squares For instance William Greene writes AIC log SSRn 2kn 284 This variant can be derived from 283 by dividing through by n and subtracting the constant 1 log 2π That is writing AICG for the version given by Greene we have AICG 1n AIC 1 log 2π Finally Ramanathan gives a further variant AICR SSRn e2kn which is the exponential of the one given by Greene Gretl began by using the Ramanathan variant but since version 131 the program has used the original Akaike formula 281 and more specifically 283 for models estimated via least squares Although the Akaike criterion is designed to favor parsimony arguably it does not go far enough in that direction For instance if we have two nested models with k 1 and k parameters respectively and if the null hypothesis that parameter k equals 0 is true in large samples the AIC will nonetheless tend to select the less parsimonious model about 16 percent of the time see Davidson and MacKinnon 2004 chapter 15 An alternative to the AIC which avoids this problem is the Schwarz 1978 Bayesian information criterion BIC The BIC can be written in line with Akaikes formulation of the AIC as BIC 2ℓθ k log n The multiplication of k by log n in the BIC means that the penalty for adding extra parameters grows with the sample size This ensures that asymptotically one will not select a larger model over a correctly specified parsimonious model A further alternative to AIC which again tends to select more parsimonious models than AIC is the HannanQuinn criterion or HQC Hannan and Quinn 1979 Written consistently with the formulations above this is HQC 2ℓθ 2k log log n The HannanQuinn calculation is based on the law of the iterated logarithm note that the last term is the log of the log of the sample size The authors argue that their procedure provides a strongly consistent estimation procedure for the order of an autoregression and that compared to other strongly consistent procedures this procedure will underestimate the order to a lesser degree Gretl reports the AIC BIC and HQC calculated as explained above for most sorts of models The key point in interpreting these values is to know whether they are calculated such that smaller values are better or such that larger values are better In gretl smaller values are better one wants to minimize the chosen criterion Chapter 29 Degrees of freedom correction 291 Introduction This chapter gives a brief account of the issue of correction for degrees of freedom in the context of econometric modeling leading up to a discussion of the policies adopted in gretl in this regard We also explain how to supplement the results produced automatically by gretl if you want to apply such a correction where gretl does not or vice versa The first few sections are quite basic experts are invited to skip to section 295 292 Back to basics Its well known that given a sample xi of size n from a normally distributed population the Maximum Likelihood ML estimator of the population variance σ2 is σ2 1n sum i1 to n xi x2 291 where x is the sample mean n1 sum i1 to n xi Its also well known that σ2 while it is a consistent estimator is biased and it is commonly replaced by the sample variance namely s2 1 n 1 sum t1 to n xi x2 292 The intuition behind the bias in 291 is straightforward First the quantity we seek to estimate is defined as σ2 E xi μ2 where μ Ex It is clear that if μ were observable a perfectly good estimator would be σ2 1n sum i1 to n xi μ2 But this is not a practical option μ is generally unobservable We therefore substitute x for the unknown μ It is easily shown that x is the leastsquares estimator of μ and also assuming normality the ML estimator It is unbiased but is of course subject to sampling error in any given sample it is highly unlikely that x μ Given that x is the leastsquares estimator the sum of squared deviations of the xi from any value other than x must be greater than the summation in 291 But since μ is almost certainly not equal to x the sum of squared deviations of the xi from μ will surely be greater than the sum of squared deviations in 291 It follows that the expected value of σ2 falls short of the population variance The proof that s2 is indeed the unbiased estimator can be found in any good statistics textbook where we also learn that the magnitude n 1 in 292 can be brought under a general description as the degrees of freedom of the calculation at hand Given x the n sample values provide only n 1 items of information since the nth value can always be deduced via the formula for x 293 Application to OLS regression The argument above carries over into the usual calculation of standard errors in the context of OLS regression as applied to the linear model y Xβ u If the disturbances u are assumed to be independently and identically distributed IID then the variance of the OLS estimator β is given by Var β σ²XᵀX¹ where σ² is the variance of the error term and X is an n k matrix of regressors But how should the unknown σ² be estimated The ML estimator is σ² 1n i1 to n ui² 293 where the ui² are squared residuals ui yi Xiβ But this estimator is biased and we typically use the unbiased counterpart s² 1nk i1 to n ui² 294 in which n k is the number of degrees of freedom given n residuals from a regression where k parameters are estimated The standard estimator of the variance of β in the context of OLS is then V s²XᵀX¹ And the standard errors of the individual parameter estimates sβi being the square roots of the diagonal elements of V inherit a degrees of freedom correction from the estimator s² Going one step further consider hypothesis testing in the context of OLS Since the variance of β is unknown and must itself be estimated the sampling distribution of the OLS coefficients is not strictly speaking normal But if the disturbances are normally distributed besides being IID then even in small samples the parameter estimates will follow a distribution that can be specified exactly namely the Student t distribution with degrees of freedom equal to the value given above ν n k That is besides using a df correction in computing the standard errors of the OLS coefficients one uses the same ν in selecting the particular distribution to which the tratio βi β0sβi should be referred in order to determine the marginal significance level or pvalue for the null hypothesis that βi β0 This is the payoff to df correction we get test statistics that follow a known distribution in small samples In big enough samples the point is moot since the quantitative distinction between σ² and s² vanishes So far so good Everyone expects df correction in plain OLS standard errors just as we expect division by n 1 in the sample variance And users of econometric software expect that the pvalues reported for OLS coefficients will be based on the tν distributionalthough they are not always sufficiently aware that the validity of such statistics in small samples depends on the assumption of normally distributed errors 294 Beyond OLS The situation is different when we move beyond estimation of the classical linear model via OLS We may wish to estimate nonlinear models sometimes by least squares and many models of interest to econometricians are commonly estimated via maximization of a likelihood function or via the generalized method of moments GMM In such cases we do not in general have exact smallsample results to rely upon in particular we cannot assume that coefficient estimates follow the t distribution Rather we typically appeal to asymptotic results in statistical theory We seek consistent estimators which although they may be biased nonetheless converge in probability to the corresponding parameter values as the sample size goes to infinity Under the right conditions laws of large numbers and central limit Chapter 29 Degrees of freedom correction 283 theorems entitle us to expect that test statistics will converge to the normal distribution or the χ2 distribution for multivariate tests given big enough samples To correct or not The question arises should we or should we not apply a df correction in reporting variance estimates and standard errors for models that depart from the classical linear specification The argument against applying df adjustment is that it lacks a theoretical basis it does not pro duce test statistics that follow any known distribution in small samples In addition if parameter estimates are obtained via ML it makes sense to report ML estimates of variances even if these are biased since it is the ML quantities that are used in computing the criterion function and in forming likelihoodratio tests On the other hand pragmatic arguments for doing df adjustment are a that it makes for closer comparability between regular OLS estimates and nonlinear ones and b that it provides a pinch of salt in relation to smallsample resultsthat is it inflates standard errors confidence intervals and pvalues somewhateven if it lacks rigorous justification Note that even for fairly small samples the difference between the biased and unbiased estimators in equations 291 and 292 above will be small For example if n 30 then s2 30 29 ˆσ 2 In econometric modelling proper however the difference can be quite substantial If n 50 and k 10 the s2 defined in 294 will be 5040 125 as large as the ˆσ 2 in 293 and standard errors will be about 12 percent larger1 One can make a case for inflating the standard errors obtained via nonlinear estimators as a precaution against taking results to be more precise than they really are In rejoinder to the last point one might equally say that savvy econometricians should know to apply a discount factor albeit an imprecise one to smallsample estimates outside of the classical normal linear modelor even that they should distrust such results and insist on large samples before making inferences This line of thinking suggests that test statistics such as z ˆβiˆσˆβi should be referred to the distribution to which they conform asymptoticallyin this case N0 1 for H0 βi 0if and only if the conditions for appealing to asymptotic results can be considered as met From this point of view df adjustment may be seen as providing a false sense of security 295 Consistency and awkward cases Consistency in the ordinary sense of uniformity of treatment is a bugbear when dealing with this issue To give a simple example suppose an econometrics program follows the policy of applying df correction for OLS estimation but not for ML estimation One is of course free to estimate the classical normal linear model via ML in which case ˆβ should be numerically identical to that obtained via OLS But the user of the software will obtain two different sets of standard errors depending on the estimation method Admittedly this example is not very troublesome presumably one would apply ML to the classical linear model only to make a pedagogical point Here is a more awkward case An unrestricted vector autoregression VAR is a system of equations but the ML estimate of this system given normal errors is equivalent to equationbyequation OLS Should df correction be applied to VARs Consistency with OLS argues Yes However a popular extension of the VAR methodology is the vector errorcorrection model VECM VECMs are closely related to VARs and one might well be interested in making comparisons across the two but a VECM is a nonlinear system and the cointegrating vectors that lie at the heart of this model must be estimated via Maximum Likelihood So perhaps VAR results should not be df adjusted for comparability with VECMs 1A fairly typical situation in timeseries macroeconometrics would be have between 100 and 200 quarterly observa tions and to be estimating up to maybe 30 parameters including lags In this case df correction would make a difference to standard errors on the order of 10 percent Chapter 29 Degrees of freedom correction 284 Another grey area is the class of Feasible Generalized Least Squares FGLS estimatorsfor exam ple weighted least squares following the estimation of a skedastic function or estimators designed to handle firstorder autocorrelation such as CochraneOrcutt These depart from the classical lin ear model and the theoreretical basis for inference in such models is asymptotic yet according to econometric tradition standard errors are generally df adjusted Yet another awkward case robust heteroskedasticity andor autocorrelationconsistent stan dard errors in the context of OLS Such estimators are justified by asymptotic arguments and in general we cannot determine their smallsample distributions That would argue for referring the associated test statistics to the normal distribution But comparability with classical standard errors pulls in the other direction Suppose in a particular case a robust estimator produces a standard error that is numerically indistinguishable from the classical one if the former is referred to the normal distribution and the latter to the t distribution switching to robust standard errors will give a smaller pvalue for the coefficient in question making it appear more significant and arguably this is misleading 296 What gretl does First of all the third column in gretl model outputfollowing coefficient and std erroris labeled either tratio or z This is your signal tratio indicates that the estimated standard error employs a degrees of freedom adjustment and the reported pvalue is obtained from the Student t distribution while z indicates that no such adjustment is applied and the pvalue comes from the standard normal distribution If you see that gretl is applying a df adjustment but you dont want this the first point to check is whether you can switch to the asymptotic variant by using an option flag or other command The ols and tsls commands support a nodfcorr option to suppress degrees of freedom adjustment In the case of TwoStage Least Squares its certainly arguable that df correction should not be performed by default however gretl does this largely for comparability with other software for example Statas ivreg command But you can override the default if you wish The estimate command for systems of equations also supports the nodfcorr option when the specified estimation method is OLS or TSLS For other estimators supported by gretls system command no df adjustment is applied by default By default gretl uses the t distribution for statistics based on robust standard errors under OLS However users can specify that pvalues be calculated using the standard normal distri bution whenever the robust option is passed to an estimation command by means of the following set command set robustz on If these possibilities do not apply it is fairly straightforward to purge regression results of df correction as illustrated in the following script fragment We assume that a model has just been estimated so that the modelrelated accessors stderr coeff and so on are available matrix se stderr sqrtdfT matrix zscore coeff se matrix pv 2 pvaluez abszscore matrix M coeff se zscore pv cnamesetM coeff stderr z pvalue print M This will print the original coefficient estimates along with asymptotic standard errors and the associated zscores and twosided normal pvalues The converse case is left as an exercise for the reader Chapter 29 Degrees of freedom correction 285 VARs As mentioned above Vector Autoregressions constitute a particularly awkward case with consid erations of consistency of treatment pulling in two opposite directions For that reason gretl has adopted an agnostic policy in relation to such systems We do not offer a vcv accessor but instead accessors named xtxinv the matrix XX1 for the system as a whole and sigma an estimate of the crossequation variancecovariance matrix Σ Its then up to the user to build an estimate of the variance matrix of the parameter estimatescall it V should that be required Note that sigma gives the Maximum Likelihood Estimator without a degrees of freedom adjust ment so if you do matrix Vml sigma xtxinv where represents Kronecker product you obtain the MLE of the variance matrix of the param eter estimates But if you want the unbiased estimator you can do matrix S sigma TTncoeff matrix Vu S xtxinv to employ a suitably inflated variant of the Σ estimate For VARs and also VECMs ncoeff gives the number of coefficients per equation The second variant above is such that the vector of standard errors produced by matrix SE sqrtdiagVu agrees with the standard errors printed as part of the perequation VAR output A fuller example of usage of the xtxinv accessor is given in Listing 291 this shows how one can replicate the Ftests for Granger causality that are displayed by default by the var command with the refinement that depending on the setting of the USEF flag these tests can be done using a small sample correction as in gretls output or in asymptotic χ2 form Vector Error Correction Models are more complex than VARs in this respect since we employ Jo hansens variance estimator for the β terms This means for example that the xtxinv accessor treats each estimated error correction EC term as one regressor on its own such that the sam pling uncertainty of the loading coefficients is thereby addressed after Kroneckermultiplying with sigma as before The internals of the EC terms are of course made up of the integrated levels variables and the special jvbeta accessor is responsible for the variance of the cointegration coefficients where degreesoffreedom corrections are not available But as soon as the loading coefficients attached to the EC terms are restricted there is no common set of regressors with freely varying coefficients in the VECM system anymore and therefore in these cases the formulas above are misleading The xtxinv accessor can still be retrieved because it does not involve the coefficients but in the restricted α case it should no longer be used as shown above The notion of system degrees of freedom then also becomes fuzzier since the number of regressors can vary across equations Chapter 29 Degrees of freedom correction 286 Listing 291 Computing statistics to test for Granger causality Download open denmarkgdt list LST LRM LRY IBO IDE scalar p 2 lags in VAR scalar USEF 1 small sample correction var p LST quiet k nelemLST matrix theta veccoeff matrix V sigma xtxinv if USEF scalar df T ncoeff V Tdf endif matrix GC zerosk k cnamesetGC LST rnamesetGC LST matrix idx seq1p 1 loop i 1k loop j 1k GCij qformthetaidx invpdVidxidx idx jk p1 p endloop endloop if USEF GC p matrix pvals pvalueF p df GC else matrix pvals pvalueX p GC endif cnamesetpvals LST rnamesetpvals LST print GC pvals Chapter 30 Time series filters In addition to the usual application of lags and differences gretl provides fractional differencing and various filters commonly used in macroeconomics for trendcycle decomposition notably the HodrickPrescott filter Hodrick and Prescott 1997 the BaxterKing bandpass filter Baxter and King 1999 and the Butterworth filter Butterworth 1930 301 Fractional differencing The concept of differencing a time series d times is pretty obvious when d is an integer it may seem odd when d is fractional However this idea has a welldefined mathematical content consider the function fz 1 zd where z and d are real numbers By taking a Taylor series expansion around z 0 we see that fz 1 dz dd12 z² or more compactly fz 1 i1 to ψi zi with ψk i1 to kdi1k ψk1 dk1k The same expansion can be used with the lag operator so that if we defined Yt 1 L05 Xt this could be considered shorthand for Yt Xt 05Xt1 0125Xt2 00625Xt3 In gretl this transformation can be accomplished by the syntax Y fracdiffX 05 302 The HodrickPrescott filter This filter is accessed using the hpfilter function which takes as its first argument the name of the variable to be processed Further optional arguments are explained below A time series yt may be decomposed into a trend or growth component gt and a cyclical component ct yt gt ct t 1 2 T 287 Chapter 30 Time series filters 288 The HodrickPrescott filter effects such a decomposition by minimizing the following t1 to T yt gt² λ t2 to T1 gt1 gt gt gt1² The first term above is the sum of squared cyclical components ct yt gt The second term is a multiple λ of the sum of squares of the trend components second differences This second term penalizes variations in the growth rate of the trend component the larger the value of λ the higher is the penalty and hence the smoother the trend series Note that the hpfilter function in gretl produces the cyclical component ct of the original series If you want the smoothed trend you can subtract the cycle from the original ct hpfilteryt gt yt ct Hodrick and Prescott 1997 suggest that a value of λ 1600 is reasonable for quarterly data The default value in gretl is 100 times the square of the data frequency which of course yields 1600 for quarterly data The value can be adjusted using an optional second argument to hpfilter as in ct hpfilteryt 1300 As of version 2018a the hpfilter function accepts a third optional Boolean argument If set to nonzero what is performed is the socalled onesided version of the filter See Section 3612 for further details 303 The Baxter and King filter This filter is accessed using the bkfilt function which again takes the name of the variable to be processed as its first argument The operation of the filter can be controlled via three further optional argument Consider the spectral representation of a time series yt yt from π to π eiω dZω To extract the component of yt that lies between the frequencies ω and ω one could apply a bandpass filter ct from π to π Fω eiω dZω where Fω 1 for ω ω ω and 0 elsewhere This would imply in the time domain applying to the series a filter with an infinite number of coefficients which is undesirable The Baxter and King bandpass filter applies to yt a finite polynomial in the lag operator AL ct AL yt where AL is defined as AL ik to k ai Li The coefficients ai are chosen such that Fω Aeiω Aeiω is the best approximation to Fω for a given k Clearly the higher k the better the approximation is but since 2k observations have to be discarded a compromise is usually sought Moreover the filter has also other appealing Chapter 30 Time series filters 289 theoretical properties among which the property that A1 0 so a series with a single unit root is made stationary by application of the filter In practice the filter is normally used with monthly or quarterly data to extract the business cycle component namely the component between 6 and 36 quarters Usual choices for k are 8 or 12 maybe higher for monthly series The default values for the frequency bounds are 8 and 32 and the default value for the approximation order k is 8 You can adjust these values using the full form of bkfilt which is bkfiltseriesname f1 f2 k where f1 and f2 represent the lower and upper frequency bounds respectively 304 The Butterworth filter The Butterworth filter Butterworth 1930 is an approximation to an ideal squarewave filter The ideal filter divides the spectrum of a time series into a passband frequencies less than some chosen ω for a lowpass filter or frequencies greater than ω for highpass and a stopband the gain is 1 for the passband and 0 for the stopband The ideal filter is unattainable in practice since it would require an infinite number of coefficients but the Butterworth filter offers a remarkably good approximation This filter is derived and persuasively advocated by Pollock 2000 For data y the filtered sequence x is given by x y λΣQM λQΣQ1Qy 301 where Σ 2IT LT L1 T T2 and M 2IT LT L1 T T IT denotes the identity matrix of order T LT e1 e2 eT1 0 is the finitesample matrix version of the lag operator and Q is defined such that premultiplication of a Tvector of data by Q of order T 2 T produces the second differences of the data The matrix product QΣQ 2IT LT L1 T T is a Toeplitz matrix The behavior of the Butterworth filter is governed by two parameters the frequency cutoff ω and an integer order n which determines the number of coefficients used The λ that appears in 301 is tanω22n Higher values of n produce a better approximation to the ideal filter in principle ie a sharper cut between the passband and the stopband but there is a downside with a greater number of coefficients numerical instability may be an issue and the influence of the initial values in the sample may be exaggerated In gretl the Butterworth filter is implemented by the bwfilt function1 which takes three argu ments the series to filter the order n and the frequency cutoff ω expressed in degrees The cutoff value must be greater than 0 and less than 180 This function operates as a lowpass filter for the highpass variant subtract the filtered series from the original as in series bwcycle y bwfilty 8 67 Pollock recommends that the parameters of the Butterworth filter be tuned to the data one should examine the periodogram of the series in question possibly after removal of a polynomial trend in search of a dead spot of low power between the frequencies one wishes to exclude and the frequencies one wishes to retain If ω is placed in such a dead spot then the job of separation can be done with a relatively small n hence avoiding numerical problems By way of illustration consider the periodogram for quarterly observations on new cars sales in the US2 19751 to 19904 the upper panel in Figure 301 1The code for this filter is based on D S G Pollocks programs IDEOLOG and DETREND The Pascal source code for Chapter 30 Time series filters 290 periods 640 107 58 40 30 25 21 300000 250000 200000 150000 100000 50000 0 0 20 40 60 80 100 120 140 160 180 degrees 3400 3200 3000 2800 2600 2400 2200 2000 1800 1600 1976 1978 1980 1982 1984 1986 1988 1990 QNC original data QNC smoothed 1 08 06 04 02 0 0 pi4 pi2 3pi4 pi Figure 301 The Butterworth filter applied A seasonal pattern is clearly visible in the periodogram centered at an angle of 90 or 4 periods If we set ω 68 or thereabouts we should be able to excise the seasonality quite cleanly using n 8 The result is shown in the lower panel of the Figure along with the frequency response or gain plot for the chosen filter Note the smooth and reasonably steep dropoff in gain centered on the nominal cutoff of 68 3π8 The apparatus that supports this sort of analysis in the gretl GUI can be found under the Variable menu in the main window the items Periodogram and Filter In the periodogram dialog box you have the option of expressing the frequency axis in degrees which is helpful when selecting a Butterworth filter and in the Butterworth filter dialog you have the option of plotting the frequency response as well as the smoothed series andor the residual or cycle 305 The discrete Fourier transform The Fourier transform is not itself a timeseries filter but by providing the bridge between the time and the frequency domain it is a fundamental building block of many filter internals and deserves some detailed comments The discrete Fourier transform can be best thought of as a linear invertible transform of a complex vector Hence if x is an ndimensional vector whose kth element is xk ak ibk then the output of the discrete Fourier transform is a vector f Fx whose kth element is fk Σj0n1 eiωjk xj where ωjk 2πi jk n Since the transformation is invertible the vector x can be recovered from the former is available from httpwwwleacukusersdsgp1 and the C sources for the latter were kindly made available to us by the author 2This is the variable QNC from the Ramanathan data file data97 Chapter 30 Time series filters 291 f via the socalled inverse transform xk 1n Σj0n1 eiωjk fj The Fourier transform is used in many diverse situations on account of this key property the convolution of two vectors can be performed efficiently by multiplying the elements of their Fourier transforms and inverting the result If zk Σj1n xj ykj then Fz Fx Fy That is Fzk Fxk Fyk For computing the Fourier transform gretl uses the external library fftw3 see Frigo and Johnson 2005 This guarantees extreme speed and accuracy In fact the CPU time needed to perform the transform is On log n for any n This is why the array of numerical techniques employed in fftw3 is commonly known as the Fast Fourier Transform Gretl provides two matrix functions for performing the Fourier transform and its inverse fft2 and ffti3 For example matrix x1 1 2 3 perform the transform matrix f fft2x1 perform the inverse transform matrix x2 fftif yields x1 1 2 3 f 6 0 15 0866 15 0866 x2 1 2 3 Should it be necessary to compute the Fourier transform on several vectors with the same number of elements it is numerically more efficient to group them into a matrix rather than invoking fft for each vector separately As an example consider the multiplication of two polynomials ax 1 05x bx 1 03x 08x2 cx ax bx 1 08x 065x2 04x3 The coefficients of the polynomial cx are the convolution of the coefficients of ax and bx the following gretl code fragment illustrates how to compute the coefficients of cx define the two polynomials a 1 05 0 0 b 1 03 08 0 perform the transforms 3The same functionality is available via the legacy function fft that predates gretls native support of complex matrices It is more limited than fft2 as the input is understood to be real It returns the real and imaginary parts of the result in separate columns The fft function is kept for backward compatibility but for new scripts it is recommended to use the newer function fft2 instead The inverse function ffti supports both representations Chapter 30 Time series filters 292 fa fft2a fb fft2b multiply the two transforms element by element fc fa fb compute the coefficients of c via the inverse transform c fftifc Maximum efficiency would have been achieved by grouping a and b into a matrix The computa tional advantage is so little in this case that the exercise is a bit silly but the following alternative may be preferable for a large number of rowscolumns define the two polynomials a 1 05 0 0 b 1 03 08 0 perform the transforms jointly f fft2a b complexmultiply the two transforms fc f1 f2 compute the coefficients of c via the inverse transform c fftifc Traditionally the Fourier transform in econometrics has been mostly used in timeseries analysis the periodogram being the best known example Listing 301 shows how to compute the peri odogram of a time series via the fft2 function Listing 301 Periodogram via the Fourier transform Download set verbose off nulldata 50 generate an AR1 process series e normal series x 0 x 09x1 e compute the periodogram F fft2x note that the series is turned into a matrix on the fly S absF2 S S2nobs21 2pinobs sfreq seq1nobs2 omega sfreq 2pinobs period nobs sfreq omega omega sfreq period S compare the builtin command pergm x print omega Chapter 31 Univariate time series models 311 Introduction Time series models are discussed in this chapter and the next two Here we concentrate on ARIMA models unit root tests and GARCH The following chapter deals with VARs and chapter 33 with cointegration and error correction 312 ARIMA models Representation and syntax The arma command performs estimation of AutoRegressive Integrated Moving Average ARIMA models These are models that can be written in the form φLyt θLϵt 311 where φL and θL are polynomials in the lag operator L defined such that Lnxt xtn and ϵt is a white noise process The exact content of yt of the AR polynomial φ and of the MA polynomial θ will be explained in the following Mean terms The process yt as written in equation 311 has without further qualifications mean zero If the model is to be applied to real data it is necessary to include some term to handle the possibility that yt has nonzero mean There are two possible ways to represent processes with nonzero mean one is to define µt as the unconditional mean of yt namely the central value of its marginal distribution Therefore the series yt yt µt has mean 0 and the model 311 applies to yt In practice assuming that µt is a linear function of some observable variables xt the model becomes φLyt xtβ θLϵt 312 This is sometimes known as a regression model with ARMA errors its structure may be more apparent if we represent it using two equations yt xtβ ut φLut θLϵt The model just presented is also sometimes known as ARMAX ARMA eXogenous variables It seems to us however that this label is more appropriately applied to a different model another way to include a mean term in 311 is to base the representation on the conditional mean of yt that is the central value of the distribution of yt given its own past Assuming again that this can be represented as a linear combination of some observable variables zt the model would expand to φLyt ztγ θLϵt 313 The formulation 313 has the advantage that γ can be immediately interpreted as the vector of marginal effects of the zt variables on the conditional mean of yt And by adding lags of zt to 293 Chapter 31 Univariate time series models 294 this specification one can estimate Transfer Function models which generalize ARMA by adding the effects of exogenous variable distributed across time Gretl provides a way to estimate both forms Models written as in 312 are estimated by maximum likelihood models written as in 313 are estimated by conditional maximum likelihood For more on these options see the section on Estimation below In the special case when xt zt 1 that is the models include a constant but no exogenous variables the two specifications discussed above reduce to ϕLyt μ θLεt 314 and ϕL yt α θLεt 315 respectively These formulations are essentially equivalent but if they represent one and the same process μ and α are fairly obviously not numerically identical rather α 1 ϕ1 ϕp μ The gretl syntax for estimating 314 is simply arma p q y The AR and MA lag orders p and q can be given either as numbers or as predefined scalars The parameter μ can be dropped if necessary by appending the option nc no constant to the command If estimation of 315 is needed the switch conditional must be appended to the command as in arma p q y conditional Generalizing this principle to the estimation of 312 or 313 you get that arma p q y const x1 x2 would estimate the following model yt xt β ϕ1 yt1 xt1 β ϕp ytp xtp β εt θ1 εt1 θq εtq where in this instance xt β β0 xt1 β1 xt2 β2 Appending the conditional switch as in arma p q y const x1 x2 conditional would estimate the following model yt xt y ϕ1 yt1 ϕp ytp εt θ1 εt1 θq εtq Ideally the issue broached above could be made moot by writing a more general specification that nests the alternatives that is ϕL yt xt β zt γ θL εt 316 we would like to generalize the arma command so that the user could specify for any estimation method whether certain exogenous variables should be treated as xt s or zt s but were not yet at that point and neither are most other software packages Chapter 31 Univariate time series models 295 Seasonal models A more flexible lag structure is desirable when analyzing time series that display strong seasonal patterns Model 311 can be expanded to φLΦLsyt θLΘLsϵt 317 For such cases a fuller form of the syntax is available namely arma p q P Q y where p and q represent the nonseasonal AR and MA orders and P and Q the seasonal orders For example arma 1 1 1 1 y would be used to estimate the following model 1 φL1 ΦLsyt µ 1 θL1 ΘLsϵt If yt is a quarterly series and therefore s 4 the above equation can be written more explicitly as yt µ φyt1 µ Φyt4 µ φ Φyt5 µ ϵt θϵt1 Θϵt4 θ Θϵt5 Such a model is known as a multiplicative seasonal ARMA model Gaps in the lag structure The standard way to specify an ARMA model in gretl is via the AR and MA orders p and q respec tively In this case all lags from 1 to the given order are included In some cases one may wish to include only certain specific AR andor MA lags This can be done in either of two ways One can construct a matrix containing the desired lags positive integer values and supply the name of this matrix in place of p or q One can give a commaseparated list of lags enclosed in braces in place of p or q The following code illustrates these options matrix pvec 14 arma pvec 1 y arma 14 1 y Both forms above specify an ARMA model in which AR lags 1 and 4 are used but not 2 and 3 This facility is available only for the nonseasonal component of the ARMA specification Differencing and ARIMA The above discussion presupposes that the time series yt has already been subjected to all the transformations deemed necessary for ensuring stationarity see also section 313 Differencing is the most common of these transformations and gretl provides a mechanism to include this step into the arma command the syntax arma p d q y would estimate an ARMAp q model on dyt It is functionally equivalent to Chapter 31 Univariate time series models 296 series tmp y loop i1d tmp difftmp endloop arma p q tmp except with regard to forecasting after estimation see below When the series yₜ is differenced before performing the analysis the model is known as ARIMA I for Integrated for this reason gretl provides the arima command as an alias for arma Seasonal differencing is handled similarly with the syntax arma p d q P D Q y where D is the order for seasonal differencing Thus the command arma 1 0 0 1 1 1 y would produce the same parameter estimates as series dsy sdiffy arma 1 0 1 1 dsy where we use the sdiff function to create a seasonal difference eg for quarterly data yₜ yₜ₄ In specifying an ARIMA model with exogenous regressors we face a choice which relates back to the discussion of the variant models 312 and 313 above If we choose model 312 the regression model with ARMA errors how should this be extended to the case of ARIMA The issue is whether or not the differencing that is applied to the dependent variable should also be applied to the regressors Consider the simplest case ARIMA with nonseasonal differencing of order 1 We may estimate either 𝜙L1 Lyₜ Xₜβ θLεₜ 318 or 𝜙L1 Lyₜ Xₜβ θLεₜ 319 The first of these formulations can be described as a regression model with ARIMA errors while the second preserves the levels of the X variables As of gretl version 186 the default model is 318 in which differencing is applied to both yₜ and Xₜ However when using the default estimation method native exact ML see below the option ydiffonly may be given in which case gretl estimates 319¹ Estimation The default estimation method for ARMA models is exact maximum likelihood estimation under the assumption that the error term is normally distributed using a variety of techniques the main algorithm for evaluating the loglikelihood is AS197 by Melard 1984 Maximization is performed via BFGS and the score is approximated numerically This method produces results that are directly comparable with many other software packages The constant and any exogenous variables are treated as in equation 312 The covariance matrix for the parameters is computed using a numerical approximation to the Hessian at convergence The alternative method invoked with the conditional switch is conditional maximum likelihood CML also known as conditional sum of squares see Hamilton 1994 p 132 This method was exemplified in the script 133 and only a brief description will be given here Given a sample of size T the CML method minimizes the sum of squared onestepahead prediction errors generated ¹Prior to gretl 186 the default model was 319 We changed this for the sake of consistency with other software Chapter 31 Univariate time series models 297 by the model for the observations t0 T The starting point t0 depends on the orders of the AR polynomials in the model The numerical maximization method used is BHHH and the covariance matrix is computed using a GaussNewton regression The CML method is nearly equivalent to maximum likelihood under the hypothesis of normality the difference is that the first t0 1 observations are considered fixed and only enter the like lihood function as conditioning variables As a consequence the two methods are asymptotically equivalent under standard conditionsexcept for the fact discussed above that our CML imple mentation treats the constant and exogenous variables as per equation 313 The two methods can be compared as in the following example open data101 arma 1 1 r arma 1 1 r conditional which produces the estimates shown in Table 311 As you can see the estimates of φ and θ are quite similar The reported constants differ widely as expectedsee the discussion following equations 314 and 315 However dividing the CML constant by 1 φ we get 738 which is not far from the ML estimate of 693 Table 311 ML and CML estimates Parameter ML CML µ 693042 0923882 107322 0488661 φ 0855360 00511842 0852772 00450252 θ 0588056 00986096 0591838 00456662 Convergence and initialization The numerical methods used to maximize the likelihood for ARMA models are not guaranteed to converge Whether or not convergence is achieved and whether or not the true maximum of the likelihood function is attained may depend on the starting values for the parameters Gretl employs one of the following two initialization mechanisms depending on the specification of the model and the estimation method chosen 1 Estimate a pure AR model by Least Squares nonlinear least squares if the model requires it otherwise OLS Set the AR parameter values based on this regression and set the MA parameters to a small positive value 00001 2 The HannanRissanen method First estimate an autoregressive model by OLS and save the residuals Then in a second OLS pass add appropriate lags of the firstround residuals to the model to obtain estimates of the MA parameters To see the details of the ARMA estimation procedure add the verbose option to the command This prints a notice of the initialization method used as well as the parameter values and log likelihood at each iteration Besides the builtin initialization mechanisms the user has the option of specifying a set of starting values manually This is done via the set command the first argument should be the keyword initvals and the second should be the name of a prespecified matrix containing starting values For example matrix start 0 085 034 set initvals start arma 1 1 y Chapter 31 Univariate time series models 298 The specified matrix should have just as many parameters as the model in the example above there are three parameters since the model implicitly includes a constant The constant if present is always given first otherwise the order in which the parameters are expected is the same as the order of specification in the arma or arima command In the example the constant is set to zero φ1 to 085 and θ1 to 034 You can get gretl to revert to automatic initialization via the command set initvals auto Two variants of the BFGS algorithm are available in gretl In general we recommend the default vari ant which is based on an implementation by Nash 1990 but for some problems the alternative limitedmemory version LBFGSB see Byrd et al 1995 may increase the chances of convergence on the ML solution This can be selected via the lbfgs option to the arma command Estimation via X12ARIMA As an alternative to estimating ARMA models using native code gretl offers the option of using the external program X12ARIMA This is the seasonal adjustment software produced and main tained by the US Census Bureau it is used for all official seasonal adjustments at the Bureau The current version X13 can also be used working as a dropin replacement Gretl includes a module which interfaces with X12ARIMA it translates arma commands using the syntax outlined above into a form recognized by X12ARIMA executes the program and retrieves the results for viewing and further analysis within gretl To use this facility you have to install X12ARIMA separately Packages for both MS Windows and GNULinux are available from the gretl website httpgretlsourceforgenet To invoke X12ARIMA as the estimation engine append the flag x12arima as in arma p q y x12arima As with native estimation the default is to use exact ML but there is the option of using conditional ML with the conditional flag However please note that when X12ARIMA is used in conditional ML mode the comments above regarding the variant treatments of the mean of the process yt do not apply That is when you use X12ARIMA the model that is estimated is 312 regardless of whether estimation is by exact ML or conditional ML In addition the treatment of exogenous regressors in the context of ARIMA differencing is always that shown in equation 318 Forecasting ARMA models are often used for forecasting purposes The autoregressive component in particu lar offers the possibility of forecasting a process out of sample over a substantial time horizon Gretl supports forecasting on the basis of ARMA models using the method set out by Box and Jenkins 19762 The Box and Jenkins algorithm produces a set of integrated AR coefficients which take into account any differencing of the dependent variable seasonal andor nonseasonal in the ARIMA context thus making it possible to generate a forecast for the level of the original variable By contrast if you first difference a series manually and then apply ARMA to the differenced series forecasts will be for the differenced series not the level This point is illustrated in Listing 311 The parameter estimates are identical for the two models The forecasts differ but are mutually consistent the variable fcdiff emulates the ARMA forecast static one step ahead within the sample range and dynamic out of sample Lag selection A variant of the arma and arima commands is available as an aid to specification If you give the lagselect option the lag orders p and qas well as P and Q if applicableare taken as 2See in particular their Program 4 on p 505ff Chapter 31 Univariate time series models 299 Listing 311 ARIMA forecasting Download open greene182gdt log of quarterly US nominal GNP 19501 to 19834 series y logY and its first difference series dy diffy reserve 2 years for outofsample forecast smpl 19814 Estimate using ARIMA arima 1 1 1 y forecast over full period smpl full fcast fc1 Return to subsample and run ARMA on the first difference of y smpl 19814 arma 1 1 dy smpl full fcast fc2 series fcdiff t19821 fc1 y1 fc1 fc11 compare the forecasts over the later period smpl 19811 19834 print y fc1 fc2 fcdiff byobs The output from the last command is y fc1 fc2 fcdiff 19811 7964086 7940930 002668 002668 19812 7978654 7997576 003349 003349 19813 8009463 7997503 001885 001885 19814 8015625 8033695 002423 002423 19821 8014997 8029698 001407 001407 19822 8026562 8046037 001634 001634 19823 8032717 8063636 001760 001760 19824 8042249 8081935 001830 001830 19831 8062685 8100623 001869 001869 19832 8091627 8119528 001891 001891 19833 8115700 8138554 001903 001903 19834 8140811 8157646 001909 001909 Chapter 31 Univariate time series models 300 maxima and the usual output is replaced by a table showing information criteria and loglikelihood for a range of specifications from zero lags to the maxima If no seasonal component is given this table has six columns p and q the criteria AIC BIC and HQC see Chapter 28 and loglikelihood In the seasonal case there are eight columns P and Q are inserted following p and q Asterisks identify the rows specifications on which each information criterion is minimized If the input specification includes differencing nonseasonal andor seasonal this is respected but d and D are treated as fixed values rather than maxima You have the usual choice between exact and conditional ML estimation but the option of using X12ARIMA or X13 is not supported You also have the usual option of including exogenous regressors ARMAX On successful completion the table of results is available in the form of a matrix via the test accessor The printed version can be suppressed via the quiet option A simple example of usage is shown in Listing 312 using annual sunspot data from 1700 to 2021 The table part of which is elided for brevity has the three information criteria agreeing on ARMA42 as the optimum among the specifications estimated The script illustrates how the test matrix can be used to extract the best specification 313 Unit root tests The ADF test The Augmented DickeyFuller ADF test is as implemented in gretl the tstatistic on 𝜙 in the following regression Δyₜ μₜ 𝜙yₜ₁ Σᵖᵢ₁ γᵢΔyₜᵢ εₜ 3110 This test statistic is probably the bestknown and most widely used unit root test It is a onesided test whose null hypothesis is 𝜙 0 versus the alternative 𝜙 0 and hence large negative values of the test statistic lead to the rejection of the null Under the null yₜ must be differenced at least once to achieve stationarity under the alternative yₜ is already stationary and no differencing is required One peculiar aspect of this test is that its limit distribution is nonstandard under the null hypothesis moreover the shape of the distribution and consequently the critical values for the test depends on the form of the μₜ term A full analysis of the various cases is inappropriate here Hamilton 1994 contains an excellent discussion but any recent time series textbook covers this topic Suffice it to say that gretl allows the user to choose the specification for μₜ among four different alternatives μₜ command option 0 nc μ₀ c μ₀ μ₁t ct μ₀ μ₁t μ₁t² ctt These option flags are not mutually exclusive when they are used together the statistic will be reported separately for each selected case By default gretl uses the combination c ct For each case approximate pvalues are calculated by means of the algorithm developed in MacKinnon 1996 The gretl command used to perform the test is adf for example adf 4 x1 Chapter 31 Univariate time series models 301 Listing 312 ARMA lag selection Download open sunspotsgdt ARMA lag selection with maxima of 4 for p and q arma 4 4 sunspots lagselect determine the best row per BIC column 4 bestrow iminctest4 extract this row spec testbestrow12 extract p and q as scalars scalar p spec1 scalar q spec2 and estimate the best specification arma p q sunspots Part of the lagselection table Estimated using AS 197 exact ML Dependent variable sunspots T 322 Criteria for ARMAp q specifications p q AIC BIC HQC lnL 0 0 35752367 35827858 35782505 17856183 0 1 32837333 32950569 32882540 16388666 0 2 31236726 31387708 31297002 15578363 0 3 30718351 30907078 30793697 15309175 0 4 30470500 30696973 30560916 15175250 1 0 32203385 32316621 32248593 16071692 1 1 31084048 31235030 31144325 15502024 1 2 30603363 30792090 30678709 15251681 1 3 30512713 30739187 30603129 15196357 1 4 30451230 30715449 30556715 15155615 3 0 30086022 30274750 30161368 14993011 3 1 30105262 30331735 30195677 14992631 3 2 29763054 30027273 29868539 14811527 3 3 29696493 29998457 29817046 14768246 3 4 29705017 30044727 29840640 14762509 4 0 30105497 30331970 30195912 14992748 4 1 30123267 30387485 30228751 14991633 4 2 29695073 29997037 29815626 14767536 4 3 29712552 30052262 29848175 14766276 4 4 29711378 30088833 29862070 14755689 Chapter 31 Univariate time series models 302 would compute the test statistic as the tstatistic for 𝜙 in equation 3110 with p 4 in the two cases μₜ μ₀ and μₜ μ₀ μ₁t The number of lags p in equation 3110 should be chosen as to ensure that 3110 is a parametrization flexible enough to represent adequately the shortrun persistence of Δyₜ Setting p too low results in size distortions in the test whereas setting p too high leads to low power As a convenience to the user the parameter p can be automatically determined Setting p to a negative number triggers a sequential procedure that starts with p lags and decrements p until the tstatistic for the parameter γₚ exceeds 1645 in absolute value The ADFGLS test Elliott Rothenberg and Stock 1996 proposed a variant of the ADF test which involves an alternative method of handling the parameters pertaining to the deterministic term μₜ these are estimated first via Generalized Least Squares and in a second stage an ADF regression is performed using the GLS residuals This variant offers greater power than the regular ADF test for the cases μₜ μ₀ and μₜ μ₀ μ₁t The ADFGLS test is available in gretl via the gls option to the adf command When this option is selected the nc and ctt options become unavailable and only one case can be selected at a time by default the constantonly model is used but a trend can be added using the ct flag When a trend is present in this test MacKinnontype pvalues are not available instead we show critical values from Table 1 in Elliott et al 1996 The KPSS test The KPSS test Kwiatkowski Phillips Schmidt and Shin 1992 is a unit root test in which the null hypothesis is opposite to that in the ADF test under the null the series in question is stationary the alternative is that the series is I1 The basic intuition behind this test statistic is very simple if yₜ can be written as yₜ μ uₜ where uₜ is some zeromean stationary process then not only does the sample average of the yₜs provide a consistent estimator of μ but the longrun variance of uₜ is a welldefined finite number Neither of these properties hold under the alternative The test itself is based on the following statistic η Σᵀᵢ1 Sₜ² T²σ² 3111 where Sₜ Σₛ₁ᵗ eₛ and σ² is an estimate of the longrun variance of eₜ yₜ y Under the null this statistic has a welldefined nonstandard asymptotic distribution which is free of nuisance parameters and has been tabulated by simulation Under the alternative the statistic diverges As a consequence it is possible to construct a onesided test based on η where H₀ is rejected if η is bigger than the appropriate critical value gretl provides the 90 95 and 99 percent quantiles The critical values are computed via the method presented by Sephton 1995 which offers greater accuracy than the values tabulated in Kwiatkowski et al 1992 Usage example kpss m y where m is an integer representing the bandwidth or window size used in the formula for estimating the long run variance σ² Σⁿᵢᵐ 1 i m 1 ŷᵢ The ŷᵢ terms denote the empirical autocovariances of eₜ from order m through m For this estimator to be consistent m must be large enough to accommodate the shortrun persistence of et but not too large compared to the sample size T If the supplied m is nonpositive a default value is computed namely the integer part of 4 T10014 The above concept can be generalized to the case where yt is thought to be stationary around a deterministic trend In this case formula 3111 remains unchanged but the series et is defined as the residuals from an OLS regression of yt on a constant and a linear trend This second form of the test is obtained by appending the trend option to the kpss command kpss n y trend Note that in this case the asymptotic distribution of the test is different and the critical values reported by gretl differ accordingly Panel unit root tests The most commonly used unit root tests for panel data involve a generalization of the ADF procedure in which the joint null hypothesis is that a given times series is nonstationary for all individuals in the panel In this context the ADF regression 3110 can be rewritten as Δyit μit φiyit1 from j1 to pi γij Δyitj ϵit 3112 The model 3112 allows for maximal heterogeneity across the individuals in the panel the parameters of the deterministic term the autoregressive coefficient φ and the lag order p are all specific to the individual indexed by i One possible modification of this model is to impose the assumption that φi φ for all i that is the individual time series share a common autoregressive root although they may differ in respect of other statistical properties The choice of whether or not to impose this assumption has an important bearing on the hypotheses under test Under model 3112 the joint null is φi 0 for all i meaning that all the individual time series are nonstationary and the alternative simply the negation of the null is that at least one individual time series is stationary When a common φ is assumed the null is that φ 0 and the alternative is that φ 0 The null still says that all the individual series are nonstationary but the alternative now says that they are all stationary The choice of model should take this point into account as well as the gain in power from forming a pooled estimate of φ and of course the plausibility of assuming a common AR1 coefficient3 In gretl the formulation 3112 is used automatically when the adf command is used on panel data The joint test statistic is formed using the method of Im Pesaran and Shin 2003 In this context the behavior of adf differs from regular timeseries data only one case of the deterministic term is handled per invocation of the command the default is that μit includes just a constant but the nc and ct flags can be used to suppress the constant or to include a trend respectively and the quadratic trend option ctt is not available The alternative that imposes a common value of φ is implemented via the levinlin command The test statistic is computed as per Levin Lin and Chu 2002 As with the adf command the first argument is the lag order and the second is the name of the series to test and the default case for the deterministic component is a constant only The options nc and ct have the same effect as with adf One refinement is that the lag order may be given in either of two forms if a scalar is given this is taken to represent a common value of p for all individuals but you may instead provide a vector holding a set of pi values hence allowing the order of autocorrelation of the series to differ by individual So for example given 3If the assumption of a common φ seems excessively restrictive bear in mind that we routinely assume common slope coefficients when estimating panel models even if this is unlikely to be literally true Chapter 31 Univariate time series models 304 levinlin 2 y levinlin 223344 y the first command runs a joint ADF test with a common lag order of 2 while the second which assumes a panel with six individuals allows for differing shortrun dynamics The first argument to levinlin can be given as a set of commaseparated integers enclosed in braces as shown above or as the name of an appropriately dimensioned predefined matrix see chapter 17 Besides variants of the ADF test the KPSS test also can be used with panel data via the kpss command In this case the test of the null hypothesis that the given time series is stationary for all individuals is implemented using the method of Choi 2001 This is an application of meta analysis the statistical technique whereby an overall or composite pvalue for the test of a given null hypothesis can be computed from the pvalues of a set of separate tests Unfortunately in the case of the KPSS test we are limited by the unavailability of precise pvalues although if an individual test statistic falls between the 10 percent and 1 percent critical values we are able to interpolate with a fair degree of confidence This gives rise to four cases 1 All the individual KPSS test statistics fall between the 10 percent and 1 percent critical values the Choi method gives us a plausible composite pvalue 2 Some of the KPSS test statistics exceed the 1 percent value and none fall short of the 10 percent value we can give an upper bound for the composite pvalue by setting the unknown pvalues to 001 3 Some of the KPSS test statistics fall short of the 10 percent critical value but none exceed the 1 percent value we can give a lower bound to the composite pvalue by setting the unknown pvalues to 010 4 None of the above conditions are satisfied the Choi method fails to produce any result for the composite KPSS test 314 Cointegration test The generally recommended test for cointegration is the Johansen test which is discussed in detail in chapter 33 In this context we just offer a few remarks on the cointegration test of Engle and Granger 1987 because it builds on the univariate ADF test discussed above section 313 For the EngleGranger test the procedure is 1 Test each series for a unit root using an ADF test 2 Run a cointegrating regression via OLS For this we select one of the potentially cointegrated variables as dependent and include the other potentially cointegrated variables as regressors 3 Perform an ADF test on the residuals from the cointegrating regression The idea is that cointegration is supported if a the null of nonstationarity is not rejected for each of the series individually in step 1 while b the null is rejected for the residuals at step 3 That is each of the individual series is I1 but some linear combination of the series is I0 This test is implemented in gretl by the coint command which requires an integer lag order for the ADF tests followed by a list of variables to be tested the first of which will be taken as dependent in the cointegrating regression Please see the online help for coint or the Gretl Command Reference for further details 315 ARCH and GARCH Heteroskedasticity means a nonconstant variance of the error term in a regression model Autoregressive Conditional Heteroskedasticity ARCH is a phenomenon specific to time series models whereby the variance of the error displays autoregressive behavior for instance the time series exhibits successive periods where the error variance is relatively large and successive periods where it is relatively small This sort of behavior is reckoned to be common in asset markets an unsettling piece of news can lead to a period of increased volatility in the market An ARCH error process of order q can be represented as ut σt εt σt2 Eut2Ωt1 α0 from i1 to q αi uti2 where the εts are independently and identically distributed iid with mean zero and variance 1 and where σt is taken to be the positive square root of σt2 Ωt1 denotes the information set as of time t 1 and σt2 is the conditional variance that is the variance conditional on information dated t 1 and earlier It is important to notice the difference between ARCH and an ordinary autoregressive error process The simplest firstorder case of the latter can be written as ut ρ ut1 εt 1 ρ 1 where the εts are independently and identically distributed with mean zero and variance σ2 With an AR1 error if ρ is positive then a positive value of ut will tend to be followed by a positive ut1 With an ARCH error process a disturbance ut of large absolute value will tend to be followed by further large absolute values but with no presumption that the successive values will be of the same sign ARCH in asset prices is a stylized fact and is consistent with market efficiency on the other hand autoregressive behavior of asset prices would violate market efficiency One can test for ARCH of order q in the following way 1 Estimate the model of interest via OLS and save the squared residuals ût2 2 Perform an auxiliary regression in which the current squared residual is regressed on a constant and q lags of itself 3 Find the TR2 value sample size times unadjusted R2 for the auxiliary regression 4 Refer the TR2 value to the χ2 distribution with q degrees of freedom and if the pvalue is small enough reject the null hypothesis of homoskedasticity in favor of the alternative of ARCHq This test is implemented in gretl via the modtest command with the arch option which must follow estimation of a timeseries model by OLS either a singleequation model or a VAR For example ols y 0 x modtest 4 arch This example specifies an ARCH order of q 4 if the order argument is omitted q is set equal to the periodicity of the data In the graphical interface the ARCH test is accessible from the Tests menu in the model window again for singleequation OLS or VARs GARCH The simple ARCHq process is useful for introducing the general concept of conditional heteroskedasticity in time series but it has been found to be insufficient in empirical work The dynamics of the error variance permitted by ARCHq are not rich enough to represent the patterns found in financial data The generalized ARCH or GARCH model is now more widely used The representation of the variance of a process in the GARCH model is somewhat but not exactly analogous to the ARMA representation of the level of a time series The variance at time t is allowed to depend on both past values of the variance and past values of the realized squared disturbance as shown in the following system of equations yt Xt β ut 3113 ut σt εt 3114 σt2 α0 from i1 to q αi uti2 from j1 to p δj σtj2 3115 As above εt is an iid sequence with unit variance Xt is a matrix of regressors or in the simplest case just a vector of 1s allowing for a nonzero mean of yt Note that if p 0 GARCH collapses to ARCHq the generalization is embodied in the δj terms that multiply previous values of the error variance In principle the underlying innovation εt could follow any suitable probability distribution and besides the obvious candidate of the normal or Gaussian distribution the Students t distribution has been used in this context Currently gretl only handles the case where εt is assumed to be Gaussian However when the robust option to the garch command is given the estimator gretl uses for the covariance matrix can be considered QuasiMaximum Likelihood even with nonnormal disturbances See below for more on the options regarding the GARCH covariance matrix Example garch p q y const x where p 0 and q 0 denote the respective lag orders as shown in equation 3115 These values can be supplied in numerical form or as the names of predefined scalar variables GARCH estimation Estimation of the parameters of a GARCH model is by no means a straightforward task Consider equation 3115 the conditional variance at any point in time σt2 depends on the conditional variance in earlier periods but σt2 is not observed and must be inferred by some sort of Maximum Likelihood procedure By default gretl uses native code that employs the BFGS maximizer you also have the option activated by the fcp commandline switch of using the method proposed by Fiorentini et al 19964 which was adopted as a benchmark in the study of GARCH results by McCullough and Renfro 1998 It employs analytical first and second derivatives of the loglikelihood and uses a mixedgradient algorithm exploiting the information matrix in the early iterations and then switching to the Hessian in the neighborhood of the maximum likelihood This progress can be observed if you append the verbose option to gretls garch command Several options are available for computing the covariance matrix of the parameter estimates in connection with the garch command At a first level one can choose between a standard and a robust estimator By default the Hessian is used unless the robust option is given in which case the QML estimator is used A finer choice is available via the set command as shown in Table 312 It is not uncommon when one estimates a GARCH model for an arbitrary time series to find that the iterative calculation of the estimates fails to converge For the GARCH model to make sense there are strong restrictions on the admissible parameter values and it is not always the case that there exists a set of values inside the admissible parameter space for which the likelihood is maximized 4The algorithm is based on Fortran code deposited in the archive of the Journal of Applied Econometrics by the authors and is used by kind permission of Professor Fiorentini Chapter 31 Univariate time series models 307 Table 312 Options for the GARCH covariance matrix command effect set garchvcv hessian Use the Hessian set garchvcv im Use the Information Matrix set garchvcv op Use the Outer Product of the Gradient set garchvcv qml QML estimator set garchvcv bw BollerslevWooldridge sandwich estimator The restrictions in question can be explained by reference to the simplest and much the most common instance of the GARCH model where p q 1 In the GARCH1 1 model the conditional variance is σ 2 t α0 α1u2 t1 δ1σ 2 t1 3116 Taking the unconditional expectation of 3116 we get σ 2 α0 α1σ 2 δ1σ 2 so that σ 2 α0 1 α1 δ1 For this unconditional variance to exist we require that α1 δ1 1 and for it to be positive we require that α0 0 A common reason for nonconvergence of GARCH estimates that is a common reason for the non existence of αi and δi values that satisfy the above requirements and at the same time maximize the likelihood of the data is misspecification of the model It is important to realize that GARCH in itself allows only for timevarying volatility in the data If the mean of the series in question is not constant or if the error process is not only heteroskedastic but also autoregressive it is necessary to take this into account when formulating an appropriate model For example it may be necessary to take the first difference of the variable in question andor to add suitable regressors Xt as in 3113 Chapter 32 Vector Autoregressions Gretl provides a standard set of procedures for dealing with the multivariate timeseries models known as VARs Vector AutoRegression More general modelssuch as VARMAs nonlinear models or multivariate GARCH modelsare not provided as of now although it is entirely possible to estimate them by writing custom procedures in the gretl scripting language In this chapter we will briefly review gretls VAR toolbox 321 Notation A VAR is a structure whose aim is to model the time persistence of a vector of n time series yt via a multivariate autoregression as in yt A1yt1 A2yt2 Apytp Bxt ϵt 321 The number of lags p is called the order of the VAR The vector xt if present contains a set of exogenous variables often including a constant possibly with a time trend and seasonal dummies The vector ϵt is typically assumed to be a vector white noise with covariance matrix Σ Equation 321 can be written more compactly as ALyt Bxt ϵt 322 where AL is a matrix polynomial in the lag operator or as yt yt1 ytp1 A yt1 yt2 ytp B 0 0 xt ϵt 0 0 323 The matrix A is known as the companion matrix and equals A A1 A2 Ap I 0 0 0 I 0 Equation 323 is known as the companion form of the VAR Another representation of interest is the socalled VMA representation which is written in terms of an infinite series of matrices Θi defined as Θi yt ϵti 324 The Θi matrices may be derived by recursive substitution in equation 321 for example assuming for simplicity that B 0 and p 1 equation 321 would become yt Ayt1 ϵt 308 Chapter 32 Vector Autoregressions 309 which could be rewritten as yt An1 ytn1 εt A εt1 A2 εt2 An εtn In this case Θi Ai In general it is possible to compute Θi as the n n northwest block of the ith power of the companion matrix A so Θ0 is always an identity matrix The VAR is said to be stable if all the eigenvalues of the companion matrix A are smaller than 1 in absolute value or equivalently if the matrix polynomial AL in equation 322 is such that Az 0 implies z 1 If this is the case limn Θn 0 and the vector yt is stationary as a consequence the equation yt Eyt i0 Θi εti 325 is a legitimate Wold representation If the VAR is not stable then the inferential procedures that are called for become somewhat more specialized except for some simple cases In particular if the number of eigenvalues of A with modulus 1 is between 1 and n 1 the canonical tool to deal with these models is the cointegrated VAR model discussed in chapter 33 322 Estimation The gretl command for estimating a VAR is var which in the command line interface is invoked in the following manner modelname var p Ylist Xlist where p is a scalar the VAR order and Ylist is a list of variables specifying the content of yt The optional Xlist argument can be used to specify a set of exogenous variables If this argument is omitted the vector xt is taken to contain a constant only if present it must be separated from Ylist by a semicolon Note however that a few common choices can be obtained in a simpler way the options trend and seasonals call for inclusion of a linear trend and a set of seasonal dummies respectively In addition the nc option no constant can be used to suppress the standard inclusion of a constant The construct can be used to store the model under a name see section 32 if so desired To estimate a VAR using the graphical interface choose Time Series Vector Autoregression under the Model menu The parameters in eq 321 are typically free from restrictions which implies that multivariate OLS provides a consistent and asymptotically efficient estimator of all the parameters 1 Given the simplicity of OLS this is what every software package including gretl uses example script 321 exemplifies the fact that the var command gives you exactly the output you would have from a battery of OLS regressions The advantage of using the dedicated command is that after estimation is done it makes it much easier to access certain quantities and manage certain tasks For example the coeff accessor returns the estimated coefficients as a matrix with n columns and sigma returns an estimate of the matrix Σ the covariance matrix of εt Moreover for each variable in the system an F test is automatically performed in which the null hypothesis is that no lags of variable j are significant in the equation for variable i This is commonly known as a Granger causality test In addition two accessors become available for the companion matrix compan and the VMA representation vma The latter deserves a detailed description since the VMA representation 325 is of infinite order gretl defines a horizon up to which the Θi matrices are computed automatically 1 In fact under normality of εt OLS is indeed the conditional ML estimator You may want to use other methods if you need to estimate a VAR in which some parameters are constrained Chapter 32 Vector Autoregressions 310 Listing 321 Estimation of a VAR via OLS Download Input open swch14gdt series infl 400sdifflogPUNEW scalar p 2 list X LHUR infl list Xlag lagspX loop foreach i X ols i const Xlag endloop var p X Output selected portions Model 1 OLS using observations 1960319994 T 158 Dependent variable LHUR coefficient std error tratio pvalue const 0113673 00875210 1299 01960 LHUR1 154297 00680518 2267 878e51 LHUR2 0583104 00645879 9028 700e16 infl1 00219040 000874581 2505 00133 infl2 00148408 000920536 1612 01090 Mean dependent var 6019198 SD dependent var 1502549 Sum squared resid 8654176 SE of regression 0237830 VAR system lag order 2 OLS estimates observations 1960319994 T 158 Loglikelihood 32273663 Determinant of covariance matrix 020382769 AIC 42119 BIC 44057 HQC 42906 Portmanteau test LB39 226984 df 148 00000 Equation 1 LHUR coefficient std error tratio pvalue const 0113673 00875210 1299 01960 LHUR1 154297 00680518 2267 878e51 LHUR2 0583104 00645879 9028 700e16 infl1 00219040 000874581 2505 00133 infl2 00148408 000920536 1612 01090 Mean dependent var 6019198 SD dependent var 1502549 Sum squared resid 8654176 SE of regression 0237830 Chapter 32 Vector Autoregressions 311 Periodicity horizon Quarterly 20 5 years Monthly 24 2 years Daily 3 weeks All other cases 10 Table 321 VMA horizon as a function of the dataset periodicity By default this is a function of the periodicity of the data see table 321 but it can be set by the user to any desired value via the set command with the horizon parameter as in set horizon 30 Calling the horizon h the vma accessor returns an h 1 n2 matrix in which the i 1th row is the vectorized form of Θi VAR lagorder selection In order to help the user choose the most appropriate VAR order gretl provides a special variant of the var command var p Ylist Xlist lagselect When the lagselect option is given estimation is performed for all lags up to p and a table is printed it displays for each order a Likelihood Ratio test for the order p versus p 1 plus an array of information criteria see chapter 28 For each information criterion in the table a star indicates what appears to be the best choice The same output can be obtained through the graphical interface via the Time Series VAR lag selection entry under the Model menu Warning in finite samples the choice of the maximum lag p may affect the outcome of the proce dure This is not a bug but rather an unavoidable side effect of the way these comparisons should be made If your sample contains T observations and you invoke the lag selection procedure with maximum order p gretl examines all VARs of order ranging form 1 to p estimated on a uniform sample of T p observations In other words the comparison procedure does not use all the avail able data when estimating VARs of order less than p so as to ensure that all the models in the comparison are estimated on the same data range Choosing a different value of p may therefore alter the results although this is unlikely to happen if your sample size is reasonably large An example of this unpleasant phenomenon is given in example script 322 As can be seen ac cording to the HannanQuinn criterion order 2 seems preferable to order 1 if the maximum tested order is 4 but the situation is reversed if the maximum tested order is 6 323 Structural VARs Gretls builtin var command does not support the general class of models known as Structural VARsthough it does support the Cholesky decompositionbased approach the classic and most popular structural VAR variant If you wish to go beyond that there is a gretl addon named SVAR which will likely meet your needs SVAR is supplied as part of the gretl package you can find its documentation which is quite detailed as follows under the Tools menu in the gretl main window go to Function packagesOn local machine Or use the f x button on the toolbar at the foot of the main window In the function packages window either scroll down or use the search box to find SVAR Then rightclick and select Info This opens a window which gives basic information on the package including a link to SVARpdf the full documentation The remainder of this section will thus only deal with the Choleskybased recursive shock identifi cation used by the native var command Chapter 32 Vector Autoregressions 312 Listing 322 VAR lag selection via Information Criteria Input open denmark list Y 1 2 3 4 var 4 Y lagselect var 6 Y lagselect Output selected portions VAR system maximum lag order 4 The asterisks below indicate the best that is minimized values of the respective information criteria AIC Akaike criterion BIC Schwarz Bayesian criterion and HQC HannanQuinn criterion lags loglik pLR AIC BIC HQC 1 60915315 23104045 22346466 22814552 2 63170153 000013 23360844 21997203 22839757 3 64238574 016478 23152382 21182677 22399699 4 65322564 015383 22950025 20374257 21965748 VAR system maximum lag order 6 The asterisks below indicate the best that is minimized values of the respective information criteria AIC Akaike criterion BIC Schwarz Bayesian criterion and HQC HannanQuinn criterion lags loglik pLR AIC BIC HQC 1 59438410 23444249 22672078 23151288 2 61543480 000038 23650400 22260491 23123070 3 62497613 026440 23386781 21379135 22625083 4 63603766 013926 23185210 20559827 22189144 5 65836014 000016 23443271 20200150 22212836 6 66988472 011243 23260601 19399743 21795797 Chapter 32 Vector Autoregressions 313 IRF and FEVD Assume that the disturbance in equation 321 can be thought of as a linear function of a vector of structural shocks ut which are assumed to have unit variance and to be mutually unncorrelated so Vut I If εt Kut it follows that Σ Vεt K K The main object of interest in this setting is the sequence of matrices Ck yt uti Θk K 326 known as the structural VMA representation From the Ck matrices defined in equation 326 two quantities of interest may be derived the Impulse Response Function IRF and the Forecast Error Variance Decomposition FEVD The IRF of variable i to shock j is simply the sequence of the elements in row i and column j of the Ck matrices In symbols Iijk yit ujtk As a rule Impulse Response Functions are plotted as a function of k and are interpreted as the effect that a shock has on an observable variable through time Of course what we observe are the estimated IRFs so it is natural to endow them with confidence intervals following common practice gretl computes the confidence intervals by using the bootstrap 2 details are given later in this section Another quantity of interest that may be computed from the structural VMA representation is the Forecast Error Variance Decomposition FEVD The forecast error variance after h steps is given by Ωh k0h Ck Ck hence the variance for variable i is ωi2 Ωhii k0h diagCk Cki k0h l1n k cil2 where k cil is trivially the i l element of Ck As a consequence the share of uncertainty on variable i that can be attributed to the jth shock after h periods equals V Dijh k0h k cij2 k0h l1n k cil2 This makes it possible to quantify which shocks are most important to determine a certain variable in the short andor in the long run Triangularization The formula 326 takes K as known while of course it has to be estimated The estimation problem has been the subject of an enormous body of literature we will not even attempt to summarize here see for example Lütkepohl 2005 chapter 9 Suffice it to say that the most popular choice dates back to Sims 1980 and consists in assuming that K is lower triangular so its estimate is simply the Cholesky decomposition of the estimate of Σ The main consequence of this choice is that the ordering of variables within the vector yt becomes meaningful since K is also the matrix of Impulse Response Functions at lag 0 the triangularity 2 It is possible in principle to compute analytical confidence intervals via an asymptotic approximation but this is not a very popular choice asymptotic formulae are known to often give a very poor approximation of the finitesample properties Chapter 32 Vector Autoregressions 314 assumption means that the first variable in the ordering responds instantaneously only to shock number 1 the second one only to shocks 1 and 2 and so forth For this reason each variable is thought to own one shock variable 1 owns shock number 1 and so on In this sort of exercise therefore the ordering of the y variables is important To put it differently if variable foo comes before variable bar in the Y list it follows that the shock owned by foo affects bar instantaneously but not vice versa Impulse Response Functions and the FEVD can be printed out via the command line interface by using the impulseresponses and variancedecomp options respectively If you need to store them into matrices you could compute the structural VMA and proceed from there For example the following code snippet shows you how to manually compute a matrix containing the IRFs open denmark list Y 1 2 3 4 scalar n nelemY var 2 Y quiet impulseresponses matrix K choleskysigma matrix V vma matrix IRF V K In print IRF in which the equality vecCk vecΘkK K IvecΘk was used A more convenient way of obtaining the desired quantities is to use the irf and fevd functions which can be used in scripts after a VAR or VECM see the next chapter has been estimated In these functions you must specify the number of the responding target variable and the number of the analyzed shock to get the corresponding results as a column vector The choice of how many periods should be calculated and thus how long the result vector will be is determined by previ ously invoking set horizon x where x is a nonnegative integer and the first response concerns the impact effect As always it is recommended to consult the function reference under the help menu where in the case of the irf function it is also explained that the implicit shock size is such that the impact response in the same equation is one standard deviation of the corresponding error term IRF bootstrap The IRFs obtained above are estimates and as such they are uncertain Mostly due to the fact that they are nonlinear functions of the VAR parameters the standard way of assessing this estimation uncertainty and to derive confidence intervals or bands is to use a bootstrap approach Again more advanced options are available with the SVAR addon but the irf function used after the builtin var or vecm command also provides the option to run a bootstrap based on resampling from the residuals The number of bootstrap iterations can be adjusted through set bootiters x where x must be larger than 499 The desired nominal confidence level must be specified after the target and shock numbers as the third argument and in that case the return vector becomes a threecolumn matrix where the lower and upper bounds of the confidence intervals are given in the extra two columns Menudriven usage Almost all the functionality related to the described recursively identified structural VARs is also available under the menus in the model window that appears after a VAR is estimated in the GUI3 3Note that you cannot directly invoke the SVAR addon from the model window of an estimated VAR that menu entry is only present in gretls main window under the Model menu and multivariate time series submenu Chapter 32 Vector Autoregressions 315 In the Plots menu there are a number of menu entries relating to the impulse responses as well as one entry for the forecast error variance decomposition Selecting any of these will bring up a little specification window where the ordering for the Cholesky decomposition must be chosen and in case of IRFs the intended bootstrap coverage can be set In the Analysis menu there are also entries for IRF and FEVD which may sometimes be a little confusing The point is that here the numbers of the point estimates will be printed out in a tabular format instead of being plotted 324 Residualbased diagnostic tests Three diagnostic tests based on residuals are available after estimating a VARfor normality autocorrelation and ARCH Autoregressive Conditional Heteroskedasticity These are implemented by the modtest command using the options normality autocorr and arch respectively The multivariate normality test is that of Doornik and Hansen 1994 it is based on the skewness and kurtosis of the VAR residuals The autocorrelation and ARCH test are also by default multivariate they are described in detail by Lütkepohl 2005 see sections 444 and 1651 Both tests are of the LM type although the autocorrelation test statistic is referred to a Rao F distribution Rao 1973 These tests may involve estimation of a large number of parameters depending on the lag horizon chosen and can fail for lack of degrees of freedom in small samples As a fallback the univariate option can be used to specify that the tests be run perequation rather than in multivariate mode Listing 323 illustrates the VAR autocorrelation tests replicating an example given by Lütkepohl 2005 p 174 Note the difference in the interpretation of the order argument to modtest with the autocorr option this also applies to the ARCH test in the multivariate version order is taken as the maximum lag order and tests are run from lag 1 up to the maximum but in the univariate version a single test is run for each equation using just the specified lag order The example also exposes what exactly is returned by the test and pvalue accessors in the two variants Chapter 32 Vector Autoregressions 316 Listing 323 VAR autocorrelation test from Lütkepohl Download Input open wgmacrogdt quiet list Y investment income consumption list dlnY ldiffY smpl 19604 19784 var 2 dlnY modtest 4 autocorr eval test pvalue modtest 4 autocorr univariate eval test pvalue Output from tests modtest 4 autocorr Test for autocorrelation of order up to 4 Rao F Approx dist pvalue lag 1 0615 F9 148 07827 lag 2 0754 F18 164 07507 lag 3 1143 F27 161 02982 lag 4 1254 F36 154 01743 eval test pvalue 061524 078269 075397 075067 11429 029820 12544 017431 modtest 4 autocorr univariate Test for autocorrelation of order 4 Equation 1 LjungBox Q 611506 with pvalue PChisquare4 611506 0191 Equation 2 LjungBox Q 167136 with pvalue PChisquare4 167136 0796 Equation 3 LjungBox Q 159931 with pvalue PChisquare4 159931 0809 eval test pvalue 61151 019072 16714 079591 15993 080892 Chapter 33 Cointegration and Vector Error Correction Models 331 Introduction The twin concepts of cointegration and error correction have drawn a good deal of attention in macroeconometrics over recent years The attraction of the Vector Error Correction Model VECM is that it allows the researcher to embed a representation of economic equilibrium relationships within a relatively rich timeseries specification This approach overcomes the old dichotomy be tween a structural models that faithfully represented macroeconomic theory but failed to fit the data and b timeseries models that were accurately tailored to the data but difficult if not impos sible to interpret in economic terms The basic idea of cointegration relates closely to the concept of unit roots see section 313 Sup pose we have a set of macroeconomic variables of interest and we find we cannot reject the hypoth esis that some of these variables considered individually are nonstationary Specifically suppose we judge that a subset of the variables are individually integrated of order 1 or I1 That is while they are nonstationary in their levels their first differences are stationary Given the statistical problems associated with the analysis of nonstationary data for example the threat of spurious regression the traditional approach in this case was to take first differences of all the variables before proceeding with the analysis But this can result in the loss of important information It may be that while the variables in question are I1 when taken individually there exists a linear combination of the variables that is stationary without differencing or I0 There could be more than one such linear combina tion That is while the ensemble of variables may be free to wander over time nonetheless the variables are tied together in certain ways And it may be possible to interpret these ties or cointegrating vectors as representing equilibrium conditions For example suppose we find some or all of the following variables are I1 money stock M the price level P the nominal interest rate R and output Y According to standard theories of the demand for money we would nonetheless expect there to be an equilibrium relationship between real balances interest rate and output for example m p γ0 γ1y γ2r γ1 0 γ2 0 where lowercase variable names denote logs In equilibrium then m p γ1y γ2r γ0 Realistically we should not expect this condition to be satisfied each period We need to allow for the possibility of shortrun disequilibrium But if the system moves back towards equilibrium fol lowing a disturbance it follows that the vector x m p y r is bound by a cointegrating vector β β1 β2 β3 β4 such that βx is stationary with a mean of γ0 Furthermore if equilibrium is correctly characterized by the simple model above we have β2 β1 β3 0 and β4 0 These things are testable within the context of cointegration analysis There are typically three steps in this sort of analysis 1 Test to determine the number of cointegrating vectors the cointegrating rank of the system 2 Estimate a VECM with the appropriate rank but subject to no further restrictions 317 Chapter 33 Cointegration and Vector Error Correction Models 318 3 Probe the interpretation of the cointegrating vectors as equilibrium conditions by means of restrictions on the elements of these vectors The following sections expand on each of these points giving further econometric details and explaining how to implement the analysis using gretl 332 Vector Error Correction Models as representation of a cointegrated system Consider a VAR of order p with a deterministic part given by μt typically a polynomial in time One can write the nvariate process yt as yt μt A1 yt1 A2 yt2 Ap ytp εt 331 But since yti yt1 Δyt1 Δyt2 Δyti1 we can rewrite the above as Δyt μt Π yt1 i1p1 Γi Δyti εt 332 where Π i1p Ai I and Γi ji1p Aj This is the VECM representation of 331 The interpretation of 332 depends crucially on r the rank of the matrix Π If r 0 the processes are all I1 and not cointegrated If r n then Π is invertible and the processes are all I0 Cointegration occurs in between when 0 r n and Π can be written as α β In this case yt is I1 but the combination zt β yt is I0 If for example r 1 and the first element of β was 1 then one could write zt y1t β2 y2t βn ynt which is equivalent to saying that y1t β2 y2t βn ynt zt is a longrun equilibrium relationship the deviations zt may not be 0 but they are stationary In this case 332 can be written as Δyt μt α β yt1 i1p1 Γi Δyti εt 333 If β were known then zt would be observable and all the remaining parameters could be estimated via OLS In practice the procedure estimates β first and then the rest The rank of Π is investigated by computing the eigenvalues of a closely related matrix whose rank is the same as Π however this matrix is by construction symmetric and positive semidefinite As a consequence all its eigenvalues are real and nonnegative and tests on the rank of Π can therefore be carried out by testing how many eigenvalues are 0 If all the eigenvalues are significantly different from 0 then all the processes are stationary If on the contrary there is at least one zero eigenvalue then the yt process is integrated although some linear combination β yt might be stationary At the other extreme if no eigenvalues are significantly different from 0 then not only is the process yt nonstationary but the same holds for any linear combination β yt in other words no cointegration occurs Estimation typically proceeds in two stages first a sequence of tests is run to determine r the cointegration rank Then for a given rank the parameters in equation 333 are estimated The two commands that gretl offers for estimating these systems are johansen and vecm respectively The syntax for johansen is Chapter 33 Cointegration and Vector Error Correction Models 321 then we should not place any restriction on the intercept Otherwise the question arises of whether it makes sense to specify a cointegration relationship which includes a nonzero intercept One ex ample where this is appropriate is the relationship between two interest rates generally these are not trended but the VAR might still have an intercept because the difference between the two the interest rate spread might be stationary around a nonzero mean for example because of a risk or liquidity premium The previous example can be generalized in three directions 1 If a VAR of order greater than 1 is considered the algebra gets more convoluted but the conclusions are identical 2 If the VAR includes more than two endogenous variables the cointegration rank r can be greater than 1 In this case α is a matrix with r columns and the case with restricted constant entails the restriction that µ0 should be some linear combination of the columns of α 3 If a linear trend is included in the model the deterministic part of the VAR becomes µ0 µ1t The reasoning is practically the same as above except that the focus now centers on µ1 rather than µ0 The counterpart to the restricted constant case discussed above is a restricted trend case such that the cointegration relationships include a trend but the first differences of the variables in question do not In the case of an unrestricted trend the trend appears in both the cointegration relationships and the first differences which corresponds to the presence of a quadratic trend in the variables themselves in levels In order to accommodate the five cases gretl provides the following options to the johansen and vecm commands µt option flag description 0 nc no constant µ0 α µ0 0 rc restricted constant µ0 uc unrestricted constant µ0 µ1t α µ1 0 crt constant restricted trend µ0 µ1t ct constant unrestricted trend Note that for this command the above options are mutually exclusive In addition you have the option of using the seasonals options for augmenting µt with centered seasonal dummies In each case pvalues are computed via the approximations devised by Doornik 1998 334 The Johansen cointegration tests The two Johansen tests for cointegration are used to establish the rank of β or in other words the number of cointegrating vectors These are the λmax test for hypotheses on individual eigenvalues and the trace test for joint hypotheses Suppose that the eigenvalues λi are sorted from largest to smallest The null hypothesis for the λmax test on the ith eigenvalue is that λi 0 The corresponding trace test instead considers the hypothesis λj 0 for all j i The gretl command johansen performs these two tests The corresponding menu entry in the GUI is Model Time Series Cointegration Test Johansen As in the ADF test the asymptotic distribution of the tests varies with the deterministic component µt one includes in the VAR see section 333 above The following code uses the denmark data file supplied with gretl to replicate Johansens example found in his 1995 book open denmark johansen 2 LRM LRY IBO IDE rc seasonals Chapter 33 Cointegration and Vector Error Correction Models 323 of r equilibrium relations as y1t b1r1yr1t b1nynt y2t b2r1yr1t b2nynt yrt brr1yr1t brnyrt where the first r variables are expressed as functions of the remaining n r Although the triangular representation ensures that the statistical problem of estimating β is solved the resulting equilibrium relationships may be difficult to interpret In this case the user may want to achieve identification by specifying manually the system of r 2 constraints that gretl will use to produce an estimate of β As an example consider the money demand system presented in section 96 of Verbeek 2004 The variables used are m the log of real money stock M1 infl inflation cpr the commercial paper rate y log of real GDP and tbr the Treasury bill rate2 Estimation of β can be performed via the commands open moneygdt smpl 19541 19944 vecm 6 2 m infl cpr y tbr rc and the relevant portion of the output reads Maximum likelihood estimates observations 1954119944 T 164 Cointegration rank 2 Case 2 Restricted constant beta cointegrating vectors standard errors in parentheses m 10000 00000 00000 00000 infl 00000 10000 00000 00000 cpr 056108 24367 010638 42113 y 040446 091166 010277 40683 tbr 054293 24786 010962 43394 const 37483 16751 078082 30909 Interpretation of the coefficients of the cointegration matrix β would be easier if a meaning could be attached to each of its columns This is possible by hypothesizing the existence of two longrun relationships a money demand equation m c1 β1infl β2y β3tbr and a risk premium equation cpr c2 β4infl β5y β6tbr 2This data set is available in the verbeek data package see httpgretlsourceforgenetgretldatahtml Chapter 33 Cointegration and Vector Error Correction Models 324 which imply that the cointegration matrix can be normalized as β 1 0 β1 β4 0 1 β2 β5 β3 β6 c1 c2 This renormalization can be accomplished by means of the restrict command to be given after the vecm command or in the graphical interface by selecting the Test Linear Restrictions menu entry The syntax for entering the restrictions should be fairly obvious3 restrict b11 1 b13 0 b21 0 b23 1 end restrict which produces Cointegrating vectors standard errors in parentheses m 10000 00000 00000 00000 infl 0023026 0041039 00054666 0027790 cpr 00000 10000 00000 00000 y 042545 0037414 0033718 017140 tbr 0027790 10172 00045445 0023102 const 33625 068744 025318 12870 336 Overidentifying restrictions One purpose of imposing restrictions on a VECM system is simply to achieve identification If these restrictions are simply normalizations they are not testable and should have no effect on the max imized likelihood In addition however one may wish to formulate constraints on β andor α that derive from the economic theory underlying the equilibrium relationships substantive restrictions of this sort are then testable via a likelihoodratio statistic Gretl is capable of testing general linear restrictions of the form Rbvecβ q 335 andor Ravecα 0 336 Note that the β restriction may be nonhomogeneous q 0 but the α restriction must be homoge neous Nonlinear restrictions are not supported and neither are restrictions that cross between β 3Note that in this context we are bending the usual matrix indexation convention using the leading index to refer to the column of β the particular cointegrating vector This is standard practice in the literature and defensible insofar as it is the columns of β the cointegrating relations or equilibrium errors that are of primary interest Chapter 33 Cointegration and Vector Error Correction Models 325 and α When r 1 such restrictions may be in common across all the columns of β or α or may be specific to certain columns of these matrices For useful discussions of this point see Boswijk 1995 and Boswijk and Doornik 2004 section 44 The restrictions 335 and 336 may be written in explicit form as vecβ Hφ h0 337 and vecα Gψ 338 respectively where φ and ψ are the free parameter vectors associated with β and α respectively We may refer to the free parameters collectively as θ the column vector formed by concatenating φ and ψ Gretl uses this representation internally when testing the restrictions If the list of restrictions that is passed to the restrict command contains more constraints than necessary to achieve identification then an LR test is performed In addition the restrict com mand can be given the full switch in which case full estimates for the restricted system are printed including the Γi terms and the system thus restricted becomes the current model for the purposes of further tests Thus you are able to carry out cumulative tests as in Chapter 7 of Johansen 1995 Syntax The full syntax for specifying the restriction is an extension of that exemplified in the previous section Inside a restrict end restrict block valid statements are of the form parameter linear combination scalar where a parameter linear combination involves a weighted sum of individual elements of β or α but not both in the same combination the scalar on the righthand side must be 0 for combina tions involving α but can be any real number for combinations involving β Below we give a few examples of valid restrictions b11 1618 b14 2b25 0 a13 0 a11 a12 0 Special syntax is used when a certain constraint should be applied to all columns of β in this case one index is given for each b term and the square brackets are dropped Hence the following syntax restrict b1 b2 0 end restrict corresponds to β β11 β21 β11 β21 β13 β23 β14 β24 The same convention is used for α when only one index is given for an a term the restriction is presumed to apply to all r columns of α or in other words the variable associated with the given row of α is weakly exogenous For instance the formulation Chapter 33 Cointegration and Vector Error Correction Models 326 restrict a3 0 a4 0 end restrict specifies that variables 3 and 4 do not respond to the deviation from equilibrium in the previous period4 A variant on the singleindex syntax for common restrictions on α and β is available you can replace the index number with the name of the corresponding variable in square brackets For example instead of a3 0 one could write acpr 0 if the third variable in the system is named cpr Finally a shortcut or anyway an alternative is available for setting up complex restrictions but currently only in relation to β you can specify Rb and q as in Rbvecβ q by giving the names of previously defined matrices For example matrix I4 I4 matrix vR I4I4zeros41 matrix vq mshapeI4161 restrict R vR q vq end restrict which manually imposes Phillips normalization on the β estimates for a system with cointegrating rank 4 There are two points to note in relation to this option First vecβ is taken to include the coeffi cients on all terms within the cointegration space including the restricted constant or trend if any as well as any restricted exogenous variables Second it is acceptable to give an R matrix with a number of columns equal to the number of rows of β this variant is taken to specify a restriction that is in common across all the columns of β An example Brand and Cassola 2004 propose a money demand system for the Euro area in which they postu late three longrun equilibrium relationships money demand m βll βyy Fisher equation π φl Expectation theory of l s interest rates where m is real money demand l and s are long and shortterm interest rates y is output and π is inflation5 The names for these variables in the gretl data file are mp rl rs y and infl respectively The cointegration rank assumed by the authors is 3 and there are 5 variables giving 15 elements in the β matrix 3 3 9 restrictions are required for identification and a justidentified system would have 15 9 6 free parameters However the postulated longrun relationships feature only three free parameters so the overidentification rank is 3 4Note that when two indices are given in a restriction on α the indexation is consistent with that for β restrictions the leading index denotes the cointegrating vector and the trailing index the equation number 5A traditional formulation of the Fisher equation would reverse the roles of the variables in the second equation but this detail is immaterial in the present context moreover the expectation theory of interest rates implies that the third equilibrium relationship should include a constant for the liquidity premium However since in this example the system is estimated with the constant term unrestricted the liquidity premium gets absorbed into the system intercept and disappears from zt Chapter 33 Cointegration and Vector Error Correction Models 327 Listing 331 Estimation of a money demand system with constraints on β Download Input open brandcassolagdt perform a few transformations mp mp100 y y100 infl infl4 rs rs4 rl rl4 replicate table 4 page 824 vecm 2 3 mp infl rl rs y q ll0 lnl restrict full b11 1 b12 0 b14 0 b21 0 b22 1 b24 0 b25 0 b31 0 b32 0 b33 1 b34 1 b35 0 end restrict ll1 rlnl Partial output Unrestricted loglikelihood lu 11660268 Restricted loglikelihood lr 11586451 2 lu lr 147635 PChiSquare3 147635 068774 beta cointegrating vectors standard errors in parentheses mp 10000 00000 00000 00000 00000 00000 infl 00000 10000 00000 00000 00000 00000 rl 16108 067100 10000 062752 0049482 00000 rs 00000 00000 10000 00000 00000 00000 y 13304 00000 00000 0030533 00000 00000 Chapter 33 Cointegration and Vector Error Correction Models 328 Listing 331 replicates Table 4 on page 824 of the Brand and Cassola article6 Note that we use the lnl accessor after the vecm command to store the unrestricted loglikelihood and the rlnl accessor after restrict for its restricted counterpart The example continues in script 332 where we perform further testing to check whether a the income elasticity in the money demand equation is 1 βy 1 and b the Fisher relation is homo geneous φ 1 Since the full switch was given to the initial restrict command additional restrictions can be applied without having to repeat the previous ones The second script contains a few printf commands which are not strictly necessary to format the output nicely It turns out that both of the additional hypotheses are rejected by the data with pvalues of 0002 and 0004 Listing 332 Further testing of money demand system Input restrict b15 1 end restrict lluie rlnl restrict b23 1 end restrict llhfh rlnl replicate table 5 page 824 printf Testing zero restrictions in cointegration space printf LRtest rank 3 chi23 64f 64f 2ll0ll1 pvalueX 3 2ll0ll1 printf Unit income elasticity LRtest rank 3 printf chi24 g 64f 2ll0lluie pvalueX 4 2ll0lluie printf Homogeneity in the Fisher hypothesis printf LRtest rank 3 chi24 63f 64f 2ll0llhfh pvalueX 4 2ll0llhfh Output Testing zero restrictions in cointegration space LRtest rank 3 chi23 14763 06877 Unit income elasticity LRtest rank 3 chi24 172071 00018 Homogeneity in the Fisher hypothesis LRtest rank 3 chi24 15547 00037 Another type of test that is commonly performed is the weak exogeneity test In this context a variable is said to be weakly exogenous if all coefficients on the corresponding row in the α matrix are zero If this is the case that variable does not adjust to deviations from any of the longrun equilibria and can be considered an autonomous driving force of the whole system The code in Listing 333 performs this test for each variable in turn thus replicating the first column of Table 6 on page 825 of Brand and Cassola 2004 The results show that weak exogeneity might perhaps be accepted for the longterm interest rate and real GDP pvalues 007 and 008 respectively 6Modulo what appear to be a few typos in the article Chapter 33 Cointegration and Vector Error Correction Models 329 Listing 333 Testing for weak exogeneity Input restrict a1 0 end restrict tsm 2ll0 rlnl restrict a2 0 end restrict tsp 2ll0 rlnl restrict a3 0 end restrict tsl 2ll0 rlnl restrict a4 0 end restrict tss 2ll0 rlnl restrict a5 0 end restrict tsy 2ll0 rlnl loop foreach i m p l s y printf Delta i 63f 64f tsi pvalueX 6 tsi endloop Output variable LR test pvalue Delta m 18111 00060 Delta p 21067 00018 Delta l 11819 00661 Delta s 16000 00138 Delta y 11335 00786 Chapter 33 Cointegration and Vector Error Correction Models 331 optimizer may end up at a local maximum or in the case of the switching algorithm at a saddle point The solution or lack thereof may be sensitive to the initial value selected for θ By default gretl selects a starting point using a deterministic method based on Boswijk 1995 but two further options are available the initialization may be adjusted using simulated annealing or the user may supply an explicit initial value for θ The default initialization method is 1 Calculate the unrestricted ML ˆβ using the Johansen procedure 2 If the restriction on β is nonhomogeneous use the method proposed by Boswijk φ0 Ir ˆβHIr ˆβh0 339 where ˆβ ˆβ 0 and A denotes the MoorePenrose inverse of A Otherwise φ0 HH1Hvecˆβ 3310 3 vecβ0 Hφ0 h0 4 Calculate the unrestricted ML ˆα conditional on β0 as per Johansen ˆα S01β0β 0S11β01 3311 5 If α is restricted by vecα Gψ then ψ0 GG1G vecˆα and vecα 0 Gψ0 Alternative initialization methods As mentioned above gretl offers the option of adjusting the initialization using simulated anneal ing This is invoked by adding the jitter option to the restrict command The basic idea is this we start at a certain point in the parameter space and for each of n iterations currently n 4096 we randomly select a new point within a certain radius of the previous one and determine the likelihood at the new point If the likelihood is higher we jump to the new point otherwise we jump with probability P and remain at the previous point with probability 1 P As the iterations proceed the system gradually coolsthat is the radius of the random perturbation is reduced as is the probability of making a jump when the likelihood fails to increase In the course of this procedure many points in the parameter space are evaluated starting with the point arrived at by the deterministic method which well call θ0 One of these points will be best in the sense of yielding the highest likelihood call it θ This point may or may not have a greater likelihood than θ0 And the procedure has an end point θn which may or may not be best The rule followed by gretl in selecting an initial value for θ based on simulated annealing is this use θ if θ θ0 otherwise use θn That is if we get an improvement in the likelihood via annealing we make full use of this on the other hand if we fail to get an improvement we nonetheless allow the annealing to randomize the starting point Experiments indicate that the latter effect can be helpful Besides annealing a further alternative is manual initialization This is done by passing a prede fined vector to the set command with parameter initvals as in set initvals myvec The details depend on whether the switching algorithm or LBFGS is used For the switching algo rithm there are two options for specifying the initial values The more userfriendly one for most people we suppose is to specify a matrix that contains vecβ followed by vecα For example Chapter 33 Cointegration and Vector Error Correction Models 332 open denmarkgdt vecm 2 1 LRM LRY IBO IDE rc seasonals matrix BA 1 1 6 6 6 02 01 002 003 set initvals BA restrict b1 1 b1 b2 0 b3 b4 0 end restrict In this examplefrom Johansen 1995the cointegration rank is 1 and there are 4 variables However the model includes a restricted constant the rc flag so that β has 5 elements The α matrix has 4 elements one per equation So the matrix BA may be read as β1 β2 β3 β4 β5 α1 α2 α3 α4 The other option which is compulsory when using LBFGS is to specify the initial values in terms of the free parameters φ and ψ Getting this right is somewhat less obvious As mentioned above the implicitform restriction Rvecβ q has explicit form vecβ Hφ h0 where H R the right nullspace of R The vector φ is shorter by the number of restrictions than vecβ The savvy user will then see what needs to be done The other point to take into account is that if α is unrestricted the effective length of ψ is 0 since it is then optimal to compute α using Johansens formula conditional on β equation 3311 above The example above could be rewritten as open denmarkgdt vecm 2 1 LRM LRY IBO IDE rc seasonals matrix phi 8 6 set initvals phi restrict lbfgs b1 1 b1 b2 0 b3 b4 0 end restrict In this more economical formulation the initializer specifies only the two free parameters in φ 5 elements in β minus 3 restrictions There is no call to give values for ψ since α is unrestricted Scale removal Consider a simpler version of the restriction discussed in the previous section namely restrict b1 1 b1 b2 0 end restrict This restriction comprises a substantive testable requirementthat β1 and β2 sum to zeroand a normalization or scaling β1 1 The question arises might it be easier and more reliable to maximize the likelihood without imposing β1 110 If so we could record this normalization remove it for the purpose of maximizing the likelihood then reimpose it by scaling the result Unfortunately it is not possible to say in advance whether scale removal of this sort will give better results for any particular estimation problem However this does seem to be the case more often than not Gretl therefore performs scale removal where feasible unless you 10As a numerical matter that is In principle this should make no difference Chapter 33 Cointegration and Vector Error Correction Models 333 explicitly forbid this by giving the noscaling option flag to the restrict command or provide a specific vector of initial values or select the LBFGS algorithm for maximization Scale removal is deemed infeasible if there are any crosscolumn restrictions on β or any non homogeneous restrictions involving more than one element of β In addition experimentation has suggested to us that scale removal is inadvisable if the system is just identified with the normalizations included so we do not do it in that case By just identified we mean that the system would not be identified if any of the restrictions were removed On that criterion the above example is not just identified since the removal of the second restriction would not affect identification and gretl would in fact perform scale removal in this case unless the user specified otherwise Chapter 34 Multivariate models By a multivariate model we mean one that includes more than one dependent variable Certain specific types of multivariate model for timeseries data are discussed elsewhere chapter 32 deals with VARs and chapter 33 with VECMs Here we discuss two general sorts of multivariate model implemented in gretl via the system command SUR systems Seemingly Unrelated Regressions in which all the regressors are taken to be exogenous and interest centers on the covariance of the error term across equations and simultaneous systems in which some regressors are assumed to be endogenous In this chapter we give an account of the syntax and use of the system command and its compan ions restrict and estimate we also explain the options and accessors available in connection with multivariate models 341 The system command The specification of a multivariate system takes the form of a block of statements starting with system and ending with end system Once a system is specified it can estimated via various methods using the estimate command with or without restrictions which may be imposed via the restrict command Starting a system block The first line of a system block may be augmented in either or both of two ways An estimation method is specified for the system This is done by following system with an expression of the form methodestimator where estimator must be one of ols Ordinary Least Squares tsls TwoStage Least Squares sur Seemingly Unrelated Regressions 3sls ThreeStage Least Squares liml Limited Information Maximum Likelihood or fiml Full Information Maximum Likelihood Two examples system methodsur system methodfiml OLS TSLS and LIML are of course singleequation methods rather than true system estima tors they are included to facilitate comparisons The system is assigned a name This is done by giving the name first followed by a back arrow followed by system If the name contains spaces it must be enclosed in double quotes Here are two examples sys1 system System 1 system Note however that this naming method is not available within a userdefined function only in the main body of a gretl script If the initial system line is augmented in the first way the effect is that the system is estimated as soon as its definition is completed using the specified method The effect of the second option is 334 Chapter 34 Multivariate models 335 that the system can then be referenced by the assigned name for the purposes of the restrict and estimate commands in the gretl GUI an additional effect is that an icon for the system is added to the Session view These two possibilities can be combined as in mysys system method3sls In this example the system is estimated immediately via ThreeStage Least Squares and is also available for subsequent use under the name mysys If the system is not named via the backarrow mechanism it is still available for subsequent use via restrict and estimate in this case you should use the generic name system to refer to the lastdefined multivariate system The body of a system block The most basic element in the body of a system block is the equation statement which is used to specify each equation within the system This takes the same form as the regression specifica tion for singleequation estimators namely a list of series with the dependent variable given first followed by the regressors with the series given either by name or by ID number order in the dataset A system block must contain at least two equation statements and for systems without endogenous regressors these statements are all that is required So for example a minimal SUR specification might look like this system methodsur equation y1 const x1 equation y2 const x2 end system For simultaneous systems it is necessary to determine which regressors are endogenous and which exogenous By default all regressors are treated as exogenous except that any variable that appears as the dependent variable in one equation is automatically treated as endogeous if it appears as a regressor elsewhere However an explicit list of endogenous regressors may be supplied follow ing the equations lines this takes the form of the keyword endog followed by the names or ID numbers of the relevant regressors When estimation is via TSLS or 3SLS it is possible to specify a particular set of instruments for each equation This is done by giving the equation lists in the format used with the tsls command first the dependent variable then the regressors then a semicolon followed by the instruments as in system method3sls equation y1 const x11 x12 const x11 z1 equation y2 const x21 x22 const x21 z2 end system An alternative way of specifying instruments is to insert an extra line starting with instr followed by the list of variables acting as instruments This is especially useful for specifying the system with the equations keyword see the following subsection As in tsls any regressors that are not also listed as instruments are treated as endogenous so in the example above x11 and x21 are treated as exogenous while x21 and x22 are endogenous and instrumented by z1 and z2 respectively One more sort of statement is allowed in a system block that is the keyword identity followed by an equation that defines an accounting relationship rather then a stochastic one between variables For example identity Y C I G X Chapter 34 Multivariate models 336 There can be more than one identity in a system block But note that these statements are specific to estimation via FIML they are ignored for other estimators 342 Equation systems within functions It is also possible to define a multivariate system in a programmatic way This is useful if the precise specification of the system depends on some input parameters that are not known in advance but are given when the script is actually run The relevant syntax is given by the equations keyword note the plural which replaces the block of equation lines in the standard form This keyword must be followed by two arguments The first is a named list containing all series on the lefthand side of the system which determines the number of equations in the system The nature of the second argument depends on whether or not the list of regressors is in common for all equations as in SUR Common regressors a second named list Differing regressors an array of lists one per equation The first case is straightforward the second requires a little more explanation Suppose we have a twoequation system with regressors given by the lists xlist1 and xlist2 We can then define a suitable array as follows lists Xlists defarrayxlist1 xlist2 See section 118 for alternative ways of building an array Therefore specifying a system generically in this way just involves building the necessary list arguments as shown in the following example open denmark list LHS LRM LRY list RHS1 const LRM1 IBO1 IDE1 list RHS2 const LRY1 IBO1 lists RHS defarrayRHS1 RHS2 system methodols equations LHS RHS end system As mentioned above the option of assigning a specific name to a system is not available within functions but the generic identifier system can be used to similar effect The following example illustrates how one can define a system estimate it via two methods apply a restriction then reestimate it subject to the restriction function void anonsysseries x series y system equation x const equation y const end system estimate system methodols estimate system methodsur restrict system b11 b21 0 end restrict estimate system methodols end function Chapter 34 Multivariate models 339 The above account has to be qualified for the case where a system is set up for estimation via TSLS or 3SLS using a specific list of instruments per equation as described in section 341 In that case it is possible to include more endogenous regressors than explicit equations although of course there must be sufficient instruments to achieve identification In such systems endogenous re gressors that have no associated explicit equation are treated as if exogenous when constructing the structuralform matrices This means that forecasts are conditional on the observed values of the extra endogenous regressors rather than solely on the values of the exogenous and predeter mined variables On the contrary gretl does not provide a native command for generating simulated data from a multiequation system but this is relatively easily accomplished by means of scripting script 341 gives an example on a 3variable system1 All equations contain lagged endogenous variables but the equation for consumption at time t also contains income at time t as an explanatory variable This makes the system simultaneous so we use FIML as the estimation method Once the system is estimated we store its results to a bundle named sys so as to make it easier to retrieve certain quantities used in the remainder of the script First we compute the reduced form matrices by using the Gamma A and B bundle elements Of course simulation needs values for the exogenous variable which are easy to create in a system such as this where all the exogenous variables are deterministic The simulation horizon is set for this example at 12 periods Subsequently structuralform disturbances are drawn randomly from a multivariate normal dis tibution with mean 0 and variance equal to the estimated covariance matrix ˆΣ available as the sigma element of the sys bundle These are then mapped to reducedform innovations via the relationship vt Γ 1ϵt Finally all these ingredients are combined to produce the simulated values with the varsimul function Note that initial values for the VAR recursion are taken from the latest available data Running the script should produce the following set of simulated values Sim 14 x 3 13887 12874 14508 13889 12877 14515 13893 12880 14520 13895 12885 14518 13895 12894 14517 13900 12902 14520 13907 12908 14525 13917 12910 14534 13920 12911 14539 13919 12906 14547 13934 12910 14567 13935 12908 14575 13942 12913 14581 13944 12916 14583 1Note the system of equations that is being estimated here is not meant to stand for a realistic model of the European economy It is just set up in such a way to provide a simple example Chapter 34 Multivariate models 340 Listing 341 Simulation from a simultaneous equation system Download set verbose off set seed 131020 load the data and generate the variables open AWM18gdt quiet Con logPCR Inv logGCR Inc logYER list EXO const time estimate the system via FIML system methodfiml equation Con EXO Con1 Inc0 to 1 equation Inv EXO Inv1 Inc1 equation Inc EXO Inc1 to 2 Inv1 end system bundle sys system save the estimated system to a bundle compute the reduced form VAR representation matrix iG invsysGamma matrix rfA iG sysA matrix rfB iG sysB produce the simulation scalar horizon 12 retrieve a few magnitudes from the estimated system scalar g sysneqns number of equations scalar p colssysA g maximum lag future values of the exogenous variable matrix SimExo 1 seqnobs 1 nobs horizon matrix X SimExo rfB simulated disturbances E mnormalhorizon g choleskysyssigma reduced form disturbances V E iG structural form disturbances initial values list ENDO Con Inv Inc matrix init ENDOnobsp1 perform simulation Sim varsimulrfA X V init print Sim Chapter 35 Forecasting 351 Introduction In some econometric contexts forecasting is the prime objective one wants estimates of the future values of certain variables to reduce the uncertainty attaching to current decision making In other contexts where realtime forecasting is not the focus prediction may nonetheless be an important moment in the analysis For example outofsample prediction can provide a useful check on the validity of an econometric model In other cases we are interested in questions of what if for example how might macroeconomic outcomes have differed over a certain period if a different policy had been pursued In the latter cases prediction need not be a matter of actually projecting into the future but in any case it involves generating fitted values from a given model The term postdiction might be more accurate but it is not commonly used we tend to talk of prediction even when there is no true forecast in view This chapter offers an overview of the methods available within gretl for forecasting or prediction whether forward in time or not and explicates some of the finer points of the relevant commands 352 Saving and inspecting fitted values In the simplest case the predictions of interest are just the within sample fitted values from an econometric model For the singleequation linear model yt Xtβ ut these are ˆyt Xt ˆβ In commandline mode the ˆy series can be retrieved after estimating a model using the accessor yhat as in series yh yhat If the model in question takes the form of a system of equations yhat returns a matrix each column of which contains the fitted values for a particular dependent variable To extract the fitted series for eg the dependent variable in the second equation do matrix Yh yhat series yh2 Yh2 Having obtained a series of fitted values you can use the fcstats function to produce a vector of statistics that characterize the accuracy of the predictions see section 354 below The gretl GUI offers several ways of accessing and examining withinsample predictions In the model display window the Save menu contains an item for saving fitted values the Graphs menu allows plotting of fitted versus actual values and the Analysis menu offers a display of actual fitted and residual values 353 The fcast command The fcast command and its equivalent GUI invocation see below generates predictions based on the last estimated model Several questions arise here How to control the range over which predictions are generated How to control the forecasting method where a choice is available How to control the printing andor saving of the results Basic answers can be found in the Gretl Command Reference we add some more details here 341 Chapter 35 Forecasting 342 The forecast range The range defaults to the currently defined sample range If this remains unchanged following esti mation of the model in question the forecast will be within sample and with some qualifications noted below it will essentially duplicate the information available via the retrieval of fitted values see section 352 above A common situation is that a model is estimated over a given sample and then forecasts are wanted for a subsequent outofsample range The simplest way to accomplish this is via the outofsample option to fcast For example assuming we have a quarterly timeseries dataset containing observations from 19801 to 20084 four of which are to be reserved for forecasting reserve the last 4 observations smpl 19801 20074 ols y 0 xlist fcast outofsample This will generate a forecast from 20081 to 20084 There are two other ways of adjusting the forecast range offering finer control Use the smpl command to adjust the sample range prior to invoking fcast Use the optional startobs and endobs arguments to fcast which should come right after the command word These values set the forecast range independently of the sample range What if one wants to generate a true forecast that goes beyond the available data In that case one can use the dataset command with the addobs parameter to add extra observations before forecasting For example use the entire dataset which ends in 20084 ols y 0 xlist dataset addobs 4 fcast 20091 20094 But this will work as stated only if the set of regressors in xlist does not contain any stochastic regressors other than lags of y The dataset addobs command attempts to detect and extrapolate certain common deterministic variables eg time trend periodic dummy variables In addition lagged values of the dependent variable can be supported via a dynamic forecast see below for discussion of the staticdynamic distinction But future values of any other included regressors must be supplied before such a forecast is possible Note that specific values in a series can be set directly by date for example x120091 1205 Or if the assumption of no change in the regressors is warranted one can do something like this loop t2009120094 loop foreach i xlist it i20084 endloop endloop In singleequation OLS models a recursive forecast option is also available expanding the es timation sample onebyone and recalculating the forecasts again and again for the constantly updated information set In this case a number must be given of how many periods ahead should be forecast for each of the estimation samples Note that only this kstepsahead forecast will be printed or accessible in fcast not the interim values from step 1 through k 1 if k 1 If those interim values are also needed then several fcast recursive rounds would have to be done with different stepsahead numbers Chapter 35 Forecasting 343 Static and dynamic forecasts The distinction between static and dynamic forecasts applies only to dynamic models ie those that feature one or more lags of the dependent variable The simplest case is the AR1 model yt α0 α1yt1 ϵt 351 In some cases the presence of a lagged dependent variable is implicit in the dynamics of the error term for example yt β ut ut ρut1 ϵt which implies that yt 1 ρβ ρyt1 ϵt Suppose we want to forecast y for period s using a dynamic model say 351 for example If we have data on y available for period s 1 we could form a fitted value in the usual way ˆys ˆα0 ˆα1ys1 But suppose that data are available only up to s 2 In that case we can apply the chain rule of forecasting ˆys1 ˆα0 ˆα1ys2 ˆys ˆα0 ˆα1 ˆys1 This is what is called a dynamic forecast A static forecast on the other hand is simply a fitted value even if it happens to be computed outofsample Printing displaying and saving forecasts When working from the GUI the way to perform and access forecasts is to first estimate a model with some inherently dynamic features and then in the model window navigate to the Forecasts entry in the Analysis menu If some outofsample observations are already available see above a dialog window is presented where the discussed forecasting options can be chosen by pointing and clicking Executing the forecasts then automatically yields two result windows one with a time series plot of the forecasts along with their confidence bands if those were chosen and another one with tabular output The produced plot can be saved to the current session or exported like any other plot in gretl by rightclicking Notice that in the textual result window there is a button at the top which offers to save the point forecasts and their standard errors as new series to the active dataset In a command line context the fcast command automatically prints out the tables with the pro duced forecasts their standard errors and associated confidence intervalsunless you wish to suppress this verbose output with the options statsonly or quiet The former option re stricts output to the forecast evaluation statistics as explained in the next section the latter option silences output altogether Another accepted syntax variant is to supply the name of a new series for the point forecasts after the fcast command as for example in fcast Yfc outofsample At the same time this also suppresses printout Accessing and saving the produced forecast time series along with the estimated standard errors also works through the fcast and fcse accessors available after fcast execution These return vectors as gretl matrix objects not series so if you want to add the results to the dataset in this way you would have to set the active sample to the forecast range first You can of course first access and store the matrices and then later after resetting the sample assign them to series Note that the estimated standard errors do not incorporate parameter uncertainty in the case of dynamic models But if you want to create forecast plots within a script the relevant option already has to be ap pended to the fcast command As explained in the command reference specify plotfilename Chapter 35 Forecasting 346 For 95 confidence intervals t140 0025 1977 LHUR prediction std error 95 interval 19991 4300000 4335004 0222784 3894549 4775460 19992 4300000 4312724 0401960 3518028 5107421 19993 4233333 4272764 0539582 3205982 5339547 19994 4100000 4223213 0642001 2953943 5492482 Forecast evaluation statistics Mean Error 0052593 Root Mean Squared Error 0067311 Mean Absolute Error 0052593 Mean Percentage Error 12616 Mean Absolute Percentage Error 12616 Theils U2 087334 Bias proportion UM 061049 Regression proportion UR 029203 Disturbance proportion UD 0097478 INFL prediction std error 95 interval 19991 1651245 1812250 0431335 0959479 2665022 19992 2048545 2088185 0777834 0550366 3626004 19993 2298952 2266445 1075855 0139423 4393467 19994 2604836 2610037 1409676 0176969 5397043 Forecast evaluation statistics Mean Error 0043335 Root Mean Squared Error 0084525 Mean Absolute Error 0059588 Mean Percentage Error 26178 Mean Absolute Percentage Error 33248 Theils U2 0095932 Bias proportion UM 026285 Regression proportion UR 045311 Disturbance proportion UD 028404 One of the main differences is that specifying a variable name after the fcast command does not mean to save something under that name but now it serves to pick one of the N variables of the VAR for printing out the forecasts That leaves only the fcast and fcse accessors to obtain and save the produced forecastsin this system case the returned matrix objects will have as many columns as equations In the GUI the relevant menu entry is again Forecasts in the Analysis menu in the window of the estimated VAR model Here the user must pick the variable of interest after which a dialog window with relevant options is presented As in the singleequation context a plot and a textual output windows are created Again forecast series can be added to the dataset through the button and the plot can be saved or exported Special VAR cases exogenous variables cointegration It may be worth noting that when a VAR is specified with additional nondeterministic exogenous regressors a similar issue as with single equations arises the forecast is conditional and requires some assumptions about the development of those regressors out of sample As before these Chapter 35 Forecasting 347 values can be easily filled in after the dataset has been extended with the observations for the fore casting sample but naturally only the user not gretl can and must decide what those values should be This includes handcrafted deterministic variables like shift dummies but on the other hand standard deterministic terms like trends and seasonals will be extrapolated by gretl automatically Using a cointegrated VAR model with gretls vecm command does not change the way a forecast is obtained afterwards The VECM can be internally represented as a VAR in levels that automat ically contains the reducedrank restrictions of cointegration and this VAR form is then used to calculate the forecasts Providing forecast standard errors and the associated confidence bands is also straightforward since only the innovation uncertainty is captured in those This ease of use also carries over to the situation when a VECM with additional exogenous terms is used for forecastingprovided that future values of the exogenous variables are specified of course 356 Forecasting from simultaneous systems To be interesting for a forecasting application a simultaneousequation system must be dynamic including some lags of endogenous variables as regressors Otherwise we would be conducting a scenario analysis purely conditional on assumed exogenous developments For the following discussion we therefore presuppose that we are dealing with such a dynamic system Then the difference between such a model set up with gretls system block and a VAR system concerns mainly two aspects First a VAR model is already given as a socalled reduced form and as such is ready to be used for forward simulation aka forecasting In contrast a simultaneous system can come in a structural form with some contemporaneous endogenous variables as regressors in the equations the future values of those regressors are unknown however Second a plain VAR is estimated by OLS whereas a simultaneous system can be estimated with different methods for reasons of efficiency Neither of these differences present any deep challenge for forecasting however As explained at the end of the previous chapter on multivariate models see the subsection titled Structural and reduced forms it is easy to obtain the reduced form of any such simultaneous equation system This reduced form is used by gretl to simulate the system forward in time just as with a VAR model The slight complication for computing the forecast variances is merely that the estimated error term ϵt from the structural form must be mapped to the reducedform innovations vt using the inverse of the estimated structural relations matrix Γ This is automatically taken into account The estimation method through which the coefficient values of the system are determined does not matter for forecasting The prediction algorithm can simply take these point esti mates as given use these for calculating the associated reduced form and use that represen tation to iterate the model forward over the desired forecasting horizon It should nonetheless be obvious that different estimators entail different forecast values As a consequence of these considerations the way to handle forecasts from simultaneous systems in gretl is exactly as discussed before in the context of VARs possibly with exogenous regressors This applies to the commandline interface as well as the GUI Chapter 36 State Space Modeling 361 Introduction This chapter describes the handling of linear state space models in gretl 2022b and higher1 Here is a brief highlevel overview of gretls Kalman apparatus To obtain a Kalman structurein the form of a bundleyou use the ksetup function Having obtained such a bundle you can then adjust its contents as described in detail below You then do things with your state space model via the functions kfilter forecasting ksmooth state smoothing andor kdsmooth disturbance smoothing 362 Notation In this document our basic representation of a state space model is given by the following pair of equations yt Ztαt εt 361 αt1 Ttαt ηt 362 where 361 is the observation or measurement equation and 362 is the state transition equation The state vector αt is r 1 and the vector of observables yt is n 1 The n 1 vector εt and the r 1 vector ηt are assumed to be vector Gaussian white noise Eεtε s Σt for t s otherwise 0 Eηtη s Ωt for t s otherwise 0 The number of timeseries observations is denoted by N In the case where Zt Z Tt T Σt Σ and Ωt Ω for all t the model is said to be timeinvariant We assume timeinvariance in much of what follows but discuss the timevarying casealong with other extensions of the basic model in section 369 363 Defining the model as a bundle The ksetup function is used to initialize a state space model by specifying only its indispensable elements the observables and their link to the unobserved state vector plus the law of motion for the latter and the covariance matrix of its innovations Therefore the function takes a minimum of four arguments The corresponding bundle keys are as follows 1The user interface was substantially different prior to version 2017a For example be aware that Lucchetti 2011 is based on the old syntax If anyone needs documentation for the original interface it can be found at httpgretl sourceforgenetpaperskalmanoldpdf Additional functionality relating to exact diffuse initialization of the Kalman filter was added in version 2022b 348 Chapter 36 State Space Modeling 349 Symbol Dimensions Reserved key y N n obsy Z n r obsymat T r r statemat Ω r r statevar Please note that the matrix Z in the observation equation must be given in transposed form This is required to preserve compatibility with gretl versions prior to 2022a Correspondingly if you retrieve this matrix using its key obsymat its the transpose you actually obtain The names of these input matrices dont matter in fact they may be anonymous matrices con structed on the fly But if and when you wish to copy them out of the bundle you must use the specified keys as in matrix Z SSmodobsymat matrix T SSmodstatemat Although all the arguments are in principle matrices as a convenience you may give obsy as a series or list of series and the other arguments can be given as scalars if in context they are 1 1 If applicable you may specify any of the following optional input matrices2 Symbol Dimensions Key If omitted Σ n n obsvar no disturbance term in observation equation α0 r 1 inistate α0 is a zero vector P0 r r inivar P0 is set automatically These matrices are not passed to ksetup rather you add them to the bundle returned by ksetup under their reserved keys as you usually add elements to a bundle for example SSmodobsvar Veps Naturally the arguments you pass to ksetup must have mutually compatible dimensions other wise an error is returned Once setup is complete the dimensions of the modelr n and N become available as scalar members of the bundle under their own names In case inivar is not specified the matrix P10 will be automatically initialized by gretl only if all the eigenvalues of T lie inside the unit circle and the model is stationary In this case the variance for the marginal distribution of αt is well defined and the initializer is computed using vecP10 I T T1 vecΩ If the above condition is not satisfied you will have to make a choice on which technique to use for diffuse initialization In Section 368 we provide a fuller discussion of the various options but heres what is probably the bottom line for many users In earlier versions of gretl a rather crude solution was adopted initializing P10 to a numerically large matrix This was accomplished by setting a value of 1 on the bundle under the reserved key diffuse From gretl version 2022b on if you have scripts where you set diffuse1 on your Kalman bundle you can now try diffuse2 instead This invokes the new exact initial method for state space models with a diffuse initializer Dont expect identical results from the new code but to the extent results differ the new ones should be somewhat more accurate If results differ wildly youve probably found a bug please report it You may also find that the new code is faster it should be less likely to get hung up on numerical problems that delay or prevent convergence of ML estimation 2Additional optional matrices are described in section 369 below Chapter 36 State Space Modeling 353 which case Pt is not even defined or simply out of lack of information In that case there are two possible approaches The traditional one used by gretl up to version 2022a is to ascribe a very large variance to the initial Pt as in P0 κ Ir where κ is say 107 You can impose this diffuse prior by setting SSmoddiffuse 1 In some cases this strategy may lead to numerical problems It may then be helpful to specify a diffuse initializer via inivar using a somewhat smaller value of κ as in SSmodinivar 10e5 Istdim where stdim is the dimension of the state While the κ I approach works fairly well in many cases it is nowadays generally deprecated in favor of one or other exact initial method Such methods depend on derivation of the properties of the Kalman filter and smoother in the limit as the aforementioned very large variance tends to infinity In libgretl we have implemented two such methods the univariate approach to multi variate observable advocated by Durbin and Koopman 2012 and the augmented Kalman method set out by de Jong 1991 and de Jong and ChuChunLin 20033 Well refer to them via the labels univariate and dejong respectively Exact diffuse methods The univariate approach handles a vector observable by unpacking it and substituting scalar calculations for matrix ones so far as possible Durbin and Koopman claim it is faster than the alternatives It is also able to deal in a straightforward way with incomplete observations where some but not all elements of yt are missing at time t it can utilize any nonmissing elements while ignoring the missing ones However it runs into complications if a the variance matrix of the observation disturbances is not diagonal andor b the disturbances are correlated between the state and observation equations Case a can be handled at the cost of some extra preliminary computationtransforming y and Z to induce a diagonal variance matrixand this is automati cally carried out by gretl if needed Handling case b is more bothersome requiring augmentation of the state at present this not is supported in gretl The dejong approach has no problem with the variance cases a and b mentioned above How ever its not clear how incomplete observations can be handled and at present observations with any missing elements are ignored In short there are cases where univariate may work best and other cases that are not handled by univariate but where dejong works fine Hence our decision to implement both methods Table 361 sets out the various cases that arise via combination of code where legacy indicates the Kalman code as of gretl 2022a and diffuse status ie whether the model is diffuse and if so how it is handled Note that although the primary virtue of univariate and dejong is their handling of the exact diffuse case these methods can handle the nondiffuse case and the traditional κdiffuse case The case used depends on various points the primary one being the diffuse integer member of the state space bundle which defaults to 0 but can be set to 1 or 2 diffuse0 case 1 is the default for backward compatibility but case 4 or 7 can be selected by adding univariate1 or dejong1 to the bundle diffuse1 case 2 is the default but case 5 or 8 can be selected as above 3The first of these is used in the KFAS package for R Helske 2017 and the second by the sspace command in Stata See httpswwwstatacommanualstssspacepdf Chapter 36 State Space Modeling 354 nondiffuse κdiffuse exact diffuse code diffuse0 diffuse1 diffuse2 legacy 1 2 univariate 4 5 6 dejong 7 8 9 Table 361 Crosstabulation of codepath and diffuse status Numbers in cells are used for reference in the text legacy indicates gretl 2022a or earlier diffuse2 the default is 6 but can be switched to 9 via dejong1 For cases in the same columnnamely 147 258 and 69results from kfilter ksmooth and kdsmooth should in principle be the same across the codepaths but in practice there are bound to be slight differences due to the different algorithms employed And note that slight differences at that level may be somewhat amplified by iterated filtering as in ML estimation 369 Extensions and refinements Regressors in the observation equation The observation equation 361 can be augmented to allow for the effect of a kvector of observable exogenous variables xt in addition to that of the unobserved state as in yt Btxt Ztαt εt This specification can be added to a bundle previously obtained via ksetup by use of the keys obsx for x and obsxmat for B In that case obsx must be an N k matrix and B must be n k But please note as with the case of Z described above backward compatibility dictates that obsxmat be given in transposed form An exception to this dimensionality rule is granted for convenience If the observation equation includes a constant but no additional exogenous variables you can give B as n 1 without having to specify obsx More generally if the column dimension of B is 1 greater than k it is assumed that the first element of B is associated with an implicit column of ones Intercept in the state equation In some applications it may be useful to have an intercept in the state transition equation thus generalizing equation 362 to αt1 µt Ttαt ηt The term µ is never strictly necessary the system 361 and 362 can absorb it as an extra non timevarying element in the state vector However this comes at the cost of expanding all the matrices that touch the state α T η Ω Z making the model relatively awkward to formulate and forecasts more expensive to compute We therefore adopt the convention above on practical grounds The r 1 vector µ can be added to a bundle under the key stconst Despite its name this matrix can be specified as timevarying as explained in the next section Timevarying matrices Any or all of the matrices obsymat obsxmat obsvar statemat statevar and stconst may be timevarying In that case you must supply the name of a function to be called to update the matrix or matrices in question you add this to the bundle as a string under the key timevarcall4 For 4The choice of the name for the function itself is of course totally up to the user Chapter 36 State Space Modeling 355 example if just obsymat Zt should be updated by a function named TVZ you would write SSmodtimevarcall TVZ The function that plays this role will be called at each timestep of the filtering or simulation operation prior to performing any calculations It should have a single bundlepointer parameter by means of which it will be passed a pointer to the Kalman bundle to which the call is attached Its return value if any will not be used so generally it returns nothing is of type void However you can use gretls funcerr keyword to raise an error if things seem to be going wrong see chapter 14 for details Besides the bundle members noted above a time variation function has access to the current 1 based time step under the reserved key t and the nvector containing the forecast error from the previous time step vt1 under the key uhat when t 1 the latter will be a zero vector If any additional information is needed for performing the update it can be placed in the bundle under a userspecified key So for example a simple updater for a 1 1 Z matrix might look like this function void TVZ bundle b bobsymat bZvalsbt end function where bZvals is a bundled Nvector An updater that operates on both Z n r and T r r might be function void update2 bundle b bobsymat mshapebZvalsbt br bn bstatemat unvechbTvalsbt end function where in this case we assume that bZvals is N rn with row t holding the transposed vec of Zt and bTvals is N rr 12 with row t holding the vech of Tt Simpler variants eg just one element of a relevant matrix in question is changed and more complex variantssay involving some sort of conditionalityare also possible in this framework It is worth noting that this setup lends itself to a much wider scope than timevarying system matrices In fact this syntax allows for the possibility of executing userdefined operations at each step The function that goes under timevarcall can read all the elements of the model bundle and can modify several of them the system matrices which can therefore be made timevarying as well as the userdefined elements An extended example of use of the timevariation facility is presented in section 3612 Crosscorrelated disturbances The formulation given in equations 361 and 362 assumes mutual independence of the distur bances in the state and observation equations εt and ηt This assumption holds good in many practical applications but in some cases one may wish to allow for crosscorrelation More generally we note three common representations of the variance of the disturbances in 361 and 362 1 The basic representation εt and ηt are assumed to be mutually uncorrelated and we write their respective possibly timevarying variance matrices as Vεt n n and Vηt r r 2 The de Jong representation write εt Gtνt and ηt Htνt where Gt is nr Ht is r p and p is the length of the underlying disturbance vector νt This formulation allows for correlation of the disturbances across the equations if HtG t is nonzero Chapter 36 State Space Modeling 356 3 The DurbinKoopman representation as in the first case assume that the disturbances are uncorrelated across the equations but write ηt Rtξt and Vηt RtQtR t where Rt is a selection matrix and Qt Vξt Let m r denote the dimension of ξt Then Qt is m m and Rt is r m This allows for the possibility that there are fewer disturbances to the state than elements of the state vector With the de Jong representation in place of 361362 we may write yt Ztαt Gtνt αt1 Ttαt Htνt In that case we may reexpress the variance matrices from section 362 above as Σt GtG t Ωt HtH t with the addition of Covηt εt HtG t You can select the de Jong or DurbinKoopman representation by supplying extra arguments to the ksetup function For the de Jong version in place of giving Ω you should give the two matrices identified above as H and G as in bundle SSxmod ksetupy Z T H G and in case you wish to retrieve or update information on the variance of the disturbances note that in the crosscorrelated case the bundle keys statevar and obsvar are taken to designate the factors H and G respectively To select the DurbinKoopman representation a sixth boolean argument must be used If that has a nonzero value statevar is taken to be Q and the fifth argument is taken to be R Note that in this case obsvar should be added separately as in the basic case The following statements illustrate the three cases basic bundle kb1 ksetupy Z T Veta kb1obsvar Veps if wanted de Jong bundle kb2 ksetupy Z T H G DurbinKoopman bundle kb3 ksetupy Z T Q R 1 kb3obsvar Veps if wanted 3610 The ksimul function This simulation function has as its required arguments a pointer to a Kalman bundle and a matrix containing artificial disturbances and it returns a matrix of simulation results An optional trailing Boolean argument is supported the purpose of which is explained below If the disturbances are not crosscorrelated the matrix argument must be either N r if there is no disturbance in the observation equation or N r n if the Σ obsvar matrix is specified Row t holds either η t or η t ε t Note that if Ω statevar is not simply an identity matrix you will have to scale the artificial state disturbances appropriately the same goes for Σ and the observation Chapter 36 State Space Modeling 358 α1 defaults to the value given under the key inistate or if that in turn is not present to a zero vector Alternatively the starting point can be made stochastic To do this you can emulate the procedure followed by SsfPack namely setting α1 a Av0 where a is a nonstochastic rvector v0 is an rvector of standard normal random numbers and A is a matrix such that AA P0 Lets say we have a statespace bundle b on which we have already set suitable values of inistate corresponding to a above and inivar P0 To perform a simulation with a stochastic starting point we can set α1 thus matrix A psdrootbinivar bsimstart binistate A mnormalbr 1 3611 Numerical optimization If the object of using a state space model is to produce maximum likelihood estimates of some parameters of interest note that the loglikelihood surface may be quite awkward far from globally concave posing a challenge for numerical methods such as BFGS the default maximizer under gretls mle command Symptoms may include failure of convergencetypically due to an excessive computed gradient even as the maximizer cannot find an improvement in the objective function or an excessive number of iterations In such cases it is worth considering the following points In some cases scaling the observables may help if the order of magnitude of yt is too small or too large floatingpoint precision may become an issue for estimating variances If you can obtain plausible initial values for the parameters things are likely to go better than starting with arbitrary values The limitedmemory version of BFGS LBFGS may work better than the standard version in some cases To engage this issue the command set lbfgs on prior to ML estimation It may be helpful to employ a more accurate but computationally more expensive method for computing the gradient namely Richardson extrapolation Here the command is set bfgsrichardson on 3612 Example scripts This section presents a selection of short sample scripts to illustrate the most important points covered in this chapter ARMA estimation Functions illustrated in this example ksetup kfilter As is well known the Kalman filter provides a very efficient way to compute the likelihood of ARMA models as an example take an ARMA11 model yt φyt1 εt θεt1 Chapter 36 State Space Modeling 359 Listing 361 ARMA estimation Download function void arma11viakalman series y parameter initialization scalar phi 0 scalar theta 0 scalar sigma 1 statespace model setup matrix Z 1 theta matrix T phi 0 1 0 matrix Q sigma2 0 0 0 bundle kb ksetupy Z T Q maximum likelihood estimation mle logl ERR NA kbllt kbobsymat2 theta kbstatemat11 phi kbstatevar11 sigma2 ERR kfilterkb params phi theta sigma end mle hessian end function main open armagdt open the arma example dataset arma11viakalmany estimate an arma11 model arma 1 1 y nc check via native command Chapter 36 State Space Modeling 361 Listing 362 HP filter Download function series hpviakalmanseries y scalar lambda0 bool oneside0 if lambda 0 lambda 100 pd2 endif State transition matrix matrix T 2 1 1 0 Observation matrix matrix Z 1 0 Covariance matrix in the state equation matrix Q 1sqrtlambda 0 0 0 matrix my y string desc if oneside matrix my my 0 desc 1sided endif ssm ksetupmy Z T Q ssmobsvar sqrtlambda ssministate 2y1y2 3y12y2 ssmdiffuse 1 err oneside kfilterssm ksmoothssm if err series ret NA else mu oneside ssmstate22 ssmstate1 series ret y mu endif string d sprintfsHPfiltered s lambda g desc argnamey lambda setinfo ret descriptiond return ret end function example clear open fedstlbin data houst y loghoust onesided builtin then hansl n1c hpfilty 1600 1 series h1c hpviakalmany 1600 1 ols n1c const h1c simpleprint twosided builtin then hansl n2c hpfilty 1600 series h2c hpviakalmany 1600 ols n2c const h2c simpleprint Chapter 36 State Space Modeling 362 Estimates for µt can be obtained by running a forward filter for the onesided version plus a smoothing pass for the twosided one Code implementing the filter is shown in script 362 along with an example using the housing starts series from the St Louis Fed database The example also compares the result of the function to that from gretls native hpfilt function Note that in the case of the onesided filter a little trick is required in order to get the desired result the state matrix stored by the kfilter function is the estimate of ˆαtt1 whereas what we require is in fact ˆαtt To work around this we add an extra observation to the end of the series and retrieve the onestepahead estimate of the lagged state Local level model Functions illustrated in this example ksetup kfilter ksmooth Suppose we have a series yt µt εt where µt is a random walk with normal increments of variance σ 2 1 and εt is normal white noise with variance σ 2 2 independent of µt This is known as the local level model and it can be cast in statespace form as equations 361362 with T 1 ηt N0 σ 2 1 Z 1 and εt N0 σ 2 2 5 The translation to hansl is bundle llmod ksetupy 1 1 s1 llmodobsvar s2 llmoddiffuse 1 The two unknown parameters σ 2 1 and σ 2 2 can be estimated via maximum likelihood Listing 363 provides an example of simulation and estimation of such a model Since simulating the local level model is trivial using ordinary gretl commands we dont use ksimul in this context6 Timevarying models To illustrate state space models with timevarying system matrices we will use timevarying OLS Suppose the DGP for an observable time series yt is given by yt β0 β1txt εt 364 where the slope coefficient β1 evolves through time according to β1t1 β1t ηt 365 It is easy to see that the pair of equations above define a state space model with equation 364 as the measurement equation and 365 as the state transition equation The unobservable state is β1t T 1 and Ω σ 2 η As for the measurement equation Σ σ 2 ε while the matrix multiplying β1t and hence playing the role of Zt is the timevarying xt Once the system is framed as a statespace model estimation of the three unknown parameters β0 σ 2 ε and σ 2 η can proceed by maximum likelihood in a manner similar to example 361 and 363 The sequence of slope coefficients β1t can then be estimated by running the smoother which also yields a consistent estimate of the dispersion of the estimated state Listing 364 presents an example in which data from the AWM database are used to estimate a Phillips Curve with timevarying slope INFQt β0 β1tURXt εt 5Note that the local level model plus other common Structural Time Series models are implemented in the StrucTiSM function package 6Warning as the script stands there is an offbyone misalignment between the state vector and the observable series For convenience the script is written as if equation 362 was modified into the equivalent formulation αt Tαt1 ηt We kept the script as simple as possible so that the reader can focus on the interesting aspects Chapter 36 State Space Modeling 363 Listing 363 Local level model Download nulldata 200 set seed 101010 setobs 1 1 specialtimeseries set the true variance parameters trues1 05 trues2 025 and simulate some data v normal sqrttrues1 w normal sqrttrues2 mu 2 cumv y mu w starting values for variance estimates s1 1 s2 1 statespace model setup bundle kb ksetupy 1 1 s1 kbobsvar s2 kbdiffuse 1 ML estimation of variances mle ll ERR NA kbllt ERR kfilterkb params kbstatevar kbobsvar end mle compute the smoothed state ksmoothkb series muhat kbstate Chapter 36 State Space Modeling 364 Listing 364 Phillips curve on Euro data with timevarying slope Download function void ateachstepbundle b bobsymat transpbmXbt end function open AWMgdt quiet smpl 19741 19941 parameter initialization scalar b0 meanINFQ scalar sobs 01 scalar sstate 01 bundle setup bundle B ksetupINFQ 1 1 1 matrix BmX URX matrix Bdepvar INFQ Btimevarcall ateachstep Bdiffuse 1 ML estimation of intercept and the two variances mle LL err NA Bllt Bobsy Bdepvar b0 Bobsvar sobs2 Bstatevar sstate2 err kfilterB params b0 sobs sstate end mle display the smoothed timevarying slope ksmoothB series tvarb1hat Bstate1 series tvarb1se sqrtBstvar1 gnuplot tvarb1hat timeseries withlines outputdisplay bandtvarb1hattvarb1se196 bandstylefill Chapter 36 State Space Modeling 365 03 025 02 015 01 005 0 1975 1980 1985 1990 tvarb1hat Figure 361 Phillips Curve on Euro data timevarying slope and 95 confidence interval where INFQ is a measure of quarterly inflation and URX a measure of unemployment At the end of the script the evolution of the slope coefficient over time is plotted along with a 95 confidence bandsee Figure 361 Disturbance smoothing Functions illustrated in this example ksetup kdsmooth In section 367 we noted that the kdsmooth function can produce two different measures of the dispersion of the smoothed disturbances depending on the the value of the optional trailing Boolean parameter Here we show what these two measures are good for using the famous Nile flow data that have been much analysed in the statespace literature We focus on the state equation that is the randomwalk component of the observed series Our script is shown in Listing 365 This is an instance of the Local Level model and the ML variance estimates are obtained as in Listing 363 In the first call to kdsmooth we omit the optional switch and therefore compute Eˆηt ˆη t for each t This quantity is suitable for constructing the auxiliary residuals shown in the top panel of Figure 362 for similar plots see Koopman et al 1999 Pelagatti 2011 This plot suggests the presence of a structural break shortly prior to 1900 as various authors have observed In the second kdsmooth call we ask gretl to compute instead Eˆηt ηtˆηt ηty1 yT the MSE of ˆηt considered as an estimator of ηt And in the lower panel of the Figure we plot ˆηt along with a 90 confidence band roughly 164 times the RMSE This reveals that given the sampling variance of ˆηt were not really sure that any of the ηt values were truly different from zero The resolution of the seeming conflict here is commonly reckoned to be that there was in fact a change in mean around 1900 but besides that event theres little evidence for a nonzero σ 2 η Or in other words the standard local level model is not really applicable to the data 3613 Graphical interface By this point the reader will have gathered that setting up a state space model can be quite a complex undertaking and the only general way to accomplish it is by writing a script However some cases are simple enough to lend themselves to a standardized treatment and so can be Chapter 36 State Space Modeling 366 Listing 365 Working with smoothed disturbances Nile data Download open nilegdt ML variance estimates scalar s2eta 146849 scalar s2eps 150997 bundle LLM ksetupnile 1 1 s2eta LLMobsvar s2eps LLMdiffuse 1 kdsmoothLLM series etaaux LLMsmdist1 LLMsmdisterr1 series zero 0 plot etaaux options timeseries withlines bandzeroconst2 literal unset ylabel literal set title Auxiliary residual state equation end plot outputdisplay kdsmoothLLM 1 series etahat LLMsmdist1 series sdeta LLMsmdisterr1 plot etahat options timeseries withlines bandetahatsdeta164485 literal unset ylabel literal set title State disturbance with 90 confidence band end plot outputdisplay Chapter 36 State Space Modeling 367 a Auxiliary standardized residuals state equation 4 3 2 1 0 1 2 3 1880 1900 1920 1940 1960 b Estimated state disturbance with 90 confidence band 120 100 80 60 40 20 0 20 40 60 80 100 1880 1900 1920 1940 1960 Figure 362 Nile data auxiliary residuals and ˆηt from disturbance smoother Chapter 36 State Space Modeling 368 handled via a relatively streamlined graphical interface As of version 2022a gretl provides just this a GUI for estimating a subset of state space models that while limited may still be useful for pedagogical purposes sparing the user from the intricacies of scripting In this section we describe the GUI and the class of models it supports The GUI can be used for performing ML estimation of models of the kind yt Zαt εt 366 αt Tαt1 Rηt 367 where yt is a vector of observables Vεt is a diagonal matrixor possibly 0 in which case the last term of equation 366 is dropped As for the covariance matrix of the shocks to equation 367 it is assumed that ηt is an IID sequence of normal random variates with diagonal covariance matrix Ση Therefore the covariance matrix denoted by Ω in the previous sections of this chapter whose corresponding key in the Kalman bundle is statevar is assumed to be Ω RΣηR Note that R can have fewer columns than r thereby making Ω singular In the graphical interface this is called the state variance factor The system matrices Z T and R are assumed to be timeinvariant and known so estimation only concerns the variances of εt and ηt Clearly this is a limited subset of the range of models that gretl can handle but it may be of some value to users Figure 363 GUI hook for state space models ML estimation is carried out internally using the mle command with the limitedmemory version of the BFGS optimizer and the user is given the option of tracking the optimization process via a verbosity option For reasons of numerical performance it is convenient to have the choice of representing variances as transformations of the BFGS parameters in one of the three following ways Absolute value maximization is performed on the variances σ 2 θ Square maximization is performed on the standard deviations σ 2 θ2 Exponential maximization is performed on the log standard deviations σ 2 exp2 θ Chapter 36 State Space Modeling 369 Normally this choice should make no difference for wellbehaved data although numerical prob lems may occur sometimes In these cases it may be helpful to rescale the data by multiplying yt by some scalar such as 100 or 00001 so as to make the order of magnitude of the parameters less prone to finiteprecision issues In any case the function reports the estimates of the standard errors whatever the parametrization type Once the parameters are estimated the user has the choice of performing smoothing of the states The GUI is shown in figure 363 The observables box is used for specifying a list of series or a single series for yt The next two boxes handle the Z and T matrices respectively These can be preexisting matrices or may be created on the fly The same applies for the next box dedicated to the R matrix However the R matrix can be omitted in which case it is implicitly assumed R I The remaining GUI elements should hopefully be selfexplanatory The function returns a bundle which includes a subbundle called kmod with all the statespace internals a matrix called state holding the estimated states and matrices coeff and vcv holding respectively the coefficients and standard errors obtained via ML estimation Example Random walk plus noise The model here is yt αt εt αt αt1 ηt so that Z T R 1 The following script simulates the DGP above with Vεt 1 and Vηt 116 and sets up the two matrices Z and T ready to be entered into the second and third boxes of the GUI helper respectively obviously the first box should contain the string y Note that the first box expects as argument a named list thereby allowing for multivariate models clear set verbose off set seed 280921 nulldata 256 setobs 1 1 special example 1 random walk plus noise series m cumnormal 025 series y m normal Z 1 T 1 4 3 2 1 0 1 2 3 0 50 100 150 200 250 m mhat Figure 364 Estimated state Chapter 36 State Space Modeling 371 stdev1 00173633 000448117 3875 00001 State transition equation coefficient std error z pvalue stdev1 00269790 000409457 6589 443e11 stdev2 000648082 000202576 3199 00014 Loglikelihood 12133 Note that the output window will contain a few icons on the top bar By clicking on the second one from the left it is possible to save to the gretl workspace one or more elements from the returned bundle For example the kmod key corresponds to the estimated kalman bundle Saving it under the name kb and running the code below will produce the plot shown in Figure 365 series trend kbstate1 series seas kbstate2 scatters y trend seas 63 64 65 66 67 68 69 7 71 72 73 74 1973 1978 1983 1988 1993 y 64 65 66 67 68 69 7 71 72 73 1973 1978 1983 1988 1993 trend 015 01 005 0 005 01 015 1973 1978 1983 1988 1993 seas Figure 365 Estimated trend and seasonal component Chapter 37 Numerical methods Several functions are available to aid in the construction of specialpurpose estimators their pur pose is to find numerically approximate solutions to problems that in principle could be solved analytically but in practice cannot be for one reason or another In this chapter we illustrate the tools that gretl offers for optimization of functions differentiation and integration 371 Derivativebased optimization methods In some cases the function we want to optimize is differentiable and has a maximum in the interior of the search space In these cases you will want to use algorithms that exploit this feature such as BFGS or NewtonRaphson If this is not the case you may want to use derivativefree methods which are illustrated in section 372 BFGS The BFGSmax function has two required arguments a vector holding the initial values of a set of parameters and a call to a function that calculates the scalar criterion to be maximized given the current parameter values and any other relevant data If the object is in fact minimization this function should return the negative of the criterion On successful completion BFGSmax returns the maximized value of the criterion and the vector given via the first argument holds the parameter values which produce the maximum It is assumed here that the objective function is a userdefined function see Chapter 14 with the following general setup function scalar ObjFunc const matrix theta matrix X scalar val do some computation return val end function The first argument contains the parameter vector which should not be modified within the func tion and the second may be used to hold extra values that are necessary to compute the objective function but are not the variables of the optimization problem Here the pointer form is chosen for the argument but depending on the problem it could also be passed as a plain argument with our without the const modifier For example if the objective function were a loglikelihood the first argument would contain the parameters and the second one the data Or for more economic theory inclined readers if the objective function were the utility of a consumer the first argument might contain the quantities of goods and the second one their prices and disposable income The operation of BFGS can be adjusted using the set variables bfgsmaxiter and bfgstoler see Chapter 26 In addition you can provoke verbose output from the maximizer by setting maxverbose to on again via the set command alternatively you could have set it to full and get even richer output The Rosenbrock function is often used as a test problem for optimization algorithms It is also known as Rosenbrocks Valley or Rosenbrocks Banana Function on account of the fact that its contour lines are bananashaped It is defined by f x y 1 x2 100y x22 372 Chapter 37 Numerical methods 373 Listing 371 Finding the minimum of the Rosenbrock function Download function scalar Rosenbrockconst matrix param scalar x param1 scalar y param2 return 1x2 100 y x22 end function matrix theta 0 0 set maxverbose on M BFGSmaxtheta Rosenbrocktheta print theta The function has a global minimum at x y 1 1 where f x y 0 Listing 371 shows a gretl script that discovers the minimum using BFGSmax giving a verbose account of progress Note that in this particular case the function to be maximized only depends on the parameters so the second parameter is omitted from the definition of the function Rosenbrock Supplying analytical derivatives for BFGS An optional third argument to the BFGSmax function enables the user to supply analytical deriva tives of the criterion function with respect to the parameters without which a numerical approxi mation to the gradient is computed This argument is similar to the second one in that it specifies a function call In this case the function that is called must have the following signature Its first argument should be a predefined matrix correctly dimensioned to hold the gradient that is if the parameter vector contains k elements the gradient matrix must also be a kvector This matrix argument must be given in pointer form so that its content can be modified by the func tion Note that unlike the parameter vector where the choice of initial values can be important the initial values given to the gradient are immaterial and do not affect the results In addition the gradient function must have as one of its argument the parameter vector This may be given in pointer form which enhances efficiency but that is not required Additional arguments may be specified if necessary Given the current parameter values the function call must fill out the gradient vector appropriately It is not required that the gradient function returns any value directly if it does that value is ignored Listing 372 illustrates showing how the Rosenbrock script can be modified to use analytical deriva tives Note that since this is a minimization problem the values written into g1 and g2 in the function Rosengrad are in fact the derivatives of the negative of the Rosenbrock function Limitedmemory variant and constrained optimization As an alternative to standard BFGS gretl offers the limitedmemory variant LBFGSB This is de scribed by Byrd et al 1995 and Zhu et al 1997 Gretl uses version 30 of this code which features improvements described by Morales and Nocedal 2011 Some problems that defeat standard BFGS may be amenable to solution by LBFGSB To see if this is the case gretl code that uses BFGS can be pushed into using the alternative algorithm via the set command as follows set lbfgs on The primary case for using LBFGSB however is constrained optimization this algorithm supports constraints on the parameters in the form of minima andor maxima In gretl this is implemented Chapter 37 Numerical methods 374 Listing 372 Rosenbrock function with analytical gradient Download function scalar Rosenbrock const matrix param scalar x param1 scalar y param2 return 1x2 100 y x22 end function function void Rosengrad matrix g const matrix param scalar x param1 scalar y param2 g1 21x 2x200yx2 g2 200y x2 end function matrix theta 0 0 matrix grad 0 0 set maxverbose 1 M BFGSmaxtheta Rosenbrocktheta Rosengradgrad theta print theta print grad by the function BFGScmax c for constrained The syntax is basically similar to that of BFGSmax except that the first argument must followed by specification of a bounds matrix This matrix should have three columns and as many rows as there are constrained elements of the parameter vector Each row should hold the 1based index of the constrained parameter followed by lower and upper bounds The values huge and huge should be used to indicate that the parameter is unconstrained downward or upward respectively For example the following code constructs a matrix to specify that the second element of the parameter vector must be nonnegative and the fourth must lie between 0 and 1 matrix bounds 2 0 huge 4 0 1 NewtonRaphson BFGS discussed above is an excellent allpurpose maximizer and about as robust as possible given the limitations of digital computer arithmetic The NewtonRaphson maximizer is not as robust but may converge much faster than BFGS for problems where the maximand is reasonably well behavedin particular where it is anything like quadratic see below The case for using Newton Raphson is enhanced if it is possible to supply a function to calculate the Hessian analytically The gretl function NRmax which implements the NewtonRaphson method has a maximum of four arguments The first two required arguments are exactly as for BFGS an initial parameter vector and a function call which returns the maximand given the parameters The optional third argu ment is again as in BFGS a function call that calculates the gradient Specific to NRmax is an optional fourth argument namely a function call to calculate the negative Hessian The first argument of this function must be a predefined matrix of the right dimension to hold the Hessianthat is a k k matrix where k is the length of the parameter vectorgiven in pointer form The second argument should be the parameter vector optionally in pointer form Other data may be passed as additional arguments as needed Similarly to the case with the gradient if the fourth argument to NRmax is omitted then a numerical approximation to the Hessian is constructed What is ultimately required in NewtonRaphson is the negative inverse of the Hessian Note that if you give the optional fourth argument your function should compute the negative Hessian but Chapter 37 Numerical methods 375 should not invert it NRmax takes care of inversion with special handling for the case where the matrix is not negative definite which can happen far from the maximum Script 373 extends the Rosenbrock example using NRmax with a function Rosenhess to compute the Hessian The functions Rosenbrock and Rosengrad are just the same as in Listing 372 and are omitted for brevity Listing 373 Rosenbrock function via NewtonRaphson function void Rosenhess matrix H const matrix param scalar x param1 scalar y param2 H11 2 400y 1200x2 H12 400x H21 400x H22 200 end function matrix theta 0 0 matrix grad 0 0 matrix H zeros2 2 set maxverbose 1 M NRmaxtheta Rosenbrocktheta Rosengradgrad theta RosenhessH theta print theta print grad The idea behind NewtonRaphson is to exploit a quadratic approximation to the maximand under the assumption that it is concave If this is true the method is very effective However if the algorithm happens to evaluate the function at a point where the Hessian is not negative definite things may go wrong Script 374 exemplifies this by using a normal density which is concave in the interval 1 1 and convex elsewhere If the algorithm is started from within the interval everything goes well and NR is slightly more effective than BFGS If however the Hessian is positive at the starting point BFGS converges with only little more difficulty while NewtonRaphson fails 372 Derivativefree optimization methods Golden section search method Suppose you have a function f x of a scalar argument that is known to have a unique maximum The golden section method is rather effective at finding it quickly without making use of derivatives see Press et al 2007 section 102 for a thorough description The gretl function implementing this method is called GSSmax The idea is roughly to take an interval x0 x1 also known as the bracket that should contain the maximizing value Once y0 f x0 and y1 f x1 are computed the algorithm sets a new point x2 that replaces the end of the previous interval for which the function takes the worse value So for example if y0 y1 then x0 is replaced and the interval becomes x1 x2 The width of the interval shrinks progressively so after a few iterations you should end close to the maximum As an illustration consider the function f x 50 x32ex which is maximized at x 15 The following script sets as the initial interval the range 0 10 function scalar gscalar x Chapter 37 Numerical methods 376 Listing 374 Maximization of a Gaussian density Download function scalar NDmatrix x scalar z x1 return exp05zz end function set maxverbose 1 x 075 A BFGSmaxx NDx x 075 A NRmaxx NDx x 15 A BFGSmaxx NDx x 15 A NRmaxx NDx return 50 x15 expx end function set maxverbose on m 5 0 10 y GSSmaxm gm1 printf fg g m1 y The output is 1 bracket381966618034 values818747159001 2 bracket236068381966 values171118818747 3 bracket145898236068 values204841171118 4 bracket0901699145898 values173764204841 20 bracket150017150042 values204958204958 21 bracket150001150017 values204958204958 f149996 204958 As you can see from the output the bracket shrinks progressively the center of the interval when the algorithm stops is x 149996 figure 371 gives a pictorial representation of the process where the blue line is the function to maximize and the red segments are the successive choices for the bracket Simulated Annealing Simulated annealingas implemented by the gretl function simannis not a fullblown maxi mization method in its own right but can be a useful auxiliary tool in problems where convergence depends sensitively on the initial values of the parameters The idea is that you supply initial values and the simulated annealing mechanism tries to improve on them via controlled randomization Chapter 37 Numerical methods 377 0 5 10 15 20 25 0 1 2 3 4 5 6 7 Figure 371 Golden section search method example The simann function takes up to three arguments The first two required are the same as for BFGSmax and NRmax an initial parameter vector and a function that computes the maximand The optional third argument is a positive integer giving the maximum number of iterations n which defaults to 1024 Starting from the specified point in the parameter space for each of n iterations we select at random a new point within a certain radius of the previous one and determine the value of the criterion at the new point If the criterion is higher we jump to the new point otherwise we jump with probability P and remain at the previous point with probability 1 P As the iterations proceed the system gradually coolsthat is the radius of the random perturbation is reduced as is the probability of making a jump when the criterion fails to increase In the course of this procedure n 1 points in the parameter space are evaluated call them θi i 0 n where θ0 is the initial value given by the user Let θ denote the best point among θ1 θn highest criterion value The value written into the parameter vector on completion is then θ if θ is better than θ0 otherwise θn In other words failing an actual improvement in the criterion simann randomizes the starting point which may be helpful in tricky optimization problems Listing 375 shows simann at work as a helper for BFGSmax in finding the maximum of a bimodal function Unaided BFGSmax requires 60 function evaluations and 55 evaluations of the gradient while after simulated annealing the maximum is found with 7 function evaluations and 6 evalua tions of the gradient1 NelderMead The NelderMead derivativefree simplex maximizer also known as the amoeba algorithm is implemented by the function NMmax The argument list of this function is essentially the same as for simann the required arguments are an initial parameter vector and a functioncall to compute the maximand while an optional third argument can be used to set the maximum number of function evaluations default value 2000 This method is unlikely to produce as close an approximation to the true optimum as derivative based methods such as BFGS and NewtonRaphson but it is more robust than the latter It may succeed in some cases where derivativebased methods fail and it may be useful like simann for improving the starting point for an optimization problem so that a derivativebased method can then take over successfully 1Your mileage may vary these figures are somewhat compiler and machinedependent Chapter 37 Numerical methods 378 Listing 375 BFGS with initialization via Simulated Annealing Download function scalar bimodal matrix x matrix A scalar ret expqformx1 A ret 2expqformx4 A return ret end function set seed 12334 set maxverbose on scalar k 2 matrix A 01 Ik matrix x0 3 5 x x0 u BFGSmaxx bimodalx A print x x x0 u simannx bimodalx A 1000 print x u BFGSmaxx bimodalx A print x NMmax includes an internal convergence checknamely verification that the best value achieved for the objective function at termination of the algorithm is at least a local optimumbut by de fault it doesnt flag an error if this condition is not satisfied This permits a mode of usage where you set a fairly tight budget of function evaluations for example 200 and just take any improve ment in the objective function that is available without worrying about whether an optimum has truly been reached However if you want the convergence check to be enforced you can flag this by setting a negative value for the maximum function evaluations argument in that case the absolute value of the argument is taken and an error is provoked on nonconvergence If the task for this function is actually minimization you can either have the functioncall return the negative of the actual criterion or if you prefer call NMmax under the alias NMmin Here is an example of use minimization of the Powell quartic function which is problematic for BFGS The true minimum is zero obtained for x a 4vector of zeros function scalar powell const matrix x fx1 x1 10 x2 fx2 x3 x4 fx3 x2 2 x3 fx4 x1 x4 return fx12 50 fx22 fx34 100 fx44 end function matrix x 3 1 0 1 printf Initial fX g powellx fmin NMminx powellx printf Estimate of optimal X 14f x printf fX g fmin Chapter 37 Numerical methods 380 Listing 376 Delta Method Download function matrix MPCmatrix param matrix Y beta param2 gamma param3 y Y1 return betagammaygamma1 end function William Greene Econometric Analysis 5e Chapter 9 set echo off set messages off open greene51gdt Use OLS to initialize the parameters ols realcons 0 realdpi quiet a coeff0 b coeffrealdpi g 10 Run NLS with analytical derivatives nls realcons a b realdpig deriv a 1 deriv b realdpig deriv g b realdpig logrealdpi end nls matrix Y realdpi20004 matrix theta coeff matrix V vcv mpc MPCtheta Y matrix Jac fdjactheta MPCtheta Y Sigma qformJac V printf mpc g stderr g mpc sqrtSigma scalar teststat mpc1sqrtSigma printf Test for MPC 1 g pvalue g teststat pvaluenabsteststat Chapter 38 Discrete and censored dependent variables 387 Model 1 Logit estimates using the 32 observations 132 Dependent variable GRADE VARIABLE COEFFICIENT STDERROR T STAT SLOPE at mean const 130213 493132 2641 GPA 282611 126294 2238 0533859 TUCE 00951577 0141554 0672 00179755 PSI 237869 106456 2234 0449339 Mean of GRADE 0344 Number of cases correctly predicted 26 812 fbetax at mean of independent vars 0189 McFaddens pseudoRsquared 0374038 Loglikelihood 128896 Likelihood ratio test Chisquare3 154042 pvalue 0001502 Akaike information criterion AIC 337793 Schwarz Bayesian criterion BIC 396422 HannanQuinn criterion HQC 357227 Predicted 0 1 Actual 0 18 3 1 3 8 Model 2 Probit estimates using the 32 observations 132 Dependent variable GRADE VARIABLE COEFFICIENT STDERROR T STAT SLOPE at mean const 745232 254247 2931 GPA 162581 0693883 2343 0533347 TUCE 00517288 00838903 0617 00169697 PSI 142633 0595038 2397 0467908 Mean of GRADE 0344 Number of cases correctly predicted 26 812 fbetax at mean of independent vars 0328 McFaddens pseudoRsquared 0377478 Loglikelihood 128188 Likelihood ratio test Chisquare3 155459 pvalue 0001405 Akaike information criterion AIC 336376 Schwarz Bayesian criterion BIC 395006 HannanQuinn criterion HQC 35581 Predicted 0 1 Actual 0 18 3 1 3 8 Test for normality of residual Null hypothesis error is normally distributed Test statistic Chisquare2 361059 with pvalue 0164426 Table 381 Example logit and probit output Chapter 38 Discrete and censored dependent variables 388 Odds ratios A noteworthy feature of the binary logit model is that the regression coefficients have an inter pretation as log odds ratios where the odds ratio is 0 Py 1Py 0 In the logit example above the coefficient on TUCE has a value of 0095 The corresponding odds ratio is then e0095 110 meaning that the estimated effect of a unit increase in TUCE is to move the odds ratio by 10 percent in favor of GRADE 1 When a binary logit model is estimated via the gretl GUI the Analysis menu in the model output window incudes an Odds ratios item This opens a window showing the odds ratios along with standard errors obtained via the delta method plus a 95 percent confidence interval as illustrated below 95 confidence intervals z0025 19600 odds ratio std error low high GPA 168797 213181 142019 200624 TUCE 109983 0155686 0833365 145150 PSI 107907 114874 133934 869380 Note however that confidence intervals shown are not calculated using the deltamethod standard errors rather the bounds are obtained by exponentiating the bounds of regular confidence inter vals for the coefficients This makes sense on the assumption that the coefficients themselves are more likely to be normally distributed than their exponentials Odds ratio information can also be retrieved following binary logit estimation via scripting In this case it takes the form of a matrix provided by the oddsratios accessor or as modeloddsratios The perfect prediction problem One curious characteristic of logit and probit models is that quite paradoxically estimation is not feasible if a model fits the data perfectly this is called the perfect prediction problem The reason why this problem arises is easy to see by considering equation 386 if for some vector β and scalar k its the case that zi k whenever yi 0 and zi k whenever yi 1 the same thing is true for any multiple of β Hence Lβ can be made arbitrarily close to 0 simply by choosing enormous values for β As a consequence the loglikelihood has no maximum despite being bounded Gretl has a mechanism for preventing the algorithm from iterating endlessly in search of a non existent maximum One subcase of interest is when the perfect prediction problem arises because of a single binary explanatory variable In this case the offending variable is dropped from the model and estimation proceeds with the reduced specification Nevertheless it may happen that no single perfect classifier exists among the regressors in which case estimation is simply impos sible and the algorithm stops with an error This behavior is triggered during the iteration process if max zi iyi0 min zi iyi1 If this happens unless your model is trivially misspecified like predicting if a country is an oil exporter on the basis of oil revenues it is normally a smallsample problem you probably just dont have enough data to estimate your model You may want to drop some of your explanatory variables This problem is well analyzed in Stokes 2004 the results therein are replicated in the example script murderratesinp Chapter 38 Discrete and censored dependent variables 389 382 Ordered response models These models constitute a simple variation on ordinary logitprobit models and are usually applied when the dependent variable is a discrete and ordered measurementnot simply binary but on an ordinal rather than an interval scale For example this sort of model may be applied when the dependent variable is a qualitative assessment such as Good Average and Bad In the general case consider an ordered response variable y that can take on any of the J1 values 0 1 2 J We suppose as before that underlying the observed response is a latent variable y Xβ ε z ε Now define cut points α1 α2 αJ such that y 0 if y α1 y 1 if α1 y α2 y J if y αJ For example if the response takes on three values there will be two such cut points α1 and α2 The probability that individual i exhibits response j conditional on the characteristics xi is then given by Pyi j xi Py α1 xi Fα1 zi for j 0 Pαj y αj1 xi Fαj1 zi Fαj zi for 0 j J Py αJ xi 1 FαJ zi for j J 388 The unknown parameters αj are estimated jointly with the βs via maximum likelihood The ˆαj estimates are reported by gretl as cut1 cut2 and so on For the probit variant a conditional moment test for normality constructed in the spirit of Chesher and Irish 1987 is also included Note that the αj parameters can be shifted arbitrarily by adding a constant to zi so the model is underidentified if there is some linear combination of the explanatory variables which is constant The most obvious case in which this occurs is when the model contains a constant term for this reason gretl drops automatically the intercept if present However it may happen that the user in adventently specifies a list of regressors that may be combined in such a way to produce a constant for example by using a full set of dummy variables for a discrete factor If this happens gretl will also drop any offending regressors In order to apply these models in gretl the dependent variable must either take on only non negative integer values or be explicitly marked as discrete In case the variable has noninteger values it will be recoded internally Note that gretl does not provide a separate command for ordered models the logit and probit commands automatically estimate the ordered version if the dependent variable is acceptable but not binary Listing 383 reproduces the results presented in section 1510 of Wooldridge 2002a The question of interest in this analysis is what difference it makes to the allocation of assets in pension funds whether individual plan participants have a choice in the matter The response variable is an ordinal measure of the weight of stocks in the pension portfolio Having reported the results of estimation of the ordered model Wooldridge illustrates the effect of the choice variable by reference to an average participant The example script shows how one can compute this effect in gretl After estimating ordered models the uhat accessor yields generalized residuals as in binary mod els additionally the yhat accessor function returns ˆzi so it is possible to compute an unbiased estimator of the latent variable y i simply by adding the two together Chapter 38 Discrete and censored dependent variables 390 Listing 383 Ordered probit model Download Replicate the results in Wooldridge Econometric Analysis of Cross Section and Panel Data section 1510 using pensionplan data from Papke AER 1998 The dependent variable pctstck percent stocks codes the asset allocation responses of mostly bonds mixed and mostly stocks as 0 50 100 The independent variable of interest is choice a dummy indicating whether individuals are able to choose their own asset allocations open pensiongdt demographic characteristics of participant list DEMOG age educ female black married dummies coding for income level list INCOME finc25 finc35 finc50 finc75 finc100 finc101 Papkes OLS approach ols pctstck const choice DEMOG INCOME wealth89 prftshr save the OLS choice coefficient choiceols coeffchoice estimate ordered probit probit pctstck choice DEMOG INCOME wealth89 prftshr k ncoeff matrix b coeff1k2 a1 coeffk1 a2 coeffk Wooldridge illustrates the choice effect in the ordered probit by reference to a single nonblack male aged 60 with 135 years of education income in the range 50K 75K and wealth of 200K participating in a plan with profit sharing matrix X 60 135 0 0 0 0 0 0 1 0 0 200 1 with choice 0 scalar Xb 0 X b P0 cdfN a1 Xb P50 cdfN a2 Xb P0 P100 1 cdfN a2 Xb E0 50 P50 100 P100 with choice 1 Xb 1 X b P0 cdfN a1 Xb P50 cdfN a2 Xb P0 P100 1 cdfN a2 Xb E1 50 P50 100 P100 printf With choice Ey 2f without Ey 2f E1 E0 printf Estimated choice effect via ML 2f OLS 2f E1 E0 choiceols Chapter 38 Discrete and censored dependent variables 402 where durat measures durations 0 represents the constant which is required for such models X is a named list of regressors and cens is the censoring dummy By default the Weibull distribution is used you can substitute any of the other three distribu tions discussed here by appending one of the option flags exponential loglogistic or lognormal Interpreting the coefficients in a duration model requires some care and we will work through an illustrative case The example comes from section 203 of Wooldridge 2002a and it concerns criminal recidivism7 The data filename recidgdt pertain to a sample of 1445 convicts released from prison between July 1 1977 and June 30 1978 The dependent variable is the time in months until they are again arrested The information was gathered retrospectively by examining records in April 1984 the maximum possible length of observation is 81 months Rightcensoring is impor tant when the date were compiled about 62 percent had not been rearrested The dataset contains several covariates which are described in the data file we will focus below on interpretation of the married variable a dummy which equals 1 if the respondent was married when imprisoned Listing 387 shows the gretl commands for Weibull and lognormal models along with most of the output Consider first the Weibull scale factor σ The estimate is 1241 with a standard error of 0048 We dont print a z score and pvalue for this term since H0 σ 0 is not of interest Recall that σ corresponds to 1α we can be confident that α is less than 1 so recidivism displays negative duration dependence This makes sense it is plausible that if a past offender manages to stay out of trouble for an extended period his risk of engaging in crime again diminishes The exponential model would therefore not be appropriate in this case On a priori grounds however we may doubt the monotonic decline in hazard that is implied by the Weibull specification Even if a person is liable to return to crime it seems relatively unlikely that he would do so straight out of prison In the data we find that only 26 percent of those followed were rearrested within 3 months The lognormal specification which allows the hazard to rise and then fall may be more appropriate Using the duration command again with the same covariates but the lognormal flag we get a loglikelihood of 1597 as against 1633 for the Weibull confirming that the lognormal gives a better fit Let us now focus on the married coefficient which is positive in both specifications but larger and more sharply estimated in the lognormal variant The first thing is to get the interpretation of the sign right Recall that Xβ enters negatively into the intermediate variable w equation 3820 The Weibull hazard is λwi ewi so being married reduces the hazard of reoffending or in other words lengthens the expected duration out of prison The same qualitative interpretation applies for the lognormal To get a better sense of the married effect it is useful to show its impact on the hazard across time We can do this by plotting the hazard for two values of the index function Xβ in each case the values of all the covariates other than married are set to their means or some chosen values while married is set first to 0 then to 1 Listing 388 provides a script that does this and the resulting plots are shown in Figure 381 Note that when computing the hazards we need to multiply by the Jacobian of the transformation from ti to wi logti xiβσ namely 1t Note also that the estimate of σ is available via the accessor sigma but it is also present as the last element in the coefficient vector obtained via coeff A further difference between the Weibull and lognormal specifications is illustrated in the plots The Weibull is an instance of a proportional hazard model This means that for any sets of values of the covariates xi and xj the ratio of the associated hazards is invariant with respect to duration In this example the Weibull hazard for unmarried individuals is always 11637 times that for married In the lognormal variant on the other hand this ratio gradually declines from 16703 at one month to 11766 at 100 months 7Germán Rodríguez of Princeton University has a page discussing this example and displaying estimates from Stata at httpdataprincetonedupop509recid1html Chapter 38 Discrete and censored dependent variables 403 Listing 387 Models for recidivism data Download Input open recidgdt list X workprg priors tserved felon alcohol drugs black married educ age duration durat 0 X cens duration durat 0 X cens lognormal Partial output Model 1 Duration Weibull using observations 11445 Dependent variable durat coefficient std error z pvalue const 422167 0341311 1237 385e35 workprg 0112785 0112535 1002 03162 priors 0110176 00170675 6455 108e10 tserved 00168297 000213029 7900 278e15 felon 0371623 0131995 2815 00049 alcohol 0555132 0132243 4198 269e05 drugs 0349265 0121880 2866 00042 black 0563016 0110817 5081 376e07 married 0188104 0135752 1386 01659 educ 00289111 00241153 1199 02306 age 000462188 0000664820 6952 360e12 sigma 124090 00482896 Chisquare10 1654772 pvalue 239e30 Loglikelihood 1633032 Akaike criterion 3290065 Model 2 Duration lognormal using observations 11445 Dependent variable durat coefficient std error z pvalue const 409939 0347535 1180 411e32 workprg 00625693 0120037 05213 06022 priors 0137253 00214587 6396 159e10 tserved 00193306 000297792 6491 851e11 felon 0443995 0145087 3060 00022 alcohol 0634909 0144217 4402 107e05 drugs 0298159 0132736 2246 00247 black 0542719 0117443 4621 382e06 married 0340682 0139843 2436 00148 educ 00229194 00253974 09024 03668 age 000391028 0000606205 6450 112e10 sigma 181047 00623022 Chisquare10 1667361 pvalue 131e30 Loglikelihood 1597059 Akaike criterion 3218118 Chapter 38 Discrete and censored dependent variables 404 Listing 388 Create plots showing conditional hazards Download open recidgdt q leave married separate for analysis list X workprg priors tserved felon alcohol drugs black educ age Weibull variant duration durat 0 X married cens coefficients on all Xs apart from married matrix betaw coeff1ncoeff2 married coefficient scalar mcw coeffncoeff1 scalar sw sigma Lognormal variant duration durat 0 X married cens lognormal matrix betan coeff1ncoeff2 scalar mcn coeffncoeff1 scalar sn sigma list allX 0 X evaluate Xbeta at means of all variables except marriage scalar Xbw meancallX betaw scalar Xbn meancallX betan construct two plot matrices matrix matw zeros100 3 matrix matn zeros100 3 loop t1100 first column duration matwt 1 t matnt 1 t wiw logt Xbwsw win logt Xbnsn second col hazard with married 0 matwt 2 1t expwiw matnt 2 1t pdfz win cdfz win wiw logt Xbw mcwsw win logt Xbn mcnsn third col hazard with married 1 matwt 3 1t expwiw matnt 3 1t pdfz win cdfz win endloop cnamesetmatw months unmarried married cnamesetmatn months unmarried married gnuplot 2 3 1 withlines supp matrixmatw outputweibullplt gnuplot 2 3 1 withlines supp matrixmatn outputlognormplt Chapter 38 Discrete and censored dependent variables 405 0006 0008 0010 0012 0014 0016 0018 0020 0 20 40 60 80 100 months Weibull unmarried married 0006 0008 0010 0012 0014 0016 0018 0020 0 20 40 60 80 100 months Lognormal unmarried married Figure 381 Recidivism hazard estimates for married and unmarried exconvicts Chapter 38 Discrete and censored dependent variables 406 Alternative representations of the Weibull model One point to watch out for with the Weibull duration model is that the estimates may be represented in different ways The representation given by gretl is sometimes called the accelerated failuretime AFT metric An alternative that one sometimes sees is the log relativehazard metric in fact this is the metric used in Wooldridges presentation of the recidivism example To get from AFT estimates to log relativehazard form it is necessary to multiply the coefficients by σ 1 For example the married coefficient in the Weibull specification as shown here is 0188104 and ˆσ is 124090 so the alternative value is 0152 which is what Wooldridge shows 2002a Table 201 Fitted values and residuals By default gretl computes fitted values accessible via yhat as the conditional mean of duration The formulae are shown below where Γ denotes the gamma function and the exponential variant is just Weibull with σ 1 Weibull Loglogistic Lognormal expXβΓ1 σ expXβ πσ sinπσ expXβ σ 22 The expression given for the loglogistic mean however is valid only for σ 1 otherwise the expectation is undefined a point that is not noted in all software8 Alternatively if the medians option is given gretls duration command will produce conditional medians as the content of yhat For the Weibull the median is expXβlog 2σ for the loglogistic and lognormal it is just expXβ The values we give for the accessor uhat are generalized CoxSnell residuals computed as the integrated hazard function which equals the negative log of the survivor function ϵi Λti xi θ log Sti xi θ Under the null of correct specification of the model these generalized residuals should follow the unit exponential distribution which has mean and variance both equal to 1 and density expϵ See chapter 18 of Cameron and Trivedi 2005 for further discussion 8The predict adjunct to the streg command in Stata 10 for example gaily produces large negative values for the loglogistic mean in duration models with σ 1 Chapter 39 Quantile regression 391 Introduction In Ordinary Least Squares OLS regression the fitted values ˆyi Xi ˆβ represent the conditional mean of the dependent variableconditional that is on the regression function and the values of the independent variables In median regression by contrast and as the name implies fitted values represent the conditional median of the dependent variable It turns out that the principle of estimation for median regression is easily stated though not so easily computed namely choose ˆβ so as to minimize the sum of absolute residuals Hence the method is known as Least Absolute Deviations or LAD While the OLS problem has a straightforward analytical solution LAD is a linear programming problem Quantile regression is a generalization of median regression the regression function predicts the conditional τquantile of the dependent variablefor example the first quartile τ 25 or the ninth decile τ 90 If the classical conditions for the validity of OLS are satisfiedthat is if the error term is indepen dently and identically distributed conditional on X then quantile regression is redundant all the conditional quantiles of the dependent variable will march in lockstep with the conditional mean Conversely if quantile regression reveals that the conditional quantiles behave in a manner quite distinct from the conditional mean this suggests that OLS estimation is problematic Gretl has offered quantile regression functionality since version 175 in addition to basic LAD regression which has been available since early in gretls history via the lad command1 392 Basic syntax The basic invocation of quantile regression is quantreg tau reglist where reglist is a standard gretl regression list dependent variable followed by regressors including the constant if an intercept is wanted and tau is the desired conditional quantile in the range 001 to 099 given either as a numerical value or the name of a predefined scalar variable but see below for a further option Estimation is via the FrischNewton interior point solver Portnoy and Koenker 1997 which is sub stantially faster than the traditional BarrodaleRoberts 1974 simplex approach for large prob lems 1We gratefully acknowledge our borrowing from the quantreg package for GNU R version 417 The core of the package is composed of Fortran code written by Roger Koenker this is accompanied by various driver and auxiliary functions written in the R language by Koenker and Martin Mächler The latter functions have been reworked in C for gretl We have added some guards against potential numerical problems in small samples 407 Chapter 39 Quantile regression 408 By default standard errors are computed according to the asymptotic formula given by Koenker and Bassett 1978 Alternatively if the robust option is given we use the sandwich estimator developed in Koenker and Zhao 19942 393 Confidence intervals An option intervals is available When this is given we print confidence intervals for the param eter estimates instead of standard errors These intervals are computed using the rank inversion method and in general they are asymmetrical about the point estimatesthat is they are not sim ply plus or minus so many standard errors The specifics of the calculation are inflected by the robust option without this the intervals are computed on the assumption of IID errors Koenker 1994 with it they use the heteroskedasticityrobust estimator developed by Koenker and Machado 1999 By default 90 percent intervals are produced You can change this by appending a confidence value expressed as a decimal fraction to the intervals option as in quantreg tau reglist intervals95 When the confidence intervals option is selected the parameter estimates are calculated using the BarrodaleRoberts method This is simply because the FrischNewton code does not currently support the calculation of confidence intervals Two further details First the mechanisms for generating confidence intervals for quantile esti mates require that the model has at least two regressors including the constant If the intervals option is given for a model containing only one regressor an error is flagged Second when a model is estimated in this mode you can retrieve the confidence intervals using the accessor coeffci This produces a k 2 matrix where k is the number of regressors The lower bounds are in the first column the upper bounds in the second See also section 395 below 394 Multiple quantiles As a further option you can give tau as a matrixeither the name of a predefined matrix or in numerical form as in 05 25 5 75 95 The given model is estimated for all the τ values and the results are printed in a special form as shown below in this case the intervals option was also given Model 1 Quantile estimates using the 235 observations 1235 Dependent variable foodexp With 90 percent confidence intervals VARIABLE TAU COEFFICIENT LOWER UPPER const 005 124880 983021 130517 025 954835 737861 120098 050 814822 532592 114012 075 623966 327449 107314 095 641040 462649 835790 income 005 0343361 0343327 0389750 025 0474103 0420330 0494329 050 0560181 0487022 0601989 075 0644014 0580155 0690413 095 0709069 0673900 0734441 2These correspond to the iid and nid options in Rs quantreg package respectively Chapter 39 Quantile regression 409 03 035 04 045 05 055 06 065 07 075 0 02 04 06 08 1 tau Coefficient on income Quantile estimates with 90 band OLS estimate with 90 band Figure 391 Regression of food expenditure on income Engels data The gretl GUI has an entry for Quantile Regression under ModelRobust estimation and you can select multiple quantiles there too In that context just give spaceseparated numerical values as per the predefined options shown in a dropdown list When you estimate a model in this way most of the standard menu items in the model window are disabled but one extra item is availablegraphs showing the τ sequence for a given coeffi cient in comparison with the OLS coefficient An example is shown in Figure 391 This sort of graph provides a simple means of judging whether quantile regression is redundant OLS is fine or informative In the example shownbased on data on household income and food expenditure gathered by Ernst Engel 18211896it seems clear that simple OLS regression is potentially misleading The crossing of the OLS estimate by the quantile estimates is very marked However it is not always clear what implications should be drawn from this sort of conflict With the Engel data there are two issues to consider First Engels famous law claims an income elasticity of food consumption that is less than one and talk of elasticities suggests a logarithmic formulation of the model Second there are two apparently anomalous observations in the data set household 105 has the thirdhighest income but unexpectedly low expenditure on food as judged from a simple scatter plot while household 138 which also has unexpectedly low food consumption has much the highest income almost twice that of the next highest With n 235 it seems reasonable to consider dropping these observations If we do so and adopt a loglog formulation we get the plot shown in Figure 392 The quantile estimates still cross the OLS estimate but the evidence against OLS is much less compelling the 90 percent confidence bands of the respective estimates overlap at all the quantiles considered A script to produce the results discussed above is presented in listing 391 395 Large datasets As noted above when you give the intervals option with the quantreg command which calls for estimation of confidence intervals via rank inversion gretl switches from the default Frisch Chapter 39 Quantile regression 410 076 078 08 082 084 086 088 09 092 094 096 0 02 04 06 08 1 tau Coefficient on logincome Quantile estimates with 90 band OLS estimate with 90 band Figure 392 Loglog regression 2 observations dropped from full Engel data set Listing 391 Food expenditure and income Engel data Download this data file is supplied with gretl open engelgdt specify some quantiles matrix tau 05 25 5 75 95 use levels of variables QM1 quantreg tau foodexp 0 income intervals use loglog specification with two outliers removed logs foodexp income smpl obs105 obs138 restrict QM2 quantreg tau lfoodexp 0 lincome intervals The script saves the two models as icons Doubleclicking on a models icon opens a window to display the results and the Graph menu in this window gives access to a tausequence plot Chapter 39 Quantile regression 411 Newton algorithm to the BarrodaleRoberts simplex method This is OK for moderately large datasets up to say a few thousand observations but on very large problems the simplex algorithm may become seriously bogged down For example Koenker and Hallock 2001 present an analysis of the determinants of birth weights using 198377 observations and with 15 regressors Generating confidence intervals via BarrodaleRoberts for a single value of τ took about half an hour on a Lenovo Thinkpad T60p with 183GHz Intel Core 2 processor If you want confidence intervals in such cases you are advised not to use the intervals option but to compute them using the method of plus or minus so many standard errors One Frisch Newton run took about 8 seconds on the same machine showing the superiority of the interior point method The script below illustrates quantreg 10 y 0 xlist scalar crit qnorm95 matrix ci coeff crit stderr ci cicoeff crit stderr print ci The matrix ci will contain the lower and upper bounds of the symmetrical 90 percent confidence intervals To avoid a situation where gretl becomes unresponsive for a very long time we have set the maxi mum number of iterations for the BorrodaleRoberts algorithm to the somewhat arbitrary value of 1000 We will experiment further with this but for the meantime if you really want to use this method on a large dataset and dont mind waiting for the results you can increase the limit using the set command with parameter rqmaxiter as in set rqmaxiter 5000 Chapter 40 Nonparametric methods 415 Listing 402 NadarayaWatson example Download Nonparametric regression example husbands age on wifes age open mroz87gdt initial value for the bandwidth scalar h nobs02 three increasingly smooth estimates series m0 nadarwatHA WA h series m1 nadarwatHA WA h 5 series m2 nadarwatHA WA h 10 produce the graph dataset sortby WA gnuplot HA m0 m1 m2 WA outputdisplay withlinesm0m1m2 30 35 40 45 50 55 60 30 35 40 45 50 55 60 HA WA m0 m1 m2 Figure 402 NadarayaWatson example for several choices of the bandwidth parameter Chapter 40 Nonparametric methods 416 If you need a point estimate of mX for some value of X which is not present among the valid observations of your dependent variable you may want to add some fake observations to your dataset in which y is missing and x contains the values you want mx evaluated at For example the following script evaluates mx at regular intervals between 20 and 20 nulldata 120 set seed 120496 first part of the sample actual data smpl 1 100 x normal y x2 sinx normal second part of the sample fake x data smpl 101 120 x obs110 5 compute the NadarayaWatson estimate with bandwidth equal to 04 note that 10002 0398 smpl full m nadarwaty x 04 show mx for the fake x values only smpl 101 120 print x m o and running it produces x m 101 18 1165934 102 16 0730221 103 14 0314705 104 12 0026057 105 10 0131999 106 08 0215445 107 06 0269257 108 04 0304451 109 02 0306448 110 00 0238766 111 02 0038837 112 04 0354660 113 06 0908178 114 08 1485178 115 10 2000003 116 12 2460100 117 14 2905176 118 16 3380874 119 18 3927682 120 20 4538364 Chapter 41 MIDAS models 418 Parameterization code string Normalized exponential Almon 1 nealmon Normalized beta zero last lag 2 beta0 Normalized beta nonzero last lag 3 betan Almon polynomial 4 almonp Oneparameter beta 5 beta1 Table 411 MIDAS parameterizations In the case of the nonnormalized Almon polynomial the γ coefficient in 412 is identically 10 and is omitted The beta1 case is the the same as the twoparameter beta0 except that θ1 is constrained to equal 1 leaving θ2 as the only free parameter Ghysels and Qian 2016 make a case for use of this particularly parsimonious version2 An additional function is provided for convenience it is named mlincomb and it combines mweights with the lincomb function which takes a list of series argument followed by a vector of coeffi cients and produces a series result namely a linear combination of the elements of the list If we have a suitable list X available we can do for example series foo mlincombX theta beta0 This is equivalent to series foo lincombX mweightsnelemX theta beta0 but saves a little typing and some CPU cycles 412 Estimating MIDAS models Gretl offers a dedicated command midasreg for estimation of MIDAS models Theres a corre sponding item MIDAS under the Time series section of the Model menu in the gretl GUI We begin by discussing that then move on to possibilities for defining your own estimator The syntax of midasreg looks like this midasreg depvar xlist midasterms options The depvar slot takes the name or series ID number of the dependent variable and xlist is the list of regressors that are observed at the same frequency as the dependent variable this list may contain lags of the dependent variable The midasterms slot accepts one or more specifica tions for highfrequency terms Each of these specifications must conform to one or other of the following patterns 1 mdsmlist minlag maxlag type theta 2 mdslllist type theta In case 1 mlist must be a MIDAS list as defined in section 202 which contains a full set of perperiod series but no lags Lags will be generated automatically governed by the minlag and maxlag integer arguments which may be given as numerical values or the names of predefined scalar variables The integer or string type argument represents the type of parameterization in addition to the values 1 to 4 defined in Table 411 a value of 0 or the string umidas indicates unrestricted MIDAS In case 2 llist is assumed to be a list that already contains the required set of highfrequency lagsas may be obtained via the hflags function described in section 203hence minlag and maxlag are not wanted 2Note however that at present beta1 cannot be mixed with other parameterizations in a single model Chapter 41 MIDAS models 419 The final theta argument is optional in most cases implying an automatic initialization of the hyperparameters If this argument is given it must take one of the following forms 1 The name of a matrix vector holding initial values for the hyperparameters or a simple expression which defines a matrix using scalars such as 1 5 2 The keyword null indicating that an automatic initialization should be used as happens when this argument is omitted 3 An integer value in numerical form indicating how many hyperparameters should be used which again calls for automatic initialization The third of these forms is required if you want automatic initialization in the Almon polynomial case since we need to know how many terms you wish to include In the normalized exponential Almon case we default to the usual two hyperparameters if theta is omitted or given as null The midasreg syntax allows the user to specify multiple highfrequency predictors if wanted these can have different lag specifications different parameterizations andor different frequencies The options accepted by midasreg include quiet suppress printed output verbose show detail of iterations if applicable and robust use a HAC estimator of the NeweyWest type in computing standard errors Two additional specialized options are described below Examples of usage Suppose we have a dependent variable named dy and a MIDAS list named dX and we wish to run a MIDAS regression using one lag of the dependent variable and highfrequency lags 1 to 10 of the series in dX The following will produce UMIDAS estimates midasreg dy const dy1 mdsdX 1 10 0 The next lines will produce estimates for the normalized exponential Almon parameterization with two coefficients both initialized to zero midasreg dy const dy1 mdsdX 1 10 nealmon 00 In the examples above the required lags will be added to the dataset automatically then deleted after use If you are estimating several models using a single set of MIDAS lags it is more efficient to create the lags once and use the mdsl specifier For example the following estimates three variant parameterizations exponential Almon beta with zero last lag and beta with nonzero last lag on the same data list dXL hflags1 10 dX midasreg dy 0 dy1 mdsldXL nealmon 00 midasreg dy 0 dy1 mdsldXL beta0 15 midasreg dy 0 dy1 mdsldXL betan 110 Any additional MIDAS terms should be separated by spaces as in midasreg dy const dy1 mdsdX191theta1 mdsZ163theta2 Replication exercise We give a substantive illustration of midasreg in Listing 411 This replicates the first practical example discussed by Ghysels in the users guide titled MIDAS Matlab Toolbox3 The dependent 3See Ghysels 2015 This document announces itself as Version 20 of the guide and is dated November 1 2015 The example were looking at appears on pages 2426 the associated Matlab code can be found in the program appADLMIDAS1m Chapter 41 MIDAS models 421 Listing 411 Script to replicate results given by Ghysels Download set verbose off open gdpmidasgdt quiet form the dependent variable series dy 100 ldiffqgdp form list of highfrequency lagged log differences list X payems list dXL hflags3 11 hfldiffX 100 initialize matrix to collect forecasts matrix FC estimation sample smpl 19851 20091 print unrestricted MIDAS umidas midasreg dy 0 dy1 mdsldXL 0 fcast outofsample static quiet FC fcast print normalized beta with zero last lag beta0 midasreg dy 0 dy1 mdsldXL 2 15 fcast outofsample static quiet FC fcast print normalized beta nonzero last lag betan midasreg dy 0 dy1 mdsldXL 3 110 fcast outofsample static quiet FC fcast print normalized exponential Almon nealmon midasreg dy 0 dy1 mdsldXL 1 00 fcast outofsample static quiet FC fcast print Almon polynomial almonp midasreg dy 0 dy1 mdsldXL 4 4 fcast outofsample static quiet FC fcast smpl 20092 20112 matrix my dy print Forecast RMSEs printf umidas 4f fcstatsmy FC12 printf beta0 4f fcstatsmy FC22 printf betan 4f fcstatsmy FC32 printf nealmon 4f fcstatsmy FC42 printf almonp 4f fcstatsmy FC52 Chapter 41 MIDAS models 422 Listing 412 Replication of Ghysels results partial output normalized beta nonzero last lag betan Model 3 MIDAS NLS using observations 1985120091 T 97 Using LBFGSB with conditional OLS Dependent variable dy estimate std error tratio pvalue const 0748578 0146404 5113 174e06 dy1 0248055 0118903 2086 00398 MIDAS list dXL highfrequency lags 3 to 11 HFslope 172167 0582076 2958 00039 Beta1 0998501 00269479 3705 110e56 Beta2 295148 293404 1006 03171 Beta3 00743143 00271273 2739 00074 Sum squared resid 2878262 SE of regression 0562399 Rsquared 0356376 Adjusted Rsquared 0321012 Loglikelihood 7871248 Akaike criterion 1694250 Schwarz criterion 1848732 HannanQuinn 1756715 Almon polynomial almonp Model 5 MIDAS NLS using observations 1985120091 T 97 Using LevenbergMarquardt algorithm Dependent variable dy estimate std error tratio pvalue const 0741403 0146433 5063 214e06 dy1 0255099 0119139 2141 00349 MIDAS list dXL highfrequency lags 3 to 11 Almon0 106035 153491 06908 04914 Almon1 0193615 130812 01480 08827 Almon2 0140466 0299446 04691 06401 Almon3 00116034 00198686 05840 05607 Sum squared resid 2866623 SE of regression 0561261 Rsquared 0358979 Adjusted Rsquared 0323758 Loglikelihood 7851596 Akaike criterion 1690319 Schwarz criterion 1844802 HannanQuinn 1752784 Forecast RMSEs umidas 05424 beta0 05650 betan 05210 nealmon 05642 almonp 05329 Chapter 41 MIDAS models 423 LBFGSB with conditional OLS LBFGS is a limited memory version of the BFGS optimizer and the trailing B means that it supports bounds on the parameters which is useful for reasons given below Golden Section search with conditional OLS This is a line search method used only when there is a just a single hyperparameter to estimate LevenbergMarquardt is the default NLS method but if the MIDAS specifications include any of the beta variants or normalized exponential Almon we switch to LBFGSB unless the user gives the levenberg option The ability to set bounds on the hyperparameters via LBFGSB is helpful first because the beta parameters other than the third one if applicable must be nonnegative but also because one is liable to run into numerical problems in calculating the weights andor gradient if their values become too extreme For example we have found it useful to place bounds of 2 and 2 on the exponential Almon parameters Heres what we mean by conditional OLS in the context of LBFGSB and line search the search algorithm itself is only responsible for optimizing the MIDAS hyperparameters and when the algo rithm calls for calculation of the sum of squared residuals given a certain hyperparameter vector we optimize the remaining parameters coefficients on basefrequency regressors slopes with respect to MIDAS terms via OLS Testing for a structural break The breaktest option can be used to carry out the Quandt Likelihood Ratio QLR test for a structural break at the stage of running the final GaussNewton regression to check for conver gence and calculate the covariance matrix of the parameter estimates This can be a useful aid to diagnosis since nonhomogeneity of the data over the estimation period can lead to numerical problems in nonlinear estimation besides compromising the forecasting capacity of the resulting equation For example when this option is given with the command to estimate the betan model shown in Listing 412 the following result is appended to the standard output QLR test for structural break Null hypothesis no structural break Test statistic chisquare6 351745 at observation 20052 with asymptotic pvalue 0000127727 Despite the strong evidence for a structural break in this case the nonlinear estimator appears to converge successfully But one might wonder if a shorter estimation period could provide better outofsample forecasts Defining your own MIDAS estimator As explained above the midasreg command is in effect a wrapper for various underlying meth ods Some users may wish to undo the wrapping This would be required if you wish to introduce any nonlinearity other than that associated with the stock MIDAS parameterizations or to define your own MIDAS parameterization Anyone with ambitions in this direction will presumably be quite familiar with the commands and functions available in hansl gretls scripting language so we will not say much here beyond presenting a couple of examples First we show how the nls command can be used along with the MIDASrelated functions described in section 411 to estimate a model with the exponential Almon specification open gdpmidasgdt quiet series dy 100 ldiffqgdp series dy1 dy1 list X payems Chapter 41 MIDAS models 425 Listing 413 Manual MIDAS oneparameter beta specification Download set verbose off function scalar beta1SSR scalar th2 const series y const series x list L matrix theta 1 th2 series mdx mlincombL theta 2 run OLS conditional on theta ols y 0 x mdx quiet return ess end function function matrix midasGNR const matrix theta const series y const series x list L int type GaussNewton regression series mdx mlincombL theta type ols y 0 x mdx quiet matrix b coeff matrix u uhat matrix mgrad mgradientnelemL theta type matrix M const x mdx b3 L mgrad matrix V set svd on in case of strong collinearity molsu M null V return b theta sqrtdiagV end function main open gdpmidasgdt quiet series dy 100 ldiffqgdp series dy1 dy1 list dX ldpayem list dXL hflags3 11 dX estimation sample smpl 19851 20091 matrix b 0 101 100 use Golden Section minimizer SSR GSSminb beta1SSRb1 dy dy1 dXL 10e6 printf SSR GSS 15g SSR matrix theta 1 b1 column vector needed matrix bse midasGNRtheta dy dy1 dXL 2 bse42 nan mask std error of clamped coefficient modprint bse const dy1 HFslope Beta1 Beta2 Chapter 42 Gretl and ODBC Gretl provides a method for retrieving data from databases which support the Open Database Connectivity ODBC standard Most users wont be interested in this but there may be some for whom this feature matters a lottypically those who work in an environment where huge data collections are accessible via a Data Base Management System DBMS In the following section we explain what is needed for ODBC support in gretl We provide some background information on how ODBC works in section 422 and explain the details of getting gretl to retrieve data from a database in section 423 Section 424 provides some example of usage and section 425 gives some details on the management of ODBC connections 421 ODBC support The piece of software that bridges between gretl and the ODBC system is a dynamically loaded plugin This is included in the gretl packages for MS Windows and Mac OS X On other unixtype platforms notably Linux you may have to build gretl from source to get ODBC support This is because the plugin depends on having unixODBC installed which we cannot assume to be the case on typical Linux systems To enable the ODBC plugin when building gretl you must pass the option withodbc to gretls configure script In addition if unixODBC is installed in a non standard location you will have to specify its installation prefix using withODBCprefix as in for example configure withodbc withODBCprefixoptODBC 422 ODBC base concepts ODBC is short for Open DataBase Connectivity a group of software methods that enable a client to interact with a database server The most common operation is when the client fetches some data from the server ODBC acts as an intermediate layer between client and server so the client talks to ODBC rather than accessing the server directly see Figure 421 ODBC query data Figure 421 Retrieving data via ODBC For the above mechanism to work it is necessary that the relevant ODBC software is installed and working on the client machine contact your DB administrator for details At this point the database or databases that the server provides will be accessible to the client as a data source with a specific identifier a Data Source Name or DSN in most cases a username and a password are required to connect to the data source 429 Chapter 42 Gretl and ODBC 430 Once the connection is established the user sends a query to ODBC which contacts the database manager collects the results and sends them back to the user The query is almost invariably formulated in a special language used for the purpose namely SQL1 We will not provide here an SQL tutorial there are many such tutorials on the Net besides each database manager tends to support its own SQL dialect so the precise form of an SQL query may vary slightly if the DBMS on the other end is Oracle MySQL PostgreSQL or something else Suffice it to say that the main statement for retrieving data is the SELECT statement Within a DBMS data are organized in tables which are roughly equivalent to spreadsheets The SELECT statement returns a subset of a table which is itself a table For example imagine that the database holds a table called NatAccounts containing the data shown in Table 421 year qtr gdp consump tradebal 1970 1 584763 3447469 589101 1970 2 597746 3501769 706871 1970 3 604270 3552497 837927 1970 4 609706 3617947 791761 1971 1 609597 362490 62743 1971 2 617002 3683136 665876 1971 3 625536 372605 479589 1971 4 630047 3770339 649813 Table 421 The NatAccounts table The SQL statement SELECT qtr tradebal gdp FROM NatAccounts WHERE year1970 produces the subset of the original data shown in Table 422 qtr tradebal gdp 1 589101 584763 2 706871 597746 3 837927 604270 4 791761 609706 Table 422 Result of a SELECT statement Gretl provides a mechanism for forwarding your query to the DBMS via ODBC and including the results in your currently open dataset 423 Syntax At present we do not offer a graphical interface for ODBC import this must be done via the com mand line interface The two commands used for fetching data via an ODBC connection are open and data The open command is used for connecting to a DBMS its syntax is open dsndatabase userusername passwordpassword odbc The user and password items are optional the effect of this command is to initiate an ODBC connection It is assumed that the machine gretl runs on has a working ODBC client installed 1See httpenwikipediaorgwikiSQL Chapter 42 Gretl and ODBC 431 In order to actually retrieve the data the data command is used Its syntax is data series obsformatformatstring queryquerystring odbc where series is a list of names of gretl series to contain the incoming data separated by spaces Note that these series need not exist pior to the ODBC import formatstring is an optional parameter used to handle cases when a rectangular organisation of the database cannot be assumed more on this later querystring is a string containing the SQL statement used to extract the data There should be no spaces around the equals signs in the obsformat and query fields in the data command The querystring can in principle contain any valid SQL statement which results in a table This string may be specified directly within the command as in data x querySELECT foo FROM bar odbc which will store into the gretl variable x the content of the column foo from the table bar However since in a reallife situation the string containing the SQL statement may be rather long it may be best to store it in a string variable For example string SqlQry SELECT foo1 foo2 FROM bar data x y querySqlQry odbc The observation format specifier If the optional parameter obsformat is absent as in the above example the SQL query should return k columns of data where k is the number of series names listed in the data command It may be necessary to include a smpl command before the data command to set up the right window for the incoming data In addition if one cannot assume that the data will be delivered in the correct order typically chronological order the SQL query should contain an appropriate ORDER BY clause The optional format string is used for those cases when there is no certainty that the data from the query will arrive in the same order as the gretl dataset This may happen when missing values are interspersed within a column or with data that do not have a natural ordering eg crosssectional data In this case the SQL statement should return a table with m k columns where the first m columns are used to identify the observation or row in the gretl dataset into which the actual data values in the final k columns should be placed The obsformat string is used to translate the first m fields into a string which matches the string gretl uses to identify observations in the currently open dataset Up to three columns can be used for this purpose m 3 Note that the strings gretl uses to identify observations can be seen by printing any variable by observation as in print index byobs The series named index is automatically added to a dataset created via the nulldata command The format specifiers available for use with obsformat are as follows d print an integer value s print an string value g print a floatingpoint value Chapter 42 Gretl and ODBC 432 In addition the format can include literal characters to be passed through such as slashes or colons to make the resulting string compatible with gretls observation identifiers For example consider the following fictitious case we have a 5daysperweek dataset to which we want to add the stock index for the Verdurian market2 it so happens that in Verduria Saturdays are working days but Wednesdays are not We want a column which does not contain data on Saturdays because we wouldnt know where to put them but at the same time we want to place missing values on all the Wednesdays In this case the following syntax could be used string QRYSELECT yearmonthdayVerdSE FROM AlmeaIndexes data y obsformatddd queryQRY odbc The column VerdSE holds the data to be fetched which will go into the gretl series y The first three columns are used to construct a string which identifies the day Daily dates take the form YYYYMMDD in gretl If a row from the DBMS produces the observation string 20080401 this will match OK its a Tuesday but 20080405 will not match since it is a Saturday the corresponding row will therefore be discarded On the other hand since no string 20080423 will be found in the data coming from the DBMS its a Wednesday that entry is left blank in our series y 424 Examples Table Consump Table DATA Field Type time decimal72 income decimal166 consump decimal166 Field Type year decimal40 qtr decimal10 varname varchar16 xval decimal2010 Table 423 Example AWM database structure Table Consump Table DATA 197000 424278975500 344746944000 197025 433218709400 350176890400 197050 440954219100 355249672300 197075 446278664700 361794719900 197100 447752681800 362489970500 197125 453553860100 368313558500 197150 460115133100 372605015300 1970 1 CAN 5179085000000 1970 2 CAN 6625996000000 1970 3 CAN 11304155000000 1970 4 CAN 4672508000000 1970 1 COMPR 184000000000 1970 2 COMPR 186341000000 1970 3 COMPR 183000000000 1970 4 COMPR 182663000000 1970 1 D1 10000000000 1970 2 D1 00000000000 Table 424 Example AWM database data In the following examples we will assume that access is available to a database known to ODBC with the data source name AWM with username Otto and password Bingo The database AWM contains quarterly data in two tables see 423 and 424 2See httpwwwalmeopediacomindexphpVerduria Chapter 42 Gretl and ODBC 433 The table Consump is the classic rectangular dataset that is its internal organization is the same as in a spreadsheet or econometrics package each row is a data point and each column is a variable The structure of the DATA table is different each record is one figure stored in the column xval and the other fields keep track of which variable it belongs to for which date Listing 421 Simple query from a rectangular table nulldata 160 setobs 4 19701 time open dsnAWM userOtto passwordBingo odbc string Qry SELECT consump income FROM Consump data cons inc queryQry odbc Listing 421 shows a query for two series first we set up an empty quarterly dataset Then we connect to the database using the open statement Once the connection is established we retrieve two columns from the Consump table No observation string is required because the data already have a suitable structure we need only import the relevant columns Listing 422 Simple query from a nonrectangular table string S select year qtr xval from DATA where varnameWLN ORDER BY year qtr data wln obsformatdd queryS odbc In example 422 by contrast we make use of the observation string since we are drawing from the DATA table which is not rectangular The SQL statement stored in the string S produces a table with three columns The ORDER BY clause ensures that the rows will be in chronological order although this is not strictly necessary in this case 425 Connectivity details It may be helpful to supply some details on gretls management of ODBC connections First when the open command is invoked with the odbc option gretl checks to see if a connection to the specified DSN Data Source Name can be established via the ODBC function SQLConnect If not an error is flagged if so the connection is dropped SQLDisconnect but the DSN details are stored The stored DSN then remains the implicit source for subsequent invocation of the data command with the odbc option until a countermanding open command is issued Each time an OBDCrelated data command is issued gretl attempts to reestablish a connection to the given DSN the connection is dropped once the data transfer is complete Chapter 42 Gretl and ODBC 434 Listing 423 Handling of missing values for a nonrectangular table string foo select year qtr xval from DATA where varnameSTN AND qtr1 data bar obsformatdd queryfoo odbc print bar byobs Listing 423 shows what happens if the rows in the outcome from the SELECT statement do not match the observations in the currently open gretl dataset The query includes a condition which filters out all the data from the first quarter The query result invisible to the user would be something like year qtr xval 1970 2 78705000000 1970 3 75600000000 1970 4 71892000000 1971 2 58679000000 1971 3 62442000000 1971 4 59811000000 1972 2 46883000000 1972 3 46302000000 Internally gretl fills the variable bar with the corresponding value if it finds a match otherwise NA is used Printing out the variable bar thus produces Obs bar 19701 19702 78705 19703 75600 19704 71892 19711 19712 58679 19713 62442 19714 59811 19721 19722 46883 19723 46302 Chapter 43 Gretl and TEX 436 Figure 431 LATEX menu in model window Table 431 Example of LATEX tabular output Model 1 OLS estimates using the 51 observations 151 Dependent variable ENROLL Variable Coefficient Std Error tstatistic pvalue const 0241105 00660225 36519 00007 CATHOL 0223530 00459701 48625 00000 PUPIL 000338200 000271962 12436 02198 WHITE 0152643 00407064 37499 00005 Mean of dependent variable 00955686 SD of dependent variable 00522150 Sum of squared residuals 00709594 Standard error of residuals ˆσ 00388558 Unadjusted R2 0479466 Adjusted R2 0446241 F3 47 144306 Chapter 43 Gretl and TEX 437 standard errors in parentheses The distinction between the Copy and Save options for both tabular and equation is twofold First Copy puts the TEX source on the clipboard while with Save you are prompted for the name of a file into which the source should be saved Second with Copy the material is copied as a fragment while with Save it is written as a complete file The point is that a wellformed TEX source file must have a header that defines the documentclass article report book or whatever and tags that say begindocument and enddocument This material is included when you do Save but not when you do Copy since in the latter case the expectation is that you will paste the data into an existing TEX source file that already has the relevant apparatus in place The items under Equation options should be selfexplanatory when printing the model in equa tion form do you want standard errors or tratios displayed in parentheses under the parameter estimates The default is to show standard errors if you want tratios select that item Other windows Several other sorts of output windows also have TEX preview copy and save enabled In the case of windows having a graphical toolbar look for the TEX button Figure 432 shows this icon second from the right on the toolbar along with the dialog that appears when you press the button Figure 432 TEX icon and dialog One aspect of gretls TEX support that is likely to be particularly useful for publication purposes is the ability to produce a typeset version of the model table see section 34 An example of this is shown in Table 432 433 Finetuning typeset output There are three aspects to this adjusting the appearance of the output produced by gretl in LATEX preview mode adjusting the formatting of gretls tabular output for models when using the tabprint command and incorporating gretls output into your own TEX files Previewing in the GUI As regards preview mode you can control the appearance of gretls output using a file named gretlpretex which should be placed in your gretl user directory see the Gretl Command Ref erence If such a file is found its contents will be used as the preamble to the TEX source The default value of the preamble is as follows documentclass11ptarticle usepackageutf8inputenc Chapter 43 Gretl and TEX 438 Table 432 Example of model table output OLS estimates Dependent variable ENROLL Model 1 Model 2 Model 3 const 02907 02411 008557 007853 006602 005794 CATHOL 02216 02235 02065 004584 004597 005160 PUPIL 0003035 0003382 0001697 0002727 0002720 0003025 WHITE 01482 01526 004074 004071 ADMEXP 01551 01342 n 51 51 51 R2 04502 04462 02956 ℓ 9609 9536 8869 Standard errors in parentheses indicates significance at the 10 percent level indicates significance at the 5 percent level Chapter 43 Gretl and TEX 439 usepackageamsmath usepackagedcolumnlongtable begindocument hispagestyleempty Note that the amsmath and dcolumn packages are required For some sorts of output the longtable package is also needed Beyond that you can for instance change the type size or the font by al tering the documentclass declaration or including an alternative font package In addition if you wish to typeset gretl output in more than one language you can set up per language preamble files A localized preamble file is identified by a name of the form gretlprexxtex where xx is replaced by the first two letters of the current setting of the LANG environment vari able For example if you are running the program in Polish using LANGplPL then gretl will do the following when writing the preamble for a TEX source file 1 Look for a file named gretlprepltex in the gretl user directory If this is not found then 2 look for a file named gretlpretex in the gretl user directory If this is not found then 3 use the default preamble Conversely suppose you usually run gretl in a language other than English and have a suitable gretlpretex file in place for your native language If on some occasions you want to produce TEX output in English then you could create an additional file gretlpreentex this file will be used for the preamble when gretl is run with a language setting of say enUS Commandline options After estimating a model via a scriptor interactively via the gretl console or using the command line program gretlcliyou can use the commands tabprint or eqnprint to print the model to file in tabular format or equation format respectively These options are explained in the Gretl Command Reference If you wish alter the appearance of gretls tabular output for models in the context of the tabprint command you can specify a custom row format using the format flag The format string must be enclosed in double quotes and must be tied to the flag with an equals sign The pattern for the format string is as follows There are four fields representing the coefficient standard error t ratio and pvalue respectively These fields should be separated by vertical bars they may contain a printftype specification for the formatting of the numeric value in question or may be left blank to suppress the printing of that column subject to the constraint that you cant leave all the columns blank Here are a few examples format4f4f4f4f format4f4f3f format5f4f4f format8g8g4f The first of these specifications prints the values in all columns using 4 decimal places The second suppresses the pvalue and prints the tratio to 3 places The third omits the tratio The last one again omits the t and prints both coefficient and standard error to 8 significant figures Once you set a custom format in this way it is remembered and used for the duration of the gretl session To revert to the default formatting you can use the special variant formatdefault Further editing Once you have pasted gretls TEX output into your own document or saved it to file and opened it in an editor you can of course modify the material in any wish you wish In some cases machine generated TEX is hard to understand but gretls output is intended to be humanreadable and Chapter 43 Gretl and TEX 440 editable In addition it does not use any nonstandard style packages Besides the standard LATEX document classes the only files needed are as noted above the amsmath dcolumn and longtable packages These should be included in any reasonably full TEX implementation 434 Installing and learning TEX This is not the place for a detailed exposition of these matters but here are a few pointers So far as we know every GNULinux distribution has a package or set of packages for TEX and in fact these are likely to be installed by default Check the documentation for your distribution For MS Windows several packaged versions of TEX are available one of the most popular is MiKTEX at httpwwwmiktexorg For Mac OS X a nice implementation is iTEXMac at httpitexmac sourceforgenet An essential starting point for online TEX resources is the Comprehensive TEX Archive Network CTAN at httpwwwctanorg As for learning TEX many useful resources are available both online and in print Among online guides Tony Roberts LATEX from quick and dirty to style and finesse is very helpful at httpwwwsciusqeduaustaffrobertsaLaTeXlatexintrohtml An excellent source for advanced material is The LATEX Companion Goossens et al 2004 Chapter 44 Gretl and R 441 Introduction R is by far the largest free statistical project1 Like gretl it is a GNU project and the two have a lot in common however gretls approach focuses on ease of use much more than R which instead aims to encompass the widest possible range of statistical procedures As is natural in the free software ecosystem we dont view ourselves as competitors to R2 but rather as projects sharing a common goal who should support each other whenever possible For this reason gretl provides a way to interact with R and thus enable users to pool the capabilities of the two packages In this chapter we will explain how to exploit Rs power from within gretl We assume that the reader has a working installation of R available and a basic grasp of Rs syntax3 Despite several valiant attempts no graphical shell has gained wide acceptance in the R community by and large the standard method of working with R is by writing scripts or by typing commands at the R prompt much in the same way as one would write gretl scripts or work with the gretl console In this chapter the focus will be on the methods available to execute R commands without leaving gretl 442 Starting an interactive R session The easiest way to use R from gretl is in interactive mode Once you have your data loaded in gretl you can select the menu item Tools Start GNU R and an interactive R session will be started with your dataset automatically preloaded A simple example OLS on crosssection data For this example we use Ramanathans dataset data41 one of the sample files supplied with gretl We first run in gretl an OLS regression of price on sqft bedrms and baths The basic results are shown in Table 441 Table 441 OLS house price regression via gretl Variable Coefficient Std Error tstatistic pvalue const 129062 883033 14616 01746 sqft 0154800 00319404 48465 00007 bedrms 21587 270293 07987 04430 baths 12192 432500 02819 07838 1Rs homepage is at httpwwwrprojectorg 2OK who are we kidding But its friendly competition 3The main reference for R documentation is httpcranrprojectorgmanualshtml In addition R tutorials abound on the Net as always Google is your friend 441 Chapter 44 Gretl and R 442 We will now replicate the above results using R Select the menu item Tools Start GNU R A window similar to the one shown in figure 441 should appear Figure 441 R window The actual look of the R window may be somewhat different from what you see in Figure 441 especially for Windows users but this is immaterial The important point is that you have a window where you can type commands to R If the above procedure doesnt work and no R window opens it means that gretl was unable to launch R You should ensure that R is installed and working on your system and that gretl knows where it is The relevant settings can be found by selecting the Tools Preferences General menu entry under the Programs tab Assuming R was launched successfully you will see notification that the data from gretl are avail able In the background gretl has arranged for two R commands to be executed one to load the gretl dataset in the form of a data frame one of several forms in which R can store data and one to attach the data so that the variable names defined in the gretl workspace are available as valid identifiers within R In order to replicate gretls OLS estimation go into the R window and type at the prompt model lmprice sqft bedrms baths summarymodel You should see something similar to Figure 442 Surprisethe estimates coincide To get out just close the R window or type q at the R prompt Time series data We now turn to an example which uses time series data we will compare gretls and Rs estimates of Box and Jenkins immortal airline model The data are contained in the bjg sample dataset The following gretl code open bjg arima 0 1 1 0 1 1 lg nc produces the estimates shown in Table 442 Chapter 44 Gretl and R 443 Figure 442 OLS regression on house prices via R Table 442 Airline model from Box and Jenkins 1976 selected portion of gretls estimates Variable Coefficient Std Error tstatistic pvalue θ1 0401824 00896421 44825 00000 Θ1 0556936 00731044 76184 00000 Variance of innovations 000134810 Loglikelihood 244696 Akaike information criterion 48339 Chapter 44 Gretl and R 444 If we now open an R session as described in the previous subsection the datapassing mechanism is slightly different Since our data were defined in gretl as time series we use an R timeseries object ts for short for the transfer In this way we can retain in R useful information such as the periodicity of the data and the sample limits The downside is that the names of individual series as defined in gretl are not valid identifiers In order to extract the variable lg one needs to use the syntax lg gretldata lg ARIMA estimation can be carried out by issuing the following two R commands lg gretldata lg arimalg c011 seasonalc011 which yield Coefficients ma1 sma1 04018 05569 se 00896 00731 sigma2 estimated as 0001348 log likelihood 2447 aic 4834 Happily the estimates again coincide 443 Running an R script Opening an R window and keying in commands is a convenient method when the job is small In some cases however it would be preferable to have R execute a script prepared in advance One way to do this is via the source command in R Alternatively gretl offers the facility to edit an R script and run it having the current dataset preloaded automatically This feature can be accessed via the File Script Files menu entry By selecting User file one can load a preexisting R script if you want to create a new script instead select the New script R script menu entry Figure 443 Editing window for R scripts In either case you are presented with a window very similar to the editor window used for ordinary gretl scripts as in Figure 443 There are two main differences First you get syntax highlighting for Rs syntax instead of gretls Second clicking on the Execute button the gears icon launches an instance of R in which your commands are executed Before R is actually run you are asked if you want to run R interactively or not see Figure 444 An interactive run opens an R instance similar to the one seen in the previous section your data will be preloaded if the preload data box is checked and your commands will be executed Once this is done you will find yourself at the R prompt where you can enter more commands Chapter 44 Gretl and R 445 Figure 444 Editing window for R scripts A noninteractive run on the other hand will execute your script collect the output from R and present it to you in an output window R will be run in the background If for example the script in Figure 443 is run noninteractively a window similar to Figure 445 will appear Figure 445 Output from a noninteractive R run 444 Sending data back and forth As regards the passing of data between the two programs so far we have only considered passing series from gretl to R In order to achieve a satisfactory degree of interoperability more is needed In the following subsections we see how matrices can be exchanged and how data can be passed from R back to gretl Chapter 44 Gretl and R 446 Passing matrices from gretl to R For passing matrices from gretl to R you can use the mwrite matrix function described in section 177 For example the following gretl code fragment generates the matrix A 3 7 11 4 8 12 5 9 13 6 10 14 and stores it into the file mymatfilemat in the users dotdirsee section 152 Note that writing to this special directory which is sure to exist and be writable by the user is mandated by the nonzero value for the third optional argument to mwrite matrix A mshapeseq31443 err mwriteA mymatfilemat 1 The recommended R code to import such a matrix is A gretlloadmatmymatfilemat The function gretlloadmat which is predefined when R is called from gretl retrieves the matrix from dotdir The mat extension for gretl matrix files is not compulsory you can name these files as you wish Its also possible to take more control over the details of the transfer if you wish You have the builtin string variable dotdir in gretl while in R you have the same variable under the name gretldotdir To use a location other than dotdir you may a omit the third argument to mwrite and supply a full path to the matrix file and b use a more generic approach to reading the file in R Heres an example Gretl side mwriteA pathtomymatfilemat R side A asmatrixreadtablepathtomymatfilemat skip1 Passing data from R to gretl For passing data in the opposite direction gretl defines a special function that can be used in the R environment An R object will be written as a temporary file in dotdir from where it can be easily retrieved from within gretl The name of this function is gretlexport it takes one required argument the object to be exported At present the objects that can be exported with this method are matrices data frames and timeseries objects The function creates a text file by default with the same name as the exported object plus an appropriate suffix in gretls temporary directory Data frames and time series objects are stored as CSV files and can be retrieved by using gretls append command Matrices are stored in a special text format that is understood by gretl see section 177 the file suffix is in this case mat and to read the matrix in gretl you must use the mread function This function also has an optional second argument namely a string which specifies a basename for the export file in case you want to use a name other than that attached to the object within R As in the default case an appropriate suffix csv or mat will be added to the basename Chapter 44 Gretl and R 447 As an example we take the airline data and use them to estimate a structural time series model à la Harvey 19894 The model we will use is the Basic Structural Model BSM in which a time series is decomposed into three terms yt µt γt εt where µt is a trend component γt is a seasonal component and εt is a noise term In turn the following is assumed to hold µt βt1 ηt βt ζt sγt ωt where s is the seasonal differencing operator 1 Ls and ηt ζt and ωt are mutually uncorre lated white noise processes The object of the analysis is to estimate the variances of the noise components which may be zero and to recover estimates of the latent processes µt the level βt the slope and γt We will use Rs StructTS command and import the results back into gretl Once the bjg dataset is loaded in gretl we pass the data to R and execute the following script extract the log series y gretldata lg estimate the model strmod StructTSy save the fitted components smoothed compon aststsSmoothstrmod save the estimated variances vars asmatrixstrmodcoef export into gretls temp dir gretlexportcompon gretlexportvars Running this script via gretl produces minimal output current data loaded as ts object gretldata wrote homecottrellgretlcomponcsv wrote homecottrellgretlvarsmat However we are now able to pull the results back into gretl by executing the following commands either from the console or by creating a small script string fname sprintfscomponcsv dotdir append fname vars mreadvarsmat 1 The first command reads the estimated timeseries components from a CSV file which is the format that the passing mechanism employs for series The matrix vars is read from the file varsmat After the above commands have been executed three new series will have appeared in the gretl workspace namely the estimates of the three components by plotting them together with the original data you should get a graph similar to Figure 446 The estimates of the variances can be seen by printing the vars matrix as in print vars vars 4 x 1 4The function package StucTiSM is available to handle this class of models natively in gretl Chapter 44 Gretl and R 448 46 48 5 52 54 56 58 6 62 64 66 1949 1955 1961 lg 46 48 5 52 54 56 58 6 62 1949 1955 1961 level 001 001005 00101 001015 00102 001025 1949 1955 1961 slope 025 02 015 01 005 0 005 01 015 02 025 03 1949 1955 1961 sea Figure 446 Estimated components from BSM 000077185 00000 00013969 00000 That is ˆσ 2 η 000077185 ˆσ 2 ζ 0 ˆσ 2 ω 00013969 ˆσ 2 ε 0 Notice that since ˆσ 2 ζ 0 the estimate for βt is constant and the level component is simply a random walk with a drift 445 Interacting with R from the command line Up to this point we have spoken only of interaction with R via the GUI program In order to do the same from the command line interface gretl provides the foreign command This enables you to embed nonnative commands within a gretl script A foreign block takes the form foreign languageR senddatalist quiet R commands end foreign and achieves the same effect as submitting the enclosed R commands via the GUI in the non interactive mode see section 443 above The senddata option arranges for autoloading of the data present in the gretl session or a subset thereof specified via a named list The quiet option prevents the output from R from being echoed in the gretl output Using this method replicating the example in the previous subsection is rather easy basically all it takes is encapsulating the content of the R script in a foreign end foreign block see Listing 441 Chapter 44 Gretl and R 449 Listing 441 Estimation of the Basic Structural Model simple Download open bjggdt foreign languageR senddata y gretldata lg strmod StructTSy compon aststsSmoothstrmod vars asmatrixstrmodcoef gretlexportcompon gretlexportvars end foreign append dotdircomponcsv rename level lglevel rename slope lgslope rename sea lgseas vars mreadvarsmat 1 The above syntax despite being already quite useful by itself shows its full power when it is used in conjunction with userwritten functions Listing 442 shows how to define a gretl function that calls R internally Listing 442 Estimation of the Basic Structural Model via a function Download function list RStructTSseries myseries smpl okmyseries restrict sx argnamemyseries foreign languageR senddata quiet sx gretldata myseries strmod StructTSsx compon aststsSmoothstrmod gretlexportcompon end foreign append dotdircomponcsv rename level sxlevel rename slope sxslope rename sea sxseas list ret sxlevel sxslope sxseas return ret end function main open bjggdt list X RStructTSlg Chapter 44 Gretl and R 450 446 Performance issues with R R is a large and complex program which takes an appreciable time to initialize itself In interactive use this not a significant problem but if you have a gretl script that calls R repeatedly the cumulated startup costs can become bothersome To get around this gretl calls the R shared library by preference in this case the startup cost is borne only once on the first invocation of R code from within gretl Support for the R shared library is built into the gretl packages for MS Windows and OS Xbut the advantage is realized only if the library is in fact available at run time If you are building gretl yourself on Linux and wish to make use of the R library you should ensure a that R has been built with the shared library enabled specify enableRshlib when configuring your build of R and b that the pkgconfig program is able to detect your R installation We do not link to the R library at build time rather we open it dynamically on demand The gretl GUI has an item under the ToolsPreferences menu which enables you to select the path to the library if it is not detected automatically If you have the R shared library installed but want to force gretl to call the R executable instead you can do set Rlib off 447 Further use of the R library Besides improving performance as noted above use of the R shared library makes possible a further refinement That is you can define functions in R within a foreign block then call those functions later in your script much as if they were gretl functions This is illustrated below set Rfunctions on foreign languageR plusone functionq z q1 invisiblez end foreign scalar bRplusone2 The R function plusone is obviously trivial in itself but the example shows a couple of points First for this mechanism to work you need to enable Rfunctions via the set command Second to avoid collision with the gretl function namespace calls to functions defined in this way must be prefixed with R as in Rplusone But please note this mechanism will not work if you have defined a gretl bundle named R in that case identifiers beginning with R will be understood as referring to members of the bundle in question Builtin R functions may also be called in this way once Rfunctions is set on For example one can invoke Rs choose function which computes binomial coefficients set Rfunctions on scalar b Rchoose104 The use of R functions from within gretl is limited by the need for an unambiguous and lossless mapping between R and gretl datatypes both for arguments passed by gretl and for return values generated by R So far the following possibilities are supported see chapter 11 for details on the definition of types on the gretl side The most basic typesreal scalars real matrices and single stringscan be pushed in either direction no problem Since gretl 2023b row and column names will be preserved when transferring matrices Chapter 44 Gretl and R 451 A series in gretl can be pushed to R as a vector If the gretl series is stringvalued see chapter 16 R will receive the string values Gretls arrays of strings can be pushed to R as vectors of strings and vice versa Gretls bundles can be pushed to R as lists with tags naming the elements and Rs lists can be retrieved as gretl bundles provided that their elements have a corresponding gretl type and are identified by tags But this is subject to the restriction that a gretl bundle passed to R cannot contain instances of the gretl list type or arrays of anything other than strings Chapter 45 Gretl and Ox 451 Introduction Ox written by Jurgen A Doornik see Doornik 2007 is described by its author as an object oriented statistical system At its core is a powerful matrix language which is complemented by a comprehensive statistical library Among the special features of Ox are its speed and well designed syntax Ox comes in two versions Ox Professional and Ox Console Ox is available for Windows Linux Mac OS X and several Unix platforms wwwdoornikcom Ox is proprietary closedsource software The commandline version of the program is however available free of change for academic users Quoting again from Doorniks website The Console command line versions may be used freely for academic research and teaching purposes only The Ox syntax is public and of course you may do with your own Ox code whatever you wish If you wish to use Ox in conjunction with gretl please refer to doornikcom for further details on licensing As the reader will no doubt have noticed most other software that we discuss in this Guide is open source and freely available for all users We make an exception for Ox on the grounds that it is indeed fast and well designed and that its statistical libraryalong with various addon packages that are also availablehas exceptional coverage of cuttingedge techniques in econometrics The gretl authors have used Ox for benchmarking some of gretls more advanced features such as dynamic panel models and state space models1 452 Ox support in gretl The support offered for Ox in gretl is similar to that offered for R as discussed in chapter 44 To enable support for Ox go to the ToolsPreferencesGeneral menu item and look under the Programs tab Find the entry for the path to the oxl executable that is the program that runs Ox files on MS Windows it is called oxlexe Adjust the path if its not already right for your system and you should be ready to go With support enabled you can open and edit Ox programs in the gretl GUI Clicking the execute icon in the editor window will send your code to Ox for execution Figures 451 and Figure 452 show an Ox program and part of its output In addition you can embed Ox code within a gretl script using a foreign block as described in connection with R A trivial example which simply prints the gretl data matrix within Ox is shown below open data41 matrix m dataset mwritem gretlmat 1 foreign languageOx include oxstdh main 1For a review of Ox see CribariNeto and Zarkos 2003 and for a somewhat dated comparison of Ox with other matrixoriented packages such as GAUSS see Steinhaus 1999 452 Chapter 45 Gretl and Ox 453 Figure 451 Ox editing window Figure 452 Output from Ox Chapter 45 Gretl and Ox 454 decl gmat gretlloadmatgretlmat printgmat end foreign The above example illustrates how a matrix can be passed from gretl to Ox We use the mwrite function to write a matrix into the users dotdir see section 152 then in Ox we use the function gretlloadmat to retrieve the matrix How does gretlloadmat come to be defined When gretl writes out the Ox program correspond ing to your foreign block it does two things in addition First it writes a small utility file named gretlioox into your dotdir This contains a definition for gretlloadmat and also for the function gretlexport see below Second gretl interpolates into your Ox code a line which in cludes this utility file it is inserted right after the inclusion of oxstdh which is needed in all Ox programs Note that gretlloadmat expects to find the named file in the users dotdir 453 Illustration replication of DPD model Listing 451 shows a more ambitious case This script replicates one of the dynamic panel data models in Arellano and Bond 1991 first using gretl and then using Ox we then check the relative differences between the parameter estimates produced by the two programs which turn out to be reassuringly small Unlike the previous example in this case we pass the dataset from gretl to Ox as a CSV file in order to preserve the variable names Note the use of the internal variable csvna to get the right representation of missing values for use with Oxand also note that the senddata option for the foreign command is not available in connection with Ox We get the parameter estimates back from Ox using gretlexport on the Ox side and mread on the gretl side The gretlexport function takes two arguments a matrix and a file name The file is written into the users dotdir from where it can be picked up using mread The final portion of the output from Listing 451 is shown below matrix oxparm mreadoxparmmat 1 Generated matrix oxparm eval absparm oxparm oxparm 14578e13 35642e13 50672e15 16091e13 89808e15 20450e14 10218e13 21048e13 95898e15 18658e14 21852e14 29451e13 19398e13 Chapter 45 Gretl and Ox 455 Listing 451 Estimation of dynamic panel data model via gretl and Ox open abdatagdt 1step GMM estimation dpanel 2 n w w1 k ys ys1 0 timedummies dpdstyle matrix parm coeff Write CSV file for Ox set csvna NaN store dotdirabdatacsv Replicate using the Ox DPD package foreign languageOx include oxstdh import packagesdpddpd main decl dpd new DPD dpdLoaddotdirabdatacsv dpdSetYearYEAR dpdSelectYVAR n 0 2 dpdSelectXVAR w 0 1 k 0 0 ys 0 1 dpdSelectIVAR w 0 1 k 0 0 ys 0 1 dpdGmmn 2 99 GMMtype instrument dpdSetDummiesDCONSTANT DTIME dpdSetTest2 2 Sargan AR 12 tests dpdEstimate 1step estimation decl parm dpdGetPar gretlexportparm oxparmmat delete dpd end foreign Compare the results matrix oxparm mreadoxparmmat 1 eval absparm oxparm oxparm Chapter 46 Gretl and Octave 461 Introduction GNU Octave written by John W Eaton and others is described as a highlevel language primar ily intended for numerical computations The program is oriented towards solving linear and nonlinear problems numerically and performing other numerical experiments using a language that is mostly compatible with Matlab wwwgnuorgsoftwareoctave Octave is available in sourcecode form naturally for GNU software and also in the form of binary packages for MS Win dows and Mac OS X Numerous contributed packages that extend Octaves functionality in various ways can be found at octavesfnet 462 Octave support in gretl The support offered for Octave in gretl is similar to that offered for R chapter 44 For example you can open and edit Octave scripts in the gretl GUI Clicking the execute icon in the editor window will send your code to Octave for execution Figures 461 and Figure 462 show an Octave script and its output in this example we use the function logisticregression to replicate some results from Greene 2000 In addition you can embed Octave code within a gretl script using a foreign block as described in connection with R A trivial example is shown below it simply loads and prints the gretl data matrix within Octave then takes it back to gretl and checks for any difference there should be none Note that in Octave appending to a line suppresses verbose output leaving off the semicolon results in printing of the object that is produced if any open data41 matrix m dataset mwritem gretlmat 1 foreign languageOctave gmat gretlloadmatgretlmat gretlexportgmat octavemat end foreign matrix chk mreadoctavemat 1 eval maxrmaxcabsm chk The functions gretlloadmat and gretlexport which are predefined when you run Octave from within gretl have the following signatures function A gretlloadmatfname autodot1 function gretlexportX fname autodot1 By default traffic in matrices goes via the users dotdir see section 152 on the Octave side that is the name of this directory is prepended to filename for both reading and writing This is complementary to use of the export and import parameters with gretls mwrite and mread functions respectively However if you wish to take control over the reading and writing locations 456 Chapter 46 Gretl and Octave 457 Figure 461 Octave editing window Figure 462 Output from Octave Chapter 46 Gretl and Octave 458 you can supply a zero value for autodot or give an absolute path when calling gretlloadmat and gretlexport in that case the filename argument is used as is 463 Illustration spectral methods We now present a more ambitious example which exploits Octaves handling of the frequency domain and also its ability to use code written for MATLAB namely estimation of the spec tral coherence of two time series For this illustration we require two extra Octave packages from octavesfnet namely those supporting spectral functions specfun and signal process ing signal After downloading the packages you can install them from within Octave as follows using version numbers as of March 2010 pkg install specfun108targz pkg install signal1010targz In addition we need some specialized MATLAB files made available by Mario Forni of the Univer sity of Modena at httpmorganaunimoreitfornimariomatlabhtm The files needed are coheren2m coherenm coherm cospecm crosscovm crosspecm crosspem and specm These are in a form appropriate for MS Windows On Linux you could run the following shell script to get the files and remove the Windows endoffile character which prevents the files from running under Octave SITEhttpmorganaunimoreitfornimarioMYPROG download files and delete trailing CtrlZ for f in coheren2m coherenm coherm cospecm crosscovm crosspecm crosspem specm do wget SITEf cat f tr d 032 tmpm mv tmpm f done The Forni files should be placed in some appropriate directory and you should tell Octave where to find them by adding that directory to Octaves loadpath On Linux this can be done via an entry in ones octaverc file For example addpathstatsoctaveforni Alternatively an addpath directive can be written into the Octave script that calls on these files With everything set up on the Octave side we now write a gretl script see Listing 461 which opens a timeseries dataset constructs and writes a matrix containing two series and defines a foreign block containing the Octave statements needed to produce the spectral coherence matrix This matrix is exported via gretlexport and picked up using mread Finally we produce a graph from the matrix in gretl In the script this is sent to the screen Figure 463 shows the same graph in PDF format Chapter 46 Gretl and Octave 459 Listing 461 Estimation of spectral coherence via Octave open data97 matrix xy PRIME UNEMP mwritexy xymat 1 foreign languageOctave pkg load signal uncomment and modify the following if necessary addpathstatsoctaveforni xy gretlloadmatxymat x xy1 y xy2 note the last parameter is the Bartlett window size h coherx y 8 gretlexporth hmat end foreign h mreadhmat 1 cnameseth coherence gnuplot 1 timeseries withlines matrixh outputdisplay 05 04 03 02 01 0 01 02 03 04 0 10 20 30 40 50 60 coherence Figure 463 Spectral coherence estimated via Octave Chapter 47 Gretl and Stata Stata wwwstatacom is closedsource proprietary and expensive software and as such is not a natural companion to gretl Nonetheless given Statas popularity it is desirable to have a convenient way of comparing results across the two programs and to that end we provide some support for Stata code under the foreign command To enable support for Stata go to the ToolsPreferencesGeneral menu item and look under the Programs tab Find the entry for the path to the Stata executable Adjust the path if its not already right for your system and you should be ready to go The following example illustrates whats available You can send the current gretl dataset to Stata using the senddata flag And having defined a matrix within Stata you can export it for use with gretl via the gretlexport command this takes two arguments the name of the matrix to export and the filename to use the file is written to the users dotdir from where it can be retrieved using the mread function1 To suppress printed output from Stata you can add the quiet flag to the foreign block Listing 471 Comparison of clustered standard errors with Stata Download function matrix statareorder matrix se stata puts the intercept last but gretl puts it first scalar n rowsse return sen se1n1 end function open data41 ols 1 0 2 3 clusterbedrms matrix se stderr foreign languagestata senddata regress price sqft bedrms vcecluster bedrms matrix vcv eV gretlexport vcv vcvmat end foreign matrix statavcv mreadvcvmat 1 statase statareordersqrtdiagstatavcv matrix check se statase print check In addition you can edit pure Stata scripts in the gretl GUI and send them for execution as with native gretl scripts Note that Stata coerces all variable names to lowercase on data input so even if series names in 1We do not currently offer the complementary functionality of gretlloadmat which enables reading of matrices written by gretls mwrite function in Ox and Octave This is not at all easy to implement in Stata code 460 Chapter 47 Gretl and Stata 461 gretl are uppercase or of mixed case its necessary to use all lowercase in Stata Also note that when opening a data file within Stata via the use command it will be necessary to provide the full path to the file Chapter 48 Gretl and Python 481 Introduction According to wwwpythonorg Python is an easy to learn powerful programming language It has efficient highlevel data structures and a simple but effective approach to objectoriented program ming Pythons elegant syntax and dynamic typing together with its interpreted nature make it an ideal language for scripting and rapid application development in many areas on most platforms Indeed Python is widely used in a great variety of contexts Numerous addon modules are avail able the ones likely to be of greatest interest to econometricians include NumPy the fundamen tal package for scientific computing with Pythonsee wwwnumpyorg SciPy which builds on NumPysee wwwscipyorg and Statsmodels httpstatsmodelssourceforgenet 482 Python support in gretl The support offered for Python in gretl is similar to that offered for Octave chapter 46 You can open and edit Python scripts in the gretl GUI Clicking the execute icon in the editor window will send your code to Python for execution In addition you can embed Python code within a gretl script using a foreign block as described in connection with R When you launch Python from within gretl one variable and two convenience functions are pre defined as follows gretldotdir gretlloadmatfilename autodot1 gretlexportM filename autodot1 The variable gretldotdir holds the path to the users dot directory The first function loads a matrix of the given filename as written by gretls mwrite function and the second writes matrix M under the given filename in the format wanted by gretl By default the traffic in matrices goes via the dot directory on the Python side that is the name of this directory is prepended to filename for both reading and writing This is complementary to use of the export and import parameters with gretls mwrite and mread functions respectively However if you wish to take control over the reading and writing locations you can supply a zero value for autodot or give an absolute path when calling gretlloadmat and gretlexport in that case the filename argument is used as is Note that gretlloadmat and gretlexport depend on NumPy they make use of the functions loadtxt and savetxt respectively Nonetheless the presence of NumPy is not an absolute require ment if you dont need to use these two functions 483 Illustration linear regression with multicollinearity Listing 481 compares the numerical accuracy of gretls ols command with that of the function linalglstsq in NumPy using the notorious Longley test data which exhibit extreme multi collinearity Unlike some econometrics packages NumPy does a good job on these data The script computes and prints the logrelative error in estimation of the regression coefficients using 462 Chapter 48 Gretl and Python 463 the NISTcertified values as a benchmark1 the error values correspond to the number of correct digits with a maximum of 15 The results will likely differ somewhat by computer architecture and compiler Listing 481 Comparing regression results with Python Download set verbose off function matrix logrelerr const matrix est const matrix true return log10absest true abstrue end function open longleygdt q list LX prdefl year ols employ 0 LX q matrix bgretl coeff mwriteemploy const LX alldatamat 1 foreign languagepython import numpy as np X gretlloadmatalldatamat 1 NumPys OLS b nplinalglstsqX1 X00 gretlexportnptransposenpmatrixb pybmat 1 end foreign NISTs certified coefficient values matrix bnist 348225863459582 150618722713733 0358191792925910E01 202022980381683 103322686717359 0511041056535807E01 182915146461355 matrix bnumpy mreadpybmat 1 matrix E logrelerrbgretl bnist logrelerrbnumpy bnist cnamesetE gretl python printf Logrelative errors Longley coefficients 125g E printf Column means 125g meancE Logrelative errors Longley coefficients gretl python 12844 12850 11528 11414 12393 12401 13135 13121 13738 13318 12587 12363 12848 12852 Column means 12725 12617 1See httpwwwitlnistgovdiv898strdllsdataLongleyshtml Chapter 49 Gretl and Julia 491 Introduction According to julialangorg Julia is a highlevel highperformance dynamic programming lan guage for technical computing with syntax that is familiar to users of other technical computing environments It provides a sophisticated compiler distributed parallel execution numerical ac curacy and an extensive mathematical function library Julia is well known for being very fast however you should be aware that by default starting Julia takes some time due to JustinTime compilation of the input This fixed cost is well worth bearing if you are asking Julia to perform a big computation but small jobs are likely to run faster if you use the Juliaspecific nocompile option with the foreign command1 492 Julia support in gretl The support offered for Julia in gretl is similar to that offered for Octave chapter 46 You can open and edit Julia scripts in the gretl GUI Clicking the execute icon in the editor window will send your code to Julia for execution In addition you can embed Julia code within a gretl script using a foreign block as described in connection with R When you launch Julia from within gretl one variable and two convenience functions are predefined as follows gretldotdir gretlloadmatfilename autodottrue gretlexportM filename autodottrue The variable gretldotdir holds the path to the users dot directory The first function loads a matrix of the given filename as written by gretls mwrite function and the second writes matrix M under the given filename in the format wanted by gretl By default the traffic in matrices goes via the dot directory on the Julia side that is the name of this directory is prepended to filename for both reading and writing This is complementary to use of the export and import parameters with gretls mwrite and mread functions respectively However if you wish to take control over the reading and writing locations you can supply a zero value for autodot or give an absolute path when calling gretlloadmat and gretlexport in that case the filename argument is used as is 493 Illustration Listing 491 shows a minimal example of how to interact with Julia from a gretl script Since this is a very small job JIT compilation is not worthwhile in our testing the script runs almost 4 times faster if the Julia block is opened with foreign languagejulia nocompile This has the effect of passing the option compileno to the Julia executable 1Caveat it seems that this option is not supported by all builds of Julia 464 Chapter 49 Gretl and Julia 465 Listing 491 Simple Julia IO example Download set verbose off matrix A mnormal44 generate a random matrix mwriteA A 1 and save it to a file foreign languagejulia call Julia printHi from Julia output a string A gretlloadmatA grab the matrix from gretl gretlexportinvA iAmat and save its inverse end foreign go back to gretl matrix iA mreadiAmat 1 read the inverse from Julia matrix check A iA compute the product print check print out the check should be I Output good approximation to identity matrix Hi from Julia check 4 x 4 10000 69389e18 16653e16 16653e16 00000 10000 00000 00000 44409e16 83267e17 10000 66613e16 44409e16 13878e17 11102e16 10000 Chapter 50 Troubleshooting gretl 501 Bug reports Bug reports are welcomewell if not exactly welcome then useful and appreciated Hopefully you are unlikely to find bugs in the actual calculations done by gretl although this statement does not constitute any sort of warranty You may however come across bugs or oddities in the behavior of the graphical interface Please remember that the usefulness of bug reports is greatly enhanced if you can be as specific as possible what exactly went wrong under what conditions and on what operating system If you saw an error message what precisely did it say One way of making a bug report more useful is to run the program in such a way that you can see and copy any additional information that gets printed to the stderr output stream On Linux and Mac OS X thats just a matter of launching gretl from the command prompt in a terminal window On MS Windows its a bit more complicated since stderr is by default invisble However you can quite easily set up a special gretl shortcut that does the job On the Windows desktop rightclick and select New shortcut In the dialog box that appears browse to find gretlexe and append the debug flag as shown in Figure 501 Note that there are two dashes before debug Figure 501 Creating a debugging shortcut 466 Chapter 50 Troubleshooting gretl 467 When you start gretl in this mode a console window appears as well as the gretl window and stderr output goes to the console To copy this output click at the top left of the console window for a menu Figure 502 first do Select all then Copy You can paste the results into Notepad or similar Figure 502 The program with console window 502 Auxiliary programs As mentioned above gretl calls some other programs to accomplish certain tasks gnuplot for graphing LATEX for highquality typesetting of regression output GNU R If something goes wrong with such external links it is not always easy for gretl to produce an informative error message If such a link fails when accessed from the gretl graphical interface you may be able to get more information by starting gretl from the command prompt rather than via a desktop menu entry or icon On the X window system start gretl from the shell prompt in an xterm on MS Windows start the program gretlexe from a console window or DOS box using the g or debug option flag Additional error messages may be displayed on the terminal window Also please note that for most external calls gretl assumes that the programs in question are available in your paththat is that they can be invoked simply via the name of the program without supplying the programs full location1 Thus if a given program fails try the experiment of typing the program name at the command prompt as shown below Graphing Typesetting GNU R X window system gnuplot pdflatex R MS Windows wgnuplotexe pdflatex RGuiexe If the program fails to start from the prompt its not a gretl issue but rather that the programs home directory is not in your path or the program is not installed properly For details on modifying your path please see the documentation or online help for your operating system or shell 1The exception to this rule is the invocation of gnuplot under MS Windows where a full path to the program is given Chapter 51 The command line interface The gretl package includes the commandline program gretlcli On Linux it can be run from a terminal window xterm rxvt or similar or at the text console Under MS Windows it can be run in a console window sometimes inaccurately called a DOS box gretlcli has its own help file which may be accessed by typing help at the prompt It can be run in batch mode sending output directly to a file see also the Gretl Command Reference If gretlcli is linked to the readline library this is automatically the case in the MS Windows version also see Appendix B the command line is recallable and editable and offers command completion You can use the Up and Down arrow keys to cycle through previously typed commands On a given command line you can use the arrow keys to move around in conjunction with Emacs editing keystrokes1 The most common of these are Keystroke Effect Ctrla go to start of line Ctrle go to end of line Ctrld delete character to right where Ctrla means press the a key while the Ctrl key is also depressed Thus if you want to change something at the beginning of a command you dont have to backspace over the whole line erasing as you go Just hop to the start and add or delete characters If you type the first letters of a command name then press the Tab key readline will attempt to complete the command name for you If theres a unique completion it will be put in place automatically If theres more than one completion pressing Tab a second time brings up a list Probably the most useful mode for heavyduty work with gretlcli is batch noninteractive mode in which the program reads and processes a script and sends the output to file For example gretlcli b scriptfile outputfile Note that scriptfile is treated as a program argument only the output file requires redirection Dont forget the b batch switch otherwise the program will wait for user input after executing the script and if output is redirected the program will appear to hang 1Actually the key bindings shown below are only the defaults they can be customized See the readline manual 468 Appendix A Data file details A1 Basic native format In gretls basic native data formatfor which we use the suffix gdta dataset is stored in XML extensible markup language Data files correspond to the simple DTD document type defini tion given in gretldatadtd which is supplied with the gretl distribution and is installed in the system data directory eg usrsharegretldata on Linux Such files may be plain text un compressed or gzipped They contain the actual data values plus additional information such as the names and descriptions of variables the frequency of the data and so on In a gdt file the actual data values are written to 17 significant figures for generated data such as logs or pseudorandom numbers or to a maximum of 15 figures for primary data The C printf format g is used for 17 or 15 so that trailing zeros are not printed Most users will probably not have need to read or write such files other than via gretl itself but if you want to manipulate them using other software tools you should examine the DTD and also take a look at a few of the supplied practice data files data41gdt gives a simple example data410gdt is an example where observation labels are included A2 Binary data file format A native binary format is also available for dataset storage This formatwith suffix gdtboffers much faster writing and reading for very large datasets For small to moderately sized datasets say up to a few megabytes there is little advantage in the binary format and we recommend use of plain gdt Note that gdtb files are saved in the endianness of the machine on which theyre written and are not portable across platforms of differing endianness but since almost all machines on which gretl is likely to be run are littleendian this is unlikely to be a serious limitation The implementation of gdtb format can be found in purebinc in the plugin subdirectory of the gretl source tree Prior to version 2021b of gretl gdtb files had a different structure namely a PKZIP file containing an XML component for the metadata plus a binary component for the actual data values It turned out that this hybrid format did not scale well for datasets with a great deal of metadata For backward compatibility gretl can still read such oldstyle files but it doesnt write them any more A3 Native database format A gretl database has two primary parts a plain text index file with filename suffix idx containing information on the included series and a binary file suffix bin containing the actual data Two examples of the format for an entry in the idx file are shown below G0M910 Composite index of 11 leading indicators 1987100 M 194801 199511 n 575 currbal Balance of Payments Balance on Current Account SA Q 19601 19994 n 160 The first field is the series name The second is a description of the series maximum 128 charac ters On the second line the first field is a frequency code M for monthly Q for quarterly A for 470 Appendix A Data file details 471 annual B for businessdaily daily with five days per week D for 7day daily S for 6day daily U for undated No other frequencies are accepted at present Then comes the starting date with two digits following the point for monthly data one for quarterly data none for annual a space a hyphen another space the ending date the string n and the integer number of observations In the case of daily data the starting and ending dates should be given in the ISO 8601 form YYYYMMDD This format must be respected exactly Optionally the first line of the index file may contain a short comment up to 64 characters on the source and nature of the data following a hash mark For example Federal Reserve Board interest rates The corresponding binary database file holds the data values represented as floats that is single precision floatingpoint numbers taking four bytes apiece The values are packed by variable so that the first n numbers are the observations of variable 1 the next m the observations on variable 2 and so on A third file may accompany the idx and bin files namely a codebook containing a description of the data If present this must be plain text with filename suffix cb or PDF with suffix pdf The components of a gretl database are generally combined in a single file with zlib compression and gz suffix for distribution A small program named gretlzip can be used to create or unpack such files See the utilsdbzip subdirectory of the gretl source tree Appendix B Building gretl Here we give instructions detailed enough to allow a user with only a basic knowledge of a Unix type system to build gretl These steps were tested on a fresh installation of Debian Etch For other Linux distributions especially Debianbased ones like Ubuntu and its derivatives little should change Other Unixlike operating systems such as Mac OS X and BSD would probably require more substantial adjustments In this guided example we will build gretl complete with documentation This introduces a few more requirements but gives you the ability to modify the documentation files as well like the help files or the manuals B1 Installing the prerequisites We assume that the basic GNU utilities are already installed on the system together with these other programs some TEXLATEXsystem texlive will do beautifully Gnuplot ImageMagick We also assume that the user has administrative privileges and knows how to install packages The examples below are carried out using the aptget shell command but they can be performed with menubased utilities like aptitude dselect or the GUIbased program synaptic Users of Linux distributions which employ rpm packages eg Red HatFedora Mandriva SuSE may want to refer to the dependencies page on the gretl website The first step is installing the C compiler and related basic utilities if these are not already in place On a Debian or derivative system these are contained in a bunch of packages that can be installed via the command aptget install gcc autoconf automake19 libtool flex bison gccdoc libc6dev libcdev gfortran gettext pkgconfig Then it is necessary to install the development dev packages for the libraries that gretl uses 472 Appendix B Building gretl 473 Library command GLIB aptget install libglib20dev GTK 30 aptget install libgtk30dev PNG aptget install libpng12dev XSLT aptget install libxslt1dev LAPACK aptget install liblapackdev FFTW aptget install libfftw3dev READLINE aptget install libreadlinedev ZLIB aptget install zlib1gdev XML aptget install libxml2dev GMP aptget install libgmpdev CURL aptget install libcurl4gnutlsdev MPFR aptget install libmpfrdev It is possible to substitute GTK 20 for GTK 30 The dev packages for these libraries are necessary to compile gretlyoull also need the plain nondev library packages to run gretl but most of these should already be part of a standard installation In order to enable other optional features like audio support you may need to install more libraries The above steps can be much simplified on Linux systems that provide debbased package managers such as Debian and its derivatives Ubuntu Knoppix and other distributions The command aptget builddep gretl will download and install all the necessary packages for building the version of gretl that is currently present in your APT sources Technically this does not guarantee that all the software necessary to build the git version is included because the version of gretl on your repository may be quite old and build requirements may have changed in the meantime However the chances of a mismatch are rather remote for a reasonably uptodate system so in most cases the above command should take care of everything correctly B2 Getting the source release or git At this point it is possible to build from the source You have two options here obtain the latest released source package or retrieve the current git version of gretl git the version control software currently in use for gretl The usual caveat applies to the git version namely that it may not build correctly and may contain experimental code on the other hand git often contains bugfixes relative to the released version If you want to help with testing and to contribute bug reports we recommend using git gretl To work with the released source 1 Download the gretl source package from gretlsourceforgenet 2 Unzip and untar the package On a system with the GNU utilities available the command would be tar xvfJ gretlNtarxz replace N with the specific version number of the file you downloaded at step 1 3 Change directory to the gretl source directory created at step 2 eg gretl2020a 4 Proceed to the next section Configure and make To work with git youll first need to install the git client program if its not already on your sys tem Relevant resources you may wish to consult include the main git website at gitscmcom and instructions specific to gretl gretl git basics Appendix B Building gretl 474 When grabbing the git sources for the first time you should first decide where you want to store the code For example you might create a directory called git under your home directory Open a terminal window cd into this directory and type the following commands git clone gitgitcodesfnetpgretlgit gretlgit At this point git should create a subdirectory named gretlgit and fill it with the current sources When you want to update the source this is very simple just move into the gretlgit directory and type git pull Assuming youre now in the gretlgit directory you can proceed in the same manner as with the released source package B3 Configure the source The next command you need is configure this is a complex script that detects which tools you have on your system and sets things up The configure command accepts many options you may want to run configure help first to see what options are available One option you way wish to tweak is prefix By default the installation goes under usrlocal but you can change this For example configure prefixusr will put everything under the usr tree Note that the recommended location to build gretl is not in the source directory The way to achieve that is quite simple the invocation of the configure script has to take into account the relative path to the source tree So if your build directory is inside underneath the source tree it is configure while if it is in a parallel tree it would be something like gretlgitconfigure If you have a multicore machine you may want to activate support for OpenMP which permits the parallelization of matrix multiplication and some other tasks This requires adding the configure flag enableopenmp By default the gretl GUI is built using version 30 of the GTK library if available otherwise version 20 If you have both versions installed and prefer to use GTK 20 use the flag enablegtk2 In order to have the documentation built we need to pass the relevant option to configure as in enablebuilddoc Appendix B Building gretl 475 But please note that this option will work only if you are using the git source In order to build the documentation there is the possibility that you will have to install some extra software on top of the packages mentioned in the previous section For example you may need some extra LATEX packages to compile the manuals Two of the required packages that not every standard LATEX installation include are typically pifontsty and appendixsty You could install the corresponding packages from your distribution or you could simply download them from CTAN and install them by hand This for example if you want to install under usr with OpenMP support and also build the documentation you would do configure prefixusr enableopenmp enablebuilddoc You will see a number of checks being run and if everything goes according to plan you should see a summary similar to that displayed in Listing B1 Listing B1 Sample output from configure Configuration Installation path usr Use readline library yes Use gnuplot for graphs yes Use LaTeX for typesetting output yes Use libgsf for zipunzip no sse2 support for RNG yes OpenMP support yes MPI support no AVX support for arithmetic no Build with GTK version 20 Build gretl documentation yes Use Lucida fonts no Build message catalogs yes X12ARIMA support yes TRAMOSEATS support yes libR support yes ODBC support no Experimental audio support no Use xdgutils in installation if DESTDIR not set LAPACK libraries llapack lblas lgfortran Now type make to build gretl You can also do make pdfdocs to build the PDF documentation If youre using git its a good idea to rerun the configure script after doing an update This is not always necessary but sometimes it is and it never does any harm For this purpose you may want to write a little shell script that calls configure with any options you want to use B4 Build and install We are now ready to undertake the compilation proper this is done by running the make command which takes care of compiling all the necessary source files in the correct order All you need to do Appendix B Building gretl 476 is type make This step will likely take several minutes to complete a lot of output will be produced on screen Once this is done you can install your freshly baked copy of gretl on your system via make install On most systems the make install command requires you to have administrative privileges Hence either you log in as root before launching make install or you may want to use the sudo utility as in sudo make install Now try if everything works go back to your home directory and run gretl cd gretl If all is well you ought to see gretl start at which point just exit the program in the usual way On the other hand there is the possibility that gretl doesnt start and instead you see a message like usrlocalbingretlx11 error while loading shared libraries libgretl10so0 cannot open shared object file No such file or directory In this case just run sudo ldconfig The problem should be fixed once and for all Appendix C Numerical accuracy Gretl uses doubleprecision arithmetic throughoutexcept for the multipleprecision plugin in voked by the menu item Model Other linear models High precision OLS which represents floating point values using a number of bits given by the environment variable GRETLMPBITS default value 256 The normal equations of Least Squares are by default solved via Cholesky decomposition which is highly accurate provided the matrix of crossproducts of the regressors XX is not very ill conditioned If this problem is detected gretl automatically switches to use QR decomposition The program has been tested rather thoroughly on the statistical reference datasets provided by NIST the US National Institute of Standards and Technology and a full account of the results may be found on the gretl website follow the link Numerical accuracy To date two published reviews have discussed gretls accuracy Giovanni Baiocchi and Walter Dis taso 2003 and Talha Yalta and Yasemin Yalta 2007 We are grateful to these authors for their careful examination of the program Their comments have prompted several modifications includ ing the use of Stephen Moshiers cephes code for computing pvalues and other quantities relating to probability distributions see netliborg changes to the formatting of regression output to en sure that the program displays a consistent number of significant digits and attention to compiler issues in producing the MS Windows version of gretl which at one time was slighly less accurate than the Linux version Gretl now includes a plugin that runs the NIST linear regression test suite You can find this under the Tools menu in the main window When you run this test the introductory text explains the expected result If you run this test and see anything other than the expected result please send a bug report to cottrellwfuedu All regression statistics are printed to 6 significant figures in the current version of gretl except when the multipleprecision plugin is used in which case results are given to 12 figures If you want to examine a particular value more closely first save it for example using the genr command then print it using printf to as many digits as you like see the Gretl Command Reference 477 Appendix D Related free software Gretls capabilities are substantial and are expanding Nonetheless you may find there are some things you cant do in gretl or you may wish to compare results with other programs If you are looking for complementary functionality in the realm of free opensource software we recommend the following programs The selfdescription of each program is taken from its website GNU R rprojectorg R is a system for statistical computation and graphics It consists of a language plus a runtime environment with graphics a debugger access to certain system functions and the ability to run programs stored in script files It compiles and runs on a wide variety of UNIX platforms Windows and MacOS Comment There are numerous addon packages for R covering most areas of statistical work GNU Octave wwwoctaveorg GNU Octave is a highlevel language primarily intended for numerical computations It provides a convenient command line interface for solving linear and nonlinear problems numerically and for performing other numerical experiments using a language that is mostly compatible with Matlab It may also be used as a batchoriented language Julia julialangorg Julia is a highlevel highperformance dynamic programming language for technical computing with syntax that is familiar to users of other technical computing environments It provides a sophisticated compiler distributed parallel execution numerical accuracy and an extensive mathematical function library JMulTi wwwjmultide JMulTi was originally designed as a tool for certain econometric pro cedures in time series analysis that are especially difficult to use and that are not available in other packages like Impulse Response Analysis with bootstrapped confidence intervals for VARVEC modelling Now many other features have been integrated as well to make it possi ble to convey a comprehensive analysis Comment JMulTi is a java GUI program you need a java runtime environment to make use of it As mentioned above gretl offers the facility of exporting data in the formats of both Octave and R In the case of Octave the gretl data set is saved as a single matrix X You can pull the X matrix apart if you wish once the data are loaded in Octave see the Octave manual for details As for R the exported data file preserves any time series structure that is apparent to gretl The series are saved as individual structures The data should be brought into R using the source command In addition gretl has a convenience function for moving data quickly into R Under gretls Tools menu you will find the entry Start GNU R This writes out an R version of the current gretl data set in the users gretl directory and sources it into a new R session The particular way R is invoked depends on the internal gretl variable Rcommand whose value may be set under the Tools Preferences menu The default command is RGuiexe under MS Windows Under X it is xterm e R Please note that at most three spaceseparated elements in this command string will be processed any extra elements are ignored 478 Appendix E Listing of URLs Below is a listing of the full URLs of websites mentioned in the text Estima RATS httpwwwestimacom FFTW3 httpwwwfftworg Gnome desktop homepage httpwwwgnomeorg GNU Multiple Precision GMP library httpgmpliborg CURL library httpcurlhaxxselibcurl GNU Octave homepage httpwwwoctaveorg GNU R homepage httpwwwrprojectorg GNU R manual httpcranrprojectorgdocmanualsRintropdf Gnuplot homepage httpwwwgnuplotinfo Gretl data page httpgretlsourceforgenetgretldatahtml Gretl homepage httpgretlsourceforgenet GTK homepage httpwwwgtkorg GTK port for win32 httpswikignomeorgProjectsGTKWin32 InfoZip homepage httpwwwinfoziporgpubinfozipzlib JMulTi homepage httpwwwjmultide JRSoftware httpwwwjrsoftwareorg Julia homepage httpjulialangorg Mingw gcc for win32 homepage httpwwwmingworg Minpack httpwwwnetliborgminpack Penn World Table httppwteconupennedu Readline homepage httpcnswwwcnscwrueduchetreadlinerltophtml Readline manual httpcnswwwcnscwrueduchetreadlinereadlinehtml Xmlsoft homepage httpxmlsoftorg 479 Bibliography Akaike H 1974 A new look at the statistical model identification IEEE Transactions on Auto matic Control AC19 716723 Anderson T W and C Hsiao 1981 Estimation of dynamic models with error components Jour nal of the American Statistical Association 76 598606 Andrews D W K and J C Monahan 1992 An improved heteroskedasticity and autocorrelation consistent covariance matrix estimator Econometrica 60 953966 Arellano M 2003 Panel Data Econometrics Oxford Oxford University Press Arellano M and S Bond 1991 Some tests of specification for panel data Monte carlo evidence and an application to employment equations The Review of Economic Studies 58 277297 Armesto M T K Engemann and M Owyang 2010 Forecasting with mixed frequencies Fed eral Reserve Bank of St Louis Review 926 521536 URL httpresearchstlouisfedorg publicationsreview1011Armestopdf Baiocchi G and W Distaso 2003 GRETL Econometric software for the GNU generation Journal of Applied Econometrics 18 105110 Baltagi B H 1995 Econometric Analysis of Panel Data New York Wiley Baltagi B H and YJ Chang 1994 Incomplete panels A comparative study of alternative esti mators for the unbalanced oneway error component regression model Journal of Econometrics 62 6789 Baltagi B H and Q Li 1990 A lagrange multiplier test for the error components model with incomplete panels Econometric Reviews 9 103107 Baltagi B H and P X Wu 1999 Unequally spaced panel data regressions with AR1 distur bances Econometric Theory 15 814823 Barrodale I and F D K Roberts 1974 Solution of an overdetermined system of equations in the ℓl norm Communications of the ACM 17 319320 Baxter M and R G King 1999 Measuring business cycles Approximate bandpass filters for economic time series The Review of Economics and Statistics 814 575593 Beck N and J N Katz 1995 What to do and not to do with timeseries crosssection data The American Political Science Review 89 634647 Bera A K C M Jarque and L F Lee 1984 Testing the normality assumption in limited depen dent variable models International Economic Review 25 563578 Berndt E B Hall R Hall and J Hausman 1974 Estimation and inference in nonlinear structural models Annals of Economic and Social Measurement 34 653665 Bhargava A L Franzini and W Narendranathan 1982 Serial correlation and the fixed effects model Review of Economic Studies 49 533549 Blundell R and S Bond 1998 Initial conditions and moment restrictions in dynamic panel data models Journal of Econometrics 87 115143 480 Bibliography 481 Bond S A Hoeffler and J Temple 2001 GMM estimation of empirical growth models Economics Papers from Economics Group Nuffield College University of Oxford No 2001W21 Boswijk H P 1995 Identifiability of cointegrated systems Tinbergen Institute Discussion Paper 9578 URL httpwwwaseuvanlppbin258fulltextpdf Boswijk H P and J A Doornik 2004 Identifying estimating and testing restricted cointegrated systems An overview Statistica Neerlandica 584 440465 Bournay J and G Laroque 1979 Réflexions sur la méthode délaboration des comptes trimestriels Annales de linséé 36 330 URL httpwwwjstorcomstable20075332 Box G E P and G Jenkins 1976 Time Series Analysis Forecasting and Control San Franciso HoldenDay Brand C and N Cassola 2004 A money demand system for euro area M3 Applied Economics 368 817838 Butterworth S 1930 On the theory of filter amplifiers Experimental Wireless The Wireless Engineer 7 536541 Byrd R H P Lu J Nocedal and C Zhu 1995 A limited memory algorithm for bound constrained optimization SIAM Journal on Scientific Computing 165 11901208 Cameron A C and D L Miller 2015 A practitioners guide to clusterrobust inference Journal of Human Resources 502 317373 Cameron A C and P K Trivedi 1986 Econometric models based on count data comparisons and applications of some estimators and tests Journal of Applied Econometrics 1 2954 1998 Regression Analysis of Count Data Cambridge Cambridge University Press 2005 Microeconometrics Methods and Applications Cambridge Cambridge University Press 2013 Regression Analysis of Count Data Cambridge University Press Caselli F G Esquivel and F Lefort 1996 Reopening the convergence debate A new look at crosscountry growth empirics Journal of Economic Growth 13 363389 Chesher A and M Irish 1987 Residual analysis in the grouped and censored normal linear model Journal of Econometrics 34 3361 Choi I 2001 Unit root tests for panel data Journal of International Money and Finance 202 249272 Cholette P A 1984 Adjusting subannual series to yearly benchmarks Survey Methodology 101 3549 URL httpswww150statcangccan1pub12001x1984001article 14348engpdf Chow G C and Al Lin 1971 Best linear unbiased interpolation distribution and extrapolation of time series by related series The Review of Economics and Statistics 534 372375 URL httpswwwjstororgstable1928739 Cleveland W S 1979 Robust locally weighted regression and smoothing scatterplots Journal of the American Statistical Association 74368 829836 Cottrell A 2017 Random effects estimators for unbalanced panel data a Monte Carlo analysis gretl working papers number 4 URL httpsideasrepecorgpancwgretl4html Cottrell A and R Lucchetti 2016 Gretl Function Package Guide gretl documentation URL http sourceforgenetprojectsgretlfilesmanual Bibliography 482 CribariNeto F and S G Zarkos 2003 Econometric and statistical computing using Ox Compu tational Economics 21 277295 Datta D D and W Du 2012 Nonparametric HAC estimation for time series data with missing observations Board of Governors of the Federal Reserve System International Finance Discus sion Papers Number 1060 URL httpswwwfederalreservegovpubsifdp20121060 ifdp1060pdf Davidson R and E Flachaire 2001 The wild bootstrap tamed at last GREQAM Document de Travail 99A32 URL httprussellvchariteunivmrsfrGMMbootwild5europdf Davidson R and J G MacKinnon 1993 Estimation and Inference in Econometrics New York Oxford University Press 2004 Econometric Theory and Methods New York Oxford University Press Denton F T 1971 Adjustment of monthly or quarterly series to annual totals An approach based on quadratic minimization Journal of the American Statistical Association 66333 99102 URL httpwwwjstorcomstable2284856 Di Fonzo T 2003 Benchmarking di serie storiche economiche Nota tecnica ed estensioni Work ing paper Università degli Studi di Padova URL httppaduaresearchcabunipdit7302 1WP200310pdf Di Fonzo T and M Marini 2012 On the extrapolation with the Denton proportional benchmark ing method IMF Working Paper WP12169 URL httpswwwimforgexternalpubsft wp2012wp12169pdf Doornik J A 1995 Testing general restrictions on the cointegrating space Discussion Paper Nuffield College URL httpwwwdoornikcomresearchcoigenpdf 1998 Approximations to the asymptotic distribution of cointegration tests Journal of Economic Surveys 12 573593 Reprinted with corrections in McAleer and Oxley 1999 2007 ObjectOriented Matrix Programming Using Ox London Timberlake Consultants Press third edn URL httpwwwdoornikcom Doornik J A M Arellano and S Bond 2006 Panel Data estimation using DPD for Ox Doornik J A and H Hansen 1994 An omnibus test for univariate and multivariate normality Working paper Nuffield College Oxford Durbin J and S J Koopman 2012 Time Series Analysis by State Space Methods Oxford Oxford University Press second edn Elliott G T J Rothenberg and J H Stock 1996 Efficient tests for an autoregressive unit root Econometrica 64 813836 Engle R F and C W J Granger 1987 Cointegration and error correction Representation esti mation and testing Econometrica 55 251276 Fernández R B 1981 A methodological note on the estimation of time series The Review of Economics and Statistics 633 471476 URL httpswwwjstororgstable1924371 Fiorentini G G Calzolari and L Panattoni 1996 Analytic derivatives and the computation of GARCH estimates Journal of Applied Econometrics 11 399417 Frigo M and S G Johnson 2005 The design and implementation of FFTW3 Proceedings of the IEEE 93 2 216231 Ghysels E 2015 MIDAS Matlab Toolbox University of North Carolina Chapel Hill URL http wwwuncedueghyselspapersMIDASUsersguideV10pdf Bibliography 483 Ghysels E and H Qian 2016 Estimating MIDAS regressions via OLS with polynomial parameter profiling University of North Carolina Chapel Hill and MathWorks URL httpdxdoiorg 102139ssrn2837798 Ghysels E P SantaClara and R Valkanov 2004 The MIDAS touch Mixed data sampling re gression models Série Scientifique CIRANO Montréal URL httpwwwciranoqccafiles publications2004s20pdf Golub G H and C F Van Loan 1996 Matrix Computations Baltimore and London The John Hopkins University Press third edn Goossens M F Mittelbach and A Samarin 2004 The LATEX Companion Boston AddisonWesley second edn Gould W 2013 Interpreting the intercept in the fixedeffects model URL httpwwwstata comsupportfaqsstatisticsinterceptinfixedeffectsmodel Gourieroux C A Monfort E Renault and A Trognon 1987 Generalized residuals Journal of Econometrics 34 532 Greene W H 2000 Econometric Analysis Upper Saddle River NJ PrenticeHall fourth edn 2003 Econometric Analysis Upper Saddle River NJ PrenticeHall fifth edn Hall A D 2005 Generalized Method of Moments Oxford Oxford University Press Hamilton J D 1994 Time Series Analysis Princeton NJ Princeton University Press Hannan E J and B G Quinn 1979 The determination of the order of an autoregression Journal of the Royal Statistical Society B 41 190195 Hansen L P 1982 Large sample properties of generalized method of moments estimation Econometrica 504 10291054 Hansen L P and K J Singleton 1982 Generalized instrumental variables estimation of nonlinear rational expectations models Econometrica 50 12691286 Harvey A C 1989 Forecasting Structural Time Series Models and the Kalman Filter Cambridge Cambridge University Press Harvey A C and A Jaeger 1993 Detrending stylized facts and the business cycle Journal of Applied Econometrics 83 231247 Hausman J A 1978 Specification tests in econometrics Econometrica 46 12511271 Heckman J 1979 Sample selection bias as a specification error Econometrica 47 153161 Helske J 2017 KFAS Exponential family state space models in R Journal of Statistical Software 7810 139 URL httpsdoiorg1018637jssv078i10 Hodrick R and E C Prescott 1997 Postwar US business cycles An empirical investigation Journal of Money Credit and Banking 29 116 Im K S M H Pesaran and Y Shin 2003 Testing for unit roots in heterogeneous panels Journal of Econometrics 115 5374 Islam N 1995 Growth Empirics A Panel Data Approach The Quarterly Journal of Economics 1104 11271170 Johansen S 1995 LikelihoodBased Inference in Cointegrated Vector Autoregressive Models Ox ford Oxford University Press de Jong P 1991 The diffuse Kalman filter The Annals of Statistics 19 10731083 Bibliography 484 de Jong P and S ChuChunLin 2003 Smoothing with an unknown initial condition Journal of Time Series Analysis 242 141148 Kalbfleisch J D and R L Prentice 2002 The Statistical Analysis of Failure Time Data New York Wiley second edn Keane M P and K I Wolpin 1997 The career decisions of young men Journal of Political Economy 105 473522 King R G and S T Rebelo 1993 Low frequency filtering and real business cycles Journal of Economic dynamics and Control 1712 207231 Klein P 2000 Using the generalized Schur form to solve a multivariate linear rational expecta tions model Journal of Economic Dynamics and Control 2410 14051423 Koenker R 1994 Confidence intervals for regression quantiles In P Mandl and M Huskova eds Asymptotic Statistics pp 349359 New York SpringerVerlag Koenker R and G Bassett 1978 Regression quantiles Econometrica 46 3350 Koenker R and K Hallock 2001 Quantile regression Journal of Economic Perspectives 154 143156 Koenker R and J Machado 1999 Goodness of fit and related inference processes for quantile regression Journal of the American Statistical Association 94 12961310 Koenker R and Q Zhao 1994 Lestimation for linear heteroscedastic models Journal of Non parametric Statistics 3 223235 Koopman S J 1993 Disturbance smoother for state space models Biometrika 80 117126 Koopman S J N Shephard and J A Doornik 1999 Statistical algorithms for models in state space using SsfPack 22 Econometrics Journal 2 107160 Kwiatkowski D P C B Phillips P Schmidt and Y Shin 1992 Testing the null of stationarity against the alternative of a unit root How sure are we that economic time series have a unit root Journal of Econometrics 54 159178 Levin A CF Lin and J Chu 2002 Unit root tests in panel data asymptotic and finitesample properties Journal of Econometrics 108 124 Lucchetti R 2011 State space methods in gretl Journal of Statistical Software 4111 122 Lucchetti R L Papi and A Zazzaro 2001 Banks inefficiency and economic growth A micro macro approach Scottish Journal of Political Economy 48 400424 Lütkepohl H 2005 New Intoduction to Multiple Time Series Analysis Berlin Springer MacKinnon J G 1996 Numerical distribution functions for unit root and cointegration tests Journal of Applied Econometrics 11 601618 MacKinnon J G and H White 1985 Some heteroskedasticityconsistent covariance matrix esti mators with improved finite sample properties Journal of Econometrics 29 305325 Magnus J R and H Neudecker 1988 Matrix Differential Calculus with Applications in Statistics and Econometrics John Wiley Sons McAleer M and L Oxley 1999 Practical Issues in Cointegration Analysis Oxford Blackwell McCullagh P and J A Nelder 1983 Generalized linear models London and New York Chapman and Hall Bibliography 485 McCullough B D and C G Renfro 1998 Benchmarks and software standards A case study of GARCH procedures Journal of Economic and Social Measurement 25 5971 Melard G 1984 Algorithm AS 197 A Fast Algorithm for the Exact Maximum Likelihood of AutoregressiveMoving Average Models Journal of the Royal Statistical Society Series C Applied Statistics 331 104114 Morales J L and J Nocedal 2011 Remark on Algorithm 778 LBFGSB Fortran routines for largescale bound constrained optimization ACM Transactions on Mathematical Software 381 14 Mroz T 1987 The sensitivity of an empirical model of married womens hours of work to eco nomic and statistical assumptions Econometrica 5 765799 Nadaraya E A 1964 On estimating regression Theory of Probability and its Applications 9 141142 Nash J C 1990 Compact Numerical Methods for Computers Linear Algebra and Function Min imisation Bristol Adam Hilger second edn Nerlove M 1971 Further evidence on the estimation of dynamic economic relations from a time series of cross sections Econometrica 39 359382 1999 Properties of alternative estimators of dynamic panel models An empirical anal ysis of crosscountry data for the study of economic growth In C Hsiao K Lahiri LF Lee and M H Pesaran eds Analysis of Panels and Limited Dependent Variable Models Cambridge Cambridge University Press Newey W K and K D West 1987 A simple positive semidefinite heteroskedasticity and auto correlation consistent covariance matrix Econometrica 55 703708 1994 Automatic lag selection in covariance matrix estimation Review of Economic Stud ies 61 631653 Okui R 2009 The optimal choice of moments in dynamic panel data models Journal of Econo metrics 1511 116 Parzen E 1963 On spectral analysis with missing observations and amplitude modulation Sankhya The Indian Journal of Statistics Series A 254 383392 Pelagatti M 2011 State space methods in OxSsfPack Journal of Statistical Software 413 125 Pollock D S G 2000 Trend estimation and detrending via rational squarewave filters Journal of Econometrics 992 317334 Portnoy S and R Koenker 1997 The Gaussian hare and the Laplacian tortoise computability of squarederror versus absoluteerror estimators Statistical Science 124 279300 Press W S Teukolsky W Vetterling and B Flannery 2007 Numerical Recipes The Art of Scientific Computing Cambridge University Press 3 edn Ramanathan R 2002 Introductory Econometrics with Applications Fort Worth Harcourt fifth edn Rao C R 1973 Linear Statistical Inference and its Applications New York Wiley second edn Rho SH and T J Vogelsang 2018 Heteroskedasticity autocorrelation robust inference in time series regressions with missing data Econometric Theory 353 601629 URL httpsdoi org101017S0266466618000117 Roodman D 2009a How to do xtabond2 An introduction to difference and system GMM in Stata The Stata Journal 9 86136 URL httpsdoiorg1011771536867X0900900106 Bibliography 486 2009b A note on the theme of too many instruments Oxford Bulletin of Economics and Statistics 71 135158 URL httpsdoiorg101111j14680084200800542x Sargan J D 1958 The estimation of economic relationships using instrumental variables Econo metrica 263 393415 URL httpsdoiorg1023071907619 Schwarz G 1978 Estimating the dimension of a model Annals of Statistics 6 461464 Sephton P S 1995 Response surface estimates of the KPSS stationarity test Economics Letters 47 255261 Shumway R H and D S Stoffer 2017 Time series analysis and its applications with R examples Springer 4th edn Sims C A 1980 Macroeconomics and reality Econometrica 48 148 Steinhaus S 1999 Comparison of mathematical programs for data analysis edition 3 Univer sity of Frankfurt URL httpwwwinformatikunifrankfurtdeststncrunch Stock J H and M W Watson 1999 Forecasting inflation Journal of Monetary Economics 442 293335 2003 Introduction to Econometrics Boston AddisonWesley 2008 Heteroskedasticityrobust standard errors for fixed effects panel data regression Econometrica 761 155174 Stokes H H 2004 On the advantage of using two or more econometric software systems to solve the same problem Journal of Economic and Social Measurement 29 307320 Swamy P A V B and S S Arora 1972 The exact finite sample properties of the estimators of coefficients in the error components regression models Econometrica 40 261275 Theil H 1961 Economic Forecasting and Policy Amsterdam NorthHolland 1966 Applied Economic Forecasting Amsterdam NorthHolland Verbeek M 2004 A Guide to Modern Econometrics New York Wiley second edn Watson G S 1964 Smooth regression analysis Shankya Series A 26 359372 White H 1980 A heteroskedasticityconsistent covariance matrix astimator and a direct test for heteroskedasticity Econometrica 48 817838 Windmeijer F 2005 A finite sample correction for the variance of linear efficient twostep GMM estimators Journal of Econometrics 126 2551 Wooldridge J M 2002a Econometric Analysis of Cross Section and Panel Data Cambridge MA MIT Press 2002b Introductory Econometrics A Modern Approach Mason OH SouthWestern sec ond edn Yalta A T and A Y Yalta 2007 GRETL 160 and its numerical accuracy Journal of Applied Econometrics 22 849854 Zhu C R H Byrd and J Nocedal 1997 Algorithm 778 LBFGSB Fortran routines for largescale boundconstrained optimization ACM Transactions on Mathematical Software 234 550560