Applied Econometrics Using Stata
Applied Econometrics Using Stata
Applied Econometrics Using Stata
Index
Chapter 1. First steps in Stata ..................................... ??
Chapter 2. Least Squares ........................................... ??
Chapter 3. Instrumental Variables................................. ??
Chapter 4. Panel Data ............................................... ??
4.1 Static ................................................... ??
4.2 Dynamic ................................................ ??
4.3 Nonlinear Models ...................................... ??
Chapter 5. Binary Logit/Probit..................................... ??
Chapter 6. Hazard Models .......................................... ??
Chapter 7. Count-Data Models ..................................... ??
Chapter 8. Selection Models........................................ ??
Chapter 9. Partially Continuous variables ........................ ??
Chapter 10. Logit/Probit Models................................... ??
10.1 Multinomial ........................................... ??
10.2 Ordered ............................................... ??
Chapter 11: Quantile Regression................................... ??
Chapter 12. Robust Inference ...................................... ??
12.1 Clustered Standard Errors .......................... ??
12.2 Bootstrap ............................................. ??
12.3 Two-Stage Models ................................... ??
Chapter 13: Matching................................................ ??
Chapter 1
First steps in Stata
Introduction
The Stata screen is divided in 4 parts. In "review" you can see the last commands that
have been executed. In "variables" you can see all the variables in the current
database. In "results" you can see the commands' output. Finally, in the window
"command" you can enter the commands.
Data input and output
Stata has its own data format with default extension ".dta". Reading and saving a
Stata file are straightforward. If the filename is "sales.dta" (and if it is located in the
Stata's directory), the commands are:
. use sales
. save sales
Notice that if you don't specify an extension, Stata will assume that it is ".dta". If the
file is not stored in the current directory (e.g. in the folder "c:\user\data"), then the
complete path must be specified:
. use c:\user\data\sales
Nevertheless, the easiest way to work is keeping all the files for a particular project
in one directory, and then changing the "default" directory of Stata to that folder.
For instance, if you are working in the folder "c:\user\data":
. cd c:\user\data
. use sales
Insheet (importing from MS Excel)
There are two simple ways to transform an Excel database (or similar formats) into a
Stata database. For instance, please create the following table in MS Excel:
name
account
money
John Doe
1001
55
Tom Twain
1002
182
Tim Besley
1003
-10
Louis Lane
1004
23
Save as a text file (tab delimited or comma delimited) by selecting "File" and
choosing "Save As", under the name "bank.txt" in the Stata folder. Notice that saving
as "txt" (only text) you will loose all the information on formats, formulaes, etc. Then
import the data to Stata using the command "insheet":
. insheet using bank.txt
Para ver si las variables se cargaron exitosamente, abra el browser:
. browse
Alternatively, in order to import the data you can highlight the cells under
consideration in MS Excel, and then select "Edit" and choose "Copy". With an empty
dataset (i.e. first use the command "clear"), enter in Stata:
. edit
And a spreadsheet will appear (similar to what appeared with the command
"browse", now you can modify the database). Then click the right button on the
editor and press "Paste".
The command "save" let us save the data in Stata format (".dta"). The option
"replace" replaces the old version if the file already exists:
. save prueba.dta, replace
The command "clear" clears the current database:
. clear
The option "clear" can be used simultaneously with the command "use" to open a new
database:
. use sales, clear
By default Stata separates 1mb of memory for loading the database. However, some
databases demands more than 1mb, and then (before opening the file) we need to
indicate how much space is needed (using the command "set mem"). For instance, if
we want to allocate 5 megabytes of memmory:
. set mem 5m
Preserve and Restore
As you may have noticed, there is no "Undo and Redo" in Stata. But you can use
"preserve" to "save provisionally" a database. Then, if you want to "undo", you can
execute "restore" and go back to the previous state.
. preserve
. drop
Ups! You dropped all the observations. However, you can go back to "preserve":
. restore
Log-File
The log-files are useful to keep record of everything that appears in the "results"
window. A log-file records both the history of commands and the history of outputs.
. log using test, replace
The option "replace" replace the existing file. When the session is finished, you must
close the log-file:
. log close
You can open a log file using the Notepad or using the option "File Log View".
Do-File
The do-files are extremely useful in Stata. A do-file is an unformatted text file
(ASCII) that contains a sequence of Stata commands. Stata interprets them exactly as
if they were entered in the command window. Then you can save code lines and time
when you want to repeat an entire piece of code with a minor modification.
Let's begin with a classis of programming: making Stata say Hello. The
corresponding command is:
. display "Hello"
Then, if you want to create a do-file with the above code, simple open the do-file
editor (you can even use the Notepad) and enter:
. display "Hello"
Remember always to add an "Enter" at the end! Then save the file as "hello.do" in the
Stata flder. If you want to execute the do-file, you must use the commmand "do":
. do hello
You will understand the advantages of using do-files as soon as you begin working
with them for the Problem Sets. Sometimes you want other people to understand or
use your piece of code, or maybe you know that you may access to it in the future.
For that sake it is vary useful to include comments in the do-file for describing what
you are doing in each group of lines. For inserting an entire line of comments you
must use an asterisk at the beginning of the command line:
. * This is a comment, write whatever you want
. * And if you need a further line, just begin it with an asterisk
The "/* text */" comment delimiter has the advantage that it may be used in the
middle of a line. What appears inside /* */ is ignored (it cannot be used in the
command window, as it will only work in a do-file). The "//" comment indicator may
be used at the beginning or at the end of a line:
. describe */ text /* var1-var10
. describe // test
Exercise 1.1: After finishing this week's notes open the do-file editor and create a
do-file called "week1" to reproduce all the commands.
Help and search
As in every programming environment, the command "help" has a special importance,
as it gives detailed information about the commands under consideration. For
instance, write the following and press Enter:
. help summarize
A window with information on the command "summarize" will appear. If you want to
do something, but you do not know which the right command is, you should use
"search". For instance, if you want to find the mean of a variable, you may enter the
following:
. search mean
Then just choose the appropriate option.
Commands
As indicated by the help command, there is a generic command structure for the
mayority of the Stata commands.
[by varlist:] command [varlist] [=exp] [if exp] [in range]
[weight] [using filename] [, options]
For any given command, some of these components may not be available. In the
help-file you may find links to information on each of the components:
[by varlist:] instructs Stata to repeat the command for each combination of values in
the list of variables varlist. For instance, "by location" would repeat the command for
the set of observations with each value for the variable "location".
[command] is the name of the command and can be abbreviated. For instance, the
command "summarize" can be abbreviated as "sum" and the command "regress" can
be abbreviated "reg".
[varlist] is the list of variables to which the command applies. There are some
shortcuts. For example, instead of writing "var1, var2,, var9" you can write "var*",
or "var1-var9". Alternatively, if you are interested in listing the variables from
"var1john" through "var9john" you can use simply "var?john" (as in the old DOS).
[=exp] is an expression.
[if exp] restricts the command to that subset of the observations that satisfies the
logical expression "exp". For instance, "height>170" restricts the command to those
observation with "height" greater than 170.
[in range] restricts the command to those observations whose indices lie in a
particular range. The range of indices is specified using the syntax f/l (for first to
last) where f and/or l may be replaced by numerical values if required, so that 5/12
means fifth to twelfth and f/10 means first to tenth. Negative numbers are used
to count from the end, for example: list var in -10/l lists the last 10 observations.
[weight] allows the weighting of observations.
[using filename] specifies the filename to be used.
[options] are specific to the commands.
Version
Some commands change a lot from version to version. If you want to execute a code
from a previous version (e.g. Versin 7.0), you can do so by using the command
"version" at the beginning of the code:
. version 7.0
Data Management
Variables
In Stata there are two types of variables: string and numeric. Subsequently, each
variable can be stored in a number of storage types (byte, int, long, float, and
double for numeric variables and str1 to str80 for string variables).
If you have a string variable and you need to generate numeric codes, you may find
useful the command "encode". For instance, consider a variable "name" that takes
the value "Nick" for the observations belonging to Nick, "John" for the observations
belonging to John, and so forth. Then you may find useful the following:
. encode name, gen(code)
A numeric code-variable will be generated (e.g. it takes the value 1 for the
observations belonging to Nick, the value 2 for the observations belonging to John,
and so forth).
Missing values in numeric variables are represented by dots. Some databases have
other special characters for missing values, or maybe particular numbers (9, 66, 99,
etc.). Missing value codes may be converted to missing values using the command
"mvdecode". For instance, if the variable "gender" takes the value 9 if missing value,
then enter:
. mvdecode gender, mv(9)
It will replace by dots all values of variable gender equal to 9.
Let's work
In the website of the course (??) you will find some databases (grouped by weeks)
that will be used throrought this notes. I will not mention any more that you first
have to download the database (and save it in the Stata folder or whatever folder
you are working in).
Now we will use the database "russia.dta". It is a compilation of health, economic
and welfare variables from the Russian Longitudinal Monitoring Survey (RLMS;
information at the official website: www.epc.unc.edu/projects/rlms).
. use russia.dta, clear
Describe, list, browse and edit
In order to know basic information on the variables allocated in memmory, use:
. describe
You can see the names, storage types, display formats, labels (e.g. "For how many
years" for the variable yr) and value labels (e.g. "Male" if gender==1 and "Female" if
gender==0). If you want to see information for only some variables, simply list them:
. describe totexpr gender
You can list some observations. For instance, the following code list the gender and
total expenditure for the first 10 observations:
Extended Generate
The command "egen" (extended generate) is useful when you need to create a
variable that is the mean, meidan, standard deviations, etc. of an existing variable.
For instance, I can create a variable that takes the mean life satisfaction over the
entire sample:
. egen mean_satlif = sum(satlif)
10
Or I can create variable that takes the mean life satisfaction over their geographical
sites:
. egen site_mean_satlif =mean(satlif), by(site)
The command "egen" have other useful option. We can use the option "group" to
group the population according to different combinations of some variables. For
instance, I would like to identify the individuals according to if they "smokes" and if
they are "obese". As both categories are binary, the command "group" will generate
four possible categories (people that smoke and are obese, people that don't smoke
and are obese, people that smoke and are not obese, and people that don't smoke
and are not obese):
. egen so=group(smokes obese)
Let's see the results:
. browse smokes obese so
Label values
We can put labels to the different values of "so":
. label values so solabel
. label define solabel 1 "Smokes:NO Obese:NO" 2 "Smokes:NO Obese:YES" 3
"Smokes:YES Obese:NO" 4 "Smokes:YES Obese:YES"
You may see the results using the command "tabulate":
. tab so
If you want to modify the value labels, you need to drop the old label and create a
new one:
. label drop solabel
. label values so solabel
. label define solabel 1 "Skinny and clean" 2 "Clean but fatty" 3 "Skinny smoker " 4
"Fatty smoker"
Let's see the results again:
. tab so
By and bysort
The option "by" indicates that the command must be run for many groups of
variables. Some commands use by as "by xx: command", and other command use it as
11
in "command, by xx". Before running "by" the data must be sorted by the variable
after the "by":
. sort geo
. by geo: count if gender==1
The command "count if gender==1" counts the number of observations for men. Then,
adding "by geo" make Stata count the number of men inside each geographical area
(for geo==1, geo==2 and geo==3). If you find uncomfortable to sort the data first, you
can use "bysort" directly:
. bysort geo: count if gender==1
Append
Using "append" you can add observations to database using another database (i.e. you
can "append" one to another). In the database "week1_2" there are 100 observations
on individuals from the Russian Household Survey, and in the database "week1_3"
there are 100 additional observations. You can browse both databases:
. use week1_2, clear
. browse
. use week1_3, clear
. browse
We want to "paste" one database below the other. You must open the first database:
. use week1_2, clear
And then using "append" you add the observations of the second database:
. append using week1_3
Finally, you can see the results:
. browse
Collapse
The command "collapse" generates a "smaller" database contaning the means, sums,
standard deviations, etc. of the original dataset.
. use russia.dta, clear
We can take the means of life satisfaction ("satlif") and economic satisfaction
("satecc") within geographical sites ("site"):
. collapse (mean) satlif satecc, by(site round)
12
money1990
money1991
money1992
John Doe
10
12
15
Tom Twain
Tim Besley
25
20
18
Louis Lane
14
14
11
Use the following code to transform the database into a long shape:
. reshape long money, i(name) j(month)
13
. browse
name
year
money
John Doe
1990
10
John Doe
1991
12
John Doe
1992
15
Louis Lane
1990
14
Louis Lane
1991
14
Louis Lane
1992
11
Tim Besley
1990
25
Tim Besley
1991
20
Tim Besley
1992
18
Tom Twain
1990
Tom Twain
1991
Tom Twain
1992
14
Exercise 1.2: Generate a do-file to carry out the following: using the dataset
"russia.dta", generate an id-variable for the observations (individuals); divide it in
two separate databases, each one with different sets of variables; merge those
datasets back together.
Descriptive Analysis
We will see some commands to describe the database, which will involve creating
complex tables and figures. There may many objectives in doing so. For instance, you
can show the motivations of a study, a difficult and important task. In the Problem
Sets we will ask you to reproduce a lot of tables and figures from published papers.
It is also important to show descriptive statistics for letting the reader have a
quantitative idea of the parameters estimated by the econometric model. Using
descriptive statistics you can also check the internal consistency of the data (e.g. can
an individual be 453 years old? Can an individual consume a negative number of
cars?). Additionally, you can provide evidence in favor of the external validity of the
model (e.g. are the individuals surveyed representative of the entire population?).
Summarize and tabstat
The command "summarize" shows basic descriptive statistics:
. summarize totexpr
The option "detail" adds different percentiles to the table:
. summarize totexpr, detail
The option "pweight" weights the observations by using the inverse of the probability
of entering the sample:
. summarize totexpr [pweight=inwgt], detail
The command "tabstat" shows specific statistics:
. tabstat hswrk, stats(mean range)
Some arguments for "stats()" are: mean, count, sum, max, min, range (max-min), sd,
var, etc. The option "by(varname)" makes "tabstat" build a table with descriptive
statistics for each value of "varname":
. tabstat totexpr, stats(mean range) by(gender)
The command "ci" makes an confidence interval for the mean of a variable at a given
statistical level of confidence:
. ci totexpr, level(95)
15
Scatterplot
You graph a "scatterplot" entering "graph twoway scatter" followed by the y-variable
and the x-variable, respectively:
16
Histogram
Use the command "hist":
. hist height, title(Heights) by(gender)
Nonetheless, the histograms are very sensitive to the parameters used for their
construction. Thus, you are strongly encouraged to (carefully) provide them by
yourselves: "bin(#)" (number of bins), "width(#)" (width of bins) and "start(#)" (lower
limit of first bin). The option "discrete" indicates to Stata that the variable under
consideration in categorical (it is very important to do so):
. hist belief, title(Trust in God) discrete
If you didn't, your graph would look like this:
. hist belief, title(Trust in God)
The option "normal" adds a normal density to the graph, and the option "freq" shows
the frequencies instead of the percentages:
. hist height, width(2) start(140) norm freq
Boxplot
. graph box height, by(site)
Pie chart
17
Symmetry Plots
. symplot height
QQ Plots
. qnorm height
18
sum smokes
.}
Ado-files
We can make Stata execute a list of command using do-files. For instance, we made
Stata say hello running the do-file "hello.do". Alternatively, you can create a program
to run a code. For instance, let's create a program called "hello.ado" to make Stata
say hello:
. program define hello
1. display Hello
2. end
Nothing happened, because we have to run the new program first:
. hello
Some Macros
The "macros" stores many useful values. For instance, the macro "_n" stores the
number of the current observation. For instance, the following line of code lists the
first nine observations:
. list totexpr if _n<10
On the other hand, the macro "_N" stores the total number of observations. For
instance, the following line of code shows the last observation within each
geographical site:
. bysort site: list totexpr if _n==_N
You can create a variable based on the macros. For instance:
. clear
. set obs 100
. generate index = _n
. browse
Generate a variable "x" with random numbers distributed uniformly between zero and
one:
. gen x=uniform()
Then, you can generate lagged values:
19
20
21
Chapter 2
Least Squares
22
. gen n=uniform()
. gen n1= uniform()
. list
And then repeat the process using the same seed:
. clear
. set obs 10
. set seed 11223344
. gen n=uniform()
. set seed 11223344
. gen n1= uniform()
. list
A "fictional" example
We will create values for some variables, using the "actual" values of the linear
parameters involved. Then we will try to retrieve those parameters using OLS, what
will let us experiment with some basic properties.
Let's generate i.i.d. data on wages, education, intelligence, two explanatory
variables uncorrelated with education and intelligence but correlated with wages (a
and b), and finally a variable (c) totally uncorrelated with all the former variables.
. clear
. set obs 100
The variable intelligence will be the IQ of the individuals. IQs have approximately a
normal distribution centered in 100 with a standard deviation of 20:
. gen intelligence=int(invnormal(uniform())*20+100)
Notice that we have truncated the decimal part of the numbers. Since more
intelligent people is expected to study more (see the original model of Spence on the
signaling purpose of education), the years of education will be equal to the
intelligence (over 10) plus a normally distributed noise with mean 0 and deviation 2.
Finally, we will keep only the integer part of the numbers:
. gen education=int(intelligence/10+invnormal(uniform())*2)
I will stop repeating "enter browse to see the results". Then, feel free to do so
whenever you want. Variable a (b) will be normally distributed with mean 10 (5) and
23
standard deviation 2 (1). Variable "c" will be normally distributed with mean 15 and
standard deviation 3.
. gen a=int(invnormal(uniform())*2+10)
. gen b=int(invnormal(uniform())*1+5)
. gen c=int(invnormal(uniform())*3+15)
Finally, the unobserved error term "u" will be normally distributed with mean 7 and
standard deviation 1:
. gen u=int(invnormal(uniform())*1+7)
Wages will be the result of "intelligence" multiplied by 3, plus variables "a" and "b"
multiplied by 1 and 2 respectively, plus the unobserved error term "u":
= ( X ' X )1 X ' Y
= ( X ' X )1 X '( X + u )
= + ( X ' X ) 1 X ' u
E = + E ( X ' X )1 X ' u
24
= + E ( X ' X )1 X ' E (u | X )
Which equals because E (u | X ) = 0 . A similar result holds with consistency (i.e.
for large sample).
Let's see that the exclusion of "a" and "b" does not violate the exogeneity condition.
Since "intelligence" is not correlated with "a" and "b", its coefficient should remain
consistent and unbiased:
. reg wage intelligence, robust
Nonetheless, including "a" and "b" should decrease the standard deviation of the
coefficient on "intelligence":
. reg wage intelligence a b, robust
Conversely, due to their independence, including "a" and "b" but excluding
"intelligence" should not affect the consistence of the coefficients on the former:
. reg wage a b, robust
Finally, let's see the effect of including an "irrelevant" variable ("c") in the "right"
equation:
. reg wage intelligence a b c, robust
Compared to the "right" equation", the loss of one degree-of-freedom is irrelevant in
this case:
. reg wage intelligence a b, robust
Taking advantages of do-files and macros
We can create a do-file including all the previous exercise:
. clear
. set obs 100
. gen intelligence=int(invnormal(uniform())*20+100)
. gen education=int(intelligence/10+invnormal(uniform())*2)
. gen a=int(invnormal(uniform())*2+10)
. gen b=int(invnormal(uniform())*1+5)
. gen c=int(invnormal(uniform())*3+15)
. gen u=int(invnormal(uniform())*1+7)
. gen wage=3*intelligence+a+2*b+u
. reg wage intelligence a b, robust
. ()
25
You can also set a seed in order to keep your results "tractable".
Exercise 2.1: Repeat the do-file including minor modifications to see the following:
a. That an increase in the sample size implies a decrease in the standard errors.
b. What happens if you increase the variance of "u".
c. In the real world "u" is by definition something we cannot measure nor observe.
We estimate the coefficients using that "u" is orthogonal to the included regressors. If
we estimated "u" (as the residual of the regression), we would find that it is exactly
orthogonal to the included regressors. But in this fictional world you know "u", and
then you can calculate the error term in each equation and then you can test the
orthogonality condition.
d. Include a measurement error in the "observed" intelligence (e.g. a normally
distributed noise). Then observe that the estimated coefficient is downward biased.
A "real" example
It is time to use the command "regress" with real data. We will use the database
"russia.dta" we used last week. It is a compilation of health, economic and welfare
variables from the Russian Longitudinal Monitoring Survey for 2600+ individuals in
2000. At the end of the first Problem Set you may find data definitions.
. use russia, clear
Suppose that you want to explain the Health Self-Evaluation indexes ("evalhl"; the
larger the healthier) using the following variables:
. reg evalhl monage obese smokes
Notice that Stata automatically puts the constant. If you wanted to exclude it, you
would have to enter "nocons" as option:
. reg evalhl monage obese smokes, nocons
As any other command in Stata, "regress" can be applied to a subset of the
observations. Suppose you want to run two separate regressions, one for males and
the other for females, respectively:
. reg evalhl monage obese smokes if gender==1
. reg evalhl monage obese smokes if gender==0
There are many options for "regress" (enter "help regress"). Possibly the most
important is the option "robust", which uses a robust estimate for the variances-andcovariances matrix:
. reg evalhl monage obese smokes if gender==0 & round==9, robust
26
Exercise 2.2: Why "robust" is not the default option? Since homocedasticity is a
particular case: Does it seem logic to you? Experiment with some regressions: using
the "homocedastic coefficients", are you over-rejecting or under-rejecting the null
hypotheses?
Besides "robust", there are other options regarding the estimation of standard errors:
"cluster()" (adjusts standard errors for intra-group correlation) and "bootstrap"
(Bootstrap estimation of standard errors).
Tests
After each estimation Stata automatically provides t-tests (for linear regressions) or
z-tests (for nonlinear models) of the null hypothesis whether the coefficients are
zero. Notwithstanding, other hypotheses on the coefficients can be tested.
For instance, using the command "test" (after the regression was run) you can test
whether the effect of being obese equals -0.05:
. reg evalhl monage obese smokes gender, robust
. test obese=-0.05
We can test hypotheses that involve more than one variable. For instance, we can
test if the coefficients on "smokes" and "obese" are both null, or if the sum of the
coefficients on "obese" and "gender" equals 1:
. test smokes obese
. test obese + gender == 1
27
The command "imtest" performs the White's test, and the command "hettest"
performs the Breusch-Pagan's test. In both tests the null hypothesis is whether the
variance of residuals is homogeneous.
Nevertheless, you should be very careful. These tests are pretty sensitive to the
assumptions of the model (for instance, they suppose normality for the error term).
1
N
u x
i i
=0
yi = xi + ui ui = yi xi
1
( yi xi ) xi = 0
N
( yi xi xi xi ) = 0
And finally you get:
= ( xixi )1 xi yi
= ( X ' X ) 1 X ' Y
As Stata do not allow for more than 800 rows or columns for matrixes, it would be
impossible to work directly with X or Y (as they have +2600 rows). But there is a
rather simple trick: the command "mat accum". It executes an intermediate step,
X'X, which creates a "small" matrix. First we have to eliminate the observations with
missing values in the variables that will be included in the model (why?):
. drop if evalhl==. | monage==. | obese==. | smokes==.
Then we run the regression in the "traditional" way:
. reg evalhl monage obese smokes
Let's begin with calculating X'X and storing it in the matrix "XpX":
. mat accum XpX = monage obese smokes
You can see the result entering:
. mat list XpX
Calculate X'Y and store it in the matrix "XpY":
29
Regressions' output
You need to show the regressions in tables similar to those utilized in most
economics papers (e.g. one column per specification, and standard errors in
parentheses). Use then the command "outreg". As it is not a "default" command in
Stata, you need to install it first:
. search outreg
Select "sg97.3" and then press "click here to install".
We run a regression:
. reg evalhl monage obese smokes satlif totexpr
And then we save the output in the file "regresion.out":
. outreg using regresion, replace
The option "replace" indicates to Stata that if the file "regresion.out" exists, then it
must be replaced. Go to the Stata folder and open "regresion.out" (using MS Excel or
MS Word).
By default "outreg" shown in parentheses the t-values and puts an asterisk if the
coefficient is significative at the 5%, and two asterisks if it is significative at the 1%.
We can ask for the standard errors in parenthesis, and one asterisk if the coefficient
is significative at the 10%, two asterisks if it is significative at the 5%, and three
asterisks if it is significative at the 1% (options "se" and "3aster" respectively):
. outreg using regresion, se 3aster replace
The command can also show various regressions in the same table (as columns). We
must add regression outputs using the option "append" instead of "replace". Let's run
30
three regressions: one with only "monage" and "obese" as explanatory variables;
another with only "smokes", "satlif" y "totexpr"; and finally a regression with all them.
. reg evalhl monage obese
. outreg using regresion, se 3aster replace
. reg evalhl smokes satlif totexpr
. outreg using regresion, se 3aster append
. reg evalhl monage obese smokes satlif totexpr
. outreg using regresion, se 3aster append
Open using MS Excel the file "regresion.out" and see the results. If you do not want to
show the coefficient for a set of dummy variables, you will find the command "areg"
very useful.
31
Chapter 3
Instrumental Variables
32
There we can see that taller people and people with larger waists and hips are
probably more obese.
Hausman Test
33
You must notice that you cannot test the instruments' exogeneity (because they are
assumptions!). But given that the instruments are valid, you can test the
instrumented variables' exogeneity.
You must save the IV estimates:
. ivreg evalhl alclmo cmedin belief operat (obese =height hipsiz waistc) smokes
totexpr monage if gender==0, robust
. est store iv
You must also save the OLS estimates:
. reg evalhl alclmo cmedin belief operat obese smokes totexpr monage if gender==0,
robust
. est store ols
Then use the command "hausman" indicating first the consistent estimates and then
the efficient estimates:
. hausman iv ols
The null hypothesis is that there is not systematic difference between the estimates.
Exercise 3.1: If you reject the null, which regression should you run? Why? What if
you do not reject the null hypothesis? If I already knew for sure that the instruments
are valid, why should I care for this?
There is an alternative way to run the Hausman test. You must run the fist stage, and
then you must save the residuals:
. reg obese alclmo cmedin belief operat smokes totexpr monage height hipsiz waistc
if gender==0, robust
. predict res, residual
Then run the original least squares regression, but including the residuals:
. reg evalhl alclmo cmedin belief operat smokes totexpr monage obese res if
gender==0, robust
First notice that the coefficients (though not the standard errors) are the same than
those obtained using "ivreg":
. ivreg evalhl alclmo cmedin belief operat (obese =height hipsiz waistc) smokes
totexpr monage if gender==0, robust
34
Finally, the Hausman test consists in testing if the coefficient on the residuals ("res")
is null: if rejected, then (given that the instruments are valid) "obese" are
endogenous.
35
. reg resid1 alclmo cmedin belief operat height hipsiz waistc smokes totexpr monage,
robust
. test height==hipsiz==waistc==0
The overidentifying restriction test statistic is J=nF. Under the null hypothesis that
all instruments are exogenous, J is distributed m2 k , where (m-r) is the number of
instruments minus the number of endogenous variables:
. quietly: ereturn list
. quietly: return list
. display chi2(2,e(N)*r(F))
Weak Instruments
If the correlation between the instruments and the endogenous variables is poor,
then the asymptotic distribution of the IV standard errors is anything but normal. A
widely spread "rule of thumb" to test whether there might be a weak instruments
problem was proposed by Stock and Watson ("Introduction to Econometrics", Chapter
10). In the first stage regression, if the F-statistic corresponding to testing whether
all the instruments are conjunctly null is greater than 10, then you should care about
weak instruments.
Let's estimate the first stage and perform the test:
. reg obese alclmo cmedin belief operat smokes totexpr monage height hipsiz waistc
if gender==0, robust
. test height==hipsiz==waistc==0
Ivreg2
You can download the command "ivreg2" (with the commands "overid", "ivendog",
"ivhettest" associated), which provides extensions to Stata's official "ivreg". It
supports the same command syntax as official "ivreg" and supports (almost) all of its
options. Among the improvements you may find some very useful as the enhanced
Kleibergen-Paap and Cragg-Donald tests for weak instruments.
The post-estimation command "overid" computes versions of Sargan's (1958) and
Basmann's (1960) tests of overidentifying restrictions for an overidentified equation.
You may find details in the help file. The Durbin-Wu-Hausman test for endogeneity
("ivendog") is numerically equivalent to the standard Hausman test obtained using
"hausman" with the "sigmamore" option.
36
Chapter 4
Panel Data
37
Hausman test
There is a test useful to choose between the fixed effects estimates, which can be
consistent if the individual effects are correlated with the included variables, and
the random effects estimates, which are consistent and efficient if the individual
effects are correlated with the included variables, but inconsistent otherwise.
First estimate your model using fixed effects and save the estimates:
. xtreg cartheft instp inst1p month5-month12, fe i(blockid) robust
. est store fixed
Then run the random effects estimation:
. xtreg cartheft instp inst1p month5-month12, re i(blockid) robust
Finally, enter:
. hausman fixed
The null hypothesis is that the differences in the coefficients are not systematic. As
we cannot reject such hypothesis, we can use the random effects estimates to
improve the efficiency.
Fixed effects estimates "by hand"
There are many ways to obtain the fixed effects estimates. For instance, we could
have included a set of dummy variable identifying each unit, what is known as the
LSDV (Least Squares Dummy Variables) estimator. You can use the command "areg":
. areg cartheft instp inst1p month5-month12, absorb(blockid) robust
The option "absorb" generates a dummy variable for each value of "blockid" (i.e. a
dummy for each block), but it does not show their estimates (as that would involve
38
estimating and showing 876 coefficients that are not interesting by themselves).
Notice that the estimates are the same than those obtained using "xtreg".
Exercise 4.2: A less-known way to obtain fixed effects estimates is running a
regression for each unit, and then averaging all the estimates obtained. Carry out
such estimation.
Within Transformation
Consider the following model:
y it = X it + i + it
It cannot be consistently estimated by OLS when i is correlated with it. The within
transformation proposes to take out individual means in order to obtain the
following:
( yit yi ) = (xit xi ) + ( it i )
The over-bar denotes deviations from the mean for a given i:
xi
t =1
xit
Such transformation eliminated i and then made OLS consistent. Let's replicate such
transformation. First generate the deviations from the mean using the command
"egen". For instance, for the dependent variable you should enter:
. bysort blockid: egen mean_cartheft= mean(cartheft)
. gen dm_cartheft= cartheft - mean_cartheft
And for each independent variable X you should enter:
. bysort blockid: egen mean_X= mean(X)
. gen dm_X= X - mean_X
First we should eliminate observations with missing values in any of the variables
included in the model. However, there are no missing values in this database. In
order to save code lines and time, we will use the "foreach" syntax:
. foreach var of varlist cartheft instp inst1p month5-month12 {
. bysort blockid: egen mean_`var'=mean(`var')
. gen dm_`var'= `var' - mean_`var'
.}
Then we run the regression:
39
yi ,t = yi ,t 1 + i + i ,t
Consider < 1 . We have observations on individuals i=1,,N for periods t=1,,T.
The within estimator for is:
FE
( y y )( y y )
=
(y y )
with yi =
i =1
t =1
N
i =1
t =1
i ,t
i ,t 1
i , 1
i ,t 1
i , 1
1 T
1 T
y and yi ,1 = t =1 yi ,t 1
t =1 i ,t
T
T
(1/ NT ) i =1 t =1 ( i ,t i )( yi ,t 1 yi ,t 1 )
N
FE = +
(1/ NT ) i =1 t =1 ( yi ,t 1 yi ,t 1 )
N
This estimator is biased and inconsistent for N and fixed T. It can be shown
that (Hsiao, 2003, Section 4.2):
40
plim
N
1
NT
(
N
i =1
t =1
i ,t
i )( yi ,t 1 yi ,t 1 ) =
2 (T 1) T + T
0
T2
(1 ) 2
Thus, for fixed T we have an inconsistent estimator. One way to solve this
inconsistency problem was proposed by Anderson and Hsiao (1981). Take first
differences:
yi ,t yi ,t 1 = ( yi ,t 1 yi ,t 2 ) + ( i ,t i ,t 1 )
If we estimated the above model by OLS we would yield inconsistent estimates, since
(y
i ,t 2
yi ,t 3 ) as an instrument. Furthermore,
Arellano and Bond (1991) suggested that the list of instruments can be extended by
exploiting additional moment conditions, and then using a GMM framework to obtain
estimates.
Exercise 4.5: What do you think about the orthogonality condition between yi ,t 2
and i ,t i ,t 1 ? (Hint: using the same argument, one could argue to use yi ,t 1 as an
instrument for yi ,t in non-dynamic models).
Xtabond
The Arellano-Bond estimator is obtained with the command "xtabond".1 Nevertheless,
we need first to declare the dataset to be panel data using "tsset":
. tsset panelvar timevar
This declaration is also needed for other commands, such "stcox" and "streg" for
duration models. Once you have done such declaration, you can refer to lagged
values using "L.variable" and first difference using "D.variable". You can also utilize
the commands "xtsum" and "xttab", similar to "summarize" and "tabulate" but
designed specifically for panel data.
Then you can implement the Arellano and Bond estimator ("xtabond"), which uses
moment conditions in which lags of the dependent variable and first differences of
the exogenous variables are instruments for the first-differenced equation. The
syntax is:
. xtabond depvar indepvar, lags(#)
This command has been updated on May 2004. Then, you should enter "update all" to install
41
The option lags(#) indicates the number of lags of the dependent variable to be
included in the model (the default being 1). The option "robust" specifies that the
Huber/White/sandwich estimator of variance must be used in place of the traditional
calculation. The option "twostep" specifies that the two-step estimator has to be
calculated.
For instance, consider a dynamic version of Di Tella et al. (2004):
. tsset blockid month
. xtabond cartheft instp month5-month12, lags(1)
The moment conditions of these GMM estimators are valid only if there is no serial
correlation in the idiosyncratic errors. Because the first difference of white noise is
necessarily autocorrelated, we need to focus on second and higher autocorrelation.
You can see the error-autocorrelation tests at the end of the table generated by
"xtabond". In this case we reject the possibility of second degree autocorrelation.
Notice that Stata also calculates the Sargan test of over-identifying restrictions.
We can see that the coefficient on "instp" is not statistically different from zero any
more. There are two probable reasons. The first has to do with the exogeneity
assumption. In the difference-in-difference estimator we were pretty convinced
about the internal validity of the model. However, the Arellano-Bond estimator
involves a great set of moment conditions, and we do not have serious evidence to
think that there are valid. Then, the Arellano-Bond estimate for "instp" may be
inconsistent.
On the other hand, there is a simple loss of power related to performing first
differences instead of using the within transformation (because we are employing
less information). Indeed, if we had used first differences in the original model we
would have found a coefficient on "instp" statistically not different from zero:
. reg D.cartheft D.instp D.month5 D.month6 D.month7 D.month8 D.month9
D.month10 D.month11 D.month12, robust
Monte Carlo Experiment
We will implement a Monte-Carlo experiment to evaluate the seriousness of the bias
in "short" dynamic panels. Consider the Data Generating Process of the following
growth model:
42
i 's are the country fixed effects, and i ,t is the error term. The information is
collected for i=1,,N individuals, during t=1,,T periods.
Suppose N=50 and T=5. Generate variables "year" and "country":
. clear
. set obs 250
. gen year=mod(_n-1,5)+1
. gen country=ceil(_n/5)
See the results using "browse". Let's generate data on "inst" and "entr":
. gen inst = 0.2 + year*0.1 + uniform()/2
. gen entr = 0.4 + year*0.08 + uniform()/3
Generate the fixed effects as correlated with both "inst" and "entr":
. gen fe_aux=(inst + entr + uniform())/3 if year==1
. bysort country: egen fe = mean(fe_aux)
. drop fe_aux
Finally, generate data on "GDP":
. gen GDP = 0.2*inst+0.3*entr+fe+uniform()
. bysort country: replace GDP = GDP+ 0.3*GDP[_n-1] if _n>1
. drop fe
See the results for the first 6 countries:
. graph twoway line GDP inst entr year if country<7, by(country)
Antes de continuar, declare "tsset":
. tsset country year
We know that the "true" values of the parameters are: = 0.3 , 1 = 0.2 and
43
The estimates are also wrong. Run a regression including both fixed effects and
lagged dependent variable, but inconsistent:
. xtreg GDP L.GDP inst entr, fe i(country) robust
Although considerably less biased than the previous estimates, they are still
inconsistent. Finally, run the Arellano-Bond estimator:
. xtabond GDP inst entr, robust
The bias is even smaller. We can repeat the whole experiment using different values
for N and T. In particular, keep N fixed in 100 and vary T=5, 10, 15, , 35. We will
show the estimates of the coefficients and their standard values. Run the following
do-file:
. * Generate data for T=25 and N=50
. clear
. set mat 800
. set obs 1250 // 25*50
. gen year=mod(_n-1,25)+1
. gen country=ceil(_n/25)
. gen inst = 0.2 + year*0.05 + uniform()/2
. gen entr = 0.35 + year*0.08 + uniform()/3
. gen fe_aux=(inst + entr + uniform())/3 if year==1
. bysort country: egen fe = mean(fe_aux)
. drop fe_aux
. gen GDP = 0.2*inst+0.3*entr+fe+uniform()
. bysort country: replace GDP = GDP+ 0.5*GDP[_n-1] if _n>1
. drop fe
. tsset country year
. * Repeat the experiment for different T's
. forvalues T=3(2)25 {
. preserve
. quietly: keep if year<=`T'
. display "T="`T'
. quietly: xtreg GDP L.GDP inst entr, fe i(country) robust
. quietly: ereturn list
. matrix b = e(b)
44
. matrix se = e(V)
. display "XTREG: gamma=" b[1,1] " (" se[1,1] ")"
. quietly: xtabond GDP inst entr, robust
. quietly: ereturn list
. matrix b = e(b)
. matrix se = e(V)
. display "XTABOND: gamma=" b[1,1] " (" se[1,1] ")"
. restore
.}
Run the do-file a couple of times. You will notice that for T=3 the "xtabond" may
yield more inconsistent coefficients than the original "xtreg". For T>5 the coefficient
yielded by "xtabond" is always more close to the real value (0.5) than that yielded by
"xtreg". In these simulations the bias becomes relatively insignificant for T>21.
Exercise 4.6: Repeat the experiment changing the GDP error term (e.g.
"uniform()*1.5" instead of "uniform()"). Recalling the formula for the inconsistency
(Hsiao, 2003), comment the results.
Exercise 4.7: Repeat the last experiment including second order autocorrelation in
the error term. Comment the results.
Xtabond2
If you are using Stata 7 to 9.1, you can install "xtabond2". You might not find the
command using "search xtabond2". Then, to download it type the following
command: ssc install xtabond2, all replace. If that does not work, then download
the ado- and help-files from "http://ideas.repec.org/c/boc/bocode/s435901.html",
and copy them in "StataFolder\ado\base\?\" (where "?" must be replaced by the first
letter of each command). This procedure is particularly useful in computers with
restrained access to the Web.
The command "xtabond2" can fit two closely related dynamic panel data models. The
first is the Arellano-Bond (1991) estimator, as in "xtabond", but using a two-step
finite-sample correction. The second is an augmented version outlined in Arellano
and Bover (1995) and fully developed in Blundell and Bond (1998).
A problem with the original Arellano-Bond estimator is that lagged levels are often
poor instruments for first differences, especially for variables that are close to a
random walk. Arellano and Bover (1995) described how, if the original equations in
levels were added to the system, additional moment conditions could be brought to
45
P{ y i = (0,1) | y i = 1 2 , i , } =
P{ y i = (0,1) | i , }
P{ y i = (0,1) | i , } + P{ y i = (1,0) | i , }
Use that:
P{ y i = (0,1) | i , } = P{ y i1 = 0 | i , }P{ y i 2 = 1 | i , }
And:
P{ y i 2 = 1 | i , } =
exp{ i + xi2 }
1 + exp{ i + xi2 }
P{ y i1 = 0 | i , } = 1
exp{ i + xi1 }
1 + exp{ i + xi1 }
Then, after some algebra steps detailed at the end of this document, it follows that
the conditional probability will be given by:
P{ yi = (0,1) | yi = 1 2, i , } =
Which does not depend on i . The last looks exactly as a simple logit regression,
where xi* is in fact the first difference for xi . In an analogous way you may obtain
P{ y i = (1,0) | y i = 1 2 , i , } .
In summary, the estimator consists in the following: keep only the observations with
( y1 , y 2 ) equal to (0,1) or (1,0). Then generate a dependent variable taking the value
1 for positive changes (0,1), and the value 0 for negative changes (1,0). Then, regress
(by an ordinary logit) the transformed dependent variable on the first differences for
the regressors ( xi* = xi 2 xi1 ). You can obtain a similar result for T>2.
46
An example
For this exercise we will use a small panel (T=3) extracted from a Russian database.
. use russia1, clear
The
command
xtlogit
fits
random-effects,
conditional
fixed-effects,
and
y it = xit + u it
Where uit have mean zero and unit variance, and it can be further decomposed as
it + i . Additionally:
47
y it = 1
if y it > 0
y it = 0
if y it 0
xi
f (u )du
In the T-periods case, integrating over the appropriate intervals would involve T
integrals (which in practice is done numerically). When T 4 , the maximum
likelihood estimation is made infeasible. This curse of dimensionality may be
avoided by using simulation-based estimators (Verbeek, 2004, Chapter 10.7.3).
If u it could be assumed as independent:
f ( y it | xit , i , ) f ( i )d i
48
y it = xit + i + it
Where i and it are i.i.d. normally distributed, independent of ( xi1 ,..., xiT ) with
zero means and variances 2 and 2 . As in the original model:
y it = y it
if y it > 0
if y it 0
y it = 0
As in the random effects probit model, the likelihood contribution of individual i will
be the following:
[ f ( y
t
it
| xit , i , ) f ( i )d i
We can obtain the desired model simply by replacing f () by the normal density
function, and integrating over i numerically.
An example
Use the dataset "hinc.dta", which is a quasi-generated sample of individuals based on
a British household panel:
. clear
. set mem 50m
. use hinc
See the data definitions:
. describe
Estimate the model using a pooled Tobit:
. tobit hincome educ exper expersq married widowed divorcied, ll(0)
And now estimate the model using a Tobit with random effects:
. xttobit hincome educ exper expersq married widowed divorcied, i(pid) ll(0)
You can use almost all the features available for the pooled Tobit model. You will be
required to do so in the Problem Set.
49
P{ yi = (0,1) | yi = 1 2 , i , } =
exp{ i + xi2 } exp{ i + xi1 } exp{ i + xi1 } exp{ i + xi2 }
+
2
1
+
exp{
+
x
}
1
+
exp{
+
x
}
1
+
exp{
+
x
}
1
+
exp{
+
x
}
i
i2
i
i1
i
i1
i
i2
P{ yi = (0,1) | yi = 1 2 , i , } =
exp{ i + xi2 }(1 + exp{ i + xi1 }) + exp{ i + xi1 }(1 + exp{ i + xi2 }) 2 exp{ i + xi1 } exp{ i + xi2 }
P{ yi = (0,1) | yi = 1 2 , i , } =
exp{ i + xi2 }(1 + exp{ i + xi1 }) + exp{ i + xi1 }(1 + exp{ i + xi2 }) 2 exp{ i + xi1 } exp{ i + xi2 }
P{ yi = (0,1) | yi = 1 2 , i , } =
exp{
}(
1
exp{
})
exp{
}(
1
exp{
})
2
exp{
}
exp{
}
+
+
+
+
+
+
+
+
+
i
i2
i
i1
i
i1
i
i2
i
i1
i
i2
exp{ i + xi2 }
P{ yi = (0,1) | yi = 1 2 , i , } =
exp{
+
x
}
+
exp{
+
x
}
i
i
2
i
i
1
P{ yi = (0,1) | y i = 1 2 , i , } =
exp{ i
exp{ i + xi2 }
exp{ i + xi1 }
exp{ i + xi1 }
exp{xi2 xi1 }
exp{( xi 2 xi1 )' }
=
P{ yi = (0,1) | yi = 1 2 , i , } =
exp{xi2 xi1 } + 1 exp{( xi 2 xi1 )' } + 1
50
Chapter 5
Binary Logit/Probit
Binary regression
When you try to estimate by OLS a model with binary dependent variable, you may
encounter some of the following problems: i. The marginal effects are linear in
parameters; ii. There is heterocedasticity (though it would be enough to use robust
standard errors); iii. The model might predict values below zero and above one
(though you may use set of dummies as regressors, and then that problem would be
solved).
In summary, the main concern is that in some economic models (i.e. from the
aprioristic analysis) you may expect to find nonlinearities. For instance, consider a
model for the probability of approving an exam. For the most prepared students
(those with high probabilities of passing), an additional hour of study will have
virtually no effect on their probabilities of approving. The same is valid for the
students that have not paid any attention since the first day of class: the impact of
an additional hour of study on their probabilities of approving is almost zero.
However, consider a student right in the middle. As she is in the borderline, she
might pass the exam just because what she studied in the last hour: the marginal
effect of an additional hour of study on her probability of approving will be
considerable. Then, we will introduce a model that captures exactly that kind of
nonlinearities. Consider a vector x of explanatory variables, and a vector of
parameters. The probability (p) that an event y happens is:
p = F ( x )
Where F () has the following properties:
F () = 0, F () = 1, f ( x) = dF ( x) dx > 0
For instance, in the probit model F () is the accumulated distribution of a standard
normal:
2
F ( x ) =
1 s2
e ds
2
F ( x ) =
51
e x
1 + e x
Notice that the marginal effects of the explanatory variables are not linear:
p
= k f ( xi )
xk
The sign of the derivative is the same than the sign of the coefficient k .
sgn(p xk ) = sgn( k )
However, the value of the coefficient lacks direct quantitative interpretation. The
parameter k is multiplied by f ( xi ) , which is maximum when xi = 0 and
decreases (in absolute value) when xi goes towards (indeed, the difference
between the models logit and probit is the weight of their tails). We will discuss
thoroughly how to present marginal effects.
The Maximum-Likelihood estimator
We have an i.i.d. sample of binary events and explanatory variables ( yi , xi ) , for i
=1,,n,. The random variable yi follows a Bernoulli distribution with pi = P( yi = 1) .
The likelihood function is then:
n
1 yi
L( ) = pi (1 pi ) = piyi (1 pi )
yi =1
yi = 0
i =1
l ( ) = [ yi ln( pi ) + (1 yi ) ln(1 pi ) ]
i =1
n
i =1
( yi F ( xi )) f ( xi ) xki
= 0 k = 1,..., K
F ( xi )(1 F ( xi ))
52
version 1.0
args lnf XB
. end
Where "args lnf XB" indicates the value of the likelihood function ("lnf") and xi
("XB"). The expression "$ML_y1" is the convention for yi . The program you write is
written in the style required by the method you choose. The methods are "lf", "d0",
"d1", and "d2". See "help mlmethod" for further information. However, some global
macros are used by all evaluators. For instance, "$ML_y1" (for the first dependent
variable), and "$ML_samp" contains 1 if observation is to be used, and 0 otherwise.
Then, the expression "`lnf' = $ML_y1*ln(norm(`XB'))+(1-$ML_y1)* ln(1-norm(`XB'))" is
exactly the formulae of the log-likelihood function:
l ( ) = yi ln( F ( xi )) + (1 yi ) ln(1 F ( xi ))
Notice that we used the normal accumulated distribution, and then we are
estimating a probit model. Now we can use the command "ml":
. ml model lf probit_ml (smokes = gender monage highsc belief obese alclmo hattac
cmedin totexpr tincm_r work0 marsta1 marsta2)
Once you have defined the maximum likelihood problem, you can verify that the loglikelihood evaluator you have written seems to work (strongly recommended if you
are facing a problem not covered by any standard command):
. ml check
53
specifies
the
Berndt-Hall-Hall-Hausman
"technique(dfp)"
specifies
Davidon-Fletcher-Powell
(DFP)
(BHHH)
algorithm,
algorithm,
and
54
association between smoking and being male, and it is statistically different from
zero at the 1% level of confidence).
The binary logit and probit models are particular cases of more general models. For
instance, we have the multinomial logit ("mlogit"), nested logit ("nlogit"), mixed logit
("mlogit"), ordered logit ("ologit"), and so forth. In the last week we will explain
those models. The literature is known as "discrete choice", and it is particularly
useful to model the consumer behavior. At the end of this section we will develop a
logit model with fixed effects ("xtlogit").
Marginal Effects
As stated previously, the marginal effects depend on the value of x. You may choose
the mean values for every element of x, and evaluate the marginal effects there:
p
xk
= k f ( x )
x= x
This is the "marginal effects at the mean". You can compute them using the command
"dprobit" for the probit model:
. dprobit smokes gender monage highsc belief obese alclmo hattac cmedin totexpr
tincm_r work0 marsta1 marsta2, robust
At the mean values, the marginal effect of changing the value of obese from 0 to 1
on the probability of smoking is -14 percentage points. You can evaluate the marginal
effects at a point different than the mean values:
. matrix input x_values = (1,240,1,3,0,1,0,1,10000,10000,1,1,0)
. dprobit smokes gender monage highsc belief obese alclmo hattac cmedin totexpr
tincm_r work0 marsta1 marsta2, robust at(x_values)
You can also use the command "mfx":
. probit smokes gender monage highsc belief obese alclmo hattac cmedin totexpr
tincm_r work0 marsta1 marsta2, robust
. mfx
For the logit model you may only use the command "mfx":
. logit smokes gender monage highsc belief obese alclmo hattac cmedin totexpr
tincm_r work0 marsta1 marsta2, robust
. mfx
55
1 n p
n i =1 xk
=
x = xi
1 n
k f ( xi )
n i =1
You can use the command "margeff", which you have to install in advance ("search
margeff"):
. probit smokes gender monage highsc belief obese alclmo hattac cmedin totexpr
tincm_r work0 marsta1 marsta2, robust
. margeff
It works both with logit and probit models. Moreover, you can calculate these
marginal effects "by hand". Estimate the model, predict each xi and store them in
the variable "xb":
. probit smokes gender monage highsc belief obese alclmo hattac cmedin totexpr
tincm_r work0 marsta1 marsta2, robust
. predict xb, xb
We need to retrieve the coefficients on the variable under consideration (for
instance, on "monage"):
. ereturn list
. matrix coef=e(b)
And we can calculate the marginal effect at every observation (to be stored in
variable "me"):
. gen me=coef[1,2]*normden(xb)
Now we can calculate the mean of the marginal effects, or even their median:
. tabstat me, stats(mean median)
Alternatively, we can estimate nonparametrically the density distribution of the
marginal effects over the sample (using the command "kdensity", for obtaining a
Kernel estimator):
56
. kdensity me
There is no "right" or "wrong" way to evaluate the marginal effects. The important
thing is to give the right econometric interpretation for the strategy chosen.
Goodness of Fit
There is a simple way to imitate the R2 from OLS, denominated pseudo-R2:
LR = 1
ln L
ln L0
Where L is the maximum value of the log-likelihood function under the full
specification, and L0 is the maximum value of the log-likelihood function in a model
with only a constant. Then LR measures the increase in the explanatory power from
considering a model beyond a simple constant.
As you can see in Menard (2000), there are some "desirable" properties for a
goodness-of-fit index that the R2 complies. There is no such a good index for the
logit/probit model, but you have a relatively wide set of opportunities. Once again,
there is no "right" or "wrong" way to measure the goodness of fit. However, you must
read carefully the econometric meaning of your choice.
Another strategy is to generate a "percentage of right predictions". As you probably
noticed, the predictions are in the (0,1) open interval, and then you cannot compare
them with the actual outcomes (either zero or one). But you can use cutoff points (c)
to transform the (0,1) predictions in either 0 or 1:
yi 1[ p i > c ]
yic 1[ yi = yi ]
H=
n
i =1
yic
Where H is the "percentage of right predictions". The index is pretty sensitive to the
choice of c. Furthermore, a trivial model has H 1 2 . For instance, consider a
model explaining the decision to commit suicide. Since less than 0.1% of the people
commit suicide, a trivial model predicting "no person commit suicide" would obtain a
H = 0.999 .
We will create a table of "Type I Type II" errors. Suppose c=0.5. Let's generate y i
and then yic :
. logit smokes gender monage highsc belief obese alclmo hattac cmedin totexpr
tincm_r work0 marsta1 marsta2, robust
. predict smokes_p
57
58
Chapter 6
Hazard Models
F(t) = P(T t)
The survivor function is the probability surviving past t and is defined as:
P(t T < t + h | T t)
Dividing by h we can obtain the average likelihood of leaving the initial state per
period of time over the interval [t, t + h) . Making h tend to zero we get the hazard
function (defined as the instantaneous rate of leaving the initial state):
(t) = lim
h 0
The
hazard
and
survival
P(t T < t + h | T t)
h
functions
provide
alternative
but
equivalent
P(t T < t + h | T t) =
P(t T < t + h) F (t + h) F (t )
=
P(T t)
1 F (t )
lim
h 0 +
F (t + h) F (t )
= F (t ) = f (t )
h
59
Finally:
(t) =
f (t )
f (t )
=
1 F (t ) S (t )
For instance, consider the simplest case: the hazard rate is constant, (t) = . It
implies that T follows the exponential distribution: F(t) = 1 - exp(-t) .
Let x i be a vector of explanatory variables. A widespread group of models are the
so-called proportional hazard models. The idea behind those models is that the
hazard function can be written as the product of a baseline hazard function that does
not depend on x i , and a individual-specific non-negative function that describes the
effect of x i :
(t, x i ) = 0 (t)exp(x i )
Where 0 is the baseline hazard function (i.e. that of an hypothetical individual with
x i = 0 ), and exp(x i ) is the proportional term that stretches and shrinks the
baseline function along the y axis (because the adjustment is the same for every t ).
Take logarithm and then take the derivative with respect to x ik :
ln (t, x i )
= k
x ik
The coefficient k is the (proportional) effect of a change in x k on the hazard
function. When x k is increased, the hazard function is stretched if k > 0 (i.e. the
instantaneous likelihood of leaving the initial state increases for every t), and if
k < 0 it is shrunk.
An example
Open the database lung.dta and see its content:
. use lung, clear
. describe
. browse
First of all, use stset to declare the data to be survival-time data. You have to
indicate the permanence variable ("time") y and the failure variable ("dead"):
60
61
62
63
Chapter 7
Count-Data Models
Count-data Models
The central characteristics of a count-dependent-variable are: i. It must be a positive
integer, including zero; ii. It has not an obvious maximum or upper limit; iii. Most of
the values must be low, and particularly there must be lots of zeros.
Consider the following example, illustrated by the same database that we will use
thoroughly the notes: the visits to the doctor in the last two weeks ("count.dta").
. use count, clear
. tab doctorco
. hist doctorco, discrete
Visits to the doctor in the last two weeks (doctorco) - Actual frequency distribution
Count
Total
Frequency
4,141
782
174
30
24
12
12
5,190
Percent
79.79 15.07
3.35
0.58
0.46
0.17
0.23
0.23
0.1
0.02
100
Notice that the "count nature" is clear: i. The variable only takes integer values,
including the zero; ii. There is no obvious upper limit; iii. The 95% of the
observations take values either 0 or 1, and 80% of the sample is composed by zeros.
The Poisson Regression Model
We need to estimate an appropriate model for E ( y | x) , where x is a vector of
explanatory variables. The model will comply:
E ( y | x ) = exp( x )
Which guarantees E ( y | x) > 0 and provides a rather simple interpretation for the
marginal effects:
ln E ( y | x )
= k
xk
Then, the coefficients are semi-elasticities (notice that the coefficients on binary
variables will be read as proportional changes).
Remember that a random variable has a Poisson distribution if:
f ( y ) = P (Y = y ) =
e y
y!
64
y = 0,1, 2,...
E (Y ) = V (Y ) =
The property that the expectation equals the variance is called equidispersion. The
Poisson regression model corresponds to:
f ( y | x) =
e ( x ) ( x ) y
y!
y = 0,1, 2,...
L( ) =
i =1
e i i yi
yi !
l ( ) = [ yi xi exp( xi ) ln yi !]
i =1
y exp( x )x
i
=0
i =1
Like in the logit model, there is no explicit solution and you must find the estimates
numerically. Under the right specification the estimator is consistent, asymptotically
normal and efficient.
Furthermore, the Poisson estimator is a Quasi-MLE. If E (Y | x) = exp( x ) holds, then
65
66
Chapter 8
Selection Models
The internal validity of the econometric models is commonly threatened by selection
problems. Consider the following example:
y = x + u
Call s to the selection variable, which takes the value 1 if y is observed, and 0
otherwise. Imagine a super-sample ( yi , xi , si ) of size N, and that we only observe the
sub-sample ( yi , xi ) for those with s = 1 . Consider as an example a wage equation for
women: we only observe wages for those women who work.
If we had an i.i.d. sample ( yi , xi ) , the consistency would depend on E (u | x) = 0 . The
problem is that we now have a sample conditional on s = 1. Take expectation:
E ( y | x, s = 1) = x + E (u | x, s = 1)
Then OLS using the sub-sample will be inconsistent unless:
E (u | x, s = 1) = 0
Notice that not any selection mechanism makes OLS inconsistent. If u is independent
from s, then OLS is consistent. Additionally, if the selection depends only on x then
OLS is also consistent.
A selectivity model
Consider the following system of equations:
67
This allows the non-observable from both equations to be related. Take expectation:
N (0, 22 ) :
P ( y2i = 1) = P ( y2i = 1) = P (u2i 2 < x2 i 2 2 ) = ( x2 i )
Then P( y2i = 1) corresponds to a probit model with coefficient . If x2i and y2i are
observed for the complete sample, then can be estimated consistently using a
probit model (notice that despite we can identify , we cannot identify 2 and 2
separately).
Finally, 1 and
procedure:
68
First stage: obtain an estimate for using the probit model P( y2i = 1) = ( x2 i ) for
the complete sample. Then estimate zi using zi = ( x2 i) .
Second stage: Regress y2i on x1i and zi utilizing the censored sample, which should
yield consistent estimates for 1 and .
It can be shown that the second stage is heterocedastic by construction. You can
derive robust standard errors, though the adjustment is not as simple as in the
sandwich estimator (it would be if zi were observable). Nevertheless, Stata
computes it automatically.
An example: wage equation for women
The objective of this example is estimating a simple wage equation for women. The
database is the same presented for the Tobit example. Enter "describe" to see the
data definitions:
. use women, clear
. describe
The variable "hwage" represents the hourly salary income (in US dollars), which must
be set to missing if the person does not work (since the "heckman" command
identifies with missing values those observations censored):
. replace hwage=. if hwage==0
Then we can use the command "heckman". Inmmediately after it you must specify
the regression equation, and in the option "select()" you must specify the
explanatory variables for the selection equation:
. heckman hwage age agesq exp expsq married head spouse headw spousew pric seci
secc supi supc school, select(age agesq exp expsq married head spouse headw
spousew pric seci secc supi supc school) twostep
With the option "twostep" it fits the regression model using the Heckman's two-step
consistent estimator. Additionally, you can use different specifications for the
regression and selection equations:
. heckman hwage age agesq exp expsq pric seci secc supi supc, select(age agesq exp
expsq married head spouse headw spousew pric seci secc supi supc school) twostep
Selection Test
You can compare the estimate with the (seemingly inconsistent) OLS regression:
. reg hwage age agesq exp expsq pric seci secc supi supc, robust
69
The differences are indeed minuscule. This is not surprising, since the inverse Mills
ratio term is statistically not different from zero. Testing H 0 : = 0 provides a
simple test for selection bias. Under H 0 the regresin model with the selectioned
samples is homocedastic, and then you can perform it without correcting for
heteroskedasticity.
. heckman hwage age agesq exp expsq pric seci secc supi supc, select(age agesq exp
expsq married head spouse headw spousew pric seci secc supi supc school) twostep
. test lambda
Two stages "by hand"
. gen s=(hwage!=.)
. probit s age agesq exp expsq married head spouse headw spousew pric seci secc
supi supc school
. predict xb, xb
. gen lambda = normden(xb)/normal(xb)
. regress hwage age agesq exp expsq pric seci secc supi supc lambda if hwage>0,
robust
And compare the estimates to those obtained using "heckman":
. heckman hwage age agesq exp expsq pric seci secc supi supc, select(age agesq exp
expsq married head spouse headw spousew pric seci secc supi supc school) twostep
Notice that obtaining the two-stage estimates "by hand" you cannot use the standard
errors directly.
Maximum-likelihood estimation
Under the (more restrictive) assumption that (u1i , u2i ) follows a bivariate normal
distribution it is possible to construct a maximum-likelihood estimator. If you do not
specify the "twostep" option, then Stata fits such an estimator:
. heckman hwage age agesq exp expsq married head spouse headw spousew pric seci
secc supi supc school, select(age agesq exp expsq married head spouse headw
spousew pric seci secc supi supc school)
This strategy is usually discarded. See for example Nawata y Nagase (????). In the
Problem Set you will be asked to compare both estimations.
Drawbacks
The classical problem with the Heckman model is the high correlation between x1i
and zi . Since the function () is monotonously increasing, if its argument has little
70
variation then it may resemble a linear function. If x1i is similar to x2i , then the
correlation between x1i an zi may be high. Indeed, the identification of needs
strictly () not to be linear. If x1i and x2i have a lot of variables in common, then
the second stage will be subject to a problem of high multicollineality. The final
result are insignificant coefficients on and 1 .
We can obtain a measure of how bad this problem could be. First estimate the
Heckman model using the option "mills(newvar)", which stores () in "newvar":
. heckman hwage age agesq exp expsq pric seci secc supi supc, select(age agesq exp
expsq married head spouse headw spousew pric seci secc supi supc school) twostep
mills(mills)
And then regress () on x1i :
. reg mills age agesq exp expsq pric seci secc supi supc
The higher the resulting R2, the higher the multicollineality problem. In this
particular example high multicollineality does not seem to be an issue.
Marginal effects
The marginal effects for the expected value of the dependent variable conditional on
being observed, E(y | y observed), are:
. mfx compute, predict(ycond)
The marginal effects for the probability of the dependent variable being observed,
Pr(y observed), are:
. mfx compute, predict(psel)
71
Chapter 9
Partially continuous variables
Some random variables are discrete: each value in the support can happen with
positive probability (e.g. bernoulli, poisson). Some random variables are continuos:
each point in the support happens with probability zero, but intervals happens with
positive probability (e.g. normal, chi-squared). But some variables are partially
continuous: some points in the support have positive probability.
Consider expenditure in tobacco as an example: an individual may not smoke with
positive probability; but given that he/she smokes, then the tobacco expenditure is a
continuous variable (i.e. the probability that he/she will expend exactly $121,02 is
null).
This discontinuity usually responds to either censorship or corner solutions. Consider
for example the following simple model:
y = x + u
y = max(0, y )
N (0, 2 ) :
y | x, y > 0 = x + ( u | x, y > 0 )
E ( y | x, y > 0) = x + E (u | x, y > 0)
= x + E (u | x, u > x )
= x +
72
( x )
1 ( x )
= x +
( x / )
( x / )
= x + ( x / )
Where ( z ) = ( z ) ( z ) is known as the inverse Mills ratio. If we regressed yi on xi
then ( x / ) would appear in the error term. Since the latter is correlated with
E ( y | x) = E ( y | x, y > 0) P ( y > 0 | x) + E ( y | x, y 0) P ( y 0 | x)
= E ( y | x, y > 0) P ( y > 0 | x)
= [ x + ( x / )] P (u > x )
= [ x + ( x / )] ( x )
Maximum-likelihood estimation
The model with censored data with normal errors is known as Tobit Model. Under the
normality assumption it is relatively easy to obtain a consistent maximum-likelihood
estimator. Let's begin by dividing the sample ( yi , xi ) in pairs with yi = 0 and pairs
with yi > 0 . The first happen with probability P( yi = 0 | xi ) , while the second follow
the density distribution f ( yi | xi , yi > 0) . Define wi 1[ yi > 0] . Then:
L( ) =
P( y
i| yi = 0
= 0) f ( yi | xi , yi > 0)
i| yi > 0
= P ( yi = 0)(1 wi ) f ( yi | xi , yi > 0) wi
i =1
l ( ) = i =1 (1 wi ) ln (1 ( xi ) ) + wi ln ( (1 ) (( yi xi ) )
n
Under standard conditions the maximization problem has a unique solution, which
can be easily retrieved using maximization algorithms.
Interpreting the results
The maximum-likelihood coefficients have the following interpretation:
E ( y | x )
=
x
They are the partial derivatives on the latent variable (e.g. consumption). But
sometimes the interest must be focused on different marginal effects. Recall that:
E ( y | x) = E ( y | x, y > 0) P( y > 0 | x) = [ x + ( x )] ( x )
Using the chain-rule:
73
E ( y | x) E ( y | x, y > 0)
P ( y > 0 | x)
=
P( y > 0 | x) + E ( y | x, y > 0)
xk
xk
xk
The above is the Donald-Moffit decomposition. It is straightforward to show that:
E ( y | x)
= k ( x )
xk
This is a rather simple re-scaling of the original estimate. However, as in the logit
model, the marginal effects depend on x. In our example, this would be the marginal
derivative on expenditures: part of the effect comes from the families that start
expending because of the change in x, and the other part comes from the increase in
expenditures for those families that were already expending.
We can also obtain the following:
E ( y | x, y > 0)
= k 1 ( x ) [ ( x ) + ( x )]
xk
It is also a re-scaling. This would be the effect on expenditures only for those
families who were already expending something.
An example: wage determination for women
We want to estimate a rather simple model on the determination of income for
women. We have data ("women.dta") on 2556 women between 14 and 64 years old
living in Buenos Aires in 1998.2 Enter "describe" to see the data definitions:
. use women, clear
. describe
Then, regressing wages on some individual characteristics we can estimate the effect
of those variables on the wages' determination. The variable "hwage" stores the
hourly wage in US dollars, and it takes the value zero if the woman does not work. As
potential explanatory variables and controls we have variables on education, age,
number of kids, marital status, etc.
Regarding our theoretical model, the y would be the "potential" wage (e.g. if
compelled to work, a woman may be paid less than her reservation wage), and y*
would be the "actual" wage (e.g. if a woman is offered less than her reservation
wage, then she will not work and she will earn nothing).
Let's estimate the following wage determination equation (inconsistently) by OLS
using the censored sample and the truncated sample:
2
74
. reg hwage age agesq exp expsq married head spouse children hhmembers headw
spousew pric seci secc supi supc school, robust
. reg hwage age agesq exp expsq married head spouse children hhmembers headw
spousew pric seci secc supi supc school if hwage>0, robust
Now we will retrieve the Tobit coefficients:
. tobit hwage age agesq exp expsq married head spouse children hhmembers headw
spousew pric seci secc supi supc school, ll(0)
With the option "ll(0)" we are indicated that the sample is left-censored at zero. We
know that = E ( y | x) x is the marginal effect of x on the (conditional expected
value of) "potential" wages. For instance, the impact of being married on the "latent"
wage is -8.6. Following the previously introduced formulas, we can also estimate the
marginal effect of x on the expected "actual" wage (at the mean x ), and the
marginal effect of x on the expected "actual" wage conditional on being uncensored
(at the mean x ).
We will take advantage of the command "dtobit", which you must install first. For
instance:
. version 6.0
. tobit hwage age agesq exp expsq married head spouse children hhmembers headw
spousew pric seci secc supi supc school, ll(0)
. dtobit
. version 9.1
Notice that we need to specify "version 6.0" before running the Tobit model. The
reason is that in version 6.0 Stata stores the estimate for sigma, which is necessary
to carry on "dtobit". If you do not want to calculate the marginal effects at the mean
x, you may specify the values for x using the option "at()".
For instance, the effect of being married on the "actual" hourly salary income is -3,
while the effect on the "actual" hourly salary income for working women is only 2.49. The differences with respect to the latent variable are considerable.
Notice that the latter two estimates are respectively the "consistent" versions of the
OLS regressions performed at the beginning of the example. Comparing them you can
make up an idea of how much biased were those early estimates (in this example,
considerably).
Exercise 6.1: Reproduce the latter two marginal effects "by hand" (i.e. as if the
command "dtobit" did not exist).
75
Exercise 6.2: Obtain the maximum-likelihood estimates "by hand" (i.e. using the
command "ml").
Exercise 6.3: Choose a continuous explanatory variable. Then graph a scatterplot of
that variable and hourly wages, and add the regression lines for each one of the
estimated coefficients showed above (including the inconsistent OLS estimates).
For the marginal effects, you can use the command mfx as well. The marginal
effects for the probability of being uncensored are obtained in the following way:
mfx compute, predict(p(a,b)), where a is the lower limit for left censoring and
b is the upper limit for right censoring. In our example:
. mfx compute, predict(p(0,.))
The marginal effects for the expected value of the dependent variable conditional on
being uncensored and the marginal effects for the unconditional expected value of
the dependent variable are obtained respectively:
. mfx compute, predict(e(0,.))
. mfx compute, predict(ys(0,.))
LR Test
Let L ( ) be the log-likelihood function, let be the unrestricted estimator, and let
be the estimator with the Q nonredundant constraints imposed (e.g. with less
explanatory variables). Then, under the regularity conditions, the likelihood-ratio
(LR) statistic LR = 2[ L() L ( )] is distributed asymptotically as Q2 under H 0 .
For instance, the following model reaches a log-likelihood of about -4513.77:
. tobit hwage age agesq exp expsq married head spouse headw spousew pric seci secc
supi supc school, ll(0)
If the variables "hhmembers" and "children" are added, then the log-likelihood
becomes about -4496.83:
. tobit hwage age agesq exp expsq married head spouse children hhmembers headw
spousew pric seci secc supi supc school, ll(0)
The likelihood ratio statistic is about 2(-4513.77 (-4496.83)) = -33.88. As it follows a
76
. tobit hwage age agesq exp expsq married head spouse children hhmembers headw
spousew pric seci secc supi supc school, ll(0)
. est store A
. tobit hwage age agesq exp expsq married head spouse headw spousew pric seci secc
supi supc school, ll(0)
. lrtest A
77
Chapter 10
Logit/Probit Models
( xi , yi ) are i.i.d.
We are interested in how ceteris paribus changes in the elements of x affect the
response probabilities, P ( y = j | x) , j = 0, 1,, J. Since the probabilities must sum to
unity, P ( y = 0 | x) will be determined once we know the probabilities for j = 1,, J.
We have to think on the multinomial model as a series of binary models. That is,
evaluate the probability of the alternative j against alternative i for every i j . For
instance, consider the binary model P ( y = j | y {i,j}, x) :
Pj
P ( y = j | x)
=
= F(X j )
P( y = i | x) + P( y = j | x) Pi + Pj
Obtenemos:
Pj = F ( X j )( Pi + Pj )
Pj
Pi
= F(X j )
Pi + Pj
Pi
F(X j ) F(X j )
F(X j )
=
=
= G( X j )
Pi
Pj
1 F ( X j )
1
Pi + Pj
Pi + Pj
Notice that:
Pj
P
j i
j i
Pi
1 Pi 1
= 1
Pi
Pi
P
1
= 1+ j
Pi
j i Pi
78
1
= 1 + G( X j )
Pi
j i
Pi =
1
1 + G( X j )
j i
Pj =
G( X j )
1 + G( X i )
i j
To find an explicit form for Pj we only have to substitute the G () by exp() , and
then we obtain the multinomial logit model:
P( y = j | x) =
exp( X j )
1 + exp( X i )
j = 1,..., J
i j
As the response probabilities must sum to 1, we must set the probability of the
reference response (j=0) to:
P( y = 0 | x) =
1
1 + exp( X i )
i 1
L( ) = P ( yi = j | x)1[ yi = j ]
i =1 j = 0
McFadden (1974) has shown that the log-likelihood function is globally concave, what
makes the maximization problem straightforward.
The partial effects for this model are complicated. For continuous xk, we can
express:
ik exp( X i )
P( y = j | x)
i 1
= P ( y = j | x) jk
xk
1 + exp( X i )
i 1
Where ik is the k-th element of i . Notice that even the direction of the effect is
not entirely revealed by ik . You may find other ways to interpret the coefficients.
Conditional logit model
79
McFadden (1974) showed that a model closely related to the multinomial logit model
can be obtained from an underlying utility comparison. It is called the Conditional
logit model. Those models have similar response probabilities, but they differ in
some key regards. In the MNL model, the conditioning variables do not change across
alternative: for each i, xi contains variables specific to the individuals but not to the
alternatives. On the other hand, the conditional logit model cannot have variables
varying over i but not over j. However, using an appropriate transformation you can
obtain the MNL model using the conditional technique, as it turns out to actually
contain the MNL model as a special case.
An example: a simple model of occupational choice
Utilizaremos la base de datos "status.dta". Enter "describe" to see the data
definitions. It has been extracted from Wooldridge "Econometric Analysis of Cross
Section and Panel Data", example 15.4 (page 498). It is a subset from Keane and
Wolpin (1997) that contains employment and schooling history for a sample of men
for 1987. The three possible outcomes are enrolled in school ("status=0"), not in
school and not working ("status=1"), and working ("status=2"). As explanatory
variables we have education, past work experience, and a black binary indicator.
Open the database:
. use status, clear
. describe
Now we can enter the command "mlogit":
. mlogit status educ black exper expersq, b(0)
With "b(0)" we indicate that the base category is "enrolled in school" ("status=0").
Marginal Effects
We can calculate the marginal effects "by hand" using the formula derived above, or
we can simply take advantage of the command "mfx":
. mfx compute, predict(outcome(0))
. mfx compute, predict(outcome(1))
. mfx compute, predict(outcome(2))
Where "outcome()" denotes the response under consideration. For example, an
addition year of education (at the mean x) changes the probability of going to school
by +0.01, the probability of staying home by -0.05, and the probability of working by
+0.03. This is completely logical: in general people invest in education in order to get
further education (e.g. going to college in order to get a Ph.D. in the future) or they
80
invest in education in order to enter the job market. Thus, investing in education
reduces the probability of staying home.
We can also obtain predicted probabilities to provide some useful comparisons. For
instance, consider two non-black men, each with 3 years of experience (and then
"expersq=9"). Calculate the three predicted probabilities for a man with 12 years of
education:
. mfx, predict(p outcome(0)) at(12 0 3 9)
. mfx, predict(p outcome(1)) at(12 0 3 9)
. mfx, predict(p outcome(2)) at(12 0 3 9)
And calculate the predicted probabilities for a man with 16 years of education:
. mfx, predict(p outcome(0)) at(16 0 3 9)
. mfx, predict(p outcome(1)) at(16 0 3 9)
. mfx, predict(p outcome(2)) at(16 0 3 9)
You can see that the 4 years of additional schooling changed the probability of going
to school by +0.06, the probability of staying home by -0.1, and the probability of
working by +0.04.
Tests for the multinomial logit model
You may install the command "mlogtest", which computes a variety of tests for
multinomial logit models.
p% ij = max { p ij }
j
81
Gs . Thus the first hierarchy corresponds to which of the S groups y falls into, and the
second corresponds to the actual alternative within each group. We can propose
separate models for those probabilities:
P( y Gs | x) and P( y = j | y Gs , x)
The response probability P ( y = j | x) , which is ultimately of interest, is obtained by
multiplying both equations. The problem can be solved either in a two-step fashion
or using the full maximum likelihood.
For instance, consider the model of commuting to work. First you might want to
decide whether to travel by car, to travel "naturally", or by public transportation.
Once you decided to travel by car, you have to decide whether to travel alone or
carpooling. Once you decided to travel "naturally", you have to decide whether to
travel by foot or in bicycle. Once you decided to travel by public transportation, you
have to decide whether to travel by bus or by train.
82
Consider for instance the ordered probit model. Let y be an ordered response taking
on the values {0, 1,, J} for some known integer J. And define the latent variable as:
y = x + e, e | x
N (0,1)
Where x does not contain an intercept. Let 1 < 2 < ... < J be unknown cut points
(a.k.a. threshold parameters), and define:
0 if y 1
1 if 1 < y 2
y=
M
J if y >
J
P( y = 0 | x) = P( y 1 | x) = P ( x + e 1 | x ) = (1 x )
P( y = 1| x) = P(1 < y 2 | x) = ( 2 x ) (1 x )
M
P( y = J 1| x) = P( J 1 < y J | x) = ( J x ) ( J 1 x )
P( y = J | x) = P( y > J | x) = 1 ( J x )
The parameters and can be estimated by maximum likelihood. The magnitude
of the ordered probit coefficient does not have a simple interpretation, but you can
retrieve qualitative information directly from its sign and statistical significance. We
can also compute marginal effects with respect to xk.
As always, replacing the normal cumulative distribution by the logistic yields the
ordered logit model.
Example: life satisfaction in Russia
Recall the Russian database used in the first two weeks ("russia.dta"). We will
estimate a model explaining life satisfaction using health and economic variables. It
will also include dummies for geographical areas (area*). The variable "satlif"
measures life satisfaction, and takes values from 1 ("not at all satisfied") to 5 ("fully
satisfied").
First open the database, generate the geographical dummies, and transform the
household expenditures to thousand rubles:
. use russia, clear
. gen area1=(geo==1)
. gen area2=(geo==2)
83
p% ij = max { p ij }
j
84
Chapter 11
Quantile Regression
An Introduction
Notice that we can define the sample mean as the solution to the problem of
minimizing a sum of squared residuals, we can define the median as the solution to
the problem of minimizing a sum of absolute residuals (Koenker et al., 2001). Since
the symmetry of the absolute value yields the median, minimizing a sum of
asymmetrically weighted absolute residuals yield other quantiles. Then solving:
min ( y i )
Yields the -th sample quantile as its solution, where the function
is:
min ( y i ( xi , ) )
85
86
Chapter 12
Robust inference
y ij = xij + i + eij
Where i denotes units and j denotes groups (e.g. individuals within households,
students within schools). Notice that panel data is a particular case. In OLS the usual
assumption is that the error term ( eij ) is independently and identically distributed.
However, this is noticeably violated in many applications. Since the consistency of
depends on E(x ij | e ij ) , the bias arises in the estimates of its standard errors.
One of the common violations to the i.i.d. assumption is that errors within groups are
correlated in some unknown way, while errors between groups are not. This is known
as clustered errors, and OLS estimates of the variance can be corrected following a
Eicker-Huber-White-robust treatment of errors (i.e. making as few assumptions as
possible). We will keep the assumption of no correlation across groups, but we will
allow the within-group correlation to be anything at all.
First recall the OLS variance estimator:
VOLS = s 2 ( X X ) 1 , where s 2 =
1
N
e2
i =1 i
N k
V R = ( X X ) 1
i =1
(ei xi )(ei xi ) ( X X ) 1
VC = ( X X ) 1
NC
j =1
u j u j ( X X ) 1 , where u j = j =C1 ei xi
N
Being N C the total number of clusters. For simplicity, I omitted the multipliers
(which are close to 1) from the formulas for robust and clustered standard errors.
Over-rejection
When you compare the first and the second estimates, generally the second yield
greater standard errors. The, you have heterocedasticity and you do not use the
robust se, then you are over-rejecting the null hypothesis that the coefficients are
zero.
87
But when you compare the robust (unclustered) y and the clustered variance
estimators, there is not a general result. If the within-cluster correlation is negative,
then within the same cluster there will be big negative eij along with big positive eij ,
and small negative eij with small positive eij . This would imply that the cluster sums
of eij xij have less variability than the individual eij xij , since within each cluster the
eijs will be cancelling each other. Then the variance of the clustered estimator
will be less than the robust (unclustered) estimator. In this case, using cluster-robust
standard errors will not only make your inference robust to within-cluster
correlation, but it will improve your statistical significance.
You repeat the reasoning for the opposite case (i.e. with positive within-cluster
correlation), and where there is no clustered errors. The following is an applied
example.
An example
Open the database:
. use education, clear
This is a database on test scores that was generated including within-school
correlation in the error term. The estimate of the coefficient on treatment is
consistent. Now we can estimate the unclustered and clustered standard errors:
. xtreg test treatment female educ income, fe i(school) robust
. xtreg test treatment female educ income, fe i(school) cluster(school)
In this example not accounting for within-cluster correlation would have led to wrong
inference at the 10% for the treatment coefficient.
Should I always use clustered standard errors?
If the assumptions are satisfied, and the error term is clustered, you will get
consistent standard error estimates if you use the cluster-robust estimator. On the
other hand, if the assumptions are satisfied and errors are not clustered, you will get
roughly the same estimates as if you had not specified cluster.
Why not always specify cluster? Well, the cluster-robust standard error estimator
converges to the true standard error as the number of clusters M approaches infinity,
not the number of observations N. Kezdi (2004) shows that 50 clusters (with roughly
equal cluster sizes) is often close enough to infinity for accurate inference.
Moreover, as long as the number of clusters is large, even in the absence of
clustering there is little cost of using clustered-robust estimates.
88
However, with a small number of clusters or very unbalanced cluster sizes, inference
using the cluster-robust estimator may be very incorrect. With finite M, the clusterrobust estimator produces estimates of standard errors that may be substantially
biased downward (i.e. leading to over-rejection). See Wooldridge (2003) and
Cameron et al. (2006) for further discussions and suggestions on the matter.
Right specification
The within-cluster correlations may disappear with a correctly specified model, and
so one should always be alert to that possibility. Consider as an example a model
where the dependent variable is cell phone expenditure. If you only included
explanatory variables at the individual level (pooled OLS), then you would find
serious within-cluster correlation: since many of the calls are between household
members, the errors of the individuals within the household would certainly be
correlated to each other. However, a great deal of such within-cluster correlation
would disappear if you used a fixed-effects model. Furthermore, by adding the right
predictors the correlation of residuals could almost disappear, and certainly this
would be a better model.
For instance, you can see whether in the above model the difference between the
unclustered and clustered standard errors is magnified or not after controlling for
fixed effects:
. reg test treatment female educ income, robust
. reg test treatment female educ income, cluster(school)
. xtreg test treatment female educ income, fe i(school) robust
. xtreg test treatment female educ income, fe i(school) cluster(school)
Nested multilevel clustering
Beyond the basic one-dimensional case, one may consider a multiple-level clustering.
For instance, the error term may be clustered by city and by household, or it may be
clustered by household and by year. In the first example the levels of clustering are
called nested. To estimate cluster-robust standard errors in the presence of nested
multi-level clustering, one can use the svy suite of commands. However, specifying
clustering solely at the higher level and clustering at the higher and lower level is
unlikely to yield significantly different results.
Nonlinear models
In Wooldridge (2006) there is an entire section of nonlinear clustered Standard
errors, with examples and Stata commands in the Appendix.
89
90
wb = 1,b 1 s , b = 1,..., B
1, b
w = 1 0 s
Obtaining s 1
from the CRVE standard errors, which controls for both error
heteroskedasticity
across
clusters
and
quite
general
correlation
and
heteroskedasticity within cluster (obtained using jointly the robust and cluster()
options).
Finally, we test HO: 1 = 0 against Ha: 1 0 . We reject H0 at level if w < w
1 2
or w > w
th
B.
Obtain w and 1 :
. use education, clear
. xtreg test treatment female educ income, fe i(school) cluster(school)
. quietly: ereturn list
. matrix V=e(V)
. matrix b=e(b)
. scalar se=sqrt(V[1,1])
. scalar b1=b[1,1]
. scalar wfirst=b1/se
Then obtain wb (say) 100 times:
. forvalues j=1(1)100 {
. quietly: preserve
. quietly: bsample , cluster(school)
91
E [ y1 | x1 ,1 ]
Second stage:
E [ y 2 | x 2 , 2 , E ( y1 | x1 , 1 )]
There are two standard approaches to such estimation. The first is a full information
maximum likelihood model (FIML), which consists in specifying the joint distribution
estimate.
92
L = ln f y 2i | x 2i , 2i , ( x 1i ,1i )
i =1
2 1
L L
R (a p q matrix) equals E 2 1
2 1
Matrices V1 and V2 can be estimated in many ways. For instance, in the following
example we will use the robust versions.
An example
We will consider an example from Hardin (2002), using data from Greene (1992). The
interesting dependent variable is the number of major derogatory reports recorded in
the credit history for the sample of applicants to a credit card. There are 1319
observations in the sample for this variable (10% of the original dataset), 1060 of
which are zeros. A Poisson regression model will be then a reasonable choice for the
second stage.
In a first stage we will estimate of a model of credit card applications outcome,
using a logit model. Then we will include the predicted values of the latter model as
explanatory variable in the Poisson regression. Notice that using the full information
93
Variable Name
Description
Age
Income
Expend
Ownrent
Selfemp
=1 if self employed
dep var: z
age
income
ownrent
selfemp
cons
First Stage
coef
std error
-0.073
0.219
0.189
-1.944
2.724
0.035
0.232
0.628
1.088
1.060
dep var: y
p-value
0.035
0.345
0.763
0.074
0.010
age
income
expend
zhat
cons
Second Stage
coef
std error
0.073
0.045
-0.007
4.632
-6.320
0.048
0.178
0.003
3.968
3.718
p-value
0.125
0.800
0.022
0.243
0.089
This output shows the "naive" (robust) standard errors (i.e. which assume that there
is no error in the generation of z i in the first-stage).
Within Stata it is not too difficult to obtain the MurphyTopel variance estimates for
a two-stage model. We need to gather all the information for calculating the matrix
V2MT . First we have to obtain V1 and V2 (V1r and V2r in the following code). We will
94
also obtain z i , y i and the coefficient on z i in the second stage (zhat, yhat and zz in
the code), as we will need them later:
. logit z age income ownrent selfemp, robust
. matrix V1r = .99 * e(V)
. predict zhat
. poisson y age income expend zhat, robust
. matrix V2r = .99 * e(V)
. predict yhat
. scalar zz = _b[zhat]
You may notice that we have multiplied the robust variance-covariance matrices by
0.99. This is because Stata applies a small sample adjustment n (n 1) to the robust
variance estimates, and then we can undo that adjustment by multiplying
accordingly.
Then we have to calculate R and C . First we have to do a little bit of algebra. A
logistic model where z is the outcome (whether the application is successful) and
L2 = ( y i wi 2 exp( wi 2 ) ln ( y i + 1) )
i =1
Differentiating we obtain:
n
L1
= xi ( z i zi )xi = X diag ( z i zi ) X
1 i =1
n
L2
= xi ( y i y i )zi (1 zi )2 xi = X diag (( y i y i )zi (1 zi )) X
1 i =1
n
L2
= wi ( y i y i )w i = W diag ( y i y i )W
2 i =1
Where 2 is the estimate obtained from the two-step procedure. The "matrix accum"
command calculates XX. Lets obtain C and R :
. gen byte cons = 1
95
. matrix accum C = age income ownrent selfemp cons age income expend zhat cons
[iw=(y-yhat)*(y-yhat)*zhat*(1-zhat)*zz], nocons
. matrix accum R = age income ownrent selfemp cons age income expend zhat cons
[iw=(y-yhat)*(z-zhat)], nocons
And keep only the desired partition:
. matrix list C
. matrix C = C[6..10,1..5]
. matrix list R
. matrix R = R[6..10,1..5]
Finally, we can calculate the MurphyTopel estimate:
. matrix M = V2r + (V2r * (C*V1r*C - R*V1r*C - C*V1r*R) * V2r)
. matrix list M
To show the MurphyTopel estimates, Hardin (2002) suggested a rather useful way. In
case it already exists, drop the program "doit":
. capture program drop doit
Then define it:
. program define doit, eclass
. matrix b = e(b) // stores the coefficients in b
. ereturn post b M // saves the old coefficients along with the new standard errors
dep var: y
age
income
expend
zhat
cons
Second Stage
coef
std error
0.073
0.045
-0.007
4.632
-6.320
0.314
1.483
0.019
30.591
24.739
p-value
0.816
0.976
0.723
0.880
0.798
96
Hole (2006) shows how the calculation of the Murphy-Topel variance estimator for
two-step models can be simplified in Stata by using the scores option of predict.
97
Chapter 13
Matching
98
References
[1] Verbeek, Marno (2000), A Guide to Modern Econometrics - John Wiley & Sons
Eds.
[2] Jenkins, S. (1995), "Easy Estimation Methods for Discrete-time Duration Models"
- Oxford Bulletin of Economics and Statistics, Vol. 57 (1), pp. 129-138.
[3] Lancaster, Tony (1990), The econometric analysis of transition data
Econometric Society Monographs.
[4] Koenker, Roger and Hallock, Kevin F. (2001), Quantile Regression - Journal of
Economic Perspectives, Volume 15 (4), pp. 143156.
[5] Cameron, Colin A.; Gelbach, Jonah and Miller, Douglas L. (2006), BootstrapBased Improvements for Inference with Clustered Errors NBER Technical
Working Paper No. 344.
[6] Efron, Bradley (1981), Nonparametric Estimates of Standard Error: The
Jackknife, the Bootstrap and Other Methods - Biometrika, Vol. 68 (3), pp. 589599.
[7] Guan, Weihua (2003), From the help desk: Bootstrapped standard Errors - The
Stata Journal, Vol. 3, pp. 7180.
[8] Kezdi, Gabor (2004), Robust Standard Error Estimation in Fixed-Effects Panel
Models - Hungarian Statistical Review Special, Vol. 9, pp. 96-116.
[9] Nichols, Austin and Schaffer, Mark (2007), Clustered Errors in Stata Mimeo.
[10]Wooldridge,
Jeffrey
M.
(2006),
Cluster-Sample
Methods
in
Applied
Methods
in
Applied
Jeffrey
M.
(2003),
Cluster-Sample
99
C.
Hamilton,
"Statistics
with
(http://www.stata.com/bookstore/sws.html)
100
Stata"
Brooks/Cole,
2006.