Oversikt ECN402
Oversikt ECN402
Oversikt ECN402
1. If you add more variables and this makes the existing coefficients change, the reason
might be that adding variables reduces the omitted variable bias. If adding a variable
makes the existing coefficient changes, this means that the added variable is correlated
with existing x and with y.
2. If the causal effect doesn’t hold, we probably violate one of the Gauss-Markov
assumptions.
3. Including a variable that doesn’t change (doesn’t variate) is a violation of the third
Gauss-Markov assumption, that there has to be sample variation in x.
4. The annual growth based on quarterly data = [1+_b(time)]4-1
5. By using the command vce(cluster) in Stata, we control for potential heterogeneity.
Biased = when the average of the estimators doesn’t equal the true population parameter.
Unbiased: When the estimated beta is unbiased, it means that the expected value of the
estimator is equal to the true population parameter, E ¿)= β 1.
Restricted model = a model in which we have assumed (under H0) that several of the
estimators are equal to zero. The restricted model always has fewer parameters than the
unrestricted model.
Robust standard errors = the standard errors, and hence the variance, are robust to
heteroskedasticity. A technique to obtain unbiased standard errors of the OLS coefficients
under heteroskedasticity.
Cluster data = when there are subsamples within the data that are related to each other. For
example, data for test scores in a school, those scores might be correlated with classroom
because classrooms share the same teacher.
Cluster robust standard errors = to be used in panel data. When error terms uit are
correlated within clusters but independent across clusters, then regular standard errors, which
assume independence between all observations, will be incorrect. Cluster-robust standard
errors are designed to allow for correlation between observations within cluster.
Unit root = a feature of a stochastic process, such as random walks, that may cause problems
with statistical inference in time series. If there are d unit roots, the process will have to be
differentiated d times in order to become stationary.
When there is a correlation between u and xj we say that xj is endogeneous. If this is not the
case, xj is exogeneous.
Validation can be helpful to avoid overfitting (capturing noise by adding too many
variables/terms) and to assess predictive performance of the model.
Panel data
Fixed effect estimator: The fixed effect estimator allows for a correlation between ai and x.
FE is a stepwise form of OLS. In the first step you subtract the mean value from every value
in the regression. Then, this regression can be estimated by OLS, and the estimator will then
be unbiased, since ai is constant across time and therefore will no longer be a part of the
regression. When using the fixed effect estimator explanatory variables that doesn’t vary over
time can’t be identified, which is the case here with variable x and x. The FE estimators also
have a lot of parameters to estimate, resulting in a high consumption of degrees of freedom.
We define the residuals as the difference between actual and fitted values of yi.
u^ i= y i− ^y i= y i− β^ 0 − ^β1 x i
The OLS estimator chooses ^β 0and ^β 1 x i to minimize the sum of squared residuals.
n n
∑ u^ 2i =¿ ∑ ( y i − ^β 0− ^β 1 x i ) ¿
2
i=1 i=1
FOC:
n
^β :−2 ∑ ( y − ^β − ^β x ) =0
0 i 0 1 i
i=1
n
^β 1 :−2 ∑ x i ( yi − ^β 0− ^β 1 x i ) =0
i=1
This yield:
^β = y − ^β x
0 1
Goodness of fit, R2
R2 tells us something about how well the explanatory variable explain the dependent variable,
that means how well does changes in x explain the change in y. R2 measures the fraction of
total variation that is explained by the regression and is always a value between 0 and 1.
Higher value indicates a better fit.
You should be careful when comparing R2 across models and samples. High R2 does not mean
that there is a causal relationship.
i=1
i=1
2 SSE SSR
R= =1−
SST SST
A good estimator
An estimator is a good if it is
Unbiased: E ( β^ 0 ) =β 0 and E ( β^ 1 ) =β 1, meaning that the average of the estimated values
equals the true population parameters.
Efficient: An estimator is efficient relative to another estimator if it has a smaller
variance.
Under the following 5 assumptions, the OLS estimator is unbiased and efficient. This is called
the Gauss-Markov Assumptions.
Gauss-Markov Assumptions
1. Linear in parameters
The dependent variable must be related to the independent variable independently. The model
has to be linear in the parameters only, so it is possible to use non-linear functions of x and y.
The definitions of independent and dependent variables (level-log, log-level, etc.) don’t affect
the mechanisms of calculating the estimates but do affect the size and hence the interpretation
of the estimated coefficients.
2. Random sampling
The data has to be drawn from a random sample of the population.
Multiple regression: None of the explanatory variables are constant, and there are no exact
linear relationships among the explanatory variables. If there exist such an exact linear
combination of the other dependent variables, the model suffers from perfect collinearity, and
it cannot be estimated by OLS. This assumption allows the explanatory variables to correlate,
they just cannot be perfectly correlated. If two of the variables are highly correlated, this may
lead to large variances for the OLS slope estimators.
5. Homoskedasticity
The error term u has the same variance given any value of the explanatory variable.
Var ¿
This implies that Var ¿. If this does not hold and Var ¿ does depend on x, the error term
exhibits heteroskedasticity.
Homoskedasticity may be tested by looking at a plot of residuals and fitted values from the
regression.
Implications
Under assumption 1-4, the OLS estimator is unbiased, and ^β 1 represents the causal effects
of x on y in the population.
Under assumption 1-5 the OLS estimator is also effective.
∑ ( xi −x )2
i=0
n
σ 2 n−1 ∑ x2i
Var ( β^ 0 )= n
i=0
∑ ( x i−x )2
i=0
These formulas rely on the unknown error variance, σ 2 . We kan use the estimated residuals to
estimate σ 2 .
n
1
^σ 2= ∑
n−2 i=0
u^ 2i
We divide by n-2 instead of n because we have used 2 degrees of freedom when estimating ^β 0
and ^β 1. The smaller the estimated variance, the more precise (efficient) is the OLS estimator.
Interpretations
Estimator: formula for guessing a population parameter from sample data
Estimate: the actual value taken by an estimator in a specific sample
“An increase in x by 1 unit/% is associated with an increase/decrease in y by …. Units/%,
everything else equal”.
Level-level:
y=β 0 + β 1 x +u
∆ y =β 1 ∆ x
Log-level:
ln ( y )= β0 + β 1 x+u
% ∆ y ≈ (100∗β1 )∆ x
Level-log:
y=β 0 + β 1 ln ( x ) +u
∆ y≈ ( )
β1
100
%∆x
Log-log:
ln ( y )= β0 + β 1 ln ( x )+u
% ∆ y ≈∗β 1 ∆ x
Binary variables
A binary variable takes either the value 0 or 1.
Then E ¿ and E ¿
This regression allows the mean value of y to vary with the value of x.
β 1=E ¿
i=1 i=1
The first order conditions give k+1 linear equations with k+1 unknows. Solving these systems
of equations gives formulas for the OLS estimators ^β 0 , β^ 1 , … β^ k .
^β measures how on average y changes in our sample when xj increases by one unit while
j
holding all other independent variables fixed constant. It shows the partial effect, the ceteris
paribus effect, of xj on y for j = 1,…,k. The ceteris paribus interpretation results from the
linearity of the regression.
A good estimator
1. Linear in parameters
The model relating to xi can be written as following:
y=β 0 + β 1 x 1 + β 2 x 2 +…+ β k x k +u
The population model must be linear in the parameters only. We can use non-linear functions
of xi.
2. Random sampling
The data is drawn from a random sample of the population.
3. No perfect collinearity
None of the variables are constant, and there are no exact linear relationships among the
independent variables. This assumption is violated if there exists (a,b) such that x j=a+b x l.
This will for instance be the case if one variable is a multiple of another, one variable is the
sum of others, and when variables are shares or exclusive categories you cannot include all
shares/categories. STATA will drop one variable if you try to include two parameters that are
perfectly collinear.
When this assumption is violated we say that x is endogenous and we have an endogeneity
problem and a problem of identification.
5. Homoskedasticity
The error term u has the same variance given any value of the explanatory variables.
Var ¿
Implications
Under assumption 1-4 the OLS estimators are unbiased, so that E ( β^ j )=β j j=0,1 , … , k ,and we
can interpret the regression causally. This means that the average of the estimated values
equals the true population parameters.
If assumption 1-5 are satisfied, the OLS estimators, ^β j , are the best linear unbiased estimators
(BLUEs) of β j . OLS estimates will have the smallest variance (be most efficient) and be more
precise than other potential estimators.
!Omitted variables
It might happen that we omit a variable that actually belongs in the population model. The
model we should estimate would be as follows:
y= β^ 0 + β^ 1 x 1 + ^β 2 x 2 +u
But, omitting a relevant variable, e.g. x2, by ignorance, we end up estimating the following
model (tilde emphasize that there is an underspecified model):
~y=~ ~
β 0 + β1 x 1
~
There is a simple relationship between β 1 and ^β 1:
~ ^ ^ ~
β 1= β 1 + β 2 δ 1
~ ~
where δ 1is the slope coefficient of x2 on x1. The difference between β 1 and ^β 1 is the omitted
variable bias: ^β 2 δ^ 1 .
~
Omitted variable bias: ^β 2 δ 1
~
E ( β 1 ) =β1 if, and only if, either:
^β 2 = 0 (the partial effect of x2 is zero)
~
δ 1 = 0 (x1 and x2 are uncorrelated)
The direction of the bias depends on the sign of β 2and the sign of corr(x1, x2).
The omitted variable bias occurs when we have a variable that impact the outcome and is
correlated with the independent variable of interest.
Precision/variance of OLS
If we have k regressors, then:
σ2
Var ( ^β j ) = , j=0,1,2 ,... , k
SS T j ( 1−R j )
2
2
R j is the R2 from regressing xi on zi without yi, where zi is an added control variable. Since σ 2
is unknown, we need an unbiased estimator for this. Under assumption 1-5, the following
estimator is an unbiased estimator of σ 2 :
n
∑ u^ 2i
SSR
2 i=1
σ^ = =
n−k −1 n−k−1
n-k-1 is the degrees of freedom, which is [the number of observations – number of estimated
parameters].
Adding a new variable may affect the variance of the estimated beta in the following ways:
Increase the numerator (top) through increasing σ^ 2, since k increases by 1 when we
add a new variable.
Decrease the numerator (top) through decreasing σ^ 2, since SSR decreases given that
we have one less variable that is unobserved.
2
Affecting R j and thus decrease the denominator (bottom) if the added explanatory
variable is correlated with the existing explanatory variable (x).
!Multicollinearity
Multicollinearity = high (but not perfect) correlation between two or more independent
variables.
Multicollinearity is not a violation of any assumptions, and the OLS estimators are still
BLUEs.
Multicollinearity causes a large variance of the estimators and therefore large standard errors,
and therefore statistical inference is more difficult. To solve the problem of multicollinearity
you can collect more data, redefine the research question or try to drop a variable even though
this might lead to an omitted variable bias ( tradeoff between precision and bias).
On the contrary, when there are variables that are not highly correlated with our independent
variable (x), we may want to include them in the regression even when they do not cause
omitted variable biases. It may reduce the variance in the error term, hence increase precision.
STATA interpretation: In a regression, omitting a highly correlated variable that appear to
have zero effect (coefficient not statistically significant) on the dependent variable (y) will not
severely bias the estimate, and we may therefore want to exclude it. If two variables are
highly correlated including both will diminish the precision of the OLS estimate. Excluding
one of them, giving they are both significant, will on the other hand bias the estimates due to
the omission (omitted variable bias). In this case, you would still like to include both variables
and reduce the precision.
6. Normality assumption
We now assume that the unobserved error is normally distributed in the population.
u Normal(0 , σ 2 )
We say that the error term is independently and identically distributed. The form and the
variance of the distribution does not depend on any of the explanatory variables.
y |x Normal(β 0 + β 1 x 1 +…+ β k x k , σ )
2
The relationship between the normal (N) and standard normal (Z) distributions gives:
β^ j −β j
z= Normal (0,1)
sd ( ^β ) j
Since the population parameter is unknown, we hypothesize its value (H0) and use this value
in the expression for z. σ is unknown in the expression of sd ( ^β j ) and are replaced with the
^β −β
unbiased estimator σ^ =
√ SSR
n−k −1
. This can be shown in the test-statistic t=
j
se ( β^ )
j
j
t n−k−1
Under this assumption we can hypothesize about the unknown population value of β j and use
statistical inference to test such hypothesis.
Implications
Under A1-A6 we have a classical linear model (CLM) assumptions. Under the CLM
assumptions, the OLS estimators have a stronger efficiency property than they would under
the Gauss-Markov assumptions.
Hypothesis testing
Possible conclusions:
Reject H0
Fail to reject H0
o We should never “accept” the H0.
1. Specify H0 and H1
2. Determine significance level . The significance level determines the “rejection area”: the
area exactly equal to the probability mass in our test statistic distribution.
3. Rejection rule: reject H0 when test statistic is larger than some critical value c which is
determined by chosen alternative hypothesis (one sided or two-sided test) and significance
level. This means the probability of rejecting a “true” H0 is small.
If H0 involves several restrictions, the test statistic is F-distributed. The F-statistic should be
used when H0 involves the hypothesis that more than one population parameter is taking a
certain value at the same time: H0: 3=0, 4=0, 5=0.
F statistic for testing exclusion of a single variable (q=1) is equal to the square of the
corresponding t statistic. T2n-k-1 has an F1, n-k-1 distribution, so the two approaches give the exact
same outcome provided that the alternative is two-sided in t-test. The t-statistics are easier to
obtain and can be used to test against one-sided alternatives, so when testing for a single
parameter we therefore normally use the t-statistic.
The t-statistic:
^β j−β j|H 0
t= t n−k−1
se ( β^ )
j
The null hypothesis then involves 3 exclusion restrictions, q=3, and that the 3 variables
have no effect on y. The alternative hypothesis will then be that H0 is not true, and that at
least one of the variables have an effect on y.
The model without restrictions in the population parameters is called the unrestricted
model. The restricted model always has fewer parameters than then unrestricted model.
Adding variables to a model reduced the part of total variation in y that the model cannot
explain: SSRunrestricted < SSRrestriced. The F-test is based on the idea of comparing SSRur and
SSRr, and if SSRr is “sufficiently higher” than SSRur than we reject H0 and the restricted
model.
( SS Rr −SS R ur )/q
F= F q ,n−k−1
SS R ur /( n−k −1)
q=dfr - dfur
dfur = n-k-1
This test-statistic will always be positive, if you get a negative F you have done something
wrong.
When F is large, we reject H0. If we reject H0 we say that xk-q+1,…,xk are jointly significant
at the appropriate significance level. This test does not tell us which of the variables that
have a partial effect on y. If the null is not rejected, then the variables are jointly
insignificant. Tables of the F-distribution will tell us when a calculated F-statistic is above
the critical values for the chosen significance level.
P-values:
P-value is the largest significance level at which we would fail to reject H0. Given that H0 is
true, what is the probability that we observe a value of t that is as large as the value of our test
statistic? A low p-value is evidence against H0. How we compute p-values depends on H1
(one- or two-tailed alternative hypothesis).
p−value=P (T >t )
p−value=P (|T |>t )=2 P(T >t )
https://www.statology.org/t-score-p-value-calculator/
https://www.statology.org/how-to-calculate-a-p-value-from-a-t-test-by-hand/
“The p-value is the probability of getting a coefficient of a given magnitude if we assume that
the null hypothesis, stating that the value should be equal to 0, is true (hence there is no
correlation between the y and the x). Since we here find the p-value to be so small, we
therefore choose to reject the null hypothesis to be true, since the probability of viewing a
coefficient of -0,047 to be so small.”
Confidence intervals
The opposite of the rejection area under a 2-sided t-test. The critical value of the t-test for a
given significance level gives the probability mass that we want to allocate to the tails of
distribution, where the tails of the distribution represent “unlikely” values of the population
parameter under H0. A confidence interval represents a range of “likely” values of the
population parameter.
^β ± t se ( ^β )
j c j
Where tc is the critical value of the t-distribution that corresponds to a 2-tailed test with
significance level . The t-distribution has degrees of freedom equal to df = (n-k-1), where k
is the number of explanatory variables included in the model.
Asymptotic properties of OLS estimators
Asymptotic properties of OLS estimators are properties of the estimator when n goes towards
infinity (when n is large enough). When n tends to infinity OLS will be consistent, meaning
that there is a very high probability that the estimate is close to the true population value.
Consistency is a minimum requirement for sensible estimators. Under assumption A1-A4 the
OLS estimator ^β j is consistent for all j. Under assumption A1-A5 there is a normal
distribution. When n is large, we can conduct hypothesis testing as usual even when the
normality assumption does not hold. When n is large we can replace the conditional zero
mean assumption by Cov(xj,u), which is a weaker assumption.
Qualitive information
Qualitive information and dummies
Qualitive information could have 2 (binary) or more groups. We use a separate dummy, D,
variable for each of the categories, except one. The category left out of the regression is called
the base/reference/benchmark group. We interpret our coefficients against this excluded base
group. You cannot include a dummy variable for all categories dummy variable trap.
We can also include a dummy for when two variables are being interacted with each
other.
Heteroskedasticity
The homoskedasticity assumption explained previously says that the error term u has the same
variance given any values of the explanatory variables.
Var ¿
The most general form of heteroskedasticity can be written as:
Var ¿
A special case of heteroskedasticity is:
Var ¿
Homoskedasticity was not used when showing that the OLS estimators er unbiased (to be
unbiased A1-A4 have to be fulfilled). Hence, heteroskedasticity does not result in biased
estimators, and heteroskedasticity does not threaten the causal interpretation of the point
estimates). In the case of heteroskedasticity, the interpretation of R2 is also unaffected.
Homoskedasticity was used for calculating the variance of the OLS estimators. With
heteroskedasticity, the variance-formulas that have been derived are invalid. We do not know
how precise our estimators are, and we do not know how to test hypotheses statistically. The
Gauss-Markov theorem that OLS is BLUE no longer holds.
The Breusch-Pagan test tests only for a linear relationship between the squared residuals and
the independent variables. Given that the sample is relatively large, the standard errors should
not change much if we decide to correct for heteroskedasticity, while the homoskedasticity
assumption was correct.
∑ ( xi −x )2 σ 2i
Var ( β^ 1)= i=0 2
SS T x
When i = for all observations i, the variance formula reduces to the familiar 2/SSTx.
2 2
When i2 varies between observations, the usual calculations of standard errors are biased. A
valid estimator of Var ( β^ ¿ ¿ 1)¿ that is robust to heteroskedasticity of any form is:
n
∑ ( x i−x )2 u^ 2i
i=0
SS T 2x
Robust standard errors are a technique to obtain unbiased standard errors of the OLS
coefficients under heteroskedasticity. When estimating this in STATA the estimated
coefficients will be the same, and only the standard errors will differ.
The reason for not using robust standard errors in all cases (regardless of heteroskedasticity or
not) is that when the sample size is “small” and the homoskedasticity assumption holds, t-
statistics using “robust” standard errors are not always close to the appropriate t-distribution
and could throw off inference. Given that the sample is relatively large, the standard errors
should not change much if we decide to correct for heteroskedasticity, while the
homoskedasticity assumption was correct.
A problem is that se( θ^ 0) is not part of the regular output from a regression, as
(θ^ 0 ) ≠ se ( ^β 0 ¿+ se( ^β 1 ¿ c 1 + se( ^β 2 ¿ c2
We can transform the variables, so the regression output gives the right standard errors.
y=θ 0 + β 1 ( x 1−c 1 ) + β 2 ( x 2−c 2) + u
Where σ^ is the unbiased estimator of σ 2. With ln(y), this is the correct prediction rule.
2
2 SSR
An unbiased estimator of σ 2 is the mean squared error, σ^ = .
n−k −1
The following regression procedure will produce the sample correlation between y and
predicted u from the log(y) model.
1. Estimate the log(y) model and obtain the fitted values ^log ( y) .
^
log ( y )
2. Generate a variable m ^ =e i
.
m
^
3. Regress y on i without an intercept. The fitted values for y from this regression
represents the predicted y from the log(y) model. These fitted values we denote as ~y .
4. Find the sample correlation between ~y and y in the sample, Corr(~y ,y)
5. Square this and compare with R2 of the model with y in levels.
Corrected or adjusted R2
R2 will always increase as we add more independent variables. This is because SSR will
always decrease. To compare goodness of fit for nested models of this type we have to use the
corrected R2. Corrected R2 penalizes for adding additional independent variables.
2
R =1−
[
SSR
n−k −1 ]
SST
n−1 [ ]
Outliers
If the model has outliers, this may result in none of the explanatory variables becoming
statistically significant. One solution to this is to drop outliers by dropping 5 and bottom 5 %
in the distribution (problem: ad hoc). We can also winsorize the variable.
Specification issues
When there is a correlation between u and xj we say that xj is endogeneous. If this is not the
case, xj is exogeneous.
Nested models
To test between nested models, we can do a RESET test. H0 is that (1), the initial model, is the
“correct” model that satisifies the zero conditional mean assumption. H1 is a model which
includes additional non-linearities of xi that is better. There are many potential nonlinearities
that are possible. The implication of H0 is that no nonlinear combination of x variables should
matter. The RESET test is based on adding polynomials of the fitted values to the original
regression. The square of the fitted values would pick up several forms of nonlinearity
between the explanatory variables. Based on this, when doing RESET we estimate:
2 3
y=β 0 + β 1 x 1 + β 2 x 2 +…+ δ 1 ^y + δ 2 ^y +u .
The reformulated null hypothesis is then: H0: δ 1=δ 2= 0. Then we conduct the F-test for
testing 2 exclusion restrictions.
Rejecting one model with RESET does not give a clear indication as to what should be the
correct model. Sometimes when RESET is used to discriminate between 2 different functional
forms, the test may fail to reject both of them. The conclusion might change with different
numbers of non-linear terms. We have no clear indication of what is best to do if we conclude
that both models we have tested are misspecified.
Non-nested models
To choose between non-nested models, we can use the Davidson-MacKinnon test. This can be
used to test 2 non-nested models with the same dependent variables against each other.
A drawback with the Davidson-MacKinnon test, is that it is possible that both models are
rejected. Then we have no clear indication of what to do. There is also possible that none of
the models are rejected, and then we need to think about other criteria’s such as economic
intuition, R2, etc. If one of the models is rejected, this does not mean that the other model is
correct. Rejection could be due to a variety of functional form problems. The test offers
insight into a specific setting testing between two models and says nothing of all possible
other models. It is always important to think about model specification as well and not rely
solely on misspecification tests to guide model choice.
Panel data
Panel data is when we measure the same units in at least two periods, so we get two
dimensions. We get one cross sectional dimension N and a time series dimension T. Panel
data is also called longitudinal data. The advantages of panel data are as follows:
Increase the sample size
Able to reduce multicollinearity problems because of
o Variation between cross-sections
o Variation over time
Able to control for unobserved effects better than in cross-sections or time-series.
Able to build dynamic models
We will in general consider that N is large and that Ti is relatively small. This is typical the
case in panels of individuals, households or firms. Panel data is more than a sequence of
cross-sections over time. In a sequence of cross-sections we don’t necessarily observe the
same individuals over time. In panel data we follow individuals for several periods of time. A
balanced panel is when Ti = T for every i. An unbalanced panel is when Ti can be different
for each i. Unbalanced panels can be such that individuals have different starting periods.
Another example:
R D ¿ =β 0+ β 1 R D ¿−1+ v ¿ , with v ¿ =(ai +u¿ )
The term β 1 R D¿−1 captures the fact that previously investment in R&D reduced the cost
(marginal or fixed) of investment today. Thus, we have constructed a dynamic model where
the dependent variable and some of the explanatory variables are from different periods. If
Cov(RDit-1, ai)>0, OLS will provide an upward biased estimate of β 1 . We can control for
unobservables which are correlated with the regressors, that are common for all the
individuals.
ai, the individual unobserved specific effect may mean different things.
If subscript i denotes an individual:
Ability
Motivation
Family background
If subscript i denotes a firm:
Efficiency
If subscript i denotes a country
political system
citizens’ attitude
If subscript i denotes a market
local demand
willingness to pay
If Cor(xit,ait) ≠ 0, then OLS is biased and inconsistent, as the distribution of the estimated
coefficient becomes more and more tight around the true parameter as the sample size grows.
The correlation between xit and ai is violating the conditional mean zero, resulting in a biased
estimate. Because of this we need to get rid of ai or at least remove it from the error term.
Ignoring the existing of ai when estimating the model, we could think of this as an omitted
variable bias problem.
Positive bias means that the estimated coefficient is too large. Negative bias means that the
estimated coefficient is too small.
Estimation methods for removing ai
The occurrence of ai in the error term vit causes the bias when estimating the panel data model
with OLS. There are several ways to remove ai and still find an estimate of the original slope
coefficient 1.
y i 2=β 0 + β 1 x i 2+ ai +ui 2
y i 1=β 0 + β 1 x i 1+ ai +ui 1
1 1
( y i1 + y i 2 )= ( ( β0 + β 1 x i 1 +ai +ui 1 ) + ( β 0 + β 1 x i 2 +ai +ui 2 ) )
T T
1 β1 1 1
y i= ( β 0 + β 0 ) + ( x i 1+ xi 2 ) + ( a1 +a 1) + ( ui 1+u i2 )
T T T T
y i=β 0 + β 1 x i +a 1+u i
We are only using the variation within each group, not between groups within group
estimator.
( (
y i 2−
y i1 + y i 2
2 ))
=β 1 ¿
2 y i 2 − y i 1− y i 2
2
=β 1 (2 x i 2−x i1 −xi 2
2
+)( 2ui 2−u i1 −ui 2
2 )
y i1 − y i 2
2 (
x −x
=β 1 i1 i 2 + i1 i 2
2 )( )
u −u
2
y i 1− y i 2=β 1 ( x i1 −xi 2 )+(ui 1−u i 2)
FE or FD?
LSDV approach and within group estimators (FE) are always identical.
FD and LSDV, as well as FE gives the same estimated when T=2. To achieve this FE
estimation must include a dummy variable for the second time period to be identical to the FD
estimates that include an intercept. With T=2 FD has the advantage of being straightforward
to implement, and it is easy to compute heteroskedasticity-robust statistics after FD
estimation. The error variance estimates are also numerically identical.
When T≥ 3, FE and FD are not the same. Both are unbiased under their assumptions and
consistent. For large N and small T, the choice between FE and FD hinges on the relative
efficiency of the estimators, and this is determined by the serial correlation in the
idiosyncratic errors uit. When uit are serially uncorrelated, FE are more efficient than FD. So if
we are to use FD or FE depends on the potential serial correlation in the error term, and
measurement errors. The best thing may be to use both estimation approaches and compare
the results.
LSDV approach and within group estimators (FE) are always identical. In LSDV there are a
lot of coefficients to be estimated, consuming degrees of freedom. When using WG/FE we
are only using the within group variation in the data, while it sometimes would be nice to take
advantage of some of the between-group variation. The FE estimator is neither able to give
coefficient estimates of variables that are not varying over time. Might therefore be useful to
use another estimator that allows for this and neither gives many degrees of freedom.
Fixed effect estimators: (WG, FD, LDSV) allow for correlation between the explanatory
variables x and ai. But if there is no variation over time in one of the explanatory variables the
effect of this non-time varying explanatory variable cannot be identified. The fixed effect
estimators are also “consuming” degrees of freedom, having a lot of parameters that need to
be estimated.
Random effect estimator: (RE) gives the possibility to estimate the effect of non-time-
varying explanatory variables and still take account of ai. Neither does it “consume”
parameters as the fixed effect estimators do (RE is more efficient than the FE-estimator). But
the RE model cannot be used if there is correlation between x and the ai.
The choice: To choose between fixed effect estimators and random effect estimators, we can
base this decision on a) theory or b) testing (Hausman test).
The Hausman test is as follows:
−1
m= q^ 1 ' [ VC ( q^ 1 ) ] q^ 1 χ 2df =k
q^ 1= ^β − ^β
FE ℜ
VC ( . )=variance−covariance ¿
H0: ^β FE ≅ β^ ℜ. RE is unbiased only if xit is uncorrelated with ai, while FE is unbiased even if xit
is correlated with ai. This occurs when the two coefficient vectors ^β FE and ^β ℜ are similar. If xit
is uncorrelated with ai, we then prefer to use RE since RE is more efficient than FE.
If we only have one slope coefficient, , that we want to compare between RE and FE, we get
the following setup:
−1
m=( ^β − ^β )' [ VC ( q^ 1) ] ( ^β − ^β )
FE ℜ FE ℜ
( ^β FE− ^β ℜ)
2
¿
VC ( q^ 1 )
( )
( ^β−β H 0 ) ( β^ −β H 0 )
2 2
2
t = =
st . dev ( ^β ) var ( ^β)
^
( β−β H 0)
t=
st . error ( ^β)
The numerator (top) of m is what we “want to test” and the denominator (bottom) is a kind of
weighting.
A comment about FE vs. RE: When ai affect the outcome yit linearly, we have seen that we
can eliminate ai from the specification through some linear transformation (WG/FD/LSDV).
However, if ai affects the outcome yit nonlinearly, it isn’t easy to find a transformation to
eliminate ai. Such a nonlinear model is for instance the binary choice model where the
observed outcome yit takes the value of either 1 or 0. Then the only alternative might be a RE-
specification (it doesn’t help to include one dummy for each individual as we did for LSDV).
y ¿ =x ¿ β1 + β 0 +(ai +u¿ )
y ¿−1 =x¿−1 β 1 + β 0+(a i+ u¿−1 )
v ¿= p v ¿−1+ ε ¿
( a i+ u¿ ) =p ( ai +u¿−1) + ε ¿
We see that the existence of ai, both on the LHS and RHS of the last equation, induces
correlation between the residuals in period t and t-1 for individual i, i.e. between vit and vit-1.
We know that autocorrelation, correlation over time in the residuals, will affect the efficiency
(the st.errors of our coefficient estimates). Thus, we can no longer be sure about the
correctness of the st.errors. RE is taking into account this dependency between the residuals
for individual i, and thus we prefer RE over pooled OLS for panel data sets.
The difference-in-difference estimator (DiD)
To be used when we have cross-sections that are sampled before and after a treatment.
This is one of the mostly used models to analyze the effect of policy changes if one has one
group that is potentially affected by a treatment and a control group that is not affected by the
policy change. The treatment group experiences an exogenous change (treatment), while the
control group don’t. At the same time, we control for an underlying trend not caused by the
policy change.
It is important to be sure that the effect of the policy intervention is really caused by the
intervention itself, and not by something else. Whether individuals are allocated into
treatment or control group should be random (random assignment). The conditional
independence assumption, or unconfoundedness, says that conditional upon the covariates
xi, selection into treatment is unrelated to the potential gain from the treatment.
Using DiD we rely on the common trend assumption, that the lines for the treatment and
control group are parallel before the policy change/treatment. But, there might be a difference
in the trend already before the policy change, and therefore we might not assume that
differences in yit is caused by the treatment. [This assumption is therefore often violated. It
can be violated if people have the possibility to change from control to treatment group (they
want the treatment), or if they have different long run trends. One way to support evidence for
this assumption is by providing trends for the different control/treatment groups before the
treatment takes place, to show that they have a common trend, conducting placebo tests, etc.
just come up with something fitting if you are asked to]
Time dummies
We can construct time dummies that take the value 1 if the observations are from a certain
time period, zero otherwise.
There is several ways to solve this problem, but one way is to use an instrumental variable
(IV) estimation instead of the endogenous variable. Using IV estimation means finding an
instrument z that has a direct effect on x, but no direct effect on y and is uncorrelated with u.
There are two conditions that z has to satisfy, it has to be exogenous (cannot be tested) and
relevant (can be tested with a F-test). If the two conditions are met, we can estimate the IV
estimate with a 2 stage least square (2SLS) method.
Endogeneity problem
When the zero conditional mean assumption holds, the OLS results can be interpreted in a
causal way. If we on the other hand think that x is correlated with the error term u, the
variable is endogenous, and we have an endogeneity problem.
Definitions:
Weak instrument: When the correlation between x and z is small
IV-estimation: 2SLS
If the two conditions for z is satisfied (exogeneity and relevance), we can estimate the IV
estimator by conducting the following steps:
1. Regress x on z. This is called the first stage.
x=π 0 + π 1 z + v
The predicted ^x is the part of x that can be explained by z. Because z is uncorrelated
with the error term u in the structural model, so is ^x .
1. The second stage is to replace x with ^x in the original equation, and estimate this by
OLS:
y=β 0 + β 1 x^ +u
The coefficient in front of ^x is the IV estimate of β 1.
OLS and IV estimates can give very different results. One possible explanation is that either
the estimates from OLS or IV is biased. IV estimation may for example have problems with
small samples or weak instruments, making the estimate consistent but not unbiased. If z is in
fact not exogenous, the IV estimator is not consistent, and we are not solving the endogeneity
problem but may make it even worse.
F-tests are used when we compare the variances or shares (andeler) of two populations.
The F-test for variance is as follows, where s is the standard error:
s 21
F= 2 F n−1 , n−2
s2
If Var ( β^ ¿ ¿ 1 ,OLS )<Var ( ^β ¿ ¿ 1 , IV )¿ ¿ then OLS is more efficient than the IV estimation.
1
As a rule of thumb, the standard error of the IV estimator is about larger than the variance
p xz
for OLS, where pxz is the sample correlation between xi and zi. This we can think of as the cost
of doing IV when we could be doing OLS. When x is endogenous, we therefore prefer OLS to
IV since it gives a more efficient estimate. If pxz is small, the IV standard error becomes large,
and a small pxz also means that z is a weak instrument. If z = x, then p2xz=1, and we get the
OLS variance.
That the variance is asymptotic means that when N goes towards infinity, the variance will
be reduced and/or move towards its true value.
To test if the relevance condition hold, we can look at the first-stage F statistic for exclusion
of the instrument variables. If F>10, then the instrument is relevant in explaining x. If F<10,
the relationship between z and x is considered weak, and we have a weak instrument. This
implies large standard errors, and the IV estimator can be biased, especially if the sample size
is very small.
H0 = x is exogenous
Regress x on z (OLS)
Save the residuals from the first stage regression, call them ^v .
Estimate y=β 0 + β 1 x + δ v^ +u
H0 that x is exogenous is equivalent to H0: =0.
If we reject H0, there is an endogeneity problem, and we use IV.
The test requires a valid instrument. You should not rely solely on the result of this
endogeneity test in determining if we need to use IV estimation or not.
!!Overidentification
When we have more instruments than endogenous variables, we have overidentification.
When we have more instruments than endogenous variables, we can test whether some of the
instruments are uncorrelated with the error term u in the population. Assume we have two
instruments, z1 and z2 for one endogenous variable x:
We could get two different IV estimates, one using z1 as an instrument for x, and one
using z2 as an instrument for x.
If both instruments are exogenous, they should give similar IV estimates. If the two
IV-estimates are very different, one of the instruments or both, are endogenous and
should not be used as instruments.
Testing whether overidentifying restrictions are exogenous, means comparing different IV
estimates based on using different instruments. With more than two instruments, comparing
many different IV estimates is cumbersome. The intuition for the test is that if all instruments
are exogenous, the residuals from the 2SLS estimation should be uncorrelated with q linear
functions of the instruments, where q is the number of overidentifying restrictions.
Time series
An advantage of time-series is that we can build dynamic models that rely on observations in
previous periods. It also enables us to discuss short-term and long-term effects.
Static model
y t =α + δ 0 z t +u t
A model is static if the change in the explanatory variable gives an immediate change in the
dependent variable.
Multipliers
When the explained variable in a model is dependent on lagged variables x, a permanent
increase in x will lead to the increase being spread over several periods. The long-run (LR)
multiplier shows the aggregate effect of an increase.
the long run multiplier=∑ δ t
The short-run multiplier is the coefficient for the specific time-lag, e.g. δ 1, since this gives the
effect that you see immediately.
The LR multiplier in a finite distributed lag model with lagged dependent variables:
y t =α + δ 0 z t + δ 1 z ( t−1) +δ 2 z ( t−2) + γ y t−1 +γ y t−1 +ut
Assuming we have reached a steady state, where yt = yt-1= y and zt = zt-1 = zt-2 = z.
Then
y ( 1−γ )=α+ ( δ 0 +δ 1 +δ 2) z +ut
α ( δ 0 +δ 1+ δ 2 ) 1
y= + z+ u
(1−γ ) (1−γ ) (1−γ ) t
(δ 0 +δ 1+ δ 2 )
Where is the long-run multiplier.
(1−γ )
Adding the normality assumption as well, we get the classical linear model assumptions:
ut independent of x ,∧ut N ( 0 , σ )
2
(The errors ut are independent of X and are independently and identically distributed as
Normal(0, σ 2)).
Under the 6 CLM Assumptions, the OLS estimators are normally distributed, conditional on
X. Further, under the null hypothesis, each t statistic has a t distribution, and each F statistic
has an F distribution. The construction of confidence intervals are also valid. Then everything
we have learned about estimation and inference for cross-sectional regressions will apply
directly to time-series regressions.
Dummy variables:
C t=α 0 + δ 0 Y t +δ 1 Y t −1+ η D t ≥2015 + ut
D
t ≥ 2015
{
= 1 , if t ≥ 2015
0 otherwhise
This can for instance be used to check if consumption is higher (lower) in the years greater
(smaller) or equal to the years prior to 2015?
y t− y t−1
∆ log y t y t −1 relative change , growth
β 1= ≈ =
∆t ∆t time unit
Unobserved, trending factors that affect yt might also be correlated with the explanatory
variables. Ignoring this will lead to spurious regression. Including a time trend in the
regression model eliminates the problem.
By taking away the trend (detrend), we are only using variation relative to the trend lines. The
R2 of the detrended series reflects how well the explanatory variable(s) explains the dependent
variable net of the effect of the time trend.
We can de-seasonalize time series as well. When analyzing time series, we should control for
trends and seasonality. If there are underlying common trends that are not taken into account,
the results are biased and can even give the opposite coefficient sign relative to the true effect.
Moreover, the R2 could be exaggeratedly large even though the true R2 might be minimal.
Autocorrelation
Autocorrelation might affect the efficiency of the estimate but should in theory not lead to
biased coefficient results. However, for small samples, autocorrelation might have
implications even for the sign of the slope of the coefficients. With autocorrelation it can be
shown that E ( β^ 1 ) =β 1 is unbiased, but the standard errors of this estimator will no longer be
unbiased. This implies that the least squares estimates are unbiased but inefficient in the
presence of autocorrelation.
One assumption working with time series and cross-sectional data is that ui and uj are
independent for all i ≠ j (the error terms doesn’t correlate).
Cov ( u i , u j )=E {(ui−E ( u i ))(u j−E ( u j ) ) }=E ( u j∗ui )=0
If this assumption is violated, we have lags in the error term autocorrelation.
!!!First-order autoregressive error model (AR(1)-model)
An AR(1)-model is a simple model with a lag in the error term. This model also captures the
dynamic characteristic in time-series data.
y t =β 0+ β1 x t +ut , t=1,2 … T
ut =ρ ut −1 + et
2
e t N (0 , σ e )
E ( e t e s ) =0 for all t ≠ s
Differentiating:
y t −ρ y t−1=β 0− ρ β 0 + β 1 x t −ρ β 1 x t−1 +(u t− ρu t−1)
y t −ρ y t−1=β 0 ( 1−ρ ) + β 1 ( xt −ρ x t−1 ) +(ut −ρ ut−1 )
y t =β 0 ( 1−ρ ) + β 1 ( x t −ρ xt −1 ) + ρ y t−1 +(u t− ρut −1)
where ( ut −ρ ut −1) =e t .
A model with a lagged dependent variable (yt-1), a contemporaneous variable (xt), a lagged
variable (xt-1) and a well behaved residual et (since we assumed E(etes)=0 for all t ≠ s), as this,
is also an ARDL(p,q)-model.
Hva I alle dager oppnår jeg med å gjøre dette her, hva har dette her å si for autokorrelasjon,
hva gjør jeg
Detecting autocorrelation
Durbin Watson Test
T T T T
2 2
∑ ( u^ t−u^ t−1 ) ∑ u^ t +∑ u^ t −1 −2 ∑ u^ t u^ t−1
2
d= t =2 T
= t =2 t=2
T
t =2
∑ u^ t ∑ u^ t
2 2
t=2 t=2
T T T
2 2
∑ u^ t ∑ u^ t −1 ∑ u^ t u^ t −1
t =2 t=2 t=2
¿ T
+ T
−2 T
∑ u^ t
2
∑ u^ t
2
∑ u^ t
2
t =2 t =2 t=2
T T
∑ ( ρ u^ ¿ ¿ t−1 +e t )^ut −1 ^p ∑ u^ t −1 u^ t −1
t =2 t=2
¿ 1+1−2 T
=1+1−2 T
¿¿
∑ u^ t ∑ u^ t
2 2
t =2 t =2
¿ 1+1−2 ^p=2(1− ^p )
^pis the sample first order coefficient of autocorrelation. Since -1< ^p <1, d must be such that 0
< d < 4 ( ^p=1 gives d=0, ^p=-1 gives d=4). The DW-test is often inconclusive, meaning that
there are values of d where one cannot say whether autocorrelation is a problem or not. The
DW-test also only work for AR(1)-processes.
The zero conditional mean assumption implies that K explanatory variables in period t is
independent of the error term in the same period, but also that the error term is independent of
all explanatory variables from all periods.
E ( ut|x )=0 , t=1. .T
E ( ut|x1 t , … , x Kt ) =0 ,t =1. . T
E ( ut|x1 τ ,… , x Kτ )=0 , τ ≠ t
Anything that causes unobservables at time t, ut, to be correlated with any of the explanatory
variables in any time period causes the assumption about zero conditional mean to fail. Then
the regressors are not strictly exogenous. It is quite normal in time series that one or more of
the regressors are not strictly exogenous, in the following there will be presented a test where
one doesn’t necessarily need strictly exogenous regressors.
1. Run OLS
y t =β 0+ β1 x 1t +…+ β k x kt +ut
2. Find
u^ t
3. Run OLS
u^ t =δ + ρ u^ t−1 +θ1 x1 t + …+θk x kt + v t
4. T-test for
H0: =0 (no autocorrelation)
!Correlogram
This we will come back to in highly persistent time series.
Consequences of autocorrelation
With autocorrelation it can be shown that E ( β^ 1 ) =β 1 is unbiased, but the standard errors of this
estimator will no longer be unbiased. This implies that the least squares estimates are
unbiased but inefficient in the presence of autocorrelation.
We have now estimated a static model, but where we have allowed that the shocks ut are
serially correlated and to follow an AR(1) process. We have estimated y t =β 0+ β1 x t +ut where
the shocks ut are AR(1), ut =ρ ut −1 + et and et is white noise e t N (0 , σ 2e ). We thus have found
estimates of 1 and .
y t =β 0+ β1 x t +ut , t=1,2 … T
ut =ρ ut −1 + et
Which may be transformed as follows:
y t =β 0+ β1 +u t
ρ y t−1= ρ β 0 + ρ β 1 x t−1 + ρ u(t −1)
Differentiating:
y t =β 0 ( 1−ρ ) + β 1 x t −ρ β 1 xt −1+ ρ y t −1+(ut −ρ ut −1)
The coefficient of xt-1 is a product of two other coefficients. Thus there is a common factor
since we have here only two deep parameters, 1 and .
STATA: In Stata you will get the coefficient 1 and , thus making you able to calculate π 2.
Hva gjør man, hva ser man, hva innebærer det at man legger restriksjoner på
koeffisientene?
Stationarity
In order to predict a time series, it has to have some attributes that are constant over time.
Thus, time-series need to be stationary. A time series is stationary if there is
E ( y t )=μ( constant mean)
Var ( y t ) =σ 2 ( constant variance)
Cov ( y t , y t−s )=γ s (covariance depends on s , not t )
When non-stationary time series are used in a regression model the results may indicate a
significant relationship even when there is none, which may lead to a spurious regression.
Thus, if we suspect one or all of the series used in a regression to be non-stationary we should
be careful since we can obtain spurious regression results.