Mistakes to Avoid and Reporting OLS

1- Make sure that you "clean up" the data?

In SPSS, you should convert all the "don't know," "not sure," or the like into missing values. The easiest way to do that is to go to the "variable view" and click on the "Values" of each variable to identify the ones that should be coded as missing.


Then click on "Missing" for each variable and type in the numbers that should be coded as missing.

An example of the impact of "not" cleaning up the data.



2- What is OLS regression?

The estimated regression equation is:

Y = 0 + 1X1 + 2X2 + 3D +

where the s are the OLS estimates of the Bs. OLS minimizes the sum of the squared residuals

OLS minimizes SUM 2

The residual, , is the difference between the actual Y and the predicted Y and has a zero mean. In other words, OLS calculates the slope coefficients so that the difference between the predicted Y and the actual Y is minimized. (The residuals are squared in order to compare negative errors to positive errors more easily.)

The OLS estimates of the s:

  • are unbiased - the s are centered around the true population values of the Bs
  • have minimum variance - the distributions of the estimates around the true Bs are as tight as possible
  • are consistent - as the sample size (n) approaches infinity, the estimated s converge on the true Bs
  • are normally distributed - statistical tests based on the normal distribution can be applied to these estimates.

Statistical computing packages such as SPSS routinely print out the estimated s when estimating a regression equation (i.e. ols1.txt).

3- What to do with OLS Regression?

Multiple regression has three major uses

1. A description or model of reality

Instead of an abstract model


where EXPEND (vacation expenditures) increases with INCOME (income in thousands) and decreases with AGE (the age of the tourist), we get a more descriptive picture of reality, such as:

EXPEND = 100 + 30 INCOME - 10 AGE

where we now know that for every unit that INCOME increases, EXPEND increases by $30 and for every unit that AGE increases, EXPEND decreases by $10.

2. The testing of hypotheses about theory

Given test statistics on the numbers above, we can determine if these are "statistically significant." Statistical significance indicates the confidence we can place in the quantitative regression results. For example, it is important to know whether there is a 5% or a 50% chance that the true effect of INCOME on EXPEND is zero.

3. Predictions about the future

Suppose we want to predict what will happen to EXPEND if INCOME increases by 10%. If average income is $30, simply plug INCOME=3 into the model:

EXPEND = 100 + 30 (INCOME=3) - 10 AGE

and predict that EXPEND will increase by $90 if INCOME increases by $3000, holding the age of the tourist constant.

These uses do not differ from simple regression, but, results will be less likely misleading due to confounding effects.

4- How to evaluate the OLS model?

Evaluating the overall performance of the model

We hope that our regression models will explain the variation in the dependent variable fairly accurately. If it does, we say that "the model fits the data well." Evaluating the overall fit of the model also helps us to compare models that differ with the data set, composition and number of independent variables, etc.

There are three primary statistics for evaluating overall fit:

1. R2

The coefficient of determination, R2, is the ratio of the explained sum of squares (ESS) to the total sum of squares (TSS):

R2 = ESS/TSS = SUM([Y - ] - Y)2 / SUM(Y - Y)2

where ESS is the summation of the squared values of the difference between the predicted Ys (Y - ) and the mean of Y (Y, a naive estimate of Y) and TSS is the summation of the squared values of the difference between the actual Ys and the mean of Y.

The R2 ranges from 0 to 1 and can be interpreted as the percentage of the variance in the dependent variable that is explained by the independent variables.

2. Adjusted R2

Adding a variable to a multiple regression equation virtually guarantees that the R2 will increase (even if the variable is not very meaningful). The adjusted R2 statistic is the same as the R2 except that it takes into account the number of independent variables (k). The adjusted R2 will increase, decrease or stay the same when a variable is added to an equation depending on whether the improvement in fit (ESS) outweighs the loss of the degree of freedom (n-k-1):

adjusted R2 = 1 - (1 - R2) [(n - 1)/(n - k - 1)]

The adjusted R2 is most useful when comparing regression models with different numbers of independent variables.

3. F-stat

The F statistic is the ratio of the explained to the unexplained portions of the total sum of squares (RSS=SUM 2), adjusted for the number of independent variables (k) and the degrees of freedom (n-k-1):

F = [ESS/k] / [RSS/(n - k - 1)]

The F statistic allows the researcher to determine whether the whole model is statistically significant from zero.

Statistical computing packages such as SPSS routinely print out this stuff (i.e. ols2.txt).
What is a 'good' overall fit? It depends. Cross-sectional data will often produce R2s that seem quite low; R2=.07 might be good for some types of data while for others it might be very, very bad. The adjusted R2, F-stat, and hypothesis tests of indepedent variables are all important determinants of model fit.

5- Conservative Criteria for Hypothesis Testing

Because most data consists of samples from the population, we worry whether our s actually matter when explaining variation in the dependent variable.

The null hypothesis states that X is not associated with Y, therefore the is equal to zero; the alternative hypothesis states that X is associated with Y, therefore the is not equal to zero.

The t-statistic is equal to the divided by the standard error of (s.e., a measure of the dispersion of the )

t = /s.e.

A (very) rough guide to testing hypotheses might be: "t-statistics above 2 are good." Also check your t-tables and significance (confidence) levels.

Statistical computing packages such as SPSS routinely print out the standard errors, t-stats, and confidence levels (the probability that is not zero) when estimating a regression equation (i.e. ols3.txt).

6- Some problems and solutions

Specification Bias

How do you choose which variables to include in your model?

Problem Detection Consequences Correction
Omitted variable On the basis of theory, significant unexpected signs or poor model fit The estimated coefficients are biased and inconsistent Include the left out variable or a proxy
Irrelevant variable Theory, t-test, effect on the other coefficients and adjusted R2 if the irrelevant variable is dropped Lower adjusted R2, higher s.e.s, and lower t-stats Delete the variable from the model if it is not required by theory
Functional form Reconsider the underlying relationship between Y and the Xs Biased and inconsistent coefficients, poor overall fit Transform the variable or the equation to a different functional form

Specification searches are sometimes call "data mining" (see specbias.txt for an example).

7- Violation of Assumptions

There are several assumptions which must be met for the OLS estimates to be unbiased and have minimum variance. Two of the most easily violated assumptions are:

  • No explanatory variable is a perfect linear function of other explanatory variables, or no perfect multicollinearity.
  • The error term has a constant variance. A nonconstant variance could lead to heteroskedasticity.
Violation Detection Consequences Correction
multicol.txt for an example.)
Check to see if the adjusted R2 is high while the t-stats are low; check to see if the correlation coefficients are high The estimated coefficients are not biased but the t-stats will fall Drop one of the problematic variables, combine problematic variables as interactions
hetero.txt for the Park test.)
Plot the residuals against the Xs and look for spread or contraction; use some standard tests The estimated coefficients are not biased but the t-stats will be misleading Redefine the variables (i.e. in % terms) or weight the data

Other important assumptions which may be violated with certain types of data are:

  • All explanatory variables are uncorrelated with the error term. Correlation would lead to simultaneous equations and endogeneity bias
  • The error terms from one observation is independent of the error terms from other observations. Dependence would lead to autocorrelation: a problem with time-series data (i.e. this year's error term depends on last year's error term).

Violations of these assumptions are less likely to occur with many types of data so we'll leave their discussion to the extensions section.

8- Extensions of the Model

  • Interaction effects
Interaction terms are combinations of independent variables. For example, if you think that men earn more per year of work experience than women do, then include an interaction term [multiply the male dummy (D) by the independent variable] along with the experience variable and male dummy.
There are many important research topics for which the dependent variable is qualitative. Researchers often want to predict whether something will happen or not, such as referendum votes, business failure, disease--anything that can be expressed as Event/Nonevent or Yes/No. Logistic regression is a type of regression analysis where the dependent varible is dichotomous and coded 0, 1.
  • Two-stage least squares
This is one solution to endogeneity bias. If an independent variable is correlated with the error term (i.e. in a model of the number of years education chosen, the number of children might be chosen at the same time). Two-stage least squares would first predict the number of children (based on other independent variables) and then use the prediction as the independent variable in the education model.
  • Time series analysis, and forecasting
If you have yearly, quarterly, or monthly data then the ordering of the observations matters (with cross-section data it doesn't matter if Sally comes before or after Jane). For example, regression models of monthly car sales might include monthly and lagged monthly advertisements. Some standard time-series models should be used to account for the correlation of the lagged advertisements.

How to report OLS findings in political science reports, articles and reports?