Mistakes to Avoid and Reporting OLS 1 Make sure that you "clean up" the data?In SPSS, you should convert all the "don't know," "not sure," or the like into missing values. The easiest way to do that is to go to the "variable view" and click on the "Values" of each variable to identify the ones that should be coded as missing.
Then click on "Missing" for each variable and type in the numbers that should be coded as missing. An example of the impact of "not" cleaning up the data.
2 What is OLS regression?The estimated regression equation is: Y = ß_{0} + ß_{1}X_{1} + ß_{2}X_{2} + ß_{3}D + ê where the ßs are the OLS estimates of the Bs. OLS minimizes the sum of the squared residuals OLS minimizes SUM ê^{2} The residual, ê, is the difference between the actual Y and the predicted Y and has a zero mean. In other words, OLS calculates the slope coefficients so that the difference between the predicted Y and the actual Y is minimized. (The residuals are squared in order to compare negative errors to positive errors more easily.) The OLS estimates of the ßs:
Statistical computing packages such as SPSS routinely print out the estimated ßs when estimating a regression equation (i.e. ols1.txt). 3 What to do with OLS Regression?Multiple regression has three major uses1. A description or model of realityInstead of an abstract model EXPEND = f(INCOME, AGE) where EXPEND (vacation expenditures) increases with INCOME (income in thousands) and decreases with AGE (the age of the tourist), we get a more descriptive picture of reality, such as: EXPEND = 100 + 30 × INCOME  10 × AGE where we now know that for every unit that INCOME increases, EXPEND increases by $30 and for every unit that AGE increases, EXPEND decreases by $10. 2. The testing of hypotheses about theoryGiven test statistics on the numbers above, we can determine if these are "statistically significant." Statistical significance indicates the confidence we can place in the quantitative regression results. For example, it is important to know whether there is a 5% or a 50% chance that the true effect of INCOME on EXPEND is zero. 3. Predictions about the futureSuppose we want to predict what will happen to EXPEND if INCOME increases by 10%. If average income is $30, simply plug INCOME=3 into the model: EXPEND = 100 + 30 × (INCOME=3)  10 × AGE and predict that EXPEND will increase by $90 if INCOME increases by $3000, holding the age of the tourist constant. These uses do not differ from simple regression, but, results will be less likely misleading due to confounding effects. 4 How to evaluate the OLS model?Evaluating the overall performance of the modelWe hope that our regression models will explain the variation in the dependent variable fairly accurately. If it does, we say that "the model fits the data well." Evaluating the overall fit of the model also helps us to compare models that differ with the data set, composition and number of independent variables, etc. There are three primary statistics for evaluating overall fit: 1. R^{2}The coefficient of determination, R^{2}, is the ratio of the explained sum of squares (ESS) to the total sum of squares (TSS): R^{2} = ESS/TSS = SUM([Y  ê]  µ_{Y})^{2} / SUM(Y  µ_{Y})^{2} where ESS is the summation of the squared values of the difference between the predicted Ys (Y  ê) and the mean of Y (µ_{Y}, a naive estimate of Y) and TSS is the summation of the squared values of the difference between the actual Ys and the mean of Y. The R^{2} ranges from 0 to 1 and can be interpreted as the percentage of the variance in the dependent variable that is explained by the independent variables. 2. Adjusted R^{2}Adding a variable to a multiple regression equation virtually guarantees that the R^{2} will increase (even if the variable is not very meaningful). The adjusted R^{2} statistic is the same as the R^{2} except that it takes into account the number of independent variables (k). The adjusted R^{2} will increase, decrease or stay the same when a variable is added to an equation depending on whether the improvement in fit (ESS) outweighs the loss of the degree of freedom (nk1): adjusted R^{2} = 1  (1  R^{2}) × [(n  1)/(n  k  1)] The adjusted R^{2} is most useful when comparing regression models with different numbers of independent variables. 3. FstatThe F statistic is the ratio of the explained to the unexplained portions of the total sum of squares (RSS=SUM ê^{2}), adjusted for the number of independent variables (k) and the degrees of freedom (nk1): F = [ESS/k] / [RSS/(n  k  1)] The F statistic allows the researcher to determine whether the whole model is statistically significant from zero. Statistical computing packages such as SPSS routinely print out this stuff (i.e. ols2.txt). What is a 'good' overall fit? It depends. Crosssectional data will often produce R^{2}s that seem quite low; R^{2}=.07 might be good for some types of data while for others it might be very, very bad. The adjusted R^{2}, Fstat, and hypothesis tests of indepedent variables are all important determinants of model fit. 5 Conservative Criteria for Hypothesis TestingBecause most data consists of samples from the population, we worry whether our ßs actually matter when explaining variation in the dependent variable. The null hypothesis states that X is not associated with Y, therefore the ß is equal to zero; the alternative hypothesis states that X is associated with Y, therefore the ß is not equal to zero. The tstatistic is equal to the ß divided by the standard error of ß (s.e., a measure of the dispersion of the ß) t = ß/s.e. A (very) rough guide to testing hypotheses might be: "tstatistics above 2 are good." Also check your ttables and significance (confidence) levels. Statistical computing packages such as SPSS routinely print out the standard errors, tstats, and confidence levels (the probability that ß is not zero) when estimating a regression equation (i.e. ols3.txt). 6 Some problems and solutionsSpecification BiasHow do you choose which variables to include in your model?
Specification searches are sometimes call "data mining" (see specbias.txt for an example). 7 Violation of AssumptionsThere are several assumptions which must be met for the OLS estimates to be unbiased and have minimum variance. Two of the most easily violated assumptions are:
Other important assumptions which may be violated with certain types of data are:
Violations of these assumptions are less likely to occur with many types of data so we'll leave their discussion to the extensions section. 8 Extensions of the Model
Interaction terms are combinations of independent variables. For example, if you think that men earn more per year of work experience than women do, then include an interaction term [multiply the male dummy (D) by the independent variable] along with the experience variable and male dummy. There are many important research topics for which the dependent variable is qualitative. Researchers often want to predict whether something will happen or not, such as referendum votes, business failure, diseaseanything that can be expressed as Event/Nonevent or Yes/No. Logistic regression is a type of regression analysis where the dependent varible is dichotomous and coded 0, 1.
This is one solution to endogeneity bias. If an independent variable is correlated with the error term (i.e. in a model of the number of years education chosen, the number of children might be chosen at the same time). Twostage least squares would first predict the number of children (based on other independent variables) and then use the prediction as the independent variable in the education model.
If you have yearly, quarterly, or monthly data then the ordering of the observations matters (with crosssection data it doesn't matter if Sally comes before or after Jane). For example, regression models of monthly car sales might include monthly and lagged monthly advertisements. Some standard timeseries models should be used to account for the correlation of the lagged advertisements. How to report OLS findings in political science reports, articles and reports?
