Simple and Multiple Regressions 

Regression procedures are like correlation because they are concerned with relationships among variables. Correlation analyses serve as the part of the building block for regression procedures. There are two purposes of regression procedures—prediction and explanation


Review Correlation (from the work of Lisa Howley)

Reflects the degree to which relative positions on X match up with relative positions on Y.  Ranges from 0 to +1, regardless of the underlying scales.
Perfect positive linear relationships = +1
Perfect negative linear relationship = -1
The sign of r has nothing to do with its strength, only its direction. In other words, positive and negative signs do not denote strength!
 
How strong is good?  It depends on the variables in question. However, a good rule of thumb is below:
 
.90 to 1.0 Very strong relationship
.70 to .89 Strong relationship
.50 to .69 Moderate relationship
.26 to .49  Weak relationship
.0 to .25 Weak or no relationship

 

#1 Correlation does NOT imply causation!!!
Possible Explanations:
1.X causes Y
2.Y causes X
3.A third factor, or multiple extraneous factors, cause both X and Y
 
#2 Statistical Significance does not imply strength of association.  Generally speaking, the larger the sample size the greater the likelihood the results will be significant.
 
#3 To the extent the relationship departs from linearity, r will underestimate the relationship. Do not confuse absence of linear association, or weak r, with the absence of association! 
 
#4  Outliers will also affect the strength of r.  The direction will depend on where the outlier is located in relation to the data. Outliers should be investigated.
 
#5  Restriction in range will result in a lower r.  For example, if you were to compare scores on an achievement test with a measure of creativity in a classroom of gifted students, the r would be lower than it would be if a more diverse student body was studied.  Generally speaking, the more variability in scores, the stronger the correlation.
 
#6 The correlation coefficient is not a proportion! 
 
Another very useful measure related to the correlation is the coefficient of determination (COD or r2).  This statistic is used to to interpret the meaningfulness of the correlation coefficient.  It represents the proportion of the amount of variation shared by both variables. 
 
Conceptually, the COD asks “What percentage of of the total information can the other variable explain or predict?”  The coefficient of non-determination asks, “What percentage of the total information can the other variable not explain or predict?” 
In other words, if a r is .50, then 25% of the information is explained by the other variable.  If a correlation is perfect (r =1.00), then all of the information in one variable can be explained by the second. 

 


Regression (from the work of Lisa Howley)

 

Some major differences between correlation and regression, include:

1.The purpose of correlation research is to investigate the relationship between variables. 
The purpose of regression is to predict or explain outcomes between variables.

2.In correlation research, the variables are interchangeable: There are no independent or dependent variables. 
Regression includes this distinction between IV or predictor variables and DV or outcome variables.

.Regression research involves multiple inferential statistical tests where correlation involves a single test (H0: rho=0)
In order to develop a regression equation, you first need to have scores for both variables from a group of similar individuals. The information from this group’s data, will allow us to make predictions for other individuals.

Terminology:

Independent or ‘Predictor’ variable.  Refers to the variable used as a predictor.  Denoted by the letter X.
Dependent or ‘Outcome’ variable.  Refers to the predicted variable and is denoted by the letter Y.  Y ‘depends’ on X. 
 
Assume that researchers are interested in predicting a patient’s level of depression (Y) from the patient’s loneliness score (X).
 
The regression equation is used to determine the line of ‘best fit’ through a scatterplot of the two variables involved. 
This hypothetical line is called the regression line or line of best fit.  It provides our best guess for predicting Y from X. 
The position of the line is determined by the slope (angle & direction) and the intercept (where the line intersects the Y axis).
 
The distance between each individual data point and the regression line is the amount of error in our prediction.  The amount of error is a direct reflection of the r between X and Y.  Without a correlation between two variables there can be no meaningful prediction from one to the other.
 
If the correlation was perfect, we would have 0 error in our prediction.

When regression procedures are used for prediction, the researcher is trying to produce an equation that will predict a score (aka, dependent variable, criterion variable, or outcome variable) based on other variables (independent variables, predictor variables, or explanatory variable). An example of when regression procedures are used for prediction is when a university wants to predict the success of their students on the academic curriculum (College GPA) based on previous information (e.g., high school GPA, class rank, SAT, etc.). In this example, would be the criterion variable and the previous information would be the predictor variables.

Regression procedures can also be used to explain why individuals are different on some particular variable. For example, a researcher may want to know why elementary students may vary on the math achievement. The research may hypothesize that the reasons for the differences could be parental value of education, attendance, IQ, etc. In this case, math achievement would be the outcome variable and the variables that the researcher hypothesized would be the explanatory variables.

Our goal in regression is to build an equation where the dependent variable is equal to the weighted combination of the independent variables.

Example from the Schools

Prediction Formulas for End-of-Course Performance (March 24, 2000)

Predicted School Algebra I Mean = 60.4 + [0.88 x (Math-176.10)]

Predicted School Algebra II Mean = 59.3 + [0.43 x (Reading - 164.7)] + [0.89 x (Alg-60.0)]

What is the correlation between the predictors and criterion? R2?

 


Simple Regression or Bivariate Regression

Simple regression involves just a single dependent and independent variable. An example of a simple regression could be if a short multiple-choice test could predict a longer standardized test.

Student

Multiple-choice Test

Standardized Test

1

9

155

2

7

152

3

5

150

4

6

151

5

8

151

6

3

144

7

5

149

8

2

146

9

10

155

10

6

150

 [Download Data]

 

 

 Next you want to drag the criterion variable (dependent variable) to the y-axis and the predictor to the x-axis.

Next go to the Fit tab and make sure you select the regression method.

 

 SPSS Output

 

SPSS has constructed an equation and produced a line (regression line or the line of best fit) that is as close as possible to all of the dots.  The line was built based on the statistical concept of least squared. This means that the line was drawn so that we minimize the squared distance between the dots and the lines. Recall that when we calculated correlation coefficients, we would also produce a scatter diagram to visually examine the relationship and possible outliers.

With in the figure you can see an equation. This is called the regression equation. If you think back to high school geometric you may remember that a simple linear line takes the form of,

where Y' is the predicted score on the dependent variable, a is a constant (or intercept), b is the regression coefficient (or slope), and X is the known score on the independent variable. For our example, the regression equation is

Now we have an equation to make a prediction. Note that we needed both the independent and dependent variables to create the equation, and once the equation if form all we need is an individuals score on the shorter multiple-choice test. Let’s say that Bob scored 5 on the multiple-choice test and we wanted to predict his score on the longer standardized exam. We would simply substitute Bob’s score (5) for X in the equation above. Once you do the simple math, Bob’s predicted score on the standardized test would be 148.9.

Typically in our literature you will not see a scatter diagram of the regression line reported, simply the regression equation.

Two Types of Regression Equations

There are two types of regression equations—unstandardized and standardized. The regression equation we just completed was an unstandardized regression equation. The standardized regression equation takes the form of

The standardized regression equation uses the z-scores for both the dependent and independent variables. There is no constant (or intercept) in this equation and the β (called the beta weight) is substituted for the b (called the regression coefficient). When we are using the equations for prediction, we are more interested in the unstandardized regression equation. If we wanted to use the equation to explain, we would use the standardized regression equation. When we go to multiple regression, you will see how you can examine the values of β and determine which of the independent variables contribute the most to the dependent variable because they are measured on the same scale.

Coefficient of Determination (r2)

When summarizing the results of the regression procedure, researchers should report the coefficient of determination (r2). The coefficient of determination provides you with an indication of the quality of the relationship between the dependent and independent variables. We can always produce a regression equation for any set of data, but that does not mean that the equation is very useful or meaningful. Another way of interpreting the  r2 is that the proportion (or percentage) of in the dependent variable that is explained by the independent variable.  In most research articles, researchers report the r2 as a percentage.

Testing for Significance in the Simple Regression

Because we typically generate the regression equation using a sample, we can tests for statistical significance of our estimated parameters in the regression equation (a, b, and r2).

 

 

 

 

Write Up

 Results

A simple regression procedure was used to predict students standardized test scores from the students short multiple-choice test scores. A total of 10 subjects participated in the study. The simple regression analysis revealed that the short multiple-choice test predicted the standardized test scores, r2=.88 (adjusted r2=.87), F(1, 8)=60.42, p<.01. The unstandardized and standardized regression equations are reported in Table 1. The regression coefficient was statistically significant (p<.01).

 Table 1

Unstandardized and Standardized Regression Equations for the Prediction of Standardized Test Scores from Short Multiple-choice Test

 

 

Unstandardized

Coefficients

Standardized

Coefficients

 

 

B

SE

β

t

(Constant)

142.40

1.09

 

130.47**

MC_TEST

1.30

0.17

0.94

7.77**

**p<.01

 

Activity

[Data for predicting GPA from SAT]



Multiple Regression Procedure

Multiple regression procedures are the most popular statistical procedures used in social science research. The difference between the multiple regression procedure and simple regression is that the multiple regression has more than one independent variable. The linear regression equation takes the following form

where n is the number of independent variables.

There are several different kinds of multiple regressions—simultaneous, stepwise, and hierarchical multiple regression.  The different between the methods is how you enter the independent variables into the equation.

In simultaneous (aka, standard) multiple regression, all the independent variables are considered at the same time.

For stepwise multiple regression, the computer determines the order in which the independent variables become part of the equation. You can think of it as selecting a baseball team. You first select the best player. Your next selection may not be the next best player, but a player that helps round out your team. For example if you select a pitcher first, and for you next selection a pitcher is the next best player, you might select the next best player in a position that you need. The first independent variable entering the regression equation is the one with the largest correlation with the dependent variable. The next independent variable to enter the equation has the next highest shared variance with the dependent variable after taking out the variance for the first independent variable.

In hierarchical multiple regression, the researcher determines the order that the independent variables are entered in the equation. The order for entering the variables should be based on theory.

It is possible to use nominal level data as an independent variable. For example, gender (female and male) can be used as an independent variable. Nominal level independent variables are referred to as dummy variables, and can be code as 0 or 1. For example, males could be equal to 0 and females could be equal to 1.

 

Collinearity

A problem in estimating the b and β weights can arise when the independent variables are highly correlated. Collinearity means that the IVs are totally predicted by the other IVs. There are several diagnostic procedures for examining collinearity. Variance inflation factor (VIF) indicates the linear dependence of one IV on all the other IVs. Big values of VIF (>10) indicate potential problems with collinearity. Tolerance index is equal to 1/VIF; values close to zero could indicate collinearity problems.

 If collinearity is a problem, you have several choices:

  1. Proceed with your analysis and caution the reader that the regression coefficients are not well estimated.
  2. Eliminate some of the IVs, especially ones with large VIF, from the analysis.
  3. Run a factor analysis on the IVs find combinations of IVs that could be entered into the model.
  4. Use a ridge regression.

Activity

When all three variables are entered into the equation, the parameter estimates are no longer significant. [data]