Simple and Multiple Regressions
Regression procedures are like correlation because they are
concerned with relationships among variables. Correlation analyses serve as the
part of the building block for regression procedures. There are two purposes of
regression procedures—prediction and explanation.
Review Correlation (from the work of Lisa Howley)
1.The
purpose of correlation research is to investigate the
relationship between variables.
The purpose of regression is to predict or explain outcomes
between variables.
2.In
correlation research, the variables are interchangeable:
There are no independent or dependent variables.
Regression includes this distinction between IV or predictor
variables and DV or outcome variables.
.Regression
research involves multiple inferential statistical tests
where correlation involves a single test (H0:
rho=0)
In order to develop a regression equation, you first need to
have scores for both variables from a group of similar
individuals. The information from this group’s data, will allow us to make
predictions for other individuals.
Terminology:

When regression procedures are used for prediction, the researcher is trying to produce an equation that will predict a score (aka, dependent variable, criterion variable, or outcome variable) based on other variables (independent variables, predictor variables, or explanatory variable). An example of when regression procedures are used for prediction is when a university wants to predict the success of their students on the academic curriculum (College GPA) based on previous information (e.g., high school GPA, class rank, SAT, etc.). In this example, would be the criterion variable and the previous information would be the predictor variables.
Regression procedures can also be used to explain why individuals are different on some particular variable. For example, a researcher may want to know why elementary students may vary on the math achievement. The research may hypothesize that the reasons for the differences could be parental value of education, attendance, IQ, etc. In this case, math achievement would be the outcome variable and the variables that the researcher hypothesized would be the explanatory variables.
Our goal in regression is to build an equation where the dependent variable is equal to the weighted combination of the independent variables.
Example from the Schools
Prediction Formulas for End-of-Course Performance (March 24, 2000)
Predicted School Algebra I Mean = 60.4 + [0.88 x (Math-176.10)]

Predicted School Algebra II Mean = 59.3 + [0.43 x (Reading - 164.7)] + [0.89 x (Alg-60.0)]
What is the correlation between the predictors and criterion? R2?
Simple regression involves just a single dependent and independent variable. An example of a simple regression could be if a short multiple-choice test could predict a longer standardized test.
|
Student |
Multiple-choice Test |
Standardized Test |
|
1 |
9 |
155 |
|
2 |
7 |
152 |
|
3 |
5 |
150 |
|
4 |
6 |
151 |
|
5 |
8 |
151 |
|
6 |
3 |
144 |
|
7 |
5 |
149 |
|
8 |
2 |
146 |
|
9 |
10 |
155 |
|
10 |
6 |
150 |
Next you want to drag the criterion variable (dependent variable) to the y-axis and the predictor to the x-axis.
Next go to the Fit tab and make sure you select the regression method.
SPSS Output

SPSS has constructed an equation and produced a line (regression line or the line of best fit) that is as close as possible to all of the dots. The line was built based on the statistical concept of least squared. This means that the line was drawn so that we minimize the squared distance between the dots and the lines. Recall that when we calculated correlation coefficients, we would also produce a scatter diagram to visually examine the relationship and possible outliers.
With in the figure you can see an equation. This is called the regression equation. If you think back to high school geometric you may remember that a simple linear line takes the form of,
where Y' is the predicted score on the dependent variable, a is a constant (or intercept), b is the regression coefficient (or slope), and X is the known score on the independent variable. For our example, the regression equation is
Now we have an equation to make a prediction. Note that we needed both the independent and dependent variables to create the equation, and once the equation if form all we need is an individuals score on the shorter multiple-choice test. Let’s say that Bob scored 5 on the multiple-choice test and we wanted to predict his score on the longer standardized exam. We would simply substitute Bob’s score (5) for X in the equation above. Once you do the simple math, Bob’s predicted score on the standardized test would be 148.9.
Typically in our literature you will not see a scatter diagram of the regression line reported, simply the regression equation.
There are two types of regression equations—unstandardized and standardized. The regression equation we just completed was an unstandardized regression equation. The standardized regression equation takes the form of
The standardized regression equation uses the z-scores for both the dependent and independent variables. There is no constant (or intercept) in this equation and the β (called the beta weight) is substituted for the b (called the regression coefficient). When we are using the equations for prediction, we are more interested in the unstandardized regression equation. If we wanted to use the equation to explain, we would use the standardized regression equation. When we go to multiple regression, you will see how you can examine the values of β and determine which of the independent variables contribute the most to the dependent variable because they are measured on the same scale.
When summarizing the results of the regression procedure, researchers should report the coefficient of determination (r2). The coefficient of determination provides you with an indication of the quality of the relationship between the dependent and independent variables. We can always produce a regression equation for any set of data, but that does not mean that the equation is very useful or meaningful. Another way of interpreting the r2 is that the proportion (or percentage) of in the dependent variable that is explained by the independent variable. In most research articles, researchers report the r2 as a percentage.
Because we typically generate the regression equation using a sample, we can tests for statistical significance of our estimated parameters in the regression equation (a, b, and r2).



Results
A simple regression procedure was used to predict students standardized test scores from the students short multiple-choice test scores. A total of 10 subjects participated in the study. The simple regression analysis revealed that the short multiple-choice test predicted the standardized test scores, r2=.88 (adjusted r2=.87), F(1, 8)=60.42, p<.01. The unstandardized and standardized regression equations are reported in Table 1. The regression coefficient was statistically significant (p<.01).
Table 1
Unstandardized and Standardized Regression Equations for the Prediction of Standardized Test Scores from Short Multiple-choice Test
|
|
Unstandardized
Coefficients |
Standardized
Coefficients |
|
|
|
|
B |
SE |
β |
t |
|
(Constant) |
142.40 |
1.09 |
|
130.47** |
|
MC_TEST |
1.30 |
0.17 |
0.94 |
7.77** |
**p<.01
Activity
[Data for predicting GPA from SAT]
Multiple
Regression Procedure
Multiple regression procedures are the most popular statistical procedures used in social science research. The difference between the multiple regression procedure and simple regression is that the multiple regression has more than one independent variable. The linear regression equation takes the following form
where n is the number of independent variables.
There are several different kinds of multiple regressions—simultaneous, stepwise, and hierarchical multiple regression. The different between the methods is how you enter the independent variables into the equation.
In simultaneous (aka, standard) multiple regression, all the independent variables are considered at the same time.
For stepwise multiple regression, the computer determines the order in which the independent variables become part of the equation. You can think of it as selecting a baseball team. You first select the best player. Your next selection may not be the next best player, but a player that helps round out your team. For example if you select a pitcher first, and for you next selection a pitcher is the next best player, you might select the next best player in a position that you need. The first independent variable entering the regression equation is the one with the largest correlation with the dependent variable. The next independent variable to enter the equation has the next highest shared variance with the dependent variable after taking out the variance for the first independent variable.
In hierarchical multiple regression, the researcher determines the order that the independent variables are entered in the equation. The order for entering the variables should be based on theory.
It is possible to use nominal level data as an independent variable. For example, gender (female and male) can be used as an independent variable. Nominal level independent variables are referred to as dummy variables, and can be code as 0 or 1. For example, males could be equal to 0 and females could be equal to 1.
A problem in estimating the b and β weights can arise when the independent variables are highly correlated. Collinearity means that the IVs are totally predicted by the other IVs. There are several diagnostic procedures for examining collinearity. Variance inflation factor (VIF) indicates the linear dependence of one IV on all the other IVs. Big values of VIF (>10) indicate potential problems with collinearity. Tolerance index is equal to 1/VIF; values close to zero could indicate collinearity problems.
If collinearity is a problem, you have several choices:
Activity
When all three variables are entered into the equation, the parameter estimates are no longer significant. [data]