Logistic Regression
[Introduction
of Logistic Regression~10 minutes]
[Logistic Regression SPSS~4 minutes]
Purpose
Logistic regression -- method of modeling a binary response variable (0 to 1). For example, we want to investigate how school drop out (0) or stay in school (1) can be predicted by the school intervention.
Logistic regression allows one to predict a discrete (e.g., disease/no disease) outcome such as group membership from a set of variables. Similar to multiple regression except with a discrete outcome variable. Logistic regression emphasizes the probability of a particular outcome for each case.
The difference between multiple regression and logistic regression is that the linear portion of the equation (called the logit) is not the end in itself, but is used to find the odds of being in one of the categories of the DV given a particular combination of scores on the IVs.
Logistic regression has many analogies to multiple (ordinary least squared--OLS) regression: logit coefficients correspond to b coefficients in the logistic regression equation, the standardized logit coefficients correspond to beta weights, and a pseudo R^{2} statistic is available to summarize the strength of the relationship. Unlike OLS regression, however, logistic regression does not assume linearity of relationship between the independent variables and the dependent, does not require normally distributed variables, does not assume homoscedasticity, and in general has less stringent requirements. It does, however, require that observations are independent and that the independent variables be linearly related to the logit of the dependent. The success of the logistic regression can be assessed by looking at the classification table, showing correct and incorrect classifications of the dichotomous, ordinal, or polytomous dependent. Also, goodness-of-fit tests such as model chi-square are available as indicators of model appropriateness as is the Wald statistic to test the significance of individual independent variables.
Logistic (binary) regression is used to fit a model to binary response (Y) data, such as whether a subject dies (event) or lives (non-event). These events are often described as success vs failure. For each possible set of values for the independent (X) variables, there is a probability p that a success occurs. [there are multinomial logistic regression but we will cover those]
[Data for Coronary Heart Disease (CD)]
The Math
The logit function is used to transform an "S" shape cure into an approximately straight line and to change the range of the proportion from 0-1 to -∞ to +∞.
OR |
ln(OR) |
0.1 |
-2.30259 |
0.4 |
-0.91629 |
0.7 |
-0.35667 |
1 |
0 |
1.3 |
0.262364 |
1.6 |
0.470004 |
1.9 |
0.641854 |
[Link to Spreadsheet Showing the Calculations]
[Link to Simple Logistic Regression]
logit(p) = a + b_{1}x_{1 }+ b_{2}x_{2 }+ ... + b_{i}x_{i}
where p is the probability of dropout and x_{1}, x_{2 }... x_{i }are the explanatory variables.
Advantage & Limitations Logistic Regression Analysis
Practical Issues
Interpretation of Weights (B)--B increases the log-odds for a one unit increase in X.
You can see there is a little more work involved after you get the equation. The linear regression equation that we created in SPSS creates the logit or log of the odds. That is, the linear regression equation is the nature log of the probability of being in one group divided by the probability of being in the other group.
A procedure called maximum likelihood (ML) estimation is used to estimate the coefficients.
Logistic regression also produces Odds Ratios (OR) associated with each predictor value. The odds of an event is defined as the probability of the outcome event occurring divided by the probability of the event not occurring. The odds ratio for a predictor tells the relative amount by which the odds of the outcome increase (OR greater than 1.0) or decrease (OR less than 1.0) when the value of the predictor value is increased by 1.0 units.
Assessment of the Fit of the Model
After estimating the coefficients, there are several steps involved in assessing the appropriateness, adequacy and usefulness of the model. First, the importance of each of the explanatory variables is assessed by carrying out statistical tests of the significance of the coefficients. The overall goodness of fit of the model is then tested. Additionally, the ability of the model to discriminate between the two groups defined by the response variable is evaluated. Finally, if possible, the model is validated by checking the goodness of fit and discrimination on a different set of data from that which was used to develop the model.
1. Wald χ^{2 }statistics (reliability is questionable)
2. Likelihood Ratio Test
The likelihood ratio test for a particular parameter compares the likelihood of obtaining the data when the parameter is zero (L_{0}) with the likelihood (L_{1}) of obtaining the data evaluated at the MLE of the parameter. The test statistic is calculated as follows:
-2 × ln(likelihood ratio) = -2 × ln(L_{0}/L_{1}) = -2 × (lnL_{0 }- lnL_{1})
It is compared with a χ^{2 }distribution with 1 degree of freedom.
3. Goodness of fit
4. R^{2 }for logistic regression
5. Discrimination (classification accuracy)
From Huck Text