Statistical notes for clinical researchers: logistic regression
Article information
Logistic regression is a regression model where the dependent variable is categorical and corresponding independent variables can be categorical or continuous. This article covers the case of a binary dependent variable such as an event occurring coded 1 = ‘event’ and 0 = ‘no event’. Frequent outcomes are pass/fail, win/lose, disease/no disease, etc. The logistic regression model estimates the probability that an event occurs versus the probability that the event does not occur.
An example: score and pass data
Let's say that an institution performed an assessment procedure to determine pass and fail of the participants considering exam scores, interview result, and reputation among colleagues. Table 1 shows a data with 2 variables, exam scores and pass state (1 = pass, 0 = fail). We can notice that there is a trend that persons with lower scores are more likely to fail, while persons with higher scores tend to pass. When we plot the data as Figure 1A, we can see persons with value 1 (pass) have scores that shift to the right side, while persons with value 0 (fail) have those that shift to the left side. Persons with same score may not have the same outcome (e.g., cases of score = 799) because the assessment procedure comprises other factors. At least we can postulate that the probability of pass may be higher if the score is higher. What is the best-fit line for this data? A usual straight regression line ranging from minus infinity to infinity does not make sense for this case. Instead of ordinal regression the logistic regression can fit the probability more adequately. In Figure 1B, the probability estimated by logistic regression is presented. The estimated probability by the logistic regression model (red dot and line) seems reasonable because it reflects the observed reality that the probability of pass decreases close to zero with very low scores, while the probability increases close to one with very high scores.
Review of probability, odds, and odds ratio
From the previous sections about risk, odds, and odds ratio, they were defined as following formulas:
Let's consider an example of flipping of fair coins vs. loaded coins.
Odds ratio is important in interpreting in logistic regression because it represents how much the odds change with 1 unit increase in the predictor variables while keeping all other variables constant.
Logistic regression
1. Logit link function
Logistic regression uses logit link function to estimate unknown probability of outcome (p) for a linear combination of predictor variables. The original probability ranging from zero to one cannot match with linear combination of predictor variables ranging minus infinity to infinity [1].
Logit link function accommodate p ranging from zero to one. The logit link function reconciles the incongruity by changing the range of dependent variable, p, into minus infinity to infinity. As seen in Table 2, final logit (p) values cover from minus values to plus values.
2. Property of logit and inverse logit
Shown in Figure 2A, logit function has an s-shaped curve. Logit (p) is undefined at p = 0 and p = 1. When p approaches close to zero, the value of logit (p) goes toward minus infinity and when p get larger close to one, it goes toward infinity. We can notice that the logit (p) has a value of zero at p = 0.5.
Figure 2B shows inverse logit graph. Inverse logit returns the probability of the event ranging from zero to one. Figure 1B and Figure 2B show similar shape because both represent estimated probability. The induced inverse logit formula is as following:
3. Estimation of logistic regression equation
Simple logistic regression is expressed as logit (p) and linear combination of predictor variables as below.
Using a fictitious data based on the example above logistic regression was performed and the output was provided (pages 6–7). The observations (n = 15) are multiplied by 100 to provide high power to get significant estimates artificially. The dependent variable was the binary variable pass and score was the predictor variable. The SPSS (IBM Corp., Armonk, NY, USA) output of (e) below gives coefficients as following.
The estimated logistic equation is:
Here represents odds ratio which means the amount of change in odds with 1 unit increase in the predictor variable. The odds ratio, exp (β1) = e0.093115 = 1.097588. Therefore, as the score increases by 1 point, the odds of pass was estimated to increase by 9.8%. The 95% confidence interval of odds ratio was [1.086, 1.109] which does not include a value one. Odds ratio value of one means that 1 unit increase in the predictor variable does not make any difference in odds. Therefore, to get statistical significance, it is important to confirm that 95% confidence interval of odds ratio does not include one.
1) Estimated probability
After some algebra, inverse logit gives us the estimated probability by the predictor variable as follows:
To get the probability of pass at score 781, we can use the estimated probability function. Also, if the score increases by one point to 782 then the estimated probability can be calculated as shown in Table 3. According to the results for the score 781, estimated probability of pass in the assessment is 0.30 or 30%. Also, the odds ratio is obtained as 1.098, which is the same value with exp (β1) from the SPSS output, representing the increase of odds of 9.8% related to a 1 point increase of the score.
Estimated probability for other score values are shown in the SPSS output (f) below under ‘PRE_1’. Using this we can calculate odds and odds ratio between 2 specific scores. For example, suppose my present score is 781 and I'd like to know how much increase in odds if I raise my score by 11 points and get 792. Then the odds ratio can be obtained easily. The calculation ends up to an increase of 179% in odds when I raise up my score by 11 points (Table 4).
Appendices
Appendix 1
Procedure of logistic regression using IBM SPSS.
The procedure of logistic regression using IBM SPSS Statistics for Windows Version 23.0 (IBM Corp.) is as follows.
*In this fictitious data, the ‘freq’ variable was used to multiply the number of observations to get sufficient power.