Logistic regression

Problem

You want to perform a logistic regression.

Solution

A logistic regression is typically used when there is one dichotomous outcome variable (such as winning or losing), and a continuous predictor variable which is related to the probability or odds of the outcome variable. It can also be used with categorical predictors, and with multiple predictors.

Suppose we start with part of the built-in mtcars dataset. In the examples below, we’ll use vs as the outcome variable, mpg as a continuous predictor, and am as a categorical (dichotomous) predictor.

  1. data(mtcars)
  2. dat <- subset(mtcars, select=c(mpg, am, vs))
  3. dat
  4. #> mpg am vs
  5. #> Mazda RX4 21.0 1 0
  6. #> Mazda RX4 Wag 21.0 1 0
  7. #> Datsun 710 22.8 1 1
  8. #> Hornet 4 Drive 21.4 0 1
  9. #> Hornet Sportabout 18.7 0 0
  10. #> Valiant 18.1 0 1
  11. #> Duster 360 14.3 0 0
  12. #> Merc 240D 24.4 0 1
  13. #> Merc 230 22.8 0 1
  14. #> Merc 280 19.2 0 1
  15. #> Merc 280C 17.8 0 1
  16. #> Merc 450SE 16.4 0 0
  17. #> Merc 450SL 17.3 0 0
  18. #> Merc 450SLC 15.2 0 0
  19. #> Cadillac Fleetwood 10.4 0 0
  20. #> Lincoln Continental 10.4 0 0
  21. #> Chrysler Imperial 14.7 0 0
  22. #> Fiat 128 32.4 1 1
  23. #> Honda Civic 30.4 1 1
  24. #> Toyota Corolla 33.9 1 1
  25. #> Toyota Corona 21.5 0 1
  26. #> Dodge Challenger 15.5 0 0
  27. #> AMC Javelin 15.2 0 0
  28. #> Camaro Z28 13.3 0 0
  29. #> Pontiac Firebird 19.2 0 0
  30. #> Fiat X1-9 27.3 1 1
  31. #> Porsche 914-2 26.0 1 0
  32. #> Lotus Europa 30.4 1 1
  33. #> Ford Pantera L 15.8 1 0
  34. #> Ferrari Dino 19.7 1 0
  35. #> Maserati Bora 15.0 1 0
  36. #> Volvo 142E 21.4 1 1

Continuous predictor, dichotomous outcome

If the data set has one dichotomous and one continuous variable, and the continuous variable is a predictor of the probability the dichotomous variable, then a logistic regression might be appropriate.

In this example, mpg is the continuous predictor variable, and vs is the dichotomous outcome variable.

  1. # Do the logistic regression - both of these have the same effect.
  2. # ("logit" is the default model when family is binomial.)
  3. logr_vm <- glm(vs ~ mpg, data=dat, family=binomial)
  4. logr_vm <- glm(vs ~ mpg, data=dat, family=binomial(link="logit"))

To view the model and information about it:

  1. # Print information about the model
  2. logr_vm
  3. #>
  4. #> Call: glm(formula = vs ~ mpg, family = binomial(link = "logit"), data = dat)
  5. #>
  6. #> Coefficients:
  7. #> (Intercept) mpg
  8. #> -8.8331 0.4304
  9. #>
  10. #> Degrees of Freedom: 31 Total (i.e. Null); 30 Residual
  11. #> Null Deviance: 43.86
  12. #> Residual Deviance: 25.53 AIC: 29.53
  13. # More information about the model
  14. summary(logr_vm)
  15. #>
  16. #> Call:
  17. #> glm(formula = vs ~ mpg, family = binomial(link = "logit"), data = dat)
  18. #>
  19. #> Deviance Residuals:
  20. #> Min 1Q Median 3Q Max
  21. #> -2.2127 -0.5121 -0.2276 0.6402 1.6980
  22. #>
  23. #> Coefficients:
  24. #> Estimate Std. Error z value Pr(>|z|)
  25. #> (Intercept) -8.8331 3.1623 -2.793 0.00522 **
  26. #> mpg 0.4304 0.1584 2.717 0.00659 **
  27. #> ---
  28. #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  29. #>
  30. #> (Dispersion parameter for binomial family taken to be 1)
  31. #>
  32. #> Null deviance: 43.860 on 31 degrees of freedom
  33. #> Residual deviance: 25.533 on 30 degrees of freedom
  34. #> AIC: 29.533
  35. #>
  36. #> Number of Fisher Scoring iterations: 6

Plotting

The data and logistic regression model can be plotted with ggplot2 or base graphics:

  1. library(ggplot2)
  2. ggplot(dat, aes(x=mpg, y=vs)) + geom_point() +
  3. stat_smooth(method="glm", method.args=list(family="binomial"), se=FALSE)
  4. par(mar = c(4, 4, 1, 1)) # Reduce some of the margins so that the plot fits better
  5. plot(dat$mpg, dat$vs)
  6. curve(predict(logr_vm, data.frame(mpg=x), type="response"), add=TRUE)

plot of chunk unnamed-chunk-4plot of chunk unnamed-chunk-4

Dichotomous predictor, dichotomous outcome

This proceeds in much the same way as above. In this example, am is the dichotomous predictor variable, and vs is the dichotomous outcome variable.

  1. # Do the logistic regression
  2. logr_va <- glm(vs ~ am, data=dat, family=binomial)
  3. # Print information about the model
  4. logr_va
  5. #>
  6. #> Call: glm(formula = vs ~ am, family = binomial, data = dat)
  7. #>
  8. #> Coefficients:
  9. #> (Intercept) am
  10. #> -0.5390 0.6931
  11. #>
  12. #> Degrees of Freedom: 31 Total (i.e. Null); 30 Residual
  13. #> Null Deviance: 43.86
  14. #> Residual Deviance: 42.95 AIC: 46.95
  15. # More information about the model
  16. summary(logr_va)
  17. #>
  18. #> Call:
  19. #> glm(formula = vs ~ am, family = binomial, data = dat)
  20. #>
  21. #> Deviance Residuals:
  22. #> Min 1Q Median 3Q Max
  23. #> -1.2435 -0.9587 -0.9587 1.1127 1.4132
  24. #>
  25. #> Coefficients:
  26. #> Estimate Std. Error z value Pr(>|z|)
  27. #> (Intercept) -0.5390 0.4756 -1.133 0.257
  28. #> am 0.6931 0.7319 0.947 0.344
  29. #>
  30. #> (Dispersion parameter for binomial family taken to be 1)
  31. #>
  32. #> Null deviance: 43.860 on 31 degrees of freedom
  33. #> Residual deviance: 42.953 on 30 degrees of freedom
  34. #> AIC: 46.953
  35. #>
  36. #> Number of Fisher Scoring iterations: 4

Plotting

The data and logistic regression model can be plotted with ggplot2 or base graphics, although the plots are probably less informative than those with a continuous variable. Because there are only 4 locations for the points to go, it will help to jitter the points so they do not all get overplotted.

  1. library(ggplot2)
  2. ggplot(dat, aes(x=am, y=vs)) +
  3. geom_point(shape=1, position=position_jitter(width=.05,height=.05)) +
  4. stat_smooth(method="glm", method.args=list(family="binomial"), se=FALSE)
  5. par(mar = c(4, 4, 1, 1)) # Reduce some of the margins so that the plot fits better
  6. plot(jitter(dat$am, .2), jitter(dat$vs, .2))
  7. curve(predict(logr_va, data.frame(am=x), type="response"), add=TRUE)

plot of chunk unnamed-chunk-7plot of chunk unnamed-chunk-7

Continuous and dichotomous predictors, dichotomous outcome

This is similar to the previous examples. In this example, mpg is the continuous predictor, am is the dichotomous predictor variable, and vs is the dichotomous outcome variable.

  1. logr_vma <- glm(vs ~ mpg + am, data=dat, family=binomial)
  2. logr_vma
  3. #>
  4. #> Call: glm(formula = vs ~ mpg + am, family = binomial, data = dat)
  5. #>
  6. #> Coefficients:
  7. #> (Intercept) mpg am
  8. #> -12.7051 0.6809 -3.0073
  9. #>
  10. #> Degrees of Freedom: 31 Total (i.e. Null); 29 Residual
  11. #> Null Deviance: 43.86
  12. #> Residual Deviance: 20.65 AIC: 26.65
  13. summary(logr_vma)
  14. #>
  15. #> Call:
  16. #> glm(formula = vs ~ mpg + am, family = binomial, data = dat)
  17. #>
  18. #> Deviance Residuals:
  19. #> Min 1Q Median 3Q Max
  20. #> -2.05888 -0.44544 -0.08765 0.33335 1.68405
  21. #>
  22. #> Coefficients:
  23. #> Estimate Std. Error z value Pr(>|z|)
  24. #> (Intercept) -12.7051 4.6252 -2.747 0.00602 **
  25. #> mpg 0.6809 0.2524 2.698 0.00697 **
  26. #> am -3.0073 1.5995 -1.880 0.06009 .
  27. #> ---
  28. #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  29. #>
  30. #> (Dispersion parameter for binomial family taken to be 1)
  31. #>
  32. #> Null deviance: 43.860 on 31 degrees of freedom
  33. #> Residual deviance: 20.646 on 29 degrees of freedom
  34. #> AIC: 26.646
  35. #>
  36. #> Number of Fisher Scoring iterations: 6

Multiple predictors with interactions

It is possible to test for interactions when there are multiple predictors. The interactions can be specified individually, as with a + b + c + a:b + b:c + a:b:c, or they can be expanded automatically, with a b c. It is possible to specify only a subset of the possible interactions, such as a + b + c + a:c.

This case proceeds as above, but with a slight change: instead of the formula being vs ~ mpg + am, it is vs ~ mpg * am, which is equivalent to vs ~ mpg + am + mpg:am.

  1. # Do the logistic regression - both of these have the same effect.
  2. logr_vmai <- glm(vs ~ mpg * am, data=dat, family=binomial)
  3. logr_vmai <- glm(vs ~ mpg + am + mpg:am, data=dat, family=binomial)
  4. logr_vmai
  5. #>
  6. #> Call: glm(formula = vs ~ mpg + am + mpg:am, family = binomial, data = dat)
  7. #>
  8. #> Coefficients:
  9. #> (Intercept) mpg am mpg:am
  10. #> -20.4784 1.1084 10.1055 -0.6637
  11. #>
  12. #> Degrees of Freedom: 31 Total (i.e. Null); 28 Residual
  13. #> Null Deviance: 43.86
  14. #> Residual Deviance: 19.12 AIC: 27.12
  15. summary(logr_vmai)
  16. #>
  17. #> Call:
  18. #> glm(formula = vs ~ mpg + am + mpg:am, family = binomial, data = dat)
  19. #>
  20. #> Deviance Residuals:
  21. #> Min 1Q Median 3Q Max
  22. #> -1.70566 -0.31124 -0.04817 0.28038 1.55603
  23. #>
  24. #> Coefficients:
  25. #> Estimate Std. Error z value Pr(>|z|)
  26. #> (Intercept) -20.4784 10.5525 -1.941 0.0523 .
  27. #> mpg 1.1084 0.5770 1.921 0.0547 .
  28. #> am 10.1055 11.9104 0.848 0.3962
  29. #> mpg:am -0.6637 0.6242 -1.063 0.2877
  30. #> ---
  31. #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  32. #>
  33. #> (Dispersion parameter for binomial family taken to be 1)
  34. #>
  35. #> Null deviance: 43.860 on 31 degrees of freedom
  36. #> Residual deviance: 19.125 on 28 degrees of freedom
  37. #> AIC: 27.125
  38. #>
  39. #> Number of Fisher Scoring iterations: 7

TODO: Add comparison between interaction and non-interaction models.