Coefficient plots in ggplot

Drawing coefficient plots in R and ggplot.

Author
Published

December 15, 2022

Recently a colleague asked how they could quickly draw a coefficient plot in R. In Stata, this is relatively simple - just run your regression and use the command coefplot afterwards. This produces a graphic that is perfectly acceptable for exploratory data analysis, but leaves something to be desired if you want to use it in a publication.

This post shows you how to draw coefficient plots in R and ggplot, and is extensible for use with regressions beyond the basic lm command.

Data

For this example we will use data from the lovely Palmer Penguins package from Allison Horst. The penguins dataset is a great toy dataset for exploration and visualization, based on genuine data collected by Dr. Kristen Gorman at the Palmer Station in Antarctica.

Basic regression

We begin with a basic regression where our dependent variable is penguin body weight in grams, and independent variables are the dimensions of the penguins’ bills, flipper length, as well as species and sex.

library(tidyverse)
theme_set(theme_light())

# read in data
df <- palmerpenguins::penguins

# basic regression
basic_reg <- df %>% 
    lm(body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm + species + sex, data = .)

basic_reg

Call:
lm(formula = body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm + 
    species + sex, data = .)

Coefficients:
      (Intercept)     bill_length_mm      bill_depth_mm  flipper_length_mm  
         -1460.99              18.20              67.22              15.95  
 speciesChinstrap      speciesGentoo            sexmale  
          -251.48            1014.63             389.89  

This produces a rather messy output. It includes both the regression specification and the coefficients.

Tidy with broom package

We can use the broom package to return a tibble, a neat data object that is easy to work with.

library(broom)

tidy_reg <- df %>% 
    lm(body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm + species + sex, data = .) %>% 
    tidy()

tidy_reg
# A tibble: 7 × 5
  term              estimate std.error statistic  p.value
  <chr>                <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)        -1461.     571.       -2.56 1.10e- 2
2 bill_length_mm        18.2      7.11      2.56 1.09e- 2
3 bill_depth_mm         67.2     19.7       3.40 7.45e- 4
4 flipper_length_mm     16.0      2.91      5.48 8.44e- 8
5 speciesChinstrap    -251.      81.1      -3.10 2.09e- 3
6 speciesGentoo       1015.     130.        7.83 6.85e-14
7 sexmale              390.      47.8       8.15 7.97e-15

Great! This output is much easier to deal with.

Coefficient plot

Let’s try and make a coefficient plot.

tidy_reg %>%
  filter(term != "(Intercept)") %>%
  # reorder the coefficients so that the largest is at the top of the plot
  mutate(term = fct_reorder(term, estimate)) %>%
  ggplot(aes(estimate, term)) +
  geom_point() +
  # add in a dotted line at zero
  geom_vline(xintercept = 0, lty = 2) +
  labs(
    x = "Estimate of effect of variable on body mass (in grams)",
    y = NULL,
    title = "Coefficient plot"
  )

We can see that relative to Adelie penguins (the base category), Gentoo penguins weigh more and Chinstrap penguins weigh less.

Further, male penguins weigh more than females.

Error bars

To get the error bars, we specify that we want a confidence interval when we use the tidy command from the broom package, like so: tidy(conf.int = TRUE)

tidy_reg_conf_int <- df %>% 
    lm(body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm + species + sex, data = .) %>% 
    tidy(conf.int = TRUE)

tidy_reg_conf_int
# A tibble: 7 × 7
  term              estimate std.error statistic  p.value conf.low conf.high
  <chr>                <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)        -1461.     571.       -2.56 1.10e- 2 -2585.      -337. 
2 bill_length_mm        18.2      7.11      2.56 1.09e- 2     4.22      32.2
3 bill_depth_mm         67.2     19.7       3.40 7.45e- 4    28.4      106. 
4 flipper_length_mm     16.0      2.91      5.48 8.44e- 8    10.2       21.7
5 speciesChinstrap    -251.      81.1      -3.10 2.09e- 3  -411.       -92.0
6 speciesGentoo       1015.     130.        7.83 6.85e-14   760.      1270. 
7 sexmale              390.      47.8       8.15 7.97e-15   296.       484. 

Now in our tibble we get columns called conf.low and conf.high.

To plot these, we use an additional geometry in our ggplot, called geom_errorbarh. Here the h at the end specifies we want it in the horizontal direction. We map the conf.low and conf.high variables to xmin and xmax respectively.

tidy_reg_conf_int %>%
  filter(term != "(Intercept)") %>%
  # reorder the coefficients so that the largest is at the top of the plot
  mutate(term = fct_reorder(term, estimate)) %>%
  ggplot(aes(estimate, term)) +
  geom_point() +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high)) +
  # add in a dotted line at zero
  geom_vline(xintercept = 0, lty = 2) +
  labs(
    x = "Estimate of effect of variable on body mass (in grams)",
    y = NULL,
    title = "Coefficient plot with error bars"
  )

Because none of the error bars cross the dotted line at zero, we conclude that the point estimates of the effects are significantly different from zero at the five percent level.

Categories of predictor

Finally, let’s facet the variables by type.

We use the case_when command from dplyr to call any variable containing mm a numeric variable, and the others categorical. We colour our bars and points by variable type, and facet_wrap to make a small multiple chart. This makes it easy to differentiate between the different types of variables.

tidy_reg_conf_int %>%
  filter(term != "(Intercept)") %>%
  # create a type variable with case_when
  mutate(type = case_when(
    str_detect(term, "mm") ~ "Numeric variable",
    TRUE ~ "Categorical variable"
  )) %>%
  # reorder the coefficients so that the largest is at the top of the plot
  mutate(term = fct_reorder(term, estimate)) %>%
  ggplot(aes(estimate, term, colour = type)) +
  geom_point() +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high)) +
  # add in a dotted line at zero
  geom_vline(xintercept = 0, lty = 2) +
  # remove the legend as the facet show that information
  theme(legend.position = "none") +
  # facet by type
  facet_wrap(~type, scales = "free_y") +
  labs(
    x = "Estimate of effect of variable on body mass (in grams)",
    y = NULL,
    title = "Coefficient plot with error bars",
    subtitle = "By variable type"
  )

In this way we can see that the scaling of the variables can have an impact on how we perceive the results. Because the bill and flipper dimensions are measured in mm, and the body mass in grams, we are seeing the effect of an additional mm of bill depth, for example, on body mass in grams, which appears quite small.

Conclusion

Great - I hope that this was useful! It’s possible to easily customise your coefficient plots in ggplot, adding labels and colours to help your reader understand your regression results in a glance.