Coefficient plots in ggplot
Drawing coefficient plots in R and ggplot
.
Recently a colleague asked how they could quickly draw a coefficient plot in R. In Stata, this is relatively simple - just run your regression and use the command coefplot
afterwards. This produces a graphic that is perfectly acceptable for exploratory data analysis, but leaves something to be desired if you want to use it in a publication.
This post shows you how to draw coefficient plots in R and ggplot, and is extensible for use with regressions beyond the basic lm
command.
Data
For this example we will use data from the lovely Palmer Penguins package from Allison Horst. The penguins dataset is a great toy dataset for exploration and visualization, based on genuine data collected by Dr. Kristen Gorman at the Palmer Station in Antarctica.
Basic regression
We begin with a basic regression where our dependent variable is penguin body weight in grams, and independent variables are the dimensions of the penguins’ bills, flipper length, as well as species and sex.
library(tidyverse)
theme_set(theme_light())
# read in data
<- palmerpenguins::penguins
df
# basic regression
<- df %>%
basic_reg lm(body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm + species + sex, data = .)
basic_reg
Call:
lm(formula = body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm +
species + sex, data = .)
Coefficients:
(Intercept) bill_length_mm bill_depth_mm flipper_length_mm
-1460.99 18.20 67.22 15.95
speciesChinstrap speciesGentoo sexmale
-251.48 1014.63 389.89
This produces a rather messy output. It includes both the regression specification and the coefficients.
Tidy with broom package
We can use the broom package to return a tibble
, a neat data object that is easy to work with.
library(broom)
<- df %>%
tidy_reg lm(body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm + species + sex, data = .) %>%
tidy()
tidy_reg
# A tibble: 7 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -1461. 571. -2.56 1.10e- 2
2 bill_length_mm 18.2 7.11 2.56 1.09e- 2
3 bill_depth_mm 67.2 19.7 3.40 7.45e- 4
4 flipper_length_mm 16.0 2.91 5.48 8.44e- 8
5 speciesChinstrap -251. 81.1 -3.10 2.09e- 3
6 speciesGentoo 1015. 130. 7.83 6.85e-14
7 sexmale 390. 47.8 8.15 7.97e-15
Great! This output is much easier to deal with.
Coefficient plot
Let’s try and make a coefficient plot.
%>%
tidy_reg filter(term != "(Intercept)") %>%
# reorder the coefficients so that the largest is at the top of the plot
mutate(term = fct_reorder(term, estimate)) %>%
ggplot(aes(estimate, term)) +
geom_point() +
# add in a dotted line at zero
geom_vline(xintercept = 0, lty = 2) +
labs(
x = "Estimate of effect of variable on body mass (in grams)",
y = NULL,
title = "Coefficient plot"
)
We can see that relative to Adelie penguins (the base category), Gentoo penguins weigh more and Chinstrap penguins weigh less.
Further, male penguins weigh more than females.
Error bars
To get the error bars, we specify that we want a confidence interval when we use the tidy
command from the broom package, like so: tidy(conf.int = TRUE)
<- df %>%
tidy_reg_conf_int lm(body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm + species + sex, data = .) %>%
tidy(conf.int = TRUE)
tidy_reg_conf_int
# A tibble: 7 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -1461. 571. -2.56 1.10e- 2 -2585. -337.
2 bill_length_mm 18.2 7.11 2.56 1.09e- 2 4.22 32.2
3 bill_depth_mm 67.2 19.7 3.40 7.45e- 4 28.4 106.
4 flipper_length_mm 16.0 2.91 5.48 8.44e- 8 10.2 21.7
5 speciesChinstrap -251. 81.1 -3.10 2.09e- 3 -411. -92.0
6 speciesGentoo 1015. 130. 7.83 6.85e-14 760. 1270.
7 sexmale 390. 47.8 8.15 7.97e-15 296. 484.
Now in our tibble
we get columns called conf.low
and conf.high
.
To plot these, we use an additional geometry in our ggplot, called geom_errorbarh
. Here the h
at the end specifies we want it in the horizontal direction. We map the conf.low
and conf.high
variables to xmin
and xmax
respectively.
%>%
tidy_reg_conf_int filter(term != "(Intercept)") %>%
# reorder the coefficients so that the largest is at the top of the plot
mutate(term = fct_reorder(term, estimate)) %>%
ggplot(aes(estimate, term)) +
geom_point() +
geom_errorbarh(aes(xmin = conf.low, xmax = conf.high)) +
# add in a dotted line at zero
geom_vline(xintercept = 0, lty = 2) +
labs(
x = "Estimate of effect of variable on body mass (in grams)",
y = NULL,
title = "Coefficient plot with error bars"
)
Because none of the error bars cross the dotted line at zero, we conclude that the point estimates of the effects are significantly different from zero at the five percent level.
Categories of predictor
Finally, let’s facet the variables by type.
We use the case_when
command from dplyr to call any variable containing mm
a numeric variable, and the others categorical. We colour our bars and points by variable type, and facet_wrap
to make a small multiple chart. This makes it easy to differentiate between the different types of variables.
%>%
tidy_reg_conf_int filter(term != "(Intercept)") %>%
# create a type variable with case_when
mutate(type = case_when(
str_detect(term, "mm") ~ "Numeric variable",
TRUE ~ "Categorical variable"
%>%
)) # reorder the coefficients so that the largest is at the top of the plot
mutate(term = fct_reorder(term, estimate)) %>%
ggplot(aes(estimate, term, colour = type)) +
geom_point() +
geom_errorbarh(aes(xmin = conf.low, xmax = conf.high)) +
# add in a dotted line at zero
geom_vline(xintercept = 0, lty = 2) +
# remove the legend as the facet show that information
theme(legend.position = "none") +
# facet by type
facet_wrap(~type, scales = "free_y") +
labs(
x = "Estimate of effect of variable on body mass (in grams)",
y = NULL,
title = "Coefficient plot with error bars",
subtitle = "By variable type"
)
In this way we can see that the scaling of the variables can have an impact on how we perceive the results. Because the bill and flipper dimensions are measured in mm, and the body mass in grams, we are seeing the effect of an additional mm of bill depth, for example, on body mass in grams, which appears quite small.
Conclusion
Great - I hope that this was useful! It’s possible to easily customise your coefficient plots in ggplot, adding labels and colours to help your reader understand your regression results in a glance.