Recent developments in R tidyverse

Dec 1, 2020

tidyverse

The tidyverse is an ecosystem of packages designed with a shared underlying design philosophy, grammar, and data structures.

The tidyverse v1.3.0 loads 8 packages via library(tidyverse): ggplot2, dplyr, tidyr, purrr, tibble, stringr, forcats, readr.

It also includes (installs) but does not automatically load some other (i.e. lubridate).

Presentation motivation

Me using tidyverse packages in 2015:

Typically do stuff like

dat %>%
  filter(age >= 50) %>%
  group_by(var1, var2) %>%
  summarize(y_mean = mean(y),
            y_median = median(y))

and google “r ggplot2 rotate x axis labels”

Me using tidyverse packages in 2020:

Typically do stuff like

dat %>%
  filter(age >= 50) %>%
  group_by(var1, var2) %>%
  summarize(y_mean = mean(y),
            y_median = median(y))

and google “r ggplot2 rotate x axis labels”

Presentation content credits

(Post link here.) These slides from now on are like 80-90% of this post content, with small alterations from my side.

Outline

Palmer Penguins dataset

Selecting columns in data
Reordering columns in data
Controlling mutated column location
Transforming from wide to long
Transforming from long to wide
Running group statistics across multiple columns
Control how output columns are named when summarising across multiple columns
Running models across subsets of data
Nesting data
Graphing across subsets

Palmer Penguins dataset

“The goal of palmerpenguins is to provide a great dataset for data exploration & visualization, as an alternative to iris.”

Data set contains size measurements for three penguin species (Adelie, Chinstrap, Gentoo) observed on three islands in the Palmer Archipelago, Antarctica (Biscoe, Dream, Torgersen).

library(tidyverse)
library(palmerpenguins)
str(penguins)

## tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
##  $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
##  $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
##  $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
##  $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
##  $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
##  $ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

1. Selecting columns in data

To select columns using dplyr::select() or tidyr::pivot_longer() based on common conditions, use helper functions.

To select variables contained in a character vector: all_of(), any_of()
To select all (i.e. remaining) or last column: everything(), last_col()
To select variables by matching patterns in their names: starts_with(), ends_with(), contains(), matches(), num_range()
To apply a custom function and select those for which the function returns TRUE: where()

penguins %>% 
  dplyr::select(!contains("_"), starts_with("bill")) %>% head(n = 3)

## # A tibble: 3 x 6
##   species island    sex     year bill_length_mm bill_depth_mm
##   <fct>   <fct>     <fct>  <int>          <dbl>         <dbl>
## 1 Adelie  Torgersen male    2007           39.1          18.7
## 2 Adelie  Torgersen female  2007           39.5          17.4
## 3 Adelie  Torgersen female  2007           40.3          18

my_select_func <- function(var_name){
  return(is.factor(var_name))
}

penguins %>% 
  dplyr::select(where(my_select_func))

## # A tibble: 344 x 3
##    species island    sex   
##    <fct>   <fct>     <fct> 
##  1 Adelie  Torgersen male  
##  2 Adelie  Torgersen female
##  3 Adelie  Torgersen female
##  4 Adelie  Torgersen <NA>  
##  5 Adelie  Torgersen female
##  6 Adelie  Torgersen male  
##  7 Adelie  Torgersen female
##  8 Adelie  Torgersen male  
##  9 Adelie  Torgersen <NA>  
## 10 Adelie  Torgersen <NA>  
## # … with 334 more rows

2. Reordering columns in data

To reorder specific columns or sets of columns, use dplyr::relocate() with .before or .after

penguins %>% 
  dplyr::relocate(contains("_"), .after = year) %>% head(n = 3)

## # A tibble: 3 x 8
##   species island sex    year bill_length_mm bill_depth_mm flipper_length_…
##   <fct>   <fct>  <fct> <int>          <dbl>         <dbl>            <int>
## 1 Adelie  Torge… male   2007           39.1          18.7              181
## 2 Adelie  Torge… fema…  2007           39.5          17.4              186
## 3 Adelie  Torge… fema…  2007           40.3          18                195
## # … with 1 more variable: body_mass_g <int>

penguins %>% 
  dplyr::relocate(species, .after = last_col()) %>% head(n = 3)

## # A tibble: 3 x 8
##   island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex    year
##   <fct>           <dbl>         <dbl>            <int>       <int> <fct> <int>
## 1 Torge…           39.1          18.7              181        3750 male   2007
## 2 Torge…           39.5          17.4              186        3800 fema…  2007
## 3 Torge…           40.3          18                195        3250 fema…  2007
## # … with 1 more variable: species <fct>

3. Controlling mutated column location

To control the location of the newly added column, use dplyr::mutate()’s option (similar to above’s dplyr::relocate())

penguins <- penguins %>% 
  dplyr::mutate(penguinid = row_number(), .before = everything()) 

penguins

## # A tibble: 344 x 9
##    penguinid species island bill_length_mm bill_depth_mm flipper_length_…
##        <int> <fct>   <fct>           <dbl>         <dbl>            <int>
##  1         1 Adelie  Torge…           39.1          18.7              181
##  2         2 Adelie  Torge…           39.5          17.4              186
##  3         3 Adelie  Torge…           40.3          18                195
##  4         4 Adelie  Torge…           NA            NA                 NA
##  5         5 Adelie  Torge…           36.7          19.3              193
##  6         6 Adelie  Torge…           39.3          20.6              190
##  7         7 Adelie  Torge…           38.9          17.8              181
##  8         8 Adelie  Torge…           39.2          19.6              195
##  9         9 Adelie  Torge…           34.1          18.1              193
## 10        10 Adelie  Torge…           42            20.2              190
## # … with 334 more rows, and 3 more variables: body_mass_g <int>, sex <fct>,
## #   year <int>

4. Transforming from wide to long

To transform data set from wide(r) to long(er) form, use tidyr::pivot_longer() which is an updated approach to an older tidyr::gather().

penguins %>% 
  tidyr::pivot_longer(cols = contains("_"),  # pivot these columns
                      names_to = "variable_name", # name of column containing "old columns" names
                      values_to = "variable_value")  # name of column containing "old columns" values

## # A tibble: 1,376 x 7
##    penguinid species island    sex     year variable_name     variable_value
##        <int> <fct>   <fct>     <fct>  <int> <chr>                      <dbl>
##  1         1 Adelie  Torgersen male    2007 bill_length_mm              39.1
##  2         1 Adelie  Torgersen male    2007 bill_depth_mm               18.7
##  3         1 Adelie  Torgersen male    2007 flipper_length_mm          181  
##  4         1 Adelie  Torgersen male    2007 body_mass_g               3750  
##  5         2 Adelie  Torgersen female  2007 bill_length_mm              39.5
##  6         2 Adelie  Torgersen female  2007 bill_depth_mm               17.4
##  7         2 Adelie  Torgersen female  2007 flipper_length_mm          186  
##  8         2 Adelie  Torgersen female  2007 body_mass_g               3800  
##  9         3 Adelie  Torgersen female  2007 bill_length_mm              40.3
## 10         3 Adelie  Torgersen female  2007 bill_depth_mm               18  
## # … with 1,366 more rows

# as previous example, but simultaneously split the names of columns 
# which we pivot into longer format by "_" separator 
penguins_longer <- penguins %>% 
  tidyr::pivot_longer(cols = contains("_"), # pivot these columns
                      names_sep = "_", 
                      names_to = c("part", "measure", "unit"), # name of column(s) containing "old columns" names
                      values_to = "measure_value" ) # name of column containing "old columns" values

penguins_longer

## # A tibble: 1,376 x 9
##    penguinid species island    sex     year part    measure unit  measure_value
##        <int> <fct>   <fct>     <fct>  <int> <chr>   <chr>   <chr>         <dbl>
##  1         1 Adelie  Torgersen male    2007 bill    length  mm             39.1
##  2         1 Adelie  Torgersen male    2007 bill    depth   mm             18.7
##  3         1 Adelie  Torgersen male    2007 flipper length  mm            181  
##  4         1 Adelie  Torgersen male    2007 body    mass    g            3750  
##  5         2 Adelie  Torgersen female  2007 bill    length  mm             39.5
##  6         2 Adelie  Torgersen female  2007 bill    depth   mm             17.4
##  7         2 Adelie  Torgersen female  2007 flipper length  mm            186  
##  8         2 Adelie  Torgersen female  2007 body    mass    g            3800  
##  9         3 Adelie  Torgersen female  2007 bill    length  mm             40.3
## 10         3 Adelie  Torgersen female  2007 bill    depth   mm             18  
## # … with 1,366 more rows

5. Transforming from long to wide

To transform data set from long(er) to wide(r) form, use tidyr::pivot_wider() which is an updated approach to an older tidyr::spread().

# revert from long form from previous example 
penguins_wider <- penguins_longer %>% 
  tidyr::pivot_wider(names_from = c("part", "measure", "unit"), # pivot these columns
                     values_from = "measure_value", # take the values from here
                     names_sep = "_") # combine col names using an underscore

penguins_wider

## # A tibble: 344 x 9
##    penguinid species island sex    year bill_length_mm bill_depth_mm
##        <int> <fct>   <fct>  <fct> <int>          <dbl>         <dbl>
##  1         1 Adelie  Torge… male   2007           39.1          18.7
##  2         2 Adelie  Torge… fema…  2007           39.5          17.4
##  3         3 Adelie  Torge… fema…  2007           40.3          18  
##  4         4 Adelie  Torge… <NA>   2007           NA            NA  
##  5         5 Adelie  Torge… fema…  2007           36.7          19.3
##  6         6 Adelie  Torge… male   2007           39.3          20.6
##  7         7 Adelie  Torge… fema…  2007           38.9          17.8
##  8         8 Adelie  Torge… male   2007           39.2          19.6
##  9         9 Adelie  Torge… <NA>   2007           34.1          18.1
## 10        10 Adelie  Torge… <NA>   2007           42            20.2
## # … with 334 more rows, and 2 more variables: flipper_length_mm <dbl>,
## #   body_mass_g <dbl>

6. Running group statistics across multiple columns

To apply multiple summary statistics simultaneously in an efficient way, use across() verb.

# calculate mean and sd for each variable  ending in mm, across three species 
penguin_stats <- penguins %>% 
  dplyr::group_by(species) %>% 
  dplyr::summarise(across(.cols = ends_with("mm"), 
                          .fns = list(~mean(.x, na.rm = TRUE), 
                                      ~sd(.x, na.rm = TRUE))))

penguin_stats

## # A tibble: 3 x 7
##   species bill_length_mm_1 bill_length_mm_2 bill_depth_mm_1 bill_depth_mm_2
##   <fct>              <dbl>            <dbl>           <dbl>           <dbl>
## 1 Adelie              38.8             2.66            18.3           1.22 
## 2 Chinst…             48.8             3.34            18.4           1.14 
## 3 Gentoo              47.5             3.08            15.0           0.981
## # … with 2 more variables: flipper_length_mm_1 <dbl>, flipper_length_mm_2 <dbl>

7. Control how output columns are named when summarising across multiple columns

To apply multiple summary statistics simultaneously in an efficient way with across() verb and to use other than default column names of summary variables, use the .names argument.

penguins_stats <- penguins %>% 
  dplyr::group_by(species) %>% 
  dplyr::summarise(across(.cols = ends_with("mm"), 
                          .fns = list(mean = ~mean(.x, na.rm = TRUE), 
                                      sd = ~sd(.x, na.rm = TRUE)),
                          .names = "{gsub('_|_mm', '', col)}_{.fn}"))
penguins_stats

## # A tibble: 3 x 7
##   species billlength_mean billlength_sd billdepth_mean billdepth_sd
##   <fct>             <dbl>         <dbl>          <dbl>        <dbl>
## 1 Adelie             38.8          2.66           18.3        1.22 
## 2 Chinst…            48.8          3.34           18.4        1.14 
## 3 Gentoo             47.5          3.08           15.0        0.981
## # … with 2 more variables: flipperlength_mean <dbl>, flipperlength_sd <dbl>

8. Running models across subsets of data

Use dplyr::summarise() to compute different types of outcomes stored in a list, for example, summary vectors, data frames or other objects like models or graphs.

penguin_models <- penguins %>% 
  dplyr::group_by(species) %>% 
  dplyr::summarise(model = list(lm(body_mass_g ~ flipper_length_mm + bill_length_mm + bill_depth_mm)))  
penguin_models

## # A tibble: 3 x 2
##   species   model 
##   <fct>     <list>
## 1 Adelie    <lm>  
## 2 Chinstrap <lm>  
## 3 Gentoo    <lm>

model_tmp <- penguin_models[1, 2][[1]][[1]]
# summary(model_tmp)

library(broom)

penguin_models <- penguins %>% 
  dplyr::group_by(species) %>% 
  dplyr::summarise(broom::glance(lm(body_mass_g ~ flipper_length_mm + bill_length_mm + bill_depth_mm))) 

penguin_models

## # A tibble: 3 x 13
##   species r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC
##   <fct>       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl>
## 1 Adelie      0.508         0.498  325.      50.6 1.55e-22     3 -1086. 2181.
## 2 Chinst…     0.504         0.481  277.      21.7 8.48e-10     3  -477.  964.
## 3 Gentoo      0.625         0.615  313.      66.0 3.39e-25     3  -879. 1768.
## # … with 4 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>,
## #   nobs <int>

9. Nesting data

To partition data into subsets so that we can apply a common function or operation across all subsets of the data, use dplyr::nest_by().

penguins %>% 
  nest_by(species)  %>%
  mutate(data_model = list(lm(body_mass_g ~ flipper_length_mm + bill_length_mm + bill_depth_mm, data = data)))

## # A tibble: 3 x 3
## # Rowwise:  species
##   species                 data data_model
##   <fct>     <list<tbl_df[,8]>> <list>    
## 1 Adelie             [152 × 8] <lm>      
## 2 Chinstrap           [68 × 8] <lm>      
## 3 Gentoo             [124 × 8] <lm>

10. Graphing across subsets

To generate plots across data subsets and store them for further usage, use dplyr::nest_by() combined with plotting.

# generic function for generating a simple scatter plot in ggplot2
scatter_fn <- function(df, col1, col2, title) {
  df %>% 
    ggplot2::ggplot(aes(x = {{col1}}, y = {{col2}})) +
    ggplot2::geom_point() +
    ggplot2::geom_smooth() +
    ggplot2::labs(title = title)
}

# run function across species and store plots in a list column
penguin_scatters <- penguins %>% 
  dplyr::nest_by(species) %>% 
  dplyr::mutate(plot = list(scatter_fn(data, bill_length_mm, bill_depth_mm, species)))

penguin_scatters

## # A tibble: 3 x 3
## # Rowwise:  species
##   species                 data plot  
##   <fct>     <list<tbl_df[,8]>> <list>
## 1 Adelie             [152 × 8] <gg>  
## 2 Chinstrap           [68 × 8] <gg>  
## 3 Gentoo             [124 × 8] <gg>

p_all <- scatter_fn(penguins, bill_length_mm, bill_depth_mm, "All Species") 
# get species scatters from penguin_scatters dataframe
for (i in 1:3) {
 assign(paste("p", i, sep = "_"),
        penguin_scatters$plot[i][[1]]) 
}

# display nicely using patchwork in R Markdown
library(patchwork)
p_all /
(p_1 | p_2 | p_3)

Thank you!

Credit:

Slides in this presentation are very heavily based on “Ten Up-To-Date Ways to do Common Data Tasks in R” post by Keith McNulty.

Resources:

Tidy data paper by Wickham, Hadley (2013). Journal of Statistical Software.
Welcome to the Tidyverse paper by Wickham, Hadley et al. (2019). Journal of Open Source Software.
Tidyverse blog with updates (in a form of a blog post)