LGEO2185 – Introduction to dplyr and the tidyverse

Author

Kristof Van Oost, Antoine Stevens & Valentin Charlier

1. Introduction

In this session, we will explore the tidyverse —a suite of R packages designed to make common data science tasks more intuitive and consistent. The tidyverse philosophy emphasizes human-readable code, a consistent syntax, and functions that work together, especially thanks to the pipe orerator (we will see later what it means). This approach can often streamline your data analysis workflow.

1.1 What is the tidyverse?

The tidyverse is a collection of R packages that share a common design philosophy.
Core packages include:
- dplyr for data manipulation,
- tidyr for reshaping or “tidying” data,
- readr for efficient data import,
- purrr for functional programming,
- ggplot2 for data visualization,
- tibble for modern data frames,
- stringr for string manipulation,
- lubridate for working with dates.
All these packages are designed to work together, offering consistent function names, arguments, and data structures.

1.2 Why use dplyr?

dplyr provides a “grammar of data manipulation” that helps you:

Write code that reflects your thinking.
Focus on what you want to do (e.g., select columns, filter rows), not on the intricacies of R’s base syntax.
Create clear workflows with the pipe operator |>, which makes it easy to chain multiple transformations.

The main verbs of dplyr—select(), filter(), mutate(), arrange(), group_by(), and summarise()—cover the most common data manipulation steps. Because these functions return tibbles (enhanced data frames), you can seamlessly pass their results to other tidyverse packages like ggplot2 for plotting or tidyr for reshaping.

1.3 The rationale behind the pipe

The pipe operator |> (sometimes called the “forward pipe”) was popularized by the magrittr package and is now a key part of the tidyverse workflow: In short, by using the pipe operator |>, you can chain multiple transformations without repeatedly referencing your dataset:

It takes the value on the left and passes it as the first argument to the function on the right.
This often improves readability by letting you “read” the code from left to right, as if you’re telling R a story:
“Take my dataset, then do X, then do Y, then do Z, and so on…”
It reduces repetitive references to the same dataset in your code.

Together, the pipe and dplyr’s verbs make it easy to write compact, readable code that directly reflects the steps of your analysis.

Note

With R starting from version 4.1.0, the built-in pipe operator |> serve as a replacement to the former %>% from the magrittr package.

2. Working with Tibbles

# If you haven't installed the tidyverse yet, run:
# install.packages("tidyverse")

library(tidyverse)

Tibbles are a modern re-imagining of data frames. They behave like data frames but are optimized for typical data science tasks. Let’s load the starwars dataset that comes with dplyr.

# Load the starwars dataset
starwars <- dplyr::starwars

# Check its class
class(starwars)

[1] "tbl_df"     "tbl"        "data.frame"

# starwars is a tibble, so printing it looks a bit different than a regular data frame
starwars

# A tibble: 87 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
 3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
 4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
 6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
 9 Biggs D…    183    84 black      light      brown           24   male  mascu…
10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

print(starwars, n = 15) # You can choose how many rows to display

# A tibble: 87 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
 3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
 4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
 6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
 9 Biggs D…    183    84 black      light      brown           24   male  mascu…
10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
11 Anakin …    188    84 blond      fair       blue            41.9 male  mascu…
12 Wilhuff…    180    NA auburn, g… fair       blue            64   male  mascu…
13 Chewbac…    228   112 brown      unknown    blue           200   male  mascu…
14 Han Solo    180    80 brown      fair       brown           29   male  mascu…
15 Greedo      173    74 <NA>       green      black           44   male  mascu…
# ℹ 72 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Notice that tibbles only show a few rows and columns to keep the console tidy. They also don’t automatically convert character data to factors, which can be helpful for data manipulation.

3. Selecting columns

A common task is to select only the columns you need from a dataset. Let’s compare base R to dplyr.

3.1 Base R approaches

# Select columns by name in base R
selection_base1 <- starwars[, c("name", "homeworld")]

# Select columns by index in base R
selection_base2 <- starwars[, c(1, 9)]

Both approaches work, but we often have to keep track of which index corresponds to which variable name. This can get cumbersome for large datasets.

3.2 dplyr approach

# Using dplyr::select()
selection_dplyr <- starwars |>
    select(name, homeworld)

# Alternatively, you could write:
selection_dplyr2 <- select(starwars, name, homeworld)

The |> (pipe) operator takes the output of the left-hand side and feeds it into the first argument of the function on the right-hand side. It can help make code more readable, especially when you chain multiple operations.

3.3 More advanced selection

# Select all columns EXCEPT name
starwars |> select(-name)

# A tibble: 87 × 13
   height  mass hair_color    skin_color  eye_color birth_year sex    gender   
    <int> <dbl> <chr>         <chr>       <chr>          <dbl> <chr>  <chr>    
 1    172    77 blond         fair        blue            19   male   masculine
 2    167    75 <NA>          gold        yellow         112   none   masculine
 3     96    32 <NA>          white, blue red             33   none   masculine
 4    202   136 none          white       yellow          41.9 male   masculine
 5    150    49 brown         light       brown           19   female feminine 
 6    178   120 brown, grey   light       blue            52   male   masculine
 7    165    75 brown         light       blue            47   female feminine 
 8     97    32 <NA>          white, red  red             NA   none   masculine
 9    183    84 black         light       brown           24   male   masculine
10    182    77 auburn, white fair        blue-gray       57   male   masculine
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

# Select columns containing an underscore
starwars |> select(contains("_"))

# A tibble: 87 × 4
   hair_color    skin_color  eye_color birth_year
   <chr>         <chr>       <chr>          <dbl>
 1 blond         fair        blue            19  
 2 <NA>          gold        yellow         112  
 3 <NA>          white, blue red             33  
 4 none          white       yellow          41.9
 5 brown         light       brown           19  
 6 brown, grey   light       blue            52  
 7 brown         light       blue            47  
 8 <NA>          white, red  red             NA  
 9 black         light       brown           24  
10 auburn, white fair        blue-gray       57  
# ℹ 77 more rows

# Select columns starting with the letter "s"
starwars |> select(starts_with("s"))

# A tibble: 87 × 4
   skin_color  sex    species starships
   <chr>       <chr>  <chr>   <list>   
 1 fair        male   Human   <chr [2]>
 2 gold        none   Droid   <chr [0]>
 3 white, blue none   Droid   <chr [0]>
 4 white       male   Human   <chr [1]>
 5 light       female Human   <chr [0]>
 6 light       male   Human   <chr [0]>
 7 light       female Human   <chr [0]>
 8 white, red  none   Droid   <chr [0]>
 9 light       male   Human   <chr [1]>
10 fair        male   Human   <chr [5]>
# ℹ 77 more rows

These specialized selection helpers (contains, starts_with) are part of tidyselect, which comes bundled with dplyr.

4. Filtering rows

Another frequent operation is to filter rows based on certain conditions (e.g., we only want characters of a certain species or homeworld).

4.1 Base R approaches

# We might check which rows match a condition like so:
starwars$species == "Human"

 [1]  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
[13] FALSE  TRUE FALSE FALSE  TRUE    NA FALSE  TRUE  TRUE FALSE FALSE  TRUE
[25]  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE
[37] FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[49] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE    NA    NA
[61]  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE
[73]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE    NA FALSE  TRUE  TRUE
[85]  TRUE FALSE  TRUE

# Then use it to subset the data frame:
selection_base <- starwars[starwars$species == "Human", ]

# If you tried this without specifying the dataset columns explicitly, it wouldn't work:
# starwars[species == "Human", ] # This fails in base R unless you attach the data

4.2 dplyr approach

# Using dplyr to filter rows:
selection_dplyr <- starwars |>
    filter(species == "Human")

Filtering on multiple conditions in base R usually involves & or | plus which(). In dplyr, you can just separate conditions with commas.

# Base R multiple conditions:
selection_base_multiple <- starwars[which(starwars$species == "Human" &
    starwars$homeworld == "Tatooine"), ]

# dplyr multiple conditions:
selection_dplyr_multiple <- starwars |>
    filter(species == "Human", homeworld == "Tatooine")

Notice how the tidyverse version is more concise and readable.

5. Combining Filter and Select

Often, you need to both select certain rows and only keep certain columns. Let’s see how a traditional approach compares with a dplyr approach.

5.1 Base R approach:

selection_traditional <- starwars[
    starwars$species == "Human",
    c("name", "height", "birth_year")
]

5.2 dplyr approach

selection_dplyr3 <- starwars |>
    filter(species == "Human") |>
    select(name, height, birth_year)

Because of the pipe, it’s easy to see the logical flow: first filter for humans, then select the columns we need.

6. Grouping and summarizing

A key feature of dplyr is the ability to work with grouped data. You can summarize variables by categories (e.g., species).

6.1 Base R approach:

aggregate(): Groups by ‘species’ and applies the mean function to the height and mass columns
na.rm = TRUE ignores missing values

mean_stats_by_species_base <- aggregate(
    cbind(height, mass) ~ species,
    data = starwars,
    FUN = function(x) mean(x, na.rm = TRUE)
)

mean_stats_by_species_base

          species height    mass
1          Aleena  79.00   15.00
2        Besalisk 198.00  102.00
3          Cerean 198.00   82.00
4        Clawdite 168.00   55.00
5           Droid 140.00   69.75
6             Dug 112.00   40.00
7            Ewok  88.00   20.00
8       Geonosian 183.00   80.00
9          Gungan 210.00   74.00
10          Human 180.25   81.31
11           Hutt 175.00 1358.00
12        Kaleesh 216.00  159.00
13       Kaminoan 229.00   88.00
14        Kel Dor 188.00   80.00
15       Mirialan 168.00   53.10
16   Mon Calamari 180.00   83.00
17       Nautolan 196.00   87.00
18      Neimodian 191.00   90.00
19         Pau'an 206.00   80.00
20         Rodian 173.00   74.00
21        Skakoan 193.00   48.00
22      Sullustan 160.00   68.00
23     Tholothian 184.00   50.00
24        Togruta 178.00   57.00
25          Toong 163.00   65.00
26     Trandoshan 190.00  113.00
27        Twi'lek 178.00   55.00
28     Vulptereen  94.00   45.00
29        Wookiee 231.00  124.00
30 Yoda's species  66.00   17.00
31         Zabrak 175.00   80.00

6.2 dplyr approach

# Summarise mean height and mass by species, ignoring NA values:
mean_stats_by_species <- starwars |>
    group_by(species) |>
    summarise_at(vars(height, mass), list(~ mean(.x, na.rm = TRUE)))

mean_stats_by_species

# A tibble: 38 × 3
   species   height  mass
   <chr>      <dbl> <dbl>
 1 Aleena       79   15  
 2 Besalisk    198  102  
 3 Cerean      198   82  
 4 Chagrian    196  NaN  
 5 Clawdite    168   55  
 6 Droid       131.  69.8
 7 Dug         112   40  
 8 Ewok         88   20  
 9 Geonosian   183   80  
10 Gungan      209.  74  
# ℹ 28 more rows

Here’s what’s going on:

group_by(species): Tells R to treat each species as a group.
summarise_at(vars(height, mass), ...): Apply a function (mean) to the specified columns.
na.rm = TRUE: Ignores missing values (NA).

Alternatively, you could write:

mean_stats_by_species_alternative <- starwars |>
    group_by(species) |>
    summarise(across(c(height, mass), ~ mean(.x, na.rm = TRUE)))

Both methods work similarly. across() is the newer approach.

7. A Quick glimpse at other tidyverse packages

The tidyverse is more than just dplyr. Here are a few packages you might find useful:

tidyr: Tools for reshaping data (e.g., pivot_longer() and pivot_wider() to move between “wide” and “long” formats).
readr: Fast reading/writing of flat files (read_csv(), write_csv(), etc.).
stringr: String manipulation functions (e.g., str_detect(), str_replace()).
purrr: Functional programming tools for iteration (map(), map_df()).
lubridate: Simplifies working with dates and times.

7.1 Example with lubridate

# Load lubridate if not already loaded
library(lubridate)

# Some date strings in various formats
my_dates <- c("2025-02-05", "2024-12-31", "2023/01/01")

# Parse them into actual Date objects using ymd() (year-month-day)
parsed_dates <- ymd(my_dates)
parsed_dates

[1] "2025-02-05" "2024-12-31" "2023-01-01"

# Extract the components
year(parsed_dates)

[1] 2025 2024 2023

month(parsed_dates, label = TRUE)

[1] Feb Dec Jan
12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec

day(parsed_dates)

[1]  5 31  1

With lubridate, you can quickly parse dates from messy data, extract specific components (e.g., year, month, day), and perform date arithmetic.

7.2 Example with tidyr

While dplyr focuses on row/column operations, tidyr focuses on reshaping data between “wide” and “long” formats. This is especially useful when preparing data for different types of analyses or visualizations: Some analyses or plots require a “long” format, where each observation is a single row. Other tools or methods may require a “wide” format, where repeated measurements or metrics are in separate columns. “Tidy” structure plays well with other tidyverse packages (like ggplot2 or dplyr).

Let’s take a smaller slice of the starwars dataset to demonstrate how you might pivot it between wide and long formats. We’ll use only a few columns (name, homeworld, height, and mass) to keep things manageable.

7.2.1 Subset the starwars data

# Create a smaller starwars dataset
simple_starwars <- starwars |>
    select(name, homeworld, height, mass)

# Take a look
head(simple_starwars)

# A tibble: 6 × 4
  name           homeworld height  mass
  <chr>          <chr>      <int> <dbl>
1 Luke Skywalker Tatooine     172    77
2 C-3PO          Tatooine     167    75
3 R2-D2          Naboo         96    32
4 Darth Vader    Tatooine     202   136
5 Leia Organa    Alderaan     150    49
6 Owen Lars      Tatooine     178   120

7.2.2 Base R approach (using reshape)

Base R has a function called reshape() in the stats package to move between wide and long formats. It can be somewhat cumbersome compared to tidyverse functions.

You typically need to create an id column (or something similar) to uniquely identify rows.
You specify which columns are “varying” over time or measurement, and which ones are “fixed.”

# Base R reshape requires an ID variable to keep track of rows
simple_starwars$id <- seq_len(nrow(simple_starwars))
# 1) Create a smaller starwars dataset
simple_starwars <- starwars[, c("name", "homeworld", "height", "mass")]

# 3) "Wide" to "Long" using base R reshape()
base_long <- reshape(
    data      = as.data.frame(simple_starwars),
    timevar   = "measure", # The new variable that will indicate 'height' or 'mass'
    varying   = list(c("height", "mass")), # Columns to reshape
    v.names   = "value", # New column name for the measured values
    direction = "long"
)

head(base_long)

              name homeworld measure value id
1.1 Luke Skywalker  Tatooine       1   172  1
2.1          C-3PO  Tatooine       1   167  2
3.1          R2-D2     Naboo       1    96  3
4.1    Darth Vader  Tatooine       1   202  4
5.1    Leia Organa  Alderaan       1   150  5
6.1      Owen Lars  Tatooine       1   178  6

To go back from long to wide in base R, you would again call reshape() and specify direction = "wide".

# "Long" back to "Wide" using base R
base_wide <- reshape(
    data      = base_long,
    timevar   = "measure",
    idvar     = c("id", "name", "homeworld"),
    direction = "wide"
)

head(base_wide)

              name homeworld id value.1 value.2
1.1 Luke Skywalker  Tatooine  1     172      77
2.1          C-3PO  Tatooine  2     167      75
3.1          R2-D2     Naboo  3      96      32
4.1    Darth Vader  Tatooine  4     202     136
5.1    Leia Organa  Alderaan  5     150      49
6.1      Owen Lars  Tatooine  6     178     120

7.2.3 tidyr approach

With tidyr, you can use pivot_longer() and pivot_wider() directly. This often reads more intuitively.

# 1) pivot_longer(): wide -> long
tidy_long <- simple_starwars |>
    pivot_longer(
        cols       = c(height, mass), # which columns to stack
        names_to   = "measure", # new column with measure name
        values_to  = "value" # new column with measure value
    )

tidy_long

# A tibble: 174 × 4
   name           homeworld measure value
   <chr>          <chr>     <chr>   <dbl>
 1 Luke Skywalker Tatooine  height    172
 2 Luke Skywalker Tatooine  mass       77
 3 C-3PO          Tatooine  height    167
 4 C-3PO          Tatooine  mass       75
 5 R2-D2          Naboo     height     96
 6 R2-D2          Naboo     mass       32
 7 Darth Vader    Tatooine  height    202
 8 Darth Vader    Tatooine  mass      136
 9 Leia Organa    Alderaan  height    150
10 Leia Organa    Alderaan  mass       49
# ℹ 164 more rows

# 2) pivot_wider(): long -> wide
tidy_wide <- tidy_long |>
    pivot_wider(
        names_from  = measure, # which column to use for new column names
        values_from = value # which column to use for new values
    )

tidy_wide

# A tibble: 87 × 4
   name               homeworld height  mass
   <chr>              <chr>      <dbl> <dbl>
 1 Luke Skywalker     Tatooine     172    77
 2 C-3PO              Tatooine     167    75
 3 R2-D2              Naboo         96    32
 4 Darth Vader        Tatooine     202   136
 5 Leia Organa        Alderaan     150    49
 6 Owen Lars          Tatooine     178   120
 7 Beru Whitesun Lars Tatooine     165    75
 8 R5-D4              Tatooine      97    32
 9 Biggs Darklighter  Tatooine     183    84
10 Obi-Wan Kenobi     Stewjon      182    77
# ℹ 77 more rows

Notice how you don’t need to manually create an ID column; tidyr handles that internally. The code is shorter and more descriptive of your intent: “pivot these columns, naming them this, putting their values here.”

We’re back in the “wide” format, but this code is more explicit about what’s happening:
- names_from = measure: The column containing old column names (“height” and “mass”) is used to create new wide-format column headers.
- values_from = value: The column containing the numeric measurements is used to populate each row.

7.3 Additional tidyr functions: `separate()` and `unite()`

Apart from pivoting, tidyr also offers functions for splitting or combining columns:

separate() splits one column into multiple columns.
unite() does the opposite: it combines multiple columns into one.

For instance, if you had a column like "Skywalker, Luke" but you want separate columns for last name and first name:

names_df <- tibble(full_name = c("Skywalker, Luke", "Solo, Han", "Organa, Leia"))

# separate()
separated_df <- names_df |>
    separate(full_name, into = c("last_name", "first_name"), sep = ", ")

separated_df

# A tibble: 3 × 2
  last_name first_name
  <chr>     <chr>     
1 Skywalker Luke      
2 Solo      Han       
3 Organa    Leia

To recombine:

united_df <- separated_df |>
    unite(col = "full_name_again", last_name, first_name, sep = ", ")

united_df

# A tibble: 3 × 1
  full_name_again
  <chr>          
1 Skywalker, Luke
2 Solo, Han      
3 Organa, Leia

There are many other cool functions in tidyr for reshaping and cleaning data. Have a look at the cheatsheets here for more!

7.4 Links to `ggplot2`

In ggplot2, data must often be in a long format to properly map variables to aesthetic attributes like x, y, color, or fill.

starwars_weights <- starwars |>
    filter(!is.na(mass) & mass < 1000) |>
    mutate(
        weight_earth = mass, # Weight on Earth
        weight_mars = mass * 3.71 / 9.81
    ) # Weight on Mars

# Reshape data to long format
long_weights <- starwars_weights |>
    select(name, weight_earth, weight_mars) |>
    pivot_longer(
        cols = c(weight_earth, weight_mars),
        names_to = "planet",
        values_to = "weight"
    ) |>
    mutate(planet = gsub("weight_", "", planet)) # Clean up planet names

# Plot the weights on Earth and Mars
ggplot(long_weights, aes(x = name, y = weight, fill = planet)) +
    geom_bar(stat = "identity", position = "dodge") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    labs(
        title = "Character Weights on Earth vs. Mars",
        x = "Character",
        y = "Weight (N)", # Weight in Newtons
        fill = "Planet"
    )

8. Functional programming with purrr

Another key component of the tidyverse is purrr, which provides a cohesive set of functional programming tools that make iteration in R more consistent and expressive. In particular, purrr excels at replacing repetitive for loops or *apply functions. In particular, it allows:

Repeated operations on each element of a list or vector.
Consistent output types (numeric, character, data frame, etc.) are desired.
Parallel iteration over multiple objects (map2(), pmap()).
Robust error handling or partial success from iterative processes (safely(), possibly()).

If your data fits neatly in rows/columns, dplyr verbs might be enough. But for more complex, list-based workflows—especially when columns contain nested lists or when you’re doing more elaborate programming—purrr is a powerful toolkit.

8.1 purrr vs. traditional R loops or *apply family

Loops: Flexible but can be verbose. You usually create an empty object, iterate, and assign results.
lapply()/sapply(): Shorter syntax than loops, but return type can vary (sapply can sometimes return a vector, sometimes a list if it can’t simplify cleanly).
purrr::map_*(): You explicitly choose the return type (map_dbl, map_chr, etc.), making your code more predictable. The syntax often reads more cleanly, especially for anonymous inline functions (~ expression

8.2 The `map_` family of functions

At the core of purrr is the map() family of functions. They allow you to apply a function to each element of a list or vector and return the results in various formats:

map() returns a list.
map_lgl() returns a logical vector.
map_int() returns an integer vector.
map_dbl() returns a double (numeric) vector.
map_chr() returns a character vector.
map_df() (from older purrr versions) or map_dfr() / map_dfc() (preferred now) returns data frames.

Using the right function for the type of output you expect can help avoid unexpected data structures and reduce errors.

8.3 Example 1: Iterating over a vector

Let’s calculate the square of each number in a vector.

8.3.1 Base R approach

# Traditional for loop
numbers <- c(1, 2, 3, 4, 5)
squared <- c()

for (i in numbers) {
    squared <- c(squared, i^2)
}

squared

# Using sapply
squared_sapply <- sapply(numbers, function(x) x^2)
squared_sapply

8.3.2 purrr approach

# Using purrr::map_dbl to return a numeric vector
squared_map <- map_dbl(numbers, ~ .x^2)
squared_map

8.4 Example 2: Applying functions to columns of a Data Frame

Suppose we want to calculate the mean of selected numeric columns (height and mass) in the starwars dataset.

8.4.1 Base R approach

# Traditional for loop
numeric_columns <- c("height", "mass")
means <- c()

for (col in numeric_columns) {
    means <- c(means, mean(starwars[[col]], na.rm = TRUE))
}

names(means) <- numeric_columns
means

# Using sapply
means_sapply <- sapply(starwars[c("height", "mass")], function(x) mean(x, na.rm = TRUE))
means_sapply

8.4.2 purrr approach

# Using purrr::map_dbl
means_map <- map_dbl(starwars[c("height", "mass")], ~ mean(.x, na.rm = TRUE))
means_map

8.5 Mapping over multiple inputs with `map2()` and `pmap()`

If your function needs two arguments that vary (for example, a data vector and a threshold vector), you can use map2(). For three or more varying arguments, pmap() is your friend. These functions let you iterate in parallel over multiple inputs. map2() takes two lists (.x and .y) in parallel. For each index, it applies the function to the corresponding elements of both lists.

Suppose we want to calculate the Body Mass Index (BMI) for each character in the starwars dataset

# S ince operations on data frames / tibbles are vectorized, this could be achieved with a normal mutate call
starwars_bmi <- starwars |>
    mutate(
        bmi = mass / ((height / 100)^2)
    )

# But purrr helps in functional programming
# We define a BMI function
BMI <- function(mass, height) {
    mass / ((height / 100)^2)
}
# and apply this
starwars_bmi <- starwars |>
    mutate(
        bmi = map2_dbl(mass, height, ~ BMI(.x, .y))
    )

8.6 Error handling

purrr provides helper functions like safely(), possibly(), and quietly() that make it easy to handle errors or unexpected outputs within a mapping sequence. Base R typically requires more manual checks or try-catch logic for robust error handling in loops or *apply. From each element in results_safe, you can inspect $result or $error. This can be very handy for debugging or skipping problematic elements.

# Define a safe version of the BMI function
safe_bmi <- safely(function(mass, height) {
    mass / ((height / 100)^2)
})

# Filter rows with valid mass and height (still keeping some edge cases for demonstration)
starwars_filtered <- starwars |>
    mutate(height = ifelse(name == "Darth Vader", NA, height)) |> # Introduce a NA for testing
    filter(!is.na(mass))

# Use map2 and the safe BMI function
starwars_bmi <- starwars_filtered |>
    mutate(
        bmi_safe = map2(mass, height, safe_bmi), # Apply the safe BMI function
        bmi = map_dbl(bmi_safe, ~ .x$result), # Extract the result
        bmi_error = map_chr(bmi_safe, ~ ifelse(is.null(.x$error), "None", .x$error$message)) # Extract the error
    )

# View the results with errors and calculated BMI
starwars_bmi |>
    select(name, height, mass, bmi, bmi_error)

# A tibble: 59 × 5
   name               height  mass   bmi bmi_error
   <chr>               <int> <dbl> <dbl> <chr>    
 1 Luke Skywalker        172    77  26.0 None     
 2 C-3PO                 167    75  26.9 None     
 3 R2-D2                  96    32  34.7 None     
 4 Darth Vader            NA   136  NA   None     
 5 Leia Organa           150    49  21.8 None     
 6 Owen Lars             178   120  37.9 None     
 7 Beru Whitesun Lars    165    75  27.5 None     
 8 R5-D4                  97    32  34.0 None     
 9 Biggs Darklighter     183    84  25.1 None     
10 Obi-Wan Kenobi        182    77  23.2 None     
# ℹ 49 more rows

Tip

map can be used to perform within groups manipulation. Let’s say we want compute the Root Mean Squared Error (RMSE) for a regression model that predicts a character’s Body Mass Index (BMI) based on their evilness, grouped by their homeworld.

We use group_by() and nest() (from tidyr) to split the dataset into manageable subsets (one per homeworld). Each group’s data is stored in a compact format, making it easy to apply custom operations to each group.
The map() function is used to apply a linear regression model (lm(bmi ~ evilness)) to each group.
The output for each group includes the model and the RMSE, which measures how well the model fits the data within the group.
After applying the regression model to each group, the results (e.g., RMSE) are stored in a tidy format using unnest() for easy interpretation and visualization.

# Create a fake column for evilness
set.seed(42)
starwars <- starwars |>
    filter(!is.na(height) & !is.na(mass) & !is.na(homeworld)) |> # Filter valid data
    mutate(
        evilness = runif(n(), 1, 10), # Random evilness between 1 and 10
        bmi = mass / ((height / 100)^2) # Calculate BMI
    )

# Group by homeworld and nest the data
# Use group_by(homeworld) to group characters by their homeworld.
# Use nest() to store each group’s data as a nested data frame within a single column.
grouped_data <- starwars |>
    group_by(homeworld) |>
    nest()

# Define a function to compute regression and calculate RMSE
compute_rmse <- function(df) {
    # Perform regression of BMI ~ evilness
    model <- lm(bmi ~ evilness, data = df)

    # Calculate RMSE
    predictions <- predict(model, df)
    rmse <- sqrt(mean((df$bmi - predictions)^2))

    # Return a tidy output with model summary and RMSE
    tibble(
        model = list(model),
        rmse = rmse
    )
}

# Apply the function to each group using map
# Use unnest() to bring the results (e.g., RMSE) back into a tidy data frame.
results <- grouped_data |>
    mutate(model_results = map(data, compute_rmse)) |>
    unnest(model_results)

# View the results
results |>
    select(homeworld, rmse)

# A tibble: 39 × 2
# Groups:   homeworld [39]
   homeworld      rmse
   <chr>         <dbl>
 1 Tatooine   4.42e+ 0
 2 Naboo      6.33e+ 0
 3 Alderaan   3.55e-15
 4 Stewjon    0       
 5 Kashyyyk   5.62e-15
 6 Corellia   8.44e-14
 7 Rodia      0       
 8 Nal Hutta  0       
 9 Bestine IV 0       
10 Kamino     2.51e-15
# ℹ 29 more rows

9. Conclusion

This wraps up our introduction to dplyr and the tidyverse!

Feel free to explore more at:
- dplyr.tidyverse.org
- tidyverse.org

Cheat sheets:

Happy coding!

sessionInfo()

R version 4.5.2 (2025-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Tahoe 26.3

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Brussels
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] lubridate_1.9.5 forcats_1.0.1   stringr_1.6.0   dplyr_1.2.0    
 [5] purrr_1.2.1     readr_2.2.0     tidyr_1.3.2     tibble_3.3.1   
 [9] ggplot2_4.0.2   tidyverse_2.0.0 knitr_1.51     

loaded via a namespace (and not attached):
 [1] gtable_0.3.6       jsonlite_2.0.0     compiler_4.5.2     tidyselect_1.2.1  
 [5] scales_1.4.0       yaml_2.3.12        fastmap_1.2.0      R6_2.6.1          
 [9] labeling_0.4.3     generics_0.1.4     htmlwidgets_1.6.4  pillar_1.11.1     
[13] RColorBrewer_1.1-3 tzdb_0.5.0         rlang_1.1.7        utf8_1.2.6        
[17] stringi_1.8.7      xfun_0.56          S7_0.2.1           otel_0.2.0        
[21] timechange_0.4.0   cli_3.6.5          withr_3.0.2        magrittr_2.0.4    
[25] digest_0.6.39      grid_4.5.2         hms_1.1.4          lifecycle_1.0.5   
[29] vctrs_0.7.1        evaluate_1.0.5     glue_1.8.0         farver_2.1.2      
[33] pacman_0.5.1       rmarkdown_2.30     tools_4.5.2        pkgconfig_2.0.3   
[37] htmltools_0.5.9

1. Introduction

1.1 What is the tidyverse?

1.2 Why use dplyr?

1.3 The rationale behind the pipe

2. Working with Tibbles

3. Selecting columns

3.1 Base R approaches

3.2 dplyr approach

3.3 More advanced selection

4. Filtering rows

4.1 Base R approaches

4.2 dplyr approach

5. Combining Filter and Select

5.1 Base R approach:

5.2 dplyr approach

6. Grouping and summarizing

6.1 Base R approach:

6.2 dplyr approach

7. A Quick glimpse at other tidyverse packages

7.1 Example with lubridate

7.2 Example with tidyr

7.2.1 Subset the starwars data

7.2.2 Base R approach (using reshape)

7.2.3 tidyr approach

7.3 Additional tidyr functions: separate() and unite()

7.4 Links to ggplot2

8. Functional programming with purrr

8.1 purrr vs. traditional R loops or *apply family

8.2 The map_ family of functions

8.3 Example 1: Iterating over a vector

8.3.1 Base R approach

8.3.2 purrr approach

8.4 Example 2: Applying functions to columns of a Data Frame

8.4.1 Base R approach

8.4.2 purrr approach

8.5 Mapping over multiple inputs with map2() and pmap()

8.6 Error handling

9. Conclusion

7.3 Additional tidyr functions: `separate()` and `unite()`

7.4 Links to `ggplot2`

8.2 The `map_` family of functions

8.5 Mapping over multiple inputs with `map2()` and `pmap()`