Data Wrangling

Quote from the book “R for Data Science”, the author said

R is an old language, and some things that were useful 10 or 20 years ago now get in your way. It’s difficult to change base R without breaking existing code, so most innovation occurs in packages.

What is tidyverse?

base R’s functions are often slow and the implementations are often not consistent
writing code in tidyverse style usually is more elegant
the operations would be easily chained together using piping (more below)

So what is tidyverse?

It is a collection of R packages which are designed to be used together.
- ggplot2, for data visualisation
- dplyr, for data manipulation
- tidyr, for data tidying
- readr, for data import
- purrr, for functional programming
- tibble, for tibbles, a modern re-imagining of data frames
- stringr, for strings
- forcats, for factors

Pipe operator

We are going to detour a bit to talk about pipe operator. Pipe operator is a powerful way to make your code much cleaner and readible by

structuring sequences of data operations left-to-right (as opposed to from the inside and outside),
avoiding nested function calls,
minimizing the need for local variables and function definitions, and
making it easy to add steps anywhere in the sequence of operations.

The operators pipe their left-hand side values forward into expressions that appear on the right-hand side, i.e. one can replace f(x) with x %>% f(), where %>% is the (main) pipe-operator. When coupling several function calls with the pipe-operator, the benefit will become more apparent. Consider this pseudo example:

raw_data <- read.csv("/path/to/data/file.csv")
sub_data <- subset(raw_data, variable_a > x)
new_data <- transform(sub_data, variable_c = variable_a / variable_b)
the_data <- head(new_data, 100)

However, the local variables are really not necessary, so one liner could be

the_data <- head(
  transform(
    subset(
      read.csv("/path/to/data/file.csv"),
      variable_a > x
    ),
    variable_c = variable_a / variable_b
  ),
  100
)

But that is too hard to read and reason. If we use pipe operator,

the_data <- read.csv("/path/to/data/file.csv") %>%
  subset(variable_a > x) %>%
  transform(variable_c = variable_a / variable_b) %>%
  head(100)

Hint: In RStudio, you could use Ctrl + Shift + M (or Cmd + Shift + M) to insert the pipe operator.

Basic usage

x %>% f or x %>% f() is equivalent to f(x)
x %>% f(y) is equivalent to f(x, y)
x %>% f %>% g %>% h is equivalent to h(g(f(x)))

sum(c(1, 2))

## [1] 3

c(1, 2) %>% sum()

## [1] 3

choose(5, 3)

## [1] 10

5 %>% choose(3)

## [1] 10

The argument placeholder

x %>% f(y, .) is equivalent to f(y, x)
x %>% f(y, z = .) is equivalent to f(y, z = x)

3 %>% choose(5, .)

## [1] 10

3 %>% choose(5, k = .)

## [1] 10

Re-using the placeholder

It is straightforward to use the placeholder several times in a right-hand side expression. However, when the placeholder only appears in a nested expressions magrittr will still apply the first-argument rule. The reason is that in most cases this results more clean code.

x %>% f(y = nrow(.), z = ncol(.)) is equivalent to f(x, y = nrow(x), z = ncol(x))

a <- 100
5 %>% choose(. - 2)

## [1] 10

choose(5, 5 - 2)

## [1] 10

The behavior can be overruled by enclosing the right-hand side in braces:

4 %>% {choose(5, . - 1)}

## [1] 10

choose(5, 4 - 1)

## [1] 10

x %>% {f(y = nrow(.), z = ncol(.))} is equivalent to f(y = nrow(x), z = ncol(x))

list(n = 5, k = 3) %>% {
  choose(.$n, .$k)
}

## [1] 10

`dplyr` basics

It offers five basic verbs
- select: picks variables based on their names
- filter: picks cases based on their values
- mutate: adds new variables that are functions of existing variables
- arrange: changes the ordering of the rows
- summarize or summarise: reduces multiple values down to a single summary
These all combine naturally with group_by which allows you to perform any operation “by group”.

Obtain some data

First of all, we need some data to work with. If the data is stored in a csv,

flights <- read_csv("flights.csv")

We are using the tidyverse function read_csv to import the flights.csv instead of the obsolete base function read.csv. - read_csv imports data as tibble which has better and more consistent handling of variables. - read_csv is often faster than read.csv - read_csv handles unicode characters better

There are also read_tsv and read_delim for reading tab-seperated or delimited files.

The datasets are actually obtained from the R package nycflights13

# Airline on-time data for all flights departing NYC in 2013.
library(nycflights13)
flights

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

`select`: picks variables based on their names.

To select arrival and departure times,

# old way to do it
# flights[, c("arr_time", "dep_time")]
flights %>% select(arr_time, dep_time)

## # A tibble: 336,776 x 2
##    arr_time dep_time
##       <int>    <int>
##  1      830      517
##  2      850      533
##  3      923      542
##  4     1004      544
##  5      812      554
##  6      740      554
##  7      913      555
##  8      709      557
##  9      838      557
## 10      753      558
## # … with 336,766 more rows

I don’t see why it’s useful

dplyr provides a lot of helper functions,

# colon `:` specifies all the variables between the columns of `dep_time` and `arr_time`
flights %>% select(dep_time:arr_time)

## # A tibble: 336,776 x 4
##    dep_time sched_dep_time dep_delay arr_time
##       <int>          <int>     <dbl>    <int>
##  1      517            515         2      830
##  2      533            529         4      850
##  3      542            540         2      923
##  4      544            545        -1     1004
##  5      554            600        -6      812
##  6      554            558        -4      740
##  7      555            600        -5      913
##  8      557            600        -3      709
##  9      557            600        -3      838
## 10      558            600        -2      753
## # … with 336,766 more rows

# all the columns start with arr_
flights %>% select(starts_with("arr_"))

## # A tibble: 336,776 x 2
##    arr_time arr_delay
##       <int>     <dbl>
##  1      830        11
##  2      850        20
##  3      923        33
##  4     1004       -18
##  5      812       -25
##  6      740        12
##  7      913        19
##  8      709       -14
##  9      838        -8
## 10      753         8
## # … with 336,766 more rows

# all the columns end with _time
flights %>% select(ends_with("_time"))

## # A tibble: 336,776 x 5
##    dep_time sched_dep_time arr_time sched_arr_time air_time
##       <int>          <int>    <int>          <int>    <dbl>
##  1      517            515      830            819      227
##  2      533            529      850            830      227
##  3      542            540      923            850      160
##  4      544            545     1004           1022      183
##  5      554            600      812            837      116
##  6      554            558      740            728      150
##  7      555            600      913            854      158
##  8      557            600      709            723       53
##  9      557            600      838            846      140
## 10      558            600      753            745      138
## # … with 336,766 more rows

# all the columns contain dep
flights %>% select(contains("dep"))

## # A tibble: 336,776 x 3
##    dep_time sched_dep_time dep_delay
##       <int>          <int>     <dbl>
##  1      517            515         2
##  2      533            529         4
##  3      542            540         2
##  4      544            545        -1
##  5      554            600        -6
##  6      554            558        -4
##  7      555            600        -5
##  8      557            600        -3
##  9      557            600        -3
## 10      558            600        -2
## # … with 336,766 more rows

# remove particular columns
flights %>% select(-year)

## # A tibble: 336,776 x 18
##    month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1     1     1      517            515         2      830            819
##  2     1     1      533            529         4      850            830
##  3     1     1      542            540         2      923            850
##  4     1     1      544            545        -1     1004           1022
##  5     1     1      554            600        -6      812            837
##  6     1     1      554            558        -4      740            728
##  7     1     1      555            600        -5      913            854
##  8     1     1      557            600        -3      709            723
##  9     1     1      557            600        -3      838            846
## 10     1     1      558            600        -2      753            745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

# all the columns do not contain dep
flights %>% select(-contains("dep"))

## # A tibble: 336,776 x 16
##     year month   day arr_time sched_arr_time arr_delay carrier flight tailnum
##    <int> <int> <int>    <int>          <int>     <dbl> <chr>    <int> <chr>  
##  1  2013     1     1      830            819        11 UA        1545 N14228 
##  2  2013     1     1      850            830        20 UA        1714 N24211 
##  3  2013     1     1      923            850        33 AA        1141 N619AA 
##  4  2013     1     1     1004           1022       -18 B6         725 N804JB 
##  5  2013     1     1      812            837       -25 DL         461 N668DN 
##  6  2013     1     1      740            728        12 UA        1696 N39463 
##  7  2013     1     1      913            854        19 B6         507 N516JB 
##  8  2013     1     1      709            723       -14 EV        5708 N829AS 
##  9  2013     1     1      838            846        -8 B6          79 N593JB 
## 10  2013     1     1      753            745         8 AA         301 N3ALAA 
## # … with 336,766 more rows, and 7 more variables: origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

# using regular expression (later of the course)
flights %>% select(matches("^(arr|dep)_"))

## # A tibble: 336,776 x 4
##    dep_time dep_delay arr_time arr_delay
##       <int>     <dbl>    <int>     <dbl>
##  1      517         2      830        11
##  2      533         4      850        20
##  3      542         2      923        33
##  4      544        -1     1004       -18
##  5      554        -6      812       -25
##  6      554        -4      740        12
##  7      555        -5      913        19
##  8      557        -3      709       -14
##  9      557        -3      838        -8
## 10      558        -2      753         8
## # … with 336,766 more rows

Related verbs: - if you just need a single variable, you could use pull. - you could use rename to rename columns

# flights$arr_time
flights %>% pull(arr_time)

##   [1]  830  850  923 1004  812  740  913  709  838  753  849  853  924  923  941
##  [16]  702  854  851  837  844  812  821  858  837  858  807  945  925 1039  833
##  [31] 1017  920  933  909  840 1018 1137 1016  824  721  824  740 1028  930  739
##  [46]  922  837  931  815  910 1023  936  932  936 1021 1037 1002  854  949 1007
##  [61]  948  959  944 1027 1008 1008  907  959 1123 1058  852 1151 1023  911  850
##  [76] 1017 1013 1111 1020 1052  959 1041 1049  857 1041 1011  854 1047  918 1104
##  [91] 1038 1107 1043 1059 1135 1119  939 1041 1025  955 1056 1039 1103 1053 1057
## [106] 1022  949  900  903 1132 1103 1015 1118 1006 1043 1043 1048 1100 1006 1047
## [121] 1026 1040 1103 1047 1005 1254  940 1249 1153  932 1014 1019 1151 1027 1058
## [136] 1136 1145 1027 1150 1152 1117 1018 1052 1021 1006 1134 1210 1027 1311 1053
## [151] 1138 1001 1155 1032 1102 1215 1046 1147 1143 1226 1222 1140 1516 1107 1124
## [166] 1157 1102 1140 1223 1211 1048 1045 1207 1309 1134 1020 1004 1228 1331 1241
## [181] 1219 1346 1244 1058 1313 1206 1052 1039 1152 1237 1026 1404 1221 1233 1231
## [196] 1028 1220 1218 1237 1121
##  [ reached getOption("max.print") -- omitted 336576 entries ]

flights %>% rename(y = year)

## # A tibble: 336,776 x 19
##        y month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

# of course, we could select everything
flights %>% select(everything())

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

# move air_time to the front
flights %>% select(air_time, everything())

## # A tibble: 336,776 x 19
##    air_time  year month   day dep_time sched_dep_time dep_delay arr_time
##       <dbl> <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1      227  2013     1     1      517            515         2      830
##  2      227  2013     1     1      533            529         4      850
##  3      160  2013     1     1      542            540         2      923
##  4      183  2013     1     1      544            545        -1     1004
##  5      116  2013     1     1      554            600        -6      812
##  6      150  2013     1     1      554            558        -4      740
##  7      158  2013     1     1      555            600        -5      913
##  8       53  2013     1     1      557            600        -3      709
##  9      140  2013     1     1      557            600        -3      838
## 10      138  2013     1     1      558            600        -2      753
## # … with 336,766 more rows, and 11 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
## #   dest <chr>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

# a new function `relocate
flights %>% relocate(air_time, .after = day)

## # A tibble: 336,776 x 19
##     year month   day air_time dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <dbl>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      227      517            515         2      830
##  2  2013     1     1      227      533            529         4      850
##  3  2013     1     1      160      542            540         2      923
##  4  2013     1     1      183      544            545        -1     1004
##  5  2013     1     1      116      554            600        -6      812
##  6  2013     1     1      150      554            558        -4      740
##  7  2013     1     1      158      555            600        -5      913
##  8  2013     1     1       53      557            600        -3      709
##  9  2013     1     1      140      557            600        -3      838
## 10  2013     1     1      138      558            600        -2      753
## # … with 336,766 more rows, and 11 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
## #   dest <chr>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

`filter`: picks cases based on their values

# flights[flights$origin == "JFK", ]
# subset(flights, origin == "JFK")
flights %>% filter(origin == "JFK")

## # A tibble: 111,279 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      542            540         2      923            850
##  2  2013     1     1      544            545        -1     1004           1022
##  3  2013     1     1      557            600        -3      838            846
##  4  2013     1     1      558            600        -2      849            851
##  5  2013     1     1      558            600        -2      853            856
##  6  2013     1     1      558            600        -2      924            917
##  7  2013     1     1      559            559         0      702            706
##  8  2013     1     1      606            610        -4      837            845
##  9  2013     1     1      611            600        11      945            931
## 10  2013     1     1      613            610         3      925            921
## # … with 111,269 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

flights %>% filter(distance > 1000)

## # A tibble: 147,105 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      555            600        -5      913            854
##  6  2013     1     1      558            600        -2      849            851
##  7  2013     1     1      558            600        -2      853            856
##  8  2013     1     1      558            600        -2      924            917
##  9  2013     1     1      558            600        -2      923            937
## 10  2013     1     1      559            600        -1      941            910
## # … with 147,095 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

# note that we are using a single `&` instead of `&&` as in base R
flights %>% filter(origin == "JFK" & distance > 1000)

## # A tibble: 62,071 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      542            540         2      923            850
##  2  2013     1     1      544            545        -1     1004           1022
##  3  2013     1     1      558            600        -2      849            851
##  4  2013     1     1      558            600        -2      853            856
##  5  2013     1     1      558            600        -2      924            917
##  6  2013     1     1      611            600        11      945            931
##  7  2013     1     1      613            610         3      925            921
##  8  2013     1     1      615            615         0     1039           1100
##  9  2013     1     1      627            630        -3     1018           1018
## 10  2013     1     1      628            630        -2     1137           1140
## # … with 62,061 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

flights %>% filter(origin == "JFK", distance > 1000)

## # A tibble: 62,071 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      542            540         2      923            850
##  2  2013     1     1      544            545        -1     1004           1022
##  3  2013     1     1      558            600        -2      849            851
##  4  2013     1     1      558            600        -2      853            856
##  5  2013     1     1      558            600        -2      924            917
##  6  2013     1     1      611            600        11      945            931
##  7  2013     1     1      613            610         3      925            921
##  8  2013     1     1      615            615         0     1039           1100
##  9  2013     1     1      627            630        -3     1018           1018
## 10  2013     1     1      628            630        -2     1137           1140
## # … with 62,061 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

flights %>% filter(distance < 500 | distance > 1000)

## # A tibble: 227,322 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      555            600        -5      913            854
##  6  2013     1     1      557            600        -3      709            723
##  7  2013     1     1      558            600        -2      849            851
##  8  2013     1     1      558            600        -2      853            856
##  9  2013     1     1      558            600        -2      924            917
## 10  2013     1     1      558            600        -2      923            937
## # … with 227,312 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

flights %>% filter(!between(distance, 500, 1000))

## # A tibble: 227,322 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      555            600        -5      913            854
##  6  2013     1     1      557            600        -3      709            723
##  7  2013     1     1      558            600        -2      849            851
##  8  2013     1     1      558            600        -2      853            856
##  9  2013     1     1      558            600        -2      924            917
## 10  2013     1     1      558            600        -2      923            937
## # … with 227,312 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

# only keep the complete cases
flights %>% filter(complete.cases(.)) # filter(flights, complete.cases(flights))

## # A tibble: 327,346 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 327,336 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

flights %>% drop_na() # same, but more efficient

## # A tibble: 327,346 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 327,336 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Chaining and piping

Very often, we will need to use multiple dplyr verbs, for example

# the pipe operator %>% increases readability
flights %>%
  select(origin, air_time) %>%
  filter(origin == "JFK") %>%
  filter(air_time < 500) %>%
  rename(airtime = air_time)

## # A tibble: 108,737 x 2
##    origin airtime
##    <chr>    <dbl>
##  1 JFK        160
##  2 JFK        183
##  3 JFK        140
##  4 JFK        149
##  5 JFK        158
##  6 JFK        345
##  7 JFK         44
##  8 JFK        128
##  9 JFK        366
## 10 JFK        175
## # … with 108,727 more rows

# slightly more efficient
flights %>%
  select(origin, airtime = air_time) %>%
  filter(origin == "JFK", airtime < 500)

## # A tibble: 108,737 x 2
##    origin airtime
##    <chr>    <dbl>
##  1 JFK        160
##  2 JFK        183
##  3 JFK        140
##  4 JFK        149
##  5 JFK        158
##  6 JFK        345
##  7 JFK         44
##  8 JFK        128
##  9 JFK        366
## 10 JFK        175
## # … with 108,727 more rows

# a few more examples
flights %>%
  select(origin, air_time) %>%
  filter(origin == "JFK", air_time < mean(air_time, na.rm = TRUE))

## # A tibble: 55,521 x 2
##    origin air_time
##    <chr>     <dbl>
##  1 JFK         140
##  2 JFK         149
##  3 JFK          44
##  4 JFK         128
##  5 JFK          41
##  6 JFK          63
##  7 JFK         142
##  8 JFK         147
##  9 JFK          64
## 10 JFK          54
## # … with 55,511 more rows

# maybe we want to calculate mean_air_time first
mean_air_time <- flights %>%
  pull(air_time) %>%
  mean(na.rm = TRUE)

mean_air_time

## [1] 150.6865

flights %>%
  select(origin, air_time) %>%
  filter(origin == "JFK", air_time > mean_air_time)

## # A tibble: 53,558 x 2
##    origin air_time
##    <chr>     <dbl>
##  1 JFK         160
##  2 JFK         183
##  3 JFK         158
##  4 JFK         345
##  5 JFK         366
##  6 JFK         175
##  7 JFK         182
##  8 JFK         330
##  9 JFK         192
## 10 JFK         323
## # … with 53,548 more rows

# what if there is a name collision?
air_time <- flights %>%
  pull(air_time) %>%
  mean(na.rm = TRUE)

flights %>%
  select(origin, air_time) %>%
  filter(origin == "JFK", air_time > !!air_time)

## # A tibble: 53,558 x 2
##    origin air_time
##    <chr>     <dbl>
##  1 JFK         160
##  2 JFK         183
##  3 JFK         158
##  4 JFK         345
##  5 JFK         366
##  6 JFK         175
##  7 JFK         182
##  8 JFK         330
##  9 JFK         192
## 10 JFK         323
## # … with 53,548 more rows

Remarks: use slice, head, tail if you want particular rows

flights %>% head(4)

## # A tibble: 4 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     1      517            515         2      830            819
## 2  2013     1     1      533            529         4      850            830
## 3  2013     1     1      542            540         2      923            850
## 4  2013     1     1      544            545        -1     1004           1022
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

flights %>% slice(100:105)

## # A tibble: 6 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     1      752            759        -7      955            959
## 2  2013     1     1      753            755        -2     1056           1110
## 3  2013     1     1      754            759        -5     1039           1041
## 4  2013     1     1      754            755        -1     1103           1030
## 5  2013     1     1      758            800        -2     1053           1054
## 6  2013     1     1      759            800        -1     1057           1127
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

flights %>% tail(5)

## # A tibble: 5 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     9    30       NA           1455        NA       NA           1634
## 2  2013     9    30       NA           2200        NA       NA           2312
## 3  2013     9    30       NA           1210        NA       NA           1330
## 4  2013     9    30       NA           1159        NA       NA           1344
## 5  2013     9    30       NA            840        NA       NA           1020
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

flights %>% slice((n() - 4):n())

## # A tibble: 5 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     9    30       NA           1455        NA       NA           1634
## 2  2013     9    30       NA           2200        NA       NA           2312
## 3  2013     9    30       NA           1210        NA       NA           1330
## 4  2013     9    30       NA           1159        NA       NA           1344
## 5  2013     9    30       NA            840        NA       NA           1020
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

`mutate`: adds new variables that are functions of existing variables

flights %>% mutate(
  gain = dep_delay - arr_delay,
  speed = distance / air_time * 60
)

## # A tibble: 336,776 x 21
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 336,766 more rows, and 13 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
## #   gain <dbl>, speed <dbl>

# we could refer to the columns just created
flights %>% mutate(
  gain = dep_delay - arr_delay,
  gain_per_hour = gain / (air_time / 60)
)

## # A tibble: 336,776 x 21
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 336,766 more rows, and 13 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
## #   gain <dbl>, gain_per_hour <dbl>

# remove a variable
flights %>% mutate(
  gain = dep_delay - arr_delay,
  gain_per_hour = gain / (air_time / 60),
  air_time = NULL,
)

## # A tibble: 336,776 x 20
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 336,766 more rows, and 12 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, gain <dbl>,
## #   gain_per_hour <dbl>

# `transmute` only keeps the new variables
flights %>% transmute(
  gain = dep_delay - arr_delay,
  gain_per_hour = gain / (air_time / 60)
)

## # A tibble: 336,776 x 2
##     gain gain_per_hour
##    <dbl>         <dbl>
##  1    -9         -2.38
##  2   -16         -4.23
##  3   -31        -11.6 
##  4    17          5.57
##  5    19          9.83
##  6   -16         -6.4 
##  7   -24         -9.11
##  8    11         12.5 
##  9     5          2.14
## 10   -10         -4.35
## # … with 336,766 more rows

Six variations on ranking functions

row_number: equivalent to rank(ties.method = "first")
min_rank: equivalent to rank(ties.method = "min")
dense_rank: like min_rank(), but with no gaps between ranks
percent_rank: a number between 0 and 1 computed by rescaling min_rank to [0, 1]
cume_dist: a cumulative distribution function. Proportion of all values less than or equal to the current rank.
ntile: a rough rank, which breaks the input vector into n buckets

some_data <- tibble(
  x = c(3, 4, 1, 3, 1)
)
some_data %>%
  mutate(
    row_number(),
    row_number(x),
    min_rank(x),
    dense_rank(x),
    percent_rank(x),
    cume_dist(x),
    ntile(x, 3)
  )

## # A tibble: 5 x 8
##       x `row_number()` `row_number(x)` `min_rank(x)` `dense_rank(x)`
##   <dbl>          <int>           <int>         <int>           <int>
## 1     3              1               3             3               2
## 2     4              2               5             5               3
## 3     1              3               1             1               1
## 4     3              4               4             3               2
## 5     1              5               2             1               1
## # … with 3 more variables: `percent_rank(x)` <dbl>, `cume_dist(x)` <dbl>,
## #   `ntile(x, 3)` <int>

`lead` and `lag`

some_data2 <- tibble(
  time = 1:5,
  value = c(3, 4, 1, 3, 1)
)

some_data2 %>% mutate(
  lag(value),
  lead(value)
)

## # A tibble: 5 x 4
##    time value `lag(value)` `lead(value)`
##   <int> <dbl>        <dbl>         <dbl>
## 1     1     3           NA             4
## 2     2     4            3             1
## 3     3     1            4             3
## 4     4     3            1             1
## 5     5     1            3            NA

some_data2 %>% mutate(
  diff1 = value - lag(value), 
  diff2 = lead(value) - value
)

## # A tibble: 5 x 4
##    time value diff1 diff2
##   <int> <dbl> <dbl> <dbl>
## 1     1     3    NA     1
## 2     2     4     1    -3
## 3     3     1    -3     2
## 4     4     3     2    -2
## 5     5     1    -2    NA

Remark: the package zoo has functions to calculate rolling means (moving averages)

Conditional mutation

flights %>% transmute(
  arr_delay,
  status = if_else(arr_delay > 0, "delayed", "on time")
)

## # A tibble: 336,776 x 2
##    arr_delay status 
##        <dbl> <chr>  
##  1        11 delayed
##  2        20 delayed
##  3        33 delayed
##  4       -18 on time
##  5       -25 on time
##  6        12 delayed
##  7        19 delayed
##  8       -14 on time
##  9        -8 on time
## 10         8 delayed
## # … with 336,766 more rows

(
  flight_distances <- flights %>%
    transmute(
      distance,
      distance_type = case_when(
        distance < 500 ~ "short",
        distance < 1000 ~ "mid",
        TRUE ~ "long"
      )
    )
)

## # A tibble: 336,776 x 2
##    distance distance_type
##       <dbl> <chr>        
##  1     1400 long         
##  2     1416 long         
##  3     1089 long         
##  4     1576 long         
##  5      762 mid          
##  6      719 mid          
##  7     1065 long         
##  8      229 short        
##  9      944 mid          
## 10      733 mid          
## # … with 336,766 more rows

`recode` values

flight_distances %>% mutate(
  distance_type = recode(distance_type,
    "long" = "long-distance",
    "mid" = "mid-distance",
    "short" = "short-distance"
))

## # A tibble: 336,776 x 2
##    distance distance_type 
##       <dbl> <chr>         
##  1     1400 long-distance 
##  2     1416 long-distance 
##  3     1089 long-distance 
##  4     1576 long-distance 
##  5      762 mid-distance  
##  6      719 mid-distance  
##  7     1065 long-distance 
##  8      229 short-distance
##  9      944 mid-distance  
## 10      733 mid-distance  
## # … with 336,766 more rows

some_data %>% mutate(
  y = recode(x,
    `1` = 10,
    `3` = 30
  )
)

## # A tibble: 5 x 2
##       x     y
##   <dbl> <dbl>
## 1     3    30
## 2     4     4
## 3     1    10
## 4     3    30
## 5     1    10

`arrange`: changes the ordering of the rows

flights %>% arrange(year, month, day)

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

flights %>% arrange(desc(dep_delay))

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     9      641            900      1301     1242           1530
##  2  2013     6    15     1432           1935      1137     1607           2120
##  3  2013     1    10     1121           1635      1126     1239           1810
##  4  2013     9    20     1139           1845      1014     1457           2210
##  5  2013     7    22      845           1600      1005     1044           1815
##  6  2013     4    10     1100           1900       960     1342           2211
##  7  2013     3    17     2321            810       911      135           1020
##  8  2013     6    27      959           1900       899     1236           2226
##  9  2013     7    22     2257            759       898      121           1026
## 10  2013    12     5      756           1700       896     1058           2020
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

`summarize` and `group_by` operations

flights %>% summarize(
  air_time = mean(air_time, na.rm = TRUE)
)

## # A tibble: 1 x 1
##   air_time
##      <dbl>
## 1     151.

# mean(flights$air_time, na.rm = TRUE)

flights %>%
  group_by(tailnum) %>%
  summarize(n = n())

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 4,044 x 2
##    tailnum     n
##    <chr>   <int>
##  1 D942DN      4
##  2 N0EGMQ    371
##  3 N10156    153
##  4 N102UW     48
##  5 N103US     46
##  6 N104UW     47
##  7 N10575    289
##  8 N105UW     45
##  9 N107US     41
## 10 N108UW     60
## # … with 4,034 more rows

flights %>%
  group_by(tailnum) %>%
  tally() # shorthand

## # A tibble: 4,044 x 2
##    tailnum     n
##    <chr>   <int>
##  1 D942DN      4
##  2 N0EGMQ    371
##  3 N10156    153
##  4 N102UW     48
##  5 N103US     46
##  6 N104UW     47
##  7 N10575    289
##  8 N105UW     45
##  9 N107US     41
## 10 N108UW     60
## # … with 4,034 more rows

flights %>% count(tailnum) # another shorthand

## # A tibble: 4,044 x 2
##    tailnum     n
##    <chr>   <int>
##  1 D942DN      4
##  2 N0EGMQ    371
##  3 N10156    153
##  4 N102UW     48
##  5 N103US     46
##  6 N104UW     47
##  7 N10575    289
##  8 N105UW     45
##  9 N107US     41
## 10 N108UW     60
## # … with 4,034 more rows

flights %>%
  group_by(tailnum) %>%
  summarize(
    count = n(),
    dist = mean(distance, na.rm = TRUE),
    delay = mean(arr_delay, na.rm = TRUE)
  )

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 4,044 x 4
##    tailnum count  dist  delay
##    <chr>   <int> <dbl>  <dbl>
##  1 D942DN      4  854. 31.5  
##  2 N0EGMQ    371  676.  9.98 
##  3 N10156    153  758. 12.7  
##  4 N102UW     48  536.  2.94 
##  5 N103US     46  535. -6.93 
##  6 N104UW     47  535.  1.80 
##  7 N10575    289  520. 20.7  
##  8 N105UW     45  525. -0.267
##  9 N107US     41  529. -5.73 
## 10 N108UW     60  534. -1.25 
## # … with 4,034 more rows

flights %>%
  group_by(dest) %>%
  summarize(
    planes = n_distinct(tailnum),
    flights = n()
  )

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 105 x 3
##    dest  planes flights
##    <chr>  <int>   <int>
##  1 ABQ      108     254
##  2 ACK       58     265
##  3 ALB      172     439
##  4 ANC        6       8
##  5 ATL     1180   17215
##  6 AUS      993    2439
##  7 AVL      159     275
##  8 BDL      186     443
##  9 BGR       46     375
## 10 BHM       45     297
## # … with 95 more rows

# group by multiple variables
(per_day <- flights %>%
  group_by(year, month, day) %>%
  summarize(flights = n()))

## `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)

## # A tibble: 365 x 4
## # Groups:   year, month [12]
##     year month   day flights
##    <int> <int> <int>   <int>
##  1  2013     1     1     842
##  2  2013     1     2     943
##  3  2013     1     3     914
##  4  2013     1     4     915
##  5  2013     1     5     720
##  6  2013     1     6     832
##  7  2013     1     7     933
##  8  2013     1     8     899
##  9  2013     1     9     902
## 10  2013     1    10     932
## # … with 355 more rows

(per_month <- per_day %>%
  summarize(flights = sum(flights)))

## `summarise()` regrouping output by 'year' (override with `.groups` argument)

## # A tibble: 12 x 3
## # Groups:   year [1]
##     year month flights
##    <int> <int>   <int>
##  1  2013     1   27004
##  2  2013     2   24951
##  3  2013     3   28834
##  4  2013     4   28330
##  5  2013     5   28796
##  6  2013     6   28243
##  7  2013     7   29425
##  8  2013     8   29327
##  9  2013     9   27574
## 10  2013    10   28889
## 11  2013    11   27268
## 12  2013    12   28135

(per_year <- per_month %>%
  summarize(flights = sum(flights)))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 1 x 2
##    year flights
##   <int>   <int>
## 1  2013  336776

`across` columns

Suppose, we want to compute the means for the all numeric columns of flights. Naively, we could do

flights %>%
  group_by(origin) %>%
  summarize(
    dep_time = mean(dep_time, na.rm = TRUE),
    sched_dep_time = mean(sched_dep_time, na.rm = TRUE),
    dep_delay = mean(dep_delay, na.rm = TRUE),
    arr_time = mean(arr_time, na.rm = TRUE),
    sched_arr_time = mean(sched_arr_time, na.rm = TRUE),
    arr_delay = mean(arr_delay, na.rm = TRUE),
    flight = mean(flight, na.rm = TRUE),
    air_time = mean(air_time, na.rm = TRUE),
    distance = mean(distance, na.rm = TRUE)
  )

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 3 x 10
##   origin dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
##   <chr>     <dbl>          <dbl>     <dbl>    <dbl>          <dbl>     <dbl>
## 1 EWR       1337.          1322.      15.1    1492.          1528.      9.11
## 2 JFK       1399.          1402.      12.1    1520.          1565.      5.55
## 3 LGA       1310.          1308.      10.3    1494.          1516.      5.78
## # … with 3 more variables: flight <dbl>, air_time <dbl>, distance <dbl>

With the new across function, it is as easy as:

flights %>%
  group_by(origin) %>%
  summarize(across(dep_time:distance, mean, na.rm = TRUE))

## Warning in mean.default(col, ...): argument is not numeric or logical: returning
## NA

## Warning in mean.default(col, ...): argument is not numeric or logical: returning
## NA

## Warning in mean.default(col, ...): argument is not numeric or logical: returning
## NA

## Warning in mean.default(col, ...): argument is not numeric or logical: returning
## NA

## Warning in mean.default(col, ...): argument is not numeric or logical: returning
## NA

## Warning in mean.default(col, ...): argument is not numeric or logical: returning
## NA

## Warning in mean.default(col, ...): argument is not numeric or logical: returning
## NA

## Warning in mean.default(col, ...): argument is not numeric or logical: returning
## NA

## Warning in mean.default(col, ...): argument is not numeric or logical: returning
## NA

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 3 x 13
##   origin dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
##   <chr>     <dbl>          <dbl>     <dbl>    <dbl>          <dbl>     <dbl>
## 1 EWR       1337.          1322.      15.1    1492.          1528.      9.11
## 2 JFK       1399.          1402.      12.1    1520.          1565.      5.55
## 3 LGA       1310.          1308.      10.3    1494.          1516.      5.78
## # … with 6 more variables: carrier <dbl>, flight <dbl>, tailnum <dbl>,
## #   dest <dbl>, air_time <dbl>, distance <dbl>

We have accidentally included some non numeric columns.

flights %>%
  group_by(origin) %>%
  summarize(
    across(dep_time:distance & is.numeric, mean, na.rm = TRUE)
  )

## Warning: Predicate functions must be wrapped in `where()`.
## 
##   # Bad
##   data %>% select(is.numeric)
## 
##   # Good
##   data %>% select(where(is.numeric))
## 
## ℹ Please update your code.
## This message is displayed once per session.

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 3 x 10
##   origin dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
##   <chr>     <dbl>          <dbl>     <dbl>    <dbl>          <dbl>     <dbl>
## 1 EWR       1337.          1322.      15.1    1492.          1528.      9.11
## 2 JFK       1399.          1402.      12.1    1520.          1565.      5.55
## 3 LGA       1310.          1308.      10.3    1494.          1516.      5.78
## # … with 3 more variables: flight <dbl>, air_time <dbl>, distance <dbl>

flights %>%
  group_by(origin) %>%
  summarize(
    across(-c(year, month, day) & is.numeric, mean, na.rm = TRUE)
  )

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 3 x 12
##   origin dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
##   <chr>     <dbl>          <dbl>     <dbl>    <dbl>          <dbl>     <dbl>
## 1 EWR       1337.          1322.      15.1    1492.          1528.      9.11
## 2 JFK       1399.          1402.      12.1    1520.          1565.      5.55
## 3 LGA       1310.          1308.      10.3    1494.          1516.      5.78
## # … with 5 more variables: flight <dbl>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>

We could also use predicate functions in across and use different functions to summerize different variables.

flights %>%
  group_by(origin) %>%
  summarize(
    across(ends_with("time"), mean, na.rm = TRUE),
    across(ends_with("delay"), mean, na.rm = TRUE, trim = 0.2),
  )

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 3 x 8
##   origin dep_time sched_dep_time arr_time sched_arr_time air_time dep_delay
##   <chr>     <dbl>          <dbl>    <dbl>          <dbl>    <dbl>     <dbl>
## 1 EWR       1337.          1322.    1492.          1528.     153.     2.10 
## 2 JFK       1399.          1402.    1520.          1565.     178.     0.298
## 3 LGA       1310.          1308.    1494.          1516.     118.    -1.51 
## # … with 1 more variable: arr_delay <dbl>

Other useful functions

flights %>% glimpse()

## Rows: 336,776
## Columns: 19
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, …
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558,…
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600,…
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -…
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851…
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -…
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", …
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, …
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N39…
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA"…
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD"…
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, …
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733,…
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, …
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, …
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 …

flights %>% 
  slice_sample(n = 1000) %>% 
  ggplot(aes(arr_delay, dep_delay)) + geom_point(na.rm = TRUE)

## rows with largest values of air_time with the original order preserved
flights %>% 
  slice_max(air_time, n = 3)

## # A tibble: 4 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     3    17     1337           1335         2     1937           1836
## 2  2013     2     6      853            900        -7     1542           1540
## 3  2013     3    15     1001           1000         1     1551           1530
## 4  2013     3    17     1006           1000         6     1607           1530
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

# two disadvantages: order changes and does not handle ties
flights %>% 
  arrange(desc(air_time)) %>% 
  head(3)

## # A tibble: 3 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     3    17     1337           1335         2     1937           1836
## 2  2013     2     6      853            900        -7     1542           1540
## 3  2013     3    15     1001           1000         1     1551           1530
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Similary, there is slice_min.

Row bind or column bind data frames

df1 <- tibble(a = "hello", b = 2, c = 3)
df2 <- tibble(a = "world", b = 4)

df1 %>% bind_rows(df2)

## # A tibble: 2 x 3
##   a         b     c
##   <chr> <dbl> <dbl>
## 1 hello     2     3
## 2 world     4    NA

bind_rows(df1, df2)

## # A tibble: 2 x 3
##   a         b     c
##   <chr> <dbl> <dbl>
## 1 hello     2     3
## 2 world     4    NA

The bind_rows function from dplyr is better than rbind in several ways.

bind_rows handles missing values
bind_rows doesn’t automatically convert types
bind_rows supports splice operator

df1 <- tibble(a = "hello", b = 2, c = 3)
df2 <- tibble(a = "world", b = 4)
df1 %>% rbind(df2)

## Error in rbind(deparse.level, ...): numbers of columns of arguments do not match

df1 <- tibble(a = "hello", b = 2)
df2 <- tibble(a = 2, b = 4)
df1 %>% rbind(df2)

## # A tibble: 2 x 2
##   a         b
##   <chr> <dbl>
## 1 hello     2
## 2 2         4

df1 %>% bind_rows(df2)

## Error: Can't combine `..1$a` <character> and `..2$a` <double>.

df1 <- tibble(a = "hello", b = 2, c = 3)
df2 <- tibble(a = "world", b = 4)
dfs <- list(df1, df2)
bind_rows(!!!dfs)

## # A tibble: 2 x 3
##   a         b     c
##   <chr> <dbl> <dbl>
## 1 hello     2     3
## 2 world     4    NA

Likewise, bind_cols is safer than cbind.

df1 <- tibble(x = 1:5)
df2 <- tibble(y = 1:10)

df1 %>% cbind(df2)

df1 %>% bind_cols(df2)

## Error: Can't recycle `..1` (size 5) to match `..2` (size 10).

Joining (Merging) tables

There are 6 types of joins.

inner_join(x, y) only includes observations that match in both x and y
left_join(x, y) includes all observations in x, regardless of whether they match or not.
right_join(x, y) equivalent to left_join(y, x)
full_join(x, y) (outer join) includes all observations from x and y
semi_join(x, y) only keep observations from x if there is a match in y
anti_join(x, y) remove observations from x if there is a match in y

(df1 <- tibble(id = c(1, 2), v = 2:1))

## # A tibble: 2 x 2
##      id     v
##   <dbl> <int>
## 1     1     2
## 2     2     1

(df2 <- tibble(id = c(1, 3), a = 10, b = "a"))

## # A tibble: 2 x 3
##      id     a b    
##   <dbl> <dbl> <chr>
## 1     1    10 a    
## 2     3    10 a

df1 %>% inner_join(df2, by = "id")

## # A tibble: 1 x 4
##      id     v     a b    
##   <dbl> <int> <dbl> <chr>
## 1     1     2    10 a

df1 %>% left_join(df2, by = "id")

## # A tibble: 2 x 4
##      id     v     a b    
##   <dbl> <int> <dbl> <chr>
## 1     1     2    10 a    
## 2     2     1    NA <NA>

df1 %>% right_join(df2, by = "id")

## # A tibble: 2 x 4
##      id     v     a b    
##   <dbl> <int> <dbl> <chr>
## 1     1     2    10 a    
## 2     3    NA    10 a

# outer join
df1 %>% full_join(df2, by = "id")

## # A tibble: 3 x 4
##      id     v     a b    
##   <dbl> <int> <dbl> <chr>
## 1     1     2    10 a    
## 2     2     1    NA <NA> 
## 3     3    NA    10 a

df1 %>% semi_join(df2, by = "id")

## # A tibble: 1 x 2
##      id     v
##   <dbl> <int>
## 1     1     2

df1 %>% 
  anti_join(df2, by = "id")

## # A tibble: 1 x 2
##      id     v
##   <dbl> <int>
## 1     2     1

Join by the same variable of different names

df1 <- tibble(id = c(1, 2), v = 2:1)
df2 <- tibble(name = c(1, 3), a = 10, b = "a")

df1 %>% inner_join(df2, by = c("id" = "name"))

## # A tibble: 1 x 4
##      id     v     a b    
##   <dbl> <int> <dbl> <chr>
## 1     1     2    10 a

Updating rows

In some situation, we would like to use the values from one table to update the values of another table.

(df1 <- tibble(
  id = 1:3, 
  b = letters[c(1:2, NA)], 
  c = 0.5 + 0:2))

## # A tibble: 3 x 3
##      id b         c
##   <int> <chr> <dbl>
## 1     1 a       0.5
## 2     2 b       1.5
## 3     3 <NA>    2.5

(df2 <- tibble(id = 3, b = "z", c = 3.5))

## # A tibble: 1 x 3
##      id b         c
##   <dbl> <chr> <dbl>
## 1     3 z       3.5

Outer join doesn’t give the intended result

df1 %>% 
  full_join(df2, by = "id")

## # A tibble: 3 x 5
##      id b.x     c.x b.y     c.y
##   <dbl> <chr> <dbl> <chr> <dbl>
## 1     1 a       0.5 <NA>   NA  
## 2     2 b       1.5 <NA>   NA  
## 3     3 <NA>    2.5 z       3.5

Use rows_update to update existing rows

df1 %>% 
  rows_update(df2, by = "id")

## # A tibble: 3 x 3
##      id b         c
##   <int> <chr> <dbl>
## 1     1 a       0.5
## 2     2 b       1.5
## 3     3 z       3.5

rows_patch works like rows_update() but only overwrites NA values.

df1 %>% 
  rows_patch(df2)

## Matching, by = "id"

## # A tibble: 3 x 3
##      id b         c
##   <int> <chr> <dbl>
## 1     1 a       0.5
## 2     2 b       1.5
## 3     3 z       2.5

df3 <- tibble(id = 4, b = "z", c = 3.5)
# error when key doesn't exist
df1 %>% 
  rows_update(df3, by = "id")

## Error: Attempting to update missing rows.

Inserting rows

df3 <- tibble(id = 4, b = "z")
df1 %>% 
  rows_insert(df3, by = "id")

## # A tibble: 4 x 3
##      id b         c
##   <dbl> <chr> <dbl>
## 1     1 a       0.5
## 2     2 b       1.5
## 3     3 <NA>    2.5
## 4     4 z      NA

# error because `id = 3` exists
df1 %>% 
  rows_insert(df2, by = "id")

## Error: Attempting to insert duplicate rows.

rows_upsert inserts or updates depending on whether or not the key value already exists

df1 %>% 
  rows_upsert(tibble(id = 3:4, b = c("z", "y"), c = c(3.5, 4.5)), by = "id")

## # A tibble: 4 x 3
##      id b         c
##   <int> <chr> <dbl>
## 1     1 a       0.5
## 2     2 b       1.5
## 3     3 z       3.5
## 4     4 y       4.5

Crossing and nesting

Crossing allows you to examine all the combinations between variables.

For example, we know that not all routes between origin and dest exist.

all_existing_routes <- flights %>% expand(nesting(origin, dest))

all_possible_routes <- flights %>% expand(crossing(origin, dest))
# or simply  
all_possible_routes <- flights %>% expand(origin, dest)

# these routes do not exist
all_possible_routes %>% 
  anti_join(all_existing_routes, by = c("origin", "dest"))

## # A tibble: 91 x 2
##    origin dest 
##    <chr>  <chr>
##  1 EWR    ABQ  
##  2 EWR    ACK  
##  3 EWR    BGR  
##  4 EWR    BHM  
##  5 EWR    BUR  
##  6 EWR    CAK  
##  7 EWR    CHO  
##  8 EWR    CRW  
##  9 EWR    EYW  
## 10 EWR    ILM  
## # … with 81 more rows

Completion and missing values

expand only keeps the variables specified. If you want to join the original data, you could use complete.

# for demonstration
(dat <- flights %>% 
  group_by(origin, dest) %>% 
  summarize(air_time = mean(air_time, na.rm = TRUE)) %>% 
  ungroup())

## `summarise()` regrouping output by 'origin' (override with `.groups` argument)

## # A tibble: 224 x 3
##    origin dest  air_time
##    <chr>  <chr>    <dbl>
##  1 EWR    ALB       31.8
##  2 EWR    ANC      413. 
##  3 EWR    ATL      112. 
##  4 EWR    AUS      211. 
##  5 EWR    AVL       89.8
##  6 EWR    BDL       25.5
##  7 EWR    BNA      115. 
##  8 EWR    BOS       40.3
##  9 EWR    BQN      196. 
## 10 EWR    BTV       46.3
## # … with 214 more rows

dat %>% 
  expand(origin, dest) %>% 
  left_join(dat)

## Joining, by = c("origin", "dest")

## # A tibble: 315 x 3
##    origin dest  air_time
##    <chr>  <chr>    <dbl>
##  1 EWR    ABQ       NA  
##  2 EWR    ACK       NA  
##  3 EWR    ALB       31.8
##  4 EWR    ANC      413. 
##  5 EWR    ATL      112. 
##  6 EWR    AUS      211. 
##  7 EWR    AVL       89.8
##  8 EWR    BDL       25.5
##  9 EWR    BGR       NA  
## 10 EWR    BHM       NA  
## # … with 305 more rows

dat %>%
  complete(origin, dest, fill = list("air_time" = Inf))

## # A tibble: 315 x 3
##    origin dest  air_time
##    <chr>  <chr>    <dbl>
##  1 EWR    ABQ      Inf  
##  2 EWR    ACK      Inf  
##  3 EWR    ALB       31.8
##  4 EWR    ANC      413. 
##  5 EWR    ATL      112. 
##  6 EWR    AUS      211. 
##  7 EWR    AVL       89.8
##  8 EWR    BDL       25.5
##  9 EWR    BGR      Inf  
## 10 EWR    BHM      Inf  
## # … with 305 more rows

df <- tribble(
  ~x, ~y,
   1,  2,
  NA,  3,
   3, NA
)
df

## # A tibble: 3 x 2
##       x     y
##   <dbl> <dbl>
## 1     1     2
## 2    NA     3
## 3     3    NA

# remove all rows with missing values
df %>% drop_na()

## # A tibble: 1 x 2
##       x     y
##   <dbl> <dbl>
## 1     1     2

# fill values by copying downward
df %>% fill(x, y)

## # A tibble: 3 x 2
##       x     y
##   <dbl> <dbl>
## 1     1     2
## 2     1     3
## 3     3     3

df %>% 
  replace_na(list(x = Inf, y = -Inf))

## # A tibble: 3 x 2
##       x     y
##   <dbl> <dbl>
## 1     1     2
## 2   Inf     3
## 3     3  -Inf

Pivoting Data

Using the datasets from R for Data Science to show that the same data could be organized in different ways.

In general, there are two formats,

wide format
- each subject has its own row
- a variable may take multiple columns
long format
- each variable has its own column
- a subject may take multiple rows

Pivot longer

relig_income

## # A tibble: 18 x 11
##    religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
##    <chr>      <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>      <dbl>
##  1 Agnostic      27        34        60        81        76       137        122
##  2 Atheist       12        27        37        52        35        70         73
##  3 Buddhist      27        21        30        34        33        58         62
##  4 Catholic     418       617       732       670       638      1116        949
##  5 Don’t k…      15        14        15        11        10        35         21
##  6 Evangel…     575       869      1064       982       881      1486        949
##  7 Hindu          1         9         7         9        11        34         47
##  8 Histori…     228       244       236       238       197       223        131
##  9 Jehovah…      20        27        24        24        21        30         15
## 10 Jewish        19        19        25        25        30        95         69
## 11 Mainlin…     289       495       619       655       651      1107        939
## 12 Mormon        29        40        48        51        56       112         85
## 13 Muslim         6         7         9        10         9        23         16
## 14 Orthodox      13        17        23        32        32        47         38
## 15 Other C…       9         7        11        13        13        14         18
## 16 Other F…      20        33        40        46        49        63         46
## 17 Other W…       5         2         3         4         2         7          3
## 18 Unaffil…     217       299       374       365       341       528        407
## # … with 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>, `Don't
## #   know/refused` <dbl>

# make sure you have tidyr 1.0
relig_income %>%
  pivot_longer(-religion, names_to = "income", values_to = "count")

## # A tibble: 180 x 3
##    religion income             count
##    <chr>    <chr>              <dbl>
##  1 Agnostic <$10k                 27
##  2 Agnostic $10-20k               34
##  3 Agnostic $20-30k               60
##  4 Agnostic $30-40k               81
##  5 Agnostic $40-50k               76
##  6 Agnostic $50-75k              137
##  7 Agnostic $75-100k             122
##  8 Agnostic $100-150k            109
##  9 Agnostic >150k                 84
## 10 Agnostic Don't know/refused    96
## # … with 170 more rows

billboard

## # A tibble: 317 x 79
##    artist track date.entered   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8
##    <chr>  <chr> <date>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 2 Pac  Baby… 2000-02-26      87    82    72    77    87    94    99    NA
##  2 2Ge+h… The … 2000-09-02      91    87    92    NA    NA    NA    NA    NA
##  3 3 Doo… Kryp… 2000-04-08      81    70    68    67    66    57    54    53
##  4 3 Doo… Loser 2000-10-21      76    76    72    69    67    65    55    59
##  5 504 B… Wobb… 2000-04-15      57    34    25    17    17    31    36    49
##  6 98^0   Give… 2000-08-19      51    39    34    26    26    19     2     2
##  7 A*Tee… Danc… 2000-07-08      97    97    96    95   100    NA    NA    NA
##  8 Aaliy… I Do… 2000-01-29      84    62    51    41    38    35    35    38
##  9 Aaliy… Try … 2000-03-18      59    53    38    28    21    18    16    14
## 10 Adams… Open… 2000-08-26      76    76    74    69    68    67    61    58
## # … with 307 more rows, and 68 more variables: wk9 <dbl>, wk10 <dbl>,
## #   wk11 <dbl>, wk12 <dbl>, wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>,
## #   wk17 <dbl>, wk18 <dbl>, wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>,
## #   wk23 <dbl>, wk24 <dbl>, wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>,
## #   wk29 <dbl>, wk30 <dbl>, wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>,
## #   wk35 <dbl>, wk36 <dbl>, wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>,
## #   wk41 <dbl>, wk42 <dbl>, wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>,
## #   wk47 <dbl>, wk48 <dbl>, wk49 <dbl>, wk50 <dbl>, wk51 <dbl>, wk52 <dbl>,
## #   wk53 <dbl>, wk54 <dbl>, wk55 <dbl>, wk56 <dbl>, wk57 <dbl>, wk58 <dbl>,
## #   wk59 <dbl>, wk60 <dbl>, wk61 <dbl>, wk62 <dbl>, wk63 <dbl>, wk64 <dbl>,
## #   wk65 <dbl>, wk66 <lgl>, wk67 <lgl>, wk68 <lgl>, wk69 <lgl>, wk70 <lgl>,
## #   wk71 <lgl>, wk72 <lgl>, wk73 <lgl>, wk74 <lgl>, wk75 <lgl>, wk76 <lgl>

billboard %>%
  pivot_longer(
    starts_with("wk"),
    names_to = "week",
    names_prefix = "wk",
    values_to = "rank",
    values_drop_na = TRUE
  )

## # A tibble: 5,307 x 5
##    artist  track                   date.entered week   rank
##    <chr>   <chr>                   <date>       <chr> <dbl>
##  1 2 Pac   Baby Don't Cry (Keep... 2000-02-26   1        87
##  2 2 Pac   Baby Don't Cry (Keep... 2000-02-26   2        82
##  3 2 Pac   Baby Don't Cry (Keep... 2000-02-26   3        72
##  4 2 Pac   Baby Don't Cry (Keep... 2000-02-26   4        77
##  5 2 Pac   Baby Don't Cry (Keep... 2000-02-26   5        87
##  6 2 Pac   Baby Don't Cry (Keep... 2000-02-26   6        94
##  7 2 Pac   Baby Don't Cry (Keep... 2000-02-26   7        99
##  8 2Ge+her The Hardest Part Of ... 2000-09-02   1        91
##  9 2Ge+her The Hardest Part Of ... 2000-09-02   2        87
## 10 2Ge+her The Hardest Part Of ... 2000-09-02   3        92
## # … with 5,297 more rows

Pivot wider

fish_encounters

## # A tibble: 114 x 3
##    fish  station  seen
##    <fct> <fct>   <int>
##  1 4842  Release     1
##  2 4842  I80_1       1
##  3 4842  Lisbon      1
##  4 4842  Rstr        1
##  5 4842  Base_TD     1
##  6 4842  BCE         1
##  7 4842  BCW         1
##  8 4842  BCE2        1
##  9 4842  BCW2        1
## 10 4842  MAE         1
## # … with 104 more rows

fish_encounters %>% pivot_wider(
  names_from = station,
  values_from = seen,
  values_fill = list(seen = 0)
)

## # A tibble: 19 x 12
##    fish  Release I80_1 Lisbon  Rstr Base_TD   BCE   BCW  BCE2  BCW2   MAE   MAW
##    <fct>   <int> <int>  <int> <int>   <int> <int> <int> <int> <int> <int> <int>
##  1 4842        1     1      1     1       1     1     1     1     1     1     1
##  2 4843        1     1      1     1       1     1     1     1     1     1     1
##  3 4844        1     1      1     1       1     1     1     1     1     1     1
##  4 4845        1     1      1     1       1     0     0     0     0     0     0
##  5 4847        1     1      1     0       0     0     0     0     0     0     0
##  6 4848        1     1      1     1       0     0     0     0     0     0     0
##  7 4849        1     1      0     0       0     0     0     0     0     0     0
##  8 4850        1     1      0     1       1     1     1     0     0     0     0
##  9 4851        1     1      0     0       0     0     0     0     0     0     0
## 10 4854        1     1      0     0       0     0     0     0     0     0     0
## 11 4855        1     1      1     1       1     0     0     0     0     0     0
## 12 4857        1     1      1     1       1     1     1     1     1     0     0
## 13 4858        1     1      1     1       1     1     1     1     1     1     1
## 14 4859        1     1      1     1       1     0     0     0     0     0     0
## 15 4861        1     1      1     1       1     1     1     1     1     1     1
## 16 4862        1     1      1     1       1     1     1     1     1     0     0
## 17 4863        1     1      0     0       0     0     0     0     0     0     0
## 18 4864        1     1      0     0       0     0     0     0     0     0     0
## 19 4865        1     1      1     0       0     0     0     0     0     0     0

us_rent_income

## # A tibble: 104 x 5
##    GEOID NAME       variable estimate   moe
##    <chr> <chr>      <chr>       <dbl> <dbl>
##  1 01    Alabama    income      24476   136
##  2 01    Alabama    rent          747     3
##  3 02    Alaska     income      32940   508
##  4 02    Alaska     rent         1200    13
##  5 04    Arizona    income      27517   148
##  6 04    Arizona    rent          972     4
##  7 05    Arkansas   income      23789   165
##  8 05    Arkansas   rent          709     5
##  9 06    California income      29454   109
## 10 06    California rent         1358     3
## # … with 94 more rows

us_rent_income %>%
  pivot_wider(names_from = variable, values_from = c(estimate, moe))

## # A tibble: 52 x 6
##    GEOID NAME                 estimate_income estimate_rent moe_income moe_rent
##    <chr> <chr>                          <dbl>         <dbl>      <dbl>    <dbl>
##  1 01    Alabama                        24476           747        136        3
##  2 02    Alaska                         32940          1200        508       13
##  3 04    Arizona                        27517           972        148        4
##  4 05    Arkansas                       23789           709        165        5
##  5 06    California                     29454          1358        109        3
##  6 08    Colorado                       32401          1125        109        5
##  7 09    Connecticut                    35326          1123        195        5
##  8 10    Delaware                       31560          1076        247       10
##  9 11    District of Columbia           43198          1424        681       17
## 10 12    Florida                        25952          1077         70        3
## # … with 42 more rows

References

R for Data Science http://r4ds.had.co.nz/tidy-data.html
Documentation of dplyr https://dplyr.tidyverse.org/
Documentation of tidyr https://tidyr.tidyverse.org/