Quote from the book “R for Data Science”, the author said

R is an old language, and some things that were useful 10 or 20 years ago now get in your way. It’s difficult to change base R without breaking existing code, so most innovation occurs in packages.

What is tidyverse?

So what is tidyverse?

Pipe operator

We are going to detour a bit to talk about pipe operator. Pipe operator is a powerful way to make your code much cleaner and readible by

The operators pipe their left-hand side values forward into expressions that appear on the right-hand side, i.e. one can replace f(x) with x %>% f(), where %>% is the (main) pipe-operator. When coupling several function calls with the pipe-operator, the benefit will become more apparent. Consider this pseudo example:

raw_data <- read.csv("/path/to/data/file.csv")
sub_data <- subset(raw_data, variable_a > x)
new_data <- transform(sub_data, variable_c = variable_a / variable_b)
the_data <- head(new_data, 100)

However, the local variables are really not necessary, so one liner could be

the_data <- head(
  transform(
    subset(
      read.csv("/path/to/data/file.csv"),
      variable_a > x
    ),
    variable_c = variable_a / variable_b
  ),
  100
)

But that is too hard to read and reason. If we use pipe operator,

the_data <- read.csv("/path/to/data/file.csv") %>%
  subset(variable_a > x) %>%
  transform(variable_c = variable_a / variable_b) %>%
  head(100)

Hint: In RStudio, you could use Ctrl + Shift + M (or Cmd + Shift + M) to insert the pipe operator.

Basic usage

  • x %>% f or x %>% f() is equivalent to f(x)
  • x %>% f(y) is equivalent to f(x, y)
  • x %>% f %>% g %>% h is equivalent to h(g(f(x)))
sum(c(1, 2))
## [1] 3
c(1, 2) %>% sum()
## [1] 3
choose(5, 3)
## [1] 10
5 %>% choose(3)
## [1] 10

The argument placeholder

  • x %>% f(y, .) is equivalent to f(y, x)
  • x %>% f(y, z = .) is equivalent to f(y, z = x)
3 %>% choose(5, .)
## [1] 10
3 %>% choose(5, k = .)
## [1] 10

Re-using the placeholder

It is straightforward to use the placeholder several times in a right-hand side expression. However, when the placeholder only appears in a nested expressions magrittr will still apply the first-argument rule. The reason is that in most cases this results more clean code.

x %>% f(y = nrow(.), z = ncol(.)) is equivalent to f(x, y = nrow(x), z = ncol(x))

a <- 100
5 %>% choose(. - 2)
## [1] 10
choose(5, 5 - 2)
## [1] 10

The behavior can be overruled by enclosing the right-hand side in braces:

4 %>% {choose(5, . - 1)}
## [1] 10
choose(5, 4 - 1)
## [1] 10

x %>% {f(y = nrow(.), z = ncol(.))} is equivalent to f(y = nrow(x), z = ncol(x))

list(n = 5, k = 3) %>% {
  choose(.$n, .$k)
}
## [1] 10

dplyr basics

Obtain some data

First of all, we need some data to work with. If the data is stored in a csv,

flights <- read_csv("flights.csv")

We are using the tidyverse function read_csv to import the flights.csv instead of the obsolete base function read.csv. - read_csv imports data as tibble which has better and more consistent handling of variables. - read_csv is often faster than read.csv - read_csv handles unicode characters better

There are also read_tsv and read_delim for reading tab-seperated or delimited files.

The datasets are actually obtained from the R package nycflights13

# Airline on-time data for all flights departing NYC in 2013.
library(nycflights13)
flights
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

select: picks variables based on their names.

To select arrival and departure times,

# old way to do it
# flights[, c("arr_time", "dep_time")]
flights %>% select(arr_time, dep_time)
## # A tibble: 336,776 x 2
##    arr_time dep_time
##       <int>    <int>
##  1      830      517
##  2      850      533
##  3      923      542
##  4     1004      544
##  5      812      554
##  6      740      554
##  7      913      555
##  8      709      557
##  9      838      557
## 10      753      558
## # … with 336,766 more rows

I don’t see why it’s useful

dplyr provides a lot of helper functions,

# colon `:` specifies all the variables between the columns of `dep_time` and `arr_time`
flights %>% select(dep_time:arr_time)
## # A tibble: 336,776 x 4
##    dep_time sched_dep_time dep_delay arr_time
##       <int>          <int>     <dbl>    <int>
##  1      517            515         2      830
##  2      533            529         4      850
##  3      542            540         2      923
##  4      544            545        -1     1004
##  5      554            600        -6      812
##  6      554            558        -4      740
##  7      555            600        -5      913
##  8      557            600        -3      709
##  9      557            600        -3      838
## 10      558            600        -2      753
## # … with 336,766 more rows
# all the columns start with arr_
flights %>% select(starts_with("arr_"))
## # A tibble: 336,776 x 2
##    arr_time arr_delay
##       <int>     <dbl>
##  1      830        11
##  2      850        20
##  3      923        33
##  4     1004       -18
##  5      812       -25
##  6      740        12
##  7      913        19
##  8      709       -14
##  9      838        -8
## 10      753         8
## # … with 336,766 more rows
# all the columns end with _time
flights %>% select(ends_with("_time"))
## # A tibble: 336,776 x 5
##    dep_time sched_dep_time arr_time sched_arr_time air_time
##       <int>          <int>    <int>          <int>    <dbl>
##  1      517            515      830            819      227
##  2      533            529      850            830      227
##  3      542            540      923            850      160
##  4      544            545     1004           1022      183
##  5      554            600      812            837      116
##  6      554            558      740            728      150
##  7      555            600      913            854      158
##  8      557            600      709            723       53
##  9      557            600      838            846      140
## 10      558            600      753            745      138
## # … with 336,766 more rows
# all the columns contain dep
flights %>% select(contains("dep"))
## # A tibble: 336,776 x 3
##    dep_time sched_dep_time dep_delay
##       <int>          <int>     <dbl>
##  1      517            515         2
##  2      533            529         4
##  3      542            540         2
##  4      544            545        -1
##  5      554            600        -6
##  6      554            558        -4
##  7      555            600        -5
##  8      557            600        -3
##  9      557            600        -3
## 10      558            600        -2
## # … with 336,766 more rows
# remove particular columns
flights %>% select(-year)
## # A tibble: 336,776 x 18
##    month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1     1     1      517            515         2      830            819
##  2     1     1      533            529         4      850            830
##  3     1     1      542            540         2      923            850
##  4     1     1      544            545        -1     1004           1022
##  5     1     1      554            600        -6      812            837
##  6     1     1      554            558        -4      740            728
##  7     1     1      555            600        -5      913            854
##  8     1     1      557            600        -3      709            723
##  9     1     1      557            600        -3      838            846
## 10     1     1      558            600        -2      753            745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# all the columns do not contain dep
flights %>% select(-contains("dep"))
## # A tibble: 336,776 x 16
##     year month   day arr_time sched_arr_time arr_delay carrier flight tailnum
##    <int> <int> <int>    <int>          <int>     <dbl> <chr>    <int> <chr>  
##  1  2013     1     1      830            819        11 UA        1545 N14228 
##  2  2013     1     1      850            830        20 UA        1714 N24211 
##  3  2013     1     1      923            850        33 AA        1141 N619AA 
##  4  2013     1     1     1004           1022       -18 B6         725 N804JB 
##  5  2013     1     1      812            837       -25 DL         461 N668DN 
##  6  2013     1     1      740            728        12 UA        1696 N39463 
##  7  2013     1     1      913            854        19 B6         507 N516JB 
##  8  2013     1     1      709            723       -14 EV        5708 N829AS 
##  9  2013     1     1      838            846        -8 B6          79 N593JB 
## 10  2013     1     1      753            745         8 AA         301 N3ALAA 
## # … with 336,766 more rows, and 7 more variables: origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# using regular expression (later of the course)
flights %>% select(matches("^(arr|dep)_"))
## # A tibble: 336,776 x 4
##    dep_time dep_delay arr_time arr_delay
##       <int>     <dbl>    <int>     <dbl>
##  1      517         2      830        11
##  2      533         4      850        20
##  3      542         2      923        33
##  4      544        -1     1004       -18
##  5      554        -6      812       -25
##  6      554        -4      740        12
##  7      555        -5      913        19
##  8      557        -3      709       -14
##  9      557        -3      838        -8
## 10      558        -2      753         8
## # … with 336,766 more rows

Related verbs: - if you just need a single variable, you could use pull. - you could use rename to rename columns

# flights$arr_time
flights %>% pull(arr_time)
##   [1]  830  850  923 1004  812  740  913  709  838  753  849  853  924  923  941
##  [16]  702  854  851  837  844  812  821  858  837  858  807  945  925 1039  833
##  [31] 1017  920  933  909  840 1018 1137 1016  824  721  824  740 1028  930  739
##  [46]  922  837  931  815  910 1023  936  932  936 1021 1037 1002  854  949 1007
##  [61]  948  959  944 1027 1008 1008  907  959 1123 1058  852 1151 1023  911  850
##  [76] 1017 1013 1111 1020 1052  959 1041 1049  857 1041 1011  854 1047  918 1104
##  [91] 1038 1107 1043 1059 1135 1119  939 1041 1025  955 1056 1039 1103 1053 1057
## [106] 1022  949  900  903 1132 1103 1015 1118 1006 1043 1043 1048 1100 1006 1047
## [121] 1026 1040 1103 1047 1005 1254  940 1249 1153  932 1014 1019 1151 1027 1058
## [136] 1136 1145 1027 1150 1152 1117 1018 1052 1021 1006 1134 1210 1027 1311 1053
## [151] 1138 1001 1155 1032 1102 1215 1046 1147 1143 1226 1222 1140 1516 1107 1124
## [166] 1157 1102 1140 1223 1211 1048 1045 1207 1309 1134 1020 1004 1228 1331 1241
## [181] 1219 1346 1244 1058 1313 1206 1052 1039 1152 1237 1026 1404 1221 1233 1231
## [196] 1028 1220 1218 1237 1121
##  [ reached getOption("max.print") -- omitted 336576 entries ]
flights %>% rename(y = year)
## # A tibble: 336,776 x 19
##        y month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# of course, we could select everything
flights %>% select(everything())
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# move air_time to the front
flights %>% select(air_time, everything())
## # A tibble: 336,776 x 19
##    air_time  year month   day dep_time sched_dep_time dep_delay arr_time
##       <dbl> <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1      227  2013     1     1      517            515         2      830
##  2      227  2013     1     1      533            529         4      850
##  3      160  2013     1     1      542            540         2      923
##  4      183  2013     1     1      544            545        -1     1004
##  5      116  2013     1     1      554            600        -6      812
##  6      150  2013     1     1      554            558        -4      740
##  7      158  2013     1     1      555            600        -5      913
##  8       53  2013     1     1      557            600        -3      709
##  9      140  2013     1     1      557            600        -3      838
## 10      138  2013     1     1      558            600        -2      753
## # … with 336,766 more rows, and 11 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
## #   dest <chr>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# a new function `relocate
flights %>% relocate(air_time, .after = day)
## # A tibble: 336,776 x 19
##     year month   day air_time dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <dbl>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      227      517            515         2      830
##  2  2013     1     1      227      533            529         4      850
##  3  2013     1     1      160      542            540         2      923
##  4  2013     1     1      183      544            545        -1     1004
##  5  2013     1     1      116      554            600        -6      812
##  6  2013     1     1      150      554            558        -4      740
##  7  2013     1     1      158      555            600        -5      913
##  8  2013     1     1       53      557            600        -3      709
##  9  2013     1     1      140      557            600        -3      838
## 10  2013     1     1      138      558            600        -2      753
## # … with 336,766 more rows, and 11 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
## #   dest <chr>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

filter: picks cases based on their values

# flights[flights$origin == "JFK", ]
# subset(flights, origin == "JFK")
flights %>% filter(origin == "JFK")
## # A tibble: 111,279 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      542            540         2      923            850
##  2  2013     1     1      544            545        -1     1004           1022
##  3  2013     1     1      557            600        -3      838            846
##  4  2013     1     1      558            600        -2      849            851
##  5  2013     1     1      558            600        -2      853            856
##  6  2013     1     1      558            600        -2      924            917
##  7  2013     1     1      559            559         0      702            706
##  8  2013     1     1      606            610        -4      837            845
##  9  2013     1     1      611            600        11      945            931
## 10  2013     1     1      613            610         3      925            921
## # … with 111,269 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>% filter(distance > 1000)
## # A tibble: 147,105 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      555            600        -5      913            854
##  6  2013     1     1      558            600        -2      849            851
##  7  2013     1     1      558            600        -2      853            856
##  8  2013     1     1      558            600        -2      924            917
##  9  2013     1     1      558            600        -2      923            937
## 10  2013     1     1      559            600        -1      941            910
## # … with 147,095 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# note that we are using a single `&` instead of `&&` as in base R
flights %>% filter(origin == "JFK" & distance > 1000)
## # A tibble: 62,071 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      542            540         2      923            850
##  2  2013     1     1      544            545        -1     1004           1022
##  3  2013     1     1      558            600        -2      849            851
##  4  2013     1     1      558            600        -2      853            856
##  5  2013     1     1      558            600        -2      924            917
##  6  2013     1     1      611            600        11      945            931
##  7  2013     1     1      613            610         3      925            921
##  8  2013     1     1      615            615         0     1039           1100
##  9  2013     1     1      627            630        -3     1018           1018
## 10  2013     1     1      628            630        -2     1137           1140
## # … with 62,061 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>% filter(origin == "JFK", distance > 1000)
## # A tibble: 62,071 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      542            540         2      923            850
##  2  2013     1     1      544            545        -1     1004           1022
##  3  2013     1     1      558            600        -2      849            851
##  4  2013     1     1      558            600        -2      853            856
##  5  2013     1     1      558            600        -2      924            917
##  6  2013     1     1      611            600        11      945            931
##  7  2013     1     1      613            610         3      925            921
##  8  2013     1     1      615            615         0     1039           1100
##  9  2013     1     1      627            630        -3     1018           1018
## 10  2013     1     1      628            630        -2     1137           1140
## # … with 62,061 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>% filter(distance < 500 | distance > 1000)
## # A tibble: 227,322 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      555            600        -5      913            854
##  6  2013     1     1      557            600        -3      709            723
##  7  2013     1     1      558            600        -2      849            851
##  8  2013     1     1      558            600        -2      853            856
##  9  2013     1     1      558            600        -2      924            917
## 10  2013     1     1      558            600        -2      923            937
## # … with 227,312 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>% filter(!between(distance, 500, 1000))
## # A tibble: 227,322 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      555            600        -5      913            854
##  6  2013     1     1      557            600        -3      709            723
##  7  2013     1     1      558            600        -2      849            851
##  8  2013     1     1      558            600        -2      853            856
##  9  2013     1     1      558            600        -2      924            917
## 10  2013     1     1      558            600        -2      923            937
## # … with 227,312 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# only keep the complete cases
flights %>% filter(complete.cases(.)) # filter(flights, complete.cases(flights))
## # A tibble: 327,346 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 327,336 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>% drop_na() # same, but more efficient
## # A tibble: 327,346 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 327,336 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Chaining and piping

Very often, we will need to use multiple dplyr verbs, for example

# the pipe operator %>% increases readability
flights %>%
  select(origin, air_time) %>%
  filter(origin == "JFK") %>%
  filter(air_time < 500) %>%
  rename(airtime = air_time)
## # A tibble: 108,737 x 2
##    origin airtime
##    <chr>    <dbl>
##  1 JFK        160
##  2 JFK        183
##  3 JFK        140
##  4 JFK        149
##  5 JFK        158
##  6 JFK        345
##  7 JFK         44
##  8 JFK        128
##  9 JFK        366
## 10 JFK        175
## # … with 108,727 more rows
# slightly more efficient
flights %>%
  select(origin, airtime = air_time) %>%
  filter(origin == "JFK", airtime < 500)
## # A tibble: 108,737 x 2
##    origin airtime
##    <chr>    <dbl>
##  1 JFK        160
##  2 JFK        183
##  3 JFK        140
##  4 JFK        149
##  5 JFK        158
##  6 JFK        345
##  7 JFK         44
##  8 JFK        128
##  9 JFK        366
## 10 JFK        175
## # … with 108,727 more rows
# a few more examples
flights %>%
  select(origin, air_time) %>%
  filter(origin == "JFK", air_time < mean(air_time, na.rm = TRUE))
## # A tibble: 55,521 x 2
##    origin air_time
##    <chr>     <dbl>
##  1 JFK         140
##  2 JFK         149
##  3 JFK          44
##  4 JFK         128
##  5 JFK          41
##  6 JFK          63
##  7 JFK         142
##  8 JFK         147
##  9 JFK          64
## 10 JFK          54
## # … with 55,511 more rows
# maybe we want to calculate mean_air_time first
mean_air_time <- flights %>%
  pull(air_time) %>%
  mean(na.rm = TRUE)

mean_air_time
## [1] 150.6865
flights %>%
  select(origin, air_time) %>%
  filter(origin == "JFK", air_time > mean_air_time)
## # A tibble: 53,558 x 2
##    origin air_time
##    <chr>     <dbl>
##  1 JFK         160
##  2 JFK         183
##  3 JFK         158
##  4 JFK         345
##  5 JFK         366
##  6 JFK         175
##  7 JFK         182
##  8 JFK         330
##  9 JFK         192
## 10 JFK         323
## # … with 53,548 more rows
# what if there is a name collision?
air_time <- flights %>%
  pull(air_time) %>%
  mean(na.rm = TRUE)

flights %>%
  select(origin, air_time) %>%
  filter(origin == "JFK", air_time > !!air_time)
## # A tibble: 53,558 x 2
##    origin air_time
##    <chr>     <dbl>
##  1 JFK         160
##  2 JFK         183
##  3 JFK         158
##  4 JFK         345
##  5 JFK         366
##  6 JFK         175
##  7 JFK         182
##  8 JFK         330
##  9 JFK         192
## 10 JFK         323
## # … with 53,548 more rows

Remarks: use slice, head, tail if you want particular rows

flights %>% head(4)
## # A tibble: 4 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     1      517            515         2      830            819
## 2  2013     1     1      533            529         4      850            830
## 3  2013     1     1      542            540         2      923            850
## 4  2013     1     1      544            545        -1     1004           1022
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>% slice(100:105)
## # A tibble: 6 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     1      752            759        -7      955            959
## 2  2013     1     1      753            755        -2     1056           1110
## 3  2013     1     1      754            759        -5     1039           1041
## 4  2013     1     1      754            755        -1     1103           1030
## 5  2013     1     1      758            800        -2     1053           1054
## 6  2013     1     1      759            800        -1     1057           1127
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>% tail(5)
## # A tibble: 5 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     9    30       NA           1455        NA       NA           1634
## 2  2013     9    30       NA           2200        NA       NA           2312
## 3  2013     9    30       NA           1210        NA       NA           1330
## 4  2013     9    30       NA           1159        NA       NA           1344
## 5  2013     9    30       NA            840        NA       NA           1020
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>% slice((n() - 4):n())
## # A tibble: 5 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     9    30       NA           1455        NA       NA           1634
## 2  2013     9    30       NA           2200        NA       NA           2312
## 3  2013     9    30       NA           1210        NA       NA           1330
## 4  2013     9    30       NA           1159        NA       NA           1344
## 5  2013     9    30       NA            840        NA       NA           1020
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

mutate: adds new variables that are functions of existing variables

flights %>% mutate(
  gain = dep_delay - arr_delay,
  speed = distance / air_time * 60
)
## # A tibble: 336,776 x 21
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 336,766 more rows, and 13 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
## #   gain <dbl>, speed <dbl>
# we could refer to the columns just created
flights %>% mutate(
  gain = dep_delay - arr_delay,
  gain_per_hour = gain / (air_time / 60)
)
## # A tibble: 336,776 x 21
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 336,766 more rows, and 13 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
## #   gain <dbl>, gain_per_hour <dbl>
# remove a variable
flights %>% mutate(
  gain = dep_delay - arr_delay,
  gain_per_hour = gain / (air_time / 60),
  air_time = NULL,
)
## # A tibble: 336,776 x 20
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 336,766 more rows, and 12 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, gain <dbl>,
## #   gain_per_hour <dbl>
# `transmute` only keeps the new variables
flights %>% transmute(
  gain = dep_delay - arr_delay,
  gain_per_hour = gain / (air_time / 60)
)
## # A tibble: 336,776 x 2
##     gain gain_per_hour
##    <dbl>         <dbl>
##  1    -9         -2.38
##  2   -16         -4.23
##  3   -31        -11.6 
##  4    17          5.57
##  5    19          9.83
##  6   -16         -6.4 
##  7   -24         -9.11
##  8    11         12.5 
##  9     5          2.14
## 10   -10         -4.35
## # … with 336,766 more rows

Six variations on ranking functions

  • row_number: equivalent to rank(ties.method = "first")
  • min_rank: equivalent to rank(ties.method = "min")
  • dense_rank: like min_rank(), but with no gaps between ranks
  • percent_rank: a number between 0 and 1 computed by rescaling min_rank to [0, 1]
  • cume_dist: a cumulative distribution function. Proportion of all values less than or equal to the current rank.
  • ntile: a rough rank, which breaks the input vector into n buckets
some_data <- tibble(
  x = c(3, 4, 1, 3, 1)
)
some_data %>%
  mutate(
    row_number(),
    row_number(x),
    min_rank(x),
    dense_rank(x),
    percent_rank(x),
    cume_dist(x),
    ntile(x, 3)
  )
## # A tibble: 5 x 8
##       x `row_number()` `row_number(x)` `min_rank(x)` `dense_rank(x)`
##   <dbl>          <int>           <int>         <int>           <int>
## 1     3              1               3             3               2
## 2     4              2               5             5               3
## 3     1              3               1             1               1
## 4     3              4               4             3               2
## 5     1              5               2             1               1
## # … with 3 more variables: `percent_rank(x)` <dbl>, `cume_dist(x)` <dbl>,
## #   `ntile(x, 3)` <int>

lead and lag

some_data2 <- tibble(
  time = 1:5,
  value = c(3, 4, 1, 3, 1)
)

some_data2 %>% mutate(
  lag(value),
  lead(value)
)
## # A tibble: 5 x 4
##    time value `lag(value)` `lead(value)`
##   <int> <dbl>        <dbl>         <dbl>
## 1     1     3           NA             4
## 2     2     4            3             1
## 3     3     1            4             3
## 4     4     3            1             1
## 5     5     1            3            NA
some_data2 %>% mutate(
  diff1 = value - lag(value), 
  diff2 = lead(value) - value
)
## # A tibble: 5 x 4
##    time value diff1 diff2
##   <int> <dbl> <dbl> <dbl>
## 1     1     3    NA     1
## 2     2     4     1    -3
## 3     3     1    -3     2
## 4     4     3     2    -2
## 5     5     1    -2    NA

Remark: the package zoo has functions to calculate rolling means (moving averages)

Conditional mutation

flights %>% transmute(
  arr_delay,
  status = if_else(arr_delay > 0, "delayed", "on time")
)
## # A tibble: 336,776 x 2
##    arr_delay status 
##        <dbl> <chr>  
##  1        11 delayed
##  2        20 delayed
##  3        33 delayed
##  4       -18 on time
##  5       -25 on time
##  6        12 delayed
##  7        19 delayed
##  8       -14 on time
##  9        -8 on time
## 10         8 delayed
## # … with 336,766 more rows
(
  flight_distances <- flights %>%
    transmute(
      distance,
      distance_type = case_when(
        distance < 500 ~ "short",
        distance < 1000 ~ "mid",
        TRUE ~ "long"
      )
    )
)
## # A tibble: 336,776 x 2
##    distance distance_type
##       <dbl> <chr>        
##  1     1400 long         
##  2     1416 long         
##  3     1089 long         
##  4     1576 long         
##  5      762 mid          
##  6      719 mid          
##  7     1065 long         
##  8      229 short        
##  9      944 mid          
## 10      733 mid          
## # … with 336,766 more rows

recode values

flight_distances %>% mutate(
  distance_type = recode(distance_type,
    "long" = "long-distance",
    "mid" = "mid-distance",
    "short" = "short-distance"
))
## # A tibble: 336,776 x 2
##    distance distance_type 
##       <dbl> <chr>         
##  1     1400 long-distance 
##  2     1416 long-distance 
##  3     1089 long-distance 
##  4     1576 long-distance 
##  5      762 mid-distance  
##  6      719 mid-distance  
##  7     1065 long-distance 
##  8      229 short-distance
##  9      944 mid-distance  
## 10      733 mid-distance  
## # … with 336,766 more rows
some_data %>% mutate(
  y = recode(x,
    `1` = 10,
    `3` = 30
  )
)
## # A tibble: 5 x 2
##       x     y
##   <dbl> <dbl>
## 1     3    30
## 2     4     4
## 3     1    10
## 4     3    30
## 5     1    10

arrange: changes the ordering of the rows

flights %>% arrange(year, month, day)
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>% arrange(desc(dep_delay))
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     9      641            900      1301     1242           1530
##  2  2013     6    15     1432           1935      1137     1607           2120
##  3  2013     1    10     1121           1635      1126     1239           1810
##  4  2013     9    20     1139           1845      1014     1457           2210
##  5  2013     7    22      845           1600      1005     1044           1815
##  6  2013     4    10     1100           1900       960     1342           2211
##  7  2013     3    17     2321            810       911      135           1020
##  8  2013     6    27      959           1900       899     1236           2226
##  9  2013     7    22     2257            759       898      121           1026
## 10  2013    12     5      756           1700       896     1058           2020
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

summarize and group_by operations

flights %>% summarize(
  air_time = mean(air_time, na.rm = TRUE)
)
## # A tibble: 1 x 1
##   air_time
##      <dbl>
## 1     151.
# mean(flights$air_time, na.rm = TRUE)
flights %>%
  group_by(tailnum) %>%
  summarize(n = n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 4,044 x 2
##    tailnum     n
##    <chr>   <int>
##  1 D942DN      4
##  2 N0EGMQ    371
##  3 N10156    153
##  4 N102UW     48
##  5 N103US     46
##  6 N104UW     47
##  7 N10575    289
##  8 N105UW     45
##  9 N107US     41
## 10 N108UW     60
## # … with 4,034 more rows
flights %>%
  group_by(tailnum) %>%
  tally() # shorthand
## # A tibble: 4,044 x 2
##    tailnum     n
##    <chr>   <int>
##  1 D942DN      4
##  2 N0EGMQ    371
##  3 N10156    153
##  4 N102UW     48
##  5 N103US     46
##  6 N104UW     47
##  7 N10575    289
##  8 N105UW     45
##  9 N107US     41
## 10 N108UW     60
## # … with 4,034 more rows
flights %>% count(tailnum) # another shorthand
## # A tibble: 4,044 x 2
##    tailnum     n
##    <chr>   <int>
##  1 D942DN      4
##  2 N0EGMQ    371
##  3 N10156    153
##  4 N102UW     48
##  5 N103US     46
##  6 N104UW     47
##  7 N10575    289
##  8 N105UW     45
##  9 N107US     41
## 10 N108UW     60
## # … with 4,034 more rows
flights %>%
  group_by(tailnum) %>%
  summarize(
    count = n(),
    dist = mean(distance, na.rm = TRUE),
    delay = mean(arr_delay, na.rm = TRUE)
  )
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 4,044 x 4
##    tailnum count  dist  delay
##    <chr>   <int> <dbl>  <dbl>
##  1 D942DN      4  854. 31.5  
##  2 N0EGMQ    371  676.  9.98 
##  3 N10156    153  758. 12.7  
##  4 N102UW     48  536.  2.94 
##  5 N103US     46  535. -6.93 
##  6 N104UW     47  535.  1.80 
##  7 N10575    289  520. 20.7  
##  8 N105UW     45  525. -0.267
##  9 N107US     41  529. -5.73 
## 10 N108UW     60  534. -1.25 
## # … with 4,034 more rows
flights %>%
  group_by(dest) %>%
  summarize(
    planes = n_distinct(tailnum),
    flights = n()
  )
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 105 x 3
##    dest  planes flights
##    <chr>  <int>   <int>
##  1 ABQ      108     254
##  2 ACK       58     265
##  3 ALB      172     439
##  4 ANC        6       8
##  5 ATL     1180   17215
##  6 AUS      993    2439
##  7 AVL      159     275
##  8 BDL      186     443
##  9 BGR       46     375
## 10 BHM       45     297
## # … with 95 more rows
# group by multiple variables
(per_day <- flights %>%
  group_by(year, month, day) %>%
  summarize(flights = n()))
## `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)
## # A tibble: 365 x 4
## # Groups:   year, month [12]
##     year month   day flights
##    <int> <int> <int>   <int>
##  1  2013     1     1     842
##  2  2013     1     2     943
##  3  2013     1     3     914
##  4  2013     1     4     915
##  5  2013     1     5     720
##  6  2013     1     6     832
##  7  2013     1     7     933
##  8  2013     1     8     899
##  9  2013     1     9     902
## 10  2013     1    10     932
## # … with 355 more rows
(per_month <- per_day %>%
  summarize(flights = sum(flights)))
## `summarise()` regrouping output by 'year' (override with `.groups` argument)
## # A tibble: 12 x 3
## # Groups:   year [1]
##     year month flights
##    <int> <int>   <int>
##  1  2013     1   27004
##  2  2013     2   24951
##  3  2013     3   28834
##  4  2013     4   28330
##  5  2013     5   28796
##  6  2013     6   28243
##  7  2013     7   29425
##  8  2013     8   29327
##  9  2013     9   27574
## 10  2013    10   28889
## 11  2013    11   27268
## 12  2013    12   28135
(per_year <- per_month %>%
  summarize(flights = sum(flights)))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 1 x 2
##    year flights
##   <int>   <int>
## 1  2013  336776

across columns

Suppose, we want to compute the means for the all numeric columns of flights. Naively, we could do

flights %>%
  group_by(origin) %>%
  summarize(
    dep_time = mean(dep_time, na.rm = TRUE),
    sched_dep_time = mean(sched_dep_time, na.rm = TRUE),
    dep_delay = mean(dep_delay, na.rm = TRUE),
    arr_time = mean(arr_time, na.rm = TRUE),
    sched_arr_time = mean(sched_arr_time, na.rm = TRUE),
    arr_delay = mean(arr_delay, na.rm = TRUE),
    flight = mean(flight, na.rm = TRUE),
    air_time = mean(air_time, na.rm = TRUE),
    distance = mean(distance, na.rm = TRUE)
  )
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 10
##   origin dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
##   <chr>     <dbl>          <dbl>     <dbl>    <dbl>          <dbl>     <dbl>
## 1 EWR       1337.          1322.      15.1    1492.          1528.      9.11
## 2 JFK       1399.          1402.      12.1    1520.          1565.      5.55
## 3 LGA       1310.          1308.      10.3    1494.          1516.      5.78
## # … with 3 more variables: flight <dbl>, air_time <dbl>, distance <dbl>

With the new across function, it is as easy as:

flights %>%
  group_by(origin) %>%
  summarize(across(dep_time:distance, mean, na.rm = TRUE))
## Warning in mean.default(col, ...): argument is not numeric or logical: returning
## NA

## Warning in mean.default(col, ...): argument is not numeric or logical: returning
## NA

## Warning in mean.default(col, ...): argument is not numeric or logical: returning
## NA

## Warning in mean.default(col, ...): argument is not numeric or logical: returning
## NA

## Warning in mean.default(col, ...): argument is not numeric or logical: returning
## NA

## Warning in mean.default(col, ...): argument is not numeric or logical: returning
## NA

## Warning in mean.default(col, ...): argument is not numeric or logical: returning
## NA

## Warning in mean.default(col, ...): argument is not numeric or logical: returning
## NA

## Warning in mean.default(col, ...): argument is not numeric or logical: returning
## NA
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 13
##   origin dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
##   <chr>     <dbl>          <dbl>     <dbl>    <dbl>          <dbl>     <dbl>
## 1 EWR       1337.          1322.      15.1    1492.          1528.      9.11
## 2 JFK       1399.          1402.      12.1    1520.          1565.      5.55
## 3 LGA       1310.          1308.      10.3    1494.          1516.      5.78
## # … with 6 more variables: carrier <dbl>, flight <dbl>, tailnum <dbl>,
## #   dest <dbl>, air_time <dbl>, distance <dbl>

We have accidentally included some non numeric columns.

flights %>%
  group_by(origin) %>%
  summarize(
    across(dep_time:distance & is.numeric, mean, na.rm = TRUE)
  )
## Warning: Predicate functions must be wrapped in `where()`.
## 
##   # Bad
##   data %>% select(is.numeric)
## 
##   # Good
##   data %>% select(where(is.numeric))
## 
## ℹ Please update your code.
## This message is displayed once per session.
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 10
##   origin dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
##   <chr>     <dbl>          <dbl>     <dbl>    <dbl>          <dbl>     <dbl>
## 1 EWR       1337.          1322.      15.1    1492.          1528.      9.11
## 2 JFK       1399.          1402.      12.1    1520.          1565.      5.55
## 3 LGA       1310.          1308.      10.3    1494.          1516.      5.78
## # … with 3 more variables: flight <dbl>, air_time <dbl>, distance <dbl>
flights %>%
  group_by(origin) %>%
  summarize(
    across(-c(year, month, day) & is.numeric, mean, na.rm = TRUE)
  )
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 12
##   origin dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
##   <chr>     <dbl>          <dbl>     <dbl>    <dbl>          <dbl>     <dbl>
## 1 EWR       1337.          1322.      15.1    1492.          1528.      9.11
## 2 JFK       1399.          1402.      12.1    1520.          1565.      5.55
## 3 LGA       1310.          1308.      10.3    1494.          1516.      5.78
## # … with 5 more variables: flight <dbl>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>

We could also use predicate functions in across and use different functions to summerize different variables.

flights %>%
  group_by(origin) %>%
  summarize(
    across(ends_with("time"), mean, na.rm = TRUE),
    across(ends_with("delay"), mean, na.rm = TRUE, trim = 0.2),
  )
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 8
##   origin dep_time sched_dep_time arr_time sched_arr_time air_time dep_delay
##   <chr>     <dbl>          <dbl>    <dbl>          <dbl>    <dbl>     <dbl>
## 1 EWR       1337.          1322.    1492.          1528.     153.     2.10 
## 2 JFK       1399.          1402.    1520.          1565.     178.     0.298
## 3 LGA       1310.          1308.    1494.          1516.     118.    -1.51 
## # … with 1 more variable: arr_delay <dbl>

Other useful functions

flights %>% glimpse()
## Rows: 336,776
## Columns: 19
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, …
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558,…
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600,…
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -…
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851…
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -…
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", …
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, …
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N39…
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA"…
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD"…
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, …
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733,…
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, …
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, …
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 …
flights %>% 
  slice_sample(n = 1000) %>% 
  ggplot(aes(arr_delay, dep_delay)) + geom_point(na.rm = TRUE)

## rows with largest values of air_time with the original order preserved
flights %>% 
  slice_max(air_time, n = 3)
## # A tibble: 4 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     3    17     1337           1335         2     1937           1836
## 2  2013     2     6      853            900        -7     1542           1540
## 3  2013     3    15     1001           1000         1     1551           1530
## 4  2013     3    17     1006           1000         6     1607           1530
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
# two disadvantages: order changes and does not handle ties
flights %>% 
  arrange(desc(air_time)) %>% 
  head(3)
## # A tibble: 3 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     3    17     1337           1335         2     1937           1836
## 2  2013     2     6      853            900        -7     1542           1540
## 3  2013     3    15     1001           1000         1     1551           1530
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Similary, there is slice_min.

Row bind or column bind data frames

df1 <- tibble(a = "hello", b = 2, c = 3)
df2 <- tibble(a = "world", b = 4)
df1 %>% bind_rows(df2)
## # A tibble: 2 x 3
##   a         b     c
##   <chr> <dbl> <dbl>
## 1 hello     2     3
## 2 world     4    NA
bind_rows(df1, df2)
## # A tibble: 2 x 3
##   a         b     c
##   <chr> <dbl> <dbl>
## 1 hello     2     3
## 2 world     4    NA

The bind_rows function from dplyr is better than rbind in several ways.

df1 <- tibble(a = "hello", b = 2, c = 3)
df2 <- tibble(a = "world", b = 4)
df1 %>% rbind(df2)
## Error in rbind(deparse.level, ...): numbers of columns of arguments do not match
df1 <- tibble(a = "hello", b = 2)
df2 <- tibble(a = 2, b = 4)
df1 %>% rbind(df2)
## # A tibble: 2 x 2
##   a         b
##   <chr> <dbl>
## 1 hello     2
## 2 2         4
df1 %>% bind_rows(df2)
## Error: Can't combine `..1$a` <character> and `..2$a` <double>.
df1 <- tibble(a = "hello", b = 2, c = 3)
df2 <- tibble(a = "world", b = 4)
dfs <- list(df1, df2)
bind_rows(!!!dfs)
## # A tibble: 2 x 3
##   a         b     c
##   <chr> <dbl> <dbl>
## 1 hello     2     3
## 2 world     4    NA

Likewise, bind_cols is safer than cbind.

df1 <- tibble(x = 1:5)
df2 <- tibble(y = 1:10)
df1 %>% cbind(df2)
##    x  y
## 1  1  1
## 2  2  2
## 3  3  3
## 4  4  4
## 5  5  5
## 6  1  6
## 7  2  7
## 8  3  8
## 9  4  9
## 10 5 10
df1 %>% bind_cols(df2)
## Error: Can't recycle `..1` (size 5) to match `..2` (size 10).

Joining (Merging) tables

There are 6 types of joins.

(df1 <- tibble(id = c(1, 2), v = 2:1))
## # A tibble: 2 x 2
##      id     v
##   <dbl> <int>
## 1     1     2
## 2     2     1
(df2 <- tibble(id = c(1, 3), a = 10, b = "a"))
## # A tibble: 2 x 3
##      id     a b    
##   <dbl> <dbl> <chr>
## 1     1    10 a    
## 2     3    10 a
df1 %>% inner_join(df2, by = "id")
## # A tibble: 1 x 4
##      id     v     a b    
##   <dbl> <int> <dbl> <chr>
## 1     1     2    10 a
df1 %>% left_join(df2, by = "id")
## # A tibble: 2 x 4
##      id     v     a b    
##   <dbl> <int> <dbl> <chr>
## 1     1     2    10 a    
## 2     2     1    NA <NA>
df1 %>% right_join(df2, by = "id")
## # A tibble: 2 x 4
##      id     v     a b    
##   <dbl> <int> <dbl> <chr>
## 1     1     2    10 a    
## 2     3    NA    10 a
# outer join
df1 %>% full_join(df2, by = "id")
## # A tibble: 3 x 4
##      id     v     a b    
##   <dbl> <int> <dbl> <chr>
## 1     1     2    10 a    
## 2     2     1    NA <NA> 
## 3     3    NA    10 a
df1 %>% semi_join(df2, by = "id")
## # A tibble: 1 x 2
##      id     v
##   <dbl> <int>
## 1     1     2
df1 %>% 
  anti_join(df2, by = "id")
## # A tibble: 1 x 2
##      id     v
##   <dbl> <int>
## 1     2     1

Join by the same variable of different names

df1 <- tibble(id = c(1, 2), v = 2:1)
df2 <- tibble(name = c(1, 3), a = 10, b = "a")
df1 %>% inner_join(df2, by = c("id" = "name"))
## # A tibble: 1 x 4
##      id     v     a b    
##   <dbl> <int> <dbl> <chr>
## 1     1     2    10 a

Updating rows

In some situation, we would like to use the values from one table to update the values of another table.

(df1 <- tibble(
  id = 1:3, 
  b = letters[c(1:2, NA)], 
  c = 0.5 + 0:2))
## # A tibble: 3 x 3
##      id b         c
##   <int> <chr> <dbl>
## 1     1 a       0.5
## 2     2 b       1.5
## 3     3 <NA>    2.5
(df2 <- tibble(id = 3, b = "z", c = 3.5))
## # A tibble: 1 x 3
##      id b         c
##   <dbl> <chr> <dbl>
## 1     3 z       3.5

Outer join doesn’t give the intended result

df1 %>% 
  full_join(df2, by = "id")
## # A tibble: 3 x 5
##      id b.x     c.x b.y     c.y
##   <dbl> <chr> <dbl> <chr> <dbl>
## 1     1 a       0.5 <NA>   NA  
## 2     2 b       1.5 <NA>   NA  
## 3     3 <NA>    2.5 z       3.5

Use rows_update to update existing rows

df1 %>% 
  rows_update(df2, by = "id")
## # A tibble: 3 x 3
##      id b         c
##   <int> <chr> <dbl>
## 1     1 a       0.5
## 2     2 b       1.5
## 3     3 z       3.5

rows_patch works like rows_update() but only overwrites NA values.

df1 %>% 
  rows_patch(df2)
## Matching, by = "id"
## # A tibble: 3 x 3
##      id b         c
##   <int> <chr> <dbl>
## 1     1 a       0.5
## 2     2 b       1.5
## 3     3 z       2.5
df3 <- tibble(id = 4, b = "z", c = 3.5)
# error when key doesn't exist
df1 %>% 
  rows_update(df3, by = "id")
## Error: Attempting to update missing rows.

Inserting rows

df3 <- tibble(id = 4, b = "z")
df1 %>% 
  rows_insert(df3, by = "id")
## # A tibble: 4 x 3
##      id b         c
##   <dbl> <chr> <dbl>
## 1     1 a       0.5
## 2     2 b       1.5
## 3     3 <NA>    2.5
## 4     4 z      NA
# error because `id = 3` exists
df1 %>% 
  rows_insert(df2, by = "id")
## Error: Attempting to insert duplicate rows.

rows_upsert inserts or updates depending on whether or not the key value already exists

df1 %>% 
  rows_upsert(tibble(id = 3:4, b = c("z", "y"), c = c(3.5, 4.5)), by = "id")
## # A tibble: 4 x 3
##      id b         c
##   <int> <chr> <dbl>
## 1     1 a       0.5
## 2     2 b       1.5
## 3     3 z       3.5
## 4     4 y       4.5

Crossing and nesting

Crossing allows you to examine all the combinations between variables.

For example, we know that not all routes between origin and dest exist.

all_existing_routes <- flights %>% expand(nesting(origin, dest))
all_possible_routes <- flights %>% expand(crossing(origin, dest))
# or simply  
all_possible_routes <- flights %>% expand(origin, dest)
# these routes do not exist
all_possible_routes %>% 
  anti_join(all_existing_routes, by = c("origin", "dest"))
## # A tibble: 91 x 2
##    origin dest 
##    <chr>  <chr>
##  1 EWR    ABQ  
##  2 EWR    ACK  
##  3 EWR    BGR  
##  4 EWR    BHM  
##  5 EWR    BUR  
##  6 EWR    CAK  
##  7 EWR    CHO  
##  8 EWR    CRW  
##  9 EWR    EYW  
## 10 EWR    ILM  
## # … with 81 more rows

Completion and missing values

expand only keeps the variables specified. If you want to join the original data, you could use complete.

# for demonstration
(dat <- flights %>% 
  group_by(origin, dest) %>% 
  summarize(air_time = mean(air_time, na.rm = TRUE)) %>% 
  ungroup())
## `summarise()` regrouping output by 'origin' (override with `.groups` argument)
## # A tibble: 224 x 3
##    origin dest  air_time
##    <chr>  <chr>    <dbl>
##  1 EWR    ALB       31.8
##  2 EWR    ANC      413. 
##  3 EWR    ATL      112. 
##  4 EWR    AUS      211. 
##  5 EWR    AVL       89.8
##  6 EWR    BDL       25.5
##  7 EWR    BNA      115. 
##  8 EWR    BOS       40.3
##  9 EWR    BQN      196. 
## 10 EWR    BTV       46.3
## # … with 214 more rows
dat %>% 
  expand(origin, dest) %>% 
  left_join(dat)
## Joining, by = c("origin", "dest")
## # A tibble: 315 x 3
##    origin dest  air_time
##    <chr>  <chr>    <dbl>
##  1 EWR    ABQ       NA  
##  2 EWR    ACK       NA  
##  3 EWR    ALB       31.8
##  4 EWR    ANC      413. 
##  5 EWR    ATL      112. 
##  6 EWR    AUS      211. 
##  7 EWR    AVL       89.8
##  8 EWR    BDL       25.5
##  9 EWR    BGR       NA  
## 10 EWR    BHM       NA  
## # … with 305 more rows
dat %>%
  complete(origin, dest, fill = list("air_time" = Inf))
## # A tibble: 315 x 3
##    origin dest  air_time
##    <chr>  <chr>    <dbl>
##  1 EWR    ABQ      Inf  
##  2 EWR    ACK      Inf  
##  3 EWR    ALB       31.8
##  4 EWR    ANC      413. 
##  5 EWR    ATL      112. 
##  6 EWR    AUS      211. 
##  7 EWR    AVL       89.8
##  8 EWR    BDL       25.5
##  9 EWR    BGR      Inf  
## 10 EWR    BHM      Inf  
## # … with 305 more rows
df <- tribble(
  ~x, ~y,
   1,  2,
  NA,  3,
   3, NA
)
df
## # A tibble: 3 x 2
##       x     y
##   <dbl> <dbl>
## 1     1     2
## 2    NA     3
## 3     3    NA
# remove all rows with missing values
df %>% drop_na()
## # A tibble: 1 x 2
##       x     y
##   <dbl> <dbl>
## 1     1     2
# fill values by copying downward
df %>% fill(x, y)
## # A tibble: 3 x 2
##       x     y
##   <dbl> <dbl>
## 1     1     2
## 2     1     3
## 3     3     3
df %>% 
  replace_na(list(x = Inf, y = -Inf))
## # A tibble: 3 x 2
##       x     y
##   <dbl> <dbl>
## 1     1     2
## 2   Inf     3
## 3     3  -Inf

Pivoting Data

Using the datasets from R for Data Science to show that the same data could be organized in different ways.

In general, there are two formats,

Pivot longer

relig_income
## # A tibble: 18 x 11
##    religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
##    <chr>      <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>      <dbl>
##  1 Agnostic      27        34        60        81        76       137        122
##  2 Atheist       12        27        37        52        35        70         73
##  3 Buddhist      27        21        30        34        33        58         62
##  4 Catholic     418       617       732       670       638      1116        949
##  5 Don’t k…      15        14        15        11        10        35         21
##  6 Evangel…     575       869      1064       982       881      1486        949
##  7 Hindu          1         9         7         9        11        34         47
##  8 Histori…     228       244       236       238       197       223        131
##  9 Jehovah…      20        27        24        24        21        30         15
## 10 Jewish        19        19        25        25        30        95         69
## 11 Mainlin…     289       495       619       655       651      1107        939
## 12 Mormon        29        40        48        51        56       112         85
## 13 Muslim         6         7         9        10         9        23         16
## 14 Orthodox      13        17        23        32        32        47         38
## 15 Other C…       9         7        11        13        13        14         18
## 16 Other F…      20        33        40        46        49        63         46
## 17 Other W…       5         2         3         4         2         7          3
## 18 Unaffil…     217       299       374       365       341       528        407
## # … with 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>, `Don't
## #   know/refused` <dbl>
# make sure you have tidyr 1.0
relig_income %>%
  pivot_longer(-religion, names_to = "income", values_to = "count")
## # A tibble: 180 x 3
##    religion income             count
##    <chr>    <chr>              <dbl>
##  1 Agnostic <$10k                 27
##  2 Agnostic $10-20k               34
##  3 Agnostic $20-30k               60
##  4 Agnostic $30-40k               81
##  5 Agnostic $40-50k               76
##  6 Agnostic $50-75k              137
##  7 Agnostic $75-100k             122
##  8 Agnostic $100-150k            109
##  9 Agnostic >150k                 84
## 10 Agnostic Don't know/refused    96
## # … with 170 more rows
billboard
## # A tibble: 317 x 79
##    artist track date.entered   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8
##    <chr>  <chr> <date>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 2 Pac  Baby… 2000-02-26      87    82    72    77    87    94    99    NA
##  2 2Ge+h… The … 2000-09-02      91    87    92    NA    NA    NA    NA    NA
##  3 3 Doo… Kryp… 2000-04-08      81    70    68    67    66    57    54    53
##  4 3 Doo… Loser 2000-10-21      76    76    72    69    67    65    55    59
##  5 504 B… Wobb… 2000-04-15      57    34    25    17    17    31    36    49
##  6 98^0   Give… 2000-08-19      51    39    34    26    26    19     2     2
##  7 A*Tee… Danc… 2000-07-08      97    97    96    95   100    NA    NA    NA
##  8 Aaliy… I Do… 2000-01-29      84    62    51    41    38    35    35    38
##  9 Aaliy… Try … 2000-03-18      59    53    38    28    21    18    16    14
## 10 Adams… Open… 2000-08-26      76    76    74    69    68    67    61    58
## # … with 307 more rows, and 68 more variables: wk9 <dbl>, wk10 <dbl>,
## #   wk11 <dbl>, wk12 <dbl>, wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>,
## #   wk17 <dbl>, wk18 <dbl>, wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>,
## #   wk23 <dbl>, wk24 <dbl>, wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>,
## #   wk29 <dbl>, wk30 <dbl>, wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>,
## #   wk35 <dbl>, wk36 <dbl>, wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>,
## #   wk41 <dbl>, wk42 <dbl>, wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>,
## #   wk47 <dbl>, wk48 <dbl>, wk49 <dbl>, wk50 <dbl>, wk51 <dbl>, wk52 <dbl>,
## #   wk53 <dbl>, wk54 <dbl>, wk55 <dbl>, wk56 <dbl>, wk57 <dbl>, wk58 <dbl>,
## #   wk59 <dbl>, wk60 <dbl>, wk61 <dbl>, wk62 <dbl>, wk63 <dbl>, wk64 <dbl>,
## #   wk65 <dbl>, wk66 <lgl>, wk67 <lgl>, wk68 <lgl>, wk69 <lgl>, wk70 <lgl>,
## #   wk71 <lgl>, wk72 <lgl>, wk73 <lgl>, wk74 <lgl>, wk75 <lgl>, wk76 <lgl>
billboard %>%
  pivot_longer(
    starts_with("wk"),
    names_to = "week",
    names_prefix = "wk",
    values_to = "rank",
    values_drop_na = TRUE
  )
## # A tibble: 5,307 x 5
##    artist  track                   date.entered week   rank
##    <chr>   <chr>                   <date>       <chr> <dbl>
##  1 2 Pac   Baby Don't Cry (Keep... 2000-02-26   1        87
##  2 2 Pac   Baby Don't Cry (Keep... 2000-02-26   2        82
##  3 2 Pac   Baby Don't Cry (Keep... 2000-02-26   3        72
##  4 2 Pac   Baby Don't Cry (Keep... 2000-02-26   4        77
##  5 2 Pac   Baby Don't Cry (Keep... 2000-02-26   5        87
##  6 2 Pac   Baby Don't Cry (Keep... 2000-02-26   6        94
##  7 2 Pac   Baby Don't Cry (Keep... 2000-02-26   7        99
##  8 2Ge+her The Hardest Part Of ... 2000-09-02   1        91
##  9 2Ge+her The Hardest Part Of ... 2000-09-02   2        87
## 10 2Ge+her The Hardest Part Of ... 2000-09-02   3        92
## # … with 5,297 more rows

Pivot wider

fish_encounters
## # A tibble: 114 x 3
##    fish  station  seen
##    <fct> <fct>   <int>
##  1 4842  Release     1
##  2 4842  I80_1       1
##  3 4842  Lisbon      1
##  4 4842  Rstr        1
##  5 4842  Base_TD     1
##  6 4842  BCE         1
##  7 4842  BCW         1
##  8 4842  BCE2        1
##  9 4842  BCW2        1
## 10 4842  MAE         1
## # … with 104 more rows
fish_encounters %>% pivot_wider(
  names_from = station,
  values_from = seen,
  values_fill = list(seen = 0)
)
## # A tibble: 19 x 12
##    fish  Release I80_1 Lisbon  Rstr Base_TD   BCE   BCW  BCE2  BCW2   MAE   MAW
##    <fct>   <int> <int>  <int> <int>   <int> <int> <int> <int> <int> <int> <int>
##  1 4842        1     1      1     1       1     1     1     1     1     1     1
##  2 4843        1     1      1     1       1     1     1     1     1     1     1
##  3 4844        1     1      1     1       1     1     1     1     1     1     1
##  4 4845        1     1      1     1       1     0     0     0     0     0     0
##  5 4847        1     1      1     0       0     0     0     0     0     0     0
##  6 4848        1     1      1     1       0     0     0     0     0     0     0
##  7 4849        1     1      0     0       0     0     0     0     0     0     0
##  8 4850        1     1      0     1       1     1     1     0     0     0     0
##  9 4851        1     1      0     0       0     0     0     0     0     0     0
## 10 4854        1     1      0     0       0     0     0     0     0     0     0
## 11 4855        1     1      1     1       1     0     0     0     0     0     0
## 12 4857        1     1      1     1       1     1     1     1     1     0     0
## 13 4858        1     1      1     1       1     1     1     1     1     1     1
## 14 4859        1     1      1     1       1     0     0     0     0     0     0
## 15 4861        1     1      1     1       1     1     1     1     1     1     1
## 16 4862        1     1      1     1       1     1     1     1     1     0     0
## 17 4863        1     1      0     0       0     0     0     0     0     0     0
## 18 4864        1     1      0     0       0     0     0     0     0     0     0
## 19 4865        1     1      1     0       0     0     0     0     0     0     0
us_rent_income
## # A tibble: 104 x 5
##    GEOID NAME       variable estimate   moe
##    <chr> <chr>      <chr>       <dbl> <dbl>
##  1 01    Alabama    income      24476   136
##  2 01    Alabama    rent          747     3
##  3 02    Alaska     income      32940   508
##  4 02    Alaska     rent         1200    13
##  5 04    Arizona    income      27517   148
##  6 04    Arizona    rent          972     4
##  7 05    Arkansas   income      23789   165
##  8 05    Arkansas   rent          709     5
##  9 06    California income      29454   109
## 10 06    California rent         1358     3
## # … with 94 more rows
us_rent_income %>%
  pivot_wider(names_from = variable, values_from = c(estimate, moe))
## # A tibble: 52 x 6
##    GEOID NAME                 estimate_income estimate_rent moe_income moe_rent
##    <chr> <chr>                          <dbl>         <dbl>      <dbl>    <dbl>
##  1 01    Alabama                        24476           747        136        3
##  2 02    Alaska                         32940          1200        508       13
##  3 04    Arizona                        27517           972        148        4
##  4 05    Arkansas                       23789           709        165        5
##  5 06    California                     29454          1358        109        3
##  6 08    Colorado                       32401          1125        109        5
##  7 09    Connecticut                    35326          1123        195        5
##  8 10    Delaware                       31560          1076        247       10
##  9 11    District of Columbia           43198          1424        681       17
## 10 12    Florida                        25952          1077         70        3
## # … with 42 more rows

References