dplyr basicsselect: picks variables based on their names.
filter: picks cases based on their valuesmutate: adds new variables that are functions of existing variables
arrange: changes the ordering of the rowssummarize and group_by operations
Quote from the book “R for Data Science”, the author said
R is an old language, and some things that were useful 10 or 20 years ago now get in your way. It’s difficult to change base R without breaking existing code, so most innovation occurs in packages.
tidyverse style usually is more elegantSo what is tidyverse?
ggplot2, for data visualisationdplyr, for data manipulationtidyr, for data tidyingreadr, for data importpurrr, for functional programmingtibble, for tibbles, a modern re-imagining of data framesstringr, for stringsforcats, for factorsWe are going to detour a bit to talk about pipe operator. Pipe operator is a powerful way to make your code much cleaner and readible by
The operators pipe their left-hand side values forward into expressions that appear on the right-hand side, i.e. one can replace f(x) with x %>% f(), where %>% is the (main) pipe-operator. When coupling several function calls with the pipe-operator, the benefit will become more apparent. Consider this pseudo example:
raw_data <- read.csv("/path/to/data/file.csv")
sub_data <- subset(raw_data, variable_a > x)
new_data <- transform(sub_data, variable_c = variable_a / variable_b)
the_data <- head(new_data, 100)
However, the local variables are really not necessary, so one liner could be
the_data <- head(
transform(
subset(
read.csv("/path/to/data/file.csv"),
variable_a > x
),
variable_c = variable_a / variable_b
),
100
)
But that is too hard to read and reason. If we use pipe operator,
the_data <- read.csv("/path/to/data/file.csv") %>%
subset(variable_a > x) %>%
transform(variable_c = variable_a / variable_b) %>%
head(100)
Hint: In RStudio, you could use Ctrl + Shift + M (or Cmd + Shift + M) to insert the pipe operator.
x %>% f or x %>% f() is equivalent to f(x)x %>% f(y) is equivalent to f(x, y)x %>% f %>% g %>% h is equivalent to h(g(f(x)))sum(c(1, 2))
## [1] 3
c(1, 2) %>% sum()
## [1] 3
choose(5, 3)
## [1] 10
5 %>% choose(3)
## [1] 10
x %>% f(y, .) is equivalent to f(y, x)x %>% f(y, z = .) is equivalent to f(y, z = x)3 %>% choose(5, .)
## [1] 10
3 %>% choose(5, k = .)
## [1] 10
It is straightforward to use the placeholder several times in a right-hand side expression. However, when the placeholder only appears in a nested expressions magrittr will still apply the first-argument rule. The reason is that in most cases this results more clean code.
x %>% f(y = nrow(.), z = ncol(.)) is equivalent to f(x, y = nrow(x), z = ncol(x))
a <- 100
5 %>% choose(. - 2)
## [1] 10
choose(5, 5 - 2)
## [1] 10
The behavior can be overruled by enclosing the right-hand side in braces:
4 %>% {choose(5, . - 1)}
## [1] 10
choose(5, 4 - 1)
## [1] 10
x %>% {f(y = nrow(.), z = ncol(.))} is equivalent to f(y = nrow(x), z = ncol(x))
list(n = 5, k = 3) %>% {
choose(.$n, .$k)
}
## [1] 10
dplyr basicsselect: picks variables based on their namesfilter: picks cases based on their valuesmutate: adds new variables that are functions of existing variablesarrange: changes the ordering of the rowssummarize or summarise: reduces multiple values down to a single summarygroup_by which allows you to perform any operation “by group”.First of all, we need some data to work with. If the data is stored in a csv,
flights <- read_csv("flights.csv")
We are using the tidyverse function read_csv to import the flights.csv instead of the obsolete base function read.csv. - read_csv imports data as tibble which has better and more consistent handling of variables. - read_csv is often faster than read.csv - read_csv handles unicode characters better
There are also read_tsv and read_delim for reading tab-seperated or delimited files.
The datasets are actually obtained from the R package nycflights13
# Airline on-time data for all flights departing NYC in 2013.
library(nycflights13)
flights
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
select: picks variables based on their names.To select arrival and departure times,
# old way to do it
# flights[, c("arr_time", "dep_time")]
flights %>% select(arr_time, dep_time)
## # A tibble: 336,776 x 2
## arr_time dep_time
## <int> <int>
## 1 830 517
## 2 850 533
## 3 923 542
## 4 1004 544
## 5 812 554
## 6 740 554
## 7 913 555
## 8 709 557
## 9 838 557
## 10 753 558
## # … with 336,766 more rows
dplyr provides a lot of helper functions,
# colon `:` specifies all the variables between the columns of `dep_time` and `arr_time`
flights %>% select(dep_time:arr_time)
## # A tibble: 336,776 x 4
## dep_time sched_dep_time dep_delay arr_time
## <int> <int> <dbl> <int>
## 1 517 515 2 830
## 2 533 529 4 850
## 3 542 540 2 923
## 4 544 545 -1 1004
## 5 554 600 -6 812
## 6 554 558 -4 740
## 7 555 600 -5 913
## 8 557 600 -3 709
## 9 557 600 -3 838
## 10 558 600 -2 753
## # … with 336,766 more rows
# all the columns start with arr_
flights %>% select(starts_with("arr_"))
## # A tibble: 336,776 x 2
## arr_time arr_delay
## <int> <dbl>
## 1 830 11
## 2 850 20
## 3 923 33
## 4 1004 -18
## 5 812 -25
## 6 740 12
## 7 913 19
## 8 709 -14
## 9 838 -8
## 10 753 8
## # … with 336,766 more rows
# all the columns end with _time
flights %>% select(ends_with("_time"))
## # A tibble: 336,776 x 5
## dep_time sched_dep_time arr_time sched_arr_time air_time
## <int> <int> <int> <int> <dbl>
## 1 517 515 830 819 227
## 2 533 529 850 830 227
## 3 542 540 923 850 160
## 4 544 545 1004 1022 183
## 5 554 600 812 837 116
## 6 554 558 740 728 150
## 7 555 600 913 854 158
## 8 557 600 709 723 53
## 9 557 600 838 846 140
## 10 558 600 753 745 138
## # … with 336,766 more rows
# all the columns contain dep
flights %>% select(contains("dep"))
## # A tibble: 336,776 x 3
## dep_time sched_dep_time dep_delay
## <int> <int> <dbl>
## 1 517 515 2
## 2 533 529 4
## 3 542 540 2
## 4 544 545 -1
## 5 554 600 -6
## 6 554 558 -4
## 7 555 600 -5
## 8 557 600 -3
## 9 557 600 -3
## 10 558 600 -2
## # … with 336,766 more rows
# remove particular columns
flights %>% select(-year)
## # A tibble: 336,776 x 18
## month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <dbl> <int> <int>
## 1 1 1 517 515 2 830 819
## 2 1 1 533 529 4 850 830
## 3 1 1 542 540 2 923 850
## 4 1 1 544 545 -1 1004 1022
## 5 1 1 554 600 -6 812 837
## 6 1 1 554 558 -4 740 728
## 7 1 1 555 600 -5 913 854
## 8 1 1 557 600 -3 709 723
## 9 1 1 557 600 -3 838 846
## 10 1 1 558 600 -2 753 745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# all the columns do not contain dep
flights %>% select(-contains("dep"))
## # A tibble: 336,776 x 16
## year month day arr_time sched_arr_time arr_delay carrier flight tailnum
## <int> <int> <int> <int> <int> <dbl> <chr> <int> <chr>
## 1 2013 1 1 830 819 11 UA 1545 N14228
## 2 2013 1 1 850 830 20 UA 1714 N24211
## 3 2013 1 1 923 850 33 AA 1141 N619AA
## 4 2013 1 1 1004 1022 -18 B6 725 N804JB
## 5 2013 1 1 812 837 -25 DL 461 N668DN
## 6 2013 1 1 740 728 12 UA 1696 N39463
## 7 2013 1 1 913 854 19 B6 507 N516JB
## 8 2013 1 1 709 723 -14 EV 5708 N829AS
## 9 2013 1 1 838 846 -8 B6 79 N593JB
## 10 2013 1 1 753 745 8 AA 301 N3ALAA
## # … with 336,766 more rows, and 7 more variables: origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# using regular expression (later of the course)
flights %>% select(matches("^(arr|dep)_"))
## # A tibble: 336,776 x 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2 830 11
## 2 533 4 850 20
## 3 542 2 923 33
## 4 544 -1 1004 -18
## 5 554 -6 812 -25
## 6 554 -4 740 12
## 7 555 -5 913 19
## 8 557 -3 709 -14
## 9 557 -3 838 -8
## 10 558 -2 753 8
## # … with 336,766 more rows
Related verbs: - if you just need a single variable, you could use pull. - you could use rename to rename columns
# flights$arr_time
flights %>% pull(arr_time)
## [1] 830 850 923 1004 812 740 913 709 838 753 849 853 924 923 941
## [16] 702 854 851 837 844 812 821 858 837 858 807 945 925 1039 833
## [31] 1017 920 933 909 840 1018 1137 1016 824 721 824 740 1028 930 739
## [46] 922 837 931 815 910 1023 936 932 936 1021 1037 1002 854 949 1007
## [61] 948 959 944 1027 1008 1008 907 959 1123 1058 852 1151 1023 911 850
## [76] 1017 1013 1111 1020 1052 959 1041 1049 857 1041 1011 854 1047 918 1104
## [91] 1038 1107 1043 1059 1135 1119 939 1041 1025 955 1056 1039 1103 1053 1057
## [106] 1022 949 900 903 1132 1103 1015 1118 1006 1043 1043 1048 1100 1006 1047
## [121] 1026 1040 1103 1047 1005 1254 940 1249 1153 932 1014 1019 1151 1027 1058
## [136] 1136 1145 1027 1150 1152 1117 1018 1052 1021 1006 1134 1210 1027 1311 1053
## [151] 1138 1001 1155 1032 1102 1215 1046 1147 1143 1226 1222 1140 1516 1107 1124
## [166] 1157 1102 1140 1223 1211 1048 1045 1207 1309 1134 1020 1004 1228 1331 1241
## [181] 1219 1346 1244 1058 1313 1206 1052 1039 1152 1237 1026 1404 1221 1233 1231
## [196] 1028 1220 1218 1237 1121
## [ reached getOption("max.print") -- omitted 336576 entries ]
flights %>% rename(y = year)
## # A tibble: 336,776 x 19
## y month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# of course, we could select everything
flights %>% select(everything())
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# move air_time to the front
flights %>% select(air_time, everything())
## # A tibble: 336,776 x 19
## air_time year month day dep_time sched_dep_time dep_delay arr_time
## <dbl> <int> <int> <int> <int> <int> <dbl> <int>
## 1 227 2013 1 1 517 515 2 830
## 2 227 2013 1 1 533 529 4 850
## 3 160 2013 1 1 542 540 2 923
## 4 183 2013 1 1 544 545 -1 1004
## 5 116 2013 1 1 554 600 -6 812
## 6 150 2013 1 1 554 558 -4 740
## 7 158 2013 1 1 555 600 -5 913
## 8 53 2013 1 1 557 600 -3 709
## 9 140 2013 1 1 557 600 -3 838
## 10 138 2013 1 1 558 600 -2 753
## # … with 336,766 more rows, and 11 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
## # dest <chr>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# a new function `relocate
flights %>% relocate(air_time, .after = day)
## # A tibble: 336,776 x 19
## year month day air_time dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <dbl> <int> <int> <dbl> <int>
## 1 2013 1 1 227 517 515 2 830
## 2 2013 1 1 227 533 529 4 850
## 3 2013 1 1 160 542 540 2 923
## 4 2013 1 1 183 544 545 -1 1004
## 5 2013 1 1 116 554 600 -6 812
## 6 2013 1 1 150 554 558 -4 740
## 7 2013 1 1 158 555 600 -5 913
## 8 2013 1 1 53 557 600 -3 709
## 9 2013 1 1 140 557 600 -3 838
## 10 2013 1 1 138 558 600 -2 753
## # … with 336,766 more rows, and 11 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
## # dest <chr>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
filter: picks cases based on their values# flights[flights$origin == "JFK", ]
# subset(flights, origin == "JFK")
flights %>% filter(origin == "JFK")
## # A tibble: 111,279 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 542 540 2 923 850
## 2 2013 1 1 544 545 -1 1004 1022
## 3 2013 1 1 557 600 -3 838 846
## 4 2013 1 1 558 600 -2 849 851
## 5 2013 1 1 558 600 -2 853 856
## 6 2013 1 1 558 600 -2 924 917
## 7 2013 1 1 559 559 0 702 706
## 8 2013 1 1 606 610 -4 837 845
## 9 2013 1 1 611 600 11 945 931
## 10 2013 1 1 613 610 3 925 921
## # … with 111,269 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>% filter(distance > 1000)
## # A tibble: 147,105 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 555 600 -5 913 854
## 6 2013 1 1 558 600 -2 849 851
## 7 2013 1 1 558 600 -2 853 856
## 8 2013 1 1 558 600 -2 924 917
## 9 2013 1 1 558 600 -2 923 937
## 10 2013 1 1 559 600 -1 941 910
## # … with 147,095 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# note that we are using a single `&` instead of `&&` as in base R
flights %>% filter(origin == "JFK" & distance > 1000)
## # A tibble: 62,071 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 542 540 2 923 850
## 2 2013 1 1 544 545 -1 1004 1022
## 3 2013 1 1 558 600 -2 849 851
## 4 2013 1 1 558 600 -2 853 856
## 5 2013 1 1 558 600 -2 924 917
## 6 2013 1 1 611 600 11 945 931
## 7 2013 1 1 613 610 3 925 921
## 8 2013 1 1 615 615 0 1039 1100
## 9 2013 1 1 627 630 -3 1018 1018
## 10 2013 1 1 628 630 -2 1137 1140
## # … with 62,061 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>% filter(origin == "JFK", distance > 1000)
## # A tibble: 62,071 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 542 540 2 923 850
## 2 2013 1 1 544 545 -1 1004 1022
## 3 2013 1 1 558 600 -2 849 851
## 4 2013 1 1 558 600 -2 853 856
## 5 2013 1 1 558 600 -2 924 917
## 6 2013 1 1 611 600 11 945 931
## 7 2013 1 1 613 610 3 925 921
## 8 2013 1 1 615 615 0 1039 1100
## 9 2013 1 1 627 630 -3 1018 1018
## 10 2013 1 1 628 630 -2 1137 1140
## # … with 62,061 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>% filter(distance < 500 | distance > 1000)
## # A tibble: 227,322 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 555 600 -5 913 854
## 6 2013 1 1 557 600 -3 709 723
## 7 2013 1 1 558 600 -2 849 851
## 8 2013 1 1 558 600 -2 853 856
## 9 2013 1 1 558 600 -2 924 917
## 10 2013 1 1 558 600 -2 923 937
## # … with 227,312 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>% filter(!between(distance, 500, 1000))
## # A tibble: 227,322 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 555 600 -5 913 854
## 6 2013 1 1 557 600 -3 709 723
## 7 2013 1 1 558 600 -2 849 851
## 8 2013 1 1 558 600 -2 853 856
## 9 2013 1 1 558 600 -2 924 917
## 10 2013 1 1 558 600 -2 923 937
## # … with 227,312 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# only keep the complete cases
flights %>% filter(complete.cases(.)) # filter(flights, complete.cases(flights))
## # A tibble: 327,346 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # … with 327,336 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>% drop_na() # same, but more efficient
## # A tibble: 327,346 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # … with 327,336 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Very often, we will need to use multiple dplyr verbs, for example
# the pipe operator %>% increases readability
flights %>%
select(origin, air_time) %>%
filter(origin == "JFK") %>%
filter(air_time < 500) %>%
rename(airtime = air_time)
## # A tibble: 108,737 x 2
## origin airtime
## <chr> <dbl>
## 1 JFK 160
## 2 JFK 183
## 3 JFK 140
## 4 JFK 149
## 5 JFK 158
## 6 JFK 345
## 7 JFK 44
## 8 JFK 128
## 9 JFK 366
## 10 JFK 175
## # … with 108,727 more rows
# slightly more efficient
flights %>%
select(origin, airtime = air_time) %>%
filter(origin == "JFK", airtime < 500)
## # A tibble: 108,737 x 2
## origin airtime
## <chr> <dbl>
## 1 JFK 160
## 2 JFK 183
## 3 JFK 140
## 4 JFK 149
## 5 JFK 158
## 6 JFK 345
## 7 JFK 44
## 8 JFK 128
## 9 JFK 366
## 10 JFK 175
## # … with 108,727 more rows
# a few more examples
flights %>%
select(origin, air_time) %>%
filter(origin == "JFK", air_time < mean(air_time, na.rm = TRUE))
## # A tibble: 55,521 x 2
## origin air_time
## <chr> <dbl>
## 1 JFK 140
## 2 JFK 149
## 3 JFK 44
## 4 JFK 128
## 5 JFK 41
## 6 JFK 63
## 7 JFK 142
## 8 JFK 147
## 9 JFK 64
## 10 JFK 54
## # … with 55,511 more rows
# maybe we want to calculate mean_air_time first
mean_air_time <- flights %>%
pull(air_time) %>%
mean(na.rm = TRUE)
mean_air_time
## [1] 150.6865
flights %>%
select(origin, air_time) %>%
filter(origin == "JFK", air_time > mean_air_time)
## # A tibble: 53,558 x 2
## origin air_time
## <chr> <dbl>
## 1 JFK 160
## 2 JFK 183
## 3 JFK 158
## 4 JFK 345
## 5 JFK 366
## 6 JFK 175
## 7 JFK 182
## 8 JFK 330
## 9 JFK 192
## 10 JFK 323
## # … with 53,548 more rows
# what if there is a name collision?
air_time <- flights %>%
pull(air_time) %>%
mean(na.rm = TRUE)
flights %>%
select(origin, air_time) %>%
filter(origin == "JFK", air_time > !!air_time)
## # A tibble: 53,558 x 2
## origin air_time
## <chr> <dbl>
## 1 JFK 160
## 2 JFK 183
## 3 JFK 158
## 4 JFK 345
## 5 JFK 366
## 6 JFK 175
## 7 JFK 182
## 8 JFK 330
## 9 JFK 192
## 10 JFK 323
## # … with 53,548 more rows
Remarks: use slice, head, tail if you want particular rows
flights %>% head(4)
## # A tibble: 4 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>% slice(100:105)
## # A tibble: 6 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 752 759 -7 955 959
## 2 2013 1 1 753 755 -2 1056 1110
## 3 2013 1 1 754 759 -5 1039 1041
## 4 2013 1 1 754 755 -1 1103 1030
## 5 2013 1 1 758 800 -2 1053 1054
## 6 2013 1 1 759 800 -1 1057 1127
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>% tail(5)
## # A tibble: 5 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 9 30 NA 1455 NA NA 1634
## 2 2013 9 30 NA 2200 NA NA 2312
## 3 2013 9 30 NA 1210 NA NA 1330
## 4 2013 9 30 NA 1159 NA NA 1344
## 5 2013 9 30 NA 840 NA NA 1020
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>% slice((n() - 4):n())
## # A tibble: 5 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 9 30 NA 1455 NA NA 1634
## 2 2013 9 30 NA 2200 NA NA 2312
## 3 2013 9 30 NA 1210 NA NA 1330
## 4 2013 9 30 NA 1159 NA NA 1344
## 5 2013 9 30 NA 840 NA NA 1020
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
mutate: adds new variables that are functions of existing variablesflights %>% mutate(
gain = dep_delay - arr_delay,
speed = distance / air_time * 60
)
## # A tibble: 336,776 x 21
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # … with 336,766 more rows, and 13 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
## # gain <dbl>, speed <dbl>
# we could refer to the columns just created
flights %>% mutate(
gain = dep_delay - arr_delay,
gain_per_hour = gain / (air_time / 60)
)
## # A tibble: 336,776 x 21
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # … with 336,766 more rows, and 13 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
## # gain <dbl>, gain_per_hour <dbl>
# remove a variable
flights %>% mutate(
gain = dep_delay - arr_delay,
gain_per_hour = gain / (air_time / 60),
air_time = NULL,
)
## # A tibble: 336,776 x 20
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # … with 336,766 more rows, and 12 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, gain <dbl>,
## # gain_per_hour <dbl>
# `transmute` only keeps the new variables
flights %>% transmute(
gain = dep_delay - arr_delay,
gain_per_hour = gain / (air_time / 60)
)
## # A tibble: 336,776 x 2
## gain gain_per_hour
## <dbl> <dbl>
## 1 -9 -2.38
## 2 -16 -4.23
## 3 -31 -11.6
## 4 17 5.57
## 5 19 9.83
## 6 -16 -6.4
## 7 -24 -9.11
## 8 11 12.5
## 9 5 2.14
## 10 -10 -4.35
## # … with 336,766 more rows
row_number: equivalent to rank(ties.method = "first")min_rank: equivalent to rank(ties.method = "min")dense_rank: like min_rank(), but with no gaps between rankspercent_rank: a number between 0 and 1 computed by rescaling min_rank to [0, 1]cume_dist: a cumulative distribution function. Proportion of all values less than or equal to the current rank.ntile: a rough rank, which breaks the input vector into n bucketssome_data <- tibble(
x = c(3, 4, 1, 3, 1)
)
some_data %>%
mutate(
row_number(),
row_number(x),
min_rank(x),
dense_rank(x),
percent_rank(x),
cume_dist(x),
ntile(x, 3)
)
## # A tibble: 5 x 8
## x `row_number()` `row_number(x)` `min_rank(x)` `dense_rank(x)`
## <dbl> <int> <int> <int> <int>
## 1 3 1 3 3 2
## 2 4 2 5 5 3
## 3 1 3 1 1 1
## 4 3 4 4 3 2
## 5 1 5 2 1 1
## # … with 3 more variables: `percent_rank(x)` <dbl>, `cume_dist(x)` <dbl>,
## # `ntile(x, 3)` <int>
lead and lagsome_data2 <- tibble(
time = 1:5,
value = c(3, 4, 1, 3, 1)
)
some_data2 %>% mutate(
lag(value),
lead(value)
)
## # A tibble: 5 x 4
## time value `lag(value)` `lead(value)`
## <int> <dbl> <dbl> <dbl>
## 1 1 3 NA 4
## 2 2 4 3 1
## 3 3 1 4 3
## 4 4 3 1 1
## 5 5 1 3 NA
some_data2 %>% mutate(
diff1 = value - lag(value),
diff2 = lead(value) - value
)
## # A tibble: 5 x 4
## time value diff1 diff2
## <int> <dbl> <dbl> <dbl>
## 1 1 3 NA 1
## 2 2 4 1 -3
## 3 3 1 -3 2
## 4 4 3 2 -2
## 5 5 1 -2 NA
Remark: the package zoo has functions to calculate rolling means (moving averages)
flights %>% transmute(
arr_delay,
status = if_else(arr_delay > 0, "delayed", "on time")
)
## # A tibble: 336,776 x 2
## arr_delay status
## <dbl> <chr>
## 1 11 delayed
## 2 20 delayed
## 3 33 delayed
## 4 -18 on time
## 5 -25 on time
## 6 12 delayed
## 7 19 delayed
## 8 -14 on time
## 9 -8 on time
## 10 8 delayed
## # … with 336,766 more rows
(
flight_distances <- flights %>%
transmute(
distance,
distance_type = case_when(
distance < 500 ~ "short",
distance < 1000 ~ "mid",
TRUE ~ "long"
)
)
)
## # A tibble: 336,776 x 2
## distance distance_type
## <dbl> <chr>
## 1 1400 long
## 2 1416 long
## 3 1089 long
## 4 1576 long
## 5 762 mid
## 6 719 mid
## 7 1065 long
## 8 229 short
## 9 944 mid
## 10 733 mid
## # … with 336,766 more rows
recode valuesflight_distances %>% mutate(
distance_type = recode(distance_type,
"long" = "long-distance",
"mid" = "mid-distance",
"short" = "short-distance"
))
## # A tibble: 336,776 x 2
## distance distance_type
## <dbl> <chr>
## 1 1400 long-distance
## 2 1416 long-distance
## 3 1089 long-distance
## 4 1576 long-distance
## 5 762 mid-distance
## 6 719 mid-distance
## 7 1065 long-distance
## 8 229 short-distance
## 9 944 mid-distance
## 10 733 mid-distance
## # … with 336,766 more rows
some_data %>% mutate(
y = recode(x,
`1` = 10,
`3` = 30
)
)
## # A tibble: 5 x 2
## x y
## <dbl> <dbl>
## 1 3 30
## 2 4 4
## 3 1 10
## 4 3 30
## 5 1 10
arrange: changes the ordering of the rowsflights %>% arrange(year, month, day)
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>% arrange(desc(dep_delay))
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 9 641 900 1301 1242 1530
## 2 2013 6 15 1432 1935 1137 1607 2120
## 3 2013 1 10 1121 1635 1126 1239 1810
## 4 2013 9 20 1139 1845 1014 1457 2210
## 5 2013 7 22 845 1600 1005 1044 1815
## 6 2013 4 10 1100 1900 960 1342 2211
## 7 2013 3 17 2321 810 911 135 1020
## 8 2013 6 27 959 1900 899 1236 2226
## 9 2013 7 22 2257 759 898 121 1026
## 10 2013 12 5 756 1700 896 1058 2020
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
summarize and group_by operationsflights %>% summarize(
air_time = mean(air_time, na.rm = TRUE)
)
## # A tibble: 1 x 1
## air_time
## <dbl>
## 1 151.
# mean(flights$air_time, na.rm = TRUE)
flights %>%
group_by(tailnum) %>%
summarize(n = n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 4,044 x 2
## tailnum n
## <chr> <int>
## 1 D942DN 4
## 2 N0EGMQ 371
## 3 N10156 153
## 4 N102UW 48
## 5 N103US 46
## 6 N104UW 47
## 7 N10575 289
## 8 N105UW 45
## 9 N107US 41
## 10 N108UW 60
## # … with 4,034 more rows
flights %>%
group_by(tailnum) %>%
tally() # shorthand
## # A tibble: 4,044 x 2
## tailnum n
## <chr> <int>
## 1 D942DN 4
## 2 N0EGMQ 371
## 3 N10156 153
## 4 N102UW 48
## 5 N103US 46
## 6 N104UW 47
## 7 N10575 289
## 8 N105UW 45
## 9 N107US 41
## 10 N108UW 60
## # … with 4,034 more rows
flights %>% count(tailnum) # another shorthand
## # A tibble: 4,044 x 2
## tailnum n
## <chr> <int>
## 1 D942DN 4
## 2 N0EGMQ 371
## 3 N10156 153
## 4 N102UW 48
## 5 N103US 46
## 6 N104UW 47
## 7 N10575 289
## 8 N105UW 45
## 9 N107US 41
## 10 N108UW 60
## # … with 4,034 more rows
flights %>%
group_by(tailnum) %>%
summarize(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 4,044 x 4
## tailnum count dist delay
## <chr> <int> <dbl> <dbl>
## 1 D942DN 4 854. 31.5
## 2 N0EGMQ 371 676. 9.98
## 3 N10156 153 758. 12.7
## 4 N102UW 48 536. 2.94
## 5 N103US 46 535. -6.93
## 6 N104UW 47 535. 1.80
## 7 N10575 289 520. 20.7
## 8 N105UW 45 525. -0.267
## 9 N107US 41 529. -5.73
## 10 N108UW 60 534. -1.25
## # … with 4,034 more rows
flights %>%
group_by(dest) %>%
summarize(
planes = n_distinct(tailnum),
flights = n()
)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 105 x 3
## dest planes flights
## <chr> <int> <int>
## 1 ABQ 108 254
## 2 ACK 58 265
## 3 ALB 172 439
## 4 ANC 6 8
## 5 ATL 1180 17215
## 6 AUS 993 2439
## 7 AVL 159 275
## 8 BDL 186 443
## 9 BGR 46 375
## 10 BHM 45 297
## # … with 95 more rows
# group by multiple variables
(per_day <- flights %>%
group_by(year, month, day) %>%
summarize(flights = n()))
## `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)
## # A tibble: 365 x 4
## # Groups: year, month [12]
## year month day flights
## <int> <int> <int> <int>
## 1 2013 1 1 842
## 2 2013 1 2 943
## 3 2013 1 3 914
## 4 2013 1 4 915
## 5 2013 1 5 720
## 6 2013 1 6 832
## 7 2013 1 7 933
## 8 2013 1 8 899
## 9 2013 1 9 902
## 10 2013 1 10 932
## # … with 355 more rows
(per_month <- per_day %>%
summarize(flights = sum(flights)))
## `summarise()` regrouping output by 'year' (override with `.groups` argument)
## # A tibble: 12 x 3
## # Groups: year [1]
## year month flights
## <int> <int> <int>
## 1 2013 1 27004
## 2 2013 2 24951
## 3 2013 3 28834
## 4 2013 4 28330
## 5 2013 5 28796
## 6 2013 6 28243
## 7 2013 7 29425
## 8 2013 8 29327
## 9 2013 9 27574
## 10 2013 10 28889
## 11 2013 11 27268
## 12 2013 12 28135
(per_year <- per_month %>%
summarize(flights = sum(flights)))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 1 x 2
## year flights
## <int> <int>
## 1 2013 336776
across columnsSuppose, we want to compute the means for the all numeric columns of flights. Naively, we could do
flights %>%
group_by(origin) %>%
summarize(
dep_time = mean(dep_time, na.rm = TRUE),
sched_dep_time = mean(sched_dep_time, na.rm = TRUE),
dep_delay = mean(dep_delay, na.rm = TRUE),
arr_time = mean(arr_time, na.rm = TRUE),
sched_arr_time = mean(sched_arr_time, na.rm = TRUE),
arr_delay = mean(arr_delay, na.rm = TRUE),
flight = mean(flight, na.rm = TRUE),
air_time = mean(air_time, na.rm = TRUE),
distance = mean(distance, na.rm = TRUE)
)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 10
## origin dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 EWR 1337. 1322. 15.1 1492. 1528. 9.11
## 2 JFK 1399. 1402. 12.1 1520. 1565. 5.55
## 3 LGA 1310. 1308. 10.3 1494. 1516. 5.78
## # … with 3 more variables: flight <dbl>, air_time <dbl>, distance <dbl>
With the new across function, it is as easy as:
flights %>%
group_by(origin) %>%
summarize(across(dep_time:distance, mean, na.rm = TRUE))
## Warning in mean.default(col, ...): argument is not numeric or logical: returning
## NA
## Warning in mean.default(col, ...): argument is not numeric or logical: returning
## NA
## Warning in mean.default(col, ...): argument is not numeric or logical: returning
## NA
## Warning in mean.default(col, ...): argument is not numeric or logical: returning
## NA
## Warning in mean.default(col, ...): argument is not numeric or logical: returning
## NA
## Warning in mean.default(col, ...): argument is not numeric or logical: returning
## NA
## Warning in mean.default(col, ...): argument is not numeric or logical: returning
## NA
## Warning in mean.default(col, ...): argument is not numeric or logical: returning
## NA
## Warning in mean.default(col, ...): argument is not numeric or logical: returning
## NA
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 13
## origin dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 EWR 1337. 1322. 15.1 1492. 1528. 9.11
## 2 JFK 1399. 1402. 12.1 1520. 1565. 5.55
## 3 LGA 1310. 1308. 10.3 1494. 1516. 5.78
## # … with 6 more variables: carrier <dbl>, flight <dbl>, tailnum <dbl>,
## # dest <dbl>, air_time <dbl>, distance <dbl>
We have accidentally included some non numeric columns.
flights %>%
group_by(origin) %>%
summarize(
across(dep_time:distance & is.numeric, mean, na.rm = TRUE)
)
## Warning: Predicate functions must be wrapped in `where()`.
##
## # Bad
## data %>% select(is.numeric)
##
## # Good
## data %>% select(where(is.numeric))
##
## ℹ Please update your code.
## This message is displayed once per session.
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 10
## origin dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 EWR 1337. 1322. 15.1 1492. 1528. 9.11
## 2 JFK 1399. 1402. 12.1 1520. 1565. 5.55
## 3 LGA 1310. 1308. 10.3 1494. 1516. 5.78
## # … with 3 more variables: flight <dbl>, air_time <dbl>, distance <dbl>
flights %>%
group_by(origin) %>%
summarize(
across(-c(year, month, day) & is.numeric, mean, na.rm = TRUE)
)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 12
## origin dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 EWR 1337. 1322. 15.1 1492. 1528. 9.11
## 2 JFK 1399. 1402. 12.1 1520. 1565. 5.55
## 3 LGA 1310. 1308. 10.3 1494. 1516. 5.78
## # … with 5 more variables: flight <dbl>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>
We could also use predicate functions in across and use different functions to summerize different variables.
flights %>%
group_by(origin) %>%
summarize(
across(ends_with("time"), mean, na.rm = TRUE),
across(ends_with("delay"), mean, na.rm = TRUE, trim = 0.2),
)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 8
## origin dep_time sched_dep_time arr_time sched_arr_time air_time dep_delay
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 EWR 1337. 1322. 1492. 1528. 153. 2.10
## 2 JFK 1399. 1402. 1520. 1565. 178. 0.298
## 3 LGA 1310. 1308. 1494. 1516. 118. -1.51
## # … with 1 more variable: arr_delay <dbl>
flights %>% glimpse()
## Rows: 336,776
## Columns: 19
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, …
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558,…
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600,…
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -…
## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851…
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -…
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", …
## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, …
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N39…
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA"…
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD"…
## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, …
## $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733,…
## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, …
## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, …
## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 …
flights %>%
slice_sample(n = 1000) %>%
ggplot(aes(arr_delay, dep_delay)) + geom_point(na.rm = TRUE)
## rows with largest values of air_time with the original order preserved
flights %>%
slice_max(air_time, n = 3)
## # A tibble: 4 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 3 17 1337 1335 2 1937 1836
## 2 2013 2 6 853 900 -7 1542 1540
## 3 2013 3 15 1001 1000 1 1551 1530
## 4 2013 3 17 1006 1000 6 1607 1530
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
# two disadvantages: order changes and does not handle ties
flights %>%
arrange(desc(air_time)) %>%
head(3)
## # A tibble: 3 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 3 17 1337 1335 2 1937 1836
## 2 2013 2 6 853 900 -7 1542 1540
## 3 2013 3 15 1001 1000 1 1551 1530
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
Similary, there is slice_min.
df1 <- tibble(a = "hello", b = 2, c = 3)
df2 <- tibble(a = "world", b = 4)
df1 %>% bind_rows(df2)
## # A tibble: 2 x 3
## a b c
## <chr> <dbl> <dbl>
## 1 hello 2 3
## 2 world 4 NA
bind_rows(df1, df2)
## # A tibble: 2 x 3
## a b c
## <chr> <dbl> <dbl>
## 1 hello 2 3
## 2 world 4 NA
The bind_rows function from dplyr is better than rbind in several ways.
bind_rows handles missing valuesbind_rows doesn’t automatically convert typesbind_rows supports splice operatordf1 <- tibble(a = "hello", b = 2, c = 3)
df2 <- tibble(a = "world", b = 4)
df1 %>% rbind(df2)
## Error in rbind(deparse.level, ...): numbers of columns of arguments do not match
df1 <- tibble(a = "hello", b = 2)
df2 <- tibble(a = 2, b = 4)
df1 %>% rbind(df2)
## # A tibble: 2 x 2
## a b
## <chr> <dbl>
## 1 hello 2
## 2 2 4
df1 %>% bind_rows(df2)
## Error: Can't combine `..1$a` <character> and `..2$a` <double>.
df1 <- tibble(a = "hello", b = 2, c = 3)
df2 <- tibble(a = "world", b = 4)
dfs <- list(df1, df2)
bind_rows(!!!dfs)
## # A tibble: 2 x 3
## a b c
## <chr> <dbl> <dbl>
## 1 hello 2 3
## 2 world 4 NA
Likewise, bind_cols is safer than cbind.
df1 <- tibble(x = 1:5)
df2 <- tibble(y = 1:10)
df1 %>% cbind(df2)
## x y
## 1 1 1
## 2 2 2
## 3 3 3
## 4 4 4
## 5 5 5
## 6 1 6
## 7 2 7
## 8 3 8
## 9 4 9
## 10 5 10
df1 %>% bind_cols(df2)
## Error: Can't recycle `..1` (size 5) to match `..2` (size 10).
There are 6 types of joins.
inner_join(x, y) only includes observations that match in both x and yleft_join(x, y) includes all observations in x, regardless of whether they match or not.right_join(x, y) equivalent to left_join(y, x)full_join(x, y) (outer join) includes all observations from x and ysemi_join(x, y) only keep observations from x if there is a match in yanti_join(x, y) remove observations from x if there is a match in y(df1 <- tibble(id = c(1, 2), v = 2:1))
## # A tibble: 2 x 2
## id v
## <dbl> <int>
## 1 1 2
## 2 2 1
(df2 <- tibble(id = c(1, 3), a = 10, b = "a"))
## # A tibble: 2 x 3
## id a b
## <dbl> <dbl> <chr>
## 1 1 10 a
## 2 3 10 a
df1 %>% inner_join(df2, by = "id")
## # A tibble: 1 x 4
## id v a b
## <dbl> <int> <dbl> <chr>
## 1 1 2 10 a
df1 %>% left_join(df2, by = "id")
## # A tibble: 2 x 4
## id v a b
## <dbl> <int> <dbl> <chr>
## 1 1 2 10 a
## 2 2 1 NA <NA>
df1 %>% right_join(df2, by = "id")
## # A tibble: 2 x 4
## id v a b
## <dbl> <int> <dbl> <chr>
## 1 1 2 10 a
## 2 3 NA 10 a
# outer join
df1 %>% full_join(df2, by = "id")
## # A tibble: 3 x 4
## id v a b
## <dbl> <int> <dbl> <chr>
## 1 1 2 10 a
## 2 2 1 NA <NA>
## 3 3 NA 10 a
df1 %>% semi_join(df2, by = "id")
## # A tibble: 1 x 2
## id v
## <dbl> <int>
## 1 1 2
df1 %>%
anti_join(df2, by = "id")
## # A tibble: 1 x 2
## id v
## <dbl> <int>
## 1 2 1
df1 <- tibble(id = c(1, 2), v = 2:1)
df2 <- tibble(name = c(1, 3), a = 10, b = "a")
df1 %>% inner_join(df2, by = c("id" = "name"))
## # A tibble: 1 x 4
## id v a b
## <dbl> <int> <dbl> <chr>
## 1 1 2 10 a
In some situation, we would like to use the values from one table to update the values of another table.
(df1 <- tibble(
id = 1:3,
b = letters[c(1:2, NA)],
c = 0.5 + 0:2))
## # A tibble: 3 x 3
## id b c
## <int> <chr> <dbl>
## 1 1 a 0.5
## 2 2 b 1.5
## 3 3 <NA> 2.5
(df2 <- tibble(id = 3, b = "z", c = 3.5))
## # A tibble: 1 x 3
## id b c
## <dbl> <chr> <dbl>
## 1 3 z 3.5
Outer join doesn’t give the intended result
df1 %>%
full_join(df2, by = "id")
## # A tibble: 3 x 5
## id b.x c.x b.y c.y
## <dbl> <chr> <dbl> <chr> <dbl>
## 1 1 a 0.5 <NA> NA
## 2 2 b 1.5 <NA> NA
## 3 3 <NA> 2.5 z 3.5
Use rows_update to update existing rows
df1 %>%
rows_update(df2, by = "id")
## # A tibble: 3 x 3
## id b c
## <int> <chr> <dbl>
## 1 1 a 0.5
## 2 2 b 1.5
## 3 3 z 3.5
rows_patch works like rows_update() but only overwrites NA values.
df1 %>%
rows_patch(df2)
## Matching, by = "id"
## # A tibble: 3 x 3
## id b c
## <int> <chr> <dbl>
## 1 1 a 0.5
## 2 2 b 1.5
## 3 3 z 2.5
df3 <- tibble(id = 4, b = "z", c = 3.5)
# error when key doesn't exist
df1 %>%
rows_update(df3, by = "id")
## Error: Attempting to update missing rows.
Inserting rows
df3 <- tibble(id = 4, b = "z")
df1 %>%
rows_insert(df3, by = "id")
## # A tibble: 4 x 3
## id b c
## <dbl> <chr> <dbl>
## 1 1 a 0.5
## 2 2 b 1.5
## 3 3 <NA> 2.5
## 4 4 z NA
# error because `id = 3` exists
df1 %>%
rows_insert(df2, by = "id")
## Error: Attempting to insert duplicate rows.
rows_upsert inserts or updates depending on whether or not the key value already exists
df1 %>%
rows_upsert(tibble(id = 3:4, b = c("z", "y"), c = c(3.5, 4.5)), by = "id")
## # A tibble: 4 x 3
## id b c
## <int> <chr> <dbl>
## 1 1 a 0.5
## 2 2 b 1.5
## 3 3 z 3.5
## 4 4 y 4.5
Crossing allows you to examine all the combinations between variables.
For example, we know that not all routes between origin and dest exist.
all_existing_routes <- flights %>% expand(nesting(origin, dest))
all_possible_routes <- flights %>% expand(crossing(origin, dest))
# or simply
all_possible_routes <- flights %>% expand(origin, dest)
# these routes do not exist
all_possible_routes %>%
anti_join(all_existing_routes, by = c("origin", "dest"))
## # A tibble: 91 x 2
## origin dest
## <chr> <chr>
## 1 EWR ABQ
## 2 EWR ACK
## 3 EWR BGR
## 4 EWR BHM
## 5 EWR BUR
## 6 EWR CAK
## 7 EWR CHO
## 8 EWR CRW
## 9 EWR EYW
## 10 EWR ILM
## # … with 81 more rows
expand only keeps the variables specified. If you want to join the original data, you could use complete.
# for demonstration
(dat <- flights %>%
group_by(origin, dest) %>%
summarize(air_time = mean(air_time, na.rm = TRUE)) %>%
ungroup())
## `summarise()` regrouping output by 'origin' (override with `.groups` argument)
## # A tibble: 224 x 3
## origin dest air_time
## <chr> <chr> <dbl>
## 1 EWR ALB 31.8
## 2 EWR ANC 413.
## 3 EWR ATL 112.
## 4 EWR AUS 211.
## 5 EWR AVL 89.8
## 6 EWR BDL 25.5
## 7 EWR BNA 115.
## 8 EWR BOS 40.3
## 9 EWR BQN 196.
## 10 EWR BTV 46.3
## # … with 214 more rows
dat %>%
expand(origin, dest) %>%
left_join(dat)
## Joining, by = c("origin", "dest")
## # A tibble: 315 x 3
## origin dest air_time
## <chr> <chr> <dbl>
## 1 EWR ABQ NA
## 2 EWR ACK NA
## 3 EWR ALB 31.8
## 4 EWR ANC 413.
## 5 EWR ATL 112.
## 6 EWR AUS 211.
## 7 EWR AVL 89.8
## 8 EWR BDL 25.5
## 9 EWR BGR NA
## 10 EWR BHM NA
## # … with 305 more rows
dat %>%
complete(origin, dest, fill = list("air_time" = Inf))
## # A tibble: 315 x 3
## origin dest air_time
## <chr> <chr> <dbl>
## 1 EWR ABQ Inf
## 2 EWR ACK Inf
## 3 EWR ALB 31.8
## 4 EWR ANC 413.
## 5 EWR ATL 112.
## 6 EWR AUS 211.
## 7 EWR AVL 89.8
## 8 EWR BDL 25.5
## 9 EWR BGR Inf
## 10 EWR BHM Inf
## # … with 305 more rows
df <- tribble(
~x, ~y,
1, 2,
NA, 3,
3, NA
)
df
## # A tibble: 3 x 2
## x y
## <dbl> <dbl>
## 1 1 2
## 2 NA 3
## 3 3 NA
# remove all rows with missing values
df %>% drop_na()
## # A tibble: 1 x 2
## x y
## <dbl> <dbl>
## 1 1 2
# fill values by copying downward
df %>% fill(x, y)
## # A tibble: 3 x 2
## x y
## <dbl> <dbl>
## 1 1 2
## 2 1 3
## 3 3 3
df %>%
replace_na(list(x = Inf, y = -Inf))
## # A tibble: 3 x 2
## x y
## <dbl> <dbl>
## 1 1 2
## 2 Inf 3
## 3 3 -Inf
Using the datasets from R for Data Science to show that the same data could be organized in different ways.
In general, there are two formats,
relig_income
## # A tibble: 18 x 11
## religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Agnostic 27 34 60 81 76 137 122
## 2 Atheist 12 27 37 52 35 70 73
## 3 Buddhist 27 21 30 34 33 58 62
## 4 Catholic 418 617 732 670 638 1116 949
## 5 Don’t k… 15 14 15 11 10 35 21
## 6 Evangel… 575 869 1064 982 881 1486 949
## 7 Hindu 1 9 7 9 11 34 47
## 8 Histori… 228 244 236 238 197 223 131
## 9 Jehovah… 20 27 24 24 21 30 15
## 10 Jewish 19 19 25 25 30 95 69
## 11 Mainlin… 289 495 619 655 651 1107 939
## 12 Mormon 29 40 48 51 56 112 85
## 13 Muslim 6 7 9 10 9 23 16
## 14 Orthodox 13 17 23 32 32 47 38
## 15 Other C… 9 7 11 13 13 14 18
## 16 Other F… 20 33 40 46 49 63 46
## 17 Other W… 5 2 3 4 2 7 3
## 18 Unaffil… 217 299 374 365 341 528 407
## # … with 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>, `Don't
## # know/refused` <dbl>
# make sure you have tidyr 1.0
relig_income %>%
pivot_longer(-religion, names_to = "income", values_to = "count")
## # A tibble: 180 x 3
## religion income count
## <chr> <chr> <dbl>
## 1 Agnostic <$10k 27
## 2 Agnostic $10-20k 34
## 3 Agnostic $20-30k 60
## 4 Agnostic $30-40k 81
## 5 Agnostic $40-50k 76
## 6 Agnostic $50-75k 137
## 7 Agnostic $75-100k 122
## 8 Agnostic $100-150k 109
## 9 Agnostic >150k 84
## 10 Agnostic Don't know/refused 96
## # … with 170 more rows
billboard
## # A tibble: 317 x 79
## artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
## <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA
## 2 2Ge+h… The … 2000-09-02 91 87 92 NA NA NA NA NA
## 3 3 Doo… Kryp… 2000-04-08 81 70 68 67 66 57 54 53
## 4 3 Doo… Loser 2000-10-21 76 76 72 69 67 65 55 59
## 5 504 B… Wobb… 2000-04-15 57 34 25 17 17 31 36 49
## 6 98^0 Give… 2000-08-19 51 39 34 26 26 19 2 2
## 7 A*Tee… Danc… 2000-07-08 97 97 96 95 100 NA NA NA
## 8 Aaliy… I Do… 2000-01-29 84 62 51 41 38 35 35 38
## 9 Aaliy… Try … 2000-03-18 59 53 38 28 21 18 16 14
## 10 Adams… Open… 2000-08-26 76 76 74 69 68 67 61 58
## # … with 307 more rows, and 68 more variables: wk9 <dbl>, wk10 <dbl>,
## # wk11 <dbl>, wk12 <dbl>, wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>,
## # wk17 <dbl>, wk18 <dbl>, wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>,
## # wk23 <dbl>, wk24 <dbl>, wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>,
## # wk29 <dbl>, wk30 <dbl>, wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>,
## # wk35 <dbl>, wk36 <dbl>, wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>,
## # wk41 <dbl>, wk42 <dbl>, wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>,
## # wk47 <dbl>, wk48 <dbl>, wk49 <dbl>, wk50 <dbl>, wk51 <dbl>, wk52 <dbl>,
## # wk53 <dbl>, wk54 <dbl>, wk55 <dbl>, wk56 <dbl>, wk57 <dbl>, wk58 <dbl>,
## # wk59 <dbl>, wk60 <dbl>, wk61 <dbl>, wk62 <dbl>, wk63 <dbl>, wk64 <dbl>,
## # wk65 <dbl>, wk66 <lgl>, wk67 <lgl>, wk68 <lgl>, wk69 <lgl>, wk70 <lgl>,
## # wk71 <lgl>, wk72 <lgl>, wk73 <lgl>, wk74 <lgl>, wk75 <lgl>, wk76 <lgl>
billboard %>%
pivot_longer(
starts_with("wk"),
names_to = "week",
names_prefix = "wk",
values_to = "rank",
values_drop_na = TRUE
)
## # A tibble: 5,307 x 5
## artist track date.entered week rank
## <chr> <chr> <date> <chr> <dbl>
## 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 1 87
## 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 2 82
## 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 3 72
## 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 4 77
## 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 5 87
## 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 6 94
## 7 2 Pac Baby Don't Cry (Keep... 2000-02-26 7 99
## 8 2Ge+her The Hardest Part Of ... 2000-09-02 1 91
## 9 2Ge+her The Hardest Part Of ... 2000-09-02 2 87
## 10 2Ge+her The Hardest Part Of ... 2000-09-02 3 92
## # … with 5,297 more rows
fish_encounters
## # A tibble: 114 x 3
## fish station seen
## <fct> <fct> <int>
## 1 4842 Release 1
## 2 4842 I80_1 1
## 3 4842 Lisbon 1
## 4 4842 Rstr 1
## 5 4842 Base_TD 1
## 6 4842 BCE 1
## 7 4842 BCW 1
## 8 4842 BCE2 1
## 9 4842 BCW2 1
## 10 4842 MAE 1
## # … with 104 more rows
fish_encounters %>% pivot_wider(
names_from = station,
values_from = seen,
values_fill = list(seen = 0)
)
## # A tibble: 19 x 12
## fish Release I80_1 Lisbon Rstr Base_TD BCE BCW BCE2 BCW2 MAE MAW
## <fct> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 4842 1 1 1 1 1 1 1 1 1 1 1
## 2 4843 1 1 1 1 1 1 1 1 1 1 1
## 3 4844 1 1 1 1 1 1 1 1 1 1 1
## 4 4845 1 1 1 1 1 0 0 0 0 0 0
## 5 4847 1 1 1 0 0 0 0 0 0 0 0
## 6 4848 1 1 1 1 0 0 0 0 0 0 0
## 7 4849 1 1 0 0 0 0 0 0 0 0 0
## 8 4850 1 1 0 1 1 1 1 0 0 0 0
## 9 4851 1 1 0 0 0 0 0 0 0 0 0
## 10 4854 1 1 0 0 0 0 0 0 0 0 0
## 11 4855 1 1 1 1 1 0 0 0 0 0 0
## 12 4857 1 1 1 1 1 1 1 1 1 0 0
## 13 4858 1 1 1 1 1 1 1 1 1 1 1
## 14 4859 1 1 1 1 1 0 0 0 0 0 0
## 15 4861 1 1 1 1 1 1 1 1 1 1 1
## 16 4862 1 1 1 1 1 1 1 1 1 0 0
## 17 4863 1 1 0 0 0 0 0 0 0 0 0
## 18 4864 1 1 0 0 0 0 0 0 0 0 0
## 19 4865 1 1 1 0 0 0 0 0 0 0 0
us_rent_income
## # A tibble: 104 x 5
## GEOID NAME variable estimate moe
## <chr> <chr> <chr> <dbl> <dbl>
## 1 01 Alabama income 24476 136
## 2 01 Alabama rent 747 3
## 3 02 Alaska income 32940 508
## 4 02 Alaska rent 1200 13
## 5 04 Arizona income 27517 148
## 6 04 Arizona rent 972 4
## 7 05 Arkansas income 23789 165
## 8 05 Arkansas rent 709 5
## 9 06 California income 29454 109
## 10 06 California rent 1358 3
## # … with 94 more rows
us_rent_income %>%
pivot_wider(names_from = variable, values_from = c(estimate, moe))
## # A tibble: 52 x 6
## GEOID NAME estimate_income estimate_rent moe_income moe_rent
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 01 Alabama 24476 747 136 3
## 2 02 Alaska 32940 1200 508 13
## 3 04 Arizona 27517 972 148 4
## 4 05 Arkansas 23789 709 165 5
## 5 06 California 29454 1358 109 3
## 6 08 Colorado 32401 1125 109 5
## 7 09 Connecticut 35326 1123 195 5
## 8 10 Delaware 31560 1076 247 10
## 9 11 District of Columbia 43198 1424 681 17
## 10 12 Florida 25952 1077 70 3
## # … with 42 more rows