dplyr
basicsselect
: picks variables based on their names.
filter
: picks cases based on their valuesmutate
: adds new variables that are functions of existing variables
arrange
: changes the ordering of the rowssummarize
and group_by
operations
Quote from the book “R for Data Science”, the author said
R is an old language, and some things that were useful 10 or 20 years ago now get in your way. It’s difficult to change base R without breaking existing code, so most innovation occurs in packages.
tidyverse
style usually is more elegantSo what is tidyverse?
ggplot2
, for data visualisationdplyr
, for data manipulationtidyr
, for data tidyingreadr
, for data importpurrr
, for functional programmingtibble
, for tibbles, a modern re-imagining of data framesstringr
, for stringsforcats
, for factorsWe are going to detour a bit to talk about pipe operator. Pipe operator is a powerful way to make your code much cleaner and readible by
The operators pipe their left-hand side values forward into expressions that appear on the right-hand side, i.e. one can replace f(x)
with x %>% f()
, where %>%
is the (main) pipe-operator. When coupling several function calls with the pipe-operator, the benefit will become more apparent. Consider this pseudo example:
raw_data <- read.csv("/path/to/data/file.csv")
sub_data <- subset(raw_data, variable_a > x)
new_data <- transform(sub_data, variable_c = variable_a / variable_b)
the_data <- head(new_data, 100)
However, the local variables are really not necessary, so one liner could be
the_data <- head(
transform(
subset(
read.csv("/path/to/data/file.csv"),
variable_a > x
),
variable_c = variable_a / variable_b
),
100
)
But that is too hard to read and reason. If we use pipe operator,
the_data <- read.csv("/path/to/data/file.csv") %>%
subset(variable_a > x) %>%
transform(variable_c = variable_a / variable_b) %>%
head(100)
Hint: In RStudio, you could use Ctrl + Shift + M
(or Cmd + Shift + M
) to insert the pipe operator.
x %>% f
or x %>% f()
is equivalent to f(x)
x %>% f(y)
is equivalent to f(x, y)
x %>% f %>% g %>% h
is equivalent to h(g(f(x)))
sum(c(1, 2))
## [1] 3
c(1, 2) %>% sum()
## [1] 3
choose(5, 3)
## [1] 10
5 %>% choose(3)
## [1] 10
x %>% f(y, .)
is equivalent to f(y, x)
x %>% f(y, z = .)
is equivalent to f(y, z = x)
3 %>% choose(5, .)
## [1] 10
3 %>% choose(5, k = .)
## [1] 10
It is straightforward to use the placeholder several times in a right-hand side expression. However, when the placeholder only appears in a nested expressions magrittr will still apply the first-argument rule. The reason is that in most cases this results more clean code.
x %>% f(y = nrow(.), z = ncol(.))
is equivalent to f(x, y = nrow(x), z = ncol(x))
a <- 100
5 %>% choose(. - 2)
## [1] 10
choose(5, 5 - 2)
## [1] 10
The behavior can be overruled by enclosing the right-hand side in braces:
4 %>% {choose(5, . - 1)}
## [1] 10
choose(5, 4 - 1)
## [1] 10
x %>% {f(y = nrow(.), z = ncol(.))}
is equivalent to f(y = nrow(x), z = ncol(x))
list(n = 5, k = 3) %>% {
choose(.$n, .$k)
}
## [1] 10
dplyr
basicsselect
: picks variables based on their namesfilter
: picks cases based on their valuesmutate
: adds new variables that are functions of existing variablesarrange
: changes the ordering of the rowssummarize
or summarise
: reduces multiple values down to a single summarygroup_by
which allows you to perform any operation “by group”.First of all, we need some data to work with. If the data is stored in a csv
,
flights <- read_csv("flights.csv")
We are using the tidyverse
function read_csv
to import the flights.csv
instead of the obsolete base function read.csv
. - read_csv
imports data as tibble
which has better and more consistent handling of variables. - read_csv
is often faster than read.csv
- read_csv
handles unicode characters better
There are also read_tsv
and read_delim
for reading tab-seperated or delimited files.
The datasets are actually obtained from the R package nycflights13
# Airline on-time data for all flights departing NYC in 2013.
library(nycflights13)
flights
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
select
: picks variables based on their names.To select arrival and departure times,
# old way to do it
# flights[, c("arr_time", "dep_time")]
flights %>% select(arr_time, dep_time)
## # A tibble: 336,776 x 2
## arr_time dep_time
## <int> <int>
## 1 830 517
## 2 850 533
## 3 923 542
## 4 1004 544
## 5 812 554
## 6 740 554
## 7 913 555
## 8 709 557
## 9 838 557
## 10 753 558
## # … with 336,766 more rows
dplyr
provides a lot of helper functions,
# colon `:` specifies all the variables between the columns of `dep_time` and `arr_time`
flights %>% select(dep_time:arr_time)
## # A tibble: 336,776 x 4
## dep_time sched_dep_time dep_delay arr_time
## <int> <int> <dbl> <int>
## 1 517 515 2 830
## 2 533 529 4 850
## 3 542 540 2 923
## 4 544 545 -1 1004
## 5 554 600 -6 812
## 6 554 558 -4 740
## 7 555 600 -5 913
## 8 557 600 -3 709
## 9 557 600 -3 838
## 10 558 600 -2 753
## # … with 336,766 more rows
# all the columns start with arr_
flights %>% select(starts_with("arr_"))
## # A tibble: 336,776 x 2
## arr_time arr_delay
## <int> <dbl>
## 1 830 11
## 2 850 20
## 3 923 33
## 4 1004 -18
## 5 812 -25
## 6 740 12
## 7 913 19
## 8 709 -14
## 9 838 -8
## 10 753 8
## # … with 336,766 more rows
# all the columns end with _time
flights %>% select(ends_with("_time"))
## # A tibble: 336,776 x 5
## dep_time sched_dep_time arr_time sched_arr_time air_time
## <int> <int> <int> <int> <dbl>
## 1 517 515 830 819 227
## 2 533 529 850 830 227
## 3 542 540 923 850 160
## 4 544 545 1004 1022 183
## 5 554 600 812 837 116
## 6 554 558 740 728 150
## 7 555 600 913 854 158
## 8 557 600 709 723 53
## 9 557 600 838 846 140
## 10 558 600 753 745 138
## # … with 336,766 more rows
# all the columns contain dep
flights %>% select(contains("dep"))
## # A tibble: 336,776 x 3
## dep_time sched_dep_time dep_delay
## <int> <int> <dbl>
## 1 517 515 2
## 2 533 529 4
## 3 542 540 2
## 4 544 545 -1
## 5 554 600 -6
## 6 554 558 -4
## 7 555 600 -5
## 8 557 600 -3
## 9 557 600 -3
## 10 558 600 -2
## # … with 336,766 more rows
# remove particular columns
flights %>% select(-year)
## # A tibble: 336,776 x 18
## month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <dbl> <int> <int>
## 1 1 1 517 515 2 830 819
## 2 1 1 533 529 4 850 830
## 3 1 1 542 540 2 923 850
## 4 1 1 544 545 -1 1004 1022
## 5 1 1 554 600 -6 812 837
## 6 1 1 554 558 -4 740 728
## 7 1 1 555 600 -5 913 854
## 8 1 1 557 600 -3 709 723
## 9 1 1 557 600 -3 838 846
## 10 1 1 558 600 -2 753 745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# all the columns do not contain dep
flights %>% select(-contains("dep"))
## # A tibble: 336,776 x 16
## year month day arr_time sched_arr_time arr_delay carrier flight tailnum
## <int> <int> <int> <int> <int> <dbl> <chr> <int> <chr>
## 1 2013 1 1 830 819 11 UA 1545 N14228
## 2 2013 1 1 850 830 20 UA 1714 N24211
## 3 2013 1 1 923 850 33 AA 1141 N619AA
## 4 2013 1 1 1004 1022 -18 B6 725 N804JB
## 5 2013 1 1 812 837 -25 DL 461 N668DN
## 6 2013 1 1 740 728 12 UA 1696 N39463
## 7 2013 1 1 913 854 19 B6 507 N516JB
## 8 2013 1 1 709 723 -14 EV 5708 N829AS
## 9 2013 1 1 838 846 -8 B6 79 N593JB
## 10 2013 1 1 753 745 8 AA 301 N3ALAA
## # … with 336,766 more rows, and 7 more variables: origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# using regular expression (later of the course)
flights %>% select(matches("^(arr|dep)_"))
## # A tibble: 336,776 x 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2 830 11
## 2 533 4 850 20
## 3 542 2 923 33
## 4 544 -1 1004 -18
## 5 554 -6 812 -25
## 6 554 -4 740 12
## 7 555 -5 913 19
## 8 557 -3 709 -14
## 9 557 -3 838 -8
## 10 558 -2 753 8
## # … with 336,766 more rows
Related verbs: - if you just need a single variable, you could use pull
. - you could use rename
to rename columns
# flights$arr_time
flights %>% pull(arr_time)
## [1] 830 850 923 1004 812 740 913 709 838 753 849 853 924 923 941
## [16] 702 854 851 837 844 812 821 858 837 858 807 945 925 1039 833
## [31] 1017 920 933 909 840 1018 1137 1016 824 721 824 740 1028 930 739
## [46] 922 837 931 815 910 1023 936 932 936 1021 1037 1002 854 949 1007
## [61] 948 959 944 1027 1008 1008 907 959 1123 1058 852 1151 1023 911 850
## [76] 1017 1013 1111 1020 1052 959 1041 1049 857 1041 1011 854 1047 918 1104
## [91] 1038 1107 1043 1059 1135 1119 939 1041 1025 955 1056 1039 1103 1053 1057
## [106] 1022 949 900 903 1132 1103 1015 1118 1006 1043 1043 1048 1100 1006 1047
## [121] 1026 1040 1103 1047 1005 1254 940 1249 1153 932 1014 1019 1151 1027 1058
## [136] 1136 1145 1027 1150 1152 1117 1018 1052 1021 1006 1134 1210 1027 1311 1053
## [151] 1138 1001 1155 1032 1102 1215 1046 1147 1143 1226 1222 1140 1516 1107 1124
## [166] 1157 1102 1140 1223 1211 1048 1045 1207 1309 1134 1020 1004 1228 1331 1241
## [181] 1219 1346 1244 1058 1313 1206 1052 1039 1152 1237 1026 1404 1221 1233 1231
## [196] 1028 1220 1218 1237 1121
## [ reached getOption("max.print") -- omitted 336576 entries ]
flights %>% rename(y = year)
## # A tibble: 336,776 x 19
## y month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# of course, we could select everything
flights %>% select(everything())
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# move air_time to the front
flights %>% select(air_time, everything())
## # A tibble: 336,776 x 19
## air_time year month day dep_time sched_dep_time dep_delay arr_time
## <dbl> <int> <int> <int> <int> <int> <dbl> <int>
## 1 227 2013 1 1 517 515 2 830
## 2 227 2013 1 1 533 529 4 850
## 3 160 2013 1 1 542 540 2 923
## 4 183 2013 1 1 544 545 -1 1004
## 5 116 2013 1 1 554 600 -6 812
## 6 150 2013 1 1 554 558 -4 740
## 7 158 2013 1 1 555 600 -5 913
## 8 53 2013 1 1 557 600 -3 709
## 9 140 2013 1 1 557 600 -3 838
## 10 138 2013 1 1 558 600 -2 753
## # … with 336,766 more rows, and 11 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
## # dest <chr>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# a new function `relocate
flights %>% relocate(air_time, .after = day)
## # A tibble: 336,776 x 19
## year month day air_time dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <dbl> <int> <int> <dbl> <int>
## 1 2013 1 1 227 517 515 2 830
## 2 2013 1 1 227 533 529 4 850
## 3 2013 1 1 160 542 540 2 923
## 4 2013 1 1 183 544 545 -1 1004
## 5 2013 1 1 116 554 600 -6 812
## 6 2013 1 1 150 554 558 -4 740
## 7 2013 1 1 158 555 600 -5 913
## 8 2013 1 1 53 557 600 -3 709
## 9 2013 1 1 140 557 600 -3 838
## 10 2013 1 1 138 558 600 -2 753
## # … with 336,766 more rows, and 11 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
## # dest <chr>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
filter
: picks cases based on their values# flights[flights$origin == "JFK", ]
# subset(flights, origin == "JFK")
flights %>% filter(origin == "JFK")
## # A tibble: 111,279 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 542 540 2 923 850
## 2 2013 1 1 544 545 -1 1004 1022
## 3 2013 1 1 557 600 -3 838 846
## 4 2013 1 1 558 600 -2 849 851
## 5 2013 1 1 558 600 -2 853 856
## 6 2013 1 1 558 600 -2 924 917
## 7 2013 1 1 559 559 0 702 706
## 8 2013 1 1 606 610 -4 837 845
## 9 2013 1 1 611 600 11 945 931
## 10 2013 1 1 613 610 3 925 921
## # … with 111,269 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>% filter(distance > 1000)
## # A tibble: 147,105 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 555 600 -5 913 854
## 6 2013 1 1 558 600 -2 849 851
## 7 2013 1 1 558 600 -2 853 856
## 8 2013 1 1 558 600 -2 924 917
## 9 2013 1 1 558 600 -2 923 937
## 10 2013 1 1 559 600 -1 941 910
## # … with 147,095 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# note that we are using a single `&` instead of `&&` as in base R
flights %>% filter(origin == "JFK" & distance > 1000)
## # A tibble: 62,071 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 542 540 2 923 850
## 2 2013 1 1 544 545 -1 1004 1022
## 3 2013 1 1 558 600 -2 849 851
## 4 2013 1 1 558 600 -2 853 856
## 5 2013 1 1 558 600 -2 924 917
## 6 2013 1 1 611 600 11 945 931
## 7 2013 1 1 613 610 3 925 921
## 8 2013 1 1 615 615 0 1039 1100
## 9 2013 1 1 627 630 -3 1018 1018
## 10 2013 1 1 628 630 -2 1137 1140
## # … with 62,061 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>% filter(origin == "JFK", distance > 1000)
## # A tibble: 62,071 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 542 540 2 923 850
## 2 2013 1 1 544 545 -1 1004 1022
## 3 2013 1 1 558 600 -2 849 851
## 4 2013 1 1 558 600 -2 853 856
## 5 2013 1 1 558 600 -2 924 917
## 6 2013 1 1 611 600 11 945 931
## 7 2013 1 1 613 610 3 925 921
## 8 2013 1 1 615 615 0 1039 1100
## 9 2013 1 1 627 630 -3 1018 1018
## 10 2013 1 1 628 630 -2 1137 1140
## # … with 62,061 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>% filter(distance < 500 | distance > 1000)
## # A tibble: 227,322 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 555 600 -5 913 854
## 6 2013 1 1 557 600 -3 709 723
## 7 2013 1 1 558 600 -2 849 851
## 8 2013 1 1 558 600 -2 853 856
## 9 2013 1 1 558 600 -2 924 917
## 10 2013 1 1 558 600 -2 923 937
## # … with 227,312 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>% filter(!between(distance, 500, 1000))
## # A tibble: 227,322 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 555 600 -5 913 854
## 6 2013 1 1 557 600 -3 709 723
## 7 2013 1 1 558 600 -2 849 851
## 8 2013 1 1 558 600 -2 853 856
## 9 2013 1 1 558 600 -2 924 917
## 10 2013 1 1 558 600 -2 923 937
## # … with 227,312 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# only keep the complete cases
flights %>% filter(complete.cases(.)) # filter(flights, complete.cases(flights))
## # A tibble: 327,346 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # … with 327,336 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>% drop_na() # same, but more efficient
## # A tibble: 327,346 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # … with 327,336 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Very often, we will need to use multiple dplyr
verbs, for example
# the pipe operator %>% increases readability
flights %>%
select(origin, air_time) %>%
filter(origin == "JFK") %>%
filter(air_time < 500) %>%
rename(airtime = air_time)
## # A tibble: 108,737 x 2
## origin airtime
## <chr> <dbl>
## 1 JFK 160
## 2 JFK 183
## 3 JFK 140
## 4 JFK 149
## 5 JFK 158
## 6 JFK 345
## 7 JFK 44
## 8 JFK 128
## 9 JFK 366
## 10 JFK 175
## # … with 108,727 more rows
# slightly more efficient
flights %>%
select(origin, airtime = air_time) %>%
filter(origin == "JFK", airtime < 500)
## # A tibble: 108,737 x 2
## origin airtime
## <chr> <dbl>
## 1 JFK 160
## 2 JFK 183
## 3 JFK 140
## 4 JFK 149
## 5 JFK 158
## 6 JFK 345
## 7 JFK 44
## 8 JFK 128
## 9 JFK 366
## 10 JFK 175
## # … with 108,727 more rows
# a few more examples
flights %>%
select(origin, air_time) %>%
filter(origin == "JFK", air_time < mean(air_time, na.rm = TRUE))
## # A tibble: 55,521 x 2
## origin air_time
## <chr> <dbl>
## 1 JFK 140
## 2 JFK 149
## 3 JFK 44
## 4 JFK 128
## 5 JFK 41
## 6 JFK 63
## 7 JFK 142
## 8 JFK 147
## 9 JFK 64
## 10 JFK 54
## # … with 55,511 more rows
# maybe we want to calculate mean_air_time first
mean_air_time <- flights %>%
pull(air_time) %>%
mean(na.rm = TRUE)
mean_air_time
## [1] 150.6865
flights %>%
select(origin, air_time) %>%
filter(origin == "JFK", air_time > mean_air_time)
## # A tibble: 53,558 x 2
## origin air_time
## <chr> <dbl>
## 1 JFK 160
## 2 JFK 183
## 3 JFK 158
## 4 JFK 345
## 5 JFK 366
## 6 JFK 175
## 7 JFK 182
## 8 JFK 330
## 9 JFK 192
## 10 JFK 323
## # … with 53,548 more rows
# what if there is a name collision?
air_time <- flights %>%
pull(air_time) %>%
mean(na.rm = TRUE)
flights %>%
select(origin, air_time) %>%
filter(origin == "JFK", air_time > !!air_time)
## # A tibble: 53,558 x 2
## origin air_time
## <chr> <dbl>
## 1 JFK 160
## 2 JFK 183
## 3 JFK 158
## 4 JFK 345
## 5 JFK 366
## 6 JFK 175
## 7 JFK 182
## 8 JFK 330
## 9 JFK 192
## 10 JFK 323
## # … with 53,548 more rows
Remarks: use slice
, head
, tail
if you want particular rows
flights %>% head(4)
## # A tibble: 4 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>% slice(100:105)
## # A tibble: 6 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 752 759 -7 955 959
## 2 2013 1 1 753 755 -2 1056 1110
## 3 2013 1 1 754 759 -5 1039 1041
## 4 2013 1 1 754 755 -1 1103 1030
## 5 2013 1 1 758 800 -2 1053 1054
## 6 2013 1 1 759 800 -1 1057 1127
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>% tail(5)
## # A tibble: 5 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 9 30 NA 1455 NA NA 1634
## 2 2013 9 30 NA 2200 NA NA 2312
## 3 2013 9 30 NA 1210 NA NA 1330
## 4 2013 9 30 NA 1159 NA NA 1344
## 5 2013 9 30 NA 840 NA NA 1020
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>% slice((n() - 4):n())
## # A tibble: 5 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 9 30 NA 1455 NA NA 1634
## 2 2013 9 30 NA 2200 NA NA 2312
## 3 2013 9 30 NA 1210 NA NA 1330
## 4 2013 9 30 NA 1159 NA NA 1344
## 5 2013 9 30 NA 840 NA NA 1020
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
mutate
: adds new variables that are functions of existing variablesflights %>% mutate(
gain = dep_delay - arr_delay,
speed = distance / air_time * 60
)
## # A tibble: 336,776 x 21
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # … with 336,766 more rows, and 13 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
## # gain <dbl>, speed <dbl>
# we could refer to the columns just created
flights %>% mutate(
gain = dep_delay - arr_delay,
gain_per_hour = gain / (air_time / 60)
)
## # A tibble: 336,776 x 21
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # … with 336,766 more rows, and 13 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
## # gain <dbl>, gain_per_hour <dbl>
# remove a variable
flights %>% mutate(
gain = dep_delay - arr_delay,
gain_per_hour = gain / (air_time / 60),
air_time = NULL,
)
## # A tibble: 336,776 x 20
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # … with 336,766 more rows, and 12 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, gain <dbl>,
## # gain_per_hour <dbl>
# `transmute` only keeps the new variables
flights %>% transmute(
gain = dep_delay - arr_delay,
gain_per_hour = gain / (air_time / 60)
)
## # A tibble: 336,776 x 2
## gain gain_per_hour
## <dbl> <dbl>
## 1 -9 -2.38
## 2 -16 -4.23
## 3 -31 -11.6
## 4 17 5.57
## 5 19 9.83
## 6 -16 -6.4
## 7 -24 -9.11
## 8 11 12.5
## 9 5 2.14
## 10 -10 -4.35
## # … with 336,766 more rows
row_number
: equivalent to rank(ties.method = "first")
min_rank
: equivalent to rank(ties.method = "min")
dense_rank
: like min_rank()
, but with no gaps between rankspercent_rank
: a number between 0 and 1 computed by rescaling min_rank to [0, 1]cume_dist
: a cumulative distribution function. Proportion of all values less than or equal to the current rank.ntile
: a rough rank, which breaks the input vector into n bucketssome_data <- tibble(
x = c(3, 4, 1, 3, 1)
)
some_data %>%
mutate(
row_number(),
row_number(x),
min_rank(x),
dense_rank(x),
percent_rank(x),
cume_dist(x),
ntile(x, 3)
)
## # A tibble: 5 x 8
## x `row_number()` `row_number(x)` `min_rank(x)` `dense_rank(x)`
## <dbl> <int> <int> <int> <int>
## 1 3 1 3 3 2
## 2 4 2 5 5 3
## 3 1 3 1 1 1
## 4 3 4 4 3 2
## 5 1 5 2 1 1
## # … with 3 more variables: `percent_rank(x)` <dbl>, `cume_dist(x)` <dbl>,
## # `ntile(x, 3)` <int>
lead
and lag
some_data2 <- tibble(
time = 1:5,
value = c(3, 4, 1, 3, 1)
)
some_data2 %>% mutate(
lag(value),
lead(value)
)
## # A tibble: 5 x 4
## time value `lag(value)` `lead(value)`
## <int> <dbl> <dbl> <dbl>
## 1 1 3 NA 4
## 2 2 4 3 1
## 3 3 1 4 3
## 4 4 3 1 1
## 5 5 1 3 NA
some_data2 %>% mutate(
diff1 = value - lag(value),
diff2 = lead(value) - value
)
## # A tibble: 5 x 4
## time value diff1 diff2
## <int> <dbl> <dbl> <dbl>
## 1 1 3 NA 1
## 2 2 4 1 -3
## 3 3 1 -3 2
## 4 4 3 2 -2
## 5 5 1 -2 NA
Remark: the package zoo
has functions to calculate rolling means (moving averages)
flights %>% transmute(
arr_delay,
status = if_else(arr_delay > 0, "delayed", "on time")
)
## # A tibble: 336,776 x 2
## arr_delay status
## <dbl> <chr>
## 1 11 delayed
## 2 20 delayed
## 3 33 delayed
## 4 -18 on time
## 5 -25 on time
## 6 12 delayed
## 7 19 delayed
## 8 -14 on time
## 9 -8 on time
## 10 8 delayed
## # … with 336,766 more rows
(
flight_distances <- flights %>%
transmute(
distance,
distance_type = case_when(
distance < 500 ~ "short",
distance < 1000 ~ "mid",
TRUE ~ "long"
)
)
)
## # A tibble: 336,776 x 2
## distance distance_type
## <dbl> <chr>
## 1 1400 long
## 2 1416 long
## 3 1089 long
## 4 1576 long
## 5 762 mid
## 6 719 mid
## 7 1065 long
## 8 229 short
## 9 944 mid
## 10 733 mid
## # … with 336,766 more rows
recode
valuesflight_distances %>% mutate(
distance_type = recode(distance_type,
"long" = "long-distance",
"mid" = "mid-distance",
"short" = "short-distance"
))
## # A tibble: 336,776 x 2
## distance distance_type
## <dbl> <chr>
## 1 1400 long-distance
## 2 1416 long-distance
## 3 1089 long-distance
## 4 1576 long-distance
## 5 762 mid-distance
## 6 719 mid-distance
## 7 1065 long-distance
## 8 229 short-distance
## 9 944 mid-distance
## 10 733 mid-distance
## # … with 336,766 more rows
some_data %>% mutate(
y = recode(x,
`1` = 10,
`3` = 30
)
)
## # A tibble: 5 x 2
## x y
## <dbl> <dbl>
## 1 3 30
## 2 4 4
## 3 1 10
## 4 3 30
## 5 1 10
arrange
: changes the ordering of the rowsflights %>% arrange(year, month, day)
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>% arrange(desc(dep_delay))
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 9 641 900 1301 1242 1530
## 2 2013 6 15 1432 1935 1137 1607 2120
## 3 2013 1 10 1121 1635 1126 1239 1810
## 4 2013 9 20 1139 1845 1014 1457 2210
## 5 2013 7 22 845 1600 1005 1044 1815
## 6 2013 4 10 1100 1900 960 1342 2211
## 7 2013 3 17 2321 810 911 135 1020
## 8 2013 6 27 959 1900 899 1236 2226
## 9 2013 7 22 2257 759 898 121 1026
## 10 2013 12 5 756 1700 896 1058 2020
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
summarize
and group_by
operationsflights %>% summarize(
air_time = mean(air_time, na.rm = TRUE)
)
## # A tibble: 1 x 1
## air_time
## <dbl>
## 1 151.
# mean(flights$air_time, na.rm = TRUE)
flights %>%
group_by(tailnum) %>%
summarize(n = n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 4,044 x 2
## tailnum n
## <chr> <int>
## 1 D942DN 4
## 2 N0EGMQ 371
## 3 N10156 153
## 4 N102UW 48
## 5 N103US 46
## 6 N104UW 47
## 7 N10575 289
## 8 N105UW 45
## 9 N107US 41
## 10 N108UW 60
## # … with 4,034 more rows
flights %>%
group_by(tailnum) %>%
tally() # shorthand
## # A tibble: 4,044 x 2
## tailnum n
## <chr> <int>
## 1 D942DN 4
## 2 N0EGMQ 371
## 3 N10156 153
## 4 N102UW 48
## 5 N103US 46
## 6 N104UW 47
## 7 N10575 289
## 8 N105UW 45
## 9 N107US 41
## 10 N108UW 60
## # … with 4,034 more rows
flights %>% count(tailnum) # another shorthand
## # A tibble: 4,044 x 2
## tailnum n
## <chr> <int>
## 1 D942DN 4
## 2 N0EGMQ 371
## 3 N10156 153
## 4 N102UW 48
## 5 N103US 46
## 6 N104UW 47
## 7 N10575 289
## 8 N105UW 45
## 9 N107US 41
## 10 N108UW 60
## # … with 4,034 more rows
flights %>%
group_by(tailnum) %>%
summarize(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 4,044 x 4
## tailnum count dist delay
## <chr> <int> <dbl> <dbl>
## 1 D942DN 4 854. 31.5
## 2 N0EGMQ 371 676. 9.98
## 3 N10156 153 758. 12.7
## 4 N102UW 48 536. 2.94
## 5 N103US 46 535. -6.93
## 6 N104UW 47 535. 1.80
## 7 N10575 289 520. 20.7
## 8 N105UW 45 525. -0.267
## 9 N107US 41 529. -5.73
## 10 N108UW 60 534. -1.25
## # … with 4,034 more rows
flights %>%
group_by(dest) %>%
summarize(
planes = n_distinct(tailnum),
flights = n()
)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 105 x 3
## dest planes flights
## <chr> <int> <int>
## 1 ABQ 108 254
## 2 ACK 58 265
## 3 ALB 172 439
## 4 ANC 6 8
## 5 ATL 1180 17215
## 6 AUS 993 2439
## 7 AVL 159 275
## 8 BDL 186 443
## 9 BGR 46 375
## 10 BHM 45 297
## # … with 95 more rows
# group by multiple variables
(per_day <- flights %>%
group_by(year, month, day) %>%
summarize(flights = n()))
## `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)
## # A tibble: 365 x 4
## # Groups: year, month [12]
## year month day flights
## <int> <int> <int> <int>
## 1 2013 1 1 842
## 2 2013 1 2 943
## 3 2013 1 3 914
## 4 2013 1 4 915
## 5 2013 1 5 720
## 6 2013 1 6 832
## 7 2013 1 7 933
## 8 2013 1 8 899
## 9 2013 1 9 902
## 10 2013 1 10 932
## # … with 355 more rows
(per_month <- per_day %>%
summarize(flights = sum(flights)))
## `summarise()` regrouping output by 'year' (override with `.groups` argument)
## # A tibble: 12 x 3
## # Groups: year [1]
## year month flights
## <int> <int> <int>
## 1 2013 1 27004
## 2 2013 2 24951
## 3 2013 3 28834
## 4 2013 4 28330
## 5 2013 5 28796
## 6 2013 6 28243
## 7 2013 7 29425
## 8 2013 8 29327
## 9 2013 9 27574
## 10 2013 10 28889
## 11 2013 11 27268
## 12 2013 12 28135
(per_year <- per_month %>%
summarize(flights = sum(flights)))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 1 x 2
## year flights
## <int> <int>
## 1 2013 336776
across
columnsSuppose, we want to compute the means for the all numeric columns of flights
. Naively, we could do
flights %>%
group_by(origin) %>%
summarize(
dep_time = mean(dep_time, na.rm = TRUE),
sched_dep_time = mean(sched_dep_time, na.rm = TRUE),
dep_delay = mean(dep_delay, na.rm = TRUE),
arr_time = mean(arr_time, na.rm = TRUE),
sched_arr_time = mean(sched_arr_time, na.rm = TRUE),
arr_delay = mean(arr_delay, na.rm = TRUE),
flight = mean(flight, na.rm = TRUE),
air_time = mean(air_time, na.rm = TRUE),
distance = mean(distance, na.rm = TRUE)
)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 10
## origin dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 EWR 1337. 1322. 15.1 1492. 1528. 9.11
## 2 JFK 1399. 1402. 12.1 1520. 1565. 5.55
## 3 LGA 1310. 1308. 10.3 1494. 1516. 5.78
## # … with 3 more variables: flight <dbl>, air_time <dbl>, distance <dbl>
With the new across
function, it is as easy as:
flights %>%
group_by(origin) %>%
summarize(across(dep_time:distance, mean, na.rm = TRUE))
## Warning in mean.default(col, ...): argument is not numeric or logical: returning
## NA
## Warning in mean.default(col, ...): argument is not numeric or logical: returning
## NA
## Warning in mean.default(col, ...): argument is not numeric or logical: returning
## NA
## Warning in mean.default(col, ...): argument is not numeric or logical: returning
## NA
## Warning in mean.default(col, ...): argument is not numeric or logical: returning
## NA
## Warning in mean.default(col, ...): argument is not numeric or logical: returning
## NA
## Warning in mean.default(col, ...): argument is not numeric or logical: returning
## NA
## Warning in mean.default(col, ...): argument is not numeric or logical: returning
## NA
## Warning in mean.default(col, ...): argument is not numeric or logical: returning
## NA
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 13
## origin dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 EWR 1337. 1322. 15.1 1492. 1528. 9.11
## 2 JFK 1399. 1402. 12.1 1520. 1565. 5.55
## 3 LGA 1310. 1308. 10.3 1494. 1516. 5.78
## # … with 6 more variables: carrier <dbl>, flight <dbl>, tailnum <dbl>,
## # dest <dbl>, air_time <dbl>, distance <dbl>
We have accidentally included some non numeric columns.
flights %>%
group_by(origin) %>%
summarize(
across(dep_time:distance & is.numeric, mean, na.rm = TRUE)
)
## Warning: Predicate functions must be wrapped in `where()`.
##
## # Bad
## data %>% select(is.numeric)
##
## # Good
## data %>% select(where(is.numeric))
##
## ℹ Please update your code.
## This message is displayed once per session.
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 10
## origin dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 EWR 1337. 1322. 15.1 1492. 1528. 9.11
## 2 JFK 1399. 1402. 12.1 1520. 1565. 5.55
## 3 LGA 1310. 1308. 10.3 1494. 1516. 5.78
## # … with 3 more variables: flight <dbl>, air_time <dbl>, distance <dbl>
flights %>%
group_by(origin) %>%
summarize(
across(-c(year, month, day) & is.numeric, mean, na.rm = TRUE)
)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 12
## origin dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 EWR 1337. 1322. 15.1 1492. 1528. 9.11
## 2 JFK 1399. 1402. 12.1 1520. 1565. 5.55
## 3 LGA 1310. 1308. 10.3 1494. 1516. 5.78
## # … with 5 more variables: flight <dbl>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>
We could also use predicate functions in across
and use different functions to summerize different variables.
flights %>%
group_by(origin) %>%
summarize(
across(ends_with("time"), mean, na.rm = TRUE),
across(ends_with("delay"), mean, na.rm = TRUE, trim = 0.2),
)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 8
## origin dep_time sched_dep_time arr_time sched_arr_time air_time dep_delay
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 EWR 1337. 1322. 1492. 1528. 153. 2.10
## 2 JFK 1399. 1402. 1520. 1565. 178. 0.298
## 3 LGA 1310. 1308. 1494. 1516. 118. -1.51
## # … with 1 more variable: arr_delay <dbl>
flights %>% glimpse()
## Rows: 336,776
## Columns: 19
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, …
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558,…
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600,…
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -…
## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851…
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -…
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", …
## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, …
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N39…
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA"…
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD"…
## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, …
## $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733,…
## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, …
## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, …
## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 …
flights %>%
slice_sample(n = 1000) %>%
ggplot(aes(arr_delay, dep_delay)) + geom_point(na.rm = TRUE)
## rows with largest values of air_time with the original order preserved
flights %>%
slice_max(air_time, n = 3)
## # A tibble: 4 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 3 17 1337 1335 2 1937 1836
## 2 2013 2 6 853 900 -7 1542 1540
## 3 2013 3 15 1001 1000 1 1551 1530
## 4 2013 3 17 1006 1000 6 1607 1530
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
# two disadvantages: order changes and does not handle ties
flights %>%
arrange(desc(air_time)) %>%
head(3)
## # A tibble: 3 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 3 17 1337 1335 2 1937 1836
## 2 2013 2 6 853 900 -7 1542 1540
## 3 2013 3 15 1001 1000 1 1551 1530
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
Similary, there is slice_min
.
df1 <- tibble(a = "hello", b = 2, c = 3)
df2 <- tibble(a = "world", b = 4)
df1 %>% bind_rows(df2)
## # A tibble: 2 x 3
## a b c
## <chr> <dbl> <dbl>
## 1 hello 2 3
## 2 world 4 NA
bind_rows(df1, df2)
## # A tibble: 2 x 3
## a b c
## <chr> <dbl> <dbl>
## 1 hello 2 3
## 2 world 4 NA
The bind_rows
function from dplyr
is better than rbind
in several ways.
bind_rows
handles missing valuesbind_rows
doesn’t automatically convert typesbind_rows
supports splice operatordf1 <- tibble(a = "hello", b = 2, c = 3)
df2 <- tibble(a = "world", b = 4)
df1 %>% rbind(df2)
## Error in rbind(deparse.level, ...): numbers of columns of arguments do not match
df1 <- tibble(a = "hello", b = 2)
df2 <- tibble(a = 2, b = 4)
df1 %>% rbind(df2)
## # A tibble: 2 x 2
## a b
## <chr> <dbl>
## 1 hello 2
## 2 2 4
df1 %>% bind_rows(df2)
## Error: Can't combine `..1$a` <character> and `..2$a` <double>.
df1 <- tibble(a = "hello", b = 2, c = 3)
df2 <- tibble(a = "world", b = 4)
dfs <- list(df1, df2)
bind_rows(!!!dfs)
## # A tibble: 2 x 3
## a b c
## <chr> <dbl> <dbl>
## 1 hello 2 3
## 2 world 4 NA
Likewise, bind_cols
is safer than cbind
.
df1 <- tibble(x = 1:5)
df2 <- tibble(y = 1:10)
df1 %>% cbind(df2)
## x y
## 1 1 1
## 2 2 2
## 3 3 3
## 4 4 4
## 5 5 5
## 6 1 6
## 7 2 7
## 8 3 8
## 9 4 9
## 10 5 10
df1 %>% bind_cols(df2)
## Error: Can't recycle `..1` (size 5) to match `..2` (size 10).
There are 6 types of joins.
inner_join(x, y)
only includes observations that match in both x and yleft_join(x, y)
includes all observations in x, regardless of whether they match or not.right_join(x, y)
equivalent to left_join(y, x)full_join(x, y)
(outer join) includes all observations from x and ysemi_join(x, y)
only keep observations from x if there is a match in yanti_join(x, y)
remove observations from x if there is a match in y(df1 <- tibble(id = c(1, 2), v = 2:1))
## # A tibble: 2 x 2
## id v
## <dbl> <int>
## 1 1 2
## 2 2 1
(df2 <- tibble(id = c(1, 3), a = 10, b = "a"))
## # A tibble: 2 x 3
## id a b
## <dbl> <dbl> <chr>
## 1 1 10 a
## 2 3 10 a
df1 %>% inner_join(df2, by = "id")
## # A tibble: 1 x 4
## id v a b
## <dbl> <int> <dbl> <chr>
## 1 1 2 10 a
df1 %>% left_join(df2, by = "id")
## # A tibble: 2 x 4
## id v a b
## <dbl> <int> <dbl> <chr>
## 1 1 2 10 a
## 2 2 1 NA <NA>
df1 %>% right_join(df2, by = "id")
## # A tibble: 2 x 4
## id v a b
## <dbl> <int> <dbl> <chr>
## 1 1 2 10 a
## 2 3 NA 10 a
# outer join
df1 %>% full_join(df2, by = "id")
## # A tibble: 3 x 4
## id v a b
## <dbl> <int> <dbl> <chr>
## 1 1 2 10 a
## 2 2 1 NA <NA>
## 3 3 NA 10 a
df1 %>% semi_join(df2, by = "id")
## # A tibble: 1 x 2
## id v
## <dbl> <int>
## 1 1 2
df1 %>%
anti_join(df2, by = "id")
## # A tibble: 1 x 2
## id v
## <dbl> <int>
## 1 2 1
df1 <- tibble(id = c(1, 2), v = 2:1)
df2 <- tibble(name = c(1, 3), a = 10, b = "a")
df1 %>% inner_join(df2, by = c("id" = "name"))
## # A tibble: 1 x 4
## id v a b
## <dbl> <int> <dbl> <chr>
## 1 1 2 10 a
In some situation, we would like to use the values from one table to update the values of another table.
(df1 <- tibble(
id = 1:3,
b = letters[c(1:2, NA)],
c = 0.5 + 0:2))
## # A tibble: 3 x 3
## id b c
## <int> <chr> <dbl>
## 1 1 a 0.5
## 2 2 b 1.5
## 3 3 <NA> 2.5
(df2 <- tibble(id = 3, b = "z", c = 3.5))
## # A tibble: 1 x 3
## id b c
## <dbl> <chr> <dbl>
## 1 3 z 3.5
Outer join doesn’t give the intended result
df1 %>%
full_join(df2, by = "id")
## # A tibble: 3 x 5
## id b.x c.x b.y c.y
## <dbl> <chr> <dbl> <chr> <dbl>
## 1 1 a 0.5 <NA> NA
## 2 2 b 1.5 <NA> NA
## 3 3 <NA> 2.5 z 3.5
Use rows_update
to update existing rows
df1 %>%
rows_update(df2, by = "id")
## # A tibble: 3 x 3
## id b c
## <int> <chr> <dbl>
## 1 1 a 0.5
## 2 2 b 1.5
## 3 3 z 3.5
rows_patch
works like rows_update() but only overwrites NA values.
df1 %>%
rows_patch(df2)
## Matching, by = "id"
## # A tibble: 3 x 3
## id b c
## <int> <chr> <dbl>
## 1 1 a 0.5
## 2 2 b 1.5
## 3 3 z 2.5
df3 <- tibble(id = 4, b = "z", c = 3.5)
# error when key doesn't exist
df1 %>%
rows_update(df3, by = "id")
## Error: Attempting to update missing rows.
Inserting rows
df3 <- tibble(id = 4, b = "z")
df1 %>%
rows_insert(df3, by = "id")
## # A tibble: 4 x 3
## id b c
## <dbl> <chr> <dbl>
## 1 1 a 0.5
## 2 2 b 1.5
## 3 3 <NA> 2.5
## 4 4 z NA
# error because `id = 3` exists
df1 %>%
rows_insert(df2, by = "id")
## Error: Attempting to insert duplicate rows.
rows_upsert
inserts or updates depending on whether or not the key value already exists
df1 %>%
rows_upsert(tibble(id = 3:4, b = c("z", "y"), c = c(3.5, 4.5)), by = "id")
## # A tibble: 4 x 3
## id b c
## <int> <chr> <dbl>
## 1 1 a 0.5
## 2 2 b 1.5
## 3 3 z 3.5
## 4 4 y 4.5
Crossing allows you to examine all the combinations between variables.
For example, we know that not all routes between origin
and dest
exist.
all_existing_routes <- flights %>% expand(nesting(origin, dest))
all_possible_routes <- flights %>% expand(crossing(origin, dest))
# or simply
all_possible_routes <- flights %>% expand(origin, dest)
# these routes do not exist
all_possible_routes %>%
anti_join(all_existing_routes, by = c("origin", "dest"))
## # A tibble: 91 x 2
## origin dest
## <chr> <chr>
## 1 EWR ABQ
## 2 EWR ACK
## 3 EWR BGR
## 4 EWR BHM
## 5 EWR BUR
## 6 EWR CAK
## 7 EWR CHO
## 8 EWR CRW
## 9 EWR EYW
## 10 EWR ILM
## # … with 81 more rows
expand
only keeps the variables specified. If you want to join the original data, you could use complete
.
# for demonstration
(dat <- flights %>%
group_by(origin, dest) %>%
summarize(air_time = mean(air_time, na.rm = TRUE)) %>%
ungroup())
## `summarise()` regrouping output by 'origin' (override with `.groups` argument)
## # A tibble: 224 x 3
## origin dest air_time
## <chr> <chr> <dbl>
## 1 EWR ALB 31.8
## 2 EWR ANC 413.
## 3 EWR ATL 112.
## 4 EWR AUS 211.
## 5 EWR AVL 89.8
## 6 EWR BDL 25.5
## 7 EWR BNA 115.
## 8 EWR BOS 40.3
## 9 EWR BQN 196.
## 10 EWR BTV 46.3
## # … with 214 more rows
dat %>%
expand(origin, dest) %>%
left_join(dat)
## Joining, by = c("origin", "dest")
## # A tibble: 315 x 3
## origin dest air_time
## <chr> <chr> <dbl>
## 1 EWR ABQ NA
## 2 EWR ACK NA
## 3 EWR ALB 31.8
## 4 EWR ANC 413.
## 5 EWR ATL 112.
## 6 EWR AUS 211.
## 7 EWR AVL 89.8
## 8 EWR BDL 25.5
## 9 EWR BGR NA
## 10 EWR BHM NA
## # … with 305 more rows
dat %>%
complete(origin, dest, fill = list("air_time" = Inf))
## # A tibble: 315 x 3
## origin dest air_time
## <chr> <chr> <dbl>
## 1 EWR ABQ Inf
## 2 EWR ACK Inf
## 3 EWR ALB 31.8
## 4 EWR ANC 413.
## 5 EWR ATL 112.
## 6 EWR AUS 211.
## 7 EWR AVL 89.8
## 8 EWR BDL 25.5
## 9 EWR BGR Inf
## 10 EWR BHM Inf
## # … with 305 more rows
df <- tribble(
~x, ~y,
1, 2,
NA, 3,
3, NA
)
df
## # A tibble: 3 x 2
## x y
## <dbl> <dbl>
## 1 1 2
## 2 NA 3
## 3 3 NA
# remove all rows with missing values
df %>% drop_na()
## # A tibble: 1 x 2
## x y
## <dbl> <dbl>
## 1 1 2
# fill values by copying downward
df %>% fill(x, y)
## # A tibble: 3 x 2
## x y
## <dbl> <dbl>
## 1 1 2
## 2 1 3
## 3 3 3
df %>%
replace_na(list(x = Inf, y = -Inf))
## # A tibble: 3 x 2
## x y
## <dbl> <dbl>
## 1 1 2
## 2 Inf 3
## 3 3 -Inf
Using the datasets from R for Data Science to show that the same data could be organized in different ways.
In general, there are two formats,
relig_income
## # A tibble: 18 x 11
## religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Agnostic 27 34 60 81 76 137 122
## 2 Atheist 12 27 37 52 35 70 73
## 3 Buddhist 27 21 30 34 33 58 62
## 4 Catholic 418 617 732 670 638 1116 949
## 5 Don’t k… 15 14 15 11 10 35 21
## 6 Evangel… 575 869 1064 982 881 1486 949
## 7 Hindu 1 9 7 9 11 34 47
## 8 Histori… 228 244 236 238 197 223 131
## 9 Jehovah… 20 27 24 24 21 30 15
## 10 Jewish 19 19 25 25 30 95 69
## 11 Mainlin… 289 495 619 655 651 1107 939
## 12 Mormon 29 40 48 51 56 112 85
## 13 Muslim 6 7 9 10 9 23 16
## 14 Orthodox 13 17 23 32 32 47 38
## 15 Other C… 9 7 11 13 13 14 18
## 16 Other F… 20 33 40 46 49 63 46
## 17 Other W… 5 2 3 4 2 7 3
## 18 Unaffil… 217 299 374 365 341 528 407
## # … with 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>, `Don't
## # know/refused` <dbl>
# make sure you have tidyr 1.0
relig_income %>%
pivot_longer(-religion, names_to = "income", values_to = "count")
## # A tibble: 180 x 3
## religion income count
## <chr> <chr> <dbl>
## 1 Agnostic <$10k 27
## 2 Agnostic $10-20k 34
## 3 Agnostic $20-30k 60
## 4 Agnostic $30-40k 81
## 5 Agnostic $40-50k 76
## 6 Agnostic $50-75k 137
## 7 Agnostic $75-100k 122
## 8 Agnostic $100-150k 109
## 9 Agnostic >150k 84
## 10 Agnostic Don't know/refused 96
## # … with 170 more rows
billboard
## # A tibble: 317 x 79
## artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
## <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA
## 2 2Ge+h… The … 2000-09-02 91 87 92 NA NA NA NA NA
## 3 3 Doo… Kryp… 2000-04-08 81 70 68 67 66 57 54 53
## 4 3 Doo… Loser 2000-10-21 76 76 72 69 67 65 55 59
## 5 504 B… Wobb… 2000-04-15 57 34 25 17 17 31 36 49
## 6 98^0 Give… 2000-08-19 51 39 34 26 26 19 2 2
## 7 A*Tee… Danc… 2000-07-08 97 97 96 95 100 NA NA NA
## 8 Aaliy… I Do… 2000-01-29 84 62 51 41 38 35 35 38
## 9 Aaliy… Try … 2000-03-18 59 53 38 28 21 18 16 14
## 10 Adams… Open… 2000-08-26 76 76 74 69 68 67 61 58
## # … with 307 more rows, and 68 more variables: wk9 <dbl>, wk10 <dbl>,
## # wk11 <dbl>, wk12 <dbl>, wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>,
## # wk17 <dbl>, wk18 <dbl>, wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>,
## # wk23 <dbl>, wk24 <dbl>, wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>,
## # wk29 <dbl>, wk30 <dbl>, wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>,
## # wk35 <dbl>, wk36 <dbl>, wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>,
## # wk41 <dbl>, wk42 <dbl>, wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>,
## # wk47 <dbl>, wk48 <dbl>, wk49 <dbl>, wk50 <dbl>, wk51 <dbl>, wk52 <dbl>,
## # wk53 <dbl>, wk54 <dbl>, wk55 <dbl>, wk56 <dbl>, wk57 <dbl>, wk58 <dbl>,
## # wk59 <dbl>, wk60 <dbl>, wk61 <dbl>, wk62 <dbl>, wk63 <dbl>, wk64 <dbl>,
## # wk65 <dbl>, wk66 <lgl>, wk67 <lgl>, wk68 <lgl>, wk69 <lgl>, wk70 <lgl>,
## # wk71 <lgl>, wk72 <lgl>, wk73 <lgl>, wk74 <lgl>, wk75 <lgl>, wk76 <lgl>
billboard %>%
pivot_longer(
starts_with("wk"),
names_to = "week",
names_prefix = "wk",
values_to = "rank",
values_drop_na = TRUE
)
## # A tibble: 5,307 x 5
## artist track date.entered week rank
## <chr> <chr> <date> <chr> <dbl>
## 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 1 87
## 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 2 82
## 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 3 72
## 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 4 77
## 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 5 87
## 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 6 94
## 7 2 Pac Baby Don't Cry (Keep... 2000-02-26 7 99
## 8 2Ge+her The Hardest Part Of ... 2000-09-02 1 91
## 9 2Ge+her The Hardest Part Of ... 2000-09-02 2 87
## 10 2Ge+her The Hardest Part Of ... 2000-09-02 3 92
## # … with 5,297 more rows
fish_encounters
## # A tibble: 114 x 3
## fish station seen
## <fct> <fct> <int>
## 1 4842 Release 1
## 2 4842 I80_1 1
## 3 4842 Lisbon 1
## 4 4842 Rstr 1
## 5 4842 Base_TD 1
## 6 4842 BCE 1
## 7 4842 BCW 1
## 8 4842 BCE2 1
## 9 4842 BCW2 1
## 10 4842 MAE 1
## # … with 104 more rows
fish_encounters %>% pivot_wider(
names_from = station,
values_from = seen,
values_fill = list(seen = 0)
)
## # A tibble: 19 x 12
## fish Release I80_1 Lisbon Rstr Base_TD BCE BCW BCE2 BCW2 MAE MAW
## <fct> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 4842 1 1 1 1 1 1 1 1 1 1 1
## 2 4843 1 1 1 1 1 1 1 1 1 1 1
## 3 4844 1 1 1 1 1 1 1 1 1 1 1
## 4 4845 1 1 1 1 1 0 0 0 0 0 0
## 5 4847 1 1 1 0 0 0 0 0 0 0 0
## 6 4848 1 1 1 1 0 0 0 0 0 0 0
## 7 4849 1 1 0 0 0 0 0 0 0 0 0
## 8 4850 1 1 0 1 1 1 1 0 0 0 0
## 9 4851 1 1 0 0 0 0 0 0 0 0 0
## 10 4854 1 1 0 0 0 0 0 0 0 0 0
## 11 4855 1 1 1 1 1 0 0 0 0 0 0
## 12 4857 1 1 1 1 1 1 1 1 1 0 0
## 13 4858 1 1 1 1 1 1 1 1 1 1 1
## 14 4859 1 1 1 1 1 0 0 0 0 0 0
## 15 4861 1 1 1 1 1 1 1 1 1 1 1
## 16 4862 1 1 1 1 1 1 1 1 1 0 0
## 17 4863 1 1 0 0 0 0 0 0 0 0 0
## 18 4864 1 1 0 0 0 0 0 0 0 0 0
## 19 4865 1 1 1 0 0 0 0 0 0 0 0
us_rent_income
## # A tibble: 104 x 5
## GEOID NAME variable estimate moe
## <chr> <chr> <chr> <dbl> <dbl>
## 1 01 Alabama income 24476 136
## 2 01 Alabama rent 747 3
## 3 02 Alaska income 32940 508
## 4 02 Alaska rent 1200 13
## 5 04 Arizona income 27517 148
## 6 04 Arizona rent 972 4
## 7 05 Arkansas income 23789 165
## 8 05 Arkansas rent 709 5
## 9 06 California income 29454 109
## 10 06 California rent 1358 3
## # … with 94 more rows
us_rent_income %>%
pivot_wider(names_from = variable, values_from = c(estimate, moe))
## # A tibble: 52 x 6
## GEOID NAME estimate_income estimate_rent moe_income moe_rent
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 01 Alabama 24476 747 136 3
## 2 02 Alaska 32940 1200 508 13
## 3 04 Arizona 27517 972 148 4
## 4 05 Arkansas 23789 709 165 5
## 5 06 California 29454 1358 109 3
## 6 08 Colorado 32401 1125 109 5
## 7 09 Connecticut 35326 1123 195 5
## 8 10 Delaware 31560 1076 247 10
## 9 11 District of Columbia 43198 1424 681 17
## 10 12 Florida 25952 1077 70 3
## # … with 42 more rows