
# A tibble: 6 x 6
year month carrier dep_delay arr_delay distance
<int> <int> <chr> <dbl> <dbl> <dbl>
1 2013 1 UA 2 11 1400
2 2013 1 UA 4 20 1416
3 2013 1 AA 2 33 1089
4 2013 1 B6 -1 -18 1576
5 2013 1 DL -6 -25 762
6 2013 1 UA -4 12 719
dep_delay by year or carrier; average dep_delay by carrierdep_delay by carrier or distancedplyr is a powerful tool for data processing
The contents on dplyr are mostly based on Chapter 3 of the book “R for data science” by Wickham and Grolemun.
We will cover how to use it to:
Operate on rows: filter{dplyr}. Here “filter{dplyr}” means that filter is a command provided by dplyr
Operate on columns: select{dplyr} and mutate{dplyr}
Obtain various summary statistics.
We need the following R packages:
dplyr: the package for processing data frame like objectstidyverse: the package that also installs dplyr and tidyr and provides support for tibblenycflights13: the data set for practiceThe package dplyr is a powerful tool for working with data frame like objects. To install it, run the following in Rstudio console:
install.packages("tidyverse")
This command will also install dplyr and the needed packages to deal with a type of data frame called “tibble”.
Alternatively, the package can be installed via:
install.packages("dplyr")
tibble is a modern reimagining of the “data frame” object in R. tibble behaves similarly as data.frame, except for subsetting.
Information on tibble can be found in Chapter 7 of the book “R for data science” by Wickham and Grolemun; see in particular section “Tibbles Versus data.frame” there.
We will just use but not focus on tibble.
nycflights13The pacakge nycflights13 contains the data set, and can be installed via
install.packages("nycflights13")
nycflights13 contains 5 tibbles, we will work on the flights tibble.
> library(nycflights13)
> head(flights)
# A tibble: 6 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 517 515 2 830
2 2013 1 1 533 529 4 850
3 2013 1 1 542 540 2 923
4 2013 1 1 544 545 -1 1004
5 2013 1 1 554 600 -6 812
6 2013 1 1 554 558 -4 740
# ... with 12 more variables: sched_arr_time <int>,
# arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
# distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
filterfilter (filter{dplyr}) selects rows for which some specified conditions are true. Its syntax is:
filter(data, ...)
where “data” is the name of a data frame or tibble and “…” are conditions.
filterExample: select rows from flights, for which year is 2013, month is either 11 or 12, and tailnum is N14228
> library(nycflights13)
> library(dplyr)
> filter(flights, year == 2013, month %in% c(11, 12),
+ tailnum == 'N14228')
# A tibble: 3 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 12 22 937 900 37 1247
2 2013 12 26 922 909 13 1044
3 2013 12 28 1903 1847 16 2126
# ... with 12 more variables: sched_arr_time <int>,
# arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
# distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
filterExample: select rows from flights, for which arr_deplay is \(\le 120\) and dep_delay is \(\le 120\)
> library(dplyr)
> filter(flights, !(arr_delay > 120 | dep_delay > 120))
# A tibble: 316,050 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 517 515 2 830
2 2013 1 1 533 529 4 850
3 2013 1 1 542 540 2 923
4 2013 1 1 544 545 -1 1004
5 2013 1 1 554 600 -6 812
6 2013 1 1 554 558 -4 740
7 2013 1 1 555 600 -5 913
8 2013 1 1 557 600 -3 709
9 2013 1 1 557 600 -3 838
10 2013 1 1 558 600 -2 753
# ... with 316,040 more rows, and 12 more variables:
# sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>
arrangearrange (arrange{dplyr}) orders the rows of a data frame according to some criteria. Its syntax is:
arrange(data, ...)
where “data” is the name of a data frame or tibble and “…” are conditions.
arrangeExample: order the rows of flights by year number, month number and day number:
> #unique(flights$year)
> library(dplyr)
> arrange(flights, year, month, day)
# A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 517 515 2 830
2 2013 1 1 533 529 4 850
3 2013 1 1 542 540 2 923
4 2013 1 1 544 545 -1 1004
5 2013 1 1 554 600 -6 812
6 2013 1 1 554 558 -4 740
7 2013 1 1 555 600 -5 913
8 2013 1 1 557 600 -3 709
9 2013 1 1 557 600 -3 838
10 2013 1 1 558 600 -2 753
# ... with 336,766 more rows, and 12 more variables:
# sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>
selectselect (select{dplyr}) keeps column whose names are specified. Its syntax is:
select(data, ...)
where “data” is the name of a data frame or tibble and “…” are columns names.
selectExample: keep columns of flights with column names year up till day
> library(dplyr)
> # Select all columns between year and day (inclusive)
> select(flights, year:day)
# A tibble: 336,776 x 3
year month day
<int> <int> <int>
1 2013 1 1
2 2013 1 1
3 2013 1 1
4 2013 1 1
5 2013 1 1
6 2013 1 1
7 2013 1 1
8 2013 1 1
9 2013 1 1
10 2013 1 1
# ... with 336,766 more rows
selectExample: select from flights all columns except those from year to day (inclusive):
> library(dplyr)
> select(flights, -(year:day))
# A tibble: 336,776 x 16
dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <dbl> <int> <int>
1 517 515 2 830 819
2 533 529 4 850 830
3 542 540 2 923 850
4 544 545 -1 1004 1022
5 554 600 -6 812 837
6 554 558 -4 740 728
7 555 600 -5 913 854
8 557 600 -3 709 723
9 557 600 -3 838 846
10 558 600 -2 753 745
# ... with 336,766 more rows, and 11 more variables:
# arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
# distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
mutatemutate (mutate{dplyr}) adds new columns at the end of a dataset, and these new columns are often functions of existing columns. Its syntax is:
mutate(data, ...)
where data is a data frame or tibble and “…” contains the name of the new column and how it is obtained from some existing columns.
mutateExample: create a smaller data frame flights_sml from flights by selecting some of its columns
> library(dplyr)
> # pick a few columns; `ends_with(string)` selects columns with ending `string`
> flights_sml <- select(flights, year:day,
+ ends_with("delay"), distance,air_time)
> head(flights_sml)
# A tibble: 6 x 7
year month day dep_delay arr_delay distance air_time
<int> <int> <int> <dbl> <dbl> <dbl> <dbl>
1 2013 1 1 2 11 1400 227
2 2013 1 1 4 20 1416 227
3 2013 1 1 2 33 1089 160
4 2013 1 1 -1 -18 1576 183
5 2013 1 1 -6 -25 762 116
6 2013 1 1 -4 12 719 150
mutateExample: add a new column gain as the difference between arr_delay and dep_delay, and a new column speed as distance divided by air_time (in minutes) and then multiplied by \(60\)
> library(dplyr)
> newA = mutate(flights_sml, gain = arr_delay - dep_delay,
+ speed = distance / air_time * 60)
> # display the last 3 columns of the data frame newA
> newA[1:3,(ncol(newA)-3):ncol(newA)]
# A tibble: 3 x 4
distance air_time gain speed
<dbl> <dbl> <dbl> <dbl>
1 1400 227 9 370.
2 1416 227 16 374.
3 1089 160 31 408.
The pipe operator %>% (%>%{dplyr}) forwards a value or the result of an expression, into the next function call/expression. Its syntax is:
value %>% operation
where value is a value and operation is the operation to be applied to value.
More information on pipe can be found in Chapter 14 of the book “R for data science” by Wickham and Grolemun.
Example: create data frame flights_sml from flights by selecting some of its columns
> library(dplyr)
> # pick a few columns to create a smaller data set
> flights_sml = flights %>% select(year:day)
> head(flights_sml)
# A tibble: 6 x 3
year month day
<int> <int> <int>
1 2013 1 1
2 2013 1 1
3 2013 1 1
4 2013 1 1
5 2013 1 1
6 2013 1 1
Example: create data frame flights_sml from flights by selecting some of its rows
> library(dplyr)
> # pick a few columns to create a smaller data set
> flights_sml = flights %>% filter(year==2013)
> head(flights_sml)
# A tibble: 6 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 517 515 2 830
2 2013 1 1 533 529 4 850
3 2013 1 1 542 540 2 923
4 2013 1 1 544 545 -1 1004
5 2013 1 1 554 600 -6 812
6 2013 1 1 554 558 -4 740
# ... with 12 more variables: sched_arr_time <int>,
# arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
# distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
summarise (summarise {dplyr}) is typically used on grouped data created by group_by(). The output will have one row for each group. It syntax is:
summarise(data, ...)
where data is a data frame and “…” contains summary functions such as mean and sd
Example: obtain mean arr_delay by month and carrier for 2 months and 3 carriers
> library(nycflights13)
> library(dplyr)
> # select 2 months and 3 carriers from `flights`
> tmp = flights %>% select(month,arr_delay,carrier) %>%
+ filter(month %in% 1:2, carrier %in% c("UA","AA","US"))
> # remove rows that have any NA
> tmp = na.omit(tmp)
> # obtain mean by each combination of `month` value and `carrier` value
> sm = tmp %>% group_by(month,carrier) %>%
+ summarise(mean_arr_delay = mean(arr_delay)) %>% as.data.frame()
> head(sm)
month carrier mean_arr_delay
1 1 AA 0.9823789
2 1 UA 3.1755991
3 1 US 1.4311454
4 2 AA 1.0937891
5 2 UA 0.1946115
6 2 US 0.2726648
Caution: do not load R package plyr after dplyr; otherwise summarise will be masked and does not work together with groub_by.
Example:
> library(dplyr)
> library(plyr)
> sm = tmp %>% group_by(month,carrier) %>%
+ summarise(mean_arr_delay = mean(arr_delay)) %>% as.data.frame()
> sm
mean_arr_delay
1 1.380687
To be safe, we can also explicitly tell R which package a command belongs to when using it, such as dplyr::group_by and dplyr::summarise. Note the use of ::.
Example:
> library(dplyr)
> sm = tmp %>% dplyr::group_by(month,carrier) %>%
+ dplyr::summarise(mean_arr_delay = mean(arr_delay)) %>% as.data.frame()
> sm
month carrier mean_arr_delay
1 1 AA 0.9823789
2 1 UA 3.1755991
3 1 US 1.4311454
4 2 AA 1.0937891
5 2 UA 0.1946115
6 2 US 0.2726648
> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] plyr_1.8.4 bindrcpp_0.2.2 dplyr_0.7.8
[4] nycflights13_1.0.0 knitr_1.21
loaded via a namespace (and not attached):
[1] Rcpp_1.0.0 rstudioapi_0.8 bindr_0.1.1
[4] magrittr_1.5 tidyselect_0.2.5 R6_2.3.0
[7] rlang_0.3.0.1 fansi_0.4.0 stringr_1.3.1
[10] tools_3.5.0 revealjs_0.9 xfun_0.4
[13] utf8_1.1.4 cli_1.0.1 htmltools_0.3.6
[16] yaml_2.2.0 digest_0.6.18 assertthat_0.2.0
[19] tibble_1.4.2 crayon_1.3.4 purrr_0.2.5
[22] glue_1.3.0 evaluate_0.12 rmarkdown_1.11
[25] stringi_1.2.4 compiler_3.5.0 pillar_1.3.1
[28] pkgconfig_2.0.2