Stat 437 Lecture Notes 1a

Xiongzhi Chen

Washington State University

Basics of dplyr

Motivation

# A tibble: 6 x 6
   year month carrier dep_delay arr_delay distance
  <int> <int> <chr>       <dbl>     <dbl>    <dbl>
1  2013     1 UA              2        11     1400
2  2013     1 UA              4        20     1416
3  2013     1 AA              2        33     1089
4  2013     1 B6             -1       -18     1576
5  2013     1 DL             -6       -25      762
6  2013     1 UA             -4        12      719
  • dep_delay by year or carrier; average dep_delay by carrier
  • order dep_delay by carrier or distance
  • visualization

dplyr is a powerful tool for data processing

Overview

The contents on dplyr are mostly based on Chapter 3 of the book “R for data science” by Wickham and Grolemun.

We will cover how to use it to:

  1. Operate on rows: filter{dplyr}. Here “filter{dplyr}” means that filter is a command provided by dplyr

  2. Operate on columns: select{dplyr} and mutate{dplyr}

  3. Obtain various summary statistics.

Overview

We need the following R packages:

  • dplyr: the package for processing data frame like objects
  • tidyverse: the package that also installs dplyr and tidyr and provides support for tibble
  • nycflights13: the data set for practice

dplyr

The package dplyr is a powerful tool for working with data frame like objects. To install it, run the following in Rstudio console:

install.packages("tidyverse")

This command will also install dplyr and the needed packages to deal with a type of data frame called “tibble”.

Alternatively, the package can be installed via:

install.packages("dplyr")

tibble

tibble is a modern reimagining of the “data frame” object in R. tibble behaves similarly as data.frame, except for subsetting.

Information on tibble can be found in Chapter 7 of the book “R for data science” by Wickham and Grolemun; see in particular section “Tibbles Versus data.frame” there.

We will just use but not focus on tibble.

NYC flight data: nycflights13

The pacakge nycflights13 contains the data set, and can be installed via

install.packages("nycflights13")

nycflights13 contains 5 tibbles, we will work on the flights tibble.

NYC flight data: flights

> library(nycflights13)
> head(flights)
# A tibble: 6 x 19
   year month   day dep_time sched_dep_time dep_delay arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>
1  2013     1     1      517            515         2      830
2  2013     1     1      533            529         4      850
3  2013     1     1      542            540         2      923
4  2013     1     1      544            545        -1     1004
5  2013     1     1      554            600        -6      812
6  2013     1     1      554            558        -4      740
# ... with 12 more variables: sched_arr_time <int>,
#   arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Row operation: filter

filter (filter{dplyr}) selects rows for which some specified conditions are true. Its syntax is:

filter(data, ...)

where “data” is the name of a data frame or tibble and “…” are conditions.

Row operation: filter

Example: select rows from flights, for which year is 2013, month is either 11 or 12, and tailnum is N14228

> library(nycflights13)
> library(dplyr)
> filter(flights, year == 2013, month %in% c(11, 12), 
+        tailnum == 'N14228')
# A tibble: 3 x 19
   year month   day dep_time sched_dep_time dep_delay arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>
1  2013    12    22      937            900        37     1247
2  2013    12    26      922            909        13     1044
3  2013    12    28     1903           1847        16     2126
# ... with 12 more variables: sched_arr_time <int>,
#   arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Row operation: filter

Example: select rows from flights, for which arr_deplay is \(\le 120\) and dep_delay is \(\le 120\)

> library(dplyr)
> filter(flights, !(arr_delay > 120 | dep_delay > 120)) 
# A tibble: 316,050 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     1     1      517            515         2      830
 2  2013     1     1      533            529         4      850
 3  2013     1     1      542            540         2      923
 4  2013     1     1      544            545        -1     1004
 5  2013     1     1      554            600        -6      812
 6  2013     1     1      554            558        -4      740
 7  2013     1     1      555            600        -5      913
 8  2013     1     1      557            600        -3      709
 9  2013     1     1      557            600        -3      838
10  2013     1     1      558            600        -2      753
# ... with 316,040 more rows, and 12 more variables:
#   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>

Row operation: arrange

arrange (arrange{dplyr}) orders the rows of a data frame according to some criteria. Its syntax is:

arrange(data, ...)

where “data” is the name of a data frame or tibble and “…” are conditions.

Row operation: arrange

Example: order the rows of flights by year number, month number and day number:

> #unique(flights$year)
> library(dplyr)
> arrange(flights, year, month, day)
# A tibble: 336,776 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     1     1      517            515         2      830
 2  2013     1     1      533            529         4      850
 3  2013     1     1      542            540         2      923
 4  2013     1     1      544            545        -1     1004
 5  2013     1     1      554            600        -6      812
 6  2013     1     1      554            558        -4      740
 7  2013     1     1      555            600        -5      913
 8  2013     1     1      557            600        -3      709
 9  2013     1     1      557            600        -3      838
10  2013     1     1      558            600        -2      753
# ... with 336,766 more rows, and 12 more variables:
#   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>

Column operation: select

select (select{dplyr}) keeps column whose names are specified. Its syntax is:

select(data, ...)

where “data” is the name of a data frame or tibble and “…” are columns names.

Column operation: select

Example: keep columns of flights with column names year up till day

> library(dplyr)
> # Select all columns between year and day (inclusive)
> select(flights, year:day)
# A tibble: 336,776 x 3
    year month   day
   <int> <int> <int>
 1  2013     1     1
 2  2013     1     1
 3  2013     1     1
 4  2013     1     1
 5  2013     1     1
 6  2013     1     1
 7  2013     1     1
 8  2013     1     1
 9  2013     1     1
10  2013     1     1
# ... with 336,766 more rows

Column operation: select

Example: select from flights all columns except those from year to day (inclusive):

> library(dplyr)
> select(flights, -(year:day))
# A tibble: 336,776 x 16
   dep_time sched_dep_time dep_delay arr_time sched_arr_time
      <int>          <int>     <dbl>    <int>          <int>
 1      517            515         2      830            819
 2      533            529         4      850            830
 3      542            540         2      923            850
 4      544            545        -1     1004           1022
 5      554            600        -6      812            837
 6      554            558        -4      740            728
 7      555            600        -5      913            854
 8      557            600        -3      709            723
 9      557            600        -3      838            846
10      558            600        -2      753            745
# ... with 336,766 more rows, and 11 more variables:
#   arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Column operation: mutate

mutate (mutate{dplyr}) adds new columns at the end of a dataset, and these new columns are often functions of existing columns. Its syntax is:

mutate(data, ...)

where data is a data frame or tibble and “…” contains the name of the new column and how it is obtained from some existing columns.

Column operation: mutate

Example: create a smaller data frame flights_sml from flights by selecting some of its columns

> library(dplyr)
> # pick a few columns; `ends_with(string)` selects columns with ending `string`
> flights_sml <- select(flights, year:day,
+ ends_with("delay"), distance,air_time)
> head(flights_sml)
# A tibble: 6 x 7
   year month   day dep_delay arr_delay distance air_time
  <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl>
1  2013     1     1         2        11     1400      227
2  2013     1     1         4        20     1416      227
3  2013     1     1         2        33     1089      160
4  2013     1     1        -1       -18     1576      183
5  2013     1     1        -6       -25      762      116
6  2013     1     1        -4        12      719      150

Column operation: mutate

Example: add a new column gain as the difference between arr_delay and dep_delay, and a new column speed as distance divided by air_time (in minutes) and then multiplied by \(60\)

> library(dplyr)
> newA = mutate(flights_sml, gain = arr_delay - dep_delay,
+ speed = distance / air_time * 60)
> # display the last 3 columns of the data frame newA
> newA[1:3,(ncol(newA)-3):ncol(newA)]
# A tibble: 3 x 4
  distance air_time  gain speed
     <dbl>    <dbl> <dbl> <dbl>
1     1400      227     9  370.
2     1416      227    16  374.
3     1089      160    31  408.

Obtain summaries using dplyr

Pipe (%>%) Operator

The pipe operator %>% (%>%{dplyr}) forwards a value or the result of an expression, into the next function call/expression. Its syntax is:

value %>% operation

where value is a value and operation is the operation to be applied to value.

More information on pipe can be found in Chapter 14 of the book “R for data science” by Wickham and Grolemun.

Pipe (%>%) Operator

Example: create data frame flights_sml from flights by selecting some of its columns

> library(dplyr)
> # pick a few columns to create a smaller data set
> flights_sml = flights %>% select(year:day)
> head(flights_sml)
# A tibble: 6 x 3
   year month   day
  <int> <int> <int>
1  2013     1     1
2  2013     1     1
3  2013     1     1
4  2013     1     1
5  2013     1     1
6  2013     1     1

Pipe (%>%) Operator

Example: create data frame flights_sml from flights by selecting some of its rows

> library(dplyr)
> # pick a few columns to create a smaller data set
> flights_sml = flights %>% filter(year==2013)
> head(flights_sml)
# A tibble: 6 x 19
   year month   day dep_time sched_dep_time dep_delay arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>
1  2013     1     1      517            515         2      830
2  2013     1     1      533            529         4      850
3  2013     1     1      542            540         2      923
4  2013     1     1      544            545        -1     1004
5  2013     1     1      554            600        -6      812
6  2013     1     1      554            558        -4      740
# ... with 12 more variables: sched_arr_time <int>,
#   arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

The “summarise” command

summarise (summarise {dplyr}) is typically used on grouped data created by group_by(). The output will have one row for each group. It syntax is:

summarise(data, ...)

where data is a data frame and “…” contains summary functions such as mean and sd

The “summarise” command

Example: obtain mean arr_delay by month and carrier for 2 months and 3 carriers

> library(nycflights13)
> library(dplyr)
> # select 2 months and 3 carriers from `flights`
> tmp = flights %>% select(month,arr_delay,carrier) %>%
+   filter(month %in% 1:2, carrier %in% c("UA","AA","US"))
> # remove rows that have any NA
> tmp = na.omit(tmp)
> # obtain mean by each combination of `month` value and `carrier` value
> sm =  tmp %>% group_by(month,carrier) %>% 
+   summarise(mean_arr_delay = mean(arr_delay)) %>% as.data.frame()
> head(sm) 
  month carrier mean_arr_delay
1     1      AA      0.9823789
2     1      UA      3.1755991
3     1      US      1.4311454
4     2      AA      1.0937891
5     2      UA      0.1946115
6     2      US      0.2726648

The “summarise” command

Caution: do not load R package plyr after dplyr; otherwise summarise will be masked and does not work together with groub_by.

Example:

> library(dplyr)
> library(plyr)
> sm =  tmp %>% group_by(month,carrier) %>% 
+   summarise(mean_arr_delay = mean(arr_delay)) %>% as.data.frame()
> sm
  mean_arr_delay
1       1.380687

The “summarise” command

To be safe, we can also explicitly tell R which package a command belongs to when using it, such as dplyr::group_by and dplyr::summarise. Note the use of ::.

Example:

> library(dplyr)
> sm =  tmp %>% dplyr::group_by(month,carrier) %>% 
+   dplyr::summarise(mean_arr_delay = mean(arr_delay)) %>% as.data.frame()
> sm
  month carrier mean_arr_delay
1     1      AA      0.9823789
2     1      UA      3.1755991
3     1      US      1.4311454
4     2      AA      1.0937891
5     2      UA      0.1946115
6     2      US      0.2726648

License and session Information

License

> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
[1] plyr_1.8.4         bindrcpp_0.2.2     dplyr_0.7.8       
[4] nycflights13_1.0.0 knitr_1.21        

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0       rstudioapi_0.8   bindr_0.1.1     
 [4] magrittr_1.5     tidyselect_0.2.5 R6_2.3.0        
 [7] rlang_0.3.0.1    fansi_0.4.0      stringr_1.3.1   
[10] tools_3.5.0      revealjs_0.9     xfun_0.4        
[13] utf8_1.1.4       cli_1.0.1        htmltools_0.3.6 
[16] yaml_2.2.0       digest_0.6.18    assertthat_0.2.0
[19] tibble_1.4.2     crayon_1.3.4     purrr_0.2.5     
[22] glue_1.3.0       evaluate_0.12    rmarkdown_1.11  
[25] stringi_1.2.4    compiler_3.5.0   pillar_1.3.1    
[28] pkgconfig_2.0.2