Exploratory Data Analysis II

Before starting, please install ggplot2 , tidyverse , ggalt, and mlbench package.

install.packages(c("ggplot2","tidyverse","mlbench","ggalt"))

`ggplot2` Package

ggplot2, developed by H.Wickham, is a plotting system for R, based on the grammar of graphics, which tries to take the good parts of base and lattice graphics and none of the bad parts. It takes care of many of the fiddly details that make plotting a hassle (like drawing legends) as well as providing a powerful model of graphics that makes it easy to produce complex multi-layered graphics.

To get more information about this special packages, you can visit http://ggplot2.org/

Also, there are many videos, books and pages related to this packages.

The ggplot2, shortly ggplot, implies “Grammar of Graphics” which believes in the principle that a plot can be split into the following basic parts -

Plot = data + Aesthetics + Geometry

data refers to information you want to visualize.
Aesthetics includes the specific variables that you use in drawing. i.e, x and y variables. It is also used to tell R how data are displayed in a plot, e.g. color, size and shape of points etc.
Geometry refers to the type of graphics (bar chart, histogram, box plot, line plot, density plot, dot plot etc.) To see the list of geometric functions, please visit https://ggplot2.tidyverse.org/reference/

Here, you can see some functions from the list.

geom_point() = Scatter Plot
geom_bar() = Bar Plot
geom_line() = Line Plot
geom_histogram() = Histogram
geom_boxplot() = Box Plot
geom_density() = Density Plot

e.g

library(ggplot2)
ggplot(data,aes(x=x,y=y))+geom_point()

In addition to those functions, we use the following arguments or functions for our plots.

**Functions**

geom_text() = Add label or number on your plot

coord_flip() = Rotate your plot 

theme() = Arrange the theme of your plot, e.g size of axis names etc.

facet_wrap() & facet_grid() = Plot for different subject of your data

scale_color_manual() = Change the color of your plot manually.

labs() = Set title, axis name etc. 

**Arguments**

col = Change color of your plot by third variable (in aesthetics part)

group= Divide your data into third group (in aesthetics part)

color = Change frame of your box / bar or bin (in geom part)

fill = Fill your box / bar or bin (in geom part)

Important Note

ggplot2 package works with data.frame and tibble objects.

Why ggplot2 is better?

Excellent themes can be created with a single command.
Its colors are nicer and more pretty than the usual graphics.
Easy to visualize data with multiple variables.
Provides a platform to create simple graphs providing plethora of information.

(Ozdemir, O, Lab Notes, 2019)

Application

In this class, diamond data set, used last week, will be used again.

Read the data set at first.

diamonds = read.table("diamonds.txt", header=T,sep=",")
head(diamonds)

##   carat       cut color clarity depth table price    x    y    z
## 1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
## 2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
## 3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
## 4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
## 5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
## 6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48

str(diamonds)

## 'data.frame':    53940 obs. of  10 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Factor w/ 5 levels "Fair","Good",..: 3 4 2 4 2 5 5 5 1 5 ...
##  $ color  : Factor w/ 7 levels "D","E","F","G",..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Factor w/ 8 levels "I1","IF","SI1",..: 4 3 5 6 4 8 7 3 6 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

What is the frequency distribution of CUT?

t = table(diamonds$cut)
class(t)

## [1] "table"

df = data.frame(t)
df

##        Var1  Freq
## 1      Fair  1610
## 2      Good  4906
## 3     Ideal 21551
## 4   Premium 13791
## 5 Very Good 12082

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.6.3

ggplot(df,aes(x=Var1,y=Freq))+geom_bar(stat="identity")

#stat="identity" must argument

ggplot(df,aes(x=Var1,y=Freq,fill=Var1))+geom_bar(stat="identity")+
  labs(title="Bar Plot of CUT",y="Freq",x="Level")

ggplot(df,aes(x=Var1,y=Freq,fill=Var1))+geom_bar(stat="identity")+
  labs(title="Bar Plot of CUT",y="Freq",x="Level")+geom_text(aes(label=Freq))

#label a must argument

ggplot(df,aes(x=Var1,y=Freq,fill=Var1))+geom_bar(stat="identity")+
  labs(title="Bar Plot of CUT",y="Freq",x="Level")+geom_text(aes(label=Freq),vjust=-0.25,fontface="bold")

#label a must argument

Ordering x axis

Use reorder() argument.

ggplot(df,aes(x=reorder(Var1,Freq),y=Freq,fill=Var1))+geom_bar(stat="identity")+
  labs(title="Bar Plot of CUT",y="Freq",x="Level")+geom_text(aes(label=Freq),vjust=-0.25,fontface="bold")

What is the distribution of carat?

class(diamonds$carat)

## [1] "numeric"

ggplot(diamonds,aes(x=carat))+geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(diamonds,aes(x=carat))+geom_histogram(fill="red")+labs(title="Histogram of Carat",y="Count",x="Carat")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(diamonds,aes(x=carat))+geom_density()+labs(title="Density Plot of Carat",y="Prob",x="Carat")

It is observed that carat has multimodal distribution.

ggplot(diamonds,aes(x=carat))+geom_histogram(fill="red",aes(y=stat(density)))+labs(title="Histogram and Density Plot of Carat",y="Count",x="Carat")+geom_density(col="yellow")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#aes(y=stat(density))is must

g1<-ggplot(diamonds,aes(x=carat))+geom_boxplot()+labs(title="g1")
g2<-ggplot(diamonds,aes(x=factor(1),y=carat))+geom_boxplot()+labs(title="g2") #factor(1) a must 
library(gridExtra)
grid.arrange(g1,g2,ncol=2)

ggplot(diamonds,aes(x=factor(1),y=carat))+geom_boxplot(fill="darkred")+labs(title="Box Plot of Carat")

What is the association between carat and price?

ggplot(diamonds,aes(x=carat,y=price))+geom_point()+labs(title = "The relationship between Carat and Price")

ggplot(diamonds,aes(x=carat,y=price))+geom_point(col="darkred")+labs(title = "The relationship between Carat and Price")

Adding trend line

ggplot(diamonds,aes(x=carat,y=price))+geom_point(col="darkred")+labs(title = "The relationship between Carat and Price")+geom_smooth()

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Price and carat have strong positive relationship till carat is 3. After that, strength of the relationship is reduced.

More and More

What is the association between carat and price by cut type?

Facet in ggplot

Generating a plot for each level of a factor.

ggplot(diamonds,aes(x=carat,y=price))+geom_point()+facet_wrap(.~cut)

# .~ is must

What is the associaton between carat and price by depth and cut?

Bubble Plot : A bubble plot is a scatter plot with a third numeric variable mapped to circle size. It also enable us to include one categorical variable as a fourth one.

In order to explain the concept, I would like to use a sample of data

set.seed(123)
s = diamonds[sample(1:53940,50),]

ggplot(s,aes(x=carat,y=price,size=depth,col=cut))+geom_point()+labs(title="Association between Carat and Price by Cut and Depth")

The plot shows that when depth and cut quality increases, we can expect high carat and price.

How does depth distribute by clarity?

ggplot(diamonds,aes(x=clarity,y=depth))+geom_point()+labs(title="Relationship between Clarity and Depth")

Such plot suffers from overplotting which makes interpretation harder. It is a common problem seen in data set having large number of observations (for this data we have 53940 observations.)

There are some suggested solutions for this problem. To see the list click here, and one of them is jittering.

Jitter Plot: Random noise are added to the location of each point to remove overplotting.

ggplot(diamonds,aes(x=clarity,y=depth))+geom_point(position="jitter")+labs(title="Relationship between Clarity and Depth")

#position="jitter" is must

To see the effect of jittering, let us use the sample data set, using sample of data is one of the solution of overplotting.

ggplot(s,aes(x=clarity,y=depth))+geom_point()+labs(title="Relationship between Clarity and Depth for Sample Data Set")

ggplot(s,aes(x=clarity,y=depth))+geom_point(position="jitter")+labs(title="Relationship between Clarity and Depth for Sample Data Set")

Both jitter plot shows that it is hard to observe a relationship between variables of interest.

More and More

How diamonds prices distribute over cut type?

Violin Plot: Violin plots are similar to box plots, except that they also show the kernel probability density of the data at different values.

ggplot(diamonds,aes(x=cut,y=price))+geom_violin()

ggplot(diamonds,aes(x=cut,y=price,fill=cut))+geom_violin()+geom_boxplot(width=0.15)+labs(title="Distribution of Price by Cut")

#width sets the size of box plot

It is seen that price has right skewed distribution except good cut. Also, the price of ideal cut diamond is smaller than others on the average.

Extra

statsExpressions package

This package helps you to print the output of the statistical test on the related plots. It can be applicable for many types of plot.

For other examples, please click here.

library(ggplot2)
library(ggforce)

## Warning: package 'ggforce' was built under R version 3.6.3

library(statsExpressions)

## Warning: package 'statsExpressions' was built under R version 3.6.3

# plot with subtitle
ggplot(diamonds,aes(x=cut,y=price)) +
  geom_violin() +geom_boxplot(width=0.15)+
  labs(
    title = "Fisher's one-way ANOVA",
    subtitle = oneway_anova(diamonds, cut, price, var.equal = TRUE)$expression[[1]]
  )

Consider first research question and answer this with different visual

What is the frequency distribution of CUT?

Lollipop Chart: Lollipop plot is basically a barplot, where the bar is transformed in a line and a dot. It shows the relationship between a numeric and a categorical variable. A lollipop is built using geom_point() for the circle, and geom_segment() for the stem.

t = table(diamonds$cut)
df = data.frame(t)
df

##        Var1  Freq
## 1      Fair  1610
## 2      Good  4906
## 3     Ideal 21551
## 4   Premium 13791
## 5 Very Good 12082

ggplot(df,aes(x=Var1,y=Freq)) +geom_point() + geom_segment(aes(x=Var1, xend=Var1, y=0, yend=Freq))

ggplot(df,aes(x=Var1,y=Freq)) +geom_point(size=5, color="red", fill="yellow", alpha=0.7, shape=21, stroke=2) +geom_segment(aes(x=Var1, xend=Var1, y=0, yend=Freq))+labs(title="Lollipop Plot of Cut",x="Cut Types",y="Frequency")

What is the change in the prices within each cut type?

Dumbell Plot: Dumbell plot, a.k.a Dumbell Chart, is great for displaying changes between two points in time, two conditions or differences between two groups.

Before drawing this plot, your data set should be ready for it.

Data Manipulation

min_price = aggregate(price~cut,data=diamonds,min)
max_price = aggregate(price~cut,data=diamonds,max)
dumbell_data=cbind(min_price,max_price)
dumbell_data

##         cut price       cut price
## 1      Fair   337      Fair 18574
## 2      Good   327      Good 18788
## 3     Ideal   326     Ideal 18806
## 4   Premium   326   Premium 18823
## 5 Very Good   336 Very Good 18818

Not enough..

dumbell_data = dumbell_data[,-3]
colnames(dumbell_data) = c("cut","min","max")
dumbell_data

##         cut min   max
## 1      Fair 337 18574
## 2      Good 327 18788
## 3     Ideal 326 18806
## 4   Premium 326 18823
## 5 Very Good 336 18818

library(ggalt)

## Warning: package 'ggalt' was built under R version 3.6.3

## Registered S3 methods overwritten by 'ggalt':
##   method                  from   
##   grid.draw.absoluteGrob  ggplot2
##   grobHeight.absoluteGrob ggplot2
##   grobWidth.absoluteGrob  ggplot2
##   grobX.absoluteGrob      ggplot2
##   grobY.absoluteGrob      ggplot2

ggplot(dumbell_data, aes(y=cut, x=min, xend=max)) + 
  geom_dumbbell(size=3, color="black", 
                colour_x = "gold1", colour_xend = "darkred",
                dot_guide=TRUE, dot_guide_size=0.1)

No obvious difference in both minimum and maximum prices for cut levels is observed.

Appearence and Extra Plots

Appearence

The apperance of ggplot objects can be imporved using themes in ggplot2 package and other theme packages such as ggtheme and bbplot

ggplot(df,aes(x=Var1,y=Freq,fill=Var1))+geom_bar(stat="identity",width=0.5)+
  labs(title="Bar Plot of Cut",y="Freq",x="Level")+scale_fill_manual(values=c("darkred","gold1","maroon","steelblue","darkblue"))+geom_text(aes(label=Freq),vjust=-0.25,fontface="bold")+theme_bw()

library(bbplot)
ggplot(df,aes(x=Var1,y=Freq,fill=Var1))+geom_bar(stat="identity",width=0.5)+
  labs(title="Bar Plot of Cut",y="Freq",x="Level")+scale_fill_manual(values=c("darkred","gold1","maroon","steelblue","darkblue"))+geom_text(aes(label=Freq),vjust=-0.25,fontface="bold")+bbc_style()

You can also use theme() function for customization

library(bbplot)
ggplot(df,aes(x=Var1,y=Freq,fill=Var1))+geom_bar(stat="identity",width=0.5)+ylim(0,25000)+
  labs(title="Bar Plot of Cut",y="Freq",x="Level")+scale_fill_manual(values=c("darkred","gold1","maroon","steelblue","darkblue"))+geom_text(aes(label=Freq),vjust=-0.25,fontface="bold")+bbc_style()+
  theme(plot.title = element_text(size=18),axis.text = element_text(size=12,face="bold"),legend.text = element_text(size=10))

Map in R

You can look at my tutorials if you are interested in drawing a map with ggplot2 and leaflet in R.

Drawing Turkey map with ggplot

Drawing Izmir map with leaflet

`dplyr` Package

The data sets contain large amount of information and it is not usually easy to figure out the facts from data. They are rarely in the desired form. Thats why, some data manipulation is needed.,sometimes it is a must.,

Tidyverse package and dplyr package, being a subpackage in tidyverse, are quitely useful in the usage of data manipulation and provide useful funcitons.

dplyr	Description	SQL
select()	Selecting columns (variables)	SELECT
filter()	Filter (subset) rows.	WHERE
group_by()	Group the data	GROUP BY
summarise()	Summarise (or aggregate) data	-
join()	Joining data frames (tables)	ORDER BY
mutate()	Creating New Variables	JOIN
arrange()	Sort the data	COLUMN ALIAS

Application

To get an insight in dplyr functions, we will use flights dataset from the nycflights13 package, which contains several useful datasets.

library(tidyverse) #calls dplyr automatically
library(nycflights13)
data(flights)
# looking into sample data
head(flights)

## # A tibble: 6 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     1      517            515         2      830            819
## 2  2013     1     1      533            529         4      850            830
## 3  2013     1     1      542            540         2      923            850
## 4  2013     1     1      544            545        -1     1004           1022
## 5  2013     1     1      554            600        -6      812            837
## 6  2013     1     1      554            558        -4      740            728
## # ... with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

class(flights)

## [1] "tbl_df"     "tbl"        "data.frame"

A short Note A tibble is a modern reimagining of the data.frame

Display flights in October 2013

filter(flights,month==10)

## # A tibble: 28,889 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    10     1      447            500       -13      614            648
##  2  2013    10     1      522            517         5      735            757
##  3  2013    10     1      536            545        -9      809            855
##  4  2013    10     1      539            545        -6      801            827
##  5  2013    10     1      539            545        -6      917            933
##  6  2013    10     1      544            550        -6      912            932
##  7  2013    10     1      549            600       -11      653            716
##  8  2013    10     1      550            600       -10      648            700
##  9  2013    10     1      550            600       -10      649            659
## 10  2013    10     1      551            600        -9      727            730
## # ... with 28,879 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Display flights on 11 November 2013

filter(flights,month==11 & day==11)

## # A tibble: 983 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    11    11      453            500        -7      631            651
##  2  2013    11    11      520            515         5      759            808
##  3  2013    11    11      545            545         0      852            855
##  4  2013    11    11      551            600        -9      851            854
##  5  2013    11    11      552            600        -8      809            810
##  6  2013    11    11      553            600        -7      749            756
##  7  2013    11    11      553            601        -8      754            811
##  8  2013    11    11      553            545         8      838            835
##  9  2013    11    11      554            600        -6      720            736
## 10  2013    11    11      555            600        -5      709            719
## # ... with 973 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Sort your data by arrival time (arr_time)

arrange(flights, (arr_time))

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     2     2130           2130         0        1             18
##  2  2013     1    11     2157           2000       117        1           2208
##  3  2013     1    11     2253           2249         4        1           2357
##  4  2013     1    14     2122           2130        -8        1              2
##  5  2013     1    14     2246           2250        -4        1              7
##  6  2013     1    15     2304           2245        19        1           2357
##  7  2013     1    16     2018           2025        -7        1           2329
##  8  2013     1    16     2303           2245        18        1           2357
##  9  2013     1    19     2107           2110        -3        1           2355
## 10  2013     1    22     2246           2249        -3        1           2357
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

#in descending order desc(arr_time)

Sort your data by arrival time and carrier

arrange(flights, (arr_time), carrier)

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    12    15     2155           2100        55        1           2318
##  2  2013     5    10     2051           1955        56        1           2253
##  3  2013     6    25     2112           1830       162        1           2101
##  4  2013     7    19     2059           2030        29        1           2245
##  5  2013    12    22     2052           1930        82        1           2235
##  6  2013     2    11     2119           1955        84        1           2310
##  7  2013     6    25     2056           1755       181        1           2120
##  8  2013     7     6     2149           2150        -1        1            100
##  9  2013     1     2     2130           2130         0        1             18
## 10  2013     1    11     2253           2249         4        1           2357
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Display departure time and destination

select(flights,dep_time,dest)

## # A tibble: 336,776 x 2
##    dep_time dest 
##       <int> <chr>
##  1      517 IAH  
##  2      533 IAH  
##  3      542 MIA  
##  4      544 BQN  
##  5      554 ATL  
##  6      554 ORD  
##  7      555 FLL  
##  8      557 IAD  
##  9      557 MCO  
## 10      558 ORD  
## # ... with 336,766 more rows

Create a new column called speed which is ratio between distance and air time

mutate(flights, speed = distance/air_time)

## # A tibble: 336,776 x 20
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ... with 336,766 more rows, and 12 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
## #   speed <dbl>

Calculate the average and median of arrival time and frequency distribution of destination airlines

summarise(flights,avg_arr_time=mean(arr_time,na.rm=T),med_arr_time=median(arr_time,na.rm=T))

## # A tibble: 1 x 2
##   avg_arr_time med_arr_time
##          <dbl>        <int>
## 1        1502.         1535

summarise(flights,dest,freq_dest=n())

## # A tibble: 336,776 x 2
##    dest  freq_dest
##    <chr>     <int>
##  1 IAH      336776
##  2 IAH      336776
##  3 MIA      336776
##  4 BQN      336776
##  5 ATL      336776
##  6 ORD      336776
##  7 FLL      336776
##  8 IAD      336776
##  9 MCO      336776
## 10 ORD      336776
## # ... with 336,766 more rows

Calculate the average and median of arrival time and departure time

summarise_at(flights,vars(arr_time,dep_time), list(~mean(.,na.rm=T),~median(.,na.rm=T)))

## # A tibble: 1 x 4
##   arr_time_mean dep_time_mean arr_time_median dep_time_median
##           <dbl>         <dbl>           <int>           <int>
## 1         1502.         1349.            1535            1401

summarise() function is usually used with group_by() function to get more insight.

Calculate the average arrival time for each month

summarise(group_by(flights,month),avg_arr_time_month=mean(arr_time,na.rm=T))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 12 x 2
##    month avg_arr_time_month
##    <int>              <dbl>
##  1     1              1523.
##  2     2              1522.
##  3     3              1510.
##  4     4              1501.
##  5     5              1503.
##  6     6              1468.
##  7     7              1456.
##  8     8              1495.
##  9     9              1504.
## 10    10              1520.
## 11    11              1523.
## 12    12              1505.

piping(%>%) Operator

This operator is special for tidyverse package, such as dplyr, ggplot2 etc, and will forward a value, or the result of an expression, into the next function call/expression.

Calculate the average arrival time for each month

flights%>%group_by(month)%>%summarise(avg_arr_time_month=mean(arr_time,na.rm=T))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 12 x 2
##    month avg_arr_time_month
##    <int>              <dbl>
##  1     1              1523.
##  2     2              1522.
##  3     3              1510.
##  4     4              1501.
##  5     5              1503.
##  6     6              1468.
##  7     7              1456.
##  8     8              1495.
##  9     9              1504.
## 10    10              1520.
## 11    11              1523.
## 12    12              1505.

Exploratory Data Analysis II