Exploratory Data Analysis II
Before starting, please install ggplot2
, tidyverse
, ggalt
, and mlbench
package.
install.packages(c("ggplot2","tidyverse","mlbench","ggalt"))
ggplot2
Package
ggplot2, developed by H.Wickham, is a plotting system for R, based on the grammar of graphics, which tries to take the good parts of base and lattice graphics and none of the bad parts. It takes care of many of the fiddly details that make plotting a hassle (like drawing legends) as well as providing a powerful model of graphics that makes it easy to produce complex multi-layered graphics.
To get more information about this special packages, you can visit http://ggplot2.org/
Also, there are many videos, books and pages related to this packages.
The ggplot2, shortly ggplot, implies “Grammar of Graphics” which believes in the principle that a plot can be split into the following basic parts -
Plot = data + Aesthetics + Geometry
data refers to information you want to visualize.
Aesthetics includes the specific variables that you use in drawing. i.e, x and y variables. It is also used to tell R how data are displayed in a plot, e.g. color, size and shape of points etc.
Geometry refers to the type of graphics (bar chart, histogram, box plot, line plot, density plot, dot plot etc.) To see the list of geometric functions, please visit https://ggplot2.tidyverse.org/reference/
Here, you can see some functions from the list.
geom_point() = Scatter Plot
geom_bar() = Bar Plot
geom_line() = Line Plot
geom_histogram() = Histogram
geom_boxplot() = Box Plot
geom_density() = Density Plot
e.g
library(ggplot2)
ggplot(data,aes(x=x,y=y))+geom_point()
In addition to those functions, we use the following arguments or functions for our plots.
**Functions**
geom_text() = Add label or number on your plot
coord_flip() = Rotate your plot
theme() = Arrange the theme of your plot, e.g size of axis names etc.
facet_wrap() & facet_grid() = Plot for different subject of your data
scale_color_manual() = Change the color of your plot manually.
labs() = Set title, axis name etc.
**Arguments**
col = Change color of your plot by third variable (in aesthetics part)
group= Divide your data into third group (in aesthetics part)
color = Change frame of your box / bar or bin (in geom part)
fill = Fill your box / bar or bin (in geom part)
Important Note
ggplot2
package works with data.frame and tibble objects.
Why ggplot2 is better?
Excellent themes can be created with a single command.
Its colors are nicer and more pretty than the usual graphics.
Easy to visualize data with multiple variables.
Provides a platform to create simple graphs providing plethora of information.
(Ozdemir, O, Lab Notes, 2019)
Application
In this class, diamond data set, used last week, will be used again.
Read the data set at first.
= read.table("diamonds.txt", header=T,sep=",")
diamonds head(diamonds)
## carat cut color clarity depth table price x y z
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
str(diamonds)
## 'data.frame': 53940 obs. of 10 variables:
## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Factor w/ 5 levels "Fair","Good",..: 3 4 2 4 2 5 5 5 1 5 ...
## $ color : Factor w/ 7 levels "D","E","F","G",..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Factor w/ 8 levels "I1","IF","SI1",..: 4 3 5 6 4 8 7 3 6 5 ...
## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
What is the frequency distribution of CUT?
= table(diamonds$cut)
t class(t)
## [1] "table"
= data.frame(t)
df df
## Var1 Freq
## 1 Fair 1610
## 2 Good 4906
## 3 Ideal 21551
## 4 Premium 13791
## 5 Very Good 12082
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.3
ggplot(df,aes(x=Var1,y=Freq))+geom_bar(stat="identity")
#stat="identity" must argument
ggplot(df,aes(x=Var1,y=Freq,fill=Var1))+geom_bar(stat="identity")+
labs(title="Bar Plot of CUT",y="Freq",x="Level")
ggplot(df,aes(x=Var1,y=Freq,fill=Var1))+geom_bar(stat="identity")+
labs(title="Bar Plot of CUT",y="Freq",x="Level")+geom_text(aes(label=Freq))
#label a must argument
ggplot(df,aes(x=Var1,y=Freq,fill=Var1))+geom_bar(stat="identity")+
labs(title="Bar Plot of CUT",y="Freq",x="Level")+geom_text(aes(label=Freq),vjust=-0.25,fontface="bold")
#label a must argument
Ordering x axis
Use reorder()
argument.
ggplot(df,aes(x=reorder(Var1,Freq),y=Freq,fill=Var1))+geom_bar(stat="identity")+
labs(title="Bar Plot of CUT",y="Freq",x="Level")+geom_text(aes(label=Freq),vjust=-0.25,fontface="bold")
What is the distribution of carat?
class(diamonds$carat)
## [1] "numeric"
ggplot(diamonds,aes(x=carat))+geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(diamonds,aes(x=carat))+geom_histogram(fill="red")+labs(title="Histogram of Carat",y="Count",x="Carat")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(diamonds,aes(x=carat))+geom_density()+labs(title="Density Plot of Carat",y="Prob",x="Carat")
It is observed that carat has multimodal distribution.
ggplot(diamonds,aes(x=carat))+geom_histogram(fill="red",aes(y=stat(density)))+labs(title="Histogram and Density Plot of Carat",y="Count",x="Carat")+geom_density(col="yellow")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#aes(y=stat(density))is must
<-ggplot(diamonds,aes(x=carat))+geom_boxplot()+labs(title="g1")
g1<-ggplot(diamonds,aes(x=factor(1),y=carat))+geom_boxplot()+labs(title="g2") #factor(1) a must
g2library(gridExtra)
grid.arrange(g1,g2,ncol=2)
ggplot(diamonds,aes(x=factor(1),y=carat))+geom_boxplot(fill="darkred")+labs(title="Box Plot of Carat")
What is the association between carat and price?
ggplot(diamonds,aes(x=carat,y=price))+geom_point()+labs(title = "The relationship between Carat and Price")
ggplot(diamonds,aes(x=carat,y=price))+geom_point(col="darkred")+labs(title = "The relationship between Carat and Price")
Adding trend line
ggplot(diamonds,aes(x=carat,y=price))+geom_point(col="darkred")+labs(title = "The relationship between Carat and Price")+geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Price and carat have strong positive relationship till carat is 3. After that, strength of the relationship is reduced.
More and More
What is the association between carat and price by cut type?
Facet in ggplot
Generating a plot for each level of a factor.
ggplot(diamonds,aes(x=carat,y=price))+geom_point()+facet_wrap(.~cut)
# .~ is must
What is the associaton between carat and price by depth and cut?
Bubble Plot : A bubble plot is a scatter plot with a third numeric variable mapped to circle size. It also enable us to include one categorical variable as a fourth one.
In order to explain the concept, I would like to use a sample of data
set.seed(123)
= diamonds[sample(1:53940,50),] s
ggplot(s,aes(x=carat,y=price,size=depth,col=cut))+geom_point()+labs(title="Association between Carat and Price by Cut and Depth")
The plot shows that when depth and cut quality increases, we can expect high carat and price.
How does depth distribute by clarity?
ggplot(diamonds,aes(x=clarity,y=depth))+geom_point()+labs(title="Relationship between Clarity and Depth")
Such plot suffers from overplotting which makes interpretation harder. It is a common problem seen in data set having large number of observations (for this data we have 53940 observations.)
There are some suggested solutions for this problem. To see the list click here, and one of them is jittering.
Jitter Plot: Random noise are added to the location of each point to remove overplotting.
ggplot(diamonds,aes(x=clarity,y=depth))+geom_point(position="jitter")+labs(title="Relationship between Clarity and Depth")
#position="jitter" is must
To see the effect of jittering, let us use the sample data set, using sample of data is one of the solution of overplotting.
ggplot(s,aes(x=clarity,y=depth))+geom_point()+labs(title="Relationship between Clarity and Depth for Sample Data Set")
ggplot(s,aes(x=clarity,y=depth))+geom_point(position="jitter")+labs(title="Relationship between Clarity and Depth for Sample Data Set")
Both jitter plot shows that it is hard to observe a relationship between variables of interest.
More and More
How diamonds prices distribute over cut type?
Violin Plot: Violin plots are similar to box plots, except that they also show the kernel probability density of the data at different values.
ggplot(diamonds,aes(x=cut,y=price))+geom_violin()
ggplot(diamonds,aes(x=cut,y=price,fill=cut))+geom_violin()+geom_boxplot(width=0.15)+labs(title="Distribution of Price by Cut")
#width sets the size of box plot
It is seen that price has right skewed distribution except good cut. Also, the price of ideal cut diamond is smaller than others on the average.
Extra
statsExpressions
package
This package helps you to print the output of the statistical test on the related plots. It can be applicable for many types of plot.
For other examples, please click here.
library(ggplot2)
library(ggforce)
## Warning: package 'ggforce' was built under R version 3.6.3
library(statsExpressions)
## Warning: package 'statsExpressions' was built under R version 3.6.3
# plot with subtitle
ggplot(diamonds,aes(x=cut,y=price)) +
geom_violin() +geom_boxplot(width=0.15)+
labs(
title = "Fisher's one-way ANOVA",
subtitle = oneway_anova(diamonds, cut, price, var.equal = TRUE)$expression[[1]]
)
Consider first research question and answer this with different visual
What is the frequency distribution of CUT?
Lollipop Chart: Lollipop plot is basically a barplot, where the bar is transformed in a line and a dot. It shows the relationship between a numeric and a categorical variable. A lollipop is built using geom_point() for the circle, and geom_segment() for the stem.
= table(diamonds$cut)
t = data.frame(t)
df df
## Var1 Freq
## 1 Fair 1610
## 2 Good 4906
## 3 Ideal 21551
## 4 Premium 13791
## 5 Very Good 12082
ggplot(df,aes(x=Var1,y=Freq)) +geom_point() + geom_segment(aes(x=Var1, xend=Var1, y=0, yend=Freq))
ggplot(df,aes(x=Var1,y=Freq)) +geom_point(size=5, color="red", fill="yellow", alpha=0.7, shape=21, stroke=2) +geom_segment(aes(x=Var1, xend=Var1, y=0, yend=Freq))+labs(title="Lollipop Plot of Cut",x="Cut Types",y="Frequency")
What is the change in the prices within each cut type?
Dumbell Plot: Dumbell plot, a.k.a Dumbell Chart, is great for displaying changes between two points in time, two conditions or differences between two groups.
Before drawing this plot, your data set should be ready for it.
Data Manipulation
= aggregate(price~cut,data=diamonds,min)
min_price = aggregate(price~cut,data=diamonds,max)
max_price =cbind(min_price,max_price)
dumbell_data dumbell_data
## cut price cut price
## 1 Fair 337 Fair 18574
## 2 Good 327 Good 18788
## 3 Ideal 326 Ideal 18806
## 4 Premium 326 Premium 18823
## 5 Very Good 336 Very Good 18818
Not enough..
= dumbell_data[,-3]
dumbell_data colnames(dumbell_data) = c("cut","min","max")
dumbell_data
## cut min max
## 1 Fair 337 18574
## 2 Good 327 18788
## 3 Ideal 326 18806
## 4 Premium 326 18823
## 5 Very Good 336 18818
library(ggalt)
## Warning: package 'ggalt' was built under R version 3.6.3
## Registered S3 methods overwritten by 'ggalt':
## method from
## grid.draw.absoluteGrob ggplot2
## grobHeight.absoluteGrob ggplot2
## grobWidth.absoluteGrob ggplot2
## grobX.absoluteGrob ggplot2
## grobY.absoluteGrob ggplot2
ggplot(dumbell_data, aes(y=cut, x=min, xend=max)) +
geom_dumbbell(size=3, color="black",
colour_x = "gold1", colour_xend = "darkred",
dot_guide=TRUE, dot_guide_size=0.1)
No obvious difference in both minimum and maximum prices for cut levels is observed.
Appearence and Extra Plots
Appearence
The apperance of ggplot objects can be imporved using themes in ggplot2 package and other theme packages such as ggtheme
and bbplot
ggplot(df,aes(x=Var1,y=Freq,fill=Var1))+geom_bar(stat="identity",width=0.5)+
labs(title="Bar Plot of Cut",y="Freq",x="Level")+scale_fill_manual(values=c("darkred","gold1","maroon","steelblue","darkblue"))+geom_text(aes(label=Freq),vjust=-0.25,fontface="bold")+theme_bw()
library(bbplot)
ggplot(df,aes(x=Var1,y=Freq,fill=Var1))+geom_bar(stat="identity",width=0.5)+
labs(title="Bar Plot of Cut",y="Freq",x="Level")+scale_fill_manual(values=c("darkred","gold1","maroon","steelblue","darkblue"))+geom_text(aes(label=Freq),vjust=-0.25,fontface="bold")+bbc_style()
You can also use theme()
function for customization
library(bbplot)
ggplot(df,aes(x=Var1,y=Freq,fill=Var1))+geom_bar(stat="identity",width=0.5)+ylim(0,25000)+
labs(title="Bar Plot of Cut",y="Freq",x="Level")+scale_fill_manual(values=c("darkred","gold1","maroon","steelblue","darkblue"))+geom_text(aes(label=Freq),vjust=-0.25,fontface="bold")+bbc_style()+
theme(plot.title = element_text(size=18),axis.text = element_text(size=12,face="bold"),legend.text = element_text(size=10))
Map in R
You can look at my tutorials if you are interested in drawing a map with ggplot2 and leaflet in R.
dplyr
Package
The data sets contain large amount of information and it is not usually easy to figure out the facts from data. They are rarely in the desired form. Thats why, some data manipulation is needed.,sometimes it is a must.,
Tidyverse package and dplyr package, being a subpackage in tidyverse, are quitely useful in the usage of data manipulation and provide useful funcitons.
dplyr | Description | SQL |
---|---|---|
select() | Selecting columns (variables) | SELECT |
filter() | Filter (subset) rows. | WHERE |
group_by() | Group the data | GROUP BY |
summarise() | Summarise (or aggregate) data | - |
join() | Joining data frames (tables) | ORDER BY |
mutate() | Creating New Variables | JOIN |
arrange() | Sort the data | COLUMN ALIAS |
Application
To get an insight in dplyr functions, we will use flights dataset from the nycflights13 package, which contains several useful datasets.
library(tidyverse) #calls dplyr automatically
library(nycflights13)
data(flights)
# looking into sample data
head(flights)
## # A tibble: 6 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## # ... with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
class(flights)
## [1] "tbl_df" "tbl" "data.frame"
A short Note A tibble is a modern reimagining of the data.frame
Display flights in October 2013
filter(flights,month==10)
## # A tibble: 28,889 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 10 1 447 500 -13 614 648
## 2 2013 10 1 522 517 5 735 757
## 3 2013 10 1 536 545 -9 809 855
## 4 2013 10 1 539 545 -6 801 827
## 5 2013 10 1 539 545 -6 917 933
## 6 2013 10 1 544 550 -6 912 932
## 7 2013 10 1 549 600 -11 653 716
## 8 2013 10 1 550 600 -10 648 700
## 9 2013 10 1 550 600 -10 649 659
## 10 2013 10 1 551 600 -9 727 730
## # ... with 28,879 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Display flights on 11 November 2013
filter(flights,month==11 & day==11)
## # A tibble: 983 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 11 11 453 500 -7 631 651
## 2 2013 11 11 520 515 5 759 808
## 3 2013 11 11 545 545 0 852 855
## 4 2013 11 11 551 600 -9 851 854
## 5 2013 11 11 552 600 -8 809 810
## 6 2013 11 11 553 600 -7 749 756
## 7 2013 11 11 553 601 -8 754 811
## 8 2013 11 11 553 545 8 838 835
## 9 2013 11 11 554 600 -6 720 736
## 10 2013 11 11 555 600 -5 709 719
## # ... with 973 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Sort your data by arrival time (arr_time)
arrange(flights, (arr_time))
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 2 2130 2130 0 1 18
## 2 2013 1 11 2157 2000 117 1 2208
## 3 2013 1 11 2253 2249 4 1 2357
## 4 2013 1 14 2122 2130 -8 1 2
## 5 2013 1 14 2246 2250 -4 1 7
## 6 2013 1 15 2304 2245 19 1 2357
## 7 2013 1 16 2018 2025 -7 1 2329
## 8 2013 1 16 2303 2245 18 1 2357
## 9 2013 1 19 2107 2110 -3 1 2355
## 10 2013 1 22 2246 2249 -3 1 2357
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
#in descending order desc(arr_time)
Sort your data by arrival time and carrier
arrange(flights, (arr_time), carrier)
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 12 15 2155 2100 55 1 2318
## 2 2013 5 10 2051 1955 56 1 2253
## 3 2013 6 25 2112 1830 162 1 2101
## 4 2013 7 19 2059 2030 29 1 2245
## 5 2013 12 22 2052 1930 82 1 2235
## 6 2013 2 11 2119 1955 84 1 2310
## 7 2013 6 25 2056 1755 181 1 2120
## 8 2013 7 6 2149 2150 -1 1 100
## 9 2013 1 2 2130 2130 0 1 18
## 10 2013 1 11 2253 2249 4 1 2357
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Display departure time and destination
select(flights,dep_time,dest)
## # A tibble: 336,776 x 2
## dep_time dest
## <int> <chr>
## 1 517 IAH
## 2 533 IAH
## 3 542 MIA
## 4 544 BQN
## 5 554 ATL
## 6 554 ORD
## 7 555 FLL
## 8 557 IAD
## 9 557 MCO
## 10 558 ORD
## # ... with 336,766 more rows
Create a new column called speed which is ratio between distance and air time
mutate(flights, speed = distance/air_time)
## # A tibble: 336,776 x 20
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ... with 336,766 more rows, and 12 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
## # speed <dbl>
Calculate the average and median of arrival time and frequency distribution of destination airlines
summarise(flights,avg_arr_time=mean(arr_time,na.rm=T),med_arr_time=median(arr_time,na.rm=T))
## # A tibble: 1 x 2
## avg_arr_time med_arr_time
## <dbl> <int>
## 1 1502. 1535
summarise(flights,dest,freq_dest=n())
## # A tibble: 336,776 x 2
## dest freq_dest
## <chr> <int>
## 1 IAH 336776
## 2 IAH 336776
## 3 MIA 336776
## 4 BQN 336776
## 5 ATL 336776
## 6 ORD 336776
## 7 FLL 336776
## 8 IAD 336776
## 9 MCO 336776
## 10 ORD 336776
## # ... with 336,766 more rows
Calculate the average and median of arrival time and departure time
summarise_at(flights,vars(arr_time,dep_time), list(~mean(.,na.rm=T),~median(.,na.rm=T)))
## # A tibble: 1 x 4
## arr_time_mean dep_time_mean arr_time_median dep_time_median
## <dbl> <dbl> <int> <int>
## 1 1502. 1349. 1535 1401
summarise()
function is usually used with group_by()
function to get more insight.
Calculate the average arrival time for each month
summarise(group_by(flights,month),avg_arr_time_month=mean(arr_time,na.rm=T))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 12 x 2
## month avg_arr_time_month
## <int> <dbl>
## 1 1 1523.
## 2 2 1522.
## 3 3 1510.
## 4 4 1501.
## 5 5 1503.
## 6 6 1468.
## 7 7 1456.
## 8 8 1495.
## 9 9 1504.
## 10 10 1520.
## 11 11 1523.
## 12 12 1505.
piping(%>%)
Operator
This operator is special for tidyverse package, such as dplyr, ggplot2 etc, and will forward a value, or the result of an expression, into the next function call/expression.
Calculate the average arrival time for each month
%>%group_by(month)%>%summarise(avg_arr_time_month=mean(arr_time,na.rm=T)) flights
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 12 x 2
## month avg_arr_time_month
## <int> <dbl>
## 1 1 1523.
## 2 2 1522.
## 3 3 1510.
## 4 4 1501.
## 5 5 1503.
## 6 6 1468.
## 7 7 1456.
## 8 8 1495.
## 9 9 1504.
## 10 10 1520.
## 11 11 1523.
## 12 12 1505.