Exploratory Data Analysis I
Exploratory Data Analysis
What is Data? and Data Types
Data is plural of datum which is an abstraction.
Datum is a single quantity or quality of a real-world entity such as person, object and event.
A data set consists of the data related to a collection of entities which are described in terms of a set of attributes.
Both data and data set can be categorized into several groups.
The data types or groups are an important concept of statistics, which needs to be understood, to correctly apply statistical measurements to your data and therefore to correctly conclude certain assumptions about it.
Understanding data types results in doing exploratory data analysis which the one of the important of the data analysis project.
In general attribute, we can divide data as
Quantitative data deals with numbers and things you can measure objectively: dimensions such as height, width, and length.
Qualitative data data deals with characteristics and descriptors that can’t be easily measured, but can be observed subjectively: smells, tastes.
These two data attributes have subgroups.
Quantitative Attribute
Discrete Data: We speak of discrete data if its values are distinct and separate. In other words: We speak of discrete data if the data can only take on certain values. This type of data can not be measured but it can be counted. For example, Number of defective item in a box, Number of children in a household.
Continuous Data: Continuous Data represents measurements and therefore their values cannot be counted but they can be measured. For example, temperature, height, or weight.
Interval Data Interval values represent ordered units that have the same difference. Therefore we speak of interval data when we have a variable that contains numeric values that are ordered and where we know the exact differences between the values. It can be either discrete or continious.
The problem about the interval data is zero have no real meaning. That’s why a lot of descriptive and inferential statistics cannot be applied. For example, Temperature.
- Ratio Data : Ratio values are also ordered units that have the same difference. Ratio values are the same as interval values, with the difference that they do have an absolute zero. In other words, zero has its real meaning. For example, age, distance
Qualitative Attribute
Nominal Data: Nominal values represent discrete units and are used to label variables, that have no quantitative value. Just think of them as labels. Note that nominal data that has no order. For example, Gender: Male, Female
Binary Data:
Ordinal Data: Ordinal values represent discrete and ordered units. As you would guess from it’s name, order have an importance. For example,
1- Totally disagree.
2- Disagree.
3- Neither agree nor disagree.
4- Agree.
5- Totally Agree.
We can also categorized data set types into several groups, For example,
Cross-Sectional: It is a collection of observations (behaviour) for multiple subjects(entities) at single point in time.
Time Series: It is a collection of observations (behaviour) for a single subject(entity) at different time intervals (generally equally spaced)
Panel Data: It is usually called as Cross-sectional Time-series data as it a combination of above mentioned types, i.e., collection of observations for multiple subjects at different time points.
Circular Data:
Kahoot Time
Exploratory Data Analysis (EDA)
It is a process of exploring your data set through some techniques such as visualization, transformation etc. It is called EDA.
It has a iterative form.
Generate questions about your data, called developing research questions(RQ).
Search for answers by visualizing, transforming, and modelling your data.
Use what you learn to refine your questions and/or generate new questions.
The main purpose behind EDA is understand your data and draw a pattern for your upcoming analysis. In this process, a researcher should feel free to investigate every idea. In other words, EDA does not have any certain frame or rules. Thus, some of your idea will be dead and some of them will be alive after this which makes it an iterative process.
The easiest way of doing EDA is to use questions as a guide of your investigation. Although EDA does not have any certain rules, there are some suggestions while generating a question because you can generate large amount of quantity by asking a qualified question. It also narrows down a broad topic of interest into specific area of study.
Two types question will be useful while discovering your data.
What type of variation occurs within variable?
What type of covariance occurs between variables?
However, you should consider more details to develop well research questions.
As data and data types, RQ can be classified into different categories based on type of research to be done, which are quantitative research, qualitative research and mixed.
Quantitative Research & Questions
It includes the population to be studied, dependent and independent variables, and the research design to be used. They are not answerable with Yes and No which results in such questions do not have words such as Is, Are, Do, Does.
It can be further categorized into three types: descriptive, comparative, and relationship.
Descriptive research questions aim to measure the responses of a studys population to one or more variables or describe variables that the research will measure. These questions typically begin with what.
Comparative research questions aim to discover the differences between two or more groups for an outcome variable. These questions can be causal, as well. For instance, the researcher may compare a group where a certain variable is involved and another group where that variable is not present.
Relationship research questions seek to explore and define trends and interactions between two or more variables. These questions often include both dependent and independent variables and use words such as association or trends.
Qualitative Research & Questions
These questions generally aim to discover, explain, or explore. They have also subgroups. Here are some of them.
Descriptive research questions attempt to describe a phenomenon.
Explanatory research questions seek to expound on a phenomenon or examine reasons for and associations between what exists.
Ideological research questions are used in research that aims to advance specific ideologies of a position.
You can investigate some examples for quantitative and qualitative research questions.
In order to learn about RQs more, please read the blog
In data mining, EDA can be cross-classified into two ways.
Numerical and Graphical
Univariate and Multivariate
Univariate means that you are investigating one variable. On the other hand, multivariate means that you are handling with two or more variables. Usually, two variables are considered in the multivariate EDA.
Before applying multivariate EDA, perform univariate EDA.
Application
Please install
summarytools
Desctools
table1
lattice
I will use diamonds
dataset. Before reading a dataset in R, please make sure that the data set of interest is in your current working directory.
To find out the current working directory:
getwd()
To set your working directory:
setwd()
<-read.table("diamonds.txt",header=T,sep=",")
diamondshead(diamonds)
## carat cut color clarity depth table price x y z
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
The columns expressions
price: price in US dollars
carat: weight of the diamond
cut: quality of the cut
color: diamond colour, from D (best) to J (worst)
clarity: a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
x: length in mm
y: width in mm
z: depth in mm
depth: total depth percentage = z / mean(x, y) = 2 * z / (x + y)
table: width of top of diamond relative to widest point
we use dim()
command to extract the dimension of the dataset.
dim(diamonds)
## [1] 53940 10
After this, we should check the class of the variables.
str(diamonds)
## 'data.frame': 53940 obs. of 10 variables:
## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Factor w/ 5 levels "Fair","Good",..: 3 4 2 4 2 5 5 5 1 5 ...
## $ color : Factor w/ 7 levels "D","E","F","G",..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Factor w/ 8 levels "I1","IF","SI1",..: 4 3 5 6 4 8 7 3 6 5 ...
## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
The output shows the class of the variables. At this stage, we need to be careful because if any variable has a wrong class, this will affect the whole analysis negatively.
class(diamonds$price) #class function show the class of a variable
## [1] "integer"
is.factor(diamonds$price)
## [1] FALSE
is.integer(diamonds$price)
## [1] TRUE
<-as.factor(diamonds$price)
ais.factor(a)
## [1] TRUE
Summary Statistics
The information that gives a quick and simple description of the data including mean, median, mode, minimum value, maximum value, range, standard deviation, etc.
The easiest way to obtain of summary statistics of the variables in the dataset is to use Rs base summary()
function whose output is not very nice.
You will get the mean, quantiles, and min/max for numeric variables, and get frequency table for categorical one, but thats all.
summary(diamonds)
## carat cut color clarity depth
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00
## Median :0.7000 Ideal :21551 F: 9542 SI2 : 9194 Median :61.80
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75
## 3rd Qu.:1.0400 Very Good:12082 H: 8304 VVS2 : 5066 3rd Qu.:62.50
## Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00
## J: 2808 (Other): 2531
## table price x y
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
##
## z
## Min. : 0.000
## 1st Qu.: 2.910
## Median : 3.530
## Mean : 3.539
## 3rd Qu.: 4.040
## Max. :31.800
##
Interpretation
The average of carat concentration is 0.8. The minimum of carat is 0.2, while its maximum is 5. The half of the carat is below or above 0.7. 25% of the carat is below 0.4 and above 1.04. Lastly, it can be said that the variable might have outlier observations since there is a considerable difference between third quartile and maximum value.
Also, out of 53940 diamonds, 1610 of them have fair cut, 4906 of them good cut, 21551 of them have ideal, 13791 of them have premium cut and 12082 of them very good cut.
To spot more descriptive statistics, you should consider some additional packages. For example,
::descr(diamonds) summarytools
## Registered S3 method overwritten by 'pryr':
## method from
## print.bytes Rcpp
## Non-numerical variable(s) ignored: cut, color, clarity
## Descriptive Statistics
## diamonds
## N: 53940
##
## carat depth price table x y z
## ----------------- ---------- ---------- ---------- ---------- ---------- ---------- ----------
## Mean 0.80 61.75 3932.80 57.46 5.73 5.73 3.54
## Std.Dev 0.47 1.43 3989.44 2.23 1.12 1.14 0.71
## Min 0.20 43.00 326.00 43.00 0.00 0.00 0.00
## Q1 0.40 61.00 950.00 56.00 4.71 4.72 2.91
## Median 0.70 61.80 2401.00 57.00 5.70 5.71 3.53
## Q3 1.04 62.50 5324.50 59.00 6.54 6.54 4.04
## Max 5.01 79.00 18823.00 95.00 10.74 58.90 31.80
## MAD 0.47 1.04 2475.94 1.48 1.38 1.36 0.85
## IQR 0.64 1.50 4374.25 3.00 1.83 1.82 1.13
## CV 0.59 0.02 1.01 0.04 0.20 0.20 0.20
## Skewness 1.12 -0.08 1.62 0.80 0.38 2.43 1.52
## SE.Skewness 0.01 0.01 0.01 0.01 0.01 0.01 0.01
## Kurtosis 1.26 5.74 2.18 2.80 -0.62 91.20 47.08
## N.Valid 53940.00 53940.00 53940.00 53940.00 53940.00 53940.00 53940.00
## Pct.Valid 100.00 100.00 100.00 100.00 100.00 100.00 100.00
# DescTools::Desc(diamonds)
#produces details summary with plots etc.
::table1(~ depth, data=diamonds) table1
Overall (N=53940) |
|
---|---|
depth | |
Mean (SD) | 61.7 (1.43) |
Median [Min, Max] | 61.8 [43.0, 79.0] |
For more alternatives, please visit
summary()
has no group-by function, but you can use aggregate()
to get some rudimentary statistics by group:
Research Question 1
What is the average values of depth of diamonds for premium cut one?
Quesiton: What is type of EDA and why?
aggregate(cbind(depth) ~ cut, data = diamonds, mean)
## cut depth
## 1 Fair 64.04168
## 2 Good 62.36588
## 3 Ideal 61.70940
## 4 Premium 61.26467
## 5 Very Good 61.81828
The average depth for premium cut diamonds is 61.26
Univaraite EDA
Categorical Variables
Frequency Table:Frequency refers to the number of times an event or a value occurs. A frequency table is a table that lists items and shows the number of times the items occur. It is generally applied on categorical variables.
table(diamonds$cut) #creates frequency table
##
## Fair Good Ideal Premium Very Good
## 1610 4906 21551 13791 12082
Interpretation is above.
prop.table(table(diamonds$cut)) #shows proportions
##
## Fair Good Ideal Premium Very Good
## 0.02984798 0.09095291 0.39953652 0.25567297 0.22398962
Bar Plot: A bar chart represents data in rectangular bars with length of the bar proportional to the value of the variable. It is used for the visualization of the categorical variable, i.e displaying the distrbution of categorical variable.
To create bar plot in R, you need to create the frequency table of the categorical variable at first.
Research Question 2
What is the frequency distribution of CUT?
= table(diamonds$cut)
c barplot(c)
barplot(c,col=c("red","yellow","black","blue","orange"),main="Bar Plot of Cut")
text(c,labels=c,col="white",pos=1)
# col argument fills the bar
# main argument adds title
#ylim argument arranges the y-axis
#names.arg argument changes the bar names
#text argument shows the frequencies on the plot.
#chas is the name of the frequency table
Note: Bar plots are sometimes are used in the illustration of numerical variable.
Numerical Variables
As a numerical EDA, the summary statistics including mean, median, standard deviation etc. are considered. You can find how to do in R above.
Histogram: A histogram represents the frequencies of values of a variable bucketed into ranges. Histogram is similar to bar chat but the difference is it groups the values into continuous ranges. Each bar in histogram represents the height of the number of values present in that range.
Research Question 3
What is the distribution of carat?
hist(diamonds$carat) #hist function is used to draw histogram
hist(diamonds$carat,col="red",main="Histogram of Carat",xlab="Frequency of Carat")
#xlab changes the x axis name.
Same Histogram with different bin
Bin: The bar in the histogram is called bin.
hist(diamonds$carat,col="red",main="Histogram of Carat",xlab="Frequency of Carat",breaks = 20)
#xlab changes the x axis name.
#breaks sets the number of bin in the histogram
It is seen that the carat of diamonds has right skewed distributions.
Density Plot: A density plot is a representation of the distribution of a numeric variable. It uses a kernel density estimate to show the probability density function of the variable see more. It is a smoothed version of the histogram and is used in the same concept
= density(diamonds$carat)
d print(d)
##
## Call:
## density.default(x = diamonds$carat)
##
## Data: diamonds$carat (53940 obs.); Bandwidth 'bw' = 0.04827
##
## x y
## Min. :0.0552 Min. :0.0000000
## 1st Qu.:1.3301 1st Qu.:0.0001509
## Median :2.6050 Median :0.0037733
## Mean :2.6050 Mean :0.1959014
## 3rd Qu.:3.8799 3rd Qu.:0.1916445
## Max. :5.1548 Max. :1.7672776
#Bandwidth is a parameter used for arranging the degree of smoothness.
plot(d,col="red",main="Density Plot of Carat",type="l")
Interpretation
It has multimodal distribution.
Histogram with density plot
If you would like to draw a histogram with density plot, you should use prob=T
argument.
hist(diamonds$carat,col="red",main="Histogram of Carat",xlab="Frequency of Carat",breaks = 20,prob=T)
lines(density(diamonds$carat),col="blue",main="Density Plot of Carat",type="l")
Box Plot: It is created based on Tukeys Five Number Summary including minimum, maximum, median, first and third quartile. The box plot can be used for two main purposes,
To investigate the distribution of a univariate numerical ariable and checking the existence of outliers.
To see the relationship between numerical variable and categorical variable with levels or compare a numerical variable with respect to a categorical variable.
Therefore, it is suitable for both univariate and multivariate EDA.
Anatomy of Box Plot
Research Question 3
boxplot(diamonds$carat)
boxplot(diamonds$carat,main="Box Plot of Carat",xlab="Carat",col="red")
Interpretation
It is seen that the interested variable has right skewed distribution and have many outliers. The median of the data is between 0 and 1.
Box Plot on the top of the Histogram
# Draw the boxplot and the histogram
layout(mat = matrix(c(1,2),2,1, byrow=TRUE), height = c(1,8))
par(mar=c(0, 3.1, 1.1, 2.1))
boxplot(diamonds$carat,main="Box Plot and Histogram of Carat ",col="red",horizontal=TRUE,frame=F)
par(mar=c(1, 3.1, 1.1, 2.1))
hist(diamonds$carat,col="red",xlab="Carat",main="")
#xlab changes the x axis name.
Mid-Exercise 1
What is the variation of Price?
Type a R code to draw a histogram and denstiy plot. Explain what you get.
Multivariate EDA
In order to do Multivariate EDA, we need to take more than one variable into the analysis.
Categorical Variables
Research Question 4
How does the color of diamonds distribute over cut type?
If you are analyzing two categorical variables, we can still create a frequency table, called contingency table.
Contingency Table : In statistics, a contingency table, also known as a cross tabulation or crosstab, is a type of table in a matrix format that displays the multivariate frequency distribution of the variables. They are heavily used in survey research, business intelligence, engineering and scientific research. -Vikipedia :)
You can create contingency table by using table()
in R.
table(diamonds$color,diamonds$cut)
##
## Fair Good Ideal Premium Very Good
## D 163 662 2834 1603 1513
## E 224 933 3903 2337 2400
## F 312 909 3826 2331 2164
## G 314 871 4884 2924 2299
## H 303 702 3115 2360 1824
## I 175 522 2093 1428 1204
## J 119 307 896 808 678
Grouped Bar Plot: It is an efficient tool to display the association between two categorical variables.
= table(diamonds$color,diamonds$cut)
counts barplot(counts, main="Color Distribution by Cut Type",
xlab="Median value of Income", col=c("grey","darkred","orange","black","maroon","gold1","darkgreen"),
legend = rownames(counts),beside=TRUE)
Interpretation
It can be said that the frequency of color J increases as the quality of diamond cut increases. Also, the plot shows that most of the diamonds have color G.
What else?
Stacked Bar Plot: A stacked barplot is very similar to the grouped barplot above. The subgroups are just displayed on top of each other, not beside. The stacked barchart is the default option of the barplot()
function in base R, so you do not need to use the beside argument.
= table(diamonds$color,diamonds$cut)
counts barplot(counts, main="Color Distribution by Cut Type",
xlab="Cut Type", col=colors()[c(22,32,42,52,62,72,82)],
legend = rownames(counts))
The interpretation is in the previous plot.
Continuous Variables
Research Question 5
What is the association between carat and price?
Scatter Plot: Scatter plot is a graphical way used to display the relationship between two numeric variables by using dots. Each dot represents an observation. Their position on the X (horizontal) and Y (vertical) axis represents the values of the 2 variables.
plot(diamonds$carat,diamonds$price)
#logic-> plot(independent variable, dependent variable)
plot(diamonds$carat,diamonds$price,main="The Association btw Carat and Price",xlab="Carat",ylab="Price",col="darkred",pch=12,cex=1)
#cex=size of dots
#pch=point type
It is seen that when the carat increases, the prices increases. (Positive relationship.)
Scatter Plot with Polynomial Curve
plot(diamonds$carat,diamonds$price,main="The Association btw Carat and Price",xlab="Carat",ylab="Price",col=rgb(0.4,0.4,0.8,0.6),pch=16,cex=1,ylim=c(-3500,17000))
<- lm(diamonds$price~diamonds$carat)
model
# I can get the features of this model :
#summary(model)
#model$coefficients
#summary(model)$adj.r.squared
# For each value of x, I can get the value of y estimated by the model, and add it to the current plot !
<- predict( model )
myPredict <- sort(diamonds$carat,index.return=T)$ix
ix lines(diamonds$carat[ix], myPredict[ix], col=2, lwd=0.5 )
# I add the features of the model to the plot
<- round(model$coefficients , 2)
coeff text(3, -2000 , paste("Model : ",coeff[1] , " + " , coeff[2] , "*x" , "\n\n" , "P-value adjusted = ",round(summary(model)$adj.r.squared,2)))
You can customize your scatter plot using the following arguments in the plot function.
cex = shape size
lwd = line width
col = control colors
lty = line type
pch = marker shape
type = link between dots
Categorical and Numerical Variables
Research Question 6
How diamonds prices distribute over cut type?
Box Plot It is probably the most commonly used chart type to compare distribution of several groups.
boxplot(diamonds$price~diamonds$cut,main="The Boxplot of Diamond Prices by Cut",col=colors()[c(25,35,45,55,65)],xlab="Cut Type",ylab="Price",ylim=c(0,25000))
legend("topleft", legend = levels(diamonds$cut) ,
col=colors()[c(25,35,45,55,65)] , bty = "n", pch=20 , pt.cex = 1, cex = 1, horiz = T, inset = c(0.01, 0.01))
#inset set the position of legend
#horiz = add legend horizontally
Interpretation
In a short way, the median price for each cut type are close to each other. All of them have outlier observations. According to plot, all of them have right skewed distribution but more visual technique should be considered.
Mid-Exercise 2
How diamonds prices distribute over color type?
Research Question 7
How diamonds prices distribute over cut type and depth?
Here, we need a transformation because we aim to analze a numerical variable(price) by two categorical variables(cut and depth).
$depth_factor = ifelse(diamonds$depth<mean(diamonds$depth),"Below Average","Above Average") diamonds
# I make the boxplot, asking to use the 2 factors : depth factor and cut:
par(mar=c(3,4,3,1))
<-
myplot boxplot(price ~ depth_factor*cut , data=diamonds,
boxwex=0.4 , ylab="Price",
main="The Box Plot of Price by Cut and Depth" ,
col=c("slateblue1" , "tomato") ,
xaxt="n",ylim=c(0,30000))
<- sapply(strsplit(myplot$names , '\\.') , function(x) x[[2]] )
my_names <- my_names[seq(1 , length(my_names) , 2)]
my_names
axis(1, at =seq(0.5 , 10 , 2),labels = my_names ,
tick=FALSE , cex=0.3)
# Add the grey vertical lines
for(i in seq(0.5 ,10 , 2)){
abline(v=i,lty=1, col="grey")
}
# Add a legend
legend("topright", legend = c("Above", "Below"),
col=c("slateblue1" , "tomato"),
pch = 15, bty = "n", pt.cex = 3, cex = 1.2, horiz = F, inset = c(0.1, 0.1))
Interpretation
The price of diamonds does not show a visual difference by their depth for each cut type.
As you know, R has many number of packages for visualization, better than Python. We will cover some of those in this lab.
Lattice Plots
It is a powerful and elegant high-level data visualization system, with an emphasis on multivariate data. It generates a plot splitted into the level of a categorical variable.
What it has?
Research Question 8
What is the association between price and carats by cut type?
Scatter Plot in Lattice lattice
library offers the xyplot()
function. It builds a scatterplot for each levels of a factor automatically.
library(lattice)
## Warning: package 'lattice' was built under R version 3.6.3
xyplot(price ~ carat | cut , data=diamonds , pch=20 , cex=0.5 , col=rgb(0.2,0.4,0.8,0.5) )
Interpretation Carat and price have positive relationship in each cut type as we expected.
You can also make your scatter plot colorful by the level of one factor variable using xyplot()
.
xyplot(price ~ carat , data=diamonds,group = cut,auto.key = TRUE)
Research Question 9?
How does the price of diamonds distribute by color?
There are several ways to answer this quesiton, and see the possible solutions below.
Histogram
histogram(~ price | color, data = diamonds, breaks = 20)
We can say that price has multimodal distribution for color H, I and J and right skewed distribution for the rest of the colors.
Density Plot
densityplot(~ price | color, data = diamonds)
densityplot(~price ,group=color, data = diamonds,auto.key = TRUE)
Box and Violin Plot
bwplot(~ price | color, data = diamonds)
Violin Plot: Violin plots are similar to box plots, except that they also show the kernel probability density of the data at different values.
bwplot(~ price | color, data = diamonds,panel = panel.violin)
Recall Research Question 6
bwplot(price ~ cut, data = diamonds)
Recall Research Question 4
Heat Map
<-table(diamonds$color,diamonds$cut)
dflevelplot(df)
References