Exploratory Data Analysis

What is Data? and Data Types

  • Data is plural of datum which is an abstraction.

  • Datum is a single quantity or quality of a real-world entity such as person, object and event.

  • A data set consists of the data related to a collection of entities which are described in terms of a set of attributes.

Both data and data set can be categorized into several groups.

The data types or groups are an important concept of statistics, which needs to be understood, to correctly apply statistical measurements to your data and therefore to correctly conclude certain assumptions about it.

Understanding data types results in doing exploratory data analysis which the one of the important of the data analysis project.

In general attribute, we can divide data as

  • Quantitative data deals with numbers and things you can measure objectively: dimensions such as height, width, and length.

  • Qualitative data data deals with characteristics and descriptors that can’t be easily measured, but can be observed subjectively: smells, tastes.

These two data attributes have subgroups.

Quantitative Attribute

  • Discrete Data: We speak of discrete data if its values are distinct and separate. In other words: We speak of discrete data if the data can only take on certain values. This type of data can not be measured but it can be counted. For example, Number of defective item in a box, Number of children in a household.

  • Continuous Data: Continuous Data represents measurements and therefore their values cannot be counted but they can be measured. For example, temperature, height, or weight.

  • Interval Data Interval values represent ordered units that have the same difference. Therefore we speak of interval data when we have a variable that contains numeric values that are ordered and where we know the exact differences between the values. It can be either discrete or continious.

The problem about the interval data is zero have no real meaning. That’s why a lot of descriptive and inferential statistics cannot be applied. For example, Temperature.

  • Ratio Data : Ratio values are also ordered units that have the same difference. Ratio values are the same as interval values, with the difference that they do have an absolute zero. In other words, zero has its real meaning. For example, age, distance

Qualitative Attribute

  • Nominal Data: Nominal values represent discrete units and are used to label variables, that have no quantitative value. Just think of them as labels. Note that nominal data that has no order. For example, Gender: Male, Female

  • Binary Data:

    Ordinal Data: Ordinal values represent discrete and ordered units. As you would guess from it’s name, order have an importance. For example,

  • 1- Totally disagree.

  • 2- Disagree.

  • 3- Neither agree nor disagree.

  • 4- Agree.

  • 5- Totally Agree.

We can also categorized data set types into several groups, For example,

  • Cross-Sectional: It is a collection of observations (behaviour) for multiple subjects(entities) at single point in time.

  • Time Series: It is a collection of observations (behaviour) for a single subject(entity) at different time intervals (generally equally spaced)

  • Panel Data: It is usually called as Cross-sectional Time-series data as it a combination of above mentioned types, i.e., collection of observations for multiple subjects at different time points.

  • Circular Data:

Kahoot Time

Exploratory Data Analysis (EDA)

It is a process of exploring your data set through some techniques such as visualization, transformation etc. It is called EDA.

It has a iterative form.

  • Generate questions about your data, called developing research questions(RQ).

  • Search for answers by visualizing, transforming, and modelling your data.

  • Use what you learn to refine your questions and/or generate new questions.

The main purpose behind EDA is understand your data and draw a pattern for your upcoming analysis. In this process, a researcher should feel free to investigate every idea. In other words, EDA does not have any certain frame or rules. Thus, some of your idea will be dead and some of them will be alive after this which makes it an iterative process.

The easiest way of doing EDA is to use questions as a guide of your investigation. Although EDA does not have any certain rules, there are some suggestions while generating a question because you can generate large amount of quantity by asking a qualified question. It also narrows down a broad topic of interest into specific area of study.

Two types question will be useful while discovering your data.

  • What type of variation occurs within variable?

  • What type of covariance occurs between variables?

However, you should consider more details to develop well research questions.

As data and data types, RQ can be classified into different categories based on type of research to be done, which are quantitative research, qualitative research and mixed.

Quantitative Research & Questions

It includes the population to be studied, dependent and independent variables, and the research design to be used. They are not answerable with Yes and No which results in such questions do not have words such as Is, Are, Do, Does.

It can be further categorized into three types: descriptive, comparative, and relationship.

  • Descriptive research questions aim to measure the responses of a studys population to one or more variables or describe variables that the research will measure. These questions typically begin with what.

  • Comparative research questions aim to discover the differences between two or more groups for an outcome variable. These questions can be causal, as well. For instance, the researcher may compare a group where a certain variable is involved and another group where that variable is not present.

  • Relationship research questions seek to explore and define trends and interactions between two or more variables. These questions often include both dependent and independent variables and use words such as association or trends.

Qualitative Research & Questions

These questions generally aim to discover, explain, or explore. They have also subgroups. Here are some of them.

  • Descriptive research questions attempt to describe a phenomenon.

  • Explanatory research questions seek to expound on a phenomenon or examine reasons for and associations between what exists.

  • Ideological research questions are used in research that aims to advance specific ideologies of a position.

You can investigate some examples for quantitative and qualitative research questions.

Example for RQs

In order to learn about RQs more, please read the blog

In data mining, EDA can be cross-classified into two ways.

  • Numerical and Graphical

  • Univariate and Multivariate

Univariate means that you are investigating one variable. On the other hand, multivariate means that you are handling with two or more variables. Usually, two variables are considered in the multivariate EDA.

Before applying multivariate EDA, perform univariate EDA.

Application

Please install

  • summarytools

  • Desctools

  • table1

  • lattice

I will use diamonds dataset. Before reading a dataset in R, please make sure that the data set of interest is in your current working directory.

  • To find out the current working directory: getwd()

  • To set your working directory: setwd()

diamonds<-read.table("diamonds.txt",header=T,sep=",")
head(diamonds)
##   carat       cut color clarity depth table price    x    y    z
## 1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
## 2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
## 3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
## 4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
## 5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
## 6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48

The columns expressions

  • price: price in US dollars

  • carat: weight of the diamond

  • cut: quality of the cut

  • color: diamond colour, from D (best) to J (worst)

  • clarity: a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))

  • x: length in mm

  • y: width in mm

  • z: depth in mm

  • depth: total depth percentage = z / mean(x, y) = 2 * z / (x + y)

  • table: width of top of diamond relative to widest point

we use dim() command to extract the dimension of the dataset.

dim(diamonds)
## [1] 53940    10

After this, we should check the class of the variables.

str(diamonds)
## 'data.frame':    53940 obs. of  10 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Factor w/ 5 levels "Fair","Good",..: 3 4 2 4 2 5 5 5 1 5 ...
##  $ color  : Factor w/ 7 levels "D","E","F","G",..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Factor w/ 8 levels "I1","IF","SI1",..: 4 3 5 6 4 8 7 3 6 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

The output shows the class of the variables. At this stage, we need to be careful because if any variable has a wrong class, this will affect the whole analysis negatively.

class(diamonds$price) #class function show the class of a variable 
## [1] "integer"
is.factor(diamonds$price)
## [1] FALSE
is.integer(diamonds$price)
## [1] TRUE
a<-as.factor(diamonds$price)
is.factor(a)
## [1] TRUE

Summary Statistics

The information that gives a quick and simple description of the data including mean, median, mode, minimum value, maximum value, range, standard deviation, etc.

The easiest way to obtain of summary statistics of the variables in the dataset is to use Rs base summary() function whose output is not very nice.

You will get the mean, quantiles, and min/max for numeric variables, and get frequency table for categorical one, but thats all.

summary(diamonds)
##      carat               cut        color        clarity          depth      
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00  
##  Median :0.7000   Ideal    :21551   F: 9542   SI2    : 9194   Median :61.80  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75  
##  3rd Qu.:1.0400   Very Good:12082   H: 8304   VVS2   : 5066   3rd Qu.:62.50  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00  
##                                     J: 2808   (Other): 2531                  
##      table           price             x                y         
##  Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
##  Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
##  Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735  
##  3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540  
##  Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900  
##                                                                   
##        z         
##  Min.   : 0.000  
##  1st Qu.: 2.910  
##  Median : 3.530  
##  Mean   : 3.539  
##  3rd Qu.: 4.040  
##  Max.   :31.800  
## 

Interpretation

The average of carat concentration is 0.8. The minimum of carat is 0.2, while its maximum is 5. The half of the carat is below or above 0.7. 25% of the carat is below 0.4 and above 1.04. Lastly, it can be said that the variable might have outlier observations since there is a considerable difference between third quartile and maximum value.

Also, out of 53940 diamonds, 1610 of them have fair cut, 4906 of them good cut, 21551 of them have ideal, 13791 of them have premium cut and 12082 of them very good cut.

To spot more descriptive statistics, you should consider some additional packages. For example,

summarytools::descr(diamonds)
## Registered S3 method overwritten by 'pryr':
##   method      from
##   print.bytes Rcpp
## Non-numerical variable(s) ignored: cut, color, clarity
## Descriptive Statistics  
## diamonds  
## N: 53940  
## 
##                        carat      depth      price      table          x          y          z
## ----------------- ---------- ---------- ---------- ---------- ---------- ---------- ----------
##              Mean       0.80      61.75    3932.80      57.46       5.73       5.73       3.54
##           Std.Dev       0.47       1.43    3989.44       2.23       1.12       1.14       0.71
##               Min       0.20      43.00     326.00      43.00       0.00       0.00       0.00
##                Q1       0.40      61.00     950.00      56.00       4.71       4.72       2.91
##            Median       0.70      61.80    2401.00      57.00       5.70       5.71       3.53
##                Q3       1.04      62.50    5324.50      59.00       6.54       6.54       4.04
##               Max       5.01      79.00   18823.00      95.00      10.74      58.90      31.80
##               MAD       0.47       1.04    2475.94       1.48       1.38       1.36       0.85
##               IQR       0.64       1.50    4374.25       3.00       1.83       1.82       1.13
##                CV       0.59       0.02       1.01       0.04       0.20       0.20       0.20
##          Skewness       1.12      -0.08       1.62       0.80       0.38       2.43       1.52
##       SE.Skewness       0.01       0.01       0.01       0.01       0.01       0.01       0.01
##          Kurtosis       1.26       5.74       2.18       2.80      -0.62      91.20      47.08
##           N.Valid   53940.00   53940.00   53940.00   53940.00   53940.00   53940.00   53940.00
##         Pct.Valid     100.00     100.00     100.00     100.00     100.00     100.00     100.00
# DescTools::Desc(diamonds) 
#produces details summary  with plots etc. 
table1::table1(~ depth, data=diamonds)
Overall
(N=53940)
depth
Mean (SD) 61.7 (1.43)
Median [Min, Max] 61.8 [43.0, 79.0]

For more alternatives, please visit

summary() has no group-by function, but you can use aggregate() to get some rudimentary statistics by group:

Research Question 1

What is the average values of depth of diamonds for premium cut one?

Quesiton: What is type of EDA and why?

aggregate(cbind(depth) ~ cut, data = diamonds, mean)
##         cut    depth
## 1      Fair 64.04168
## 2      Good 62.36588
## 3     Ideal 61.70940
## 4   Premium 61.26467
## 5 Very Good 61.81828

The average depth for premium cut diamonds is 61.26

Univaraite EDA

Categorical Variables

Frequency Table:Frequency refers to the number of times an event or a value occurs. A frequency table is a table that lists items and shows the number of times the items occur. It is generally applied on categorical variables.

table(diamonds$cut) #creates frequency table 
## 
##      Fair      Good     Ideal   Premium Very Good 
##      1610      4906     21551     13791     12082

Interpretation is above.

prop.table(table(diamonds$cut)) #shows proportions
## 
##       Fair       Good      Ideal    Premium  Very Good 
## 0.02984798 0.09095291 0.39953652 0.25567297 0.22398962

Bar Plot: A bar chart represents data in rectangular bars with length of the bar proportional to the value of the variable. It is used for the visualization of the categorical variable, i.e displaying the distrbution of categorical variable.

To create bar plot in R, you need to create the frequency table of the categorical variable at first.

Research Question 2

What is the frequency distribution of CUT?

c = table(diamonds$cut)
barplot(c)

barplot(c,col=c("red","yellow","black","blue","orange"),main="Bar Plot of Cut")
text(c,labels=c,col="white",pos=1)

# col argument fills the bar
# main argument adds title
#ylim argument arranges the y-axis
#names.arg argument changes the bar names
#text argument shows the frequencies on the plot.  
#chas is the name of the frequency table

Note: Bar plots are sometimes are used in the illustration of numerical variable.

Numerical Variables

As a numerical EDA, the summary statistics including mean, median, standard deviation etc. are considered. You can find how to do in R above.

Histogram: A histogram represents the frequencies of values of a variable bucketed into ranges. Histogram is similar to bar chat but the difference is it groups the values into continuous ranges. Each bar in histogram represents the height of the number of values present in that range.

Research Question 3

What is the distribution of carat?

hist(diamonds$carat) #hist function is used to draw histogram

hist(diamonds$carat,col="red",main="Histogram of Carat",xlab="Frequency of Carat")

#xlab changes the x axis name. 

Same Histogram with different bin

Bin: The bar in the histogram is called bin.

hist(diamonds$carat,col="red",main="Histogram of Carat",xlab="Frequency of Carat",breaks = 20)

#xlab changes the x axis name. 
#breaks sets the number of bin in the histogram

It is seen that the carat of diamonds has right skewed distributions.

Density Plot: A density plot is a representation of the distribution of a numeric variable. It uses a kernel density estimate to show the probability density function of the variable see more. It is a smoothed version of the histogram and is used in the same concept

d = density(diamonds$carat)
print(d)
## 
## Call:
##  density.default(x = diamonds$carat)
## 
## Data: diamonds$carat (53940 obs.);   Bandwidth 'bw' = 0.04827
## 
##        x                y            
##  Min.   :0.0552   Min.   :0.0000000  
##  1st Qu.:1.3301   1st Qu.:0.0001509  
##  Median :2.6050   Median :0.0037733  
##  Mean   :2.6050   Mean   :0.1959014  
##  3rd Qu.:3.8799   3rd Qu.:0.1916445  
##  Max.   :5.1548   Max.   :1.7672776
#Bandwidth is a parameter used for arranging the degree of smoothness.
plot(d,col="red",main="Density Plot of Carat",type="l")

Interpretation

It has multimodal distribution.

Histogram with density plot

If you would like to draw a histogram with density plot, you should use prob=T argument.

hist(diamonds$carat,col="red",main="Histogram of Carat",xlab="Frequency of Carat",breaks = 20,prob=T)
lines(density(diamonds$carat),col="blue",main="Density Plot of Carat",type="l")

Box Plot: It is created based on Tukeys Five Number Summary including minimum, maximum, median, first and third quartile. The box plot can be used for two main purposes,

  • To investigate the distribution of a univariate numerical ariable and checking the existence of outliers.

  • To see the relationship between numerical variable and categorical variable with levels or compare a numerical variable with respect to a categorical variable.

Therefore, it is suitable for both univariate and multivariate EDA.

Anatomy of Box Plot

Anatomy of Box Plot

Research Question 3

boxplot(diamonds$carat)

boxplot(diamonds$carat,main="Box Plot of Carat",xlab="Carat",col="red")

Interpretation

It is seen that the interested variable has right skewed distribution and have many outliers. The median of the data is between 0 and 1.

Box Plot on the top of the Histogram

# Draw the boxplot and the histogram 
layout(mat = matrix(c(1,2),2,1, byrow=TRUE),  height = c(1,8))
par(mar=c(0, 3.1, 1.1, 2.1))
boxplot(diamonds$carat,main="Box Plot and Histogram of Carat ",col="red",horizontal=TRUE,frame=F)
par(mar=c(1, 3.1, 1.1, 2.1))
hist(diamonds$carat,col="red",xlab="Carat",main="")

#xlab changes the x axis name. 

Mid-Exercise 1

What is the variation of Price?

Type a R code to draw a histogram and denstiy plot. Explain what you get.

Multivariate EDA

In order to do Multivariate EDA, we need to take more than one variable into the analysis.

Categorical Variables

Research Question 4

How does the color of diamonds distribute over cut type?

If you are analyzing two categorical variables, we can still create a frequency table, called contingency table.

Contingency Table : In statistics, a contingency table, also known as a cross tabulation or crosstab, is a type of table in a matrix format that displays the multivariate frequency distribution of the variables. They are heavily used in survey research, business intelligence, engineering and scientific research. -Vikipedia :)

You can create contingency table by using table() in R.

table(diamonds$color,diamonds$cut)
##    
##     Fair Good Ideal Premium Very Good
##   D  163  662  2834    1603      1513
##   E  224  933  3903    2337      2400
##   F  312  909  3826    2331      2164
##   G  314  871  4884    2924      2299
##   H  303  702  3115    2360      1824
##   I  175  522  2093    1428      1204
##   J  119  307   896     808       678

Grouped Bar Plot: It is an efficient tool to display the association between two categorical variables.

counts = table(diamonds$color,diamonds$cut)
barplot(counts, main="Color Distribution by Cut Type",
  xlab="Median value of Income", col=c("grey","darkred","orange","black","maroon","gold1","darkgreen"),
  legend = rownames(counts),beside=TRUE)

Interpretation

It can be said that the frequency of color J increases as the quality of diamond cut increases. Also, the plot shows that most of the diamonds have color G.

What else?

Stacked Bar Plot: A stacked barplot is very similar to the grouped barplot above. The subgroups are just displayed on top of each other, not beside. The stacked barchart is the default option of the barplot() function in base R, so you do not need to use the beside argument.

counts = table(diamonds$color,diamonds$cut)
barplot(counts, main="Color Distribution by Cut Type",
  xlab="Cut Type", col=colors()[c(22,32,42,52,62,72,82)],
  legend = rownames(counts))

The interpretation is in the previous plot.

Continuous Variables

Research Question 5

What is the association between carat and price?

Scatter Plot: Scatter plot is a graphical way used to display the relationship between two numeric variables by using dots. Each dot represents an observation. Their position on the X (horizontal) and Y (vertical) axis represents the values of the 2 variables.

plot(diamonds$carat,diamonds$price)

#logic-> plot(independent variable, dependent variable)
plot(diamonds$carat,diamonds$price,main="The Association btw Carat and Price",xlab="Carat",ylab="Price",col="darkred",pch=12,cex=1)

#cex=size of dots
#pch=point type

It is seen that when the carat increases, the prices increases. (Positive relationship.)

Scatter Plot with Polynomial Curve

plot(diamonds$carat,diamonds$price,main="The Association btw Carat and Price",xlab="Carat",ylab="Price",col=rgb(0.4,0.4,0.8,0.6),pch=16,cex=1,ylim=c(-3500,17000))
model <- lm(diamonds$price~diamonds$carat)
 
# I can get the features of this model :
#summary(model)
#model$coefficients
#summary(model)$adj.r.squared
 
# For each value of x, I can get the value of y estimated by the model, and add it to the current plot !
myPredict <- predict( model ) 
ix <- sort(diamonds$carat,index.return=T)$ix
lines(diamonds$carat[ix], myPredict[ix], col=2, lwd=0.5 )  

# I add the features of the model to the plot
coeff <- round(model$coefficients , 2)
text(3, -2000 , paste("Model : ",coeff[1] , " + " , coeff[2] , "*x"  , "\n\n" , "P-value adjusted = ",round(summary(model)$adj.r.squared,2)))

You can customize your scatter plot using the following arguments in the plot function.

  • cex = shape size

  • lwd = line width

  • col = control colors

  • lty = line type

  • pch = marker shape

  • type = link between dots

Categorical and Numerical Variables

Research Question 6

How diamonds prices distribute over cut type?

Box Plot It is probably the most commonly used chart type to compare distribution of several groups.

boxplot(diamonds$price~diamonds$cut,main="The Boxplot of Diamond Prices by Cut",col=colors()[c(25,35,45,55,65)],xlab="Cut Type",ylab="Price",ylim=c(0,25000))

legend("topleft", legend = levels(diamonds$cut) , 
    col=colors()[c(25,35,45,55,65)] , bty = "n", pch=20 , pt.cex = 1, cex = 1, horiz = T, inset = c(0.01, 0.01))

#inset set the position of legend
#horiz = add legend horizontally

Interpretation

In a short way, the median price for each cut type are close to each other. All of them have outlier observations. According to plot, all of them have right skewed distribution but more visual technique should be considered.

Mid-Exercise 2

How diamonds prices distribute over color type?

Research Question 7

How diamonds prices distribute over cut type and depth?

Here, we need a transformation because we aim to analze a numerical variable(price) by two categorical variables(cut and depth).

diamonds$depth_factor = ifelse(diamonds$depth<mean(diamonds$depth),"Below Average","Above Average")
# I make the boxplot, asking to use the 2 factors : depth factor and cut:
par(mar=c(3,4,3,1))
myplot <-
boxplot(price ~ depth_factor*cut , data=diamonds,
        boxwex=0.4 , ylab="Price",
        main="The Box Plot of Price by Cut and Depth" , 
        col=c("slateblue1" , "tomato") ,  
        xaxt="n",ylim=c(0,30000))

my_names <- sapply(strsplit(myplot$names , '\\.') , function(x) x[[2]] )
my_names <- my_names[seq(1 , length(my_names) , 2)]

axis(1, at =seq(0.5 , 10 , 2),labels = my_names , 
     tick=FALSE , cex=0.3)

# Add the grey vertical lines
for(i in seq(0.5 ,10 , 2)){ 
  abline(v=i,lty=1, col="grey")
  }
 

# Add a legend
legend("topright", legend = c("Above", "Below"), 
       col=c("slateblue1" , "tomato"),
       pch = 15, bty = "n", pt.cex = 3, cex = 1.2,  horiz = F, inset = c(0.1, 0.1))

Interpretation

The price of diamonds does not show a visual difference by their depth for each cut type.

As you know, R has many number of packages for visualization, better than Python. We will cover some of those in this lab.

Lattice Plots

It is a powerful and elegant high-level data visualization system, with an emphasis on multivariate data. It generates a plot splitted into the level of a categorical variable.

What it has?

Research Question 8

What is the association between price and carats by cut type?

Scatter Plot in Lattice lattice library offers the xyplot() function. It builds a scatterplot for each levels of a factor automatically.

library(lattice)
## Warning: package 'lattice' was built under R version 3.6.3
xyplot(price ~ carat | cut , data=diamonds , pch=20 , cex=0.5 , col=rgb(0.2,0.4,0.8,0.5) )

Interpretation Carat and price have positive relationship in each cut type as we expected.

You can also make your scatter plot colorful by the level of one factor variable using xyplot().

xyplot(price ~ carat , data=diamonds,group = cut,auto.key = TRUE)

Research Question 9?

How does the price of diamonds distribute by color?

There are several ways to answer this quesiton, and see the possible solutions below.

Histogram

histogram(~ price | color, data = diamonds, breaks = 20)

We can say that price has multimodal distribution for color H, I and J and right skewed distribution for the rest of the colors.

Density Plot

densityplot(~ price | color, data = diamonds)

densityplot(~price ,group=color, data = diamonds,auto.key = TRUE)

Box and Violin Plot

bwplot(~ price | color, data = diamonds)

Violin Plot: Violin plots are similar to box plots, except that they also show the kernel probability density of the data at different values.

bwplot(~ price | color, data = diamonds,panel = panel.violin)

Recall Research Question 6

bwplot(price ~  cut, data = diamonds)

Recall Research Question 4

Heat Map

df<-table(diamonds$color,diamonds$cut)
levelplot(df)

References