ECON 413
Introduction to Data Science
Erol
Taymaz
Department of Economics
Middle East Technical
University
Topics
- What is “Data Science”?
- What is ECON 413?
- What is R?
What is “Data Science”?
Data science, also known as data-driven science, is an
interdisciplinary field about scientific methods, processes and systems
to extract knowledge or insights from data in
various forms, either structured or unstructured …
Data science is a “concept to unify statistics, data analysis and
their related methods” in order to “understand and analyze actual
phenomena” with data. It employs techniques and theories drawn from many
fields within the broad areas of mathematics, statistics, information
science, and computer science, in particular from the subdomains of
machine learning, classification, cluster analysis, data mining,
databases, and visualization. (Wikipedia/Data
Science)
What is “data”?
- Everything we can get information from
- Everything that can be stored in a computer memory!
What is “data”?
- Penn World Tables [tabular data, relational data]
- 5-Year Development Plan
- www.sahibinden.com
- Map of Turkey
What is “data”?
What is “data”?
What is “data”?
Why is data science important?
An explosion of data and data sources
Everybody and everything generates data everytime
- Walking around by mobile phone
- Using your credit card
- Browsing the Internet
- Buying a certain product
Easy access to the data (open data sources, Wikipedia!)
Ability to extract value from the data - not just our own data,
but all of the available data
Ability to use the tools necessary to collect, analyze and
present the data
Why is data science important?
The 50 Best Jobs in America in 2022
- Enterprise Architect
- Full Stack Engineer
- Data Scientist
- DevOps Engineer
- Strategy Manager
- Machine Learning Engineer
- Data Engineer
- Software Engineer
- Java Developer
- Product Manager
Source: Glassdoor/50
Best Jobs in America
Data Science Process
What is ECON 413?
- An introduction to data science
- An introduction to the main tools and ideas in the data scientist’s
toolbox.
- How to collect, retrieve, scrap, check, clean, analyze, understand,
visualize, and present the data
- An introduction to R programming
Textbooks
- Venables, W. N., Smith, D. M. and the R Core Team (2015),
An
Introduction to R, R Core Team.
- Grolemund, Garrett, and Wickham, Hadley (2017),
R for Data Science,
O’Reilly.
- Heiss, Florian (2020),
Using R for
Introductory Econometrics, 2nd edition.
- Wickham, Hadley (2014), Advanced
R, Chapman & Hall/CRC.
- Grolemund, Garrett (2014),
Hands-On
Programming with R, O’Reilly.
- Peng, Roger D. (2015),
R
Programming for Data Science, Leanpub.
- Hanck, C., Arnold, M., Gerber,A. and Schmelzer, M. (2020),
Introduction to
Econometrics with R
- Wilke, Claus O. (2021),
Fundamentals of
Data Visualization
- Healy, Kieran (2018), Data
Visualization: A practical Introduction
Topics
Part 1. Basics
Introduction
Data types and data objects
Algorithms, loops, functions
Basic functions
Data manipulation with data.table
Data visualization and ggplot2
Factors, lists, functionals
Part 2. Applications
Reproducible and interactive research (Rmarkdown, Quarto and
Shiny packages)
Web scrapping and text analysis
Regression analysis
Maps
Animations and simulations
R best practice
Review
Lectures
- Friday (09:40-12:30), Computer Lab
Please review the presentation slides and try examples before the
lecture.
Grading
The course consists of lectures, quizzes, homeworks and projects.
Course grades will be based on 6 quizzes (10 pts each), 1 project (40
pts), and forum participation (as a bonus, up to 10 pts).
There will be 7 quizzes in total, and you can take any 6 of them.
There will be no make-up.
The project teams will consist of 3 students. Projects will be
presented on-line on January 23, 2024, and be submitted by midnight, the
same day.
DataCamp
“This class is supported by DataCamp, the most intuitive learning
platform for data science and analytics. Learn any time, anywhere and
become an expert in R, Python, SQL, and more. DataCamp’s learn-by-doing
methodology combines short expert videos and hands-on-the-keyboard
exercises to help learners retain knowledge. DataCamp offers 350+
courses by expert instructors on topics such as importing data, data
visualization, and machine learning. They’re constantly expanding their
curriculum to keep up with the latest technology trends and to provide
the best learning experience for all skill levels. Join over 6 million
learners around the world and close your skills gap.”
I will register your name at DataCamp. If you do not
want to use it, please inform me by e-mail.
What is R?
- A programming language (interpreter)
- A statistics package
- An environment for statistical computing and
graphics
- Developed by a community
- Extended with ‘packages’ that contain data, code, and
documentation
What is R?
- R is a flavor of the S computer language
- S was developed by John Chambers at Bell Labs in the late 1970s
- [W]e wanted users to be able to begin in an interactive
environment, where they did not consciously think of themselves as
programming. Then as their needs became clearer and their sophistication
increased, they should be able to slide gradually into programming, when
the language and system aspects would become more important. (John
Chambers)
- 1991 R is created by Ross Ihaka and Robert Gentleman
- 1993 R is made public
- 1995 R becomes Open Source (GNU General Public License)
- 1997 R Core Group is formed
- 2000 Version 1.0.0 ships
Why R?
Why not Stata, or SPSS, or …?
- A flexible programming language
- Open Source and free (philosophy, or practical reasons)
- It is free to study the code
- It is free to redistribute it
- It is free to modify it
- It is free to redistribute the modified version
- It is free of charge, too
- Platform independent (Linux, Mac, Windows, desktops, servers)
- Easy to share with others
Why R?
- Suitable for almost all scientific disciplines
- Extensive, diverse and growing community
- Ever increasing number of packages
- Easy interactions with other programs (web, presentation, databases,
big data, etc.)
- Part of the open source toolchain of research (from data
analysis to reporting, for the web or a thesis)
- Operating system: Linux (Ubuntu)
- Office (LibreOffice)
- Vector graphics (Inkscape)
- Image processing (GIMP)
- Multimedia player (VLC)
- Video editing (Kdenlive)
Why R?
- Advantages
- Free of charge, easy to install
- Strong community support
- Up-to-date
- Can complement / can be complemented by other programs
- Requires user knowledge - user thinks about what s/he is doing
- Drawbacks
- Steep learning curve
- Not user friendly
- All objects are stored in the computer memory
- Slower than compiled languages
- Easy to make mistakes, difficult to find the sources of
mistakes
Why R?
- Data Science and Big Data
The R statistical programming language has
shown consistent growth, as has pandas, a popular library for data
science in Python. The closed source MATLAB language was growing for
most of the lifetime of the site, but has more recently leveled off and
may be shrinking. TensorFlow,
Google’s open-source machine learning framework, was introduced only in
late 2015, but it’s been growing at an extraordinary pace. Source: David Robinson, Introducing
Stack Overflow Trends, May 9, 2017
Why R?
- Number of R packages available on its main distribution site (Cran,
Comprehensive R Archive Network)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
1-year forecasts for US/TL
Assume that we need to forecast $/TL exchange rate and report results
every week
- Download the data from CRBT web site
- Prepare the data file
- Analyze the data
- Prepare the chart
- Make a presentation file
- Write the report
- Reproduce it
Data science process - the usual one
Assume that we need to forecast $/TL exchange rate and report results
every week
- Download the data from CRBT web site [web]
- Prepare the data file [Excel]
- Analyze the data [EViews, Stata, etc.]
- Prepare the chart [Excel]
- Make a presentation [PowerPoint]
- Write the report [Word]
- Reproduce it [do it again]
Data science process with R
Only 7 lines of code
library(CBRT)
library(forecast)
myData <- getDataSeries("TP.DK.USD.A.EF", start = "2015-01-01", freq = 3)
usd <- ts(myData$TP.DK.USD.A.EF, frequency = 52, start = c(2015, 1))
musd <- auto.arima(usd)
fusd <- forecast(musd, h= 52)
autoplot(fusd) + theme_bw()
Data science process with R
To do this week
- Install R
- Install RStudio
- Register at CBRT’s Electronic Data
Delivery System in order to get access to the CBRT web service. Note
that registration is free and open to the public.
- Install the following packages:
- CBRT, data.table, ggplot2, ggthemes, pwt10, rmarkdown, quarto, WDI,
DT, forecast, foreign, GGally, haven, leaflet, leaflet.extras, mice,
networkD3, plm, plotly, r2d3, rvest, sf, shiny, shinyWidgets, stargazer,
tesseract, tidyverse, microbenchmark, ggmap, gganimate, tmaptools,
stringi, stringr, tm, rgdal, fixest, fst, collapse
File organization
- Open a directory for ECON 413
- Create a new project for that diractory
- Organize files in directories
- /econ413
- /econ413/R files
- /econ413/Data
- /econ413/Raw data
- /econ413/img
- Use consistent file names
- 01 Project web data collect.R
- 02 Project WDI data collect.R
- 10 Project descriptives.R
- 20 Project regressions.R
- R code (R script) will be saved in /R files
Using RStudio
Make errors!
Note that you will make errors frequently when you start using R. Do
not get frustrated when you get error messages. It is an essential part
of the learning process. Therefore, try to fix these errors by
yourself.
If you get any error message anytime while using R, check the code
first. Most of these errors will be due to missing parentheses and
commas.
If you cannot solve the problem in a reasonable time, submit a
question at the Forum page of ODTUClass. When you
submit your question, please add the error message and provide
sufficient info to reproduce the error.
Try to solve the problems/errors posted on the Forum page, and share
your solutions with others. This is one of the best methods to learn R
programming.
Search for the error in Google (just copy the error message to the
Google search bar), and try to find an answer (among the search results,
first check Stackoverflow sites).