Class 1 Objectives

  • Understand the data science workflow in R
  • Gain confidence in using R
  • Create a PDF document and a web page to communicate your own analysis
    • Including text, statistics, table and chart
    • Using data from an external source (a CSV file)

Why R?

  • Designed for data science
  • A lingua franca of social science
  • Easy to make professional outputs (tables, charts, maps)
  • (nearly) ALL statistical methods available
  • ANY question you have has already been answered online

Organizing your Analysis:

1 - Use RStudio for everything. R is the language; RStudio is the Interface. If you do half your data cleaning in Excel, there will be no record of it and we won't be able to fix mistakes.

2 - Our work is a 'Recipe Book': R Markdown files as a step-by-step guide to the data inputs (the ingredients), our analysis (the cooking instructions) AND the outputs (the picture of the perfect meal).

3 - Our work is reproducible: Anybody with R can open our work, press 'Knit' and produce the same outputs. They can also understand what the code does.

Organizing your Analysis:

4 - Organize your work in Projects in R: For each major analysis, it's best to choose 'File' -> 'New Project' -> 'New Directory' from Rstudio. Save all your data inputs and outputs in this folder (which Rstudio will do automatically).

5 - Data frames (tables) are the main building block of our analysis: We focus on manipulating and visualizing tables of data, as these are the best way of organizing our data.

6 - Tidy data We will use a set of compatible 'packages' called the 'tidyverse' to make our analysis transparent and avoid common problems.

Structuring our Document

  • Mix TWO types of content
  • Text
    • Type normal text directly
    • Simple formatting (Cheat Sheet)
    • Equations: In latex format, eg. $ a^2 + b^2 = c^2 $

\[ a^2 + b^2 = c^2 \]

  • Code Chunks
    • 'Insert' -> 'R' creates a code chunk
    • Contains data processing OR outputs (tables, charts etc.)
    • Use a separate chunk for each output

Creating your Document Output

  • Once we have all the text and chunk outputs ready, Knit!
    • Knit to PDF
    • Knit to HTML
  • Set the title, date and author in the header in markdown

  • Documents or Presentations

Workflow

Workflow

Workflow

  1. Dataframe - Import data from a file - read_csv
  2. Process data
    • Cleaning - filter, select, mutate
    • Combine multiple datasets - left_join
    • Create measures/statistics - mutate, summarize
  3. Run a regression (if required) - zelig
  4. Create outputs
    • Table - kable, stargazer
    • Graph - ggplot
    • Map - leaflet, mapview

Load a Dataframe

  • Specific packages let us access APIs for online data, eg. CEPESP-R
  • Or from local files using read_csv
data <- read_csv("data.csv")
  • To open SPSS or Stata files:
library(foreign)
data <- read.spss("data.sav")
data <- read.dta("data.dta")

Dataframes

  • Variable Names
  • Observations
  • Values and variable types

Workflow Example

flights %>%
group_by(carrier) %>%
summarize(dep_delay=mean(dep_delay))
%>%
kable()

flights %>%
group_by(carrier) %>%
summarize(dep_delay=mean(dep_delay))
%>%
ggplot() + geom_point(aes(x=air_time,y=dep_delay))

flights %>%
group_by(carrier) %>%
summarize(dep_delay=mean(dep_delay))
%>%
zelig(dep_delay ~ carrier,data=.,model="ls") %>%
stargazer(digits=3)

Data Processing: Example Actions/Verbs

Setting the Scope of our Data:

  1. select - Pick specific columns
  2. slice - Pick rows by position
  3. filter - Pick rows based on a condition
  4. rename - Change name of column

Creating Measures/Statistics:

  1. mutate - Change/Create new column based on existing data
  2. summarize - Calculate single-value summary statistics
  3. group_by - Group into sub-tables for mutate/summarize
  4. count - Count number of rows

Restructuring our Data:

  1. arrange - Order table by values of a column
  2. gather - Reshape from Wide to Long Table
  3. spread - Reshape from Long to Wide Table

Data Processing: Example Actions

Data Processing: Example Actions on our Dataframe

flights %>% 
  select(carrier,origin,air_time,distance,dep_delay) 

Data Processing: Example Actions on our Dataframe

flights %>% 
  select(carrier,origin,air_time,distance,dep_delay) 
## # A tibble: 5 x 1
##   air_time
##      <dbl>
## 1      227
## 2      227
## 3      160
## 4      183
## 5      116

Data Processing: Example Actions on our Dataframe

flights %>% 
  slice(1:2)

Data Processing: Example Actions on our Dataframe

flights %>% 
  slice(1:2)
## # A tibble: 2 x 5
##   carrier origin air_time distance dep_delay
##   <chr>   <chr>     <dbl>    <dbl>     <dbl>
## 1 UA      EWR         227     1400      2.00
## 2 UA      LGA         227     1416      4.00

Data Processing: Example Actions on our Dataframe

flights %>% 
  filter(origin=="JFK")

Data Processing: Example Actions on our Dataframe

flights %>% 
  filter(origin=="JFK")
## # A tibble: 2 x 5
##   carrier origin air_time distance dep_delay
##   <chr>   <chr>     <dbl>    <dbl>     <dbl>
## 1 AA      JFK         160     1089      2.00
## 2 B6      JFK         183     1576     -1.00

Data Processing: Example Actions on our Dataframe

flights %>% 
  pull(distance)

Data Processing: Example Actions on our Dataframe

flights %>% 
  pull(distance)
## [1] 1400 1416 1089 1576  762

Data Processing: Example Actions on our Dataframe

flights %>% 
  mutate(air_time=round(air_time/60,3))

Data Processing: Example Actions on our Dataframe

flights %>% 
  mutate(air_time=round(air_time/60,3))
## # A tibble: 5 x 5
##   carrier origin air_time distance dep_delay
##   <chr>   <chr>     <dbl>    <dbl>     <dbl>
## 1 UA      EWR        3.78     1400      2.00
## 2 UA      LGA        3.78     1416      4.00
## 3 AA      JFK        2.67     1089      2.00
## 4 B6      JFK        3.05     1576     -1.00
## 5 DL      LGA        1.93      762     -6.00

Data Processing: Example Actions on our Dataframe

flights %>% 
  mutate(speed=round(distance/air_time,3))

Data Processing: Example Actions on our Dataframe

flights %>% 
  mutate(speed=round(distance/air_time,3))
## # A tibble: 5 x 6
##   carrier origin air_time distance dep_delay speed
##   <chr>   <chr>     <dbl>    <dbl>     <dbl> <dbl>
## 1 UA      EWR         227     1400      2.00  6.17
## 2 UA      LGA         227     1416      4.00  6.24
## 3 AA      JFK         160     1089      2.00  6.81
## 4 B6      JFK         183     1576     -1.00  8.61
## 5 DL      LGA         116      762     -6.00  6.57

Data Processing: Example Actions on our Dataframe

flights %>% 
  summarize(avg_distance=mean(distance,na.rm=TRUE))

Data Processing: Example Actions on our Dataframe

flights %>% 
  summarize(avg_distance=mean(distance,na.rm=TRUE))
## # A tibble: 1 x 1
##   avg_distance
##          <dbl>
## 1         1249

Data Processing Example

Piping to find the average speed of United (UA) flights.

In steps: Take the data, filter the data to carrier UA,
calculate the speed of each flight,
and then find the average.

flights %>% filter(carrier=="UA") %>% mutate(speed=distance/(air_time/60)) %>% summarize(avg_speed=mean(speed,na.rm=TRUE)) %>% round(1)

flights %>% filter(carrier=="UA") %>% 
  mutate(speed=distance/(air_time/60)) %>% 
  summarize(avg_speed=mean(speed,na.rm=TRUE)) %>%
  round(1) %>% as.numeric()
## [1] 420.9

Data Processing Example + In-line

These actions can be 'piped' together:
We want to find the average speed of United (UA) flights.

In steps: Take the data, filter the data to carrier UA,
calculate the speed of each flight,
and then find the average.

avg_speed <- flights %>% filter(carrier=="UA") %>% 
  mutate(speed=distance/(air_time/60)) %>% 
  summarize(avg_speed=mean(speed,na.rm=TRUE)) %>%
  round(1)

The average speed of United Flights is `r avg_speed` miles per hour.

The average speed of United Flights is 420.9 miles per hour.

Data Processing Example + In-line

Can we change the order of data processing?

In steps: Take the data, calculate the speed of each flight,
filter the data to carrier UA,
and then find the average.

flights %>% mutate(speed=distance/(air_time/60)) %>% filter(carrier=="UA") %>% summarize(avg_speed=mean(speed,na.rm=TRUE)) %>%
round(1)

Data Processing Example + In-line

Can we change the order of data processing?

In steps: Take the data, calculate the speed of each flight,
filter the data to carrier UA,
and then find the average.

flights %>% mutate(speed=distance/(air_time/60)) %>% filter(carrier=="UA") %>% summarize(avg_speed=mean(speed,na.rm=TRUE)) %>%
round(1)

420.9

Data Processing Example + In-line

Can we change the order of data processing?

In steps: Take the data, calculate the speed of each flight,
find the average,
and then filter the data to carrier UA,

flights %>% mutate(speed=distance/(air_time/60)) %>% summarize(avg_speed=mean(speed,na.rm=TRUE)) %>% filter(carrier=="UA") %>%
round(1)

Data Processing Example + In-line

Can we change the order of data processing?

In steps: Take the data, calculate the speed of each flight,
find the average,
and then filter the data to carrier UA,

flights %>% mutate(speed=distance/(air_time/60)) %>% summarize(avg_speed=mean(speed,na.rm=TRUE)) %>% filter(carrier=="UA") %>%
round(1)

394.3

Table Outputs

flights %>% slice(1:5) %>% 
  select(carrier,origin,air_time,distance,dep_delay) %>%
  kable()
carrier origin air_time distance dep_delay
UA EWR 227 1400 2
UA LGA 227 1416 4
AA JFK 160 1089 2
B6 JFK 183 1576 -1
DL LGA 116 762 -6

Table Outputs

flights %>% slice(1:5) %>% 
  select(carrier,origin,air_time,distance,dep_delay) %>%
  kable(caption="Example Table", align="lcccc")
Example Table
carrier origin air_time distance dep_delay
UA EWR 227 1400 2
UA LGA 227 1416 4
AA JFK 160 1089 2
B6 JFK 183 1576 -1
DL LGA 116 762 -6

Chart Outputs

flights %>%
  filter(carrier=="UA") %>%
  ggplot() + geom_point(aes(x=dep_time,y=dep_delay))

Chart Outputs

flights %>%
  filter(carrier=="UA") %>%
  ggplot() + geom_point(aes(x=dep_time,y=dep_delay)) +
  geom_smooth(aes(x=dep_time,y=dep_delay))

Chart Outputs

flights %>%
  filter(carrier=="UA") %>%
  ggplot() + geom_point(aes(x=dep_time,y=dep_delay)) +
  geom_smooth(aes(x=dep_time,y=dep_delay)) +
  ggtitle("Example Chart") +
  xlab("Departure Time") +
  ylab("Departure Delay")

Chart Outputs

flights %>%
  ggplot() + geom_bar(aes(x=dep_delay))

Chart Outputs

flights %>%
  ggplot() + geom_bar(aes(x=dep_delay)) +
  xlim(-30,100)

Chart Outputs

flights %>%
  group_by(origin) %>%
  summarize(avg_delay=mean(dep_delay,na.rm=TRUE)) %>%
  ggplot() + geom_col(aes(x=origin, y=avg_delay))

Basic Tools in Rstudio

  • Data analysis within code chunks:
    • Assigning to saved objects: new_object <- old_object
    • Inspecting objects interactively: Type their name and press 'Ctrl-Enter'
    • Processing objects: data_frame %>% action_on_dataframe
    • Comments: #Comments go here and won't be processed by R
    • The actions (functions) we can use depend on the packages we have loaded:
      • install.packages("New_package") ONCE, then
      • library("New_package") at the start of each document

Basic Tools in Rstudio

  • Basic Maths in code chunks:
answer <- 2 + 2
answer
## [1] 4
inputs <- seq(0,1,0.2)

answer <- inputs*10
answer
## [1]  0  2  4  6  8 10