Analysis, Visualization and Mapping in R

Class 1 Objectives

Understand the data science workflow in R
Gain confidence in using R
Create a PDF document and a web page to communicate your own analysis
- Including text, statistics, table and chart
- Using data from an external source (a CSV file)

Why R?

Designed for data science
A lingua franca of social science
Easy to make professional outputs (tables, charts, maps)
(nearly) ALL statistical methods available
ANY question you have has already been answered online

Organizing your Analysis:

1 - Use RStudio for everything. R is the language; RStudio is the Interface. If you do half your data cleaning in Excel, there will be no record of it and we won't be able to fix mistakes.

2 - Our work is a 'Recipe Book': R Markdown files as a step-by-step guide to the data inputs (the ingredients), our analysis (the cooking instructions) AND the outputs (the picture of the perfect meal).

3 - Our work is reproducible: Anybody with R can open our work, press 'Knit' and produce the same outputs. They can also understand what the code does.

Organizing your Analysis:

4 - Organize your work in Projects in R: For each major analysis, it's best to choose 'File' -> 'New Project' -> 'New Directory' from Rstudio. Save all your data inputs and outputs in this folder (which Rstudio will do automatically).

5 - Data frames (tables) are the main building block of our analysis: We focus on manipulating and visualizing tables of data, as these are the best way of organizing our data.

6 - Tidy data We will use a set of compatible 'packages' called the 'tidyverse' to make our analysis transparent and avoid common problems.

Structuring our Document

Mix TWO types of content
Text
- Type normal text directly
- Simple formatting (Cheat Sheet)
- Equations: In latex format, eg. $ a^2 + b^2 = c^2 $

\[ a^2 + b^2 = c^2 \]

Code Chunks
- 'Insert' -> 'R' creates a code chunk
- Contains data processing OR outputs (tables, charts etc.)
- Use a separate chunk for each output

Creating your Document Output

Once we have all the text and chunk outputs ready, Knit!
- Knit to PDF
- Knit to HTML
Set the title, date and author in the header in markdown
Documents or Presentations

Workflow

Dataframe - Import data from a file - read_csv
Process data
- Cleaning - filter, select, mutate
- Combine multiple datasets - left_join
- Create measures/statistics - mutate, summarize
Run a regression (if required) - zelig
Create outputs
- Table - kable, stargazer
- Graph - ggplot
- Map - leaflet, mapview

Load a Dataframe

Specific packages let us access APIs for online data, eg. CEPESP-R
Or from local files using read_csv

data <- read_csv("data.csv")

To open SPSS or Stata files:

library(foreign)
data <- read.spss("data.sav")
data <- read.dta("data.dta")

Dataframes

Variable Names
Observations
Values and variable types

Workflow Example

flights %>%
group_by(carrier) %>%
summarize(dep_delay=mean(dep_delay)) %>%
kable()

flights %>%
group_by(carrier) %>%
summarize(dep_delay=mean(dep_delay)) %>%
ggplot() + geom_point(aes(x=air_time,y=dep_delay))

flights %>%
group_by(carrier) %>%
summarize(dep_delay=mean(dep_delay)) %>%
zelig(dep_delay ~ carrier,data=.,model="ls") %>%
stargazer(digits=3)

Data Processing: Example Actions/Verbs

Setting the Scope of our Data:

select - Pick specific columns
slice - Pick rows by position
filter - Pick rows based on a condition
rename - Change name of column

Creating Measures/Statistics:

mutate - Change/Create new column based on existing data
summarize - Calculate single-value summary statistics
group_by - Group into sub-tables for mutate/summarize
count - Count number of rows

Restructuring our Data:

arrange - Order table by values of a column
gather - Reshape from Wide to Long Table
spread - Reshape from Long to Wide Table

Data Processing: Example Actions

Data Processing: Example Actions on our Dataframe

flights %>% 
  select(carrier,origin,air_time,distance,dep_delay)

Data Processing: Example Actions on our Dataframe

flights %>% 
  select(carrier,origin,air_time,distance,dep_delay)

## # A tibble: 5 x 1
##   air_time
##      <dbl>
## 1      227
## 2      227
## 3      160
## 4      183
## 5      116

Data Processing: Example Actions on our Dataframe

flights %>% 
  slice(1:2)

Data Processing: Example Actions on our Dataframe

flights %>% 
  slice(1:2)

## # A tibble: 2 x 5
##   carrier origin air_time distance dep_delay
##   <chr>   <chr>     <dbl>    <dbl>     <dbl>
## 1 UA      EWR         227     1400      2.00
## 2 UA      LGA         227     1416      4.00

Data Processing: Example Actions on our Dataframe

flights %>% 
  filter(origin=="JFK")

Data Processing: Example Actions on our Dataframe

flights %>% 
  filter(origin=="JFK")

## # A tibble: 2 x 5
##   carrier origin air_time distance dep_delay
##   <chr>   <chr>     <dbl>    <dbl>     <dbl>
## 1 AA      JFK         160     1089      2.00
## 2 B6      JFK         183     1576     -1.00

Data Processing: Example Actions on our Dataframe

flights %>% 
  pull(distance)

Data Processing: Example Actions on our Dataframe

flights %>% 
  pull(distance)

## [1] 1400 1416 1089 1576  762

Data Processing: Example Actions on our Dataframe

flights %>% 
  mutate(air_time=round(air_time/60,3))

Data Processing: Example Actions on our Dataframe

flights %>% 
  mutate(air_time=round(air_time/60,3))

## # A tibble: 5 x 5
##   carrier origin air_time distance dep_delay
##   <chr>   <chr>     <dbl>    <dbl>     <dbl>
## 1 UA      EWR        3.78     1400      2.00
## 2 UA      LGA        3.78     1416      4.00
## 3 AA      JFK        2.67     1089      2.00
## 4 B6      JFK        3.05     1576     -1.00
## 5 DL      LGA        1.93      762     -6.00

Data Processing: Example Actions on our Dataframe

flights %>% 
  mutate(speed=round(distance/air_time,3))

Data Processing: Example Actions on our Dataframe

flights %>% 
  mutate(speed=round(distance/air_time,3))

## # A tibble: 5 x 6
##   carrier origin air_time distance dep_delay speed
##   <chr>   <chr>     <dbl>    <dbl>     <dbl> <dbl>
## 1 UA      EWR         227     1400      2.00  6.17
## 2 UA      LGA         227     1416      4.00  6.24
## 3 AA      JFK         160     1089      2.00  6.81
## 4 B6      JFK         183     1576     -1.00  8.61
## 5 DL      LGA         116      762     -6.00  6.57

Data Processing: Example Actions on our Dataframe

flights %>% 
  summarize(avg_distance=mean(distance,na.rm=TRUE))

Data Processing: Example Actions on our Dataframe

flights %>% 
  summarize(avg_distance=mean(distance,na.rm=TRUE))

## # A tibble: 1 x 1
##   avg_distance
##          <dbl>
## 1         1249

Data Processing Example

Piping to find the average speed of United (UA) flights.

In steps: Take the data, filter the data to carrier UA,
calculate the speed of each flight,
and then find the average.

flights %>% filter(carrier=="UA") %>% mutate(speed=distance/(air_time/60)) %>% summarize(avg_speed=mean(speed,na.rm=TRUE)) %>% round(1)

flights %>% filter(carrier=="UA") %>% 
  mutate(speed=distance/(air_time/60)) %>% 
  summarize(avg_speed=mean(speed,na.rm=TRUE)) %>%
  round(1) %>% as.numeric()

## [1] 420.9

Data Processing Example + In-line

These actions can be 'piped' together:
We want to find the average speed of United (UA) flights.

In steps: Take the data, filter the data to carrier UA,
calculate the speed of each flight,
and then find the average.

avg_speed <- flights %>% filter(carrier=="UA") %>% 
  mutate(speed=distance/(air_time/60)) %>% 
  summarize(avg_speed=mean(speed,na.rm=TRUE)) %>%
  round(1)

The average speed of United Flights is `r avg_speed` miles per hour.

The average speed of United Flights is 420.9 miles per hour.

Data Processing Example + In-line

Can we change the order of data processing?

In steps: Take the data, calculate the speed of each flight,
filter the data to carrier UA,
and then find the average.

flights %>% mutate(speed=distance/(air_time/60)) %>% filter(carrier=="UA") %>% summarize(avg_speed=mean(speed,na.rm=TRUE)) %>%
round(1)

Data Processing Example + In-line

Can we change the order of data processing?

In steps: Take the data, calculate the speed of each flight,
filter the data to carrier UA,
and then find the average.

flights %>% mutate(speed=distance/(air_time/60)) %>% filter(carrier=="UA") %>% summarize(avg_speed=mean(speed,na.rm=TRUE)) %>%
round(1)

420.9

Data Processing Example + In-line

Can we change the order of data processing?

In steps: Take the data, calculate the speed of each flight,
find the average,
and then filter the data to carrier UA,

flights %>% mutate(speed=distance/(air_time/60)) %>% summarize(avg_speed=mean(speed,na.rm=TRUE)) %>% filter(carrier=="UA") %>%
round(1)

Data Processing Example + In-line

Can we change the order of data processing?

In steps: Take the data, calculate the speed of each flight,
find the average,
and then filter the data to carrier UA,

flights %>% mutate(speed=distance/(air_time/60)) %>% summarize(avg_speed=mean(speed,na.rm=TRUE)) %>% filter(carrier=="UA") %>%
round(1)

394.3

Table Outputs

flights %>% slice(1:5) %>% 
  select(carrier,origin,air_time,distance,dep_delay) %>%
  kable()

carrier	origin	air_time	distance	dep_delay
UA	EWR	227	1400	2
UA	LGA	227	1416	4
AA	JFK	160	1089	2
B6	JFK	183	1576	-1
DL	LGA	116	762	-6

Table Outputs

flights %>% slice(1:5) %>% 
  select(carrier,origin,air_time,distance,dep_delay) %>%
  kable(caption="Example Table", align="lcccc")

Example Table
carrier	origin	air_time	distance	dep_delay
UA	EWR	227	1400	2
UA	LGA	227	1416	4
AA	JFK	160	1089	2
B6	JFK	183	1576	-1
DL	LGA	116	762	-6

Chart Outputs

flights %>%
  filter(carrier=="UA") %>%
  ggplot() + geom_point(aes(x=dep_time,y=dep_delay))

Chart Outputs

flights %>%
  filter(carrier=="UA") %>%
  ggplot() + geom_point(aes(x=dep_time,y=dep_delay)) +
  geom_smooth(aes(x=dep_time,y=dep_delay))

Chart Outputs

flights %>%
  filter(carrier=="UA") %>%
  ggplot() + geom_point(aes(x=dep_time,y=dep_delay)) +
  geom_smooth(aes(x=dep_time,y=dep_delay)) +
  ggtitle("Example Chart") +
  xlab("Departure Time") +
  ylab("Departure Delay")

Chart Outputs

flights %>%
  ggplot() + geom_bar(aes(x=dep_delay))

Chart Outputs

flights %>%
  ggplot() + geom_bar(aes(x=dep_delay)) +
  xlim(-30,100)

Chart Outputs

flights %>%
  group_by(origin) %>%
  summarize(avg_delay=mean(dep_delay,na.rm=TRUE)) %>%
  ggplot() + geom_col(aes(x=origin, y=avg_delay))

Basic Tools in Rstudio

Data analysis within code chunks:
- Assigning to saved objects: new_object <- old_object
- Inspecting objects interactively: Type their name and press 'Ctrl-Enter'
- Processing objects: data_frame %>% action_on_dataframe
- Comments: #Comments go here and won't be processed by R
- The actions (functions) we can use depend on the packages we have loaded:
  - install.packages("New_package") ONCE, then
  - library("New_package") at the start of each document

Basic Tools in Rstudio

Basic Maths in code chunks:

answer <- 2 + 2
answer

## [1] 4

inputs <- seq(0,1,0.2)

answer <- inputs*10
answer

## [1]  0  2  4  6  8 10