- Understand the data science workflow in R
- Gain confidence in using R
- Create a PDF document and a web page to communicate your own analysis
- Including text, statistics, table and chart
- Using data from an external source (a CSV file)
1 - Use RStudio for everything. R is the language; RStudio is the Interface. If you do half your data cleaning in Excel, there will be no record of it and we won't be able to fix mistakes.
2 - Our work is a 'Recipe Book': R Markdown files as a step-by-step guide to the data inputs (the ingredients), our analysis (the cooking instructions) AND the outputs (the picture of the perfect meal).
3 - Our work is reproducible: Anybody with R can open our work, press 'Knit' and produce the same outputs. They can also understand what the code does.
4 - Organize your work in Projects in R: For each major analysis, it's best to choose 'File' -> 'New Project' -> 'New Directory' from Rstudio. Save all your data inputs and outputs in this folder (which Rstudio will do automatically).
5 - Data frames (tables) are the main building block of our analysis: We focus on manipulating and visualizing tables of data, as these are the best way of organizing our data.
6 - Tidy data We will use a set of compatible 'packages' called the 'tidyverse' to make our analysis transparent and avoid common problems.
$ a^2 + b^2 = c^2 $
\[ a^2 + b^2 = c^2 \]
Set the title, date and author in the header in markdown
Documents or Presentations
read_csv
filter
, select
, mutate
left_join
mutate
, summarize
zelig
kable
, stargazer
ggplot
leaflet
, mapview
read_csv
data <- read_csv("data.csv")
library(foreign) data <- read.spss("data.sav") data <- read.dta("data.dta")
flights %>%
group_by(carrier) %>%
summarize(dep_delay=mean(dep_delay)) %>%
kable()
flights %>%
group_by(carrier) %>%
summarize(dep_delay=mean(dep_delay)) %>%
ggplot() + geom_point(aes(x=air_time,y=dep_delay))
flights %>%
group_by(carrier) %>%
summarize(dep_delay=mean(dep_delay)) %>%
zelig(dep_delay ~ carrier,data=.,model="ls") %>%
stargazer(digits=3)
Setting the Scope of our Data:
select
- Pick specific columnsslice
- Pick rows by positionfilter
- Pick rows based on a conditionrename
- Change name of columnCreating Measures/Statistics:
mutate
- Change/Create new column based on existing datasummarize
- Calculate single-value summary statisticsgroup_by
- Group into sub-tables for mutate/summarizecount
- Count number of rowsRestructuring our Data:
arrange
- Order table by values of a columngather
- Reshape from Wide to Long Tablespread
- Reshape from Long to Wide Tableflights %>% select(carrier,origin,air_time,distance,dep_delay)
flights %>% select(carrier,origin,air_time,distance,dep_delay)
## # A tibble: 5 x 1 ## air_time ## <dbl> ## 1 227 ## 2 227 ## 3 160 ## 4 183 ## 5 116
flights %>% slice(1:2)
flights %>% slice(1:2)
## # A tibble: 2 x 5 ## carrier origin air_time distance dep_delay ## <chr> <chr> <dbl> <dbl> <dbl> ## 1 UA EWR 227 1400 2.00 ## 2 UA LGA 227 1416 4.00
flights %>% filter(origin=="JFK")
flights %>% filter(origin=="JFK")
## # A tibble: 2 x 5 ## carrier origin air_time distance dep_delay ## <chr> <chr> <dbl> <dbl> <dbl> ## 1 AA JFK 160 1089 2.00 ## 2 B6 JFK 183 1576 -1.00
flights %>% pull(distance)
flights %>% pull(distance)
## [1] 1400 1416 1089 1576 762
flights %>% mutate(air_time=round(air_time/60,3))
flights %>% mutate(air_time=round(air_time/60,3))
## # A tibble: 5 x 5 ## carrier origin air_time distance dep_delay ## <chr> <chr> <dbl> <dbl> <dbl> ## 1 UA EWR 3.78 1400 2.00 ## 2 UA LGA 3.78 1416 4.00 ## 3 AA JFK 2.67 1089 2.00 ## 4 B6 JFK 3.05 1576 -1.00 ## 5 DL LGA 1.93 762 -6.00
flights %>% mutate(speed=round(distance/air_time,3))
flights %>% mutate(speed=round(distance/air_time,3))
## # A tibble: 5 x 6 ## carrier origin air_time distance dep_delay speed ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 UA EWR 227 1400 2.00 6.17 ## 2 UA LGA 227 1416 4.00 6.24 ## 3 AA JFK 160 1089 2.00 6.81 ## 4 B6 JFK 183 1576 -1.00 8.61 ## 5 DL LGA 116 762 -6.00 6.57
flights %>% summarize(avg_distance=mean(distance,na.rm=TRUE))
flights %>% summarize(avg_distance=mean(distance,na.rm=TRUE))
## # A tibble: 1 x 1 ## avg_distance ## <dbl> ## 1 1249
Piping to find the average speed of United (UA) flights.
In steps:
Take the data,
filter the data to carrier UA,
calculate the speed of each flight,
and then
find the average.
flights %>% filter(carrier=="UA") %>% mutate(speed=distance/(air_time/60)) %>% summarize(avg_speed=mean(speed,na.rm=TRUE)) %>% round(1)
flights %>% filter(carrier=="UA") %>% mutate(speed=distance/(air_time/60)) %>% summarize(avg_speed=mean(speed,na.rm=TRUE)) %>% round(1) %>% as.numeric()
## [1] 420.9
These actions can be 'piped' together:
We want to find the
average
speed of
United (UA) flights.
In steps:
Take the data,
filter the data to carrier UA,
calculate the speed of each flight,
and then
find the average.
avg_speed <- flights %>% filter(carrier=="UA") %>% mutate(speed=distance/(air_time/60)) %>% summarize(avg_speed=mean(speed,na.rm=TRUE)) %>% round(1)
The average speed of United Flights is `r avg_speed`
miles per hour.
The average speed of United Flights is 420.9 miles per hour.
Can we change the order of data processing?
In steps:
Take the data,
calculate the speed of each flight,
filter the data to carrier UA,
and then
find the average.
flights %>%
mutate(speed=distance/(air_time/60)) %>%
filter(carrier=="UA") %>%
summarize(avg_speed=mean(speed,na.rm=TRUE)) %>%
round(1)
Can we change the order of data processing?
In steps:
Take the data,
calculate the speed of each flight,
filter the data to carrier UA,
and then
find the average.
flights %>%
mutate(speed=distance/(air_time/60)) %>%
filter(carrier=="UA") %>%
summarize(avg_speed=mean(speed,na.rm=TRUE)) %>%
round(1)
420.9
Can we change the order of data processing?
In steps:
Take the data,
calculate the speed of each flight,
find the average,
and then
filter the data to carrier UA,
flights %>%
mutate(speed=distance/(air_time/60)) %>%
summarize(avg_speed=mean(speed,na.rm=TRUE)) %>%
filter(carrier=="UA") %>%
round(1)
Can we change the order of data processing?
In steps:
Take the data,
calculate the speed of each flight,
find the average,
and then
filter the data to carrier UA,
flights %>%
mutate(speed=distance/(air_time/60)) %>%
summarize(avg_speed=mean(speed,na.rm=TRUE)) %>%
filter(carrier=="UA") %>%
round(1)
394.3
flights %>% slice(1:5) %>% select(carrier,origin,air_time,distance,dep_delay) %>% kable()
carrier | origin | air_time | distance | dep_delay |
---|---|---|---|---|
UA | EWR | 227 | 1400 | 2 |
UA | LGA | 227 | 1416 | 4 |
AA | JFK | 160 | 1089 | 2 |
B6 | JFK | 183 | 1576 | -1 |
DL | LGA | 116 | 762 | -6 |
flights %>% slice(1:5) %>% select(carrier,origin,air_time,distance,dep_delay) %>% kable(caption="Example Table", align="lcccc")
carrier | origin | air_time | distance | dep_delay |
---|---|---|---|---|
UA | EWR | 227 | 1400 | 2 |
UA | LGA | 227 | 1416 | 4 |
AA | JFK | 160 | 1089 | 2 |
B6 | JFK | 183 | 1576 | -1 |
DL | LGA | 116 | 762 | -6 |
flights %>% filter(carrier=="UA") %>% ggplot() + geom_point(aes(x=dep_time,y=dep_delay))
flights %>% filter(carrier=="UA") %>% ggplot() + geom_point(aes(x=dep_time,y=dep_delay)) + geom_smooth(aes(x=dep_time,y=dep_delay))
flights %>% filter(carrier=="UA") %>% ggplot() + geom_point(aes(x=dep_time,y=dep_delay)) + geom_smooth(aes(x=dep_time,y=dep_delay)) + ggtitle("Example Chart") + xlab("Departure Time") + ylab("Departure Delay")
flights %>% ggplot() + geom_bar(aes(x=dep_delay))
flights %>% ggplot() + geom_bar(aes(x=dep_delay)) + xlim(-30,100)
flights %>% group_by(origin) %>% summarize(avg_delay=mean(dep_delay,na.rm=TRUE)) %>% ggplot() + geom_col(aes(x=origin, y=avg_delay))
new_object <- old_object
data_frame %>% action_on_dataframe
#Comments go here and won't be processed by R
install.packages("New_package")
ONCE, thenlibrary("New_package")
at the start of each documentanswer <- 2 + 2 answer
## [1] 4
inputs <- seq(0,1,0.2) answer <- inputs*10 answer
## [1] 0 2 4 6 8 10