- Understand the data science workflow in R
- Gain confidence in using R
- Create a PDF document and a web page to communicate your own analysis
- Including text, statistics, table and chart
- Using data from an external source (a CSV file)
1 - Use RStudio for everything. R is the language; RStudio is the Interface. If you do half your data cleaning in Excel, there will be no record of it and we won't be able to fix mistakes.
2 - Our work is a 'Recipe Book': R Markdown files as a step-by-step guide to the data inputs (the ingredients), our analysis (the cooking instructions) AND the outputs (the picture of the perfect meal).
3 - Our work is reproducible: Anybody with R can open our work, press 'Knit' and produce the same outputs. They can also understand what the code does.
4 - Organize your work in Projects in R: For each major analysis, it's best to choose 'File' -> 'New Project' -> 'New Directory' from Rstudio. Save all your data inputs and outputs in this folder (which Rstudio will do automatically).
5 - Data frames (tables) are the main building block of our analysis: We focus on manipulating and visualizing tables of data, as these are the best way of organizing our data.
6 - Tidy data We will use a set of compatible 'packages' called the 'tidyverse' to make our analysis transparent and avoid common problems.
$ a^2 + b^2 = c^2 $\[ a^2 + b^2 = c^2 \]
Set the title, date and author in the header in markdown
Documents or Presentations
read_csvfilter, select, mutateleft_joinmutate, summarizezeligkable, stargazerggplotleaflet, mapviewread_csvdata <- read_csv("data.csv")
library(foreign)
data <- read.spss("data.sav")
data <- read.dta("data.dta")
flights %>%
group_by(carrier) %>%
summarize(dep_delay=mean(dep_delay)) %>%
kable()
flights %>%
group_by(carrier) %>%
summarize(dep_delay=mean(dep_delay)) %>%
ggplot() + geom_point(aes(x=air_time,y=dep_delay))
flights %>%
group_by(carrier) %>%
summarize(dep_delay=mean(dep_delay)) %>%
zelig(dep_delay ~ carrier,data=.,model="ls") %>%
stargazer(digits=3)
Setting the Scope of our Data:
select - Pick specific columnsslice - Pick rows by positionfilter - Pick rows based on a conditionrename - Change name of columnCreating Measures/Statistics:
mutate - Change/Create new column based on existing datasummarize - Calculate single-value summary statisticsgroup_by - Group into sub-tables for mutate/summarizecount - Count number of rowsRestructuring our Data:
arrange - Order table by values of a columngather - Reshape from Wide to Long Tablespread - Reshape from Long to Wide Tableflights %>% select(carrier,origin,air_time,distance,dep_delay)
flights %>% select(carrier,origin,air_time,distance,dep_delay)
## # A tibble: 5 x 1 ## air_time ## <dbl> ## 1 227 ## 2 227 ## 3 160 ## 4 183 ## 5 116
flights %>% slice(1:2)
flights %>% slice(1:2)
## # A tibble: 2 x 5 ## carrier origin air_time distance dep_delay ## <chr> <chr> <dbl> <dbl> <dbl> ## 1 UA EWR 227 1400 2.00 ## 2 UA LGA 227 1416 4.00
flights %>% filter(origin=="JFK")
flights %>% filter(origin=="JFK")
## # A tibble: 2 x 5 ## carrier origin air_time distance dep_delay ## <chr> <chr> <dbl> <dbl> <dbl> ## 1 AA JFK 160 1089 2.00 ## 2 B6 JFK 183 1576 -1.00
flights %>% pull(distance)
flights %>% pull(distance)
## [1] 1400 1416 1089 1576 762
flights %>% mutate(air_time=round(air_time/60,3))
flights %>% mutate(air_time=round(air_time/60,3))
## # A tibble: 5 x 5 ## carrier origin air_time distance dep_delay ## <chr> <chr> <dbl> <dbl> <dbl> ## 1 UA EWR 3.78 1400 2.00 ## 2 UA LGA 3.78 1416 4.00 ## 3 AA JFK 2.67 1089 2.00 ## 4 B6 JFK 3.05 1576 -1.00 ## 5 DL LGA 1.93 762 -6.00
flights %>% mutate(speed=round(distance/air_time,3))
flights %>% mutate(speed=round(distance/air_time,3))
## # A tibble: 5 x 6 ## carrier origin air_time distance dep_delay speed ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 UA EWR 227 1400 2.00 6.17 ## 2 UA LGA 227 1416 4.00 6.24 ## 3 AA JFK 160 1089 2.00 6.81 ## 4 B6 JFK 183 1576 -1.00 8.61 ## 5 DL LGA 116 762 -6.00 6.57
flights %>% summarize(avg_distance=mean(distance,na.rm=TRUE))
flights %>% summarize(avg_distance=mean(distance,na.rm=TRUE))
## # A tibble: 1 x 1 ## avg_distance ## <dbl> ## 1 1249
Piping to find the average speed of United (UA) flights.
In steps:
Take the data,
filter the data to carrier UA,
calculate the speed of each flight,
and then
find the average.
flights %>% filter(carrier=="UA") %>% mutate(speed=distance/(air_time/60)) %>% summarize(avg_speed=mean(speed,na.rm=TRUE)) %>% round(1)
flights %>% filter(carrier=="UA") %>% mutate(speed=distance/(air_time/60)) %>% summarize(avg_speed=mean(speed,na.rm=TRUE)) %>% round(1) %>% as.numeric()
## [1] 420.9
These actions can be 'piped' together:
We want to find the
average
speed of
United (UA) flights.
In steps:
Take the data,
filter the data to carrier UA,
calculate the speed of each flight,
and then
find the average.
avg_speed <- flights %>% filter(carrier=="UA") %>% mutate(speed=distance/(air_time/60)) %>% summarize(avg_speed=mean(speed,na.rm=TRUE)) %>% round(1)
The average speed of United Flights is `r avg_speed` miles per hour.
The average speed of United Flights is 420.9 miles per hour.
Can we change the order of data processing?
In steps:
Take the data,
calculate the speed of each flight,
filter the data to carrier UA,
and then
find the average.
flights %>%
mutate(speed=distance/(air_time/60)) %>%
filter(carrier=="UA") %>%
summarize(avg_speed=mean(speed,na.rm=TRUE)) %>%
round(1)
Can we change the order of data processing?
In steps:
Take the data,
calculate the speed of each flight,
filter the data to carrier UA,
and then
find the average.
flights %>%
mutate(speed=distance/(air_time/60)) %>%
filter(carrier=="UA") %>%
summarize(avg_speed=mean(speed,na.rm=TRUE)) %>%
round(1)
420.9
Can we change the order of data processing?
In steps:
Take the data,
calculate the speed of each flight,
find the average,
and then
filter the data to carrier UA,
flights %>%
mutate(speed=distance/(air_time/60)) %>%
summarize(avg_speed=mean(speed,na.rm=TRUE)) %>%
filter(carrier=="UA") %>%
round(1)
Can we change the order of data processing?
In steps:
Take the data,
calculate the speed of each flight,
find the average,
and then
filter the data to carrier UA,
flights %>%
mutate(speed=distance/(air_time/60)) %>%
summarize(avg_speed=mean(speed,na.rm=TRUE)) %>%
filter(carrier=="UA") %>%
round(1)
394.3
flights %>% slice(1:5) %>% select(carrier,origin,air_time,distance,dep_delay) %>% kable()
| carrier | origin | air_time | distance | dep_delay |
|---|---|---|---|---|
| UA | EWR | 227 | 1400 | 2 |
| UA | LGA | 227 | 1416 | 4 |
| AA | JFK | 160 | 1089 | 2 |
| B6 | JFK | 183 | 1576 | -1 |
| DL | LGA | 116 | 762 | -6 |
flights %>% slice(1:5) %>% select(carrier,origin,air_time,distance,dep_delay) %>% kable(caption="Example Table", align="lcccc")
| carrier | origin | air_time | distance | dep_delay |
|---|---|---|---|---|
| UA | EWR | 227 | 1400 | 2 |
| UA | LGA | 227 | 1416 | 4 |
| AA | JFK | 160 | 1089 | 2 |
| B6 | JFK | 183 | 1576 | -1 |
| DL | LGA | 116 | 762 | -6 |
flights %>% filter(carrier=="UA") %>% ggplot() + geom_point(aes(x=dep_time,y=dep_delay))
flights %>% filter(carrier=="UA") %>% ggplot() + geom_point(aes(x=dep_time,y=dep_delay)) + geom_smooth(aes(x=dep_time,y=dep_delay))
flights %>%
filter(carrier=="UA") %>%
ggplot() + geom_point(aes(x=dep_time,y=dep_delay)) +
geom_smooth(aes(x=dep_time,y=dep_delay)) +
ggtitle("Example Chart") +
xlab("Departure Time") +
ylab("Departure Delay")
flights %>% ggplot() + geom_bar(aes(x=dep_delay))
flights %>% ggplot() + geom_bar(aes(x=dep_delay)) + xlim(-30,100)
flights %>% group_by(origin) %>% summarize(avg_delay=mean(dep_delay,na.rm=TRUE)) %>% ggplot() + geom_col(aes(x=origin, y=avg_delay))
new_object <- old_objectdata_frame %>% action_on_dataframe#Comments go here and won't be processed by Rinstall.packages("New_package") ONCE, thenlibrary("New_package") at the start of each documentanswer <- 2 + 2 answer
## [1] 4
inputs <- seq(0,1,0.2) answer <- inputs*10 answer
## [1] 0 2 4 6 8 10