Photo by ProfDEH on Wikimedia Commons
In this assignment we continue our examination of traffic accidents in New York State.
Accept the assignment from github classrooms, then go to the course GitHub organization and locate your homework repo, which should be named hw03-pedestrian-YOUR_GITHUB_USERNAME
. Grab the URL of the repo, and clone it in RStudio. First, open the R Markdown document hw03.Rmd
and Knit it. Make sure it compiles without errors. The output will be in the file markdown .md
file with the same name.
Before we introduce the data, let’s warm up with some simple exercises.
lint_assignment()
and verify that you are lint free.Again, we use the tidyverse, vroom
, readxl
and janitor
packages. These packages is already installed for you. You can load them by running the following in your Console:
library(tidyverse)
library(vroom)
library(readxl)
library(janitor)
We can load the data with the following:
= vroom("https://urmc-bst.github.io/bst430-fall2021-site/hw_lab_instruction/hw02-accidents/data/ny_collisions_2018_2019.csv.gz") crashes
You can find out more about the dataset in the NY open data portal: https://data.ny.gov/Transportation/Motor-Vehicle-Crashes-Case-Information-Three-Year-/e8ky-4vqe . There’s a detailed data dictionary here.
Convert the names in crashes to snake_case
using janitor::clean_names()
. Filter the crashes to only include fatal accidents. You should have 1763 observations.
Consider the event_descriptor
column. First, convert it to lower case. Then, using str_detect
, define the set of events that involve collisions with bicyclists or pedestrians. Mutate crashes
to add new variable called is_pedbike
that identifies these.
i) Convert values stored in county_name
to Title Case (the purpose of this will become clear subsequently, I swear!) ii) Count the number of fatal crashes per county, per is_pedbike
iii) Consider the top 20 counties (county_name
) with the most fatal crashes in the data set. Make a barchart showing each county and the number of a) bicycle and pedestrian events and b) other events, filling the bars appropriately to show these two categories. You will be graded on having an appropriate sort order for the county, and appropriate axis labels.
Download the county population data for New York from the previous census. Put the file into a sensible place in your rstudio project. Load it using read_csv
, clean up the column names using janitor::clean_names
, filter it down to relevant rows, and select relevant columns from it.
Hint: you will want to either remove the “County” part of the ctyname
in the census data, using functions found in stringr
, or mutate a new column in your crashes
counts table that appends (glues) “County” onto the county_name
variable.
Join the population data to your table of crashes from Ex 3, and repeat your plot from Ex 3, now normalizing the number of events per county by the population. (Fatalities per 100,000 population gives nice units here.) Your top 20 counties ought to be different here. Discuss what you find.
Download the vehicle miles traveled (VMT) per capita data available from the US Department of Transportation. You can read more about it here. Put the file into a sensible place in your rstudio project and load the Urbanized Area
sheet into R using readxl
. Cleanup the column names.
Hint: your life will be made easier if you construct a “crosswalk” mapping the identifiers between the VMT dataset urbanized_area
and county_name
from the crashes data, either as a .csv file that you read in with read_csv
or using the tibble
or tribble
function directly in your markdown. Then join the files using the crosswalk. Here’s an example of the first seven rows of such a crosswalk:
vehicle_miles_traveled_per_capita_raw_value
using filter
and str_detect
. Using the list below, identify the counties corresponding to these urbanized_areas
, and join this to the table. (This will not be a one-to-one join.) Then join the fixed up VMT table to the fatal crash counts.Metro area | County |
---|---|
New York-Newark, NY-NJ-CT | Queens |
New York-Newark, NY-NJ-CT | Kings |
New York-Newark, NY-NJ-CT | New York |
New York-Newark, NY-NJ-CT | Bronx |
New York-Newark, NY-NJ-CT | Richmond |
Rochester, NY | Monroe |
Buffalo, NY | Erie |
Albany-Schenectady, NY | Albany |
Binghamton, NY-PA | Broome |
Elmira, NY | Chemung |
Glens Falls, NY | Warren |
Ithaca, NY | Tompkins |
Kingston, NY | Ulster |
Poughkeepsie-Newburgh, NY-NJ | Dutchess |
Saratoga Springs, NY | Saratoga |
Syracuse, NY | Onondaga |
Utica, NY | Oneida |
Derive the fatalities per 100,000 vehicle miles traveled, per county, using the crash count table Ex 3 and 5. Repeat your plot from Ex 5 (though now you will only have 16 17 counties). Discuss your findings.
Which estimate, if any, would be most informative about the hazard rate of being a pedestrian/cyclist in NY state? What other factors would be helpful in refining your estimate of the hazard?
🧹 🧶 ✅ ⬆️ Lint, Knit, commit, and push your changes to GitHub with an appropriate commit message. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards and review the md document, and the lintr report on GitHub to make sure you’re happy with the final state of your work.