In this lab we revisit the Denny’s and La Quinta Inn and Suites data we visualized in the previous lab.

Learning goals

Working with spatial data
Writing and using a custom function
Practice formatting code according to a style guide.

Getting started

Go to the course GitHub organization and locate your homework repo, clone it in RStudio and open the R Markdown document. Knit the document to make sure it compiles without errors.

Warm up

Before we introduce the data, let’s warm up with some simple exercises. Update the YAML of your R Markdown file with your information, knit, commit, and push your changes. Make sure to commit with a meaningful commit message. Then, go to your repo on GitHub and confirm that your changes are visible in your Rmd and md files. If anything is missing, commit and push again.

Packages

We’ll use the tidyverse package for much of the data wrangling and visualization and the data lives in the dsbox package. We’ll be using lintr to check the code formatting, but the mechanics of this being handled by scripts the lintr directory. These packages are already installed for you. You can load them by running the following in your Console:

library(tidyverse) 
library(dsbox)

⊕Feel free to poke around in the lintr directory if you are curious. If you want to experiment with different arguments to the lintr you may do so locally, but you will want to keep the revert button handy on the git pane so you can easily undo your changes. Don’t push any changes to files here or you might unleash skynet, or at least a bitcoin mining farm.

Linting your assignment

Part of you grade will be based on having a “lint”-free markdown at the end. (Or at least as close as possible.) You can check the lint two ways:

1 🧹 Run lint_assignment() from the Console and then look in the “Markers” pane, which will open as a tab next to the “Console”.
2 After you push, click on the ▶️ Actions tab, find your most recent commit at the top, and click on it. You should see a ✅ and then, if you don’t have a clean markdown, you should see Annotations / Warnings below. If you click around in the lint-project job, and find the Lint root directory step, you can find the offending line numbers. But it will probably be easier to just run lint_assignment() from rstudio cloud.

Note that the lintr doesn’t parse markdown prose (yet), but it’s still a good idea to keep your lines to under 80 characters, if for no other reason than it makes git merge conflicts slightly less likely.

Data

Remember that the datasets we’ll use are called dennys and laquinta from the dsbox package. Since the datasets are distributed with the package, we don’t need to load them separately; they become available to us when we load the package. You can find out more about the datasets by inspecting their documentation, which you can access by running ?dennys and ?laquinta in the Console or using the Help menu in RStudio to search for dennys or laquinta. You can also find this information here and here.

Exercises

i.🧹 Run lint_assignment(). What messages do you get in the “Markers” panel? Report them here. ii. Fix the formatting of the code in the lintr-warm-up chunk, and run lint_assignment() again. Verify that you now get no messages. iii. Push and verify that you have no annotations on the lint-project Action tab on github.com.
Filter the Denny’s data frame for Alaska (AK) and save the result as dn_ak. How many Denny’s locations are there in Alaska?

dn_ak = dennys %>%
  filter(state == "AK")
nrow(dn_ak)

## [1] 3

Filter the La Quinta data frame for Alaska (AK) and save the result as lq_ak. How many La Quinta locations are there in Alaska?

lq_ak = laquinta %>%
  filter(state == "AK")
nrow(lq_ak)

## [1] 2

Next we’ll calculate the distance between all Denny’s and all La Quinta locations in Alaska. Let’s take this step by step:

Step 1: There are 3 Denny’s and 2 La Quinta locations in Alaska. (If you answered differently above, you might want to recheck your answers.)

Step 2: Let’s focus on the first Denny’s location. We’ll need to calculate two distances for it: (1) distance between Denny’s 1 and La Quinta 1 and (2) distance between Denny’s 1 and La Quinta (2).

Step 3: Now let’s consider all Denny’s locations.

How many pairings are there between all Denny’s and all La Quinta locations in Alaska, i.e. how many distances do we need to calculate between the locations of these establishments in Alaska?

In order to calculate these distances we need to first restructure our data to pair the Denny’s and La Quinta locations. To do so, we will join the two data frames. We have six join options in R. Each of these join functions take at least three arguments: x, y, and by.

x and y are data frames to join
by is the variable(s) to join by

Four of these join functions combine variables from the two data frames:

⊕These are called mutating joins.

inner_join(): return all rows from x where there are matching values in y, and all columns from x and y.
left_join(): return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns.
right_join(): return all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns.
full_join(): return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing.

And the other two join functions only keep cases from the left-hand data frame, and are called filtering joins. We’ll learn about these another time but you can find out more about the join functions in the help files for any one of them, e.g. ?full_join.

In practice we mostly use mutating joins. In this case we want to keep all rows and columns from both dn_ak and lq_ak data frames. So we will use a full_join.

Full join of Denny’s and La Quinta locations in AK

Let’s join the data on Denny’s and La Quinta locations in Alaska, and take a look at what it looks like:

dn_lq_ak = full_join(dn_ak, lq_ak, by = "state")
dn_lq_ak

## # A tibble: 6 × 11
##   address.x   city.x state zip.x longitude.x latitude.x address.y  city.y  zip.y
##   <chr>       <chr>  <chr> <chr>       <dbl>      <dbl> <chr>      <chr>   <chr>
## 1 2900 Denali Ancho… AK    99503       -150.       61.2 3501 Minn… "\nAnc… 99503
## 2 2900 Denali Ancho… AK    99503       -150.       61.2 4920 Dale… "\nFai… 99709
## 3 3850 Debar… Ancho… AK    99508       -150.       61.2 3501 Minn… "\nAnc… 99503
## 4 3850 Debar… Ancho… AK    99508       -150.       61.2 4920 Dale… "\nFai… 99709
## 5 1929 Airpo… Fairb… AK    99701       -148.       64.8 3501 Minn… "\nAnc… 99503
## 6 1929 Airpo… Fairb… AK    99701       -148.       64.8 4920 Dale… "\nFai… 99709
## # … with 2 more variables: longitude.y <dbl>, latitude.y <dbl>

How many observations are in the joined dn_lq_ak data frame? What are the names of the variables in this data frame.

.x in the variable names means the variable comes from the x data frame (the first argument in the full_join call, i.e. dn_ak), and .y means the variable comes from the y data frame. These variables are renamed to include .x and .y because the two data frames have the same variables and it’s not possible to have two variables in a data frame with the exact same name.

Now that we have the data in the format we wanted, all that is left is to calculate the distances between the pairs.

🧹 🧶 ✅ ⬆️ Lint, Knit, commit, and push your changes to GitHub with an appropriate commit message. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.

One way of calculating the distance between any two points on the earth is to use the Haversine distance formula. This formula takes into account the fact that the earth is not flat, but instead spherical.

This function is not available in R, but we have it saved in a file called haversine.R that we can load and then use. (We do this for you by sourceing the file in the provided code-chunk.)

haversine = function(long1, lat1, long2, lat2, round = 3) {
  # convert to radians
  long1 = long1 * pi / 180
  lat1  = lat1  * pi / 180
  long2 = long2 * pi / 180
  lat2  = lat2  * pi / 180
  
  earth_r = 6371 # Earth mean radius in km
  
  a = sin((lat2 - lat1) / 2) ^ 2 + 
    cos(lat1) * cos(lat2) * sin((long2 - long1) / 2) ^ 2
  d = earth_r * 2 * asin(sqrt(a))
  
  return(round(d, round)) # distance in km
}

This function takes five arguments:

Longitude and latitude of the first location
Longitude and latitude of the second location
A parameter by which to round the responses

Still considering the state of Alaska, calculate the distances between all pairs of Denny’s and La Quinta locations and save this variable as distance. Save this variable in dn_lq_ak data frame so that you can use it later.
Calculate the minimum distance between a Denny’s and La Quinta for each Denny’s location. To do so we group by Denny’s locations and calculate a new variable that stores the information for the minimum distance.

dn_lq_ak_mindist = dn_lq_ak %>%
  group_by(address.x) %>%
  summarise(closest = min(distance))

Describe the distribution of the nearest La Quinta location for each Denny’s in Alaska. Include an appropriate visualization.

🧶 ✅ ⬆️ Knit, commit, and push your changes to GitHub with an appropriate commit message. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.

Repeat the same analysis for New York: (i) filter Denny’s and La Quinta Data Frames for NY, (ii) join these data frames to get a complete list of all possible pairings, (iii) calculate the distances between all possible pairings of Denny’s and La Quinta in NY, (iv) find the minimum distance between each Denny’s and La Quinta location, (v) visualize and describe the distribution of these shortest distances using appropriate summary statistics.
Repeat the same analysis for a state of your choosing, different than the ones we covered so far. Note that there’s a way you could execute steps 8 and 9 simultaneously and generate a faceted visualization. If you understand how to do this, feel free to do so.
Among the states you examined, where is Mitch Hedberg’s joke most likely to hold true? Explain your reasoning.

🧹 🧶 ✅ ⬆️ Lint ,Knit, commit, and push your changes to GitHub with an appropriate commit message. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards and review the md document on GitHub to make sure you’re happy with the final state of your work.

Rubric

37 points total. * 11 questions @ 2 points for correct and complete answers * 5 points github history * 10 points for <= 2 lint, and -0.5 point for every 1 lint thereafter

Lab 04 - La Quinta is Spanish for next to Denny’s, Pt. 2

Wrangling spatial data