In this assignment we explore the 2016 General Social Survey data set. The GSS gathers data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes. Hundreds of trends have been tracked since 1972.

Exercises

Scientific research

In this section we’re going to build a model to predict whether someone agrees or doesn’t agree with the following statement:

Even if it brings no immediate benefits, scientific research that advances the frontiers of knowledge is necessary and should be supported by the federal government.

The responses to the question on the GSS about this statement are in the advfront variable.

⊕It’s important that you don’t recode the NAs, just the remaining levels.

Re-level the advfront variable such that it has two levels: Strongly agree and “Agree" combined into a new level called agree and the remaining levels (except NAs) combined into”Not agree". Then, re-order the levels in the following order: "Agree" and "Not agree". Finally, count() how many times each new level appears in the advfront variable.

⊕You can do this in various ways. One option is to use the str_detect() function to detect the existence of words like liberal or conservative. Note that these sometimes show up with lowercase first letters and sometimes with upper case first letters. To detect either in the str_detect() function, you can use “[Ll]iberal” and “[Cc]onservative”. But feel free to solve the problem however you like, this is just one option!

Combine the levels of the polviews variable such that levels that have the word “liberal” in them are lumped into a level called "Liberal" and those that have the word conservative in them are lumped into a level called "Conservative". Then, re-order the levels in the following order: "Conservative" , "Moderate", and "Liberal". Finally, count() how many times each new level appears in the polviews variable.
Create a new data frame called gss16_advfront that includes the variables advfront, educ, polviews, and wrkstat. Then, use the drop_na() function to remove rows that contain NAs from this new data frame.
Split the data into training (75%) and testing (25%) data sets. Make sure to set a seed before you do the initial_split(). Call the training data gss16_train and the testing data gss16_test. Sample code is provided below. Use these specific names to make it easier to follow the rest of the instructions.

set.seed(___)
gss16_split = initial_split(gss16_advfront)
gss16_train = training(gss16_split)
gss16_test  = testing(gss16_split)

Create a recipe with the following steps for predicting advfront from polviews, wrkstat, and educ. Name this recipe gss16_rec_1. (We’ll create one more recipe later, that’s why we’re naming this recipe _1.) Sample code is provided below.
- step_other() to pool values that occur less than 10% of the time (threshold = 0.10) in the wrkstat variable into "Other".
- step_dummy() to create dummy variables for all_nominal() variables that are predictors, i.e. all_predictors()

gss16_rec_1 = recipe(___ ~ ___, data = ___) %>%
  step_other(wrkstat, threshold = ___, other = "Other") %>%
  step_dummy(all_nominal(), -all_outcomes())

Specify a logistic regression model using "glm" as the engine. Name this specification gss16_spec. Sample code is provided below.

gss16_spec = ___() %>%
  set_engine("___")

Build a workflow that uses the recipe you defined (gss16_rec) and the model you specified (gss16_spec). Name this workflow gss16_wflow_1. Sample code is provided below.

gss16_wflow_1 = workflow() %>%
  add_model(___) %>%
  add_recipe(___)

Perform 5-fold cross validation. specifically,
- split the training data into 5 folds (don’t forget to set a seed first!),
- apply the workflow you defined earlier to the folds with fit_resamples(), and
- collect_metrics() and comment on the consistency of metrics across folds (you can get the area under the ROC curve and the accuracy for each fold by setting summarize = FALSE in collect_metrics())
- report the average area under the ROC curve and the accuracy for all cross validation folds collect_metrics()

set.seed(___)
gss16_folds = vfold_cv(___, v = ___)

gss16_fit_rs_1 = gss16_wflow_1 %>%
  fit_resamples(___)

collect_metrics(___, summarize = FALSE)
collect_metrics(___)

Now, try a different, simpler model: predict advfront from only polviews and educ. Specifically,
- update the recipe to reflect this simpler model specification (and name it gss16_rec_2),
- redefine the workflow with the new recipe (and name this new workflow gss16_wflow_2),
- perform cross validation, and
- report the average area under the ROC curve and the accuracy for all cross validation folds collect_metrics().
Comment on which model performs better (one including wrkstat, model 1, or the one excluding wrkstat, model 2) on the training data based on area under the ROC curve.
Fit both models to the testing data, plot the ROC curves for the predictions for both models, and calculate the areas under the ROC curve. Does your answer to the previous exercise hold for the testing data as well? Explain your reasoning. Note: If you haven’t yet done so, you’ll need to first train your workflows on the training data with the following, and then use these fit objects to calculate predictions for the test data.

Extra credit

(Extra credit) Refit your model and recipe from Ex5-Ex8 using glmnet. Select the penalty and mixture tuning parameters by cross-validation over the training data.
(Extra credit) Generate a plot comparing the test-data ROC from your model in Ex 12, the results from Ex 11.

gss16_fit_1 = gss16_wflow_1 %>%
  fit(gss16_train)

gss16_fit_2 = gss16_wflow_2 %>%
  fit(gss16_train)

🧶 ✅ ⬆️ Knit, commit, and push your changes to GitHub with an appropriate commit message. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.

Part ii: (many linear models)

Using gapminder, fit a linear model over all countries on log(pop) vs (years-1990), so that the intercept reflects the log-population in 1990. Interpet the slope and intercept in terms of the unlogged variables.

library(gapminder)

Now, completing the code below, fit this model per country. In a table, report the top 3 and bottom 3 countries in terms of the population growth rates, and their 95% confidence intervals.

fit_pop = function(){
  ##Fill in this function
}

many_fits = gapminder %>% 
  group_by(country) %>%
  summarize(fit = fit_pop(across()))  %>%
  ungroup()

filter(many_fits, country == 'India')$fit[[1]]
filter(many_fits, country == 'Italy')$fit[[1]]

many_fits %>% 
  rowwise() %>%
  mutate(tidy_out = list(tidy(___))) %>%
  unnest(___)

🧶 ✅ ⬆️ Knit, commit, and push your changes to GitHub with an appropriate commit message. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards and review the md document on GitHub to make sure you’re happy with the final state of your work.

HW 05 - Modeling the GSS

Getting started

Warm up

Packages

Data

Exercises

Scientific research

Extra credit

Part ii: (many linear models)