class: ur-title, center, middle, title-slide # BST430 Lecture 17 ## Prediction and Logistic Regression ### Andrew McDavid ### U of Rochester ### 2021-11-14 (updated: 2021-11-15) --- class: middle # Prediction --- ## Goal: Building a spam filter - Data: Set of emails and we know if each email is spam/not and other features - Use logistic regression to predict the probability that an incoming email is spam - Optimize our model by picking the model with the best predictive performance -- - Building a model to predict the probability that an email is spam is only half of the battle! We also need a decision rule about which emails get flagged as spam (e.g. what probability should we use as out cutoff?) -- - A simple approach: choose a single threshold probability and any email that exceeds that probability is flagged as spam --- class: middle # Logistic regression --- ## Generalized Linear Models (GLMs) - Logistic regression is a *generalized linear model(GLM)* used to model a binary categorical outcome using numerical and categorical predictors - A GLM is an extension of the least squares linear model `\(E(Y_i|\mathbf{x}_i) = \mathbf{x}_i \beta\)`, `\(\text{Var}(Y|\mathbf{x}_i) = \sigma^2\)` to `$$g(E(Y_i|x_i)) = \eta_i = \mathbf{x}_i \beta$$` , `\(Y|X \sim\)` Exponential-Family, and `\(g\)` is known as a **link** function. - Whereby the glm allows non-linear effects in `\(\beta\)`, and since the conditional distribution of `\(Y|X\)` can include all sorts of interesting families, it also permits non-constant residual variance. --- ### The Bernoulli GLM with logit link - To finish specifying the Logistic model we just need to define a reasonable link function that connects `\(\eta_i\)` to `\(p_i\)`: logit function -- - **Logit function:** For `\(0\le p \le 1\)` `$$logit(p) = \log\left(\frac{p}{1-p}\right)$$` --- ## Logit function, visualised <img src="l17-prediction-logistic_files/figure-html/unnamed-chunk-2-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Properties of the logit - The logit function takes a value between 0 and 1 and maps it to a value between `\(-\infty\)` and `\(\infty\)` -- - Inverse logit (aka expit) function: `$$g^{-1}(x) = \frac{\exp(x)}{1+\exp(x)} = \frac{1}{1+\exp(-x)}$$` -- - The expit function takes a value between `\(-\infty\)` and `\(\infty\)` and maps it to a value between 0 and 1 -- - This formulation is also useful for interpreting the model, since the logit can be interpreted as the log odds of a success -- more on this later --- ## The logistic regression model - Based on the three GLM criteria we have - `\(y_i|x_i \sim \text{Bern}(p_i)\)` - `\(\text{logit}(p_i) = \eta_i\)` - `\(\eta_i = \beta_0+\beta_1 x_{1,i} + \cdots + \beta_n x_{n,i}\)` -- - From which we get `$$p_i = \frac{\exp(\beta_0+\beta_1 x_{1,i} + \cdots + \beta_k x_{k,i})}{1+\exp(\beta_0+\beta_1 x_{1,i} + \cdots + \beta_k x_{k,i})}$$` --- ## Modeling spam In R we fit a GLM in the same way as a linear model except we - use `"glm"` instead of `"lm"` as the engine - define `family = "binomial"` for the link function to be used in the model -- - When using `tidymodels`, specify the model with `logistic_reg()` --- ## Prediction - The mechanics of prediction is **easy**: - Plug in values of predictors to the model equation - Calculate the predicted value of the response variable, `\(\hat{y}\)` -- - Getting it right is **hard**! - There is no guarantee the model estimates you have are correct - Or that your model will perform as well with new data as it did with your sample data --- ## Underfitting and overfitting <img src="l17-prediction-logistic_files/figure-html/unnamed-chunk-3-1.png" width="70%" style="display: block; margin: auto;" /> --- ## Spending our data - Several steps to create a useful model: parameter estimation, model selection, performance assessment, etc. - Doing all of this on the entire data we have available can lead to **overfitting** - Allocate specific subsets of data for different tasks, as opposed to allocating the largest possible amount to the model parameter estimation only (which is what we've done so far) --- class: middle # Splitting data --- ## Splitting data - **Training set:** - Sandbox for model building - Spend most of your time using the training set to develop the model - Majority of the data (usually 80%) - **Testing set:** - Held in reserve to determine efficacy of one or two chosen models - Critical to look at it once, otherwise it becomes part of the modeling process - Remainder of the data (usually 20%) --- ## Performing the split ```r # Fix random numbers by setting the seed # Enables analysis to be reproducible when random numbers are used set.seed(20211115) # Put 80% of the data into the training set email_split = initial_split(email, prop = 0.80) # Create data frames for the two sets: train_data = training(email_split) test_data = testing(email_split) ``` .font70[`training` / `testing` aren't doing anything very fancy -- just using `sample_n` and then taking its complement, but this approach generalizes to more complicated ways to split our data.] --- class: code70 ## Peek at the split .small[ .pull-left[ ```r glimpse(train_data) ``` ``` ## Rows: 3,136 ## Columns: 21 ## $ spam <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,… ## $ to_multiple <fct> 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,… ## $ from <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,… ## $ cc <int> 0, 0, 0, 0, 2, 1, 0, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ sent_email <fct> 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ time <dttm> 2012-01-23 19:23:58, 2012-03-28 09:13:47, 2012-02-13 10:05:15, 2012-02-06 14… ## $ image <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ attach <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,… ## $ dollar <dbl> 0, 0, 0, 0, 2, 0, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 46, 0, 1, 4, 0, 0, 0, 6, 0, 0… ## $ winner <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, yes, no, no, … ## $ inherit <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ viagra <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ password <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0,… ## $ num_char <dbl> 24.832, 4.998, 3.568, 26.110, 7.870, 5.127, 2.710, 5.005, 7.911, 11.933, 4.66… ## $ line_breaks <int> 624, 114, 81, 652, 253, 120, 74, 118, 190, 165, 115, 256, 349, 640, 101, 349,… ## $ format <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1,… ## $ re_subj <fct> 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0,… ## $ exclaim_subj <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,… ## $ urgent_subj <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ exclaim_mess <dbl> 2, 0, 1, 1, 8, 0, 0, 2, 5, 0, 8, 0, 8, 1, 1, 0, 8, 0, 0, 0, 1, 0, 0, 1, 5, 1,… ## $ number <fct> small, small, small, small, big, small, small, small, small, small, small, sm… ``` ] .pull-right[ ```r glimpse(test_data) ``` ``` ## Rows: 785 ## Columns: 21 ## $ spam <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ to_multiple <fct> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0,… ## $ from <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,… ## $ cc <int> 1, 2, 1, 7, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 1, 0, 0, 0, 0, 0, 35, 0, 1… ## $ sent_email <fct> 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1,… ## $ time <dttm> 2012-01-01 14:38:32, 2012-01-01 18:32:53, 2012-01-01 22:16:39, 2012-01-02 19… ## $ image <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ attach <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ dollar <dbl> 0, 2, 0, 0, 0, 2, 0, 9, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ winner <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, n… ## $ inherit <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ viagra <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ password <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ num_char <dbl> 15.075, 19.693, 3.959, 0.334, 1.482, 4.103, 4.512, 4.761, 4.040, 1.401, 1.389… ## $ line_breaks <int> 354, 330, 81, 9, 24, 173, 107, 151, 76, 41, 37, 21, 145, 309, 183, 114, 135, … ## $ format <fct> 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1,… ## $ re_subj <fct> 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1,… ## $ exclaim_subj <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,… ## $ urgent_subj <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ exclaim_mess <dbl> 10, 4, 1, 0, 0, 0, 0, 1, 4, 0, 8, 2, 1, 3, 0, 1, 4, 7, 2, 9, 0, 2, 0, 0, 0, 1… ## $ number <fct> small, big, small, small, none, small, small, big, small, small, small, none,… ``` ] ] --- class: middle # Modeling workflow --- ## Fit a model to the training dataset ```r email_fit = logistic_reg() %>% set_engine("glm") %>% fit(spam ~ ., data = train_data, family = "binomial") ``` ``` ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ``` --- ## Categorical predictors <img src="l17-prediction-logistic_files/figure-html/unnamed-chunk-8-1.png" width="75%" style="display: block; margin: auto;" /> --- ## Fit a model to the training dataset ```r email_fit = logistic_reg() %>% set_engine("glm") %>% * fit(spam ~ . - from - sent_email - viagra, data = train_data, family = "binomial") ``` ``` ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ``` .code50[ ```r email_fit ``` ``` ## parsnip model object ## ## Fit time: 65ms ## ## Call: stats::glm(formula = spam ~ . - from - sent_email - viagra, family = stats::binomial, ## data = data) ## ## Coefficients: ## (Intercept) to_multiple1 cc time image attach dollar ## -1.043e+02 -2.802e+00 2.799e-02 7.810e-08 -3.578e+00 6.103e-01 -7.948e-02 ## winneryes inherit password num_char line_breaks format1 re_subj1 ## 2.111e+00 4.099e-01 -7.562e-01 6.814e-02 -5.259e-03 -9.421e-01 -3.474e+00 ## exclaim_subj urgent_subj1 exclaim_mess numbersmall numberbig ## 2.886e-01 2.734e+00 7.918e-03 -7.879e-01 2.957e-02 ## ## Degrees of Freedom: 3135 Total (i.e. Null); 3117 Residual ## Null Deviance: 1933 ## Residual Deviance: 1413 AIC: 1451 ``` ] ??? Dropping `from`, `sent_email` quells the warnings about predicted probabilities being 0/1. Actually some of these variables might have predictive power, but we would need a more sophisticated approach than logistic regression to use them --- ## Predict outcome on the testing dataset ```r predict(email_fit, test_data) ``` ``` ## # A tibble: 785 x 1 ## .pred_class ## <fct> ## 1 0 ## 2 0 ## 3 0 ## 4 0 ## 5 0 ## 6 0 ## 7 0 ## 8 0 ## 9 0 ## 10 0 ## # … with 775 more rows ``` --- ## Predict probabilities on the testing dataset ```r email_pred = predict(email_fit, test_data, type = "prob") %>% bind_cols(test_data %>% select(spam, time)) email_pred ``` ``` ## # A tibble: 785 x 4 ## .pred_0 .pred_1 spam time ## <dbl> <dbl> <fct> <dttm> ## 1 0.999 0.00127 0 2012-01-01 14:38:32 ## 2 0.996 0.00373 0 2012-01-01 18:32:53 ## 3 0.994 0.00596 0 2012-01-01 22:16:39 ## 4 0.992 0.00801 0 2012-01-02 19:07:51 ## 5 0.845 0.155 0 2012-01-02 16:55:03 ## 6 0.962 0.0376 0 2012-01-03 14:51:16 ## 7 0.996 0.00414 0 2012-01-03 17:34:13 ## 8 0.997 0.00308 0 2012-01-03 17:40:07 ## 9 0.927 0.0728 0 2012-01-03 19:06:48 ## 10 0.998 0.00243 0 2012-01-03 19:36:42 ## # … with 775 more rows ``` --- ## A closer look at predictions ```r email_pred %>% arrange(desc(.pred_1)) %>% print(n = 10) ``` ``` ## # A tibble: 785 x 4 ## .pred_0 .pred_1 spam time ## <dbl> <dbl> <fct> <dttm> ## 1 0.0970 0.903 1 2012-03-27 01:17:01 ## 2 0.165 0.835 1 2012-03-01 00:40:27 *## 3 0.191 0.809 1 2012-01-21 16:55:55 ## 4 0.206 0.794 0 2012-01-02 22:30:31 ## 5 0.243 0.757 1 2012-03-31 05:20:24 ## 6 0.258 0.742 1 2012-03-17 12:20:57 *## 7 0.288 0.712 0 2012-02-03 08:25:39 ## 8 0.293 0.707 1 2012-02-24 19:43:44 ## 9 0.308 0.692 0 2012-01-31 10:07:48 ## 10 0.346 0.654 0 2012-02-04 05:39:37 ## # … with 775 more rows ``` --- ## Evaluate the performance **Receiver operating characteristic (ROC) curve**<sup>+</sup> which plot true positive rate vs. false positive rate (1 - specificity) .pull-left[ ```r email_pred %>% roc_curve( truth = spam, .pred_1, event_level = "second" ) %>% autoplot() ``` ] .pull-right[ <img src="l17-prediction-logistic_files/figure-html/unnamed-chunk-14-1.png" width="100%" style="display: block; margin: auto;" /> ] .footnote[ .small[ <sup>+</sup>Originally developed for operators of military radar receivers, hence the name. ] ] --- ## Evaluate the performance Find the area under the curve: .pull-left[ ```r email_pred %>% roc_auc( truth = spam, .pred_1, event_level = "second" ) ``` ``` ## # A tibble: 1 x 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 roc_auc binary 0.840 ``` ] .pull-right[ <img src="l17-prediction-logistic_files/figure-html/unnamed-chunk-16-1.png" width="100%" style="display: block; margin: auto;" /> ] --- class: middle # Feature engineering --- ## Feature engineering - There are all sorts of ways to build predictive models of binary variables: random forests, support vector machines, neural networks, `\(k\)` nearest neighbors, gradient boosting, ... - In their own way, each learns the mapping `\(\hat f: \mathbf{x} \mapsto Y\)` -- - But the variables `\(\mathbf{x}\)` that go into the model and how they are represented are just as critical to success of the model -- - **Feature engineering** allows us to get creative with our predictors in an effort to make them more useful for our model (to increase its predictive performance) - How this engineering is done is part of the learned function `\(\hat f\)`, and needs to be accounted for when we evaluate our models. --- ## A simple approach: `mutate()` ```r library(lubridate) train_data %>% mutate( date = date(time), dow = wday(time), month = month(time) ) %>% select(time, date, dow, month) %>% sample_n(size = 5) # shuffle to show a variety ``` ``` ## # A tibble: 5 x 4 ## time date dow month ## <dttm> <date> <dbl> <dbl> ## 1 2012-01-31 17:06:58 2012-01-31 3 1 ## 2 2012-02-02 14:38:31 2012-02-02 5 2 ## 3 2012-02-14 06:17:18 2012-02-14 3 2 ## 4 2012-03-05 12:55:12 2012-03-05 2 3 ## 5 2012-02-23 13:12:30 2012-02-23 5 2 ``` ??? time would be used a "seconds" since the epoch as a continuous variable, however this would not allow any flexibility about weekdays vs weekends, time of year, etc. however, we would need to repeat this mutate on train / test independently, at a minimum this would not keep things D.R.Y. It's actually even worse than that, in that some mutates (collapsing factors, scaling variables) have implicitly learned parameters -- if we don't account for this we can end up with data leaking between training and test or with garbage predictions in test --- ## Modeling workflow, revisited - Create a **recipe** for feature engineering steps to be applied to the training data -- - Fit the model to the training data after these steps have been applied -- - Using the model estimates from the training data, predict outcomes for the test data -- - Evaluate the performance of the model on the test data --- class: middle # Building recipes ??? Fancy mutate statements that can be applied easily to both training / test Actually, is building up and saving a set of functions (without evaluating them) --- ## Initiate a recipe ```r email_rec = recipe( spam ~ ., # formula data = train_data # data to use for cataloguing names and types of variables ) summary(email_rec) ``` .code50[ ``` ## # A tibble: 21 x 4 ## variable type role source ## <chr> <chr> <chr> <chr> ## 1 to_multiple nominal predictor original ## 2 from nominal predictor original ## 3 cc numeric predictor original ## 4 sent_email nominal predictor original ## 5 time date predictor original ## 6 image numeric predictor original ## 7 attach numeric predictor original ## 8 dollar numeric predictor original ## 9 winner nominal predictor original ## 10 inherit numeric predictor original ## 11 viagra numeric predictor original ## 12 password numeric predictor original ## 13 num_char numeric predictor original ## 14 line_breaks numeric predictor original ## 15 format nominal predictor original ## 16 re_subj nominal predictor original ## 17 exclaim_subj numeric predictor original ## 18 urgent_subj nominal predictor original ## 19 exclaim_mess numeric predictor original ## 20 number nominal predictor original ## 21 spam nominal outcome original ``` ] --- ## Remove certain variables ```r email_rec = email_rec %>% step_rm(from, sent_email) ``` .code70[ ``` ## Data Recipe ## ## Inputs: ## ## role #variables ## outcome 1 ## predictor 20 ## ## Operations: ## ## Delete terms from, sent_email ``` ] --- ## Feature engineer date ```r email_rec = email_rec %>% step_date(time, features = c("dow", "month")) %>% step_rm(time) ``` .code70[ ``` ## Data Recipe ## ## Inputs: ## ## role #variables ## outcome 1 ## predictor 20 ## ## Operations: ## ## Delete terms from, sent_email ## Date features from time ## Delete terms time ``` ] --- ## Discretize numeric variables ```r email_rec = email_rec %>% step_cut(cc, attach, dollar, breaks = c(0, 1)) %>% step_cut(inherit, password, breaks = c(0, 1, 5, 10, 20)) ``` .code70[ ``` ## Data Recipe ## ## Inputs: ## ## role #variables ## outcome 1 ## predictor 20 ## ## Operations: ## ## Delete terms from, sent_email ## Date features from time ## Delete terms time ## Cut numeric for cc, attach, dollar ## Cut numeric for inherit, password ``` ] --- ## Create dummy variables ```r email_rec = email_rec %>% step_dummy(all_nominal(), -all_outcomes()) ``` .code70[ ``` ## Data Recipe ## ## Inputs: ## ## role #variables ## outcome 1 ## predictor 20 ## ## Operations: ## ## Delete terms from, sent_email ## Date features from time ## Delete terms time ## Cut numeric for cc, attach, dollar ## Cut numeric for inherit, password ## Dummy variables from all_nominal(), -all_outcomes() ``` ] --- ## Remove zero variance variables Variables that contain only a single value ```r email_rec = email_rec %>% step_zv(all_predictors()) ``` .code70[ ``` ## Data Recipe ## ## Inputs: ## ## role #variables ## outcome 1 ## predictor 20 ## ## Operations: ## ## Delete terms from, sent_email ## Date features from time ## Delete terms time ## Cut numeric for cc, attach, dollar ## Cut numeric for inherit, password ## Dummy variables from all_nominal(), -all_outcomes() ## Zero variance filter on all_predictors() ``` ] --- ## All in one place ```r email_rec = recipe(spam ~ ., data = email) %>% step_rm(from, sent_email) %>% step_date(time, features = c("dow", "month")) %>% step_rm(time) %>% step_cut(cc, attach, dollar, breaks = c(0, 1)) %>% step_cut(inherit, password, breaks = c(0, 1, 5, 10, 20)) %>% step_dummy(all_nominal(), -all_outcomes()) %>% step_zv(all_predictors()) ``` --- class: middle # Building workflows --- ## Define model ```r email_mod = logistic_reg() %>% set_engine("glm") email_mod ``` ``` ## Logistic Regression Model Specification (classification) ## ## Computational engine: glm ``` --- ## Define workflow **Workflows** bring together models and recipes so that they can be easily applied to both the training and test data. ```r email_wflow = workflow() %>% add_model(email_mod) %>% add_recipe(email_rec) ``` .code60[ ``` ## ══ Workflow ════════════════════════════════════════════════════════════════════════════════════════ ## Preprocessor: Recipe ## Model: logistic_reg() ## ## ── Preprocessor ──────────────────────────────────────────────────────────────────────────────────── ## 7 Recipe Steps ## ## • step_rm() ## • step_date() ## • step_rm() ## • step_cut() ## • step_cut() ## • step_dummy() ## • step_zv() ## ## ── Model ─────────────────────────────────────────────────────────────────────────────────────────── ## Logistic Regression Model Specification (classification) ## ## Computational engine: glm ``` ] --- ## Fit model to training data ```r email_fit = email_wflow %>% fit(data = train_data) ``` ``` ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ``` --- .code50[ ```r tidy(email_fit) ``` ``` ## # A tibble: 30 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -1.18 0.273 -4.31 1.64e- 5 ## 2 image -2.00 0.976 -2.05 4.08e- 2 ## 3 num_char 0.0611 0.0240 2.54 1.11e- 2 ## 4 line_breaks -0.00517 0.00135 -3.82 1.35e- 4 ## 5 exclaim_subj 0.0672 0.258 0.261 7.94e- 1 ## 6 exclaim_mess 0.00649 0.00182 3.56 3.69e- 4 ## 7 to_multiple_X1 -3.00 0.383 -7.84 4.57e-15 ## 8 cc_X.1.68. -0.327 0.470 -0.695 4.87e- 1 ## 9 attach_X.1.21. 2.40 0.403 5.96 2.57e- 9 ## 10 dollar_X.1.64. -0.0448 0.218 -0.206 8.37e- 1 ## # … with 20 more rows ``` ] --- ## Make predictions for test data ```r email_pred = predict(email_fit, test_data, type = "prob") %>% bind_cols(test_data) email_pred ``` ``` ## # A tibble: 785 x 23 ## .pred_0 .pred_1 spam to_multiple from cc sent_email ## <dbl> <dbl> <fct> <fct> <fct> <int> <fct> ## 1 0.999 0.000809 0 0 1 1 1 ## 2 0.998 0.00177 0 0 1 2 0 ## 3 0.996 0.00409 0 0 1 1 0 ## 4 0.996 0.00429 0 0 1 7 0 ## 5 0.873 0.127 0 0 1 0 0 ## 6 0.957 0.0434 0 0 1 0 0 ## 7 0.997 0.00340 0 1 1 1 0 ... ``` --- ## Evaluate the performance .pull-left[ ```r email_pred %>% roc_curve( truth = spam, .pred_1, event_level = "second" ) %>% autoplot() ``` ] .pull-right[ <img src="l17-prediction-logistic_files/figure-html/unnamed-chunk-36-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Evaluate the performance .pull-left[ ```r email_pred %>% roc_auc( truth = spam, .pred_1, event_level = "second" ) ``` ``` ## # A tibble: 1 x 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 roc_auc binary 0.847 ``` ] .pull-right[ <img src="l17-prediction-logistic_files/figure-html/unnamed-chunk-38-1.png" width="100%" style="display: block; margin: auto;" /> ] --- class: middle # Making decisions --- ## Cutoff probability: 0.5 .panelset[ .panel[.panel-name[Output] Suppose we decide to label an email as spam if the model predicts the probability of spam to be **more than 0.5**. | | Email is not spam| Email is spam| |:-----------------------|-----------------:|-------------:| |Email labelled not spam | 700| 66| |Email labelled spam | 8| 11| ] .panel[.panel-name[Code] ```r cutoff_prob = 0.5 email_pred %>% mutate( spam = if_else(spam == 1, "Email is spam", "Email is not spam"), spam_pred = if_else(.pred_1 > cutoff_prob, "Email labelled spam", "Email labelled not spam") ) %>% count(spam_pred, spam) %>% pivot_wider(names_from = spam, values_from = n) %>% knitr::kable(col.names = c("", "Email is not spam", "Email is spam")) ``` ] ] --- ## Cutoff probability: 0.25 .panelset[ .panel[.panel-name[Output] Suppose we decide to label an email as spam if the model predicts the probability of spam to be **more than 0.25**. | | Email is not spam| Email is spam| |:-----------------------|-----------------:|-------------:| |Email labelled not spam | 644| 38| |Email labelled spam | 64| 39| ] .panel[.panel-name[Code] ```r cutoff_prob = 0.25 email_pred %>% mutate( spam = if_else(spam == 1, "Email is spam", "Email is not spam"), spam_pred = if_else(.pred_1 > cutoff_prob, "Email labelled spam", "Email labelled not spam") ) %>% count(spam_pred, spam) %>% pivot_wider(names_from = spam, values_from = n) %>% knitr::kable(col.names = c("", "Email is not spam", "Email is spam")) ``` ] ] --- ## Cutoff probability: 0.75 .panelset[ .panel[.panel-name[Output] Suppose we decide to label an email as spam if the model predicts the probability of spam to be **more than 0.75**. | | Email is not spam| Email is spam| |:-----------------------|-----------------:|-------------:| |Email labelled not spam | 705| 70| |Email labelled spam | 3| 7| ] .panel[.panel-name[Code] ```r cutoff_prob = 0.75 email_pred %>% mutate( spam = if_else(spam == 1, "Email is spam", "Email is not spam"), spam_pred = if_else(.pred_1 > cutoff_prob, "Email labelled spam", "Email labelled not spam") ) %>% count(spam_pred, spam) %>% pivot_wider(names_from = spam, values_from = n) %>% knitr::kable(col.names = c("", "Email is not spam", "Email is spam")) ``` ] ] --- ## Evaluating performance on training data - The training set does not have the capacity to be a good arbiter of performance. -- - It is not an independent piece of information; predicting the training set can only reflect what the model already knows. -- - Suppose you give a class a test, then give them the answers, then provide the same test. The student scores on the second test do not accurately reflect what they know about the subject; these scores would probably be higher than their results on the first test. .footnote[ .small[ Source: [tidymodels.org](https://www.tidymodels.org/start/resampling/) ] ] --- class: middle # Cross validation --- ## Cross validation More specifically, **v-fold cross validation**: - Shuffle your data v partitions - Use 1 partition for validation, and the remaining v-1 partitions for training - Repeat v times .footnote[ .small[ You might also heard of this referred to as k-fold cross validation. ] ] --- ## Cross validation <img src="l17/img/cross-validation.png" width="100%" style="display: block; margin: auto;" /> --- ## Split data into folds .pull-left[ ```r set.seed(345) folds = vfold_cv(train_data, v = 5) folds ``` ``` ## # 5-fold cross-validation ## # A tibble: 5 x 2 ## splits id ## <list> <chr> ## 1 <split [2508/628]> Fold1 ## 2 <split [2509/627]> Fold2 ## 3 <split [2509/627]> Fold3 ## 4 <split [2509/627]> Fold4 ## 5 <split [2509/627]> Fold5 ``` ] .pull-right[ <img src="l17/img/cross-validation.png" width="100%" style="display: block; margin: auto 0 auto auto;" /> ] --- class: code60 ## Fit resamples .pull-left[ ```r set.seed(456) email_fit_rs = email_wflow %>% fit_resamples(folds) email_fit_rs ``` ``` ## # Resampling results ## # 5-fold cross-validation ## # A tibble: 5 x 4 ## splits id .metrics .notes ## <list> <chr> <list> <list> ## 1 <split [2508/628]> Fold1 <tibble [2 × 4]> <tibble [2 × 1]> ## 2 <split [2509/627]> Fold2 <tibble [2 × 4]> <tibble [1 × 1]> ## 3 <split [2509/627]> Fold3 <tibble [2 × 4]> <tibble [1 × 1]> ## 4 <split [2509/627]> Fold4 <tibble [2 × 4]> <tibble [2 × 1]> ## 5 <split [2509/627]> Fold5 <tibble [2 × 4]> <tibble [1 × 1]> ``` ] .pull-right[ <img src="l17/img/cross-validation-animated.gif" width="100%" style="display: block; margin: auto 0 auto auto;" /> ] --- ## Collect CV metrics ```r collect_metrics(email_fit_rs) ``` ``` ## # A tibble: 2 x 6 ## .metric .estimator mean n std_err .config ## <chr> <chr> <dbl> <int> <dbl> <chr> ## 1 accuracy binary 0.908 5 0.00395 Preprocessor1_Model1 ## 2 roc_auc binary 0.837 5 0.00384 Preprocessor1_Model1 ``` --- ## Deeper look into CV metrics .panelset[ .panel[.panel-name[Raw] ```r collect_metrics(email_fit_rs, summarize = FALSE) %>% print(n = 10) ``` ``` ## # A tibble: 10 x 5 ## id .metric .estimator .estimate .config ## <chr> <chr> <chr> <dbl> <chr> ## 1 Fold1 accuracy binary 0.915 Preprocessor1_Model1 ## 2 Fold1 roc_auc binary 0.834 Preprocessor1_Model1 ## 3 Fold2 accuracy binary 0.904 Preprocessor1_Model1 ## 4 Fold2 roc_auc binary 0.847 Preprocessor1_Model1 ## 5 Fold3 accuracy binary 0.915 Preprocessor1_Model1 ## 6 Fold3 roc_auc binary 0.844 Preprocessor1_Model1 ## 7 Fold4 accuracy binary 0.895 Preprocessor1_Model1 ## 8 Fold4 roc_auc binary 0.828 Preprocessor1_Model1 ## 9 Fold5 accuracy binary 0.910 Preprocessor1_Model1 ## 10 Fold5 roc_auc binary 0.831 Preprocessor1_Model1 ``` ] .panel[.panel-name[Pretty] |Fold | accuracy| roc_auc| |:-----|--------:|-------:| |Fold1 | 0.915| 0.834| |Fold2 | 0.904| 0.847| |Fold3 | 0.915| 0.844| |Fold4 | 0.895| 0.828| |Fold5 | 0.910| 0.831| ] ] --- ### Training data optimism In the typical case, we use cross-validation to tune and select between models, then use the train/test split to evaluate the final chosen model - In this case, the training data does not severely overestimate the performance in testing or cross-validation. - In general, the gap between training and test is a function of the training *optimism*. - `\(\nearrow\)` model flexibility, `\(\nearrow\)` optimism - `\(\searrow\)` training set size, `\(\nearrow\)` optimism. --- ### Example of training optimism .pull-left[ ```r set.seed(123) train_data_small = train_data %>% sample_n(size = 100) overfit = fit(email_wflow, data = train_data_small) overpredict = predict(overfit, train_data_small, type = "prob") %>% bind_cols(train_data_small) ``` ] .pull-right[ <img src="l17-prediction-logistic_files/figure-html/unnamed-chunk-51-1.png" width="90%" style="display: block; margin: auto;" /> ``` ## # A tibble: 1 x 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 roc_auc binary 1 ``` ] --- # Final announcements * You will have access to the lectures as long as github exists and provides hosting for websites. * You will have access to your completed homework assignments on github as long as you have access to your github account. * You will have indefinite access to rstudio cloud and your projects there, but your account will revert back to the "free" tier (5 hours/month) on/around 19 Dec 2021. So you might need to export at the project to do much with it after this point. * I am not distributing the homework keys, but you are welcome to ask about homework during office hours or with a classmate. I would rather that you not electronically redistribute the entirety of a graded assignment with a classmate<sup>1</sup> * I will accept corrections on labs / homework until +7 days after HW05 is returned. .footnote[[1] Reminder: sharing graded assignments within a class year isn't technically prohibited, but sharing across class years is prohibited and is a violation of the academic honesty policy per department policy.] --- class: middle ## A final request Since I am not sure how long it will take for you to receive an formal invitation to evaluate this course, please fill out [this short, anonymous survey on the course](https://docs.google.com/forms/d/e/1FAIpQLSeCWQ2FHl8aabut8-Gcj4Ea2Xke8qAM0Yo0ecmHkcv-X-8aqg/viewform?usp=sf_link) --- # Acknowledgments Data science in a box [U4D7](https://rstudio-education.github.io/datascience-box/course-materials/slides/u4-d07-prediction-overfitting/u4-d07-prediction-overfitting.html#1) [U4D8](https://rstudio-education.github.io/datascience-box/course-materials/slides/u4-d08-feature-engineering/u4-d08-feature-engineering.html#1) [U4D9](https://rstudio-education.github.io/datascience-box/course-materials/slides/u4-d09-cross-validation/u4-d09-cross-validation.html#1)