BST430 Lecture 17

# BST430 Lecture 17
## Prediction and Logistic Regression
### Andrew McDavid
### U of Rochester
### 2021-11-14 (updated: 2021-11-15)

---

# Prediction

---

## Goal: Building a spam filter

- Data: Set of emails and we know if each email is spam/not and other features 
- Use logistic regression to predict the probability that an incoming email is spam
- Optimize our model by picking the model with the best predictive performance

--
- Building a model to predict the probability that an email is spam is only half of the battle! We also need a decision rule about which emails get flagged as spam (e.g. what probability should we use as out cutoff?)

--
- A simple approach: choose a single threshold probability and any email that exceeds that probability is flagged as spam

---
class: middle

# Logistic regression

---

## Generalized Linear Models (GLMs)

- Logistic regression is a *generalized linear model(GLM)* used to model a binary categorical outcome using numerical and categorical predictors

- A GLM is an extension of the least squares linear model `$E(Y_i|\mathbf{x}_i) = \mathbf{x}_i \beta$`, `$\text{Var}(Y|\mathbf{x}_i) = \sigma^2$` to
`$$g(E(Y_i|x_i)) = \eta_i = \mathbf{x}_i \beta$$`
, `$Y|X \sim$` Exponential-Family, and `$g$` is known as a **link** function.

- Whereby the glm allows non-linear effects in `$\beta$`, and since the conditional distribution of `$Y|X$` can include all sorts of interesting families, it also permits non-constant residual variance.

---

### The Bernoulli GLM with logit link

- To finish specifying the Logistic model we just need to define a reasonable link function that connects `$\eta_i$` to `$p_i$`: logit function

--
- **Logit function:** For `$0\le p \le 1$`

`$$logit(p) = \log\left(\frac{p}{1-p}\right)$$`

---

## Logit function, visualised

---

## Properties of the logit

- The logit function takes a value between 0 and 1 and maps it to a value between `$-\infty$` and `$\infty$`

--
- Inverse logit (aka expit) function:
`$$g^{-1}(x) = \frac{\exp(x)}{1+\exp(x)} = \frac{1}{1+\exp(-x)}$$`

--
- The expit function takes a value between `$-\infty$` and `$\infty$` and maps it to a value between 0 and 1

--
- This formulation is also useful for interpreting the model, since the logit can be interpreted as the log odds of a success -- more on this later

---

## The logistic regression model

- Based on the three GLM criteria we have
  - `$y_i|x_i \sim \text{Bern}(p_i)$`
  - `$\text{logit}(p_i) = \eta_i$`
  - `$\eta_i = \beta_0+\beta_1 x_{1,i} + \cdots + \beta_n x_{n,i}$`

--
- From which we get

`$$p_i = \frac{\exp(\beta_0+\beta_1 x_{1,i} + \cdots + \beta_k x_{k,i})}{1+\exp(\beta_0+\beta_1 x_{1,i} + \cdots + \beta_k x_{k,i})}$$`
---

## Modeling spam

In R we fit a GLM in the same way as a linear model except we

- use `"glm"` instead of `"lm"` as the engine

- define `family = "binomial"` for the link function to be used in the model

- When using `tidymodels`, specify the model with `logistic_reg()`

---

## Prediction

- The mechanics of prediction is **easy**:
  - Plug in values of predictors to the model equation
  - Calculate the predicted value of the response variable, `$\hat{y}$`

--
- Getting it right is **hard**!
  - There is no guarantee the model estimates you have are correct
  - Or that your model will perform as well with new data as it did with your sample data

---

## Underfitting and overfitting

---

## Spending our data

- Several steps to create a useful model: parameter estimation, model selection, performance assessment, etc.

- Doing all of this on the entire data we have available can lead to **overfitting**

- Allocate specific subsets of data for different tasks, as opposed to allocating the largest possible amount to the model parameter estimation only (which is what we've done so far)

---

# Splitting data

---

## Splitting data

- **Training set:**
  - Sandbox for model building 
  - Spend most of your time using the training set to develop the model
  - Majority of the data (usually 80%)
  
- **Testing set:**
  - Held in reserve to determine efficacy of one or two chosen models
  - Critical to look at it once, otherwise it becomes part of the modeling process
  - Remainder of the data (usually 20%)
  
---

## Performing the split

```r
# Fix random numbers by setting the seed 
# Enables analysis to be reproducible when random numbers are used 
set.seed(20211115)

# Put 80% of the data into the training set 
email_split = initial_split(email, prop = 0.80)

# Create data frames for the two sets:
train_data = training(email_split)
test_data  = testing(email_split)
```

.font70[`training` / `testing` aren't doing anything very fancy -- just using `sample_n` and then taking its complement, but this approach generalizes to more complicated ways to split our data.]

---
class: code70

## Peek at the split

```r
glimpse(train_data)
```

```
## Rows: 3,136
## Columns: 21
## $ spam <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
## $ to_multiple <fct> 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,…
## $ from <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ cc <int> 0, 0, 0, 0, 2, 1, 0, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ sent_email <fct> 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ time <dttm> 2012-01-23 19:23:58, 2012-03-28 09:13:47, 2012-02-13 10:05:15, 2012-02-06 14…
## $ image <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ attach <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,…
## $ dollar <dbl> 0, 0, 0, 0, 2, 0, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 46, 0, 1, 4, 0, 0, 0, 6, 0, 0…
## $ winner <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, yes, no, no, …
## $ inherit <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ viagra <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ password <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0,…
## $ num_char <dbl> 24.832, 4.998, 3.568, 26.110, 7.870, 5.127, 2.710, 5.005, 7.911, 11.933, 4.66…
## $ line_breaks <int> 624, 114, 81, 652, 253, 120, 74, 118, 190, 165, 115, 256, 349, 640, 101, 349,…
## $ format <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1,…
## $ re_subj <fct> 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ exclaim_subj <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,…
## $ urgent_subj <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ exclaim_mess <dbl> 2, 0, 1, 1, 8, 0, 0, 2, 5, 0, 8, 0, 8, 1, 1, 0, 8, 0, 0, 0, 1, 0, 0, 1, 5, 1,…
## $ number <fct> small, small, small, small, big, small, small, small, small, small, small, sm…
```
]
.pull-right[

```r
glimpse(test_data)
```

```
## Rows: 785
## Columns: 21
## $ spam <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ to_multiple <fct> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0,…
## $ from <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ cc <int> 1, 2, 1, 7, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 1, 0, 0, 0, 0, 0, 35, 0, 1…
## $ sent_email <fct> 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1,…
## $ time <dttm> 2012-01-01 14:38:32, 2012-01-01 18:32:53, 2012-01-01 22:16:39, 2012-01-02 19…
## $ image <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ attach <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ dollar <dbl> 0, 2, 0, 0, 0, 2, 0, 9, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ winner <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, n…
## $ inherit <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ viagra <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ password <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ num_char <dbl> 15.075, 19.693, 3.959, 0.334, 1.482, 4.103, 4.512, 4.761, 4.040, 1.401, 1.389…
## $ line_breaks <int> 354, 330, 81, 9, 24, 173, 107, 151, 76, 41, 37, 21, 145, 309, 183, 114, 135, …
## $ format <fct> 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1,…
## $ re_subj <fct> 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1,…
## $ exclaim_subj <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
## $ urgent_subj <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ exclaim_mess <dbl> 10, 4, 1, 0, 0, 0, 0, 1, 4, 0, 8, 2, 1, 3, 0, 1, 4, 7, 2, 9, 0, 2, 0, 0, 0, 1…
## $ number <fct> small, big, small, small, none, small, small, big, small, small, small, none,…
```
]
]

---

# Modeling workflow

---

## Fit a model to the training dataset

```r
email_fit = logistic_reg() %>%
  set_engine("glm") %>%
  fit(spam ~ ., data = train_data, family = "binomial")
```

```
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
```

---

## Categorical predictors

---

## Fit a model to the training dataset

```r
email_fit = logistic_reg() %>%
  set_engine("glm") %>%
* fit(spam ~ . - from - sent_email - viagra, data = train_data, family = "binomial")
```

```
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
```

```r
email_fit
```

```
## parsnip model object
## 
## Fit time:  65ms 
## 
## Call:  stats::glm(formula = spam ~ . - from - sent_email - viagra, family = stats::binomial, 
##     data = data)
## 
## Coefficients:
##  (Intercept)  to_multiple1            cc          time         image        attach        dollar  
##   -1.043e+02    -2.802e+00     2.799e-02     7.810e-08    -3.578e+00     6.103e-01    -7.948e-02  
##    winneryes       inherit      password      num_char   line_breaks       format1      re_subj1  
##    2.111e+00     4.099e-01    -7.562e-01     6.814e-02    -5.259e-03    -9.421e-01    -3.474e+00  
## exclaim_subj  urgent_subj1  exclaim_mess   numbersmall     numberbig  
##    2.886e-01     2.734e+00     7.918e-03    -7.879e-01     2.957e-02  
## 
## Degrees of Freedom: 3135 Total (i.e. Null);  3117 Residual
## Null Deviance:	    1933 
## Residual Deviance: 1413 	AIC: 1451
```
]

???

Dropping `from`, `sent_email` quells the warnings about predicted probabilities being 0/1. Actually some of these variables might have predictive power, but we would need a more sophisticated approach than logistic regression to use them

---

## Predict outcome on the testing dataset

```r
predict(email_fit, test_data)
```

```
## # A tibble: 785 x 1
## .pred_class
## <fct> 
## 1 0 
## 2 0 
## 3 0 
## 4 0 
## 5 0 
## 6 0 
## 7 0 
## 8 0 
## 9 0 
## 10 0 
## # … with 775 more rows
```

---

## Predict probabilities on the testing dataset

```r
email_pred = predict(email_fit, test_data, type = "prob") %>%
  bind_cols(test_data %>% select(spam, time))

email_pred
```

```
## # A tibble: 785 x 4
## .pred_0 .pred_1 spam time 
## <dbl> <dbl> <fct> <dttm> 
## 1 0.999 0.00127 0 2012-01-01 14:38:32
## 2 0.996 0.00373 0 2012-01-01 18:32:53
## 3 0.994 0.00596 0 2012-01-01 22:16:39
## 4 0.992 0.00801 0 2012-01-02 19:07:51
## 5 0.845 0.155 0 2012-01-02 16:55:03
## 6 0.962 0.0376 0 2012-01-03 14:51:16
## 7 0.996 0.00414 0 2012-01-03 17:34:13
## 8 0.997 0.00308 0 2012-01-03 17:40:07
## 9 0.927 0.0728 0 2012-01-03 19:06:48
## 10 0.998 0.00243 0 2012-01-03 19:36:42
## # … with 775 more rows
```

---

## A closer look at predictions

```r
email_pred %>%
  arrange(desc(.pred_1)) %>%
  print(n = 10)
```

```
## # A tibble: 785 x 4
## .pred_0 .pred_1 spam time 
## <dbl> <dbl> <fct> <dttm> 
## 1 0.0970 0.903 1 2012-03-27 01:17:01
## 2 0.165 0.835 1 2012-03-01 00:40:27
*## 3 0.191 0.809 1 2012-01-21 16:55:55
## 4 0.206 0.794 0 2012-01-02 22:30:31
## 5 0.243 0.757 1 2012-03-31 05:20:24
## 6 0.258 0.742 1 2012-03-17 12:20:57
*## 7 0.288 0.712 0 2012-02-03 08:25:39
## 8 0.293 0.707 1 2012-02-24 19:43:44
## 9 0.308 0.692 0 2012-01-31 10:07:48
## 10 0.346 0.654 0 2012-02-04 05:39:37
## # … with 775 more rows
```

---

## Evaluate the performance

**Receiver operating characteristic (ROC) curve**+ which plot true positive rate vs. false positive rate (1 - specificity)

```r
email_pred %>%
 roc_curve(
 truth = spam,
 .pred_1,
 event_level = "second"
 ) %>%
 autoplot()
```
]
.pull-right[
<img src="l17-prediction-logistic_files/figure-html/unnamed-chunk-14-1.png" width="100%" style="display: block; margin: auto;" />
]

.footnote[
.small[
+Originally developed for operators of military radar receivers, hence the name.
]
]

---

## Evaluate the performance

Find the area under the curve:

```r
email_pred %>%
  roc_auc(
    truth = spam,
    .pred_1,
    event_level = "second"
  )
```

```
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.840
```
]
.pull-right[
<img src="l17-prediction-logistic_files/figure-html/unnamed-chunk-16-1.png" width="100%" style="display: block; margin: auto;" />
]

---

# Feature engineering

---

## Feature engineering

- There are all sorts of ways to build predictive models of binary variables: random forests, support vector machines, neural networks, `$k$` nearest neighbors, gradient boosting, ...

- In their own way, each learns the mapping `$\hat f: \mathbf{x} \mapsto Y$`

--
- But the variables `$\mathbf{x}$` that go into the model and how they are represented are just as critical to success of the model

--
- **Feature engineering** allows us to get creative with our predictors in an effort to make them more useful for our model (to increase its predictive performance)

-  How this engineering is done is part of the learned function `$\hat f$`, and needs to be accounted for when we evaluate our models.

---

## A simple approach: `mutate()`

```r
library(lubridate)
train_data %>%
  mutate(
    date = date(time),
    dow  = wday(time),
    month = month(time)
    ) %>%
  select(time, date, dow, month) %>%
  sample_n(size = 5) # shuffle to show a variety
```

```
## # A tibble: 5 x 4
## time date dow month
## <dttm> <date> <dbl> <dbl>
## 1 2012-01-31 17:06:58 2012-01-31 3 1
## 2 2012-02-02 14:38:31 2012-02-02 5 2
## 3 2012-02-14 06:17:18 2012-02-14 3 2
## 4 2012-03-05 12:55:12 2012-03-05 2 3
## 5 2012-02-23 13:12:30 2012-02-23 5 2
```

???

time would be used a "seconds" since the epoch as a continuous variable, however this would not allow any flexibility about weekdays vs weekends, time of year, etc.

however, we would need to repeat this mutate on train / test independently, at a minimum this would not keep things D.R.Y.

It's actually even worse than that, in that some mutates (collapsing factors, scaling variables) have implicitly learned parameters -- if we don't account for this we can end up with data leaking between training and test or with garbage predictions in test

---

## Modeling workflow, revisited

- Create a **recipe** for feature engineering steps to be applied to the training data

--
- Fit the model to the training data after these steps have been applied

--
- Using the model estimates from the training data, predict outcomes for the test data

--
- Evaluate the performance of the model on the test data

---

# Building recipes

???

Fancy mutate statements that can be applied easily to both training / test
Actually, is building up and saving a set of functions (without evaluating them)

---

## Initiate a recipe

```r
email_rec = recipe(
  spam ~ .,          # formula
  data = train_data  # data to use for cataloguing names and types of variables
  )

summary(email_rec)
```

```
## # A tibble: 21 x 4
## variable type role source 
## <chr> <chr> <chr> <chr> 
## 1 to_multiple nominal predictor original
## 2 from nominal predictor original
## 3 cc numeric predictor original
## 4 sent_email nominal predictor original
## 5 time date predictor original
## 6 image numeric predictor original
## 7 attach numeric predictor original
## 8 dollar numeric predictor original
## 9 winner nominal predictor original
## 10 inherit numeric predictor original
## 11 viagra numeric predictor original
## 12 password numeric predictor original
## 13 num_char numeric predictor original
## 14 line_breaks numeric predictor original
## 15 format nominal predictor original
## 16 re_subj nominal predictor original
## 17 exclaim_subj numeric predictor original
## 18 urgent_subj nominal predictor original
## 19 exclaim_mess numeric predictor original
## 20 number nominal predictor original
## 21 spam nominal outcome original
```
]

---

## Remove certain variables

```r
email_rec = email_rec %>%
  step_rm(from, sent_email)
```

```
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         20
## 
## Operations:
## 
## Delete terms from, sent_email
```
]

---

## Feature engineer date

```r
email_rec = email_rec %>%
  step_date(time, features = c("dow", "month")) %>%
  step_rm(time)
```

---

## Discretize numeric variables

```r
email_rec = email_rec %>%
  step_cut(cc, attach, dollar, breaks = c(0, 1)) %>%
  step_cut(inherit, password, breaks = c(0, 1, 5, 10, 20))
```

```
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         20
## 
## Operations:
## 
## Delete terms from, sent_email
## Date features from time
## Delete terms time
## Cut numeric for cc, attach, dollar
## Cut numeric for inherit, password
```
]

---

## Create dummy variables

```r
email_rec = email_rec %>%
  step_dummy(all_nominal(), -all_outcomes())
```

---

## Remove zero variance variables

Variables that contain only a single value

```r
email_rec = email_rec %>%
  step_zv(all_predictors())
```

---

## All in one place

```r
email_rec = recipe(spam ~ ., data = email) %>%
  step_rm(from, sent_email) %>%
  step_date(time, features = c("dow", "month")) %>%               
  step_rm(time) %>%
  step_cut(cc, attach, dollar, breaks = c(0, 1)) %>%
  step_cut(inherit, password, breaks = c(0, 1, 5, 10, 20)) %>%
  step_dummy(all_nominal(), -all_outcomes()) %>%
  step_zv(all_predictors())
```

---

# Building workflows

---

## Define model

```r
email_mod = logistic_reg() %>% 
  set_engine("glm")

email_mod
```

```
## Logistic Regression Model Specification (classification)
## 
## Computational engine: glm
```

---

## Define workflow

**Workflows** bring together models and recipes so that they can be easily applied to both the training and test data.

```r
email_wflow = workflow() %>% 
  add_model(email_mod) %>% 
  add_recipe(email_rec)
```

```
## ══ Workflow ════════════════════════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: logistic_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────────────────────────
## 7 Recipe Steps
## 
## • step_rm()
## • step_date()
## • step_rm()
## • step_cut()
## • step_cut()
## • step_dummy()
## • step_zv()
## 
## ── Model ───────────────────────────────────────────────────────────────────────────────────────────
## Logistic Regression Model Specification (classification)
## 
## Computational engine: glm
```
]

---

## Fit model to training data

```r
email_fit = email_wflow %>% 
  fit(data = train_data)
```

```
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
```

---

```r
tidy(email_fit)
```

```
## # A tibble: 30 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -1.18 0.273 -4.31 1.64e- 5
## 2 image -2.00 0.976 -2.05 4.08e- 2
## 3 num_char 0.0611 0.0240 2.54 1.11e- 2
## 4 line_breaks -0.00517 0.00135 -3.82 1.35e- 4
## 5 exclaim_subj 0.0672 0.258 0.261 7.94e- 1
## 6 exclaim_mess 0.00649 0.00182 3.56 3.69e- 4
## 7 to_multiple_X1 -3.00 0.383 -7.84 4.57e-15
## 8 cc_X.1.68. -0.327 0.470 -0.695 4.87e- 1
## 9 attach_X.1.21. 2.40 0.403 5.96 2.57e- 9
## 10 dollar_X.1.64. -0.0448 0.218 -0.206 8.37e- 1
## # … with 20 more rows
```
]

---

## Make predictions for test data

```r
email_pred = predict(email_fit, test_data, type = "prob") %>% 
  bind_cols(test_data)

email_pred
```

```
## # A tibble: 785 x 23
## .pred_0 .pred_1 spam to_multiple from cc sent_email
## <dbl> <dbl> <fct> <fct> <fct> <int> <fct> 
## 1 0.999 0.000809 0 0 1 1 1 
## 2 0.998 0.00177 0 0 1 2 0 
## 3 0.996 0.00409 0 0 1 1 0 
## 4 0.996 0.00429 0 0 1 7 0 
## 5 0.873 0.127 0 0 1 0 0 
## 6 0.957 0.0434 0 0 1 0 0 
## 7 0.997 0.00340 0 1 1 1 0 
...
```

---

## Evaluate the performance

```r
email_pred %>%
 roc_curve(
 truth = spam,
 .pred_1,
 event_level = "second"
 ) %>%
 autoplot()
```
]
.pull-right[
<img src="l17-prediction-logistic_files/figure-html/unnamed-chunk-36-1.png" width="100%" style="display: block; margin: auto;" />
]

---

## Evaluate the performance

```r
email_pred %>%
  roc_auc(
    truth = spam,
    .pred_1,
    event_level = "second"
  )
```

```
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.847
```
]
.pull-right[
<img src="l17-prediction-logistic_files/figure-html/unnamed-chunk-38-1.png" width="100%" style="display: block; margin: auto;" />
]

---

# Making decisions

---

## Cutoff probability: 0.5

Suppose we decide to label an email as spam if the model predicts the probability of spam to be **more than 0.5**.

|                        | Email is not spam| Email is spam|
|:-----------------------|-----------------:|-------------:|
|Email labelled not spam |               700|            66|
|Email labelled spam     |                 8|            11|
]
.panel[.panel-name[Code]

```r
cutoff_prob = 0.5
email_pred %>%
  mutate(
    spam      = if_else(spam == 1, "Email is spam", "Email is not spam"),
    spam_pred = if_else(.pred_1 > cutoff_prob, "Email labelled spam", "Email labelled not spam")
    ) %>%
  count(spam_pred, spam) %>%
  pivot_wider(names_from = spam, values_from = n) %>%
  knitr::kable(col.names = c("", "Email is not spam", "Email is spam"))
```
]
]

---

## Cutoff probability: 0.25

Suppose we decide to label an email as spam if the model predicts the probability of spam to be **more than 0.25**.

|                        | Email is not spam| Email is spam|
|:-----------------------|-----------------:|-------------:|
|Email labelled not spam |               644|            38|
|Email labelled spam     |                64|            39|
]
.panel[.panel-name[Code]

```r
cutoff_prob = 0.25
email_pred %>%
  mutate(
    spam      = if_else(spam == 1, "Email is spam", "Email is not spam"),
    spam_pred = if_else(.pred_1 > cutoff_prob, "Email labelled spam", "Email labelled not spam")
    ) %>%
  count(spam_pred, spam) %>%
  pivot_wider(names_from = spam, values_from = n) %>%
  knitr::kable(col.names = c("", "Email is not spam", "Email is spam"))
```
]
]

---

## Cutoff probability: 0.75

Suppose we decide to label an email as spam if the model predicts the probability of spam to be **more than 0.75**.

|                        | Email is not spam| Email is spam|
|:-----------------------|-----------------:|-------------:|
|Email labelled not spam |               705|            70|
|Email labelled spam     |                 3|             7|
]
.panel[.panel-name[Code]

```r
cutoff_prob = 0.75
email_pred %>%
  mutate(
    spam      = if_else(spam == 1, "Email is spam", "Email is not spam"),
    spam_pred = if_else(.pred_1 > cutoff_prob, "Email labelled spam", "Email labelled not spam")
    ) %>%
  count(spam_pred, spam) %>%
  pivot_wider(names_from = spam, values_from = n) %>%
  knitr::kable(col.names = c("", "Email is not spam", "Email is spam"))
```
]
]

---

## Evaluating performance on training data

-  The training set does not have the capacity to be a good arbiter of performance.

--
- It is not an independent piece of information; predicting the training set can only reflect what the model already knows.

--
- Suppose you give a class a test, then give them the answers, then provide the same test. The student scores on the second test do not accurately reflect what they know about the subject; these scores would probably be higher than their results on the first test.

---

# Cross validation

---

## Cross validation

More specifically, **v-fold cross validation**:

- Shuffle your data v partitions
- Use 1 partition for validation, and the remaining v-1 partitions for training
- Repeat v times

---

## Cross validation

---

## Split data into folds

```r
set.seed(345)

folds = vfold_cv(train_data, v = 5)
folds
```

```
## # 5-fold cross-validation 
## # A tibble: 5 x 2
## splits id 
## <list> <chr>
## 1 <split [2508/628]> Fold1
## 2 <split [2509/627]> Fold2
## 3 <split [2509/627]> Fold3
## 4 <split [2509/627]> Fold4
## 5 <split [2509/627]> Fold5
```
]
.pull-right[
<img src="l17/img/cross-validation.png" width="100%" style="display: block; margin: auto 0 auto auto;" />
]

---
class: code60

## Fit resamples

```r
set.seed(456)

email_fit_rs = email_wflow %>%
  fit_resamples(folds)

email_fit_rs
```

```
## # Resampling results
## # 5-fold cross-validation 
## # A tibble: 5 x 4
## splits id .metrics .notes 
## <list> <chr> <list> <list> 
## 1 <split [2508/628]> Fold1 <tibble [2 × 4]> <tibble [2 × 1]>
## 2 <split [2509/627]> Fold2 <tibble [2 × 4]> <tibble [1 × 1]>
## 3 <split [2509/627]> Fold3 <tibble [2 × 4]> <tibble [1 × 1]>
## 4 <split [2509/627]> Fold4 <tibble [2 × 4]> <tibble [2 × 1]>
## 5 <split [2509/627]> Fold5 <tibble [2 × 4]> <tibble [1 × 1]>
```
]
.pull-right[
<img src="l17/img/cross-validation-animated.gif" width="100%" style="display: block; margin: auto 0 auto auto;" />
]

---

## Collect CV metrics

```r
collect_metrics(email_fit_rs)
```

```
## # A tibble: 2 x 6
## .metric .estimator mean n std_err .config 
## <chr> <chr> <dbl> <int> <dbl> <chr> 
## 1 accuracy binary 0.908 5 0.00395 Preprocessor1_Model1
## 2 roc_auc binary 0.837 5 0.00384 Preprocessor1_Model1
```

---

## Deeper look into CV metrics

```r
collect_metrics(email_fit_rs, summarize = FALSE) %>%
  print(n = 10)
```

```
## # A tibble: 10 x 5
## id .metric .estimator .estimate .config 
## <chr> <chr> <chr> <dbl> <chr> 
## 1 Fold1 accuracy binary 0.915 Preprocessor1_Model1
## 2 Fold1 roc_auc binary 0.834 Preprocessor1_Model1
## 3 Fold2 accuracy binary 0.904 Preprocessor1_Model1
## 4 Fold2 roc_auc binary 0.847 Preprocessor1_Model1
## 5 Fold3 accuracy binary 0.915 Preprocessor1_Model1
## 6 Fold3 roc_auc binary 0.844 Preprocessor1_Model1
## 7 Fold4 accuracy binary 0.895 Preprocessor1_Model1
## 8 Fold4 roc_auc binary 0.828 Preprocessor1_Model1
## 9 Fold5 accuracy binary 0.910 Preprocessor1_Model1
## 10 Fold5 roc_auc binary 0.831 Preprocessor1_Model1
```
]
.panel[.panel-name[Pretty]

|Fold  | accuracy| roc_auc|
|:-----|--------:|-------:|
|Fold1 |    0.915|   0.834|
|Fold2 |    0.904|   0.847|
|Fold3 |    0.915|   0.844|
|Fold4 |    0.895|   0.828|
|Fold5 |    0.910|   0.831|
]
]

---

### Training data optimism

In the typical case, we use cross-validation to tune and select between models, then use the train/test split to evaluate the final chosen model

-  In this case, the training data does not severely overestimate the performance in testing or cross-validation.
-  In general, the gap between training and test is a function of the training *optimism*.
-  `$\nearrow$` model flexibility, `$\nearrow$` optimism
-  `$\searrow$` training set size, `$\nearrow$` optimism.

---

### Example of training optimism

```r
set.seed(123)
train_data_small = train_data %>% sample_n(size = 100)
overfit = fit(email_wflow, data = train_data_small)
overpredict = predict(overfit, train_data_small, type = "prob") %>% 
  bind_cols(train_data_small)
```
]

.pull-right[
<img src="l17-prediction-logistic_files/figure-html/unnamed-chunk-51-1.png" width="90%" style="display: block; margin: auto;" />

```
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 1
```
]

---

# Final announcements

* You will have access to the lectures as long as github exists and provides hosting for websites. 
* You will have access to your completed homework assignments on github as long as you have access to your github account.
* You will have indefinite access to rstudio cloud and your projects there, but your account will revert back to the "free" tier (5 hours/month) on/around 19 Dec 2021. So you might need to export at the project to do much with it after this point.
* I am not distributing the homework keys, but you are welcome to ask about homework during office hours or with a classmate. I would rather that you not electronically redistribute the entirety of a graded assignment with a classmate1 
* I will accept corrections on labs / homework until +7 days after HW05 is returned.

.footnote[[1] Reminder: sharing graded assignments within a class year isn't technically prohibited, but sharing across class years is prohibited and is a violation of the academic honesty policy per department policy.]

---
class: middle

## A final request

Since I am not sure how long it will take for you to receive an formal invitation to evaluate this course, please fill out [this short, anonymous survey on the course](https://docs.google.com/forms/d/e/1FAIpQLSeCWQ2FHl8aabut8-Gcj4Ea2Xke8qAM0Yo0ecmHkcv-X-8aqg/viewform?usp=sf_link)

---

# Acknowledgments

Data science in a box [U4D7](https://rstudio-education.github.io/datascience-box/course-materials/slides/u4-d07-prediction-overfitting/u4-d07-prediction-overfitting.html#1) [U4D8](https://rstudio-education.github.io/datascience-box/course-materials/slides/u4-d08-feature-engineering/u4-d08-feature-engineering.html#1) [U4D9](https://rstudio-education.github.io/datascience-box/course-materials/slides/u4-d09-cross-validation/u4-d09-cross-validation.html#1)