BST430 Lecture 09

# BST430 Lecture 09
## Data from files
### Andrew McDavid
### U of Rochester
### 2021-09-24 (updated: 2021-09-28)

---

# Reading rectangular data into R

---

.pull-left[
<img src="l09/img/readr.png" width="80%" style="display: block; margin: auto;" />
]
.pull-right[
<img src="l09/img/readxl.png" width="80%" style="display: block; margin: auto;" />
]

---

- `read_csv()` - comma delimited files
- `read_csv2()` - semicolon separated files (common in countries where , is used as the decimal place)
- `read_tsv()` - tab delimited files
- `read_delim()` - reads in files with any delimiter
- `read_fwf()` - fixed width files
- ...
]

--
.pull-right[
## readxl

- `read_excel()` - read xls or xlsx files
- ...
]

---

## Reading data

```r
nobel = read_csv(file = "l09/data/nobel.csv")
nobel
```

```
## # A tibble: 935 × 26
##      id firstname      surname    year category affiliation city 
##   <dbl> <chr>          <chr>     <dbl> <chr>    <chr>       <chr>
## 1     1 Wilhelm Conrad Röntgen    1901 Physics  Munich Uni… Muni…
## 2     2 Hendrik A.     Lorentz    1902 Physics  Leiden Uni… Leid…
## 3     3 Pieter         Zeeman     1902 Physics  Amsterdam … Amst…
## 4     4 Henri          Becquerel  1903 Physics  École Poly… Paris
## # … with 931 more rows, and 19 more variables: country <chr>,
## #   born_date <date>, died_date <date>, gender <chr>,
## #   born_city <chr>, born_country <chr>, …
```

---

## Writing data

```r
df = tribble(
  ~x, ~y,
  1,  "a",
  2,  "b",
  3,  "c"
)

write_csv(df, file = "l09/data/df.csv")
```
]

--
.pull-right[
- Read it back in to inspect

```r
read_csv("l09/data/df.csv")
```

```
## # A tibble: 3 × 2
##       x y    
##   <dbl> <chr>
## 1     1 a    
## 2     2 b    
## 3     3 c
```
]

---

- [RStudio Cloud > l09 - Application Exercise - Nobels and Sales](https://rstudio.cloud/spaces/162296/project/2939203) > open `nobels-csv.Rmd` and knit.
- Read in the `nobels.csv` file from the `data-raw/` folder.
- Split into two (STEM and non-STEM): 
  - Create a new data frame, `nobel_stem`, that filters for the STEM fields 
(Physics, Medicine, Chemistry, and Economics).
  - Create another data frame, `nobel_nonstem`, that filters for the remaining 
fields.  
- Write out the two data frames to `nobel-stem.csv` and `nobel-nonstem.csv`, 
respectively, to `data/`.

**Hint:** Use the `%in%` operator when `filter()`ing.
]

---

# Variable names

---
class: code70
## Data with bad names

```r
edibnb_badnames = read_csv("l09/data/edibnb-badnames.csv")
names(edibnb_badnames)
```

```
##  [1] "ID"                   "Price"               
##  [3] "neighbourhood"        "accommodates"        
##  [5] "Number of bathrooms"  "Number of Bedrooms"  
##  [7] "n beds"               "Review Scores Rating"
##  [9] "Number of reviews"    "listing_url"
```

... spaces need special treatment<sup>1</sup> in variable names

```r
ggplot(edibnb_badnames, aes(x = Number of bathrooms, y = Price)) +
  geom_point()
```

```
## Error: <text>:1:40: unexpected symbol
## 1: ggplot(edibnb_badnames, aes(x = Number of
##                                            ^
```

## Option 1 - Define column names

```r
edibnb_col_names =
  read_csv(
    "l09/data/edibnb-badnames.csv",
*   col_names = c("id","price", "neighbourhood",
      "accommodates", "bathroom", "bedroom",
      "bed", "review_scores_rating",
      "n_reviews", "url"))
names(edibnb_col_names)
```

```
##  [1] "id"                   "price"               
##  [3] "neighbourhood"        "accommodates"        
##  [5] "bathroom"             "bedroom"             
##  [7] "bed"                  "review_scores_rating"
##  [9] "n_reviews"            "url"
```

---

## Option 2 - Format text to snake_case

```r
edibnb_clean_names = read_csv("l09/data/edibnb-badnames.csv") %>%
  janitor::clean_names()

names(edibnb_clean_names)
```

```
##  [1] "id"                   "price"               
##  [3] "neighbourhood"        "accommodates"        
##  [5] "number_of_bathrooms"  "number_of_bedrooms"  
##  [7] "n_beds"               "review_scores_rating"
##  [9] "number_of_reviews"    "listing_url"
```

---

# Variable types

---

.pull-left[
<img src="l09/img/df-na.png" width="100%" style="display: block; margin: auto;" />
]
.pull-right[

```r
read_csv("l09/data/df-na.csv")
```

```
## # A tibble: 9 × 3
##   x     y              z     
##   <chr> <chr>          <chr> 
## 1 1     a              hi    
## 2 <NA>  b              hello 
## 3 3     Not applicable 9999  
## 4 4     d              ola   
## 5 5     e              hola  
## 6 .     f              whatup
## 7 7     g              wassup
## 8 8     h              sup   
## 9 9     i              <NA>
```
]

---

## Option 1. Explicit NAs

```r
read_csv("l09/data/df-na.csv", 
         na = c("", "NA", ".", "9999", "Not applicable"))
```

.pull-left[
<img src="l09/img/df-na.png" width="100%" style="display: block; margin: auto;" />
]
.pull-right[

```
## # A tibble: 9 × 3
##       x y     z     
##   <dbl> <chr> <chr> 
## 1     1 a     hi    
## 2    NA b     hello 
## 3     3 <NA>  <NA>  
## 4     4 d     ola   
## 5     5 e     hola  
## 6    NA f     whatup
## 7     7 g     wassup
## 8     8 h     sup   
## 9     9 i     <NA>
```
]

---

## Option 2. Specify column types

```r
read_csv("l09/data/df-na.csv", col_types = list(col_double(), 
                                            col_character(), 
                                            col_character()))
```

```
## Warning: One or more parsing issues, see `problems()` for details
```

```
## # A tibble: 9 × 3
##       x y              z     
##   <dbl> <chr>          <chr> 
## 1     1 a              hi    
## 2    NA b              hello 
## 3     3 Not applicable 9999  
## 4     4 d              ola   
## 5     5 e              hola  
## 6    NA f              whatup
## 7     7 g              wassup
## 8     8 h              sup   
## 9     9 i              <NA>
```

---

## Column types

.small[
**type function**  | **data type**
------------------ | -------------
`col_character()`  | character
`col_date()`       | date
`col_datetime()`   | POSIXct (date-time)
`col_double()`     | double (numeric)
`col_factor()`     | factor
`col_guess()`      | let readr guess (default)
`col_integer()`    | integer
`col_logical()`    | logical
`col_number()`     | numbers mixed with non-number characters
`col_numeric()`    | double or integer
`col_skip()`       | do not read
`col_time()`       | time
]

---
class: code70
.question[
Wondering where you remember these from?
]

```r
data = read_csv("l09/data/df-na.csv")
```

```
## Rows: 9 Columns: 3
```

```
## ── Column specification ─────────────────────────────────────────
## Delimiter: ","
## chr (3): x, y, z
```

```
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
```

```r
spec(data)
```

```
## cols(
##   x = col_character(),
##   y = col_character(),
##   z = col_character()
## )
```

---

class:middle

# Case study: Favorite foods

---

## Favorite foods

```r
*fav_food = read_excel("l09/data/favourite-food.xlsx")

fav_food
```

```
## # A tibble: 5 × 6
##   `Student ID` `Full Name`  favourite.food  mealPlan  AGE   SES  
##          <dbl> <chr>        <chr>           <chr>     <chr> <chr>
## 1            1 Sunil Huffm… Strawberry yog… Lunch on… 4     High 
## 2            2 Barclay Lynn French fries    Lunch on… 5     Midd…
## 3            3 Jayendra Ly… N/A             Breakfas… 7     Low  
## 4            4 Leon Rossini Anchovies       Lunch on… 99999 Midd…
## 5            5 Chidiegwu D… Pizza           Breakfas… five  High
```

---

## Variable names

```r
fav_food = read_excel("l09/data/favourite-food.xlsx") %>%
* janitor::clean_names()

fav_food 
```

```
## # A tibble: 5 × 6
##   student_id full_name   favourite_food   meal_plan   age   ses  
##        <dbl> <chr>       <chr>            <chr>       <chr> <chr>
## 1          1 Sunil Huff… Strawberry yogh… Lunch only  4     High 
## 2          2 Barclay Ly… French fries     Lunch only  5     Midd…
## 3          3 Jayendra L… N/A              Breakfast … 7     Low  
## 4          4 Leon Rossi… Anchovies        Lunch only  99999 Midd…
## 5          5 Chidiegwu … Pizza            Breakfast … five  High
```

---

## Handling NAs

```r
fav_food = read_excel("l09/data/favourite-food.xlsx",
*                      na = c("N/A", "99999")) %>%
  janitor::clean_names()

fav_food 
```

```
## # A tibble: 5 × 6
##   student_id full_name   favourite_food   meal_plan   age   ses  
##        <dbl> <chr>       <chr>            <chr>       <chr> <chr>
## 1          1 Sunil Huff… Strawberry yogh… Lunch only  4     High 
## 2          2 Barclay Ly… French fries     Lunch only  5     Midd…
## 3          3 Jayendra L… <NA>             Breakfast … 7     Low  
## 4          4 Leon Rossi… Anchovies        Lunch only  <NA>  Midd…
## 5          5 Chidiegwu … Pizza            Breakfast … five  High
```

---

## Make `age` numeric

```r
fav_food = fav_food %>%
* mutate(
*   age = if_else(age == "five", "5", age),
*   age = as.numeric(age)
*   )

glimpse(fav_food) 
```

```
## Rows: 5
## Columns: 6
## $ student_id     <dbl> 1, 2, 3, 4, 5
## $ full_name      <chr> "Sunil Huffmann", "Barclay Lynn", "Jayen…
## $ favourite_food <chr> "Strawberry yoghurt", "French fries", NA…
## $ meal_plan      <chr> "Lunch only", "Lunch only", "Breakfast a…
## $ age            <dbl> 4, 5, 7, NA, 5
## $ ses            <chr> "High", "Middle", "Low", "Middle", "High"
```
]
.pull-right-narrow[
<img src="l09/img/fav-food/fav-food-age.png" width="60%" style="display: block; margin: auto;" />
]

---

## Socio-economic status

```r
fav_food %>%
  count(ses)
```

```
## # A tibble: 3 × 2
##   ses        n
##   <chr>  <int>
## 1 High       2
## 2 Low        1
## 3 Middle     2
```
]
.pull-right-narrow[
<img src="l09/img/fav-food/fav-food-ses.png" width="60%" style="display: block; margin: auto;" />
]

---

## Make `ses` factor

```r
fav_food = fav_food %>%
* mutate(ses = fct_relevel(ses, "Low", "Middle", "High"))

fav_food %>%
  count(ses)
```

```
## # A tibble: 3 × 2
##   ses        n
##   <fct>  <int>
## 1 Low        1
## 2 Middle     2
## 3 High       2
```
]

---

## Putting it altogether

```r
fav_food = read_excel("l09/data/favourite-food.xlsx", na = c("N/A", "99999")) %>%
  janitor::clean_names() %>%
  mutate(
    age = if_else(age == "five", "5", age), 
    age = as.numeric(age),
    ses = fct_relevel(ses, "Low", "Middle", "High")
  )

fav_food
```

```
## # A tibble: 5 × 6
##   student_id full_name   favourite_food   meal_plan     age ses  
##        <dbl> <chr>       <chr>            <chr>       <dbl> <fct>
## 1          1 Sunil Huff… Strawberry yogh… Lunch only      4 High 
## 2          2 Barclay Ly… French fries     Lunch only      5 Midd…
## 3          3 Jayendra L… <NA>             Breakfast …     7 Low  
## 4          4 Leon Rossi… Anchovies        Lunch only     NA Midd…
## 5          5 Chidiegwu … Pizza            Breakfast …     5 High
```

---

## Out and back in

```r
write_csv(fav_food, file = "l09/data/fav-food-clean.csv")

fav_food_clean = read_csv("l09/data/fav-food-clean.csv")
```

---

```r
fav_food_clean %>%
  count(ses)
```

```
## # A tibble: 3 × 2
##   ses        n
##   <chr>  <int>
## 1 High       2
## 2 Low        1
## 3 Middle     2
```

---

## `readRDS()` and `saveRDS()`

- CSVs can be unreliable for saving interim results if there is specific variable type information you want to hold on to.
- An alternative is RDS files, you can read and write them with `readRDS()` and `saveRDS()`, respectively.

```r
readRDS(path)
saveRDS(x, path)
```

---

## Out and back in, take 2

```r
saveRDS(fav_food, file = "l09/data/fav-food-clean.rds")

fav_food_clean = readRDS("l09/data/fav-food-clean.rds")

fav_food_clean %>%
  count(ses)
```

```
## # A tibble: 3 × 2
##   ses        n
##   <fct>  <int>
## 1 Low        1
## 2 Middle     2
## 3 High       2
```

---

# Other types of data

---

## Other types of data

- **googlesheets4:** Google Sheets
- **haven**: SPSS, Stata, and SAS files
- **DBI**, along with a database specific backend (e.g. RMySQL, RSQLite, RPostgreSQL etc): allows you to run SQL queries against a database and return a data frame
- **jsonline**: JSON
- **xml2**: xml
- **rvest**: web scraping
- **httr**: web APIs
- **sparklyr**: data loaded into spark

---

.your-turn[
### .hand[Your turn!]
.midi[
- [RStudio Cloud > l09 - Application Exercise - Nobels and Sales](https://rstudio.cloud/spaces/162296/project/2939203) > `sales-excel.Rmd`. 
- Load the `sales.xlsx` file from the `data-raw/` folder, using appropriate arguments for the `read_excel()` function such that it looks like the output on the left.
- **Stretch goal:** Manipulate the sales data such that it looks like the output on the right.
]
]

```
## # A tibble: 9 × 2
##   id      n    
##   <chr>   <chr>
## 1 Brand 1 n    
## 2 1234    8    
## 3 8721    2    
## 4 1822    3    
## # … with 5 more rows
```
]
.pull-right[

```
## # A tibble: 7 × 3
##   brand      id     n
##   <chr>   <dbl> <dbl>
## 1 Brand 1  1234     8
## 2 Brand 1  8721     2
## 3 Brand 1  1822     3
## 4 Brand 2  3333     1
## 5 Brand 2  2156     3
## 6 Brand 2  3987     6
## 7 Brand 2  3216     5
```
]

---

## File organization and rstudio cloud

-  Picking a sensible and consistent way to organize raw and processed data will save many headaches!
-  .alert[Generally data should live in its own folder in the analysis project]
  -  But: Sometimes the loading and tidying of the data might be its own project, per se,
  -  Sometimes data will need to be reused across many different projects,
  -  And sometimes data is too big or too sensitive to put on github.
-  In Rstudio cloud, it is possible to upload, download files and move files  using the "files" pane, and move
-  We'll touch project organization again, when we remove the training wheels and make and manage our own desktop Rstudio projects

---

## Acknowledgments

[Data science in a box](https://rstudio-education.github.io/datascience-box/course-materials/slides/u2-d12-data-import/u2-d12-data-import.html#4)

## Other reading

[R4DS chapter 11](https://r4ds.had.co.nz/data-import.html#data-import)