class: ur-title, center, middle, title-slide # BST430 Lecture 09 ## Data from files ### Andrew McDavid ### U of Rochester ### 2021-09-24 (updated: 2021-09-28) --- class: middle # Reading rectangular data into R --- class: middle .pull-left[ <img src="l09/img/readr.png" width="80%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="l09/img/readxl.png" width="80%" style="display: block; margin: auto;" /> ] --- .pull-left[ ## readr - `read_csv()` - comma delimited files - `read_csv2()` - semicolon separated files (common in countries where , is used as the decimal place) - `read_tsv()` - tab delimited files - `read_delim()` - reads in files with any delimiter - `read_fwf()` - fixed width files - ... ] -- .pull-right[ ## readxl - `read_excel()` - read xls or xlsx files - ... ] --- ## Reading data ```r nobel = read_csv(file = "l09/data/nobel.csv") nobel ``` ``` ## # A tibble: 935 × 26 ## id firstname surname year category affiliation city ## <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> ## 1 1 Wilhelm Conrad Röntgen 1901 Physics Munich Uni… Muni… ## 2 2 Hendrik A. Lorentz 1902 Physics Leiden Uni… Leid… ## 3 3 Pieter Zeeman 1902 Physics Amsterdam … Amst… ## 4 4 Henri Becquerel 1903 Physics École Poly… Paris ## # … with 931 more rows, and 19 more variables: country <chr>, ## # born_date <date>, died_date <date>, gender <chr>, ## # born_city <chr>, born_country <chr>, … ``` --- ## Writing data .pull-left[ - Write a file ```r df = tribble( ~x, ~y, 1, "a", 2, "b", 3, "c" ) write_csv(df, file = "l09/data/df.csv") ``` ] -- .pull-right[ - Read it back in to inspect ```r read_csv("l09/data/df.csv") ``` ``` ## # A tibble: 3 × 2 ## x y ## <dbl> <chr> ## 1 1 a ## 2 2 b ## 3 3 c ``` ] --- .your-turn[ ### .hand[Your turn!] - [RStudio Cloud > l09 - Application Exercise - Nobels and Sales](https://rstudio.cloud/spaces/162296/project/2939203) > open `nobels-csv.Rmd` and knit. - Read in the `nobels.csv` file from the `data-raw/` folder. - Split into two (STEM and non-STEM): - Create a new data frame, `nobel_stem`, that filters for the STEM fields (Physics, Medicine, Chemistry, and Economics). - Create another data frame, `nobel_nonstem`, that filters for the remaining fields. - Write out the two data frames to `nobel-stem.csv` and `nobel-nonstem.csv`, respectively, to `data/`. **Hint:** Use the `%in%` operator when `filter()`ing. ] --- class: middle # Variable names --- class: code70 ## Data with bad names ```r edibnb_badnames = read_csv("l09/data/edibnb-badnames.csv") names(edibnb_badnames) ``` ``` ## [1] "ID" "Price" ## [3] "neighbourhood" "accommodates" ## [5] "Number of bathrooms" "Number of Bedrooms" ## [7] "n beds" "Review Scores Rating" ## [9] "Number of reviews" "listing_url" ``` -- ... spaces need special treatment<sup>1</sup> in variable names ```r ggplot(edibnb_badnames, aes(x = Number of bathrooms, y = Price)) + geom_point() ``` ``` ## Error: <text>:1:40: unexpected symbol ## 1: ggplot(edibnb_badnames, aes(x = Number of ## ^ ``` .footnote[[1] Must escape the object's name with backticks,e.g, \`.] --- class: code70 ## Option 1 - Define column names ```r edibnb_col_names = read_csv( "l09/data/edibnb-badnames.csv", * col_names = c("id","price", "neighbourhood", "accommodates", "bathroom", "bedroom", "bed", "review_scores_rating", "n_reviews", "url")) names(edibnb_col_names) ``` ``` ## [1] "id" "price" ## [3] "neighbourhood" "accommodates" ## [5] "bathroom" "bedroom" ## [7] "bed" "review_scores_rating" ## [9] "n_reviews" "url" ``` --- ## Option 2 - Format text to snake_case ```r edibnb_clean_names = read_csv("l09/data/edibnb-badnames.csv") %>% janitor::clean_names() names(edibnb_clean_names) ``` ``` ## [1] "id" "price" ## [3] "neighbourhood" "accommodates" ## [5] "number_of_bathrooms" "number_of_bedrooms" ## [7] "n_beds" "review_scores_rating" ## [9] "number_of_reviews" "listing_url" ``` --- class: middle # Variable types --- .question[ Which type is `x`? Why? ] .pull-left[ <img src="l09/img/df-na.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ ```r read_csv("l09/data/df-na.csv") ``` ``` ## # A tibble: 9 × 3 ## x y z ## <chr> <chr> <chr> ## 1 1 a hi ## 2 <NA> b hello ## 3 3 Not applicable 9999 ## 4 4 d ola ## 5 5 e hola ## 6 . f whatup ## 7 7 g wassup ## 8 8 h sup ## 9 9 i <NA> ``` ] --- ## Option 1. Explicit NAs ```r read_csv("l09/data/df-na.csv", na = c("", "NA", ".", "9999", "Not applicable")) ``` .pull-left[ <img src="l09/img/df-na.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ ``` ## # A tibble: 9 × 3 ## x y z ## <dbl> <chr> <chr> ## 1 1 a hi ## 2 NA b hello ## 3 3 <NA> <NA> ## 4 4 d ola ## 5 5 e hola ## 6 NA f whatup ## 7 7 g wassup ## 8 8 h sup ## 9 9 i <NA> ``` ] --- ## Option 2. Specify column types ```r read_csv("l09/data/df-na.csv", col_types = list(col_double(), col_character(), col_character())) ``` ``` ## Warning: One or more parsing issues, see `problems()` for details ``` ``` ## # A tibble: 9 × 3 ## x y z ## <dbl> <chr> <chr> ## 1 1 a hi ## 2 NA b hello ## 3 3 Not applicable 9999 ## 4 4 d ola ## 5 5 e hola ## 6 NA f whatup ## 7 7 g wassup ## 8 8 h sup ## 9 9 i <NA> ``` --- ## Column types .small[ **type function** | **data type** ------------------ | ------------- `col_character()` | character `col_date()` | date `col_datetime()` | POSIXct (date-time) `col_double()` | double (numeric) `col_factor()` | factor `col_guess()` | let readr guess (default) `col_integer()` | integer `col_logical()` | logical `col_number()` | numbers mixed with non-number characters `col_numeric()` | double or integer `col_skip()` | do not read `col_time()` | time ] --- class: code70 .question[ Wondering where you remember these from? ] ```r data = read_csv("l09/data/df-na.csv") ``` ``` ## Rows: 9 Columns: 3 ``` ``` ## ── Column specification ───────────────────────────────────────── ## Delimiter: "," ## chr (3): x, y, z ``` ``` ## ## ℹ Use `spec()` to retrieve the full column specification for this data. ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. ``` ```r spec(data) ``` ``` ## cols( ## x = col_character(), ## y = col_character(), ## z = col_character() ## ) ``` --- class:middle # Case study: Favorite foods --- ## Favorite foods <img src="l09/img/fav-food/fav-food.png" width="60%" style="display: block; margin: auto;" /> -- ```r *fav_food = read_excel("l09/data/favourite-food.xlsx") fav_food ``` ``` ## # A tibble: 5 × 6 ## `Student ID` `Full Name` favourite.food mealPlan AGE SES ## <dbl> <chr> <chr> <chr> <chr> <chr> ## 1 1 Sunil Huffm… Strawberry yog… Lunch on… 4 High ## 2 2 Barclay Lynn French fries Lunch on… 5 Midd… ## 3 3 Jayendra Ly… N/A Breakfas… 7 Low ## 4 4 Leon Rossini Anchovies Lunch on… 99999 Midd… ## 5 5 Chidiegwu D… Pizza Breakfas… five High ``` --- ## Variable names <img src="l09/img/fav-food/fav-food-names.png" width="60%" style="display: block; margin: auto;" /> -- ```r fav_food = read_excel("l09/data/favourite-food.xlsx") %>% * janitor::clean_names() fav_food ``` ``` ## # A tibble: 5 × 6 ## student_id full_name favourite_food meal_plan age ses ## <dbl> <chr> <chr> <chr> <chr> <chr> ## 1 1 Sunil Huff… Strawberry yogh… Lunch only 4 High ## 2 2 Barclay Ly… French fries Lunch only 5 Midd… ## 3 3 Jayendra L… N/A Breakfast … 7 Low ## 4 4 Leon Rossi… Anchovies Lunch only 99999 Midd… ## 5 5 Chidiegwu … Pizza Breakfast … five High ``` --- ## Handling NAs <img src="l09/img/fav-food/fav-food-nas.png" width="60%" style="display: block; margin: auto;" /> -- ```r fav_food = read_excel("l09/data/favourite-food.xlsx", * na = c("N/A", "99999")) %>% janitor::clean_names() fav_food ``` ``` ## # A tibble: 5 × 6 ## student_id full_name favourite_food meal_plan age ses ## <dbl> <chr> <chr> <chr> <chr> <chr> ## 1 1 Sunil Huff… Strawberry yogh… Lunch only 4 High ## 2 2 Barclay Ly… French fries Lunch only 5 Midd… ## 3 3 Jayendra L… <NA> Breakfast … 7 Low ## 4 4 Leon Rossi… Anchovies Lunch only <NA> Midd… ## 5 5 Chidiegwu … Pizza Breakfast … five High ``` --- ## Make `age` numeric .pull-left-wide[ ```r fav_food = fav_food %>% * mutate( * age = if_else(age == "five", "5", age), * age = as.numeric(age) * ) glimpse(fav_food) ``` ``` ## Rows: 5 ## Columns: 6 ## $ student_id <dbl> 1, 2, 3, 4, 5 ## $ full_name <chr> "Sunil Huffmann", "Barclay Lynn", "Jayen… ## $ favourite_food <chr> "Strawberry yoghurt", "French fries", NA… ## $ meal_plan <chr> "Lunch only", "Lunch only", "Breakfast a… ## $ age <dbl> 4, 5, 7, NA, 5 ## $ ses <chr> "High", "Middle", "Low", "Middle", "High" ``` ] .pull-right-narrow[ <img src="l09/img/fav-food/fav-food-age.png" width="60%" style="display: block; margin: auto;" /> ] --- ## Socio-economic status .question[ What order are the levels of `ses` listed in? ] .pull-left-wide[ ```r fav_food %>% count(ses) ``` ``` ## # A tibble: 3 × 2 ## ses n ## <chr> <int> ## 1 High 2 ## 2 Low 1 ## 3 Middle 2 ``` ] .pull-right-narrow[ <img src="l09/img/fav-food/fav-food-ses.png" width="60%" style="display: block; margin: auto;" /> ] --- ## Make `ses` factor .pull-left-wide[ ```r fav_food = fav_food %>% * mutate(ses = fct_relevel(ses, "Low", "Middle", "High")) fav_food %>% count(ses) ``` ``` ## # A tibble: 3 × 2 ## ses n ## <fct> <int> ## 1 Low 1 ## 2 Middle 2 ## 3 High 2 ``` ] --- ## Putting it altogether ```r fav_food = read_excel("l09/data/favourite-food.xlsx", na = c("N/A", "99999")) %>% janitor::clean_names() %>% mutate( age = if_else(age == "five", "5", age), age = as.numeric(age), ses = fct_relevel(ses, "Low", "Middle", "High") ) fav_food ``` ``` ## # A tibble: 5 × 6 ## student_id full_name favourite_food meal_plan age ses ## <dbl> <chr> <chr> <chr> <dbl> <fct> ## 1 1 Sunil Huff… Strawberry yogh… Lunch only 4 High ## 2 2 Barclay Ly… French fries Lunch only 5 Midd… ## 3 3 Jayendra L… <NA> Breakfast … 7 Low ## 4 4 Leon Rossi… Anchovies Lunch only NA Midd… ## 5 5 Chidiegwu … Pizza Breakfast … 5 High ``` --- ## Out and back in ```r write_csv(fav_food, file = "l09/data/fav-food-clean.csv") fav_food_clean = read_csv("l09/data/fav-food-clean.csv") ``` --- .question[ What happened to `ses` again? ] ```r fav_food_clean %>% count(ses) ``` ``` ## # A tibble: 3 × 2 ## ses n ## <chr> <int> ## 1 High 2 ## 2 Low 1 ## 3 Middle 2 ``` --- ## `readRDS()` and `saveRDS()` - CSVs can be unreliable for saving interim results if there is specific variable type information you want to hold on to. - An alternative is RDS files, you can read and write them with `readRDS()` and `saveRDS()`, respectively. ```r readRDS(path) saveRDS(x, path) ``` --- ## Out and back in, take 2 ```r saveRDS(fav_food, file = "l09/data/fav-food-clean.rds") fav_food_clean = readRDS("l09/data/fav-food-clean.rds") fav_food_clean %>% count(ses) ``` ``` ## # A tibble: 3 × 2 ## ses n ## <fct> <int> ## 1 Low 1 ## 2 Middle 2 ## 3 High 2 ``` --- class: middle # Other types of data --- ## Other types of data - **googlesheets4:** Google Sheets - **haven**: SPSS, Stata, and SAS files - **DBI**, along with a database specific backend (e.g. RMySQL, RSQLite, RPostgreSQL etc): allows you to run SQL queries against a database and return a data frame - **jsonline**: JSON - **xml2**: xml - **rvest**: web scraping - **httr**: web APIs - **sparklyr**: data loaded into spark --- .your-turn[ ### .hand[Your turn!] .midi[ - [RStudio Cloud > l09 - Application Exercise - Nobels and Sales](https://rstudio.cloud/spaces/162296/project/2939203) > `sales-excel.Rmd`. - Load the `sales.xlsx` file from the `data-raw/` folder, using appropriate arguments for the `read_excel()` function such that it looks like the output on the left. - **Stretch goal:** Manipulate the sales data such that it looks like the output on the right. ] ] .pull-left[ ``` ## # A tibble: 9 × 2 ## id n ## <chr> <chr> ## 1 Brand 1 n ## 2 1234 8 ## 3 8721 2 ## 4 1822 3 ## # … with 5 more rows ``` ] .pull-right[ ``` ## # A tibble: 7 × 3 ## brand id n ## <chr> <dbl> <dbl> ## 1 Brand 1 1234 8 ## 2 Brand 1 8721 2 ## 3 Brand 1 1822 3 ## 4 Brand 2 3333 1 ## 5 Brand 2 2156 3 ## 6 Brand 2 3987 6 ## 7 Brand 2 3216 5 ``` ] --- ## File organization and rstudio cloud - Picking a sensible and consistent way to organize raw and processed data will save many headaches! - .alert[Generally data should live in its own folder in the analysis project] - But: Sometimes the loading and tidying of the data might be its own project, per se, - Sometimes data will need to be reused across many different projects, - And sometimes data is too big or too sensitive to put on github. - In Rstudio cloud, it is possible to upload, download files and move files using the "files" pane, and move - We'll touch project organization again, when we remove the training wheels and make and manage our own desktop Rstudio projects --- ## Acknowledgments [Data science in a box](https://rstudio-education.github.io/datascience-box/course-materials/slides/u2-d12-data-import/u2-d12-data-import.html#4) ## Other reading [R4DS chapter 11](https://r4ds.had.co.nz/data-import.html#data-import)