Flu Analysis - Data Wrangling

Author

Aidan Troha

We begin by using the library function to be able to use the tidyverse packages

library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.1     ✔ purrr   0.3.4
✔ tibble  3.1.8     ✔ dplyr   1.1.0
✔ tidyr   1.2.1     ✔ stringr 1.4.1
✔ readr   2.1.2     ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

We use the here() function from the here package to identify the file location and use readRDS to import the .Rda file.

fludat_raw <- here::here("fluanalysis","data","raw_data","SympAct_Any_Pos.Rda")
flu_raw <- readRDS(fludat_raw)

We use the anyNA() function to indicate whether there are any missing values in the original data set. Also, we can use the !c() to make a subset of all the variables we do not want to include. On top of this, we can use dplyr’s select() to choose the variables we want to include in the new data set. Finally, we apply drop_na() to exclude any missing data. At the end, we validate that out analyses excluded all missing data using anyNA() on the new data set.

anyNA(flu_raw)
[1] TRUE
flu_clean <- flu_raw %>%
                select(!c(contains(c("Score","Total","FluA","FluB",
                                     "Dxname","Activity")),"Unique.Visit")) %>%
                drop_na()
anyNA(flu_clean)
[1] FALSE

Module 11 - Pre-Processing

There are a number of variables that are redundant. If we want to generate a useful model, we need to remove these redundancies. We can do this with select() from dplyr as we have used in previously.

flu_clean <- flu_clean %>%
             select(!c(WeaknessYN,CoughYN,CoughYN2,MyalgiaYN))

Now, we need to check to see if there are any variables in which there is an extremely uneven number of Yes/No responses such that there are <50 responses for either Yes or No.

summary(flu_clean)
 SwollenLymphNodes ChestCongestion ChillsSweats NasalCongestion Sneeze   
 No :418           No :323         No :130      No :167         No :339  
 Yes:312           Yes:407         Yes:600      Yes:563         Yes:391  
                                                                         
                                                                         
                                                                         
                                                                         
 Fatigue   SubjectiveFever Headache      Weakness    CoughIntensity
 No : 64   No :230         No :115   None    : 49   None    : 47   
 Yes:666   Yes:500         Yes:615   Mild    :223   Mild    :154   
                                     Moderate:338   Moderate:357   
                                     Severe  :120   Severe  :172   
                                                                   
                                                                   
     Myalgia    RunnyNose AbPain    ChestPain Diarrhea  EyePn     Insomnia 
 None    : 79   No :211   No :639   No :497   No :631   No :617   No :315  
 Mild    :213   Yes:519   Yes: 91   Yes:233   Yes: 99   Yes:113   Yes:415  
 Moderate:325                                                              
 Severe  :113                                                              
                                                                           
                                                                           
 ItchyEye  Nausea    EarPn     Hearing   Pharyngitis Breathless ToothPn  
 No :551   No :475   No :568   No :700   No :119     No :436    No :565  
 Yes:179   Yes:255   Yes:162   Yes: 30   Yes:611     Yes:294    Yes:165  
                                                                         
                                                                         
                                                                         
                                                                         
 Vision    Vomit     Wheeze       BodyTemp     
 No :711   No :652   No :510   Min.   : 97.20  
 Yes: 19   Yes: 78   Yes:220   1st Qu.: 98.20  
                               Median : 98.50  
                               Mean   : 98.94  
                               3rd Qu.: 99.30  
                               Max.   :103.10  
flu_clean <- flu_clean %>%
             select(!c(Vision,Hearing))

According to the table above, the Vision and Hearing variables meet this criteria, so we will remove those variables.

Again, we use the here() function to clearly indicate where the RDS file should be saved. We use saveRDS() to save the data set as the proper file type.

fludat_clean <- here::here("fluanalysis","data","processed_data","flu_processed")
saveRDS(flu_clean,file=fludat_clean)