We use the anyNA() function to indicate whether there are any missing values in the original data set. Also, we can use the !c() to make a subset of all the variables we do not want to include. On top of this, we can use dplyr’s select() to choose the variables we want to include in the new data set. Finally, we apply drop_na() to exclude any missing data. At the end, we validate that out analyses excluded all missing data using anyNA() on the new data set.
There are a number of variables that are redundant. If we want to generate a useful model, we need to remove these redundancies. We can do this with select() from dplyr as we have used in previously.
Now, we need to check to see if there are any variables in which there is an extremely uneven number of Yes/No responses such that there are <50 responses for either Yes or No.
summary(flu_clean)
SwollenLymphNodes ChestCongestion ChillsSweats NasalCongestion Sneeze
No :418 No :323 No :130 No :167 No :339
Yes:312 Yes:407 Yes:600 Yes:563 Yes:391
Fatigue SubjectiveFever Headache Weakness CoughIntensity
No : 64 No :230 No :115 None : 49 None : 47
Yes:666 Yes:500 Yes:615 Mild :223 Mild :154
Moderate:338 Moderate:357
Severe :120 Severe :172
Myalgia RunnyNose AbPain ChestPain Diarrhea EyePn Insomnia
None : 79 No :211 No :639 No :497 No :631 No :617 No :315
Mild :213 Yes:519 Yes: 91 Yes:233 Yes: 99 Yes:113 Yes:415
Moderate:325
Severe :113
ItchyEye Nausea EarPn Hearing Pharyngitis Breathless ToothPn
No :551 No :475 No :568 No :700 No :119 No :436 No :565
Yes:179 Yes:255 Yes:162 Yes: 30 Yes:611 Yes:294 Yes:165
Vision Vomit Wheeze BodyTemp
No :711 No :652 No :510 Min. : 97.20
Yes: 19 Yes: 78 Yes:220 1st Qu.: 98.20
Median : 98.50
Mean : 98.94
3rd Qu.: 99.30
Max. :103.10
According to the table above, the Vision and Hearing variables meet this criteria, so we will remove those variables.
Again, we use the here() function to clearly indicate where the RDS file should be saved. We use saveRDS() to save the data set as the proper file type.