Data Anlaysis Exercise - Module 4

Author

Aidan Troha &

This data, from NCHS, shows provisional death counts for the US. These data are obtained from the CDC website, data.CDC.org. Within, you can find COVID-19-related deaths separated by education, age, sex, and race. Data was collected as early as January 1st, 2020 and continued until January 30th, 2021. The data was last updated February 3rd, 2021.

library(readr)
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.1     ✔ dplyr   1.1.0
✔ tibble  3.1.8     ✔ stringr 1.4.1
✔ tidyr   1.2.1     ✔ forcats 0.5.2
✔ purrr   0.3.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
# Imports the raw data set. The original data set is a CSV file.
raw_data <- read_csv("data/AH_Provisional_COVID-19_Deaths_by_Educational_Attainment__Race__Sex__and_Age.csv")
Rows: 224 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): Data as of, Start Date, End Date, Education Level, Race or Hispanic...
dbl (2): COVID-19 Deaths, Total Deaths

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Shows the classes of the variables.
glimpse(raw_data)
Rows: 224
Columns: 9
$ `Data as of`              <chr> "02/03/2021", "02/02/2021", "02/02/2021", "0…
$ `Start Date`              <chr> "01/01/2020", "01/01/2020", "01/01/2020", "0…
$ `End Date`                <chr> "01/30/2021", "01/30/2021", "01/30/2021", "0…
$ `Education Level`         <chr> "Associate degree or some college", "Associa…
$ `Race or Hispanic Origin` <chr> "Hispanic", "Hispanic", "Hispanic", "Hispani…
$ Sex                       <chr> "Female", "Female", "Female", "Female", "Mal…
$ `Age Group`               <chr> "0-17 years", "18-49 years", "50-64 years", …
$ `COVID-19 Deaths`         <dbl> 0, 423, 857, 1793, 0, 737, 1592, 2655, 0, 82…
$ `Total Deaths`            <dbl> 2, 3117, 4153, 10225, 1, 5676, 6183, 11544, …
# Creates a new data set with the variables we would like to keep. In an effort to be 
# more user friendly, the variable names have been converted to all lowercase with no 
# spaces. Also, some variables have been converted to factor classes.
new_data <- raw_data %>%
    # Changes the variable names and makes some factors.
           mutate(education_level = as.factor(`Education Level`),
                  race_origin = as.factor(`Race or Hispanic Origin`),
                  sex = as.factor(`Sex`),
                  age_group = as.factor(`Age Group`),
                  covid_deaths = `COVID-19 Deaths`,
                  total_deaths = `Total Deaths`
                  ) %>%
    # Pushes only the properly formatted variables to the new data set.
           select(education_level,race_origin,sex,age_group,covid_deaths,total_deaths)
# Shows a summary of the variables included in the dataset.
glimpse(new_data)
Rows: 224
Columns: 6
$ education_level <fct> Associate degree or some college, Associate degree or …
$ race_origin     <fct> Hispanic, Hispanic, Hispanic, Hispanic, Hispanic, Hisp…
$ sex             <fct> Female, Female, Female, Female, Male, Male, Male, Male…
$ age_group       <fct> 0-17 years, 18-49 years, 50-64 years, 65 years and ove…
$ covid_deaths    <dbl> 0, 423, 857, 1793, 0, 737, 1592, 2655, 0, 82, 176, 362…
$ total_deaths    <dbl> 2, 3117, 4153, 10225, 1, 5676, 6183, 11544, 0, 591, 79…
summary(new_data)
                         education_level
 Associate degree or some college:56    
 Bachelor’s degree or more       :56    
 High school graduate/GED or less:56    
 Unknown                         :56    
                                        
                                        
                                        
                                                 race_origin     sex     
 Hispanic                                              :32   Female:112  
 Non-Hispanic American Indian or Alaska Native         :32   Male  :112  
 Non-Hispanic Asian                                    :32               
 Non-Hispanic Black                                    :32               
 Non-Hispanic Native Hawaiian or Other Pacific Islander:32               
 Non-Hispanic White                                    :32               
 Other/Unknown                                         :32               
             age_group   covid_deaths       total_deaths     
 0-17 years       :56   Min.   :    0.00   Min.   :     0.0  
 18-49 years      :56   1st Qu.:    3.75   1st Qu.:   112.0  
 50-64 years      :56   Median :   81.00   Median :   817.5  
 65 years and over:56   Mean   : 1880.20   Mean   : 15665.8  
                        3rd Qu.:  627.00   3rd Qu.:  4997.5  
                        Max.   :76871.00   Max.   :670295.0