Flu Analysis - Model Analysis & Evaluation

library(tidymodels)
── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
✔ broom        1.0.3     ✔ recipes      1.0.5
✔ dials        1.1.0     ✔ rsample      1.1.1
✔ dplyr        1.1.0     ✔ tibble       3.1.8
✔ ggplot2      3.4.1     ✔ tidyr        1.2.1
✔ infer        1.0.4     ✔ tune         1.0.1
✔ modeldata    1.1.0     ✔ workflows    1.1.3
✔ parsnip      1.0.4     ✔ workflowsets 1.0.0
✔ purrr        0.3.4     ✔ yardstick    1.1.0
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ purrr::discard() masks scales::discard()
✖ dplyr::filter()  masks stats::filter()
✖ dplyr::lag()     masks stats::lag()
✖ recipes::step()  masks stats::step()
• Use tidymodels_prefer() to resolve common conflicts.
library(tidyverse)
── Attaching packages
───────────────────────────────────────
tidyverse 1.3.2 ──
✔ readr   2.1.2     ✔ forcats 0.5.2
✔ stringr 1.4.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ readr::col_factor() masks scales::col_factor()
✖ purrr::discard()    masks scales::discard()
✖ dplyr::filter()     masks stats::filter()
✖ stringr::fixed()    masks recipes::fixed()
✖ dplyr::lag()        masks stats::lag()
✖ readr::spec()       masks yardstick::spec()

We will be using the data from the cleaned flu analysis data, so we will need to load the data from the processed_data folder.

dat <- readRDS(here::here("fluanalysis","data","processed_data","flu_processed"))

We’ll then need to find a way to create a dummy data set, called the test data set, from the cleaned data. We will use this data to test the efficacy of the generated model. We will use the remaining data, the training data set, to fit the model.

To attempt this, we will set a seed with set.seed() for randomization to ensure that these processes are reproducible. Further, we use initial_split() from the rsample package to generate a splitting rule for the training and test data sets.

set.seed(4444444)
data_split <- initial_split(dat,prop=7/10)
training_data <- training(data_split)
test_data <- testing(data_split)

We intend to use the tidymodels workflow to generate our logistic regression model. Within this workflow, we use recipe() and worklfow() to identify the relationships of interest.

# Initialize the interactions we are interested in
flu_logit_rec <- recipe(Nausea ~ ., data = training_data)
# Initialize the logistic regression formula
logit_mod <- logistic_reg() %>%
             set_engine("glm")
# Initialize the workflow
flu_wflow1 <- 
             workflow() %>%
             add_model(logit_mod) %>%
             add_recipe(flu_logit_rec)
flu_wflow1
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
0 Recipe Steps

── Model ───────────────────────────────────────────────────────────────────────
Logistic Regression Model Specification (classification)

Computational engine: glm 

Now that we have generated the workflow, we can fit the model to the training and test data sets, respectively.

training_fit <- flu_wflow1 %>%
                fit(data = training_data)

test_fit <- flu_wflow1 %>%
            fit(data = test_data)

We now want to compare the estimates. To do this, we use augment().

training_aug <- augment(training_fit, training_data)
test_aug <- augment(test_fit, test_data)

If we want to assess how well the model makes predictions, we can evaluate this with an ROC curve. roc_curev() and autoplot() will prepare the plot for us to evaluate the model on the training_data and the test_data, separately.

training_aug %>%
      roc_curve(truth = Nausea, .pred_No) %>%
      autoplot()
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
  always returns an ungrouped data frame and adjust accordingly.
ℹ The deprecated feature was likely used in the yardstick package.
  Please report the issue at <]8;;https://github.com/tidymodels/yardstick/issueshttps://github.com/tidymodels/yardstick/issues]8;;>.

roc_auc() estimates the area under the ROC curve. An area close to 1 means a good prediction, while an area near 0.5 means the model is of poor predictive quality.

training_aug %>%
      roc_auc(truth = Nausea, .pred_No)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.775

We repeat the same steps above for the test_data.

test_aug %>%
      roc_curve(truth = Nausea, .pred_No) %>%
      autoplot()

test_aug %>%
      roc_auc(truth = Nausea, .pred_No)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.877

Overall, the model appears to predict the data fairly well since both the training and test data have an area under the curve >0.7.

Part 2 Let’s see how a more restrictive model would act.

Now, let’s choose only 1 predictor instead of using all of them.

flu_logit_rec2 <- recipe(Nausea ~ RunnyNose, data = training_data)

flu_wflow2 <- 
             workflow() %>%
             add_model(logit_mod) %>%
             add_recipe(flu_logit_rec2)

training_fit2 <- flu_wflow2 %>%
                fit(data = training_data)

test_fit2 <- flu_wflow2 %>%
            fit(data = test_data)

training_aug2 <- augment(training_fit2, training_data)
test_aug2 <- augment(test_fit2, test_data)

The ROC curve of the training data set:

training_aug2 %>%
      roc_curve(truth = Nausea, .pred_No) %>%
      autoplot()

training_aug2 %>%
      roc_auc(truth = Nausea, .pred_No)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.505

The model is not a good fit of the data.

The ROC curve of the test data set:

test_aug2 %>%
      roc_curve(truth = Nausea, .pred_No) %>%
      autoplot()

test_aug2 %>%
      roc_auc(truth = Nausea, .pred_No)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.505

The model is not a good fit of the data.

This section added by Weifan Wu

Linear model for continuous outcome BodyTemp

Creating workflow and fitting model using all predictors

# Creating recipe and set up dummy code for all categorical variables
set.seed(123)
temp_rec=recipe(BodyTemp~.,data=training_data)%>%
  step_dummy(all_nominal())
# Training linear regression model
lm_mod=linear_reg()%>%
  set_engine("lm")
# Creating workflow
temp_workflow=workflow()%>%
  add_model(lm_mod)%>%
  add_recipe(temp_rec)
temp_workflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
1 Recipe Step

• step_dummy()

── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Computational engine: lm 
temp_fit=temp_workflow%>%
  fit(data=training_data)
# Checking the parameter estimates and arrange their respective p.values
temp_fit%>%
  extract_fit_parsnip()%>%
  tidy()%>%
  arrange(p.value)
# A tibble: 32 × 5
   term                estimate std.error statistic  p.value
   <chr>                  <dbl>     <dbl>     <dbl>    <dbl>
 1 (Intercept)           98.1       0.355    277.   0       
 2 SubjectiveFever_Yes    0.428     0.119      3.59 0.000362
 3 Sneeze_Yes            -0.327     0.115     -2.85 0.00458 
 4 Pharyngitis_Yes        0.333     0.141      2.37 0.0183  
 5 Headache_Yes          -0.342     0.148     -2.31 0.0213  
 6 Fatigue_Yes            0.426     0.188      2.27 0.0239  
 7 EarPn_Yes              0.271     0.132      2.05 0.0411  
 8 NasalCongestion_Yes   -0.231     0.132     -1.75 0.0800  
 9 ChillsSweats_Yes       0.260     0.154      1.69 0.0914  
10 Weakness_Severe        0.438     0.262      1.67 0.0948  
# … with 22 more rows
# ℹ Use `print(n = ...)` to see more rows

Use the trained workflow to predict both training and testing data

# Predicting training dataand getting model metrics
predict(temp_fit,training_data)
# A tibble: 510 × 1
   .pred
   <dbl>
 1 100. 
 2  99.2
 3  98.5
 4  99.8
 5  99.7
 6 100. 
 7  99.2
 8  99.3
 9  98.6
10  98.8
# … with 500 more rows
# ℹ Use `print(n = ...)` to see more rows
temp_aug_train=augment(temp_fit,training_data)
temp_aug_train%>%
  metrics(truth = !!sym("BodyTemp"), estimate = .pred)
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       1.07 
2 rsq     standard       0.151
3 mae     standard       0.796
# Predicting testing data and getting model metrics
predict(temp_fit,test_data)
# A tibble: 220 × 1
   .pred
   <dbl>
 1  99.1
 2  99.2
 3  99.1
 4  99.1
 5  99.0
 6  98.4
 7  98.8
 8  99.6
 9  99.3
10  99.5
# … with 210 more rows
# ℹ Use `print(n = ...)` to see more rows
temp_aug_test=augment(temp_fit,test_data)
temp_aug_test%>%
  metrics(truth = !!sym("BodyTemp"), estimate = .pred)
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      1.27  
2 rsq     standard      0.0316
3 mae     standard      0.957 

Creating workflow and fitting model using the main predictor (RunnyNose)

set.seed(234)
temp_rec2=recipe(BodyTemp~RunnyNose,data=training_data)
# Training linear regression model
lm_mod=linear_reg()%>%
  set_engine("lm")
# Creating workflow
temp_workflow2=workflow()%>%
  add_model(lm_mod)%>%
  add_recipe(temp_rec2)
temp_workflow2
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
0 Recipe Steps

── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Computational engine: lm 
temp_fit2=temp_workflow2%>%
  fit(data=training_data)
# Checking the parameter estimates and arrange their respective p.values
temp_fit2%>%
  extract_fit_parsnip()%>%
  tidy()%>%
  arrange(p.value)
# A tibble: 2 × 5
  term         estimate std.error statistic p.value
  <chr>           <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)    99.1      0.0948   1046.    0     
2 RunnyNoseYes   -0.278    0.113      -2.47  0.0140

Use the trained workflow to predict both training and testing data

# Predicting training data and getting model metrics
predict(temp_fit2,training_data)
# A tibble: 510 × 1
   .pred
   <dbl>
 1  98.9
 2  98.9
 3  98.9
 4  99.1
 5  98.9
 6  98.9
 7  98.9
 8  99.1
 9  98.9
10  98.9
# … with 500 more rows
# ℹ Use `print(n = ...)` to see more rows
temp_aug_train2=augment(temp_fit2,training_data)
temp_aug_train2%>%
  metrics(truth = !!sym("BodyTemp"), estimate = .pred)
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      1.16  
2 rsq     standard      0.0118
3 mae     standard      0.834 
# Predicting testing data and getting model metrics
predict(temp_fit2,test_data)
# A tibble: 220 × 1
   .pred
   <dbl>
 1  99.1
 2  99.1
 3  98.9
 4  99.1
 5  98.9
 6  98.9
 7  98.9
 8  98.9
 9  98.9
10  99.1
# … with 210 more rows
# ℹ Use `print(n = ...)` to see more rows
temp_aug_test2=augment(temp_fit2,test_data)
temp_aug_test2%>%
  metrics(truth = !!sym("BodyTemp"), estimate = .pred)
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      1.25  
2 rsq     standard      0.0134
3 mae     standard      0.949 

Overall, the model built and trained based on all predictors has a higher RMSE than that built and trained based on the main predictor RunnyNose.

```