Tidy Tuesday Exercise

##Packages Used

library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.1     ✔ purrr   0.3.4
✔ tibble  3.1.8     ✔ dplyr   1.1.0
✔ tidyr   1.2.1     ✔ stringr 1.4.1
✔ readr   2.1.2     ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

library(exact2x2)

Loading required package: exactci
Loading required package: ssanv
Loading required package: testthat

Attaching package: 'testthat'

The following object is masked from 'package:dplyr':

    matches

The following object is masked from 'package:purrr':

    is_null

The following objects are masked from 'package:readr':

    edition_get, local_edition

The following object is masked from 'package:tidyr':

    matches

##Importing the Data

ttdata <- data.frame(tidytuesdayR::tt_load('2023-02-14')$age_gaps)

--- Compiling #TidyTuesday Information for 2023-02-14 ----

--- There is 1 file available ---

--- Starting Download ---


    Downloading file 1 of 1: `age_gaps.csv`

--- Download complete ---

##Viewing the Data

# Gives a summary of the data
glimpse(ttdata)

Rows: 1,155
Columns: 13
$ movie_name         <chr> "Harold and Maude", "Venus", "The Quiet American", …
$ release_year       <dbl> 1971, 2006, 2002, 1998, 2010, 1992, 2009, 1999, 199…
$ director           <chr> "Hal Ashby", "Roger Michell", "Phillip Noyce", "Joe…
$ age_difference     <dbl> 52, 50, 49, 45, 43, 42, 40, 39, 38, 38, 36, 36, 35,…
$ couple_number      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ actor_1_name       <chr> "Ruth Gordon", "Peter O'Toole", "Michael Caine", "D…
$ actor_2_name       <chr> "Bud Cort", "Jodie Whittaker", "Do Thi Hai Yen", "T…
$ character_1_gender <chr> "woman", "man", "man", "man", "man", "man", "man", …
$ character_2_gender <chr> "man", "woman", "woman", "woman", "man", "woman", "…
$ actor_1_birthdate  <date> 1896-10-30, 1932-08-02, 1933-03-14, 1930-09-17, 19…
$ actor_2_birthdate  <date> 1948-03-29, 1982-06-03, 1982-10-01, 1975-11-08, 19…
$ actor_1_age        <dbl> 75, 74, 69, 68, 81, 59, 62, 69, 57, 77, 59, 56, 65,…
$ actor_2_age        <dbl> 23, 24, 20, 23, 38, 17, 22, 30, 19, 39, 23, 20, 30,…

# Checks to ensure all characters are defined at "man" and "woman"
unique(ttdata$character_1_gender)

[1] "woman" "man"

unique(ttdata$character_2_gender)

[1] "man"   "woman"

# Checks for any missing values
naniar::gg_miss_var(ttdata)

Warning: The `guide` argument in `scale_*()` cannot be `FALSE`. This was deprecated in
ggplot2 3.3.4.
ℹ Please use "none" instead.
ℹ The deprecated feature was likely used in the naniar package.
  Please report the issue at <]8;;https://github.com/njtierney/naniar/issueshttps://github.com/njtierney/naniar/issues]8;;>.

According to the gg_miss_var() function in the naniar package, we see that there are no missing values in any of the variables.
We can forgo usual data repair.

##Objectives
According to the variables, we may be interested in seeing the proportion of on-screen romances that occur between heterosexual and homosexual couples presented on the big screen. We may also wonder if these relationships are typical of this data set. We also may notice that a wide age gap between these on-screen relationships.

###Questions:

####1. What proportion of movies depict same-sex relationships? Is this surprising?

####2. What happens when er compare these proportions after and before the year 2000? Is there a difference?

####3. What are the odds of having an excessively high age difference given relationship type?

###Acquiring Needed Data

dt <- ttdata %>%
# Chooses only variables of interest
                  select(character_1_gender,character_2_gender,age_difference,release_year) %>%
# Creates a new variables based on whether the genders of the two characters are the same
                  mutate(relate_type = ifelse(character_1_gender == character_2_gender,"same","different"),
# Changes the parameter to a bivariate outcome: `before 2000` and `2000 and after`
                         release_year = ifelse(release_year >= 2000,
                                               "2000 or after",
                                               "before 2000"),
# Changes the parameter to a bivariate outcome: `0-14` and `15+`
                         age_difference = ifelse(age_difference < 15,
                                                 "0-14",
                                                 "15+"))

Here, we select for only the data that suits our needs. We only want to see the characters’ genders and their age differences. In order for our analyses to work, we must convert the age_difference variable into a binary outcome, so we condition the age differences of the actors into 2 categories: 0-14 and 15+. This way, we can construct a 2-way table to represent the distribution of age differences based on on-screen relationship pairs.

###1.
###Representing the Data

# Converts the data into a 2-way table representing each pair
part1 <- with(dt,table(character_1_gender,character_2_gender))
# Makes the table more readable
names(dimnames(part1)) <- c("Character 1","Character 2")
# Shows row and column totals
part1 %>% addmargins()

           Character 2
Character 1  man woman  Sum
      man     12   929  941
      woman  203    11  214
      Sum    215   940 1155

The table above represents the the number of on-screen romances. We can see that only 12 films represent gay relationships and that 11 films represent lesbian relationships, meaning there are a total of 23 films with some kind of homosexual representation in these 1155 films. It may be helpful to examine their relative proportions:

# Shows cell proportions
part1_props <- prop.table(part1)
part1_props

           Character 2
Character 1        man      woman
      man   0.01038961 0.80432900
      woman 0.17575758 0.00952381

Here, we see that less than 2% of these 1155 films had homosexual representation of some sort. Using a binomial distribution, we can determine if this is an uncommon occurrence:

part1 <- part1 %>% addmargins()
# Simple proportion test using 95% Confidence Interval
DescTools::BinomCI(part1[1,1]+part1[2,2],part1[3,3])

            est     lwr.ci     upr.ci
[1,] 0.01991342 0.01330551 0.02970422

Based on the calculations above, we can say that with 95% confidence, the true proportion of films with gay representation lies inside the interval 1.33% and 2.97%. This is still a very small percentage, so we might conclude that gay representation is still very low.

But maybe more and more films have begun to prominently display gay relationships since 2000. How can we show this?

###2.
###Representing the Data

# Converts the data into a 2-way table representing on-screen relationships
# stratified by release year and relationship type
part2 <- with(dt,table(relate_type,release_year))
# Makes the table more readable
names(dimnames(part2)) <- c("Relationship Type","Release Year")
# Shows row and column totals
part2 %>% addmargins()

                 Release Year
Relationship Type 2000 or after before 2000  Sum
        different           743         389 1132
        same                 22           1   23
        Sum                 765         390 1155

We can choose to display the data as a table, but it can be difficult to quickly interpret the potential relationships between variables. Often, data in 2-way tables are displayed as mosaic plots for this reason.

# Constructs a mosaic plot of the data
mosaicplot(data=dt, release_year ~ relate_type,
           main = "Distribution of On-Screen Relationships \n by Release Date and Type",
           xlab = "Release Year",
           ylab = "Relationship Type")

Now, we can easily see that heterosexual relationships are still much more common than same sex relationships, but compared to before 2000, there appear to be more same sex relationships in 2000 and after.

Now let’s see if this increased number of same sex couples represented on screen is due to some factor other than the increased number of films produced after 2000. We can test this by using a X^2 test for independence.

# Performs the continuity corrected X^2 test for independence
chisq.test(part2)


    Pearson's Chi-squared test with Yates' continuity correction

data:  part2
X-squared = 7.7886, df = 1, p-value = 0.005258

fisher.exact(part2, midp=TRUE)


    Central Fisher's Exact Test (mid-p version)

data:  part2
p-value = 0.0009636
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 0.00415314 0.47003856
sample estimates:
odds ratio 
0.08691912

Here, we use two tests for independence: the standard X^2 procedure and the Fisher’s exact test. The Fisher’s exact test is a special form of the X^2 test used when one or more of the cell counts is less than 5. Based on the results of the X^2 test displayed above, we conclude that same sex representation is not independent from release year.

# Shows the expected values used in the X^2 test
chisq.test(part2)$expected

                 Release Year
Relationship Type 2000 or after before 2000
        different     749.76623  382.233766
        same           15.23377    7.766234

# Shows the residuals from the X^2 test
chisq.test(part2)$res

                 Release Year
Relationship Type 2000 or after before 2000
        different    -0.2471064   0.3460848
        same          1.7335779  -2.4279613

According to the residuals presented in the table above, same sex relationships showed a higher probability of occurrence after 2000 than before 2000, indicating an increase in the number of these relationships since the year 2000.

###3.
###Representing the data

# Converts the data into a 2-way table representing on-screen relationships
# stratified by age difference and relationship type
part3 <- with(dt,table(relate_type,age_difference))
# Makes the table more readable
names(dimnames(part3)) <- c("Relationship Type","Age Difference")
# Shows row and column totals
part3 %>% addmargins()

                 Age Difference
Relationship Type 0-14  15+  Sum
        different  832  300 1132
        same        14    9   23
        Sum        846  309 1155

Above, we can see the 2-way table for age difference vs. relationship type, and below is the mosaic plot.

# Constructs a mosaic plot of the data
mosaicplot(data=dt, age_difference~ relate_type,
           main = "Distribution of On-Screen Relationships \n by Age Difference and Type",
           xlab = "Age Difference",
           ylab = "Relationship Type")

Using an odds ratio, we may be able to identify a relationship between age difference and relationship type.

epitools::oddsratio(part3)

$data
                 Age Difference
Relationship Type 0-14 15+ Total
        different  832 300  1132
        same        14   9    23
        Total      846 309  1155

$measure
                 odds ratio with 95% C.I.
Relationship Type estimate     lower    upper
        different 1.000000        NA       NA
        same      1.792952 0.7319929 4.166928

$p.value
                 two-sided
Relationship Type midp.exact fisher.exact chi.square
        different         NA           NA         NA
        same       0.1934065    0.2314329  0.1755846

$correction
[1] FALSE

attr(,"method")
[1] "median-unbiased estimate & mid-p exact CI"

According to the analyses above, the data fail to establish an association between relationship type and actors’ age difference.