Loading required package: exactci
Loading required package: ssanv
Loading required package: testthat
Attaching package: 'testthat'
The following object is masked from 'package:dplyr':
matches
The following object is masked from 'package:purrr':
is_null
The following objects are masked from 'package:readr':
edition_get, local_edition
The following object is masked from 'package:tidyr':
matches
# Checks to ensure all characters are defined at "man" and "woman"unique(ttdata$character_1_gender)
[1] "woman" "man"
unique(ttdata$character_2_gender)
[1] "man" "woman"
# Checks for any missing valuesnaniar::gg_miss_var(ttdata)
Warning: The `guide` argument in `scale_*()` cannot be `FALSE`. This was deprecated in
ggplot2 3.3.4.
ℹ Please use "none" instead.
ℹ The deprecated feature was likely used in the naniar package.
Please report the issue at <]8;;https://github.com/njtierney/naniar/issueshttps://github.com/njtierney/naniar/issues]8;;>.
According to the gg_miss_var() function in the naniar package, we see that there are no missing values in any of the variables.
We can forgo usual data repair.
##Objectives
According to the variables, we may be interested in seeing the proportion of on-screen romances that occur between heterosexual and homosexual couples presented on the big screen. We may also wonder if these relationships are typical of this data set. We also may notice that a wide age gap between these on-screen relationships.
###Questions:
####1. What proportion of movies depict same-sex relationships? Is this surprising?
####2. What happens when er compare these proportions after and before the year 2000? Is there a difference?
####3. What are the odds of having an excessively high age difference given relationship type?
###Acquiring Needed Data
dt <- ttdata %>%# Chooses only variables of interestselect(character_1_gender,character_2_gender,age_difference,release_year) %>%# Creates a new variables based on whether the genders of the two characters are the samemutate(relate_type =ifelse(character_1_gender == character_2_gender,"same","different"),# Changes the parameter to a bivariate outcome: `before 2000` and `2000 and after`release_year =ifelse(release_year >=2000,"2000 or after","before 2000"),# Changes the parameter to a bivariate outcome: `0-14` and `15+`age_difference =ifelse(age_difference <15,"0-14","15+"))
Here, we select for only the data that suits our needs. We only want to see the characters’ genders and their age differences. In order for our analyses to work, we must convert the age_difference variable into a binary outcome, so we condition the age differences of the actors into 2 categories: 0-14 and 15+. This way, we can construct a 2-way table to represent the distribution of age differences based on on-screen relationship pairs.
###1.
###Representing the Data
# Converts the data into a 2-way table representing each pairpart1 <-with(dt,table(character_1_gender,character_2_gender))# Makes the table more readablenames(dimnames(part1)) <-c("Character 1","Character 2")# Shows row and column totalspart1 %>%addmargins()
Character 2
Character 1 man woman Sum
man 12 929 941
woman 203 11 214
Sum 215 940 1155
The table above represents the the number of on-screen romances. We can see that only 12 films represent gay relationships and that 11 films represent lesbian relationships, meaning there are a total of 23 films with some kind of homosexual representation in these 1155 films. It may be helpful to examine their relative proportions:
Character 2
Character 1 man woman
man 0.01038961 0.80432900
woman 0.17575758 0.00952381
Here, we see that less than 2% of these 1155 films had homosexual representation of some sort. Using a binomial distribution, we can determine if this is an uncommon occurrence:
part1 <- part1 %>%addmargins()# Simple proportion test using 95% Confidence IntervalDescTools::BinomCI(part1[1,1]+part1[2,2],part1[3,3])
est lwr.ci upr.ci
[1,] 0.01991342 0.01330551 0.02970422
Based on the calculations above, we can say that with 95% confidence, the true proportion of films with gay representation lies inside the interval 1.33% and 2.97%. This is still a very small percentage, so we might conclude that gay representation is still very low.
But maybe more and more films have begun to prominently display gay relationships since 2000. How can we show this?
###2.
###Representing the Data
# Converts the data into a 2-way table representing on-screen relationships# stratified by release year and relationship typepart2 <-with(dt,table(relate_type,release_year))# Makes the table more readablenames(dimnames(part2)) <-c("Relationship Type","Release Year")# Shows row and column totalspart2 %>%addmargins()
Release Year
Relationship Type 2000 or after before 2000 Sum
different 743 389 1132
same 22 1 23
Sum 765 390 1155
We can choose to display the data as a table, but it can be difficult to quickly interpret the potential relationships between variables. Often, data in 2-way tables are displayed as mosaic plots for this reason.
# Constructs a mosaic plot of the datamosaicplot(data=dt, release_year ~ relate_type,main ="Distribution of On-Screen Relationships \n by Release Date and Type",xlab ="Release Year",ylab ="Relationship Type")
Now, we can easily see that heterosexual relationships are still much more common than same sex relationships, but compared to before 2000, there appear to be more same sex relationships in 2000 and after.
Now let’s see if this increased number of same sex couples represented on screen is due to some factor other than the increased number of films produced after 2000. We can test this by using a X^2 test for independence.
# Performs the continuity corrected X^2 test for independencechisq.test(part2)
Pearson's Chi-squared test with Yates' continuity correction
data: part2
X-squared = 7.7886, df = 1, p-value = 0.005258
fisher.exact(part2, midp=TRUE)
Central Fisher's Exact Test (mid-p version)
data: part2
p-value = 0.0009636
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.00415314 0.47003856
sample estimates:
odds ratio
0.08691912
Here, we use two tests for independence: the standard X^2 procedure and the Fisher’s exact test. The Fisher’s exact test is a special form of the X^2 test used when one or more of the cell counts is less than 5. Based on the results of the X^2 test displayed above, we conclude that same sex representation is not independent from release year.
# Shows the expected values used in the X^2 testchisq.test(part2)$expected
Release Year
Relationship Type 2000 or after before 2000
different 749.76623 382.233766
same 15.23377 7.766234
# Shows the residuals from the X^2 testchisq.test(part2)$res
Release Year
Relationship Type 2000 or after before 2000
different -0.2471064 0.3460848
same 1.7335779 -2.4279613
According to the residuals presented in the table above, same sex relationships showed a higher probability of occurrence after 2000 than before 2000, indicating an increase in the number of these relationships since the year 2000.
###3.
###Representing the data
# Converts the data into a 2-way table representing on-screen relationships# stratified by age difference and relationship typepart3 <-with(dt,table(relate_type,age_difference))# Makes the table more readablenames(dimnames(part3)) <-c("Relationship Type","Age Difference")# Shows row and column totalspart3 %>%addmargins()
Age Difference
Relationship Type 0-14 15+ Sum
different 832 300 1132
same 14 9 23
Sum 846 309 1155
Above, we can see the 2-way table for age difference vs. relationship type, and below is the mosaic plot.
# Constructs a mosaic plot of the datamosaicplot(data=dt, age_difference~ relate_type,main ="Distribution of On-Screen Relationships \n by Age Difference and Type",xlab ="Age Difference",ylab ="Relationship Type")
Using an odds ratio, we may be able to identify a relationship between age difference and relationship type.
epitools::oddsratio(part3)
$data
Age Difference
Relationship Type 0-14 15+ Total
different 832 300 1132
same 14 9 23
Total 846 309 1155
$measure
odds ratio with 95% C.I.
Relationship Type estimate lower upper
different 1.000000 NA NA
same 1.792952 0.7319929 4.166928
$p.value
two-sided
Relationship Type midp.exact fisher.exact chi.square
different NA NA NA
same 0.1934065 0.2314329 0.1755846
$correction
[1] FALSE
attr(,"method")
[1] "median-unbiased estimate & mid-p exact CI"
According to the analyses above, the data fail to establish an association between relationship type and actors’ age difference.