15 Data Analysis
15.1 Introduction
Data analysis plays a quintessential role in the realm of communication and media research. With an everincreasing volume of data generated from diverse sources, researchers need robust techniques to sift through this raw information and discern meaningful patterns or trends. This section outlines the objectives of data analysis, categorizes the major types of statistical analyses commonly used, and discusses the pivotal role of statistical software such as RStudio in facilitating this analytical process.
Objectives and Importance of Data Analysis
The fundamental aim of data analysis is to distill large amounts of information into actionable insights. In the context of communication and media research, these objectives can be further refined as follows:
Summarization: To present data in a digestible format, offering a clear snapshot of its main features.
Exploration: To identify relationships or trends within the data, providing a basis for further investigation.
Inference: To make educated guesses or predictions about a broader population based on sample data.
Validation: To confirm or negate existing theories or hypotheses through empirical evidence.
DecisionMaking: To provide actionable recommendations and insights that may influence policy, strategy, or further academic research.
Ignoring proper data analysis can lead to misleading conclusions or partial understandings, affecting the quality and reliability of the research. Thus, the importance of meticulous data analysis cannot be overstated.
Types of Analyses: Descriptive vs. Inferential
Inferential analysis is distinct from descriptive analysis in its ambition to extend findings from a sample to a broader population. It uses probability theory to estimate and make predictions about population parameters, whereas descriptive analysis is confined to the dataset in hand.
Descriptive Analysis
In descriptive statistics, the aim is to summarize the main aspects of the data in hand, often through tables, graphs, or numerical measures such as mean, median, and standard deviation. Descriptive analysis provides a compact representation of the data, but it does not allow researchers to make conclusions beyond the data at hand (Tukey, 1977).
Inferential Analysis
In contrast, inferential statistics go a step further by enabling researchers to draw conclusions about a population based on a sample. Inferential methods like ttests, ANOVA, and regressions allow one to assess hypotheses and derive estimates that are generalizable to a broader context (Cohen, 1988).
Role of Statistical Software: E.g., RStudio
In the current digital age, statistical software has become an indispensable tool for data analysis. RStudio is one such environment that offers a wide array of statistical and graphical techniques. It is especially favored for its:
 UserFriendly Interface: RStudio provides a clean and efficient interface for executing R code, thereby easing the process of data analysis.
 Flexibility and Adaptability: It supports various data formats and can be integrated with other software and programming languages.

Extensive Libraries: With a rich ecosystem of packages like
ggplot2
for data visualization,dplyr
for data manipulation, andcaret
for machine learning, RStudio offers comprehensive analytical capabilities.  Reproducibility: The codebased nature of RStudio ensures that analyses can be easily documented and reproduced, adhering to the tenets of reliable scientific research (Peng, 2011).
By mastering RStudio or similar statistical software, researchers are better equipped to conduct complex analyses that can contribute to robust and insightful findings.
15.2 Descriptive Analysis
Descriptive statistics form the bedrock of data exploration and initial data analysis. Descriptive analysis plays a pivotal role in data analysis by concisely summarizing the key characteristics of a dataset. It involves calculating various statistics to present a snapshot of the data, enabling researchers to understand its basic structure and form. These statistics facilitate the comprehensive summarization, condensation, and general understanding of the structural attributes of expansive datasets (De Veaux, Velleman, & Bock, 2018). Employed as a precursor to more advanced statistical procedures, descriptive statistics offer a straightforward way to describe the main aspects of a data set, from the typical values to the variability within the set. They provide researchers with tools to quickly identify patterns, trends, and potential outliers without making generalized predictions about larger populations (Boslaugh, 2012). Furthermore, descriptive statistics are essential in exploratory data analysis, where their role is to aid in the detection of any unusual observations that may warrant further investigation (Tukey, 1977).
Moreover, descriptive statistics have applications that span across various domains—from social sciences to economics, from healthcare to engineering. The utility lies in their ability to translate large amounts of data into easily understandable formats, such as graphs, tables, and numerical measures, thereby transforming raw data into insightful information. In research, they often serve as the initial step in the process of data analytics, giving researchers a snapshot of what the data looks like before delving into more complex analytical techniques like inferential statistics or machine learning algorithms (Hair et al., 2014).
If a researcher’s interest lies in examining how variables change together without intending to make predictive inferences, they should utilize descriptive correlational analysis. This type of analysis explores the relationship between variables using correlation coefficients, without extending to prediction.
Measures of Central Tendency
To capture the central tendency or the “average” experience within a set of data, calculating the mean is most appropriate. The mean provides a single value summarizing the central point of a dataset’s distribution.
Load data
# Load the packages
library(tidyverse)
library(data.table)
options(scipen=999)
# Import the datasets
spotify_songs < fread("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/20200121/spotify_songs.csv")
movies < fread("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/20210309/movies.csv")
Mean
The mean is perhaps the most widely recognized measure of central tendency, representing the arithmetic average of a dataset. In descriptive analysis, the mean serves as a fundamental measure, providing an average value that represents the central tendency of a dataset. This average is calculated by summing all observations and dividing by the number of observations. The mean is sensitive to outliers, which can disproportionately influence the calculated average, potentially resulting in a misleading representation of central location (McClave, Benson, & Sincich, 2011). Despite this limitation, the mean is highly useful in various statistical methods, including regression analysis and hypothesis testing, because of its mathematical properties (Field, Miles, & Field, 2012).
Importantly, the mean can be categorized into different types: arithmetic mean, geometric mean, and harmonic mean, each with specific applications depending on the nature of the data and the intended analysis (Triola, 2018). For instance, the geometric mean is often used when dealing with data that exhibit exponential growth or decline, such as in financial or biological contexts (Cox & Snell, 1981).
Descriptive statistics are most commonly paired with visualizations to provide clarity. For example, a scatterplot is an invaluable tool in descriptive analysis when the objective is to illustrate the relationship or correlation between two variables. It visually represents the data points for each observed pair, facilitating the detection of patterns or relationships.
Example using Spotify Songs Dataset: To find the mean popularity of songs.
The R code provided demonstrates the use of the dplyr
package and base R functions to calculate the mean popularity of tracks in the spotify_songs
dataset. Let’s break down the code and its output:
 dplyr summarise function:
mean_popularity < spotify_songs %>%
summarise(mean_popularity = mean(track_popularity, na.rm = TRUE))
This snippet uses the dplyr
package’s summarise
function to calculate the mean of the track_popularity
variable in the spotify_songs
dataframe. The mean
function is used with the na.rm = TRUE
argument, which means that it will ignore NA
(missing) values in the calculation. The result is stored in a new dataframe mean_popularity
.
 Output Explanation:
mean_popularity
## mean_popularity
## 1 42.47708
This output indicates that the mean popularity score of the tracks in the dataset is approximately 42.47708. The <dbl>
notation suggests that the mean popularity score is a doubleprecision floatingpoint number, which is a common way of representing decimal numbers in R.
In summary, both methods are used to calculate the average popularity score of tracks in the spotify_songs
dataset. The output shows the mean value as approximately 42.47708, reflecting the average popularity of the tracks in the dataset. The use of dplyr
and base R functions provides a means to crossvalidate the result for accuracy.
Median
The median serves as another measure of central tendency and is less sensitive to outliers compared to the mean (Lind et al., 2012). It is defined as the middle value in a dataset that has been arranged in ascending order. If the dataset contains an even number of observations, the median is calculated as the average of the two middle numbers. Medians are particularly useful for data that are skewed or contain outliers, as they provide a more “resistant” measure of the data’s central location (Hoaglin, Mosteller, & Tukey, 2000).
In addition to its robustness against outliers, the median is often used in nonparametric statistical tests like the MannWhitney U test and the KruskalWallis test. These tests do not assume that the data follow a specific distribution, making the median an invaluable asset in such scenarios (Siegel & Castellan, 1988).
Example using Movies Dataset: To find the median budget of movies.
The provided R code calculates the median budget of movies in the movies
dataset, with two different approaches, and the results are displayed. Let’s analyze the code and its outputs:
 Using dplyr’s summarise function:
This snippet uses the dplyr
package’s summarise
function to compute the median of the budget
variable in the movies
dataframe. Before calculating the median, each budget value is divided by 1,000,000 (budget/1000000
), effectively converting the budget values from (presumably) dollars to millions of dollars. The na.rm = TRUE
argument in the median
function indicates that any NA
(missing) values should be ignored in the calculation. The result is stored in a new dataframe called median_budget
.
 Output Explanation:
median_budget
## median_budget
## 1 28
This indicates that the median budget of the movies, in millions of dollars, is 28. The <dbl>
notation signifies that the median budget is a doubleprecision floatingpoint number.
In conclusion, both methods are used to calculate the median budget of movies in the dataset, and both approaches confirm that the median budget is 28 million dollars. The use of both dplyr
and base R functions serves as a crossverification to ensure the accuracy of the result.
Mode
The mode refers to the value or values that appear most frequently in a dataset (Gravetter & Wallnau, 2016). A dataset can be unimodal, having one mode; bimodal, having two modes; or multimodal, having multiple modes. While the mode is less commonly used than the mean and median for numerical data, it is the primary measure of central tendency for categorical or nominal data (Agresti, 2002).
Despite its less frequent application in numerical contexts, the mode can still be useful for identifying the most common values in a dataset and for understanding the general distribution of the data (Bland & Altman, 1996). For example, in market research, knowing the mode of a dataset related to consumer preferences can provide valuable insights into what most consumers are likely to choose.
Example using Spotify Songs Dataset: To find the mode of the playlist_genre
.
The provided R code calculates the mode of the playlist_genre
variable in the spotify_songs
dataset using the Mode
function from the DescTools
package. The mode is the value that appears most frequently in a dataset. Let’s break down the code and its output:
 Using the DescTools package’s Mode function:
##
## Attaching package: 'DescTools'
## The following object is masked from 'package:data.table':
##
## %like%
mode_genre < Mode(spotify_songs$playlist_genre)
This snippet uses the Mode
function from the DescTools
package to find the most frequently occurring genre in the playlist_genre
column of the spotify_songs
dataset. The result is stored in the variable mode_genre
.
 Output Explanation:
mode_genre
## [1] "edm"
## attr(,"freq")
## [1] 6043
This output indicates that the most common genre (mode) in the playlist_genre
column is “edm”. The attr(,"freq")
part shows the frequency of this mode, which is 6043. This means that “edm” appears 6043 times in the playlist_genre
column, more than any other genre.
In summary, the code calculates and displays the mode of the playlist_genre
variable in the spotify_songs
dataset, indicating that the most common genre is “edm”, which appears 6043 times. The consistency of the results from both methods demonstrates the reliability of the calculation.
Measures of Dispersion
Range
The range is the simplest measure of dispersion, calculated by subtracting the smallest value from the largest value in the dataset (McClave, Benson, & Sincich, 2011). While straightforward to compute, the range is highly sensitive to outliers and does not account for how the rest of the values in the dataset are distributed (Triola, 2018).
The range offers a quick, albeit crude, estimate of the dataset’s variability. It is often used in conjunction with other measures of dispersion for a more comprehensive understanding of data spread. Despite its limitations, the range can be helpful in initial exploratory analyses to quickly identify the scope of the data and to detect possible outliers or data entry errors (Tukey, 1977).
Example using Movies Dataset: To find the range of movie budgets.
The R code provided calculates the range of the budget
column in the movies
dataset using the dplyr
package. The range is a measure of dispersion that represents the difference between the maximum and minimum values in a dataset. Here’s a breakdown of the code and its output:
 Code Explanation:
budget_range < movies %>%
summarise(Range = max(budget/1000000,
na.rm = TRUE)  min(budget/1000000,
na.rm = TRUE))
movies %>%
: This part indicates that the code is using themovies
dataframe and piping (%>%
) it into subsequent operations.summarise(Range = ...)
: Thesummarise
function from thedplyr
package is used to compute a summary statistic. Here, it’s creating a new variable namedRange
.
max(budget/1000000, na.rm = TRUE)  min(budget/1000000, na.rm = TRUE)
: This calculates the range of the movie budgets. Eachbudget
value is first divided by 1,000,000 (presumably converting the budget from dollars to millions of dollars). Themax
function finds the maximum value andmin
finds the minimum value, withna.rm = TRUE
indicating that anyNA
(missing) values should be ignored. The range is the difference between these two values.Output Explanation:
budget_range
## Range
## 1 424.993
 The output shows that the calculated range of the movie budgets, in millions of dollars, is approximately 424.993. This means that the largest budget in the dataset exceeds the smallest budget by about 424.993 million dollars.
 The
<dbl>
notation indicates that the calculated range is a doubleprecision floatingpoint number, a standard numeric type in R for representing decimal values.
In summary, the code calculates the range of movie budgets in the movies
dataset and finds that the budgets span approximately 424.993 million dollars, from the smallest to the largest. This provides a sense of how varied the movie budgets are in the dataset.
Standard Deviation
The standard deviation is a more sophisticated measure of dispersion that indicates how much individual data points deviate from the mean (Lind et al., 2012). Standard deviation is a measure in descriptive analysis that quantifies the variation or dispersion of a set of data values. It reflects how much individual data points differ from the mean, indicating the dataset’s spread. Calculated as the square root of the variance, the standard deviation provides an intuitive sense of the data’s spread since it is in the same unit as the original data points. It plays a crucial role in various statistical analyses, including hypothesis testing and confidence interval estimation, and is fundamental in fields ranging from finance to natural sciences (Levine, Stephan, Krehbiel, & Berenson, 2008).
The standard deviation can be classified into two types: population standard deviation and sample standard deviation. The former is used when the data represent an entire population, while the latter is used for sample data and is calculated with a slight adjustment to account for sample bias (Kenney & Keeping, 1962).
Example using Spotify Songs Dataset: To find the standard deviation of danceability
.
The R code you’ve provided calculates the standard deviation of the danceability
variable in the spotify_songs
dataset using the dplyr
package. Let’s break down the code and its output:
 Code Explanation:

spotify_songs %>%
: This part uses thespotify_songs
dataframe and pipes it into the subsequent operation using%>%
. 
summarise(std_danceability = ...)
: Thesummarise
function fromdplyr
is used to compute a summary statistic. Here, it’s creating a new variable namedstd_danceability
. 
sd(danceability, na.rm = TRUE)
: This calculates the standard deviation of thedanceability
variable. Thesd
function computes the standard deviation, andna.rm = TRUE
indicates that anyNA
(missing) values should be ignored in the calculation.
 Output Explanation:
std_danceability
## std_danceability
## 1 0.1450853
 The output shows that the calculated standard deviation of the
danceability
scores in thespotify_songs
dataset is approximately 0.1450853.  The
<dbl>
notation indicates that the result is a doubleprecision floatingpoint number, which is typical for numeric calculations in R.
The standard deviation is a measure of the amount of variation or dispersion in a set of values. A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the values are spread out over a wider range.
In this case, a standard deviation of approximately 0.1450853 for danceability
suggests that the danceability scores in the spotify_songs
dataset vary moderately around the mean. This gives an idea of the variability in danceability among the songs in the dataset.
Variance
Variance is closely related to the standard deviation, essentially being its square. It quantifies how much individual data points in a dataset differ from the mean (Gravetter & Wallnau, 2016). Unlike the standard deviation, the variance is not in the same unit as the data, which can make it less intuitive to interpret. However, variance has essential mathematical properties that make it useful in statistical modeling and hypothesis testing (Moore, McCabe, & Craig, 2009).
In statistical theory, the concept of variance is pivotal for various analytical techniques, such as Analysis of Variance (ANOVA) and Principal Component Analysis (PCA). Variance allows for the decomposition of data into explained and unexplained components, serving as a key element in understanding data variability in greater depth (Johnson & Wichern, 2007).
Example using Movies Dataset: To find the variance in IMDB ratings.
The R code you’ve shared calculates the variance of the imdb_rating
variable in the movies
dataset using the dplyr
package. Let’s examine the code and its output:
 Code Explanation:

movies %>%
: This line uses themovies
dataframe and pipes it into the following operation with%>%
. 
summarise(var_imdb_rating = ...)
: Thesummarise
function fromdplyr
is employed to compute a summary statistic, in this case, creating a new variable calledvar_imdb_rating
. 
var(imdb_rating, na.rm = TRUE)
: This computes the variance of theimdb_rating
variable. Thevar
function calculates the variance, andna.rm = TRUE
indicates that anyNA
(missing) values should be excluded from the calculation.
 Output Explanation:
var_imdb_rating
## var_imdb_rating
## 1 0.9269498
 The output indicates that the variance of the IMDb ratings in the
movies
dataset is approximately 0.9269498.  The
<dbl>
notation signifies that the result is a doubleprecision floatingpoint number, which is a standard numeric format in R.
Variance is a statistical measure that describes the spread of numbers in a data set. More specifically, it measures how far each number in the set is from the mean and thus from every other number in the set. In this context, a variance of approximately 0.9269498 in IMDb ratings suggests the degree to which these ratings vary from their average value in the dataset.
This measure of variance can be particularly useful for understanding the consistency of movie ratings; a lower variance would indicate that the ratings are generally close to the mean, suggesting agreement among raters, whereas a higher variance would imply more diverse opinions on movie ratings.
General Summary
There are also a couple methods for getting multiple basic descriptive statistics with a single code. The most common of these is the summary()
function. There is also a package called skimr
.
summary()
The R code snippet you provided uses the summary()
function to generate descriptive statistics for the imdb_rating
variable in the movies
dataset. The summary()
function in R provides a quick, fivenumber summary of the given data along with the count of NA
(missing) values. Let’s break down the output:
summary(movies$imdb_rating)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.10 6.20 6.80 6.76 7.40 9.30 202

Min. (Minimum): The smallest value in the
imdb_rating
data. Here, the minimum IMDb rating is 2.10.  1st Qu. (First Quartile): Also known as the lower quartile, it is the median of the lower half of the dataset. This value is 6.20, meaning 25% of the ratings are below this value.
 Median: The middle value when the data is sorted in ascending order. The median IMDb rating is 6.80, indicating that half of the movies have a rating below 6.80 and the other half have a rating above 6.80.

Mean: The average of the
imdb_rating
values. Calculated as the sum of all ratings divided by the number of nonmissing ratings. The mean rating is 6.76.  3rd Qu. (Third Quartile): Also known as the upper quartile, it is the median of the upper half of the dataset. Here, 75% of the movies have a rating below 7.40.

Max. (Maximum): The largest value in the
imdb_rating
data. The highest IMDb rating in the dataset is 9.30. 
NA’s: The number of missing values in the
imdb_rating
data. There are 202 missing values.
This summary provides a comprehensive view of the distribution of IMDb ratings in the movies
dataset, including the central tendency (mean, median), spread (minimum, first quartile, third quartile, maximum), and the count of missing values. It helps in understanding the overall rating landscape of the movies in the dataset.
skimr
The R code snippet provided uses the skim()
function from the skimr
package to generate a summary of the imdb_rating
variable from the movies
dataset. The skimr
package provides a more detailed summary than the base R summary()
function, particularly useful for initial exploratory data analysis.
Name  movies$imdb_rating 
Number of rows  1794 
Number of columns  1 
_______________________  
Column type frequency:  
numeric  1 
________________________  
Group variables  None 
Variable type: numeric
skim_variable  n_missing  complete_rate  mean  sd  p0  p25  p50  p75  p100  hist 

data  202  0.89  6.76  0.96  2.1  6.2  6.8  7.4  9.3  ▁▁▅▇▂ 
Let’s break down the output:

Data Summary Section:

Name: Identifies the data being summarized, here
movies$imdb_rating
. 
Number of rows: Indicates the total number of entries in the dataset, which is 1794 for
imdb_rating
. 
Number of columns: The number of variables or columns in the data being skimmed. Since
skim()
is applied to a single column, this is 1.  Column type frequency: Shows the types of data present in the columns. Here, there is 1 numeric column.

Name: Identifies the data being summarized, here

Detailed Statistics Section:
 skim_variable: A character representation of the variable being summarized.

n_missing: The number of missing (
NA
) values in the dataset. Here, there are 202 missing ratings. 
complete_rate: Proportion of nonmissing values. Calculated as
(Total Number of rows  n_missing) / Total Number of rows
. Forimdb_rating
, it’s approximately 0.8874025. 
mean: The average of the
imdb_rating
values, which is 6.760113. 
sd (standard deviation): Measures the amount of variation or dispersion in
imdb_rating
. Here, it is 0.9627823.  p0, p25, p50, p75, p100: These represent the percentiles of the data:
 p0: The minimum value (0th percentile), which is 2.1.
 p25: The 25th percentile, meaning 25% of the data fall below this value, which is 6.2.
 p50: The median or 50th percentile, which is 6.8.
 p75: The 75th percentile, meaning 75% of the data fall below this value, which is 7.4.
 p100: The maximum value (100th percentile), which is 9.3.

hist: A textbased histogram providing a visual representation of the distribution of
imdb_rating
. The characters (▁▁▅▇▂) represent different frequency bins.
In summary, the skim()
function output provides a detailed statistical summary of the imdb_rating
variable, including measures of central tendency, dispersion, and data completeness, along with a visual histogram for quick assessment of the data distribution. This information is crucial for understanding the characteristics of the IMDb ratings in the movies
dataset, especially when preparing for more detailed data analysis.
15.3 Inferential Analysis
Inferential analysis is a cornerstone of statistical research, empowering researchers to draw conclusions and make predictions about a larger population based on the analysis of a representative sample. This process involves statistical models and tests that go beyond the descriptive statistics of the immediate dataset. Unlike descriptive statistics, which aim to summarize data, inferential statistics allow for hypothesis testing, predictions, and inferences about the data (Field, Miles, & Field, 2012). The utility of inferential statistics lies in its ability to generalize findings beyond the immediate data to broader contexts. This is particularly valuable in research areas where it’s impractical to collect data from an entire population (FrankfortNachmias, LeonGuerrero, & Davis, 2020). When a researcher uses sample data to infer characteristics about a larger population, they engage in inferential statistical analysis. This process allows for the generalization of results from the sample to the population, within certain confidence levels.
The application of inferential statistics often involves the use of various tests and models to determine statistical significance, which in turn helps researchers make meaningful inferences. Such analyses are commonly used in disciplines like psychology, economics, and medicine, to name a few. They provide a quantitative basis for conclusions and decisions, which is fundamental for scientific research (Rosner, 2015). Given the capacity to test theories and hypotheses, inferential statistics remain an indispensable tool in the scientific community.
Comparison of Means
Ttest
The Ttest is a statistical method used to determine if there is a significant difference between the means of two groups. It is commonly used to compare two samples to determine if they could have originated from the same population (Rosner, 2015). The Ttest operates under certain assumptions, such as the data being normally distributed and the samples being independent of each other. Violation of these assumptions may lead to misleading results.
Example with movies
dataset:
The provided R code performs a Welch Two Sample ttest to compare the mean budgets of action and drama movies in the movies
dataset. The Welch ttest is used to test the hypothesis that two populations (in this case, action and drama movies) have equal means. This test is appropriate when the two samples have possibly unequal variances.
# Calculate the mean budget for action and drama movies
action_movies < movies %>% filter(genre == 'Action')
drama_movies < movies %>% filter(genre == 'Drama')
# Perform ttest
t.test(action_movies$budget, drama_movies$budget)
##
## Welch Two Sample ttest
##
## data: action_movies$budget and drama_movies$budget
## t = 1.5346, df = 1.2327, pvalue = 0.3325
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 76461080 52430636
## sample estimates:
## mean of x mean of y
## 7570000 19585222
Let’s analyze the output:

Test Description:
 Welch Two Sample ttest: Indicates the type of ttest conducted. The Welch test does not assume equal variances across the two samples.

Data Description:

data: Specifies the datasets being compared  the
budget
ofaction_movies
anddrama_movies
.

data: Specifies the datasets being compared  the

Test Statistics:
 t = 1.5346: The calculated tstatistic value. The sign of the tstatistic indicates the direction of the difference between the means (negative here suggests that the mean budget of action movies might be less than that of drama movies).
 df = 1.2327: Degrees of freedom for the test. This value is calculated based on the sample sizes and variances of the two groups and is a key component in determining the critical value for the test.
 pvalue = 0.3325: The probability of observing a test statistic as extreme as, or more extreme than, the observed value under the null hypothesis. A higher pvalue (typically > 0.05) suggests that the observed data is consistent with the null hypothesis, which in this test is that there is no difference in the means of the two groups.

Hypothesis Testing:
 alternative hypothesis: States the hypothesis being tested. Here, it tests if the true difference in means is not equal to 0, which means it’s checking whether the average budgets of action and drama movies are significantly different.
 95 percent confidence interval: This interval estimates the range of the true difference in means between the two groups. It ranges from approximately 76,461,080 to 52,430,636. Since this interval includes 0, it suggests that the difference in means might not be statistically significant.

Sample Estimates:
 mean of x (action movies): The mean budget of action movies, approximately 7,570,000.
 mean of y (drama movies): The mean budget of drama movies, approximately 19,585,222.
In summary, the Welch ttest’s output indicates that there is not a statistically significant difference in the mean budgets of action and drama movies in the dataset, as evidenced by a pvalue greater than 0.05 and a confidence interval that includes 0. The sample estimates provide the average budgets for each movie genre, which can be useful for descriptive purposes.
Independent Sample Ttest
An independent sample Ttest is used when comparing the means of two independent groups to assess whether their means are statistically different (Field et al., 2012). The groups should be separate, meaning the performance or attributes of one group should not influence the other. For instance, this type of Ttest might be used to compare the average test scores of two different classrooms. It’s essential to note that both groups should be normally distributed, and ideally, they should have the same variance for the Ttest to be applicable.
Example with Survivor summary.csv
and viewers.csv
:
The provided R code performs a Welch Two Sample ttest to compare the average viewership (viewers_mean) of TV seasons that took place in Fiji with those that took place in other locations. This test is conducted using data from a summary
dataset.
# Load data
summary < fread("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/20210601/summary.csv")
# Compare average viewers for seasons in different locations
fiji_seasons < summary %>% filter(country == 'Fiji')
other_seasons < summary %>% filter(country != 'Fiji')
# Perform ttest
t.test(fiji_seasons$viewers_mean, other_seasons$viewers_mean)
##
## Welch Two Sample ttest
##
## data: fiji_seasons$viewers_mean and other_seasons$viewers_mean
## t = 4.5307, df = 27.938, pvalue = 0.0001004
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 7.667140 2.892491
## sample estimates:
## mean of x mean of y
## 10.69857 15.97839
Let’s analyze the output of this ttest:

Test Description:
 Welch Two Sample ttest: Indicates the type of ttest conducted, which is the Welch ttest. This test is used when comparing the means of two groups that may have unequal variances.

Data Description:

data: Compares the
viewers_mean
offiji_seasons
andother_seasons
. These represent the average viewership for TV seasons based on their filming locations (Fiji vs. other countries).

data: Compares the

Test Statistics:
 t = 4.5307: The calculated tstatistic value. A negative value indicates that the mean of the first group (Fiji seasons) might be less than the mean of the second group (other seasons).
 df = 27.938: Degrees of freedom for the test, a value calculated based on the sample sizes and variances of the two groups.
 pvalue = 0.0001004: The probability of observing a test statistic as extreme as, or more extreme than, the one observed, assuming the null hypothesis is true. A pvalue this low (much less than 0.05) suggests that the observed difference in means is statistically significant.

Hypothesis Testing:
 alternative hypothesis: The hypothesis being tested is that the true difference in means is not equal to 0. In other words, it’s assessing whether the average viewership for seasons in Fiji is significantly different from those in other locations.
 95 percent confidence interval: The interval ranges from approximately 7.667140 to 2.892491. Since this interval does not include 0 and is entirely negative, it suggests a significant difference in means, with the Fiji seasons having lower average viewership.

Sample Estimates:
 mean of x (Fiji seasons): The mean viewership for Fiji seasons, approximately 10.69857.
 mean of y (Other seasons): The mean viewership for seasons in other locations, approximately 15.97839.
In summary, the Welch ttest’s output indicates a statistically significant difference in the average viewership of TV seasons filmed in Fiji compared to those filmed in other locations. The negative tvalue and confidence interval suggest that the seasons filmed in Fiji, on average, have lower viewership than those filmed elsewhere. The low pvalue reinforces this finding, suggesting that the difference in viewership is not just a result of random chance. Confidence intervals provide a range that is likely to contain the population parameter with a specified level of confidence. This range offers a margin of error from the sample estimate, giving a probabilistic assessment of where the true value lies.
Paired Sample Ttest
In contrast, a paired sample Ttest is designed to compare means from the same group at different times or under different conditions (Vasishth & Broe, 2011). For example, it could be used to compare student test scores before and after a training program. Here, the assumption is that the differences between pairs follow a normal distribution. Paired Ttests are particularly useful in “before and after” scenarios, where each subject serves as their control, thereby increasing the test’s sensitivity.
Example with Survivor’s summary.csv
:
The R code provided performs a paired ttest to compare viewership at the premier and finale of TV seasons using the summary
dataset. A paired ttest is appropriate when comparing two sets of related observations — in this case, the viewership of the same TV seasons at two different time points (premier and finale).
# Perform paired ttest to compare viewership at premier and finale
paired_t_test_result < t.test(summary$viewers_premier, summary$viewers_finale, paired = TRUE)
# Output the result
paired_t_test_result
##
## Paired ttest
##
## data: summary$viewers_premier and summary$viewers_finale
## t = 0.76096, df = 39, pvalue = 0.4513
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## 2.764596 1.253096
## sample estimates:
## mean difference
## 0.75575
Let’s break down the output:

Test Description:
 Paired ttest: Indicates that a paired ttest is conducted, which is suitable for comparing two related samples or repeated measurements on the same subjects.

Data Description:

data: The test compares
viewers_premier
andviewers_finale
from thesummary
dataset.

data: The test compares

Test Statistics:
 t = 0.76096: The calculated tstatistic value. A negative value suggests that the mean viewership at the premier might be lower than at the finale, but the direction alone does not indicate statistical significance.
 df = 39: Degrees of freedom for the test, indicating the number of independent data points in the paired samples.
 pvalue = 0.4513: The probability of observing a test statistic as extreme as, or more extreme than, the one observed under the null hypothesis (no difference in means). A pvalue greater than 0.05 (common threshold for significance) suggests that the difference in mean viewership is not statistically significant.

Hypothesis Testing:
 alternative hypothesis: The hypothesis being tested is that the true mean difference in viewership between the premier and finale is not equal to 0. In other words, it assesses whether there is a significant difference in viewership between these two time points.
 95 percent confidence interval: Ranges from approximately 2.764596 to 1.253096. Since this interval includes 0, it suggests that the difference in viewership between the premier and finale is not statistically significant.

Sample Estimates:
 mean difference: The mean difference in viewership between the premier and finale, calculated as the mean of the differences for each season. Here, it is 0.75575. However, the confidence interval and pvalue indicate that this difference is not statistically significant.
In summary, the paired ttest output indicates that there is no statistically significant difference in viewership between the premier and finale of the TV seasons in the dataset. The pvalue is above the common threshold for significance (0.05), and the confidence interval includes 0, both suggesting that any observed difference in mean viewership could be due to random chance rather than a systematic difference.
Analysis of Variance (ANOVA)
ANOVA is a more generalized form of the Ttest and is used when there are more than two groups to compare (Kutner, Nachtsheim, & Neter, 2004). The underlying principle of ANOVA is to partition the variance within the data into “betweengroup” and “withingroup” variance, to identify any significant differences in means.
Oneway ANOVA
Oneway ANOVA focuses on a single independent variable with more than two levels or groups (Tabachnick & Fidell, 2013). It allows researchers to test if there are statistically significant differences between the means of three or more independent groups. It is widely used in various fields, including psychology, business, and healthcare, for testing the impact of different conditions or treatments.
Example with Survivor castaways.csv
:
The provided R code performs a oneway Analysis of Variance (ANOVA) to test whether there are statistically significant differences in the total votes received by castaways, grouped by their personality types, using data from the castaways
dataset.
castaways < fread("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/20210601/castaways.csv")
# Perform oneway ANOVA for total_votes_received among different personality types
anova_result < aov(total_votes_received ~ personality_type, data = castaways)
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## personality_type 15 227 15.14 1.075 0.376
## Residuals 725 10209 14.08
## 3 observations deleted due to missingness
Let’s analyze the output of the ANOVA:

ANOVA Summary:

Df (Degrees of Freedom):
 personality_type: 15 — This represents the degrees of freedom for the personality types group. It’s calculated as the number of levels in the group minus one (assuming there are 16 personality types).
 Residuals: 725 — The degrees of freedom for the residuals, which is the number of observations minus the number of groups (here, total number of castaways minus 16).

Sum Sq (Sum of Squares):
 personality_type: 227 — The total variation attributed to the differences in personality type.
 Residuals: 10209 — The total variation that is not attributed to personality types (i.e., withingroup variation).
 Mean Sq (Mean Squares):
 personality_type: 15.14 — This is the variance between the groups (Sum Sq of personality type divided by its Df).
 Residuals: 14.08 — This is the variance within the groups (Sum Sq of residuals divided by its Df).
 F value: 1.075 — The Fstatistic value, calculated as the Mean Sq of personality type divided by the Mean Sq of residuals. It’s a measure of how much the group means differ from the overall mean, relative to the variance within the groups.
 Pr(>F): 0.376 — The pvalue associated with the Fstatistic. It indicates the probability of observing an Fstatistic as large as, or larger than, what was observed, under the assumption that the null hypothesis (no difference in means across groups) is true.

Df (Degrees of Freedom):

Interpreting the Results:
 The pvalue is 0.376, which is greater than the common alpha level of 0.05. This suggests that there is no statistically significant difference in the total votes received among different personality types at the chosen level of significance. In other words, any observed differences in total votes among personality types could likely be due to chance.
 The relatively high pvalue indicates that the null hypothesis (that there are no differences in the mean total votes received among the different personality types) cannot be rejected.

Additional Note:
 The output mentions “3 observations deleted due to missingness.” This indicates that the analysis excluded three cases where data were missing, which is a standard procedure in ANOVA to ensure the accuracy of the test results.
In summary, the oneway ANOVA conducted suggests that personality type does not have a statistically significant impact on the total votes received by castaways in the dataset. This is inferred from the high pvalue and the ANOVA’s failure to reject the null hypothesis.
Twoway ANOVA
Twoway ANOVA, however, involves two independent variables, offering a more intricate comparison and understanding of the interaction effects (Winer, Brown, & Michels, 1991). It helps to analyze how two factors impact a dependent variable, and it can also show how the two independent variables interact with each other. This form of ANOVA is highly valuable in experimental design where multiple variables may influence the outcome.
Example with movies
dataset:
The provided R code performs a twoway Analysis of Variance (ANOVA) on the movies
dataset to test for statistical significance in the differences of movie budgets across different genres and years, and the interaction between these two factors.
# Perform twoway ANOVA for budget by genre and year
anova_result < aov(budget ~ genre * year, data = movies)
summary(anova_result)
## Df Sum Sq Mean Sq F value
## genre 270 1743797675202755840 6458509908158355 5.358
## year 1 159516834289368928 159516834289368928 132.346
## genre:year 156 386490379486245504 2477502432604138 2.056
## Residuals 1164 1402970663621501696 1205301257406788
## Pr(>F)
## genre < 0.0000000000000002 ***
## year < 0.0000000000000002 ***
## genre:year 0.0000000000258 ***
## Residuals
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 202 observations deleted due to missingness
Let’s analyze the output:

ANOVA Summary:

Df (Degrees of Freedom): Represents the number of levels in each factor minus one.
 genre: 270 — Degrees of freedom for the genre factor.
 year: 1 — Degrees of freedom for the year factor.

genre:year: 156 — Degrees of freedom for the interaction between genre and year.
 Residuals: 1164 — Degrees of freedom for the residuals (total number of observations minus the sum of the degrees of freedom for each factor and interaction).

Sum Sq (Sum of Squares):
 Indicates the total variation attributed to each factor and their interaction.

Mean Sq (Mean Squares):
 The variance due to each factor and their interaction (Sum Sq divided by Df).

F value:
 The Fstatistic for each factor, calculated as the Mean Sq of the factor divided by the Mean Sq of the residuals. It’s a measure of the effect size.

Pr(>F) (pvalue):
 Indicates the probability of observing an Fstatistic as large as, or larger than, what was observed, under the null hypothesis (no effect).
 genre, year, genre:year: All have very low pvalues, indicated by “***”, suggesting that each factor and their interaction significantly affect movie budgets.

Df (Degrees of Freedom): Represents the number of levels in each factor minus one.

Interpreting the Results:
 Genre: The very low pvalue suggests a statistically significant difference in movie budgets across different genres.
 Year: The very low pvalue indicates a significant difference in movie budgets across different years.
 GenreYear Interaction: The low pvalue for the interaction term suggests that the effect of genre on movie budgets varies by year, meaning different genres might have different budget trends over time.
 Residuals: Represent unexplained variance after accounting for the main effects and interaction.

Significance Codes:
 The “***” next to the pvalues denotes a very high level of statistical significance.

Additional Note:
 “202 observations deleted due to missingness” indicates that the analysis excluded cases with missing data, which is common in ANOVA to maintain accuracy.
In summary, the twoway ANOVA results suggest that both genre and year, and the interaction between them, have statistically significant effects on movie budgets in the dataset. This implies that budget variations are not only dependent on the genre or the year independently but also on how these two factors interact with each other.
Regression Analysis
Simple Linear Regression
Simple linear regression aims to model the relationship between a single independent variable and a dependent variable by fitting a linear equation to observed data (Montgomery, Peck, & Vining, 2012). The primary objective is to find the bestfitting straight line that accurately predicts the output values within a range. Simple linear regression works best when the variables have a linear relationship, and the data is homoscedastic, meaning the variance of errors is constant across levels of the independent variable.
Example with Survivor viewers.csv
:
The provided R code performs a linear regression analysis using the lm()
function to model the relationship between the number of viewers (dependent variable) and episode numbers (independent variable) in a TV series dataset. The summary()
function is then used to provide a detailed summary of the linear model’s results.
viewers < fread("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/20210601/viewers.csv")
# Model viewers based on episode numbers
lm_result < lm(viewers ~ episode, data = viewers)
summary(lm_result)
##
## Call:
## lm(formula = viewers ~ episode, data = viewers)
##
## Residuals:
## Min 1Q Median 3Q Max
## 9.189 4.344 2.014 4.224 37.931
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 13.24438 0.54547 24.281 <0.0000000000000002 ***
## episode 0.03960 0.06065 0.653 0.514
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.283 on 572 degrees of freedom
## (22 observations deleted due to missingness)
## Multiple Rsquared: 0.0007448, Adjusted Rsquared: 0.001002
## Fstatistic: 0.4263 on 1 and 572 DF, pvalue: 0.5141
Let’s break down the output:

Model Call:

lm(formula = viewers ~ episode, data = viewers): This indicates the linear model was fitted to predict
viewers
based onepisode
numbers.

lm(formula = viewers ~ episode, data = viewers): This indicates the linear model was fitted to predict

Residuals:
 The residuals represent the differences between the observed values and the values predicted by the model.
 Min, 1Q (First Quartile), Median, 3Q (Third Quartile), Max: These statistics provide a summary of the distribution of residuals. The relatively large range suggests that there may be considerable variance in how well the model predictions match the actual data.

Coefficients:
 (Intercept): The estimated average number of viewers when the episode number is zero. The intercept is significant (p < 0.0000000000000002).
 episode: The estimated change in the number of viewers for each additional episode. The coefficient is 0.03960, but it is not statistically significant (p = 0.514), suggesting that the number of episodes does not have a significant linear relationship with the number of viewers.
 Std. Error: Measures the variability or uncertainty in the coefficient estimates.
 t value: The test statistic for the hypothesis that each coefficient is different from zero.
 Pr(>t): The pvalue for the test statistic. A low pvalue (< 0.05) would indicate that the coefficient is significantly different from zero.

Residual Standard Error:
 6.283 on 572 degrees of freedom: This is a measure of the typical size of the residuals. The degrees of freedom are the number of observations minus the number of parameters being estimated.

Rsquared Values:
 Multiple Rsquared: 0.0007448: This indicates how much of the variability in the dependent variable (viewers) can be explained by the independent variable (episode). A value close to 0 suggests that the model does not explain much of the variability.
 Adjusted Rsquared: 0.001002: Adjusts the Rsquared value based on the number of predictors in the model. It can be negative if the model has no explanatory power.

Fstatistic:
 0.4263 on 1 and 572 DF, pvalue: 0.5141: This tests whether the model is statistically significant. The high pvalue suggests that the model is not statistically significant, indicating that the episode number does not significantly predict the number of viewers.

Significance Codes:
 The “***” next to the intercept’s pvalue indicates a high level of statistical significance.

Observations with Missing Data:
 (22 observations deleted due to missingness): Indicates that 22 observations were excluded from the analysis due to missing data.
In summary, the linear regression model suggests that the number of episodes is not a significant predictor of the number of viewers, based on the dataset used. The model’s low Rsquared value and the nonsignificant pvalue for the episode coefficient support this conclusion.
Multiple Linear Regression
Multiple linear regression extends the concept of simple linear regression to include two or more independent variables (Hair et al., 2014). This approach allows for a more nuanced understanding of the relationships among variables. It provides the tools needed to predict a dependent variable based on the values of multiple independent variables. Multiple linear regression assumes that the relationship between the dependent variable and the independent variables is linear, and it also assumes that the residuals are normally distributed and have constant variance.
Example with Survivor summary.csv
:
The R code provided performs a multiple linear regression analysis, modeling the average viewership (viewers_mean) as a function of country, timeslot, and season. The summary()
function provides a detailed summary of the model’s results.
# Model average viewers based on multiple factors
lm_result < lm(viewers_mean ~ country + timeslot + season, data = summary)
summary(lm_result)
##
## Call:
## lm(formula = viewers_mean ~ country + timeslot + season, data = summary)
##
## Residuals:
## Min 1Q Median 3Q Max
## 2.1969 0.3632 0.0000 0.2378 2.1969
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 30.76874 1.15957 26.535 0.000000000000000697 ***
## countryBrazil 8.54130 1.62530 5.255 0.000053615198056057 ***
## countryCambodia 10.11504 1.98496 5.096 0.000075482615565828 ***
## countryChina 8.32319 1.93885 4.293 0.000438 ***
## countryFiji 9.15756 1.89552 4.831 0.000134 ***
## countryGabon 8.72445 2.03256 4.292 0.000438 ***
## countryGuatemala 7.14067 1.78167 4.008 0.000825 ***
## countryIslands 8.72193 1.85477 4.702 0.000177 ***
## countryKenya 8.62563 1.62564 5.306 0.000048107174859556 ***
## countryMalaysia 7.57832 2.74555 2.760 0.012888 *
## countryNicaragua 12.11584 1.75705 6.896 0.000001898678859247 ***
## countryPalau 7.21193 1.66768 4.325 0.000408 ***
## countryPanama 6.73275 1.44326 4.665 0.000193 ***
## countryPhilippines 12.20939 1.80140 6.778 0.000002385364697607 ***
## countryPolynesia 8.06126 1.63176 4.940 0.000106 ***
## countrySamoa 8.85353 2.00306 4.420 0.000331 ***
## countryThailand 7.13689 1.64191 4.347 0.000389 ***
## countryVanuatu 6.76941 1.72095 3.934 0.000974 ***
## timeslotWednesday 8:00 pm 5.59395 2.14699 2.605 0.017892 *
## season 0.48437 0.08152 5.942 0.000012701313768962 ***
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.148 on 18 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple Rsquared: 0.9758, Adjusted Rsquared: 0.9502
## Fstatistic: 38.18 on 19 and 18 DF, pvalue: 0.00000000008219
Let’s break down the output:

Model Call:

lm(formula = viewers_mean ~ country + timeslot + season, data = summary): Shows the regression formula used, predicting
viewers_mean
based oncountry
,timeslot
, andseason
.

lm(formula = viewers_mean ~ country + timeslot + season, data = summary): Shows the regression formula used, predicting

Residuals:
 The residuals represent the differences between the observed and predicted values. The summary (Min, 1st Quartile, Median, 3rd Quartile, Max) shows the distribution of these residuals.

Coefficients:

Estimate: The regression coefficients for the intercept and each predictor. These values represent the expected change in
viewers_mean
for a oneunit change in the predictor, holding all other predictors constant.  Std. Error: The standard error of each coefficient, indicating the precision of the coefficient estimates.
 t value: The test statistic for the hypothesis that each coefficient is different from zero.
 Pr(>t): The pvalue for the test statistic. A low pvalue (< 0.05) indicates that the coefficient is significantly different from zero.
 The coefficients for different countries and the
timeslotWednesday 8:00 pm
are statistically significant, as indicated by their pvalues and significance codes. Theseason
variable is also significant, suggesting its impact on viewership.

Estimate: The regression coefficients for the intercept and each predictor. These values represent the expected change in

Residual Standard Error:
 1.148 on 18 degrees of freedom: This is a measure of the typical size of the residuals. Degrees of freedom are calculated as the total number of observations minus the number of estimated parameters.

Rsquared Values:
 Multiple Rsquared: 0.9758: Indicates the proportion of variance in the dependent variable (viewers_mean) that is predictable from the independent variables. A value of 0.9758 suggests a high level of predictability.
 Adjusted Rsquared: 0.9502: Adjusts the Rsquared value based on the number of predictors in the model. This is closer to the true predictive power of the model.

FStatistic:
 38.18 on 19 and 18 DF, pvalue: 0.00000000008219: This tests the overall significance of the model. The very low pvalue suggests the model as a whole is statistically significant.

Significance Codes:
 Indicate the level of significance for the coefficients. “***” denotes a very high level of statistical significance.

Observations with Missing Data:
 (2 observations deleted due to missingness): Indicates that 2 observations were excluded from the analysis due to missing data.
In summary, the multiple linear regression model suggests that both the country and the season significantly predict the average viewership of the TV series, with the timeslot also playing a significant role (specifically the Wednesday 8:00 pm
timeslot). The model explains a very high proportion of the variance in average viewership (as indicated by the Rsquared values), and the overall model is statistically significant.
References
 Agresti, A. (2002). Categorical Data Analysis. John Wiley & Sons.
 Bland, J. M., & Altman, D. G. (1996). Statistics notes: Transforming data. BMJ, 312(7033), 770.
 Boslaugh, S. (2012). Statistics in a Nutshell: A Desktop Quick Reference. O’Reilly Media.
 Cox, D. R., & Snell, E. J. (1981). Applied statistics: principles and examples. Chapman and Hall.
 De Veaux, R. D., Velleman, P. F., & Bock, D. E. (2018). Stats: Data and models (4th ed.). Pearson.
 Field, A. P., Miles, J., & Field, Z. (2012). Discovering statistics using R. Sage publications.
 FrankfortNachmias, C., LeonGuerrero, A., & Davis, G. (2020). Social statistics for a diverse society. Sage Publications.
 Gravetter, F. J., & Wallnau, L. B. (2016). Essentials of statistics for the behavioral sciences (8th ed.). Wadsworth Cengage Learning.
 Hair, J. F., Black, W. C., Babin, B. J., & Anderson, R. E. (2014). Multivariate Data Analysis. Pearson Education Limited.
 Hoaglin, D. C., Mosteller, F., & Tukey, J. W. (2000). Understanding Robust and Exploratory Data Analysis. John Wiley & Sons.
 Johnson, R. A., & Wichern, D. W. (2007). Applied Multivariate Statistical Analysis. Pearson Prentice Hall.
 Kenney, J. F., & Keeping, E. S. (1962). Mathematics of Statistics, Pt. 2, 2nd ed. Van Nostrand.
 Kutner, M. H., Nachtsheim, C. J., & Neter, J. (2004). Applied linear regression models. McGrawHill Irwin.
 Levine, D. M., Stephan, D. F., Krehbiel, T. C., & Berenson, M. L. (2008). Statistics for Managers Using Microsoft Excel. Pearson Prentice Hall.
 Lind, D. A., Marchal, W. G., & Wathen, S. A. (2012). Statistical techniques in business and economics (15th ed.). McGrawHill Irwin.
 McClave, J. T., Benson, P. G., & Sincich, T. (2011). Statistics for business and economics (11th ed.). Prentice Hall.
 Montgomery, D. C., Peck, E. A., & Vining, G. G. (2012). Introduction to linear regression analysis. John Wiley & Sons.
 Moore, D. S., McCabe, G. P., & Craig, B. A. (2009). Introduction to the Practice of Statistics. W.H. Freeman and Co.
 Rosner, B. (2015). Fundamentals of biostatistics. Cengage Learning.
 Siegel, S., & Castellan, N. J. (1988). Nonparametric statistics for the behavioral sciences. McGrawHill.
 Tabachnick, B. G., & Fidell, L. S. (2013). Using multivariate statistics. Pearson.
 Triola, M. F. (2018). Elementary Statistics. Pearson Education.
 Tukey, J. W. (1977). Exploratory Data Analysis. AddisonWesley.
 Vasishth, S., & Broe, M. (2011). The foundations of statistics: A simulationbased approach. Springer Science & Business Media.
 Winer, B. J., Brown, D. R., & Michels, K. M. (1991). Statistical principles in experimental design. McGrawHill.