15 Data Analysis

15.1 Introduction

Data analysis plays a quintessential role in the realm of communication and media research. With an ever-increasing volume of data generated from diverse sources, researchers need robust techniques to sift through this raw information and discern meaningful patterns or trends. This section outlines the objectives of data analysis, categorizes the major types of statistical analyses commonly used, and discusses the pivotal role of statistical software such as RStudio in facilitating this analytical process.

Objectives and Importance of Data Analysis

The fundamental aim of data analysis is to distill large amounts of information into actionable insights. In the context of communication and media research, these objectives can be further refined as follows:

Summarization: To present data in a digestible format, offering a clear snapshot of its main features.

Exploration: To identify relationships or trends within the data, providing a basis for further investigation.

Inference: To make educated guesses or predictions about a broader population based on sample data.

Validation: To confirm or negate existing theories or hypotheses through empirical evidence.

Decision-Making: To provide actionable recommendations and insights that may influence policy, strategy, or further academic research.

Ignoring proper data analysis can lead to misleading conclusions or partial understandings, affecting the quality and reliability of the research. Thus, the importance of meticulous data analysis cannot be overstated.

Types of Analyses: Descriptive vs. Inferential

Inferential analysis is distinct from descriptive analysis in its ambition to extend findings from a sample to a broader population. It uses probability theory to estimate and make predictions about population parameters, whereas descriptive analysis is confined to the dataset in hand.

Descriptive Analysis

In descriptive statistics, the aim is to summarize the main aspects of the data in hand, often through tables, graphs, or numerical measures such as mean, median, and standard deviation. Descriptive analysis provides a compact representation of the data, but it does not allow researchers to make conclusions beyond the data at hand (Tukey, 1977).

Inferential Analysis

In contrast, inferential statistics go a step further by enabling researchers to draw conclusions about a population based on a sample. Inferential methods like t-tests, ANOVA, and regressions allow one to assess hypotheses and derive estimates that are generalizable to a broader context (Cohen, 1988).

Role of Statistical Software: E.g., RStudio

In the current digital age, statistical software has become an indispensable tool for data analysis. RStudio is one such environment that offers a wide array of statistical and graphical techniques. It is especially favored for its:

  1. User-Friendly Interface: RStudio provides a clean and efficient interface for executing R code, thereby easing the process of data analysis.
  2. Flexibility and Adaptability: It supports various data formats and can be integrated with other software and programming languages.
  3. Extensive Libraries: With a rich ecosystem of packages like ggplot2 for data visualization, dplyr for data manipulation, and caret for machine learning, RStudio offers comprehensive analytical capabilities.
  4. Reproducibility: The code-based nature of RStudio ensures that analyses can be easily documented and reproduced, adhering to the tenets of reliable scientific research (Peng, 2011).

By mastering RStudio or similar statistical software, researchers are better equipped to conduct complex analyses that can contribute to robust and insightful findings.

15.2 Descriptive Analysis

Descriptive statistics form the bedrock of data exploration and initial data analysis. Descriptive analysis plays a pivotal role in data analysis by concisely summarizing the key characteristics of a dataset. It involves calculating various statistics to present a snapshot of the data, enabling researchers to understand its basic structure and form. These statistics facilitate the comprehensive summarization, condensation, and general understanding of the structural attributes of expansive datasets (De Veaux, Velleman, & Bock, 2018). Employed as a precursor to more advanced statistical procedures, descriptive statistics offer a straightforward way to describe the main aspects of a data set, from the typical values to the variability within the set. They provide researchers with tools to quickly identify patterns, trends, and potential outliers without making generalized predictions about larger populations (Boslaugh, 2012). Furthermore, descriptive statistics are essential in exploratory data analysis, where their role is to aid in the detection of any unusual observations that may warrant further investigation (Tukey, 1977).

Moreover, descriptive statistics have applications that span across various domains—from social sciences to economics, from healthcare to engineering. The utility lies in their ability to translate large amounts of data into easily understandable formats, such as graphs, tables, and numerical measures, thereby transforming raw data into insightful information. In research, they often serve as the initial step in the process of data analytics, giving researchers a snapshot of what the data looks like before delving into more complex analytical techniques like inferential statistics or machine learning algorithms (Hair et al., 2014).

If a researcher’s interest lies in examining how variables change together without intending to make predictive inferences, they should utilize descriptive correlational analysis. This type of analysis explores the relationship between variables using correlation coefficients, without extending to prediction.

Measures of Central Tendency

To capture the central tendency or the “average” experience within a set of data, calculating the mean is most appropriate. The mean provides a single value summarizing the central point of a dataset’s distribution.

Load data

# Load the packages
library(tidyverse)
library(data.table)

options(scipen=999)

# Import the datasets
spotify_songs <- fread("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv")
movies <- fread("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-03-09/movies.csv")

Mean

The mean is perhaps the most widely recognized measure of central tendency, representing the arithmetic average of a dataset. In descriptive analysis, the mean serves as a fundamental measure, providing an average value that represents the central tendency of a dataset. This average is calculated by summing all observations and dividing by the number of observations. The mean is sensitive to outliers, which can disproportionately influence the calculated average, potentially resulting in a misleading representation of central location (McClave, Benson, & Sincich, 2011). Despite this limitation, the mean is highly useful in various statistical methods, including regression analysis and hypothesis testing, because of its mathematical properties (Field, Miles, & Field, 2012).

Importantly, the mean can be categorized into different types: arithmetic mean, geometric mean, and harmonic mean, each with specific applications depending on the nature of the data and the intended analysis (Triola, 2018). For instance, the geometric mean is often used when dealing with data that exhibit exponential growth or decline, such as in financial or biological contexts (Cox & Snell, 1981).

Descriptive statistics are most commonly paired with visualizations to provide clarity. For example, a scatterplot is an invaluable tool in descriptive analysis when the objective is to illustrate the relationship or correlation between two variables. It visually represents the data points for each observed pair, facilitating the detection of patterns or relationships.

Example using Spotify Songs Dataset: To find the mean popularity of songs.

The R code provided demonstrates the use of the dplyr package and base R functions to calculate the mean popularity of tracks in the spotify_songs dataset. Let’s break down the code and its output:

  1. dplyr summarise function:
mean_popularity <- spotify_songs %>%
  summarise(mean_popularity = mean(track_popularity, na.rm = TRUE))

This snippet uses the dplyr package’s summarise function to calculate the mean of the track_popularity variable in the spotify_songs dataframe. The mean function is used with the na.rm = TRUE argument, which means that it will ignore NA (missing) values in the calculation. The result is stored in a new dataframe mean_popularity.

  1. Output Explanation:
mean_popularity
##   mean_popularity
## 1        42.47708

This output indicates that the mean popularity score of the tracks in the dataset is approximately 42.47708. The <dbl> notation suggests that the mean popularity score is a double-precision floating-point number, which is a common way of representing decimal numbers in R.

In summary, both methods are used to calculate the average popularity score of tracks in the spotify_songs dataset. The output shows the mean value as approximately 42.47708, reflecting the average popularity of the tracks in the dataset. The use of dplyr and base R functions provides a means to cross-validate the result for accuracy.

Median

The median serves as another measure of central tendency and is less sensitive to outliers compared to the mean (Lind et al., 2012). It is defined as the middle value in a dataset that has been arranged in ascending order. If the dataset contains an even number of observations, the median is calculated as the average of the two middle numbers. Medians are particularly useful for data that are skewed or contain outliers, as they provide a more “resistant” measure of the data’s central location (Hoaglin, Mosteller, & Tukey, 2000).

In addition to its robustness against outliers, the median is often used in non-parametric statistical tests like the Mann-Whitney U test and the Kruskal-Wallis test. These tests do not assume that the data follow a specific distribution, making the median an invaluable asset in such scenarios (Siegel & Castellan, 1988).

Example using Movies Dataset: To find the median budget of movies.

The provided R code calculates the median budget of movies in the movies dataset, with two different approaches, and the results are displayed. Let’s analyze the code and its outputs:

  1. Using dplyr’s summarise function:
median_budget <- movies %>%
  summarise(median_budget = median(budget/1000000, na.rm = TRUE))

This snippet uses the dplyr package’s summarise function to compute the median of the budget variable in the movies dataframe. Before calculating the median, each budget value is divided by 1,000,000 (budget/1000000), effectively converting the budget values from (presumably) dollars to millions of dollars. The na.rm = TRUE argument in the median function indicates that any NA (missing) values should be ignored in the calculation. The result is stored in a new dataframe called median_budget.

  1. Output Explanation:
median_budget
##   median_budget
## 1            28

This indicates that the median budget of the movies, in millions of dollars, is 28. The <dbl> notation signifies that the median budget is a double-precision floating-point number.

In conclusion, both methods are used to calculate the median budget of movies in the dataset, and both approaches confirm that the median budget is 28 million dollars. The use of both dplyr and base R functions serves as a cross-verification to ensure the accuracy of the result.

Mode

The mode refers to the value or values that appear most frequently in a dataset (Gravetter & Wallnau, 2016). A dataset can be unimodal, having one mode; bimodal, having two modes; or multimodal, having multiple modes. While the mode is less commonly used than the mean and median for numerical data, it is the primary measure of central tendency for categorical or nominal data (Agresti, 2002).

Despite its less frequent application in numerical contexts, the mode can still be useful for identifying the most common values in a dataset and for understanding the general distribution of the data (Bland & Altman, 1996). For example, in market research, knowing the mode of a dataset related to consumer preferences can provide valuable insights into what most consumers are likely to choose.

Example using Spotify Songs Dataset: To find the mode of the playlist_genre.

The provided R code calculates the mode of the playlist_genre variable in the spotify_songs dataset using the Mode function from the DescTools package. The mode is the value that appears most frequently in a dataset. Let’s break down the code and its output:

  1. Using the DescTools package’s Mode function:
## 
## Attaching package: 'DescTools'
## The following object is masked from 'package:data.table':
## 
##     %like%
mode_genre <- Mode(spotify_songs$playlist_genre)

This snippet uses the Mode function from the DescTools package to find the most frequently occurring genre in the playlist_genre column of the spotify_songs dataset. The result is stored in the variable mode_genre.

  1. Output Explanation:
mode_genre
## [1] "edm"
## attr(,"freq")
## [1] 6043

This output indicates that the most common genre (mode) in the playlist_genre column is “edm”. The attr(,"freq") part shows the frequency of this mode, which is 6043. This means that “edm” appears 6043 times in the playlist_genre column, more than any other genre.

In summary, the code calculates and displays the mode of the playlist_genre variable in the spotify_songs dataset, indicating that the most common genre is “edm”, which appears 6043 times. The consistency of the results from both methods demonstrates the reliability of the calculation.

Measures of Dispersion

Range

The range is the simplest measure of dispersion, calculated by subtracting the smallest value from the largest value in the dataset (McClave, Benson, & Sincich, 2011). While straightforward to compute, the range is highly sensitive to outliers and does not account for how the rest of the values in the dataset are distributed (Triola, 2018).

The range offers a quick, albeit crude, estimate of the dataset’s variability. It is often used in conjunction with other measures of dispersion for a more comprehensive understanding of data spread. Despite its limitations, the range can be helpful in initial exploratory analyses to quickly identify the scope of the data and to detect possible outliers or data entry errors (Tukey, 1977).

Example using Movies Dataset: To find the range of movie budgets.

The R code provided calculates the range of the budget column in the movies dataset using the dplyr package. The range is a measure of dispersion that represents the difference between the maximum and minimum values in a dataset. Here’s a breakdown of the code and its output:

  1. Code Explanation:
budget_range <- movies %>%
  summarise(Range = max(budget/1000000, 
                        na.rm = TRUE) - min(budget/1000000,
                                            na.rm = TRUE))
  • movies %>%: This part indicates that the code is using the movies dataframe and piping (%>%) it into subsequent operations.

  • summarise(Range = ...): The summarise function from the dplyr package is used to compute a summary statistic. Here, it’s creating a new variable named Range.

  • max(budget/1000000, na.rm = TRUE) - min(budget/1000000, na.rm = TRUE): This calculates the range of the movie budgets. Each budget value is first divided by 1,000,000 (presumably converting the budget from dollars to millions of dollars). The max function finds the maximum value and min finds the minimum value, with na.rm = TRUE indicating that any NA (missing) values should be ignored. The range is the difference between these two values.

    Output Explanation:

budget_range
##     Range
## 1 424.993
  • The output shows that the calculated range of the movie budgets, in millions of dollars, is approximately 424.993. This means that the largest budget in the dataset exceeds the smallest budget by about 424.993 million dollars.
  • The <dbl> notation indicates that the calculated range is a double-precision floating-point number, a standard numeric type in R for representing decimal values.

In summary, the code calculates the range of movie budgets in the movies dataset and finds that the budgets span approximately 424.993 million dollars, from the smallest to the largest. This provides a sense of how varied the movie budgets are in the dataset.

Standard Deviation

The standard deviation is a more sophisticated measure of dispersion that indicates how much individual data points deviate from the mean (Lind et al., 2012). Standard deviation is a measure in descriptive analysis that quantifies the variation or dispersion of a set of data values. It reflects how much individual data points differ from the mean, indicating the dataset’s spread. Calculated as the square root of the variance, the standard deviation provides an intuitive sense of the data’s spread since it is in the same unit as the original data points. It plays a crucial role in various statistical analyses, including hypothesis testing and confidence interval estimation, and is fundamental in fields ranging from finance to natural sciences (Levine, Stephan, Krehbiel, & Berenson, 2008).

The standard deviation can be classified into two types: population standard deviation and sample standard deviation. The former is used when the data represent an entire population, while the latter is used for sample data and is calculated with a slight adjustment to account for sample bias (Kenney & Keeping, 1962).

Example using Spotify Songs Dataset: To find the standard deviation of danceability.

The R code you’ve provided calculates the standard deviation of the danceability variable in the spotify_songs dataset using the dplyr package. Let’s break down the code and its output:

  1. Code Explanation:
std_danceability <- spotify_songs %>%
  summarise(std_danceability = sd(danceability, na.rm = TRUE))
  • spotify_songs %>%: This part uses the spotify_songs dataframe and pipes it into the subsequent operation using %>%.
  • summarise(std_danceability = ...): The summarise function from dplyr is used to compute a summary statistic. Here, it’s creating a new variable named std_danceability.
  • sd(danceability, na.rm = TRUE): This calculates the standard deviation of the danceability variable. The sd function computes the standard deviation, and na.rm = TRUE indicates that any NA (missing) values should be ignored in the calculation.
  1. Output Explanation:
std_danceability
##   std_danceability
## 1        0.1450853
  • The output shows that the calculated standard deviation of the danceability scores in the spotify_songs dataset is approximately 0.1450853.
  • The <dbl> notation indicates that the result is a double-precision floating-point number, which is typical for numeric calculations in R.

The standard deviation is a measure of the amount of variation or dispersion in a set of values. A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the values are spread out over a wider range.

In this case, a standard deviation of approximately 0.1450853 for danceability suggests that the danceability scores in the spotify_songs dataset vary moderately around the mean. This gives an idea of the variability in danceability among the songs in the dataset.

Variance

Variance is closely related to the standard deviation, essentially being its square. It quantifies how much individual data points in a dataset differ from the mean (Gravetter & Wallnau, 2016). Unlike the standard deviation, the variance is not in the same unit as the data, which can make it less intuitive to interpret. However, variance has essential mathematical properties that make it useful in statistical modeling and hypothesis testing (Moore, McCabe, & Craig, 2009).

In statistical theory, the concept of variance is pivotal for various analytical techniques, such as Analysis of Variance (ANOVA) and Principal Component Analysis (PCA). Variance allows for the decomposition of data into explained and unexplained components, serving as a key element in understanding data variability in greater depth (Johnson & Wichern, 2007).

Example using Movies Dataset: To find the variance in IMDB ratings.

The R code you’ve shared calculates the variance of the imdb_rating variable in the movies dataset using the dplyr package. Let’s examine the code and its output:

  1. Code Explanation:
var_imdb_rating <- movies %>%
  summarise(var_imdb_rating = var(imdb_rating, na.rm = TRUE))
  • movies %>%: This line uses the movies dataframe and pipes it into the following operation with %>%.
  • summarise(var_imdb_rating = ...): The summarise function from dplyr is employed to compute a summary statistic, in this case, creating a new variable called var_imdb_rating.
  • var(imdb_rating, na.rm = TRUE): This computes the variance of the imdb_rating variable. The var function calculates the variance, and na.rm = TRUE indicates that any NA (missing) values should be excluded from the calculation.
  1. Output Explanation:
var_imdb_rating
##   var_imdb_rating
## 1       0.9269498
  • The output indicates that the variance of the IMDb ratings in the movies dataset is approximately 0.9269498.
  • The <dbl> notation signifies that the result is a double-precision floating-point number, which is a standard numeric format in R.

Variance is a statistical measure that describes the spread of numbers in a data set. More specifically, it measures how far each number in the set is from the mean and thus from every other number in the set. In this context, a variance of approximately 0.9269498 in IMDb ratings suggests the degree to which these ratings vary from their average value in the dataset.

This measure of variance can be particularly useful for understanding the consistency of movie ratings; a lower variance would indicate that the ratings are generally close to the mean, suggesting agreement among raters, whereas a higher variance would imply more diverse opinions on movie ratings.

General Summary

There are also a couple methods for getting multiple basic descriptive statistics with a single code. The most common of these is the summary() function. There is also a package called skimr.

summary()

The R code snippet you provided uses the summary() function to generate descriptive statistics for the imdb_rating variable in the movies dataset. The summary() function in R provides a quick, five-number summary of the given data along with the count of NA (missing) values. Let’s break down the output:

summary(movies$imdb_rating)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    2.10    6.20    6.80    6.76    7.40    9.30     202
  • Min. (Minimum): The smallest value in the imdb_rating data. Here, the minimum IMDb rating is 2.10.
  • 1st Qu. (First Quartile): Also known as the lower quartile, it is the median of the lower half of the dataset. This value is 6.20, meaning 25% of the ratings are below this value.
  • Median: The middle value when the data is sorted in ascending order. The median IMDb rating is 6.80, indicating that half of the movies have a rating below 6.80 and the other half have a rating above 6.80.
  • Mean: The average of the imdb_rating values. Calculated as the sum of all ratings divided by the number of non-missing ratings. The mean rating is 6.76.
  • 3rd Qu. (Third Quartile): Also known as the upper quartile, it is the median of the upper half of the dataset. Here, 75% of the movies have a rating below 7.40.
  • Max. (Maximum): The largest value in the imdb_rating data. The highest IMDb rating in the dataset is 9.30.
  • NA’s: The number of missing values in the imdb_rating data. There are 202 missing values.

This summary provides a comprehensive view of the distribution of IMDb ratings in the movies dataset, including the central tendency (mean, median), spread (minimum, first quartile, third quartile, maximum), and the count of missing values. It helps in understanding the overall rating landscape of the movies in the dataset.

skimr

The R code snippet provided uses the skim() function from the skimr package to generate a summary of the imdb_rating variable from the movies dataset. The skimr package provides a more detailed summary than the base R summary() function, particularly useful for initial exploratory data analysis.

library(skimr)

skim(movies$imdb_rating)
Table 15.1: Data summary
Name movies$imdb_rating
Number of rows 1794
Number of columns 1
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
data 202 0.89 6.76 0.96 2.1 6.2 6.8 7.4 9.3 ▁▁▅▇▂

Let’s break down the output:

  1. Data Summary Section:
    • Name: Identifies the data being summarized, here movies$imdb_rating.
    • Number of rows: Indicates the total number of entries in the dataset, which is 1794 for imdb_rating.
    • Number of columns: The number of variables or columns in the data being skimmed. Since skim() is applied to a single column, this is 1.
    • Column type frequency: Shows the types of data present in the columns. Here, there is 1 numeric column.
  2. Detailed Statistics Section:
    • skim_variable: A character representation of the variable being summarized.
    • n_missing: The number of missing (NA) values in the dataset. Here, there are 202 missing ratings.
    • complete_rate: Proportion of non-missing values. Calculated as (Total Number of rows - n_missing) / Total Number of rows. For imdb_rating, it’s approximately 0.8874025.
    • mean: The average of the imdb_rating values, which is 6.760113.
    • sd (standard deviation): Measures the amount of variation or dispersion in imdb_rating. Here, it is 0.9627823.
    • p0, p25, p50, p75, p100: These represent the percentiles of the data:
    • p0: The minimum value (0th percentile), which is 2.1.
    • p25: The 25th percentile, meaning 25% of the data fall below this value, which is 6.2.
    • p50: The median or 50th percentile, which is 6.8.
    • p75: The 75th percentile, meaning 75% of the data fall below this value, which is 7.4.
    • p100: The maximum value (100th percentile), which is 9.3.
    • hist: A text-based histogram providing a visual representation of the distribution of imdb_rating. The characters (▁▁▅▇▂) represent different frequency bins.

In summary, the skim() function output provides a detailed statistical summary of the imdb_rating variable, including measures of central tendency, dispersion, and data completeness, along with a visual histogram for quick assessment of the data distribution. This information is crucial for understanding the characteristics of the IMDb ratings in the movies dataset, especially when preparing for more detailed data analysis.

15.3 Inferential Analysis

Inferential analysis is a cornerstone of statistical research, empowering researchers to draw conclusions and make predictions about a larger population based on the analysis of a representative sample. This process involves statistical models and tests that go beyond the descriptive statistics of the immediate dataset. Unlike descriptive statistics, which aim to summarize data, inferential statistics allow for hypothesis testing, predictions, and inferences about the data (Field, Miles, & Field, 2012). The utility of inferential statistics lies in its ability to generalize findings beyond the immediate data to broader contexts. This is particularly valuable in research areas where it’s impractical to collect data from an entire population (Frankfort-Nachmias, Leon-Guerrero, & Davis, 2020). When a researcher uses sample data to infer characteristics about a larger population, they engage in inferential statistical analysis. This process allows for the generalization of results from the sample to the population, within certain confidence levels.

The application of inferential statistics often involves the use of various tests and models to determine statistical significance, which in turn helps researchers make meaningful inferences. Such analyses are commonly used in disciplines like psychology, economics, and medicine, to name a few. They provide a quantitative basis for conclusions and decisions, which is fundamental for scientific research (Rosner, 2015). Given the capacity to test theories and hypotheses, inferential statistics remain an indispensable tool in the scientific community.

Comparison of Means

T-test

The T-test is a statistical method used to determine if there is a significant difference between the means of two groups. It is commonly used to compare two samples to determine if they could have originated from the same population (Rosner, 2015). The T-test operates under certain assumptions, such as the data being normally distributed and the samples being independent of each other. Violation of these assumptions may lead to misleading results.

Example with movies dataset:

The provided R code performs a Welch Two Sample t-test to compare the mean budgets of action and drama movies in the movies dataset. The Welch t-test is used to test the hypothesis that two populations (in this case, action and drama movies) have equal means. This test is appropriate when the two samples have possibly unequal variances.

# Calculate the mean budget for action and drama movies
action_movies <- movies %>% filter(genre == 'Action')
drama_movies <- movies %>% filter(genre == 'Drama')

# Perform t-test
t.test(action_movies$budget, drama_movies$budget)
## 
##  Welch Two Sample t-test
## 
## data:  action_movies$budget and drama_movies$budget
## t = -1.5346, df = 1.2327, p-value = 0.3325
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -76461080  52430636
## sample estimates:
## mean of x mean of y 
##   7570000  19585222

Let’s analyze the output:

  1. Test Description:
    • Welch Two Sample t-test: Indicates the type of t-test conducted. The Welch test does not assume equal variances across the two samples.
  2. Data Description:
    • data: Specifies the datasets being compared - the budget of action_movies and drama_movies.
  3. Test Statistics:
    • t = -1.5346: The calculated t-statistic value. The sign of the t-statistic indicates the direction of the difference between the means (negative here suggests that the mean budget of action movies might be less than that of drama movies).
    • df = 1.2327: Degrees of freedom for the test. This value is calculated based on the sample sizes and variances of the two groups and is a key component in determining the critical value for the test.
    • p-value = 0.3325: The probability of observing a test statistic as extreme as, or more extreme than, the observed value under the null hypothesis. A higher p-value (typically > 0.05) suggests that the observed data is consistent with the null hypothesis, which in this test is that there is no difference in the means of the two groups.
  4. Hypothesis Testing:
    • alternative hypothesis: States the hypothesis being tested. Here, it tests if the true difference in means is not equal to 0, which means it’s checking whether the average budgets of action and drama movies are significantly different.
    • 95 percent confidence interval: This interval estimates the range of the true difference in means between the two groups. It ranges from approximately -76,461,080 to 52,430,636. Since this interval includes 0, it suggests that the difference in means might not be statistically significant.
  5. Sample Estimates:
    • mean of x (action movies): The mean budget of action movies, approximately 7,570,000.
    • mean of y (drama movies): The mean budget of drama movies, approximately 19,585,222.

In summary, the Welch t-test’s output indicates that there is not a statistically significant difference in the mean budgets of action and drama movies in the dataset, as evidenced by a p-value greater than 0.05 and a confidence interval that includes 0. The sample estimates provide the average budgets for each movie genre, which can be useful for descriptive purposes.

Independent Sample T-test

An independent sample T-test is used when comparing the means of two independent groups to assess whether their means are statistically different (Field et al., 2012). The groups should be separate, meaning the performance or attributes of one group should not influence the other. For instance, this type of T-test might be used to compare the average test scores of two different classrooms. It’s essential to note that both groups should be normally distributed, and ideally, they should have the same variance for the T-test to be applicable.

Example with Survivor summary.csv and viewers.csv:

The provided R code performs a Welch Two Sample t-test to compare the average viewership (viewers_mean) of TV seasons that took place in Fiji with those that took place in other locations. This test is conducted using data from a summary dataset.

# Load data
summary <- fread("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-06-01/summary.csv")

# Compare average viewers for seasons in different locations
fiji_seasons <- summary %>% filter(country == 'Fiji')
other_seasons <- summary %>% filter(country != 'Fiji')

# Perform t-test
t.test(fiji_seasons$viewers_mean, other_seasons$viewers_mean)
## 
##  Welch Two Sample t-test
## 
## data:  fiji_seasons$viewers_mean and other_seasons$viewers_mean
## t = -4.5307, df = 27.938, p-value = 0.0001004
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -7.667140 -2.892491
## sample estimates:
## mean of x mean of y 
##  10.69857  15.97839

Let’s analyze the output of this t-test:

  1. Test Description:
    • Welch Two Sample t-test: Indicates the type of t-test conducted, which is the Welch t-test. This test is used when comparing the means of two groups that may have unequal variances.
  2. Data Description:
    • data: Compares the viewers_mean of fiji_seasons and other_seasons. These represent the average viewership for TV seasons based on their filming locations (Fiji vs. other countries).
  3. Test Statistics:
    • t = -4.5307: The calculated t-statistic value. A negative value indicates that the mean of the first group (Fiji seasons) might be less than the mean of the second group (other seasons).
    • df = 27.938: Degrees of freedom for the test, a value calculated based on the sample sizes and variances of the two groups.
    • p-value = 0.0001004: The probability of observing a test statistic as extreme as, or more extreme than, the one observed, assuming the null hypothesis is true. A p-value this low (much less than 0.05) suggests that the observed difference in means is statistically significant.
  4. Hypothesis Testing:
    • alternative hypothesis: The hypothesis being tested is that the true difference in means is not equal to 0. In other words, it’s assessing whether the average viewership for seasons in Fiji is significantly different from those in other locations.
    • 95 percent confidence interval: The interval ranges from approximately -7.667140 to -2.892491. Since this interval does not include 0 and is entirely negative, it suggests a significant difference in means, with the Fiji seasons having lower average viewership.
  5. Sample Estimates:
    • mean of x (Fiji seasons): The mean viewership for Fiji seasons, approximately 10.69857.
    • mean of y (Other seasons): The mean viewership for seasons in other locations, approximately 15.97839.

In summary, the Welch t-test’s output indicates a statistically significant difference in the average viewership of TV seasons filmed in Fiji compared to those filmed in other locations. The negative t-value and confidence interval suggest that the seasons filmed in Fiji, on average, have lower viewership than those filmed elsewhere. The low p-value reinforces this finding, suggesting that the difference in viewership is not just a result of random chance. Confidence intervals provide a range that is likely to contain the population parameter with a specified level of confidence. This range offers a margin of error from the sample estimate, giving a probabilistic assessment of where the true value lies.

Paired Sample T-test

In contrast, a paired sample T-test is designed to compare means from the same group at different times or under different conditions (Vasishth & Broe, 2011). For example, it could be used to compare student test scores before and after a training program. Here, the assumption is that the differences between pairs follow a normal distribution. Paired T-tests are particularly useful in “before and after” scenarios, where each subject serves as their control, thereby increasing the test’s sensitivity.

Example with Survivor’s summary.csv:

The R code provided performs a paired t-test to compare viewership at the premier and finale of TV seasons using the summary dataset. A paired t-test is appropriate when comparing two sets of related observations — in this case, the viewership of the same TV seasons at two different time points (premier and finale).

# Perform paired t-test to compare viewership at premier and finale
paired_t_test_result <- t.test(summary$viewers_premier, summary$viewers_finale, paired = TRUE)

# Output the result
paired_t_test_result
## 
##  Paired t-test
## 
## data:  summary$viewers_premier and summary$viewers_finale
## t = -0.76096, df = 39, p-value = 0.4513
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -2.764596  1.253096
## sample estimates:
## mean difference 
##        -0.75575

Let’s break down the output:

  1. Test Description:
    • Paired t-test: Indicates that a paired t-test is conducted, which is suitable for comparing two related samples or repeated measurements on the same subjects.
  2. Data Description:
    • data: The test compares viewers_premier and viewers_finale from the summary dataset.
  3. Test Statistics:
    • t = -0.76096: The calculated t-statistic value. A negative value suggests that the mean viewership at the premier might be lower than at the finale, but the direction alone does not indicate statistical significance.
    • df = 39: Degrees of freedom for the test, indicating the number of independent data points in the paired samples.
    • p-value = 0.4513: The probability of observing a test statistic as extreme as, or more extreme than, the one observed under the null hypothesis (no difference in means). A p-value greater than 0.05 (common threshold for significance) suggests that the difference in mean viewership is not statistically significant.
  4. Hypothesis Testing:
    • alternative hypothesis: The hypothesis being tested is that the true mean difference in viewership between the premier and finale is not equal to 0. In other words, it assesses whether there is a significant difference in viewership between these two time points.
    • 95 percent confidence interval: Ranges from approximately -2.764596 to 1.253096. Since this interval includes 0, it suggests that the difference in viewership between the premier and finale is not statistically significant.
  5. Sample Estimates:
    • mean difference: The mean difference in viewership between the premier and finale, calculated as the mean of the differences for each season. Here, it is -0.75575. However, the confidence interval and p-value indicate that this difference is not statistically significant.

In summary, the paired t-test output indicates that there is no statistically significant difference in viewership between the premier and finale of the TV seasons in the dataset. The p-value is above the common threshold for significance (0.05), and the confidence interval includes 0, both suggesting that any observed difference in mean viewership could be due to random chance rather than a systematic difference.

Analysis of Variance (ANOVA)

ANOVA is a more generalized form of the T-test and is used when there are more than two groups to compare (Kutner, Nachtsheim, & Neter, 2004). The underlying principle of ANOVA is to partition the variance within the data into “between-group” and “within-group” variance, to identify any significant differences in means.

One-way ANOVA

One-way ANOVA focuses on a single independent variable with more than two levels or groups (Tabachnick & Fidell, 2013). It allows researchers to test if there are statistically significant differences between the means of three or more independent groups. It is widely used in various fields, including psychology, business, and healthcare, for testing the impact of different conditions or treatments.

Example with Survivor castaways.csv:

The provided R code performs a one-way Analysis of Variance (ANOVA) to test whether there are statistically significant differences in the total votes received by castaways, grouped by their personality types, using data from the castaways dataset.

castaways <- fread("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-06-01/castaways.csv")

# Perform one-way ANOVA for total_votes_received among different personality types
anova_result <- aov(total_votes_received ~ personality_type, data = castaways)
summary(anova_result)
##                   Df Sum Sq Mean Sq F value Pr(>F)
## personality_type  15    227   15.14   1.075  0.376
## Residuals        725  10209   14.08               
## 3 observations deleted due to missingness

Let’s analyze the output of the ANOVA:

  1. ANOVA Summary:
    • Df (Degrees of Freedom):
      • personality_type: 15 — This represents the degrees of freedom for the personality types group. It’s calculated as the number of levels in the group minus one (assuming there are 16 personality types).
    • Residuals: 725 — The degrees of freedom for the residuals, which is the number of observations minus the number of groups (here, total number of castaways minus 16).
    • Sum Sq (Sum of Squares):
      • personality_type: 227 — The total variation attributed to the differences in personality type.
      • Residuals: 10209 — The total variation that is not attributed to personality types (i.e., within-group variation).
    • Mean Sq (Mean Squares):
    • personality_type: 15.14 — This is the variance between the groups (Sum Sq of personality type divided by its Df).
    • Residuals: 14.08 — This is the variance within the groups (Sum Sq of residuals divided by its Df).
    • F value: 1.075 — The F-statistic value, calculated as the Mean Sq of personality type divided by the Mean Sq of residuals. It’s a measure of how much the group means differ from the overall mean, relative to the variance within the groups.
    • Pr(>F): 0.376 — The p-value associated with the F-statistic. It indicates the probability of observing an F-statistic as large as, or larger than, what was observed, under the assumption that the null hypothesis (no difference in means across groups) is true.
  2. Interpreting the Results:
    • The p-value is 0.376, which is greater than the common alpha level of 0.05. This suggests that there is no statistically significant difference in the total votes received among different personality types at the chosen level of significance. In other words, any observed differences in total votes among personality types could likely be due to chance.
    • The relatively high p-value indicates that the null hypothesis (that there are no differences in the mean total votes received among the different personality types) cannot be rejected.
  3. Additional Note:
    • The output mentions “3 observations deleted due to missingness.” This indicates that the analysis excluded three cases where data were missing, which is a standard procedure in ANOVA to ensure the accuracy of the test results.

In summary, the one-way ANOVA conducted suggests that personality type does not have a statistically significant impact on the total votes received by castaways in the dataset. This is inferred from the high p-value and the ANOVA’s failure to reject the null hypothesis.

Two-way ANOVA

Two-way ANOVA, however, involves two independent variables, offering a more intricate comparison and understanding of the interaction effects (Winer, Brown, & Michels, 1991). It helps to analyze how two factors impact a dependent variable, and it can also show how the two independent variables interact with each other. This form of ANOVA is highly valuable in experimental design where multiple variables may influence the outcome.

Example with movies dataset:

The provided R code performs a two-way Analysis of Variance (ANOVA) on the movies dataset to test for statistical significance in the differences of movie budgets across different genres and years, and the interaction between these two factors.

# Perform two-way ANOVA for budget by genre and year
anova_result <- aov(budget ~ genre * year, data = movies)

summary(anova_result)
##               Df              Sum Sq            Mean Sq F value
## genre        270 1743797675202755840   6458509908158355   5.358
## year           1  159516834289368928 159516834289368928 132.346
## genre:year   156  386490379486245504   2477502432604138   2.056
## Residuals   1164 1402970663621501696   1205301257406788        
##                           Pr(>F)    
## genre       < 0.0000000000000002 ***
## year        < 0.0000000000000002 ***
## genre:year       0.0000000000258 ***
## Residuals                           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 202 observations deleted due to missingness

Let’s analyze the output:

  1. ANOVA Summary:
    • Df (Degrees of Freedom): Represents the number of levels in each factor minus one.
      • genre: 270 — Degrees of freedom for the genre factor.
      • year: 1 — Degrees of freedom for the year factor.
    • genre:year: 156 — Degrees of freedom for the interaction between genre and year.
      • Residuals: 1164 — Degrees of freedom for the residuals (total number of observations minus the sum of the degrees of freedom for each factor and interaction).
    • Sum Sq (Sum of Squares):
      • Indicates the total variation attributed to each factor and their interaction.
    • Mean Sq (Mean Squares):
      • The variance due to each factor and their interaction (Sum Sq divided by Df).
    • F value:
      • The F-statistic for each factor, calculated as the Mean Sq of the factor divided by the Mean Sq of the residuals. It’s a measure of the effect size.
    • Pr(>F) (p-value):
      • Indicates the probability of observing an F-statistic as large as, or larger than, what was observed, under the null hypothesis (no effect).
      • genre, year, genre:year: All have very low p-values, indicated by “***”, suggesting that each factor and their interaction significantly affect movie budgets.
  2. Interpreting the Results:
    • Genre: The very low p-value suggests a statistically significant difference in movie budgets across different genres.
    • Year: The very low p-value indicates a significant difference in movie budgets across different years.
    • Genre-Year Interaction: The low p-value for the interaction term suggests that the effect of genre on movie budgets varies by year, meaning different genres might have different budget trends over time.
    • Residuals: Represent unexplained variance after accounting for the main effects and interaction.
  3. Significance Codes:
    • The “***” next to the p-values denotes a very high level of statistical significance.
  4. Additional Note:
    • “202 observations deleted due to missingness” indicates that the analysis excluded cases with missing data, which is common in ANOVA to maintain accuracy.

In summary, the two-way ANOVA results suggest that both genre and year, and the interaction between them, have statistically significant effects on movie budgets in the dataset. This implies that budget variations are not only dependent on the genre or the year independently but also on how these two factors interact with each other.

Regression Analysis

Simple Linear Regression

Simple linear regression aims to model the relationship between a single independent variable and a dependent variable by fitting a linear equation to observed data (Montgomery, Peck, & Vining, 2012). The primary objective is to find the best-fitting straight line that accurately predicts the output values within a range. Simple linear regression works best when the variables have a linear relationship, and the data is homoscedastic, meaning the variance of errors is constant across levels of the independent variable.

Example with Survivor viewers.csv:

The provided R code performs a linear regression analysis using the lm() function to model the relationship between the number of viewers (dependent variable) and episode numbers (independent variable) in a TV series dataset. The summary() function is then used to provide a detailed summary of the linear model’s results.

viewers <- fread("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-06-01/viewers.csv")

# Model viewers based on episode numbers
lm_result <- lm(viewers ~ episode, data = viewers)
summary(lm_result)
## 
## Call:
## lm(formula = viewers ~ episode, data = viewers)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.189 -4.344 -2.014  4.224 37.931 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 13.24438    0.54547  24.281 <0.0000000000000002 ***
## episode      0.03960    0.06065   0.653               0.514    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.283 on 572 degrees of freedom
##   (22 observations deleted due to missingness)
## Multiple R-squared:  0.0007448,  Adjusted R-squared:  -0.001002 
## F-statistic: 0.4263 on 1 and 572 DF,  p-value: 0.5141

Let’s break down the output:

  1. Model Call:
    • lm(formula = viewers ~ episode, data = viewers): This indicates the linear model was fitted to predict viewers based on episode numbers.
  2. Residuals:
    • The residuals represent the differences between the observed values and the values predicted by the model.
    • Min, 1Q (First Quartile), Median, 3Q (Third Quartile), Max: These statistics provide a summary of the distribution of residuals. The relatively large range suggests that there may be considerable variance in how well the model predictions match the actual data.
  3. Coefficients:
    • (Intercept): The estimated average number of viewers when the episode number is zero. The intercept is significant (p < 0.0000000000000002).
    • episode: The estimated change in the number of viewers for each additional episode. The coefficient is 0.03960, but it is not statistically significant (p = 0.514), suggesting that the number of episodes does not have a significant linear relationship with the number of viewers.
    • Std. Error: Measures the variability or uncertainty in the coefficient estimates.
    • t value: The test statistic for the hypothesis that each coefficient is different from zero.
    • Pr(>|t|): The p-value for the test statistic. A low p-value (< 0.05) would indicate that the coefficient is significantly different from zero.
  4. Residual Standard Error:
    • 6.283 on 572 degrees of freedom: This is a measure of the typical size of the residuals. The degrees of freedom are the number of observations minus the number of parameters being estimated.
  5. R-squared Values:
    • Multiple R-squared: 0.0007448: This indicates how much of the variability in the dependent variable (viewers) can be explained by the independent variable (episode). A value close to 0 suggests that the model does not explain much of the variability.
    • Adjusted R-squared: -0.001002: Adjusts the R-squared value based on the number of predictors in the model. It can be negative if the model has no explanatory power.
  6. F-statistic:
    • 0.4263 on 1 and 572 DF, p-value: 0.5141: This tests whether the model is statistically significant. The high p-value suggests that the model is not statistically significant, indicating that the episode number does not significantly predict the number of viewers.
  7. Significance Codes:
    • The “***” next to the intercept’s p-value indicates a high level of statistical significance.
  8. Observations with Missing Data:
    • (22 observations deleted due to missingness): Indicates that 22 observations were excluded from the analysis due to missing data.

In summary, the linear regression model suggests that the number of episodes is not a significant predictor of the number of viewers, based on the dataset used. The model’s low R-squared value and the non-significant p-value for the episode coefficient support this conclusion.

Multiple Linear Regression

Multiple linear regression extends the concept of simple linear regression to include two or more independent variables (Hair et al., 2014). This approach allows for a more nuanced understanding of the relationships among variables. It provides the tools needed to predict a dependent variable based on the values of multiple independent variables. Multiple linear regression assumes that the relationship between the dependent variable and the independent variables is linear, and it also assumes that the residuals are normally distributed and have constant variance.

Example with Survivor summary.csv:

The R code provided performs a multiple linear regression analysis, modeling the average viewership (viewers_mean) as a function of country, timeslot, and season. The summary() function provides a detailed summary of the model’s results.

# Model average viewers based on multiple factors
lm_result <- lm(viewers_mean ~ country + timeslot + season, data = summary)
summary(lm_result)
## 
## Call:
## lm(formula = viewers_mean ~ country + timeslot + season, data = summary)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1969 -0.3632  0.0000  0.2378  2.1969 
## 
## Coefficients:
##                            Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)                30.76874    1.15957  26.535 0.000000000000000697 ***
## countryBrazil              -8.54130    1.62530  -5.255 0.000053615198056057 ***
## countryCambodia           -10.11504    1.98496  -5.096 0.000075482615565828 ***
## countryChina               -8.32319    1.93885  -4.293             0.000438 ***
## countryFiji                -9.15756    1.89552  -4.831             0.000134 ***
## countryGabon               -8.72445    2.03256  -4.292             0.000438 ***
## countryGuatemala           -7.14067    1.78167  -4.008             0.000825 ***
## countryIslands             -8.72193    1.85477  -4.702             0.000177 ***
## countryKenya               -8.62563    1.62564  -5.306 0.000048107174859556 ***
## countryMalaysia            -7.57832    2.74555  -2.760             0.012888 *  
## countryNicaragua          -12.11584    1.75705  -6.896 0.000001898678859247 ***
## countryPalau               -7.21193    1.66768  -4.325             0.000408 ***
## countryPanama              -6.73275    1.44326  -4.665             0.000193 ***
## countryPhilippines        -12.20939    1.80140  -6.778 0.000002385364697607 ***
## countryPolynesia           -8.06126    1.63176  -4.940             0.000106 ***
## countrySamoa               -8.85353    2.00306  -4.420             0.000331 ***
## countryThailand            -7.13689    1.64191  -4.347             0.000389 ***
## countryVanuatu             -6.76941    1.72095  -3.934             0.000974 ***
## timeslotWednesday 8:00 pm   5.59395    2.14699   2.605             0.017892 *  
## season                     -0.48437    0.08152  -5.942 0.000012701313768962 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.148 on 18 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.9758, Adjusted R-squared:  0.9502 
## F-statistic: 38.18 on 19 and 18 DF,  p-value: 0.00000000008219

Let’s break down the output:

  1. Model Call:
    • lm(formula = viewers_mean ~ country + timeslot + season, data = summary): Shows the regression formula used, predicting viewers_mean based on country, timeslot, and season.
  2. Residuals:
    • The residuals represent the differences between the observed and predicted values. The summary (Min, 1st Quartile, Median, 3rd Quartile, Max) shows the distribution of these residuals.
  3. Coefficients:
    • Estimate: The regression coefficients for the intercept and each predictor. These values represent the expected change in viewers_mean for a one-unit change in the predictor, holding all other predictors constant.
    • Std. Error: The standard error of each coefficient, indicating the precision of the coefficient estimates.
    • t value: The test statistic for the hypothesis that each coefficient is different from zero.
    • Pr(>|t|): The p-value for the test statistic. A low p-value (< 0.05) indicates that the coefficient is significantly different from zero.
    • The coefficients for different countries and the timeslotWednesday 8:00 pm are statistically significant, as indicated by their p-values and significance codes. The season variable is also significant, suggesting its impact on viewership.
  4. Residual Standard Error:
    • 1.148 on 18 degrees of freedom: This is a measure of the typical size of the residuals. Degrees of freedom are calculated as the total number of observations minus the number of estimated parameters.
  5. R-squared Values:
    • Multiple R-squared: 0.9758: Indicates the proportion of variance in the dependent variable (viewers_mean) that is predictable from the independent variables. A value of 0.9758 suggests a high level of predictability.
    • Adjusted R-squared: 0.9502: Adjusts the R-squared value based on the number of predictors in the model. This is closer to the true predictive power of the model.
  6. F-Statistic:
    • 38.18 on 19 and 18 DF, p-value: 0.00000000008219: This tests the overall significance of the model. The very low p-value suggests the model as a whole is statistically significant.
  7. Significance Codes:
    • Indicate the level of significance for the coefficients. “***” denotes a very high level of statistical significance.
  8. Observations with Missing Data:
    • (2 observations deleted due to missingness): Indicates that 2 observations were excluded from the analysis due to missing data.

In summary, the multiple linear regression model suggests that both the country and the season significantly predict the average viewership of the TV series, with the timeslot also playing a significant role (specifically the Wednesday 8:00 pm timeslot). The model explains a very high proportion of the variance in average viewership (as indicated by the R-squared values), and the overall model is statistically significant.

References

  • Agresti, A. (2002). Categorical Data Analysis. John Wiley & Sons.
  • Bland, J. M., & Altman, D. G. (1996). Statistics notes: Transforming data. BMJ, 312(7033), 770.
  • Boslaugh, S. (2012). Statistics in a Nutshell: A Desktop Quick Reference. O’Reilly Media.
  • Cox, D. R., & Snell, E. J. (1981). Applied statistics: principles and examples. Chapman and Hall.
  • De Veaux, R. D., Velleman, P. F., & Bock, D. E. (2018). Stats: Data and models (4th ed.). Pearson.
  • Field, A. P., Miles, J., & Field, Z. (2012). Discovering statistics using R. Sage publications.
  • Frankfort-Nachmias, C., Leon-Guerrero, A., & Davis, G. (2020). Social statistics for a diverse society. Sage Publications.
  • Gravetter, F. J., & Wallnau, L. B. (2016). Essentials of statistics for the behavioral sciences (8th ed.). Wadsworth Cengage Learning.
  • Hair, J. F., Black, W. C., Babin, B. J., & Anderson, R. E. (2014). Multivariate Data Analysis. Pearson Education Limited.
  • Hoaglin, D. C., Mosteller, F., & Tukey, J. W. (2000). Understanding Robust and Exploratory Data Analysis. John Wiley & Sons.
  • Johnson, R. A., & Wichern, D. W. (2007). Applied Multivariate Statistical Analysis. Pearson Prentice Hall.
  • Kenney, J. F., & Keeping, E. S. (1962). Mathematics of Statistics, Pt. 2, 2nd ed. Van Nostrand.
  • Kutner, M. H., Nachtsheim, C. J., & Neter, J. (2004). Applied linear regression models. McGraw-Hill Irwin.
  • Levine, D. M., Stephan, D. F., Krehbiel, T. C., & Berenson, M. L. (2008). Statistics for Managers Using Microsoft Excel. Pearson Prentice Hall.
  • Lind, D. A., Marchal, W. G., & Wathen, S. A. (2012). Statistical techniques in business and economics (15th ed.). McGraw-Hill Irwin.
  • McClave, J. T., Benson, P. G., & Sincich, T. (2011). Statistics for business and economics (11th ed.). Prentice Hall.
  • Montgomery, D. C., Peck, E. A., & Vining, G. G. (2012). Introduction to linear regression analysis. John Wiley & Sons.
  • Moore, D. S., McCabe, G. P., & Craig, B. A. (2009). Introduction to the Practice of Statistics. W.H. Freeman and Co.
  • Rosner, B. (2015). Fundamentals of biostatistics. Cengage Learning.
  • Siegel, S., & Castellan, N. J. (1988). Nonparametric statistics for the behavioral sciences. McGraw-Hill.
  • Tabachnick, B. G., & Fidell, L. S. (2013). Using multivariate statistics. Pearson.
  • Triola, M. F. (2018). Elementary Statistics. Pearson Education.
  • Tukey, J. W. (1977). Exploratory Data Analysis. Addison-Wesley.
  • Vasishth, S., & Broe, M. (2011). The foundations of statistics: A simulation-based approach. Springer Science & Business Media.
  • Winer, B. J., Brown, D. R., & Michels, K. M. (1991). Statistical principles in experimental design. McGraw-Hill.