Skip to Tutorial Content

Introduction

In this lab you will learn how to do a two-sample t test and visualize the differences between samples with violin plots in R.

Two-sample t test

The two-sample t test is used to compare the mean of two independent samples. Unlike the paired t test the data for a two sample t test cannot be matched in any way. The groups must be independent. Samples are independent if each one is drawn without reference to the other, and has no connection with the other. Independent samples are obtained by either sampling from two distinct populations (one sample from Athens, Ga and the other is from Athens, Greece) or by randomizing a single sample into different groups. In either case the sample data are independent of each other.

Why do we use two-sample t tests? The goal is to estimate the difference in the means (\(\mu_{1}-\mu_{2}\)) and to see if the difference is statistically significant or due to chance.

Study Description

Farmers who raise beef cattle get paid by the pound. In the past antibiotics and hormones were used to increase weight gain, but a growing number of consumers now want antibiotic and hormone free beef products. As a result farms now need to figure out which feed will lead to the best weight gain in calves without the use of antibiotics and hormones.

So an experiment was set up. 200 calves were randomized to receive 1 of 4 feeds for 3 weeks. The feeds are distinguished by the amount of protein (low and high) and the source of protein (rice and corn). At the end of the 3 week period the calves weight was compared to their weight at the start of the study and their total weight gain in pounds was recorded.

Exercise 1: Data and Assumption Check

There are two assumptions we need to check before using a the 2-sample t test.

  1. The samples are independent
  2. The data is normally distributed
  3. The population variances are equal

Data Dictionary Feed: This variable indicates the type of feed the calf received (“Corn_LowP”, “Corn_HighP”, “Rice_LowP”, “Rice_HighP”). Weight_Gain: The total weight the calf gained in pounds during the study period.

Instructions Review the data and variable distributions. Answer the quiz questions.

Plot of Weight Gain for Selected Feed

Data

Summary of Weight Gain for Selected Feed


Quiz: Questions 1-2

Quiz

Exercise 2: Corn Feed (High Protein vs Low Protein)

Now that you have a feel for the data and have check the assumptions for the two sample t test we can start testing different hypothesis. We will use the same t.test() function we used in labs 10 and 11. Two get conduct a two sample t test using this function you will follow this general form t.test(outcome ~ group, , data = “dataset”, alternative= type, var.equal=FALSE). First you specify the “outcome” and “group” variables separated by a “~”. The “outcome ~ group” expression tells R you want to know the difference in the outcome between the two groups. Next you tell the function the name of the data set that contains the variables. The “type” is the same as before (two sided, less than, greater than). Final setting “var.equal=FALSE” lets R know you do not want to assume the variances are equal between the samples (FALSE has to be in all capital letters or it will not work), if you did want to assume equal variances you would replace the FLASE with TRUE.

It is important to note that the “group” variable has to be a factor with only 2 levels. Right now our grouping variable (Feed) has 4. So before we can do the test we have to select a subset of our data with observations for the two groups we want to compare.

Perform a two sample t-test of the null hypothesis that there is not a difference in the mean weight gain between high and low protein corn feed against the two-sided alternative at the 0.05 significance level.

\[\alpha=0.05\]

\[H_{0}: \mu_{Corn\_HighP}=\mu_{Corn\_LowP}\] \[H_{A}: \mu_{Corn\_HighP}\neq\mu_{Corn\_LowP} \] Instructions: Complete the code below to test the hypothesis and click the run code button. Use the output to answer the quiz questions. If you are having a hard time check out the examples from STHDA two-sample t tests in R

# We use the same t.test() function from labs 10 and 11
# Complete the code to answer the question below 

# Notice at the "data=" , part of the function
# this code "Calf[Calf$Feed=="Corn_HighP" | Calf$Feed=="Corn_LowP", ]" tells R which observations we want
# The code tells R: In the Calf data set select rows where the Feed variable is either "Corn_HighP" or "Corn_LowP".
# This symbol "|" means OR

t.test( , data=Calf[Calf$Feed=="Corn_HighP" | Calf$Feed=="Corn_LowP", ], alternative = "two.sided", var.equal=FALSE)

Quiz: Questions 3-4

Quiz

Exercise 3: Rice Feed (High Protein vs Low Protein)

Perform a two sample t-test of the null hypothesis that there is not a difference in the mean weight gain between high and low protein rice feed against the two-sided alternative at the 0.05 significance level.

\[\alpha=0.05\]

\[H_{0}: \mu_{Rice\_HighP}=\mu_{Rice\_LowP}\] \[H_{A}: \mu_{Rice\_HighP}\neq\mu_{Rice\_LowP} \] Instructions: Complete the code below to test the hypothesis and click the run code button. Use the output to answer the quiz questions.

# We use the same t.test() function from labs 10 and 11
# Complete the code to answer the questions below 


t.test(  , data=Calf[Calf$Feed=="Rice_HighP" | Calf$Feed=="Rice_LowP", ], alternative = "two.sided", var.equal=FALSE)

Quiz: Questions 5-6

Quiz

Exercise 4: High Protein Feed (Corn vs Rice)

Perform a two sample t-test of the null hypothesis that there is not a difference in the mean weight gain between high protein rice and corn feed against the two-sided alternative at the 0.05 significance level.

\[\alpha=0.05\]

\[H_{0}: \mu_{Rice\_HighP}=\mu_{Corn\_HighP}\] \[H_{A}: \mu_{Rice\_HighP}\neq\mu_{Corn\_HighP} \] Instructions: Complete the code below to test the hypothesis and click the run code button. Use the output to answer the quiz questions.

# We use the same t.test() function from labs 10 and 11
# Complete the code to answer the questions below 


t.test(  , data=Calf[Calf$Feed=="Rice_HighP" | Calf$Feed=="Corn_HighP", ], alternative = "two.sided", var.equal=FALSE)

Quiz: Questions 7-8

Quiz

Visualizing Differences with Violin Plots

In the past we have either used a histogram or box-plot to visualize the distribution of continuous variables. Box-plots are great they show a bunch of information about the the data and allow for quick comparisons between groups of different sizes.

Unfortunately in some cases box-plots can be vary misleading, because they only show the summary statistics. Data with the same summary statistics can have very different distributions which is demonstrated in the figure below.

Box Plots Vs data

A violin plot is like a box plot in that the same summary statistics can be displayed, but it also shows the distribution of the data that generated those summary statistics. In the figure below you can see that the box do not change but the violin plots do.

Box Plots Vs Violin Plot

Both of these GIFs are from a very interesting paper by Justin Matejka, and George Fitzmaurice.

Exercise 4: Numeric Summaries and Data

Instructions: Select some data to plot. Check out all the different data shapes and the corresponding numeric summaries. Notice the mean and standard deviation the same to the second decimal place despite having very different shapes. This exercise is for understanding only.

Scatter Plot of Selected Data

Summary of X and Y Values


Exercise 5: Using Violin Plots

Now that you know more about the value of violin plots let’s use them to compare the different feeds.

Instructions The code below is complete all you have to do is click run code and use the plot to answer the quiz questions
ggviolin(Calf, x="Feed", y="Weight_Gain", fill = "Feed", palette = "npg", add=c("boxplot"), add.params = list(fill = "white"))

Quiz: Questions 9-10

Quiz

Summary

In this lab, you completed 5 exercises and answered 10 quiz questions.

The lab covered 2 topics:

  1. Two-sample t test in R
  2. Violin Plots in R

Great work you are done with lab! Don’t forget to record your answers and take the eLC quiz to get credit

Never forget the rejection region agian. From: [dyfx](https://dyfx.wordpress.com/2011/11/18/confidence-intervals-and-hypothesis-testing/)

Never forget the rejection region agian. From: dyfx

Two Sample t Test

Computer Lab 12