Statistics for Bioinformatics
Practicals - Sampling Distributions
Introduction
Prerequisite
Tutorial
## Load the configuration script
source('http://pedagogix-tagc.univ-mrs.fr/courses/statistics_bioinformatics/R-files/config.R')
## Generate a set of random numbers
p <- 100 ## Number of columns in the table
r <- 10000 ## Number of repetitions
x <- matrix(rnorm(r*p), nrow=r, ncol=p)
dim(x) ## Dimensions of the data table
x[1:10, 1:15] ## Display a subset of the matrix (10 rows, 15 columns in the top left of the matrix)
## Check the general distribution of the 10000 samples (altogether)
## Draw an histogram
hist(x, breaks=100)
## Draw an histogram and store the distribution in a variable called x.distrib
x.distrib <- hist(x, breaks=100)
names(x.distrib)
x.distrib$breaks ## class boundaries
## Plot a frequency polygon
plot(x.distrib$mids, x.distrib$density, type="l",col="blue",panel.first=grid(col="black"),ylim=c(0,2), lwd=2,xlab="x", ylab="density")
## Plot the theoretical distribution as a dotted line
lines(x.distrib$mids, dnorm(x.distrib$mids),type="l", col="red",lty="dashed",lwd=2)
## Calculate the means for r=10000 samples of size n=3
my.colors <- c("darkgreen", "darkblue", "darkviolet", "violet", "purple", "brown")
i <- 0
for (n in c(3,10,100)) {
i <- i+1
## Select the n first columns of the matrix
sample <- x[,1:n] ## Select n columns of the matrix
print(sample[1:10, ]) ## display the first 10 rows of the sample
## Calculate the mean for each row of the sub-matrix (for each one of the r=10000 samples)
## You can call help(apply) to understand how the function apply() works
sample.means <- apply(sample, 1, mean) ## The second argument (1) indicates that we apply the operation (mean) on each row
## print(sample.means)
## Calculate the distribution of sample means
sample.mean.distrib <- hist(sample.means, breaks=100, plot=F)
## print(sample.mean.distrib)
## Plot the sample mean distribution over the sample distribution
lines(sample.mean.distrib$mids, sample.mean.distrib$density, type="l",col=my.colors[i],lwd=2)
## Plot the theoretical distribution as a dotted line
lines(x.distrib$mids, dnorm(x.distrib$mids, sd=1/sqrt(n)),type="l", col="red",lty="dashed",lwd=2)
}
Exercises
- Adapt the script above to show the sampling distribution of the
mean of n numbers drawn from a uniform distribution. Start
with small values of n and progressively increase the
sample size by a factor of 2 (1, 2, 4, 8, ..., 1024) .