Simulation Experiments



Up: Introduction Previous: Random Numbers

Sample Means

How does the shape and spread of the distribution of a sample mean depend on the sample size? This classical question was eventually answered by Laplace and others by way of the Central Limit Theorem, which we will study in more detail later. In essence the CLT states that sample means computed from independent samples from a population have at least approximately a normal distribution, even though the distribution of the sampled population may not look at all normal. To study the distribution of sample means or averages we can look at the distribution of sample means computed from repeated samples of some population. To do this, we need to create many samples of the size we are interested in studying. Let's try it with samples from a very skewed distribution:

        x <- rgamma(5,1) 
        mean(x)
This gamma distribution is very skewed; try
hist(rgamma(50,5))
To study the behavior of sample means from samples of size 5, we could repeat the first command over and over, each time recording the sample mean. This will take a long time, so we would like to automate the process. The simplest way to do this efficiently is to generate a matrix of random numbers, and think of the rows as samples. To generate 500 samples of 10 numbers each and store in a matrix use the command:
       X <- matrix(rgamma(2500,1),nrow=500)
The 2500 in the rgamma function is just the total number of values we want to generate (500*5). "nrow=500" tells the matrix function how many rows we want.

Now, we can compute the mean of all rows at once with the apply command:

       m5 <- apply(X,1,mean)
"m5" will now have 500 averages of samples of size 5. The second argument can be either 1 or 2; 1 tells apply to use the rows, 2 tells apply to use the columns. We set the matrix up with rows corresponding to samples, so we want to compute all the row means. Compare the distribution of the sample means to the distribution of the original datasets. To make the original distribution clearer, lets plot a larger sample:
      par(mfrow=c(2,2))
      qqnorm(rgamma(100,1))
      qqnorm(m5)
How do the distributions differ? Try it again with samples of size 10:
       X <- matrix(rgamma(5000,1),nrow=500)
       m10 <- apply(X,1,mean)
See if you can set up the simulation for samples of size 20 and 50. Make histograms of the resulting datasets, and look at the distribution. How does it change as the sample size increases?

Mean Squared Error

For simple examples where we can compute the expected value and variance easily, we can compare estimators analytically. For example, to compare the standard estimator of a proportion p (X/n) from binomial data to the biased "shrinkage estimator" (X+1)/(n+2), we can explicitly compute their mean and variance for a range of values of p. We need a specific sample size, so let us use n=10. We will compute the mean squared error the range of possible values of the parameter p.
           n = 10
             # create a dataset containing the values of p
           p = seq(0,1,.01) 
             # compute the variance of the standard estimator X/n
             # for each value of p (X/n is unbiased, so MSE=Var)
           V1 = p*(1-p)/n
           
             # compute var((X+1)/(n+2)) for each p
           V2 = (n/(n+2)^2)*p*(1-p)
             # compute E((X+1)/(n+2)) for each p
           E2 = (n*p+1)/(n+2)
             # Since this estimator is biased, the MSE is
             # the sum of the variance and the squared bias.
           MSE2 = V2 + (E2-p)^2

           plot(p,V1,type="l",col="green")
           lines(p,MSE2,col="red")

When we can't do the computations analytically, we can still evaluate MSE's of estimators by simulation experiments. For example, to compare the MSE of the standard estimator of variance, we first generate a matrix of data corresponding to samples of the desired size from some population. Suppose we want samples of size 3 from a normal population with mean zero and standard deviation equal to 1.

          X = matrix(rnorm(3*1000,0,1),ncol=3)
Each row is a sample of size 3. Now, compute the usual estimator of variance for each row:
          V1 = apply(X,1,var)
The usual estimator divides the sum of squared deviations from the mean by n-1, or 2 in this case. To compute the estimator which divides by n instead, simply multiply those values by (2/3):
          V2 = 2*V1/3
Since the data are generated from a distribution with variance one, we know the true value of the variance, namely 1. Thus we can compute the MSE easily:
          n = length(V1)
          M1 = mean(V1)
          MSE1 = sum( (V1-1)^2)/n
          M2 = mean(V2)
          MSE2 = sum( (V2-1)^2)/n
You might make histograms of the distributions of the two estimators (V1 and V2) and see if you can explain the results. Try
          par(mfrow=c(2,1))
          hist(V1,breaks=seq(0,10,.5))
          hist(V2,breaks=seq(0,10,.5))

Exercise 1. Repeat the computations for the binomial(10,p) adding a third estimator, Phat2 = (X+2)/(n+4). Add the MSE curve to the plot for the MSE's of Phat = X/n, and Phat1 = (X+1)/(n+2).
Exercise 2. Repeat the computations for the binomial(n,p) adding a third estimator, Phat2 = (X+2)/(n+4). Add the MSE curve to the plot for the MSE's of Phat = X/n, and Phat1 = (X+1)/(n+2). This time use n=100.
Exercise 3. Use the simulation method to compare the bias, variance, and mean squared error of three estimators for the standard deviation for samples of size 3 from a normal distribution. The three estimators are s: the R sd() function, that is division by n-1, s1: the square root (sqrt()) of the estimator for variance that divides by n, s2: the median absolute deviation from the mean divided by .6745





Albyn Jones
Feb 2005