Tuesday, October 22, 2013

The line between Bayesian and classical statistics is very thin


A while ago, we wrote a paper on correcting re-transformation bias in regression analysis using Bayesian regression [Stow et al., 2006], where we used a Bayesian regression model implemented in WinBUGS and obtained the predictive posterior distribution of the response at given predictor values. The re-transformation bias was avoided by performing the re-transformation on MCMC samples from the posterior predictive distribution before calculating the predictive means in the original scale. This approach obviously works. But the added burden of using WinBUGS is a real obstacle for many. In Gelman and Hill [2007], the same Bayesian posterior sampling was done by using the sampling distributions of regression model parameters. For a simple linear regression problem of y = β0 + β1x + ε, where ε ~ N(02), the classical statistics provides sampling distributions of σ2 and regression coefficients β0 and β1. The sampling distribution of σ2 is a scaled inverse chi square distribution:

where n is the sample size, p is the number of coefficients, andis the estimated residual standard deviation. Conditional on σ2, the joint distribution of (β 01) is a multivariate normal:

where V β = (XT X)-1 is the unscaled covariance matrix and (01) are the estimated model coefficients. These results can be used to derive a Monte Carlo simulation algorithm for generating samples of model coefficients form their joint posterior distribution.
  1. Extract necessary information from the fitted linear model object lm.obj:

summ <- summary (lm.obj)
coef <- summ$coef[,1:2]
sigma.hat <- summ$sigma
beta.hat <- coef[,1]
V.beta <- summ$cov.unscaled
n <- summ$df[1]+summ$df[2]
p <- summ$df[1]
  1. Draw n.sim posterior samples:       
    out<-matrix(0, nrow=n.sim, ncol=p+1)
    for (i in 1:n.sim){
      chi2 <- rchisq(1, n-p)
      sigma <- sigma.hat * sqrt((n-p)/chi2)
      beta <- mvrnorm(1, beta.hat, V.beta*sigma^2)
      out[i,] <- c(beta, sigma)
    }
  1. Put the above two components into a function (e.g., lm.sims(lm.obj, n.sim) we can draw random samples from the joint posterior distribution of model coefficients, which can be used for drawing posterior predictive samples:
       coef.post <- lm.sims(lm.obj=my.lm, n.sim=1000)
       y.tilde <- rnorm(n.sim, coef.post[,1]+coef.post[,2]*x.tilde, 
                                  coef.post[,3])

where x.tilde is the predictor value at which the predictive distribution y.tilde is evaluated.
In other words, we simplified the Bayesian operation into two lines of R code in step 3. In Section 9.2.2.1 of Qian [2010], I compared the simple Monte Carlo simulation approach to the Bayesian regression (using MCMC).  

Drawing random samples of σ in step 2 is definitely a Bayesian step, as the classical sampling distribution is defined for .  In classical statistics, is a random variable, and in Bayesian statistics, it is fixed.  In any case, summarizes information about σ in the data.  The connection between a Bayesian approach and the corresponding classical approach is the shared likelihood function. If we use “non-informative” flat priors, the Bayesian posterior distribution is proportional to the likelihood function. Consequently, if we can find quick solutions from the classical repertoire of models, we should use them.  The line that separates a Bayesian from a classical approach is the definition of probability. The interpretation of probability can be subjective (e.g., the chance of a candidate wins the election) or objective (e.g., the chance of a head in a coin toss). In most scientific endeavors, we are more likely to use the former interpretation (e.g., the likelihood that a model is true). 

Whether we call ourselves Bayesian or not should not be based on whether we use MCMC (WinBUGS, OpenBUGS, JAGS, Stan), rather based on how we interpret the results. For most standard statistical modeling problems, we should use standard classical model results. MCMC gives us a tool for complicated computations. The use of MCMC is not equivalent to being Bayesian.

References

    A. Gelman and J. Hill. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press, New York, 2007.
    S.S. Qian. Environmental and Ecological Statistics with R. Chapman and Hall/CRC Press, 2010.

    C.A. Stow, K.H. Reckhow, and S.S. Qian. A Bayesian approach to retransformation bias in transformed regression. Ecology, 87 (6): 1472–1477, 2006.

Friday, April 5, 2013

Environmental and Ecological Statistics with R Webpage

The webpage for Environmental and Ecological Statistics with R was removed by Duke.  I was not told about the move (nor was I notified when they disabled my Duke email). I am in the process of moving the page to Toledo.  In the meanwhile, I have uploaded the data folder and two script files here.

Tuesday, October 16, 2012

To Binomial or To Poisson? That is the Question



Modeling pathogenic protozoa such as Cryptosporidium oocysts in water is a problem of modeling count data. An important aspect of the problem is to understand the method detection probability. The currently used method for detecting and quantifying Cryptosporidium oocysts in a water sample has an average recovery rate of about 40%. That is, the probability of detecting an oocyst in a water sample is about 0.4 when the oocyst is present. The method used for quantifying this recovery rate is to have labs analyze water samples spiked with known number of oocysts. In the recent US EPA study, 50 certified labs were given several spiked samples and their recoveries were calculated as the number of oocysts identified divided by the number of oocysts spiked. The number of oocysts spiked is typically about 100. The counting process has a standard deviation of 2 to 3 oocysts. Given the relatively accurate total counts, we generally use a binomial model:
where the index ij represents the ith observation from lab j. A simple multilevel model is then to model the detection probability as lab specific:
and imposing a common prior on the logits of lab means:
This model assumes that detection probability varies by lab and lab-specific detection probabilities are exchangeable.
There are two problems with the model. The first problem is a numerical one. The likelihood of a binomial is of the form  py(1 p)n-y. The likelihood function can be numerically challenging to evaluate under certain situations. This is, obviously, a problem that can be addressed in programming. The other problem is that the total (Nij) is not observed exactly. When using binomial, this uncertainty is ignored. Because ignored uncertainty will not vanish, it will be reflected in the inflated uncertainty level in the estimated model parameters. In this case, the only parameter we have is the binomial mean p.
Section 1.2.5 of Agresti [2002] discusses the relationship between multinomial and Poisson. Simplifying the discussion to a binomial situation, we have the following.
1.
We use Y 1 to denote the vector of detected number of oocysts and Y 2 the number of missed oocysts. The vector of total number of oocysts is Y 1 + Y 2.
2.
We model Y 1 and Y 2 as two independent Poisson random variates with means λ1 and λ2, respectively.
3.
If Y 1 and Y 2 are Poisson random variables, the sum of Y 1 and Y 2 is also a Poisson random variable with mean λ1 + λ2.
4.
In the binomial model the sum of Y 1 and Y 2 is given, the joint distribution of Y 1 and Y 2 must then be conditional on the sum:
which is shown [Agresti2002] to be a binomial distribution characterized by π1 = λ∕ (λ1 + λ2) and π2 = λ∕ (λ1 + λ2) = 1 π1.
In other words, the binomial distribution is the conditional distribution of two independent Poisson variates. Or, we can model the binomial count data as two independent Poisson random variables and derive the binomial parameter.
Since we know the total (N), we can parameterize into the Poisson mean. For example, we can parameterize the Poisson mean of the detected number of oocysts as a product of the spiked total (N) and detection probability (p1):
Likewise, the model for λ2 is Np2is used as an offset in both models. Given that p1 + p2 = 1, the estimated binomial mean is
Only one Poisson model is needed. Modeling using Poisson with the total counts as the offset is the same as modeling using binomial.
By using a Poisson model, we can add a multiplicative error term (e.g., LN(02)) to account for the uncertainty in the number of spiked oocysts, and/or other sources of over-dispersion.

References

   A. Agresti. Categorical Data Analysis. Wiley, 2002.

Wednesday, October 3, 2012

Stan versus JAGS

Stan was born on August 30, 2012.  It was built to speed up multilevel generalized linear models.  Instead of improving the sampler used in BUGS, Stan used a better sampler -- the Hamiltonian Monte Carlo.  The HMC converges faster.  I tried Stan in the last two weeks and did a small comparison using a change point model.  The model was discussed in Environmental and Ecological Statistics with R (section 6.1.2).  In the book, lilacs first bloom dates from 4 sites were used to fit 4 hockey stick models, one for each site.  I pooled data from the 4 sites and formed a simple multilevel model, where the 4 model parameters were assumed exchangeable.  For example:

[; \beta_{0j} \sim N(\mu_0, \sigma_0) ;]

Implementing the model in BUGS (or JAGS) is fairly straight forward:


## multilevel hockey stick model
## 
model{
for (i in 1:n){
   y[i] ~ dnorm(y.hat[i], tau.y[stid[i]])
   y.hat[i] <- beta0[stid[i]]+(beta1[stid[i]]-delta[stid[i]]*step(x[i]-phi[stid[i]]))*(x[i]-phi[stid[i]])
}
for (j in 1:n.st){
   beta0[j] ~ dnorm(mu0, tau[1])
            beta1[j] ~ dnorm(mu1, tau[2])
   delta[j] ~ dnorm(mu2, tau[3])
       phi[j] ~ dnorm(mu3,tau[4])T(L[j],U[j])
   tau.y[j] <- pow(sigma.y[j], -2)
     sigma.y[j] ~ dunif(0, 10)
        } 
mu0 ~ dnorm(0, 0.0001)
mu1 ~ dnorm(0, 0.0001)
mu2 ~ dnorm(0, 0.0001)
mu3 ~ dnorm(0, 0.0001)

for (k in 1:4){
   tau[k] <- pow(sigma[k], -2)    
   sigma[k] ~ dunif(0, 100)
}
}

But the model converges very slowly. After a warm-up of 100,000 iterations, the next 100,000 iterations (thin=150) are still jumpy. 

The Stan code looks like this:

 /*  multilevel hockey stick model without group predictors */ 
data{
  int n; 
  int nst;
  real y[n]; 
  int stid[n]; // station ID
  real x[n]; // standardized year
  real minYr;
  real maxYr;
}
parameters{
  real beta0[nst]; // intercept at change point
  real beta1[nst];
  real delta[nst];
  real phi[nst];
  real mu0;
  real mu1;
  real mu2;
  real mu3;
  real sigmay;
  real sigma[4];
}
model{
    mu0 ~ normal(0, 100);
    mu1 ~ normal(0, 100);
        mu2 ~ normal(0, 100);
        mu3 ~ normal(0, 100);
        beta0 ~ normal(mu0, sigma[1]);// vector
        beta1 ~ normal(mu1, sigma[2]);// vector
        phi ~ normal(mu2,sigma[3]);   // vector
        delta ~ normal(mu3,sigma[4]); // vector
for (i in 1:n){
   y[i]~normal(beta0[stid[i]]+(beta1[stid[i]] - delta[stid[i]] * int_step(x[i]-phi[stid[i]]))*(x[i]-phi[stid[i]]), sigmay);
}
}

Two big differences between Stan and BUGS/JAGS: one is that we need to declare data and parameters, and the second is that the order of the code matters.  The first part is easy to adjust as we need to supply data and I always provide initials for all parameters.  The second is somewhat hard for me as I always write BUGS code (no ordering is needed) in the order of "likelihood -- mean function -- prior" and I must reverse the order in Stan.  But this is a small price to pay for a big reward in terms of speed of convergence and the possibility of doing something more.  

The Stan model output is somewhat different from the hockey stick model output.  For example, the estimated change point is 1975(3.5), 1983(4.9), 1974(8), and 1976(6.7) in Table 6.1 of the book, and the multilevel model estimates are 1976, 1976, 1974, and 1978 (with a more or less consistent sd of 6.6).  The multilevel estimated slopes of the first line segment ([;\beta_1;]) are much closer to 0 than the same from the hockey stick model.  The R package rstan is also very easy to use:

bugs.in <- function(infile=USlilac, n.chains=3){
  y <- infile$FirstBloom
  temp <- !is.na(y) ## Stan cannot take NA in y
  infile <- infile[temp,]
  y <- y[temp]
  xx <- infile$Year
  x <- (xx-mean(xx))/sd(xx)

  stid <- as.numeric(ordered(infile$STID))
  n <- length(y)
  n.st <- max(stid)
  L <- tapply(x, stid, min)
  U <- tapply(x, stid, max)
  inits <- list()
  bugs.dat <- list(n=n, nst=n.st, y=y, x=x, stid=stid,              
                   minYr=min(L), maxYr=max(U))
  for (i in 1:n.chains)
    inits[[i]] <- list(beta0=rnorm(n.st,130,2),
                       beta1=rnorm(n.st, 0, 0.1),
                       delta=runif(1, 0, 3), 
                       phi=runif(n.st, max(L), min(U)),
                       sigmay=runif(1, 0, 3), 
                       sigma=runif(2, 0, 1),
                       mu0 = rnorm(1), mu1 = rnorm(1))
  parameters <- c("beta0","beta1","delta", "phi", "mu0", 
                  "mu1","mu2","mu3","sigma", "sigmay")
  return(list(para=parameters, data=bugs.dat, inits=inits,
              meanYR=mean(xx), sdYR=sd(xx), n.chains=n.chains,
              model="
 /*  multilevel hockey stick model */ 
data{
  int n; 
  int nst;
  real y[n]; 
  int stid[n]; // station ID
  real x[n]; // standardized year
  real minYr;
  real maxYr;
}
parameters{
  real beta0[nst]; // intercept at change point
  real beta1[nst];
  real delta[nst];
  real phi[nst];
  real mu0;
  real mu1;
  real mu2;
  real mu3;
  real sigmay;
  real sigma[4];
}
model{
    mu0 ~ normal(0, 100);
    mu1 ~ normal(0, 100);
        mu2 ~ normal(0, 100);
        mu3 ~ normal(0, 100);
        beta0 ~ normal(mu0, sigma[1]);
        beta1 ~ normal(mu1, sigma[2]);
        phi ~ normal(mu2,sigma[3]);
        delta ~ normal(mu3,sigma[4]);
for (i in 1:n){
   y[i]~normal(beta0[stid[i]]+
(beta1[stid[i]] - delta[stid[i]] * int_step(x[i]-phi[stid[i]]))*(x[i]-phi[stid[i]]), sigmay);
}
}
"
  ))
}

keep4 <- USlilac$STID==354147 | USlilac$STID==456974 | 
         USlilac$STID==426357 | USlilac$STID==456624

input.to.bugs <- bugs.in(infile=USlilac[keep4,])
n.chains <- input.to.bugs$n.chains
n.iter <- 100000
n.keep <- 2000
thin <- max(c(1,floor((n.iter)*n.chains/n.keep)))

fit <- stan(model_code=input.to.bugs$model,
            data=input.to.bugs$data, init=input.to.bugs$inits,
            pars=input.to.bugs$para,
            iter=n.iter, thin=thin, chains=n.chains)
print(fit)
quartz()
plot(fit)

My experience with Stan is a pleasant one, especially the level of help provided by the Stan team.  The last question I submitted to them was at 10:30 pm and I got an answer back around 11:00 pm the same day! 


Wednesday, September 5, 2012

Parallel Computing

Someone said a while ago that MCMC is an embarrassingly parallel (computing) problem. I wanted to try some R implementations of parallel computing for BUGS a few months ago when I was running some models with large data sets.  I tried the function jags.parallel() in the R2jags package. It works for some models but fails for somewhat more complicated ones. This week, I learned a new package dclone and tried a toy problem.  I have used this setup on several models and all worked out well.


packages(snow)
packages(rjags)
packages(dclone)

bugs.in <- function(chains=3){
  x <- 1:100
  y <- 3 + 5 * x + rnorm(100,0, 0.25)
  bugs.ini <- list()
  bugs.data <- list(n=100,y=y,x=x)
  for (i in 1:chains)
    bugs.ini [[i]] <- list(
                           beta=rnorm(2), sigma=runif(1))
  bugs.para <- c("beta", "sigma")
  return(list(data=bugs.data, inits=bugs.ini,
              para=bugs.para, n.chains=chains, 
              model=function() {
                for (i in 1:n){
                  y[i] ~ dnorm(mu[i], tau)
                  mu[i] <- beta[1] + beta[2]* x[i]
                }
                for (j in 1:2){
                  beta[j] ~ dnorm(0,0.0001)
                }
                tau <- pow(sigma, -2)
                sigma ~ dunif(0,10)
              }))
}

input.to.bugs <- bugs.in()
n.chains <- input.to.bugs$n.chains
n.adapt <- 10000
n.update <- 10000
n.iter <- 10000
n.keep <- 1000
thin <- max(c(1,floor((n.iter)*n.chains/n.keep)))
date()
cl <- makeCluster(3, type = "SOCK")
m2 <- jags.parfit(cl, data = input.to.bugs$data, 
                  params = input.to.bugs$para, 
                  model = input.to.bugs$model,
                  inits = input.to.bugs$inits,
                  n.adapt = n.adapt, n.update = n.update,
                  n.iter = n.iter, thin = thin, n.chains = n.chains)
date()
stopCluster(cl)

Tuesday, June 12, 2012

Variance Components in Multilevel Models

(User script TeX the World should be used to view mathematical expressions in this post.)

The term "variance components" can have different meanings. In a one-way ANOVA problem, we partition the total variance (or sum of squares) into between and within group variance (or sum of squares). The term variance components is often used either for the between or within group variances (sum of squares divided by the respective degrees of freedom) or the sum of squares.  Using a model-based (Bayesian) expression, ANOVA is a hierarchical model with
at the data level, and
at the group level, where [;\sigma_y^2;] is the within group variance and [;\sigma_g^2;] is the between group variance. When data are from a balanced design, [;Var(y) = \sigma_y^2+\sigma_g^2;].

In mixed effects model literature, the term variance components is better explained using a Bayesian notation. Suppose that we have a linear mixed effects model. At the data level:
[;y_{ij} \sim N(\alpha_{j[i]}+\beta_{j[i]} x_{ij}, \sigma_y^2),;]
and at the group level
[;\left (\begin{array}{c}\alpha_j\\ \beta_j\end{array}\right ) \sim N\left [ \left (\begin{array}{c}
\mu_{\alpha}\\ \mu_{\beta}\end{array}\right ),\Sigma \right ];]
where [;\Sigma=\left (\begin{array}{cc} \sigma_{\alpha}^2 & \rho\sigma_{\alpha}\sigma_{\beta}\\
\rho\sigma_{\alpha}\sigma_{\beta} & \sigma_{\beta}^2\end{array}\right );]
In classical literature (and software), variance components are often the estimated [;\sigma^2;]s: [;\sigma_y^2;], [;\sigma_{\alpha}^2;], and [;\sigma_{\beta}^2;].

In the two Ecology papers on multilevel model (Qian and Shen, 2007; Qian et al., 2010), I used the ANOVA notion of variance components, i.e., fraction of total variance in the response due to factor g. The above simple linear multilevel model is fitted using MCMC and the varying model coefficients are divided into overall mean (fixed effect) and group effects (random effects):
[;y_{ij} = (\mu_{\alpha}+\delta_{\alpha_{j[i]}})+(\mu_{\beta}+\delta_{\beta_{j[i]}}) x_{ij} +\varepsilon,;]
The total variance in the response variable [;y_{ij};] is partitioned into the variances due to group [;\left ( Var(\delta_{\alpha_{\alpha_j}})\right );], the predictor [;\left ( Var(\mu_{\beta}\times x_{ij})\right );], predictor-group interaction [;\left ( Var(\delta_{\beta_{\beta_j}}\times x_{ij})\right );], and residuals [;\left (Var(\varepsilon)\right );]. This is what calculated in Gelman and Hill (2007) to produce the ANOVA table-like variance components figures. If comparing variance components to the results from using the R function lmer (from package lme4), [;Var(\delta_{\alpha_{\alpha_j}});] should be the same (or very close) to [;\sigma_{\alpha}^2;] and [;Var(\delta_{\beta_{\beta_j}});] should be very close to [;\sigma_{\beta};].

Saturday, May 26, 2012

Threshold, Change Point, and Other Practical Issues

(User script TeX the World should be used to properly view mathematical expressions in this post.)

Discussions with several colleagues at the SFS 2012 Conference reminded me that there are several conceptual and technical issues related the to threshold. Clear definitions of many concepts are needed.
First, we often confuse threshold and change point. A change point is a mathematical concept. It is the point along the x-axis where the response curve show a discontinuity. A threshold is an ecological or management concept. It is the point along a gradient where the response value reaches a critical level. Figure 1 shows the difference of the two concept.

 Figure 1. 

In Figure 1, the y-axis is a hypothetical response variable of interest. If human health concern dictates that Y should be kept above Ym1, we must set the standard for X at Xm1. If ecological concern only require that Y be above Ym2, we can set the standard for X at Xm2. The mathematical change point Xcp is of no concern, unless it coincides with Xm1 or Xm2. Figure 1 suggests that we should focus our attention on finding the underlying model describing the dependency of Y on X. This relationship can be a hockey stick model or otherwise.
Second, when estimating a change point, we must explicitly state the underlying model. Our Ecological Indicators paper (To threshold or not to threshold? That's the question) discussed this issue in detail. The model-specific nature of a change point problem is not a common knowledge in our field.
Third, a change point model makes strong assumptions about the behavior of the data. With many sources of uncertainty, it is rarely possible to discern an abrupt change model (a change point model) from its  continuous counterpart. Figure 2 shows two examples.

Figure 2.

In Figure 2, I show a step function and a hockey stick model. In both cases, I contrast the two change point (abrupt change) models (the dashed lines) to their continuous counterparts (the shaded lines). If the continuous model is the appropriate model, we are interested in estimating [;\phi-\gamma;]. But if the change point model is used instead, we will likely be end up with [;\phi;]. Unfortunately, statistical tests are not currently available for distinguishing these two types of models (Chiu, et al, 2006). The term "bent-cable" model in Grace Chiu's paper is the one shown in the right panel in Figure 2. The hockey stick model in my Environmental and Ecological statistics with R is essentially a bent-cable model (two linear line segments linked by a quadratic line segment), but I fixed [;\gamma;] to be a known small value (1% of the data range). The bent-cable model estimates both [;\gamma;] and [;\phi;].  I think that we should consider  abandoning the use of the hockey-stick model in favor of the bent-cable model (and the step function model in favor of the doubly-bent cable model, left panel of Figure 2). Figure 2 is more of an ecological concern than a management one. If the system is moving from one steady state to a different one (step function versus the doubly-bent model), [;\phi-\gamma;] is the point where such change starts, while [;\phi;] is the middle of the transition period. From a management perspective, setting an environmental standard at [;\phi-\gamma;] makes more sense.


G. Chiu, R. Lockhart, and R. Routledge. Bent-cable regression theory and applications. Journal of the American Statistical Association, 101(474):542– 553, 2006.


Log or not log

LOGorNOTLOG.html Log or not log, that is the question May 19, 2018 In 2014 I taught a special topics class on statistical i...