Tuesday, October 16, 2012

To Binomial or To Poisson? That is the Question



Modeling pathogenic protozoa such as Cryptosporidium oocysts in water is a problem of modeling count data. An important aspect of the problem is to understand the method detection probability. The currently used method for detecting and quantifying Cryptosporidium oocysts in a water sample has an average recovery rate of about 40%. That is, the probability of detecting an oocyst in a water sample is about 0.4 when the oocyst is present. The method used for quantifying this recovery rate is to have labs analyze water samples spiked with known number of oocysts. In the recent US EPA study, 50 certified labs were given several spiked samples and their recoveries were calculated as the number of oocysts identified divided by the number of oocysts spiked. The number of oocysts spiked is typically about 100. The counting process has a standard deviation of 2 to 3 oocysts. Given the relatively accurate total counts, we generally use a binomial model:
where the index ij represents the ith observation from lab j. A simple multilevel model is then to model the detection probability as lab specific:
and imposing a common prior on the logits of lab means:
This model assumes that detection probability varies by lab and lab-specific detection probabilities are exchangeable.
There are two problems with the model. The first problem is a numerical one. The likelihood of a binomial is of the form  py(1 p)n-y. The likelihood function can be numerically challenging to evaluate under certain situations. This is, obviously, a problem that can be addressed in programming. The other problem is that the total (Nij) is not observed exactly. When using binomial, this uncertainty is ignored. Because ignored uncertainty will not vanish, it will be reflected in the inflated uncertainty level in the estimated model parameters. In this case, the only parameter we have is the binomial mean p.
Section 1.2.5 of Agresti [2002] discusses the relationship between multinomial and Poisson. Simplifying the discussion to a binomial situation, we have the following.
1.
We use Y 1 to denote the vector of detected number of oocysts and Y 2 the number of missed oocysts. The vector of total number of oocysts is Y 1 + Y 2.
2.
We model Y 1 and Y 2 as two independent Poisson random variates with means λ1 and λ2, respectively.
3.
If Y 1 and Y 2 are Poisson random variables, the sum of Y 1 and Y 2 is also a Poisson random variable with mean λ1 + λ2.
4.
In the binomial model the sum of Y 1 and Y 2 is given, the joint distribution of Y 1 and Y 2 must then be conditional on the sum:
which is shown [Agresti2002] to be a binomial distribution characterized by π1 = λ∕ (λ1 + λ2) and π2 = λ∕ (λ1 + λ2) = 1 π1.
In other words, the binomial distribution is the conditional distribution of two independent Poisson variates. Or, we can model the binomial count data as two independent Poisson random variables and derive the binomial parameter.
Since we know the total (N), we can parameterize into the Poisson mean. For example, we can parameterize the Poisson mean of the detected number of oocysts as a product of the spiked total (N) and detection probability (p1):
Likewise, the model for λ2 is Np2is used as an offset in both models. Given that p1 + p2 = 1, the estimated binomial mean is
Only one Poisson model is needed. Modeling using Poisson with the total counts as the offset is the same as modeling using binomial.
By using a Poisson model, we can add a multiplicative error term (e.g., LN(02)) to account for the uncertainty in the number of spiked oocysts, and/or other sources of over-dispersion.

References

   A. Agresti. Categorical Data Analysis. Wiley, 2002.

Wednesday, October 3, 2012

Stan versus JAGS

Stan was born on August 30, 2012.  It was built to speed up multilevel generalized linear models.  Instead of improving the sampler used in BUGS, Stan used a better sampler -- the Hamiltonian Monte Carlo.  The HMC converges faster.  I tried Stan in the last two weeks and did a small comparison using a change point model.  The model was discussed in Environmental and Ecological Statistics with R (section 6.1.2).  In the book, lilacs first bloom dates from 4 sites were used to fit 4 hockey stick models, one for each site.  I pooled data from the 4 sites and formed a simple multilevel model, where the 4 model parameters were assumed exchangeable.  For example:

[; \beta_{0j} \sim N(\mu_0, \sigma_0) ;]

Implementing the model in BUGS (or JAGS) is fairly straight forward:


## multilevel hockey stick model
## 
model{
for (i in 1:n){
   y[i] ~ dnorm(y.hat[i], tau.y[stid[i]])
   y.hat[i] <- beta0[stid[i]]+(beta1[stid[i]]-delta[stid[i]]*step(x[i]-phi[stid[i]]))*(x[i]-phi[stid[i]])
}
for (j in 1:n.st){
   beta0[j] ~ dnorm(mu0, tau[1])
            beta1[j] ~ dnorm(mu1, tau[2])
   delta[j] ~ dnorm(mu2, tau[3])
       phi[j] ~ dnorm(mu3,tau[4])T(L[j],U[j])
   tau.y[j] <- pow(sigma.y[j], -2)
     sigma.y[j] ~ dunif(0, 10)
        } 
mu0 ~ dnorm(0, 0.0001)
mu1 ~ dnorm(0, 0.0001)
mu2 ~ dnorm(0, 0.0001)
mu3 ~ dnorm(0, 0.0001)

for (k in 1:4){
   tau[k] <- pow(sigma[k], -2)    
   sigma[k] ~ dunif(0, 100)
}
}

But the model converges very slowly. After a warm-up of 100,000 iterations, the next 100,000 iterations (thin=150) are still jumpy. 

The Stan code looks like this:

 /*  multilevel hockey stick model without group predictors */ 
data{
  int n; 
  int nst;
  real y[n]; 
  int stid[n]; // station ID
  real x[n]; // standardized year
  real minYr;
  real maxYr;
}
parameters{
  real beta0[nst]; // intercept at change point
  real beta1[nst];
  real delta[nst];
  real phi[nst];
  real mu0;
  real mu1;
  real mu2;
  real mu3;
  real sigmay;
  real sigma[4];
}
model{
    mu0 ~ normal(0, 100);
    mu1 ~ normal(0, 100);
        mu2 ~ normal(0, 100);
        mu3 ~ normal(0, 100);
        beta0 ~ normal(mu0, sigma[1]);// vector
        beta1 ~ normal(mu1, sigma[2]);// vector
        phi ~ normal(mu2,sigma[3]);   // vector
        delta ~ normal(mu3,sigma[4]); // vector
for (i in 1:n){
   y[i]~normal(beta0[stid[i]]+(beta1[stid[i]] - delta[stid[i]] * int_step(x[i]-phi[stid[i]]))*(x[i]-phi[stid[i]]), sigmay);
}
}

Two big differences between Stan and BUGS/JAGS: one is that we need to declare data and parameters, and the second is that the order of the code matters.  The first part is easy to adjust as we need to supply data and I always provide initials for all parameters.  The second is somewhat hard for me as I always write BUGS code (no ordering is needed) in the order of "likelihood -- mean function -- prior" and I must reverse the order in Stan.  But this is a small price to pay for a big reward in terms of speed of convergence and the possibility of doing something more.  

The Stan model output is somewhat different from the hockey stick model output.  For example, the estimated change point is 1975(3.5), 1983(4.9), 1974(8), and 1976(6.7) in Table 6.1 of the book, and the multilevel model estimates are 1976, 1976, 1974, and 1978 (with a more or less consistent sd of 6.6).  The multilevel estimated slopes of the first line segment ([;\beta_1;]) are much closer to 0 than the same from the hockey stick model.  The R package rstan is also very easy to use:

bugs.in <- function(infile=USlilac, n.chains=3){
  y <- infile$FirstBloom
  temp <- !is.na(y) ## Stan cannot take NA in y
  infile <- infile[temp,]
  y <- y[temp]
  xx <- infile$Year
  x <- (xx-mean(xx))/sd(xx)

  stid <- as.numeric(ordered(infile$STID))
  n <- length(y)
  n.st <- max(stid)
  L <- tapply(x, stid, min)
  U <- tapply(x, stid, max)
  inits <- list()
  bugs.dat <- list(n=n, nst=n.st, y=y, x=x, stid=stid,              
                   minYr=min(L), maxYr=max(U))
  for (i in 1:n.chains)
    inits[[i]] <- list(beta0=rnorm(n.st,130,2),
                       beta1=rnorm(n.st, 0, 0.1),
                       delta=runif(1, 0, 3), 
                       phi=runif(n.st, max(L), min(U)),
                       sigmay=runif(1, 0, 3), 
                       sigma=runif(2, 0, 1),
                       mu0 = rnorm(1), mu1 = rnorm(1))
  parameters <- c("beta0","beta1","delta", "phi", "mu0", 
                  "mu1","mu2","mu3","sigma", "sigmay")
  return(list(para=parameters, data=bugs.dat, inits=inits,
              meanYR=mean(xx), sdYR=sd(xx), n.chains=n.chains,
              model="
 /*  multilevel hockey stick model */ 
data{
  int n; 
  int nst;
  real y[n]; 
  int stid[n]; // station ID
  real x[n]; // standardized year
  real minYr;
  real maxYr;
}
parameters{
  real beta0[nst]; // intercept at change point
  real beta1[nst];
  real delta[nst];
  real phi[nst];
  real mu0;
  real mu1;
  real mu2;
  real mu3;
  real sigmay;
  real sigma[4];
}
model{
    mu0 ~ normal(0, 100);
    mu1 ~ normal(0, 100);
        mu2 ~ normal(0, 100);
        mu3 ~ normal(0, 100);
        beta0 ~ normal(mu0, sigma[1]);
        beta1 ~ normal(mu1, sigma[2]);
        phi ~ normal(mu2,sigma[3]);
        delta ~ normal(mu3,sigma[4]);
for (i in 1:n){
   y[i]~normal(beta0[stid[i]]+
(beta1[stid[i]] - delta[stid[i]] * int_step(x[i]-phi[stid[i]]))*(x[i]-phi[stid[i]]), sigmay);
}
}
"
  ))
}

keep4 <- USlilac$STID==354147 | USlilac$STID==456974 | 
         USlilac$STID==426357 | USlilac$STID==456624

input.to.bugs <- bugs.in(infile=USlilac[keep4,])
n.chains <- input.to.bugs$n.chains
n.iter <- 100000
n.keep <- 2000
thin <- max(c(1,floor((n.iter)*n.chains/n.keep)))

fit <- stan(model_code=input.to.bugs$model,
            data=input.to.bugs$data, init=input.to.bugs$inits,
            pars=input.to.bugs$para,
            iter=n.iter, thin=thin, chains=n.chains)
print(fit)
quartz()
plot(fit)

My experience with Stan is a pleasant one, especially the level of help provided by the Stan team.  The last question I submitted to them was at 10:30 pm and I got an answer back around 11:00 pm the same day! 


Wednesday, September 5, 2012

Parallel Computing

Someone said a while ago that MCMC is an embarrassingly parallel (computing) problem. I wanted to try some R implementations of parallel computing for BUGS a few months ago when I was running some models with large data sets.  I tried the function jags.parallel() in the R2jags package. It works for some models but fails for somewhat more complicated ones. This week, I learned a new package dclone and tried a toy problem.  I have used this setup on several models and all worked out well.


packages(snow)
packages(rjags)
packages(dclone)

bugs.in <- function(chains=3){
  x <- 1:100
  y <- 3 + 5 * x + rnorm(100,0, 0.25)
  bugs.ini <- list()
  bugs.data <- list(n=100,y=y,x=x)
  for (i in 1:chains)
    bugs.ini [[i]] <- list(
                           beta=rnorm(2), sigma=runif(1))
  bugs.para <- c("beta", "sigma")
  return(list(data=bugs.data, inits=bugs.ini,
              para=bugs.para, n.chains=chains, 
              model=function() {
                for (i in 1:n){
                  y[i] ~ dnorm(mu[i], tau)
                  mu[i] <- beta[1] + beta[2]* x[i]
                }
                for (j in 1:2){
                  beta[j] ~ dnorm(0,0.0001)
                }
                tau <- pow(sigma, -2)
                sigma ~ dunif(0,10)
              }))
}

input.to.bugs <- bugs.in()
n.chains <- input.to.bugs$n.chains
n.adapt <- 10000
n.update <- 10000
n.iter <- 10000
n.keep <- 1000
thin <- max(c(1,floor((n.iter)*n.chains/n.keep)))
date()
cl <- makeCluster(3, type = "SOCK")
m2 <- jags.parfit(cl, data = input.to.bugs$data, 
                  params = input.to.bugs$para, 
                  model = input.to.bugs$model,
                  inits = input.to.bugs$inits,
                  n.adapt = n.adapt, n.update = n.update,
                  n.iter = n.iter, thin = thin, n.chains = n.chains)
date()
stopCluster(cl)

Tuesday, June 12, 2012

Variance Components in Multilevel Models

(User script TeX the World should be used to view mathematical expressions in this post.)

The term "variance components" can have different meanings. In a one-way ANOVA problem, we partition the total variance (or sum of squares) into between and within group variance (or sum of squares). The term variance components is often used either for the between or within group variances (sum of squares divided by the respective degrees of freedom) or the sum of squares.  Using a model-based (Bayesian) expression, ANOVA is a hierarchical model with
at the data level, and
at the group level, where [;\sigma_y^2;] is the within group variance and [;\sigma_g^2;] is the between group variance. When data are from a balanced design, [;Var(y) = \sigma_y^2+\sigma_g^2;].

In mixed effects model literature, the term variance components is better explained using a Bayesian notation. Suppose that we have a linear mixed effects model. At the data level:
[;y_{ij} \sim N(\alpha_{j[i]}+\beta_{j[i]} x_{ij}, \sigma_y^2),;]
and at the group level
[;\left (\begin{array}{c}\alpha_j\\ \beta_j\end{array}\right ) \sim N\left [ \left (\begin{array}{c}
\mu_{\alpha}\\ \mu_{\beta}\end{array}\right ),\Sigma \right ];]
where [;\Sigma=\left (\begin{array}{cc} \sigma_{\alpha}^2 & \rho\sigma_{\alpha}\sigma_{\beta}\\
\rho\sigma_{\alpha}\sigma_{\beta} & \sigma_{\beta}^2\end{array}\right );]
In classical literature (and software), variance components are often the estimated [;\sigma^2;]s: [;\sigma_y^2;], [;\sigma_{\alpha}^2;], and [;\sigma_{\beta}^2;].

In the two Ecology papers on multilevel model (Qian and Shen, 2007; Qian et al., 2010), I used the ANOVA notion of variance components, i.e., fraction of total variance in the response due to factor g. The above simple linear multilevel model is fitted using MCMC and the varying model coefficients are divided into overall mean (fixed effect) and group effects (random effects):
[;y_{ij} = (\mu_{\alpha}+\delta_{\alpha_{j[i]}})+(\mu_{\beta}+\delta_{\beta_{j[i]}}) x_{ij} +\varepsilon,;]
The total variance in the response variable [;y_{ij};] is partitioned into the variances due to group [;\left ( Var(\delta_{\alpha_{\alpha_j}})\right );], the predictor [;\left ( Var(\mu_{\beta}\times x_{ij})\right );], predictor-group interaction [;\left ( Var(\delta_{\beta_{\beta_j}}\times x_{ij})\right );], and residuals [;\left (Var(\varepsilon)\right );]. This is what calculated in Gelman and Hill (2007) to produce the ANOVA table-like variance components figures. If comparing variance components to the results from using the R function lmer (from package lme4), [;Var(\delta_{\alpha_{\alpha_j}});] should be the same (or very close) to [;\sigma_{\alpha}^2;] and [;Var(\delta_{\beta_{\beta_j}});] should be very close to [;\sigma_{\beta};].

Saturday, May 26, 2012

Threshold, Change Point, and Other Practical Issues

(User script TeX the World should be used to properly view mathematical expressions in this post.)

Discussions with several colleagues at the SFS 2012 Conference reminded me that there are several conceptual and technical issues related the to threshold. Clear definitions of many concepts are needed.
First, we often confuse threshold and change point. A change point is a mathematical concept. It is the point along the x-axis where the response curve show a discontinuity. A threshold is an ecological or management concept. It is the point along a gradient where the response value reaches a critical level. Figure 1 shows the difference of the two concept.

 Figure 1. 

In Figure 1, the y-axis is a hypothetical response variable of interest. If human health concern dictates that Y should be kept above Ym1, we must set the standard for X at Xm1. If ecological concern only require that Y be above Ym2, we can set the standard for X at Xm2. The mathematical change point Xcp is of no concern, unless it coincides with Xm1 or Xm2. Figure 1 suggests that we should focus our attention on finding the underlying model describing the dependency of Y on X. This relationship can be a hockey stick model or otherwise.
Second, when estimating a change point, we must explicitly state the underlying model. Our Ecological Indicators paper (To threshold or not to threshold? That's the question) discussed this issue in detail. The model-specific nature of a change point problem is not a common knowledge in our field.
Third, a change point model makes strong assumptions about the behavior of the data. With many sources of uncertainty, it is rarely possible to discern an abrupt change model (a change point model) from its  continuous counterpart. Figure 2 shows two examples.

Figure 2.

In Figure 2, I show a step function and a hockey stick model. In both cases, I contrast the two change point (abrupt change) models (the dashed lines) to their continuous counterparts (the shaded lines). If the continuous model is the appropriate model, we are interested in estimating [;\phi-\gamma;]. But if the change point model is used instead, we will likely be end up with [;\phi;]. Unfortunately, statistical tests are not currently available for distinguishing these two types of models (Chiu, et al, 2006). The term "bent-cable" model in Grace Chiu's paper is the one shown in the right panel in Figure 2. The hockey stick model in my Environmental and Ecological statistics with R is essentially a bent-cable model (two linear line segments linked by a quadratic line segment), but I fixed [;\gamma;] to be a known small value (1% of the data range). The bent-cable model estimates both [;\gamma;] and [;\phi;].  I think that we should consider  abandoning the use of the hockey-stick model in favor of the bent-cable model (and the step function model in favor of the doubly-bent cable model, left panel of Figure 2). Figure 2 is more of an ecological concern than a management one. If the system is moving from one steady state to a different one (step function versus the doubly-bent model), [;\phi-\gamma;] is the point where such change starts, while [;\phi;] is the middle of the transition period. From a management perspective, setting an environmental standard at [;\phi-\gamma;] makes more sense.


G. Chiu, R. Lockhart, and R. Routledge. Bent-cable regression theory and applications. Journal of the American Statistical Association, 101(474):542– 553, 2006.


Wednesday, May 23, 2012

TITANic Journey

After a year's work, wait, and review, we have finally submitted a critic of TITAN to the Journal of the Society of Freshwater Science. Based on USGS internal review, we shortened the manuscript and removed some materials, mostly statistics. In this post, I report the statistics not presented in the manuscript.

First, a quick background summary. 

In 2010, Ryan Kind and Matt Baker published a paper in the journal Methods in Ecology and Evolution (MEE) and later in the Journal of the North American Benthological Society (JNABS) proposing a "statistical" method for detecting and quantifying ecological threshold at both species level and community level. The method is code named TITAN. The paper was later listed as one of MEE's most popular article. After the publication, the authors promoted the paper in the media (NPR Maryland) and to many state and federal agencies.  During 2010, I was working with Tom Cuffney on a paper discussing the problems associated with threshold modeling (later published with a title "To threshold or not to threshold? That's the question"). The premise of our paper is that we must understand the underlying assumption of a model before applying it. Also, we discussed the difference between a change point (a statistical concept) and a management threshold (a regulatory concept). When we read TITAN, we know that there lies a problem. In this post, I introduce TITAN briefly, followed by a discussion on statistical issues related to the method. These issues include (1) the use of a permutation test, (2) bootstrapping, (3) repeated testing, (4) characteristics of the permutation test used in TITAN, and (5) consequences of the misuse of the permutation test.  I want to use this post to show the difficulty encountered in the peer-review process where these statistical issues are not easy to unravel in a typical review process. 

1. Introduction

My discussion of TITAN is based on my understanding of data analysis and statistical inference. The emphasis of data analysis using statistical tools is to uncover the underlying causal relation that resulted in the observed data. Statistical analysis and modeling usually follows the hypothetical deductive reasoning process first formalized by Fisher [1922]. Under Fisher’s logic, the first step of a statistical problem is to pose hypothesis, based on the answer to the question “of which distribution are the data a random sample.” As statistics is a science of randomness and probability distribution is the quantitative measure of randomness, a statistical inference is the process of learning about the parameters defining the probability distribution. Specifically, we are interested in (1) how to quantify the parameter of interested when random samples from the population are available (the problem of parameter estimation) and (2) quantifying the uncertainty of the estimate. The maximum likelihood estimator is the most commonly used statistical method for parameter estimation, while the uncertainty of the estimate is most commonly defined through the sampling distribution of the estimator. Recently, Baker and King [2010] proposed a program known as the threshold indicator taxa analysis (TITAN) for calculating ecological community “thresholds.” The program is promoted as a “statistical” method capable of detecting a community level threshold. This paper discusses the limitations of TITAN and statistical errors in the program that may lead to misleading results. 


2. TITAN

TITAN can be summarized as a collection of computer programs for selecting a split point along a disturbance gradient to divide the collection of samples into two groups. The split point is selected based on an indicator value of a taxon’s group association. The indicator value was proposed to describe a taxon’s association with a number of existing clusters [Dufrêne and Legendre, 1997]. When there are only two clusters, the indicator value (or IndVal) is the product of the taxon’s relative abundance and the frequency of occurrence:


                                       (1)


where i = 1, 2 is the cluster index, Ai is the relative abundance calculated as the ratio of mean total abundance in cluster i (ai) over the sum of the cluster means (Ai = ai i=12(a i)), and Bi is the frequency of occurrence (fraction of non-zero observations) in cluster i. While IndVal is derived for describing the association of a given taxon to an existing cluster, TITAN uses IndVal to define clusters along a disturbance gradient. Observations along the gradient is successively divided into two groups by moving a dividing line along the gradient, at each potential split point an IndVal is calculated for each group. The division that resulted in the largest difference between the two IndVal’s is defined as the split point for the taxon. (Baker and King [2010] defines the IndVal for each potential split point as the larger of the two IndVals, the split point based on the maximum IndVal under this definition is the same as the split point with the largest difference between the two IndVals based on the definition of Dufrêne and Legendre [1997].) This process searches a maxima of the indicator value differences between the two groups. Because the calculation of IndVal requires two clusters (or groups in this case), TITAN starts the search at some distance from the low end of the gradient to allow a pre-determined number of data points to be included in the “left group” and ends also at a distance from the upper end of the gradient to allow the same minimum number of data points in the “righr group.” This calculation is equivalent to a truncated variable transformation (truncating the data at both ends of the disturbance gradient and transform the total abundance data into IndVal) and the maxima of the transformed data is defined as the split point along the gradient.
Once the split point is selected, a permutation test is employed to test its “statistical significance.” The permutation test randomly divide the data into two groups each with the same number of observations as the division that produces the maximum IndVal. The statistical significance is represented by either the p-value (fraction of permutation IndVal’s exceeding the IndVal associated with the selected split point) or the z-score (the standardized IndVal with respect to the mean and standard deviation of permutation Indval’s).
To translate taxon-specific split points into a community “threshold,” TITAN performs the permutation test of IndVal for all possible split points along the gradient and record the z-scores for all taxa. At each potential split point along the gradient, z-scores of all taxa are summed. The community-level threshold is defined to be split point associated with the maximum sum of z-scores. There are two sum of z-scores, one for “decreasing” taxa (those with a higher relative abundance at the low side of the selected split point) and the other for “increasing” taxa.
Uncertainty associated with the selected split point is estimated using a bootstrapping resampling method.

3. Statistical Issues

3.1 TITAN is not a statistical model

The main problem of TITAN is that the program is inherently not a statistical method. In a typical statistical problem, variables are grouped into response variable(s) and predictor variable(s). A probability distribution assumption is first introduced to describe the variation of the response variable(s), followed by a model describing the link between the mean variable of the response variable distribution and predictor variable(s) (the mean function). This interpretation of statistical inference was first introduced by Fisher [1922], and fully explained by Lenhard [2006]. Without the distributional assumption, statistical inference and modeling is impossible. 
In selecting taxon-specific split point, TITAN uses an abundance measure of the taxon. No probability distribution assumption was made. A misleading statement was made to claim that TITAN is a nonparametric statistical method. The term “nonparametric” has two specific meanings. First, when used in describing a collection of tests based on ordered statistics, the term nonparametric is a synonym for non-normal. The distribution is undefined but replaced by the distribution of the test statistics derived from the rank-transformed data. Second, when used in describing a class of statistical models based on smoothing technique, the term nonparametric refers to the mean function to be not characterized by a known formula with a number of parameters. The distributional assumption is still needed to specify the model. The response variable distribution defines the likelihood function, which is the basis of all statistical models. Statistical inference is about the estimation of the model parameters and the quantification of uncertainty of the estimated parameter values. Parameter estimation is based on the maximum likelihood estimator in classical statistics and the posterior parameter distribution in the Bayesian statistics, both methods require the probability distribution assumption to specify the likelihood function. Uncertainty analysis in classical statistics is based on the sampling distribution of the estimator which is directly related to the probability distribution of the response variable. Uncertainty analysis in a Bayesian statistics is based on the posterior distribution of model parameters and the posterior distribution is largely based on the likelihood function.
Without a probability assumption of the response variable(s), TITAN is just a collection of computer programs for data processing and transformation. The split point for a taxon is selected, not estimated. Whether it is a statistical model or not may not be a problem, but TITAN’s lack of a definite distributional assumption results in ambiguity in the meaning of the selected split point.

3.2 Statistical change point problems

TITAN’s authors repeated claim that TITAN is a change point model. The term “change point” in statistics means a jump discontinuity in an otherwise smooth curve. The aim of a change point analysis is to detect and locate the discontinuity. For a parametric model, a change point can be defined as the predictor variable value at which parameters defining the response variable distribution change (see the selected change point literature summarized in Khodadadi and Asgharian [2008]). There are also nonparametric regression methods [Antoniadis and Gijbels, 2002, Dempfle and Stute, 2002, Gijbels et al., 1999] for a change point problem. Change point estimation methods are model-specific. Different response variable distributions and/or different mean function (e.g., linear versus nonlinear) will lead to different likelihood function, thereby different computation methods for MLE. TITAN does not define a response variable, nor a probability distribution. Consequently, the meaning of the selected split point is dependent on the underlying model, both in terms of response variable distribution and the mean function. As the response variable is likely taxa abundance, and a log transformation is recommended, a normality assumption is the least controversial. Based on the definition of IndVal, when there are two clusters the difference in taxa abundance is mostly between cluster while within cluster differences are minimal. Accordingly, when the two clusters are defined by a split point along the gradient, the statistical model is the step-function: 













                                                   (2)
where yj is the jth observed (log-transformed) taxa abundance value, xj is the corresponding disturbance gradient value, μ12 are the two cluster means, and ϵ12 are the residual terms (ϵi ~ N(0i2), i = 1, 2). Indeed, the selected split point is almost always coincide with the MLE of the change point under the step function model given that σi2 is small. Figure 1 shows some simulations of step function and selected split points.

Figure 1. Three simulations of the step function, each with a different level of uncertainty. The left column shows the three simulated data sets, the middle column is the calculated IndVal along the gradient, and the right column is the distribution of selected split point based on 2000 bootstrap samples. When there are no observation error (top row), the IndVal peaks at the change point. As the level of observation error increases (middle and bottom rows), the IndVal response pattern is increasingly irregular and the bootstrap distribution of the selected split point more variable.

When the underlying model is not a step function, the interpretation of the selected split point becomes difficult. We illustrate the difficulties using two simulated examples. First, assuming the change in taxa abundance is a linear function of the gradient (no statistical change point) and the abundance values are all positive and are observed without error at points evenly spaced along a gradient. Under this situation, IndVal is inversely related to the selected split point. That is, the maximum IndVal is at one or the other end of the truncated gradient (Figure 2) when the underlying response pattern is linear. Second, assuming the change in taxa abundance is a hockey stick or a broken stick model (Figure 3). If the slopes of the two line segments are of the same sign, the maximum of IndVal will also occur at one or the other end of the truncated gradient. In fact, it can be shown that the maximum of IndVal is always at one or the other end of the truncated gradient if the underlying pattern is monotonic. In other words, the selected split point being one of the two ends of the truncated gradient can be an indication of a monotonic abundance response pattern, further investigation is needed to understand whether the underlying pattern has a change point (e.g., hockey stick model) or not (e.g., linear model). However, a permutation test will result in a p-value of 0 (see section 4 for additional comments), thereby confirming the selected split point as a “threshold.” This problem is exacerbated when using bootstrapping for estimating the confidence interval of the selected split point. A bootstrap sample contains only a subsample of the potential split points, thus each bootstrap sample will result in a selected split point that is slightly larger (or smaller) than the observed split point. (See Section 3.3 for more discussion about the use of bootstrap.) When plotted, bootstrap split points often show a slightly skewed distribution near one end of the gradient. Both the permutation test p-value of 0 and a concentrated split point distribution from bootstrapping give a false impression of a distinct split point, but they are artifacts of the program (Figures 2 and 3).
More simulations are presented in the submitted manuscript, which concluded that TITAN will likely result in erroneous split point for all response types except the step function model. Even when the underlying model can be approximated by the step function, TITAN will likely fail when the uncertainty in the observed abundance is large. 
 Figure 2. When taxa abundance is a linear function of the gradient, IndVal is a function of the inverse of the selected split point. The left column shows three data sets with the same underlying linear model but different levels of uncertainty. The middle column shows the calculated IndVal along the truncated gradient. The right column shows bootstrap estimated split point distribution. The peak IndVal should be at one end of the truncated gradient, but the bootstrap distribution peak is always closer to the middle than the true peak. 
Figure 3. Three linear change point models are used as the true models of taxa abundance response. The top row shows a hockey stick model with a flat first segment and a decreasing second segment. The middle row shows a hockey stick model with the two line segment having slopes of opposite signs. The bottom row shows a broken stick model. In all cases, the peak IndVal is at one end of the truncated gradient (middle column). The bootstrap distribution of the selected split point is always shifted slightly towards the middle (right column). 

3.3 TITAN abuses statistical tests

Even though the selected split point is not a statistical estimate of a parameter, the split point is, nonetheless, a random variable derived from raw abundance data. Statistical characteristics of the variable can often be best described by using many nonparametric methods, including nonparametric tests for testing the central tendency (mean, median) and bootstrapping for confidence interval.
Baker and King [2010] used a permutation test to address the question whether the maximum difference in the two IndVals based on the selected split point can be attributed to randomness in the data. Based on the definition of IndVal, the null hypothesis of the permutation test is that the distributions of taxa abundance in the two groups are identical. Although the probability distribution of the abundance data is not defined, the null hypothesis implies that the two groups share the same mean, the same median, and any other statistics. The permutation test repeatedly assigns data points into two groups at random with the same sample sizes as the two groups divided by the selected split point. The IndVal is calculated each time. These random permutation IndVals form a pseudo-random sample of the IndVal variable under the null hypothesis of no changes in the abundance data along the gradient. If the sample size is large enough and the null hypothesis is true, these pseudo-random samples can be used to approximate the probability distribution of IndVal under the null hypothesis. When comparing the IndVal associated with the selected split point to the null hypothesis distribution, the tail-area (or the fraction of permutation IndVals exceeding the selected split point IndVal) is the p-value as defined by Fisher [Lenhard, 2006] – evidence against the null hypothesis. Baker and King [2010] used a default number of permutation of 250 and suggested (in the online supporting materials) that a larger than 250 permutations is unnecessary. This statement is without any statistical support. In fact, any such general statement with regard to the question of how large a sample size is large enough is misleading because the answer is largely affected by the underlying probability distribution of the data. Qian [2010] (pages 53-54) illustrates the problem using a simulation of the central limit theorem. When the data sample size is small, the number of possible permutations is also small. Using a number larger than the possible number of permutation is obviously unnecessary. When the sample size is large, the possible number of permutation is also large, 250 may be too small to characterize the null hypothesis distribution, especially when the underlying abundance data distribution is highly skewed. When the number of permutation is too small, the test is associated with a high level of uncertainty. For example, we reconstructed the six taxa response data shown in Figure 2 of Baker and King [2010] and performed the permutation test using 250 as the default number of permutations for the selected split point of Taxon D. When the permutation test is repeated, we receive a very different p-value each time. In fact, the range of the permutation estimated p-values is from 0.012 to 0.120 in 2,500 permutation tests (Figure 4). The variation in p-value is still substantial when the number of permutation is increased to 500 (p-value ranges from 0.028 to 0.102). The small number of permutations resulted in a high variance in all statistics related to the test, including the z-score.

Figure 4. Permutation estimated p-values can be highly uncertain when the number of permutations used in a test is small. The two histograms show the distributions of permutation estimated p-values based on two simulation of 2,500 permutation tests, the right panel used 250 permutations, and the right panel used 500 permutations. 

The permutation test is also used in Baker and King [2010] to calculate the z-score for all potential split points. The intention is to normalize taxa with different abundances. By calculating the z-scores for all potential splits, TITAN selects the split point based on the largest z scores along the truncated gradient. The practice, however, introduces a systematic bias into the selected split point. In a permutation test, the estimated IndVal mean and standard deviation is affected by the degree of balance in sample sizes. When the difference in sample sizes of the two groups is large, the estimated variance and mean tend to be large, too. But more so in variance than in the mean. In TITAN, the unevenness is related to the location of the split point. When the split point is at one or the other end of the truncated gradient, the permutation estimated variance is the largest due to the largest difference in sample sizes. When the split point is at the middle (center of the data), the permutation variance is the smallest. As the z-score is the difference between the observed IndVal and the permutation mean divided by the permutation standard deviation, the z-scores for those potential split points near both ends of the gradient tend to be smaller than they should be if the null hypothesis of no clustering is true. Figure 5 shows four simulated taxa abundance data series along a gradient, along with the calculated IndVal, permutation z-scores, permutation estimated mean IndVal and permutation estimated IndVal standard deviation. Given the high variability in the permutation estimated statistics because of the small number of permutations, it is not unusual to see very different selected split points when running TITAN several times using the same data set! 

Figure 5 – Split points selected using permutation estimated z-score are systematically biased. From left to right, the first column shows four simulated taxa abundance data sets, the second column shows the respective IndVals, the third column shows the permutation test-based z-scores, the fourth column shows the permutation estimated means, and the fifth column shows the permutation estimated standard deviation. The permutation estimated z-score is the difference between IndVal and the permutation mean divided by the permutation standard deviation.

TITAN's authors identify synchronicity of thresholds as a major finding of their work with TITAN. It is likely that this apparent synchronicity is an artifact of the TITAN analysis for two reasons. First, the selected split point of a species derived using z-scores has been shown to be biased towards the center of the data cloud along the gradient.  Second, in many applications of TITAN the environmental gradient is a gradient of human disturbance (e.g., % developed land or % impervious surface in a watershed) in which the distribution of sites along the gradient are skewed toward the low end of the gradient (e.g., there are few highly urbanized sites relative to the population of possible sampling sites).  This led to the peak of the z-score distribution to be biased towards the low end of the gradient.  Figure 6 shows an example of a linear response model with sites unevenly distributed along the gradient, with more data points closer to the low end of the gradient than to the high end (median x-axis value = 0.11).  When the gradient is truncated to allow a minimum of 5 data points in a group (suggested default for TITAN analysis), the upper bound of the gradient is moved from 1.0 to close to 0.6. When the noise level is 0, we see a clear monotonic increasing trend in IndVal along the gradient with the highest IndVal at the high end of the truncated gradient (0.6). The monotonic pattern of IndVal is apparent in all cases, but the maximum z-score is always at a point closer to the center of the data cloud (median = 0.11).  As the noise level increases, the peak z-score moves even closer to the data center along the x-axis resulting in what appears to be a high degree of synchronicity in split points among species. 

Figure 6.  Simulations of linear models are used to understand the effects of the z-score bias coupled with a distribution of sampling sites skewed toward the low end of the gradient. Columns are arranged as in Figure 11. Only the linear model was used. Each row represents a different level noise in the data represented by the coefficient of variation (CV).

TITAN's authors have used synchronicity in species responses to advocate for detecting thresholds based on the responses of individual taxa rather than aggregate community metrics. They suggest that community metrics are relatively insensitive to the synchronous threshold declines that they have observed for individual taxa based on TITAN analysis, which they suggest provides a method to determine the magnitude, direction, and uncertainty in responses. Unfortunately, TITAN is a poor method for determining the magnitude, direction and uncertainty of thresholds and much of the synchronicity in responses is an artifact of the IndVal method that underlies TITAN. Their criticism of analyses based on community metrics is unwarranted given the inability of TITAN to characterize common response forms and the propensity for it to extract erroneous thresholds. The reality is that threshold analysis requires describing how taxa abundances change across the gradient, a task for which TITAN and IndVal are ill suited. The characterization of individual taxa responses and the detection of thresholds should be undertaken by comparing multiple alternative models to determine whether the data fit a response model that is indicative of a threshold or whether the response cannot be differentiated from a model without a threshold (e.g., linear model).   
Using permutation test repeatedly is also an invitation to type I error. When the null hypothesis of no clustering is true, a permutation test will reject the null hypothesis 5% of the time. In other words, 5 out 100 potential split points will be statistically significant even the taxon does not respond to the disturbance. By selecting the split point based on the largest z-value, TITAN’s risk of raising false confidence in the selected split point is real. A recent study suggests that permutation estimated null hypothesis distribution can be false when the null hypothesis is itself false, often resulting in increase type II error [Aickin, 2010]. In any case, the use of a permutation test in this case is unjustified.
TITAN also used bootstrap resampling method for calculating the confidence interval of the selected split point. The bootstrapping method is a commonly used resampling methods for estimating standard deviation and confidence intervals of statistics. The bootstrapping approach is a Monte Carlo simulation procedure aimed at obtaining an approximation of the sampling distribution of the parameter of interest. It substitutes random samples from the target population with random samples of the same size (with replacement) from the existing data. Asymptotically, i.e., as sample size of the data increases, bootstrap samples are close to random samples from the population. As a result, empirical distribution of variable calculated from these bootstrap samples can be seen as approximate to the true sampling distribution of the variable of interest. The bootstrap method is, however, not appropriate for a split point problem [Bühlmann and Yu, 2002, Banerjee and McKeague, 2007]. Statistical theories [Bühlmann and Yu, 2002] indicate that bootstrap estimated standard deviation of the split point is always smaller than the true standard deviation, leading to a narrower confidence interval.  For a Monte Carlo simulation to approximate a random variable, each Monte Carlo sample should be close to be a random sample from the target population. When using bootstrapping, the population is approximated by the observed data. But the empirical distribution of the random variable of interest is asymptotically approaching the true sampling distribution. This result is true for many random variables of the population. But the split point is different. When a sample (with k unique gradient values) is taken, the number of potential split points is set (k - 1) and these potential split points are fixed. Potential split points in a bootstrap sample are always a subsample of the k - 1 potential split points in the original data. In other words, the bootstrapping process repeatedly selects the split point from the same pool of k - 1 potential split points. Consequently, the bootstrap estimated standard deviation is much smaller than it should be (and the estimated confidence interval is much narrower than it should be).  Again, Baker and King recommended a default bootstrap sample size of 500 and suggested 100 may be enough without providing any evidence. Whether a number is large enough is determined by many factors, especially the (unknown) underlying population distribution of interest. It is irresponsible to suggest a default number, especially when it is as small as 100.  
I must take some blame for the use of bootstrapping in a change point problem. I used bootstrap in a 2003 Ecological Modelling paper on a change point problem. In that paper, I used the classification and regression tree (CART) model as an alternative to the Bayesian change point model.  Because the Bayesian change point model yields the distribution of the estimated change point, I used bootstrap for the split point confidence interval.  Ian McKeague pointed out the problem of using bootstrapping to me in 2004 when we met at a meeting on statistical issues related to setting a phosphorus standard for the Everglades. It took me quite a while to finally understand the problem. (My other mistake was to call the model a nonparametric one because CART is a "nonparametric" model. But when we use the first split on one predictor, the model is a step function with 4 [or 5] parameters.)

3.4 More on z-score

The last step of TITAN is to calculate the community threshold based on the sum of z-scores calculated for all taxa for all potential split points. A z-score for a particular taxon is related to the permutation test as a measure of evidence against the null hypothesis. In this case, the null hypothesis is specific to the particular taxon (e.g., no clustering in abundance of the taxon along the disturbance gradient). Different z-scores are for different tests with different nulls. The sum of z-scores is only meaningful for tests sharing the same null (e.g., replicated experiments). In this case, we may assume that the common null hypothesis of all these tests is that the split point does not exist for all taxa. Under the common null hypothesis, z-scores are random samples from the same standard normal distribution and the sum of these z-scores is a random variable with mean 0 and variance N (the number of tests or taxa). That is, we use the sum as a composite test statistics for the common null hypothesis. But the sum of squares of z-score (which has a χ2 distribution with degrees of freedom of N) is a more frequently used test statistic. In any case, z-scores are to be combined only when these z-scores are from tests sharing the same null hypothesis, such that, the combined z-score (sum or sum of squares) provides a measure of evidence against the shared null hypothesis. In TITAN, z-scores, calculated for individual taxa (hence different null hypotheses), should not be combined. Combining z-scores based on different nulls is meaningless. 
Problems TITAN intends to solve is not a hypothesis testing problem. Consequently, the value of a z-score is limited. The species sensitivity distribution concept in ecotoxicology is more suitable. In an ecosystem risk assessment problem, the disturbance gradient is measured by the concentration of a toxic substance and the species-specific “threshold” is set to be, for example, LC50 (a concentration that may result in 50% fatality of the species). At an ecosystem level, LC50s of many species are plotted along the concentration gradient on the x-axis, and the y-axis shows fractions of species affected (or % species with LC50 a given concentration). Here, we can replace LC50 with the selected split points to form the cumulative taxa split point distribution. Mathematically, this distribution is the hyper-distribution of species-specific split points under a Bayesian hierarchical modeling framework, representing the distribution of taxon-specific split points.

Programming Issues

TITNA is implemented in R. The authors’ R programs have a few flaws that may lead to unintended consequences.
  • Treatment of ties in the observed disturbance gradient. Because the disturbance gradient is the only cause for taxa clustering, observations with the same disturbance gradient value should be grouped together. The programming strategy is to first find a list of unique values along the gradient and then attach each observed abundance data value to these unique x-values. In TITAN, instead of grouping observed abundance data values by the disturbance gradient data values, each observation is treated as independent and sorted along the gradient. When there are multiple abundance values for a gradient value, these abundance values are randomly assigned to one or the other group. Consequently, TITAN result will vary from run to run using the same data set. This problem is important because TITAN uses bootstrapping and the same observation in the original data is likely represented more than once in a bootstrap sample. This problem is difficult to detect because TITAN uses the sum of z-scores to pick the “community” level split point and the z-scores are highly variable because of the small number of permutations.
  • When running TITAN, we often see a permutation test p-value of 0.004. This is because the p-value was calculated as 1+ number of permutations with IndVal’s exceeding the observed IndVal divided by the numer of permutations. With the number of permutation is set to be 250, a p-value of 0.004 indicates no permutation IndVals exceeds the observed IndVal. As discussed in Section 4.2, a p-value of 0 may be an indication that the underlying pattern of response is monotonic. By adding 1 to the p-value calculation, the results are highly misleading.

Discussion

TITAN is intended for uncovering discontinuous jumps in taxa abundance data along a disturbance gradient. Instead of formulating specific models about abundance, TITAN’s authors used the clustering indicator IndVal. The resulting program is ambiguous in terms what kind of threshold was detected. The misuse of the permutation test resulted in a systematic bias in the selected split point towards the center of the data cloud along the gradient. TITAN is written to process large data sets with hundreds of taxa from many sites. As a result, the behavior of the program is not clear to a reviewer. It took us several months to recode the program and set up various test cases to understand what the program is doing. The peer-review process is not set up for such time-consuming task. The authors are ultimately responsible for checking their work. Simulations presented in this post should be conducted prior to submission. Journals, when considering methods manuscripts, should develop a process of verification (on top of the usual peer-review process) to avoid less obvious errors.


References

    M. Aickin. Invalid permutation tests. International Journal of Mathematics and Mathematical Sciences, Article ID 769780: 10 pages, 2010. doi: 10.1155/2010/769780.
    A. Antoniadis and I. Gijbels. Detecting abrupt changes by wavelet methods. Journal of Nonparametric Statistics, 14 (1-2): 7–29, 2002.
    M.E. Baker and R.S. King. A new method for detecting and interpreting biodiversity and ecological community thresholds. Methods in Ecology and Evolution, 1 (1): 25–37, 2010.
    M. Banerjee and I. W. McKeague. Confidence sets for split points in decision trees. The Annals of Statistics, 35 (2): 543–574, 2007.
    P. Bühlmann and B. Yu. Analyzing bagging. The Annals of Statistics, 30 (4): 927–961, 2002.
    A. Dempfle and W. Stute. Nonparametric estimation of a discontinuity in regression. Statistica Neerlandica, 56 (2): 233–242, 2002. ISSN 1467-9574. doi: 10.1111/1467-9574.00196. URL http://dx.doi.org/10.1111/1467-9574.00196.
    M. Dufrêne and P. Legendre. Species assemblages and indicator species: the need for a flexible asymmetrical approach. Ecological Monographs, 67 (3), 1997.
    R.A. Fisher. On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society of London, Series A, 222: 309–368, 1922.
    I. Gijbels, P. Hall, and A. Kneip. On the estimation of jump points in smooth curves. Annals of the Institute of Statistical Mathematics, 51: 231–251, 1999. ISSN 0020-3157. URL http://dx.doi.org/10.1023/A:1003802007064. 10.1023/A:1003802007064.
    Ahmad Khodadadi and Masoud Asgharian. Change-point problem and regression: An annotated bibliography. Technical report, COBRA Preprint Series. Article 44, 2008. URL http://biostats.bepress.com/cobra/ps/art44.
    J. Lenhard. Models and statistical inference: The controversy between Fisher and Neyman-Pearson. The British Journal for the Philosophy of Science, 57 (1): 69–91, 2006.
    S.S. Qian. Environmental and Ecological Statistics with R. Chapman and Hall/CRC Press, 2010.

Log or not log

LOGorNOTLOG.html Log or not log, that is the question May 19, 2018 In 2014 I taught a special topics class on statistical i...