Song's Blog: TITANic Journey

After a year's work, wait, and review, we have finally submitted a critic of TITAN to the Journal of the Society of Freshwater Science. Based on USGS internal review, we shortened the manuscript and removed some materials, mostly statistics. In this post, I report the statistics not presented in the manuscript.

First, a quick background summary.

In 2010, Ryan Kind and Matt Baker published a paper in the journal Methods in Ecology and Evolution (MEE) and later in the Journal of the North American Benthological Society (JNABS) proposing a "statistical" method for detecting and quantifying ecological threshold at both species level and community level. The method is code named TITAN. The paper was later listed as one of MEE's most popular article. After the publication, the authors promoted the paper in the media (NPR Maryland) and to many state and federal agencies. During 2010, I was working with Tom Cuffney on a paper discussing the problems associated with threshold modeling (later published with a title "To threshold or not to threshold? That's the question"). The premise of our paper is that we must understand the underlying assumption of a model before applying it. Also, we discussed the difference between a change point (a statistical concept) and a management threshold (a regulatory concept). When we read TITAN, we know that there lies a problem. In this post, I introduce TITAN briefly, followed by a discussion on statistical issues related to the method. These issues include (1) the use of a permutation test, (2) bootstrapping, (3) repeated testing, (4) characteristics of the permutation test used in TITAN, and (5) consequences of the misuse of the permutation test. I want to use this post to show the difficulty encountered in the peer-review process where these statistical issues are not easy to unravel in a typical review process.

1. Introduction

My discussion of TITAN is based on my understanding of data analysis and statistical inference. The emphasis of data analysis using statistical tools is to uncover the underlying causal relation that resulted in the observed data. Statistical analysis and modeling usually follows the hypothetical deductive reasoning process first formalized by Fisher [1922]. Under Fisher’s logic, the first step of a statistical problem is to pose hypothesis, based on the answer to the question “of which distribution are the data a random sample.” As statistics is a science of randomness and probability distribution is the quantitative measure of randomness, a statistical inference is the process of learning about the parameters defining the probability distribution. Specifically, we are interested in (1) how to quantify the parameter of interested when random samples from the population are available (the problem of parameter estimation) and (2) quantifying the uncertainty of the estimate. The maximum likelihood estimator is the most commonly used statistical method for parameter estimation, while the uncertainty of the estimate is most commonly defined through the sampling distribution of the estimator. Recently, Baker and King [2010] proposed a program known as the threshold indicator taxa analysis (TITAN) for calculating ecological community “thresholds.” The program is promoted as a “statistical” method capable of detecting a community level threshold. This paper discusses the limitations of TITAN and statistical errors in the program that may lead to misleading results.

2. TITAN

TITAN can be summarized as a collection of computer programs for selecting a split point along a disturbance gradient to divide the collection of samples into two groups. The split point is selected based on an indicator value of a taxon’s group association. The indicator value was proposed to describe a taxon’s association with a number of existing clusters [Dufrêne and Legendre, 1997]. When there are only two clusters, the indicator value (or IndVal) is the product of the taxon’s relative abundance and the frequency of occurrence:

(1)

where i = 1, 2 is the cluster index, A_i is the relative abundance calculated as the ratio of mean total abundance in cluster i (a_i) over the sum of the cluster means (A_i = a_i∕∑ _i=1²(a_i)), and B_i is the frequency of occurrence (fraction of non-zero observations) in cluster i. While IndVal is derived for describing the association of a given taxon to an existing cluster, TITAN uses IndVal to define clusters along a disturbance gradient. Observations along the gradient is successively divided into two groups by moving a dividing line along the gradient, at each potential split point an IndVal is calculated for each group. The division that resulted in the largest difference between the two IndVal’s is defined as the split point for the taxon. (Baker and King [2010] defines the IndVal for each potential split point as the larger of the two IndVals, the split point based on the maximum IndVal under this definition is the same as the split point with the largest difference between the two IndVals based on the definition of Dufrêne and Legendre [1997].) This process searches a maxima of the indicator value differences between the two groups. Because the calculation of IndVal requires two clusters (or groups in this case), TITAN starts the search at some distance from the low end of the gradient to allow a pre-determined number of data points to be included in the “left group” and ends also at a distance from the upper end of the gradient to allow the same minimum number of data points in the “righr group.” This calculation is equivalent to a truncated variable transformation (truncating the data at both ends of the disturbance gradient and transform the total abundance data into IndVal) and the maxima of the transformed data is defined as the split point along the gradient.

Once the split point is selected, a permutation test is employed to test its “statistical significance.” The permutation test randomly divide the data into two groups each with the same number of observations as the division that produces the maximum IndVal. The statistical significance is represented by either the p-value (fraction of permutation IndVal’s exceeding the IndVal associated with the selected split point) or the z-score (the standardized IndVal with respect to the mean and standard deviation of permutation Indval’s).

To translate taxon-specific split points into a community “threshold,” TITAN performs the permutation test of IndVal for all possible split points along the gradient and record the z-scores for all taxa. At each potential split point along the gradient, z-scores of all taxa are summed. The community-level threshold is defined to be split point associated with the maximum sum of z-scores. There are two sum of z-scores, one for “decreasing” taxa (those with a higher relative abundance at the low side of the selected split point) and the other for “increasing” taxa.

Uncertainty associated with the selected split point is estimated using a bootstrapping resampling method.

3. Statistical Issues

3.1 TITAN is not a statistical model

The main problem of TITAN is that the program is inherently not a statistical method. In a typical statistical problem, variables are grouped into response variable(s) and predictor variable(s). A probability distribution assumption is first introduced to describe the variation of the response variable(s), followed by a model describing the link between the mean variable of the response variable distribution and predictor variable(s) (the mean function). This interpretation of statistical inference was first introduced by Fisher [1922], and fully explained by Lenhard [2006]. Without the distributional assumption, statistical inference and modeling is impossible.

In selecting taxon-specific split point, TITAN uses an abundance measure of the taxon. No probability distribution assumption was made. A misleading statement was made to claim that TITAN is a nonparametric statistical method. The term “nonparametric” has two specific meanings. First, when used in describing a collection of tests based on ordered statistics, the term nonparametric is a synonym for non-normal. The distribution is undefined but replaced by the distribution of the test statistics derived from the rank-transformed data. Second, when used in describing a class of statistical models based on smoothing technique, the term nonparametric refers to the mean function to be not characterized by a known formula with a number of parameters. The distributional assumption is still needed to specify the model. The response variable distribution defines the likelihood function, which is the basis of all statistical models. Statistical inference is about the estimation of the model parameters and the quantification of uncertainty of the estimated parameter values. Parameter estimation is based on the maximum likelihood estimator in classical statistics and the posterior parameter distribution in the Bayesian statistics, both methods require the probability distribution assumption to specify the likelihood function. Uncertainty analysis in classical statistics is based on the sampling distribution of the estimator which is directly related to the probability distribution of the response variable. Uncertainty analysis in a Bayesian statistics is based on the posterior distribution of model parameters and the posterior distribution is largely based on the likelihood function.

Without a probability assumption of the response variable(s), TITAN is just a collection of computer programs for data processing and transformation. The split point for a taxon is selected, not estimated. Whether it is a statistical model or not may not be a problem, but TITAN’s lack of a definite distributional assumption results in ambiguity in the meaning of the selected split point.

3.2 Statistical change point problems

TITAN’s authors repeated claim that TITAN is a change point model. The term “change point” in statistics means a jump discontinuity in an otherwise smooth curve. The aim of a change point analysis is to detect and locate the discontinuity. For a parametric model, a change point can be defined as the predictor variable value at which parameters defining the response variable distribution change (see the selected change point literature summarized in Khodadadi and Asgharian [2008]). There are also nonparametric regression methods [Antoniadis and Gijbels, 2002, Dempfle and Stute, 2002, Gijbels et al., 1999] for a change point problem. Change point estimation methods are model-specific. Different response variable distributions and/or different mean function (e.g., linear versus nonlinear) will lead to different likelihood function, thereby different computation methods for MLE. TITAN does not define a response variable, nor a probability distribution. Consequently, the meaning of the selected split point is dependent on the underlying model, both in terms of response variable distribution and the mean function. As the response variable is likely taxa abundance, and a log transformation is recommended, a normality assumption is the least controversial. Based on the definition of IndVal, when there are two clusters the difference in taxa abundance is mostly between cluster while within cluster differences are minimal. Accordingly, when the two clusters are defined by a split point along the gradient, the statistical model is the step-function:

(2)

where y_j is the jth observed (log-transformed) taxa abundance value, x_j is the corresponding disturbance gradient value, μ₁,μ₂ are the two cluster means, and ϵ₁,ϵ₂ are the residual terms (ϵ_i ~ N(0,σ_i²), i = 1, 2). Indeed, the selected split point is almost always coincide with the MLE of the change point under the step function model given that σ_i² is small. Figure 1 shows some simulations of step function and selected split points.

Figure 1. Three simulations of the step function, each with a different level of uncertainty. The left column shows the three simulated data sets, the middle column is the calculated IndVal along the gradient, and the right column is the distribution of selected split point based on 2000 bootstrap samples. When there are no observation error (top row), the IndVal peaks at the change point. As the level of observation error increases (middle and bottom rows), the IndVal response pattern is increasingly irregular and the bootstrap distribution of the selected split point more variable.

When the underlying model is not a step function, the interpretation of the selected split point becomes difficult. We illustrate the difficulties using two simulated examples. First, assuming the change in taxa abundance is a linear function of the gradient (no statistical change point) and the abundance values are all positive and are observed without error at points evenly spaced along a gradient. Under this situation, IndVal is inversely related to the selected split point. That is, the maximum IndVal is at one or the other end of the truncated gradient (Figure 2) when the underlying response pattern is linear. Second, assuming the change in taxa abundance is a hockey stick or a broken stick model (Figure 3). If the slopes of the two line segments are of the same sign, the maximum of IndVal will also occur at one or the other end of the truncated gradient. In fact, it can be shown that the maximum of IndVal is always at one or the other end of the truncated gradient if the underlying pattern is monotonic. In other words, the selected split point being one of the two ends of the truncated gradient can be an indication of a monotonic abundance response pattern, further investigation is needed to understand whether the underlying pattern has a change point (e.g., hockey stick model) or not (e.g., linear model). However, a permutation test will result in a p-value of 0 (see section 4 for additional comments), thereby confirming the selected split point as a “threshold.” This problem is exacerbated when using bootstrapping for estimating the confidence interval of the selected split point. A bootstrap sample contains only a subsample of the potential split points, thus each bootstrap sample will result in a selected split point that is slightly larger (or smaller) than the observed split point. (See Section 3.3 for more discussion about the use of bootstrap.) When plotted, bootstrap split points often show a slightly skewed distribution near one end of the gradient. Both the permutation test p-value of 0 and a concentrated split point distribution from bootstrapping give a false impression of a distinct split point, but they are artifacts of the program (Figures 2 and 3).

More simulations are presented in the submitted manuscript, which concluded that TITAN will likely result in erroneous split point for all response types except the step function model. Even when the underlying model can be approximated by the step function, TITAN will likely fail when the uncertainty in the observed abundance is large.

Figure 2. When taxa abundance is a linear function of the gradient, IndVal is a function of the inverse of the selected split point. The left column shows three data sets with the same underlying linear model but different levels of uncertainty. The middle column shows the calculated IndVal along the truncated gradient. The right column shows bootstrap estimated split point distribution. The peak IndVal should be at one end of the truncated gradient, but the bootstrap distribution peak is always closer to the middle than the true peak.

Figure 3. Three linear change point models are used as the true models of taxa abundance response. The top row shows a hockey stick model with a flat first segment and a decreasing second segment. The middle row shows a hockey stick model with the two line segment having slopes of opposite signs. The bottom row shows a broken stick model. In all cases, the peak IndVal is at one end of the truncated gradient (middle column). The bootstrap distribution of the selected split point is always shifted slightly towards the middle (right column).

3.3 TITAN abuses statistical tests

Even though the selected split point is not a statistical estimate of a parameter, the split point is, nonetheless, a random variable derived from raw abundance data. Statistical characteristics of the variable can often be best described by using many nonparametric methods, including nonparametric tests for testing the central tendency (mean, median) and bootstrapping for confidence interval.

Baker and King [2010] used a permutation test to address the question whether the maximum difference in the two IndVals based on the selected split point can be attributed to randomness in the data. Based on the definition of IndVal, the null hypothesis of the permutation test is that the distributions of taxa abundance in the two groups are identical. Although the probability distribution of the abundance data is not defined, the null hypothesis implies that the two groups share the same mean, the same median, and any other statistics. The permutation test repeatedly assigns data points into two groups at random with the same sample sizes as the two groups divided by the selected split point. The IndVal is calculated each time. These random permutation IndVals form a pseudo-random sample of the IndVal variable under the null hypothesis of no changes in the abundance data along the gradient. If the sample size is large enough and the null hypothesis is true, these pseudo-random samples can be used to approximate the probability distribution of IndVal under the null hypothesis. When comparing the IndVal associated with the selected split point to the null hypothesis distribution, the tail-area (or the fraction of permutation IndVals exceeding the selected split point IndVal) is the p-value as defined by Fisher [Lenhard, 2006] – evidence against the null hypothesis. Baker and King [2010] used a default number of permutation of 250 and suggested (in the online supporting materials) that a larger than 250 permutations is unnecessary. This statement is without any statistical support. In fact, any such general statement with regard to the question of how large a sample size is large enough is misleading because the answer is largely affected by the underlying probability distribution of the data. Qian [2010] (pages 53-54) illustrates the problem using a simulation of the central limit theorem. When the data sample size is small, the number of possible permutations is also small. Using a number larger than the possible number of permutation is obviously unnecessary. When the sample size is large, the possible number of permutation is also large, 250 may be too small to characterize the null hypothesis distribution, especially when the underlying abundance data distribution is highly skewed. When the number of permutation is too small, the test is associated with a high level of uncertainty. For example, we reconstructed the six taxa response data shown in Figure 2 of Baker and King [2010] and performed the permutation test using 250 as the default number of permutations for the selected split point of Taxon D. When the permutation test is repeated, we receive a very different p-value each time. In fact, the range of the permutation estimated p-values is from 0.012 to 0.120 in 2,500 permutation tests (Figure 4). The variation in p-value is still substantial when the number of permutation is increased to 500 (p-value ranges from 0.028 to 0.102). The small number of permutations resulted in a high variance in all statistics related to the test, including the z-score.

Figure 4. Permutation estimated p-values can be highly uncertain when the number of permutations used in a test is small. The two histograms show the distributions of permutation estimated p-values based on two simulation of 2,500 permutation tests, the right panel used 250 permutations, and the right panel used 500 permutations.

The permutation test is also used in Baker and King [2010] to calculate the z-score for all potential split points. The intention is to normalize taxa with different abundances. By calculating the z-scores for all potential splits, TITAN selects the split point based on the largest z scores along the truncated gradient. The practice, however, introduces a systematic bias into the selected split point. In a permutation test, the estimated IndVal mean and standard deviation is affected by the degree of balance in sample sizes. When the difference in sample sizes of the two groups is large, the estimated variance and mean tend to be large, too. But more so in variance than in the mean. In TITAN, the unevenness is related to the location of the split point. When the split point is at one or the other end of the truncated gradient, the permutation estimated variance is the largest due to the largest difference in sample sizes. When the split point is at the middle (center of the data), the permutation variance is the smallest. As the z-score is the difference between the observed IndVal and the permutation mean divided by the permutation standard deviation, the z-scores for those potential split points near both ends of the gradient tend to be smaller than they should be if the null hypothesis of no clustering is true. Figure 5 shows four simulated taxa abundance data series along a gradient, along with the calculated IndVal, permutation z-scores, permutation estimated mean IndVal and permutation estimated IndVal standard deviation. Given the high variability in the permutation estimated statistics because of the small number of permutations, it is not unusual to see very different selected split points when running TITAN several times using the same data set!

Figure 5 – Split points selected using permutation estimated z-score are systematically biased. From left to right, the first column shows four simulated taxa abundance data sets, the second column shows the respective IndVals, the third column shows the permutation test-based z-scores, the fourth column shows the permutation estimated means, and the fifth column shows the permutation estimated standard deviation. The permutation estimated z-score is the difference between IndVal and the permutation mean divided by the permutation standard deviation.

TITAN's authors identify synchronicity of thresholds as a major finding of their work with TITAN. It is likely that this apparent synchronicity is an artifact of the TITAN analysis for two reasons. First, the selected split point of a species derived using z-scores has been shown to be biased towards the center of the data cloud along the gradient. Second, in many applications of TITAN the environmental gradient is a gradient of human disturbance (e.g., % developed land or % impervious surface in a watershed) in which the distribution of sites along the gradient are skewed toward the low end of the gradient (e.g., there are few highly urbanized sites relative to the population of possible sampling sites). This led to the peak of the z-score distribution to be biased towards the low end of the gradient. Figure 6 shows an example of a linear response model with sites unevenly distributed along the gradient, with more data points closer to the low end of the gradient than to the high end (median x-axis value = 0.11). When the gradient is truncated to allow a minimum of 5 data points in a group (suggested default for TITAN analysis), the upper bound of the gradient is moved from 1.0 to close to 0.6. When the noise level is 0, we see a clear monotonic increasing trend in IndVal along the gradient with the highest IndVal at the high end of the truncated gradient (0.6). The monotonic pattern of IndVal is apparent in all cases, but the maximum z-score is always at a point closer to the center of the data cloud (median = 0.11). As the noise level increases, the peak z-score moves even closer to the data center along the x-axis resulting in what appears to be a high degree of synchronicity in split points among species.

Figure 6. Simulations of linear models are used to understand the effects of the z-score bias coupled with a distribution of sampling sites skewed toward the low end of the gradient. Columns are arranged as in Figure 11. Only the linear model was used. Each row represents a different level noise in the data represented by the coefficient of variation (CV).

TITAN's authors have used synchronicity in species responses to advocate for detecting thresholds based on the responses of individual taxa rather than aggregate community metrics. They suggest that community metrics are relatively insensitive to the synchronous threshold declines that they have observed for individual taxa based on TITAN analysis, which they suggest provides a method to determine the magnitude, direction, and uncertainty in responses. Unfortunately, TITAN is a poor method for determining the magnitude, direction and uncertainty of thresholds and much of the synchronicity in responses is an artifact of the IndVal method that underlies TITAN. Their criticism of analyses based on community metrics is unwarranted given the inability of TITAN to characterize common response forms and the propensity for it to extract erroneous thresholds. The reality is that threshold analysis requires describing how taxa abundances change across the gradient, a task for which TITAN and IndVal are ill suited. The characterization of individual taxa responses and the detection of thresholds should be undertaken by comparing multiple alternative models to determine whether the data fit a response model that is indicative of a threshold or whether the response cannot be differentiated from a model without a threshold (e.g., linear model).

Using permutation test repeatedly is also an invitation to type I error. When the null hypothesis of no clustering is true, a permutation test will reject the null hypothesis 5% of the time. In other words, 5 out 100 potential split points will be statistically significant even the taxon does not respond to the disturbance. By selecting the split point based on the largest z-value, TITAN’s risk of raising false confidence in the selected split point is real. A recent study suggests that permutation estimated null hypothesis distribution can be false when the null hypothesis is itself false, often resulting in increase type II error [Aickin, 2010]. In any case, the use of a permutation test in this case is unjustified.

TITAN also used bootstrap resampling method for calculating the confidence interval of the selected split point. The bootstrapping method is a commonly used resampling methods for estimating standard deviation and confidence intervals of statistics. The bootstrapping approach is a Monte Carlo simulation procedure aimed at obtaining an approximation of the sampling distribution of the parameter of interest. It substitutes random samples from the target population with random samples of the same size (with replacement) from the existing data. Asymptotically, i.e., as sample size of the data increases, bootstrap samples are close to random samples from the population. As a result, empirical distribution of variable calculated from these bootstrap samples can be seen as approximate to the true sampling distribution of the variable of interest. The bootstrap method is, however, not appropriate for a split point problem [Bühlmann and Yu, 2002, Banerjee and McKeague, 2007]. Statistical theories [Bühlmann and Yu, 2002] indicate that bootstrap estimated standard deviation of the split point is always smaller than the true standard deviation, leading to a narrower confidence interval. For a Monte Carlo simulation to approximate a random variable, each Monte Carlo sample should be close to be a random sample from the target population. When using bootstrapping, the population is approximated by the observed data. But the empirical distribution of the random variable of interest is asymptotically approaching the true sampling distribution. This result is true for many random variables of the population. But the split point is different. When a sample (with k unique gradient values) is taken, the number of potential split points is set (k - 1) and these potential split points are fixed. Potential split points in a bootstrap sample are always a subsample of the k - 1 potential split points in the original data. In other words, the bootstrapping process repeatedly selects the split point from the same pool of k - 1 potential split points. Consequently, the bootstrap estimated standard deviation is much smaller than it should be (and the estimated confidence interval is much narrower than it should be). Again, Baker and King recommended a default bootstrap sample size of 500 and suggested 100 may be enough without providing any evidence. Whether a number is large enough is determined by many factors, especially the (unknown) underlying population distribution of interest. It is irresponsible to suggest a default number, especially when it is as small as 100.
I must take some blame for the use of bootstrapping in a change point problem. I used bootstrap in a 2003 Ecological Modelling paper on a change point problem. In that paper, I used the classification and regression tree (CART) model as an alternative to the Bayesian change point model. Because the Bayesian change point model yields the distribution of the estimated change point, I used bootstrap for the split point confidence interval. Ian McKeague pointed out the problem of using bootstrapping to me in 2004 when we met at a meeting on statistical issues related to setting a phosphorus standard for the Everglades. It took me quite a while to finally understand the problem. (My other mistake was to call the model a nonparametric one because CART is a "nonparametric" model. But when we use the first split on one predictor, the model is a step function with 4 [or 5] parameters.)

3.4 More on z-score

The last step of TITAN is to calculate the community threshold based on the sum of z-scores calculated for all taxa for all potential split points. A z-score for a particular taxon is related to the permutation test as a measure of evidence against the null hypothesis. In this case, the null hypothesis is specific to the particular taxon (e.g., no clustering in abundance of the taxon along the disturbance gradient). Different z-scores are for different tests with different nulls. The sum of z-scores is only meaningful for tests sharing the same null (e.g., replicated experiments). In this case, we may assume that the common null hypothesis of all these tests is that the split point does not exist for all taxa. Under the common null hypothesis, z-scores are random samples from the same standard normal distribution and the sum of these z-scores is a random variable with mean 0 and variance N (the number of tests or taxa). That is, we use the sum as a composite test statistics for the common null hypothesis. But the sum of squares of z-score (which has a χ² distribution with degrees of freedom of N) is a more frequently used test statistic. In any case, z-scores are to be combined only when these z-scores are from tests sharing the same null hypothesis, such that, the combined z-score (sum or sum of squares) provides a measure of evidence against the shared null hypothesis. In TITAN, z-scores, calculated for individual taxa (hence different null hypotheses), should not be combined. Combining z-scores based on different nulls is meaningless.

Problems TITAN intends to solve is not a hypothesis testing problem. Consequently, the value of a z-score is limited. The species sensitivity distribution concept in ecotoxicology is more suitable. In an ecosystem risk assessment problem, the disturbance gradient is measured by the concentration of a toxic substance and the species-specific “threshold” is set to be, for example, LC50 (a concentration that may result in 50% fatality of the species). At an ecosystem level, LC50s of many species are plotted along the concentration gradient on the x-axis, and the y-axis shows fractions of species affected (or % species with LC50 ≤ a given concentration). Here, we can replace LC50 with the selected split points to form the cumulative taxa split point distribution. Mathematically, this distribution is the hyper-distribution of species-specific split points under a Bayesian hierarchical modeling framework, representing the distribution of taxon-specific split points.

4 Programming Issues

TITNA is implemented in R. The authors’ R programs have a few flaws that may lead to unintended consequences.

Treatment of ties in the observed disturbance gradient. Because the disturbance gradient is the only cause for taxa clustering, observations with the same disturbance gradient value should be grouped together. The programming strategy is to first find a list of unique values along the gradient and then attach each observed abundance data value to these unique x-values. In TITAN, instead of grouping observed abundance data values by the disturbance gradient data values, each observation is treated as independent and sorted along the gradient. When there are multiple abundance values for a gradient value, these abundance values are randomly assigned to one or the other group. Consequently, TITAN result will vary from run to run using the same data set. This problem is important because TITAN uses bootstrapping and the same observation in the original data is likely represented more than once in a bootstrap sample. This problem is difficult to detect because TITAN uses the sum of z-scores to pick the “community” level split point and the z-scores are highly variable because of the small number of permutations.
When running TITAN, we often see a permutation test p-value of 0.004. This is because the p-value was calculated as 1+ number of permutations with IndVal’s exceeding the observed IndVal divided by the numer of permutations. With the number of permutation is set to be 250, a p-value of 0.004 indicates no permutation IndVals exceeds the observed IndVal. As discussed in Section 4.2, a p-value of 0 may be an indication that the underlying pattern of response is monotonic. By adding 1 to the p-value calculation, the results are highly misleading.

5 Discussion

TITAN is intended for uncovering discontinuous jumps in taxa abundance data along a disturbance gradient. Instead of formulating specific models about abundance, TITAN’s authors used the clustering indicator IndVal. The resulting program is ambiguous in terms what kind of threshold was detected. The misuse of the permutation test resulted in a systematic bias in the selected split point towards the center of the data cloud along the gradient. TITAN is written to process large data sets with hundreds of taxa from many sites. As a result, the behavior of the program is not clear to a reviewer. It took us several months to recode the program and set up various test cases to understand what the program is doing. The peer-review process is not set up for such time-consuming task. The authors are ultimately responsible for checking their work. Simulations presented in this post should be conducted prior to submission. Journals, when considering methods manuscripts, should develop a process of verification (on top of the usual peer-review process) to avoid less obvious errors.

References

M. Aickin. Invalid permutation tests. International Journal of Mathematics and Mathematical Sciences, Article ID 769780: 10 pages, 2010. doi: 10.1155/2010/769780.

A. Antoniadis and I. Gijbels. Detecting abrupt changes by wavelet methods. Journal of Nonparametric Statistics, 14 (1-2): 7–29, 2002.

M.E. Baker and R.S. King. A new method for detecting and interpreting biodiversity and ecological community thresholds. Methods in Ecology and Evolution, 1 (1): 25–37, 2010.

M. Banerjee and I. W. McKeague. Confidence sets for split points in decision trees. The Annals of Statistics, 35 (2): 543–574, 2007.

P. Bühlmann and B. Yu. Analyzing bagging. The Annals of Statistics, 30 (4): 927–961, 2002.

A. Dempfle and W. Stute. Nonparametric estimation of a discontinuity in regression. Statistica Neerlandica, 56 (2): 233–242, 2002. ISSN 1467-9574. doi: 10.1111/1467-9574.00196. URL http://dx.doi.org/10.1111/1467-9574.00196.

M. Dufrêne and P. Legendre. Species assemblages and indicator species: the need for a flexible asymmetrical approach. Ecological Monographs, 67 (3), 1997.

R.A. Fisher. On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society of London, Series A, 222: 309–368, 1922.

I. Gijbels, P. Hall, and A. Kneip. On the estimation of jump points in smooth curves. Annals of the Institute of Statistical Mathematics, 51: 231–251, 1999. ISSN 0020-3157. URL http://dx.doi.org/10.1023/A:1003802007064. 10.1023/A:1003802007064.

Ahmad Khodadadi and Masoud Asgharian. Change-point problem and regression: An annotated bibliography. Technical report, COBRA Preprint Series. Article 44, 2008. URL http://biostats.bepress.com/cobra/ps/art44.

J. Lenhard. Models and statistical inference: The controversy between Fisher and Neyman-Pearson. The British Journal for the Philosophy of Science, 57 (1): 69–91, 2006.

S.S. Qian. Environmental and Ecological Statistics with R. Chapman and Hall/CRC Press, 2010.

Song's Blog

Wednesday, May 23, 2012

TITANic Journey