First, a quick background summary.
In 2010, Ryan Kind and Matt Baker published a paper in the journal Methods in Ecology and Evolution (MEE) and later in the Journal of the North American Benthological Society (JNABS) proposing a "statistical" method for detecting and quantifying ecological threshold at both species level and community level. The method is code named TITAN. The paper was later listed as one of MEE's most popular article. After the publication, the authors promoted the paper in the media (NPR Maryland) and to many state and federal agencies. During 2010, I was working with Tom Cuffney on a paper discussing the problems associated with threshold modeling (later published with a title "To threshold or not to threshold? That's the question"). The premise of our paper is that we must understand the underlying assumption of a model before applying it. Also, we discussed the difference between a change point (a statistical concept) and a management threshold (a regulatory concept). When we read TITAN, we know that there lies a problem. In this post, I introduce TITAN briefly, followed by a discussion on statistical issues related to the method. These issues include (1) the use of a permutation test, (2) bootstrapping, (3) repeated testing, (4) characteristics of the permutation test used in TITAN, and (5) consequences of the misuse of the permutation test. I want to use this post to show the difficulty encountered in the peer-review process where these statistical issues are not easy to unravel in a typical review process.
1. Introduction
My discussion of TITAN is based on my understanding of data analysis and statistical inference. The emphasis of data analysis using statistical tools is to uncover the underlying causal relation
that resulted in the observed data. Statistical analysis and modeling usually follows the
hypothetical deductive reasoning process first formalized by Fisher [1922]. Under Fisher’s logic,
the first step of a statistical problem is to pose hypothesis, based on the answer to the question
“of which distribution are the data a random sample.” As statistics is a science of randomness
and probability distribution is the quantitative measure of randomness, a statistical inference is
the process of learning about the parameters defining the probability distribution. Specifically,
we are interested in (1) how to quantify the parameter of interested when random samples from
the population are available (the problem of parameter estimation) and (2) quantifying the
uncertainty of the estimate. The maximum likelihood estimator is the most commonly used
statistical method for parameter estimation, while the uncertainty of the estimate is most
commonly defined through the sampling distribution of the estimator. Recently, Baker and
King [2010] proposed a program known as the threshold indicator taxa analysis (TITAN) for
calculating ecological community “thresholds.” The program is promoted as a “statistical”
method capable of detecting a community level threshold. This paper discusses the
limitations of TITAN and statistical errors in the program that may lead to misleading
results.
(1)
2. TITAN
TITAN can be summarized as a collection of computer programs for selecting a split point along a disturbance gradient to divide the collection of samples into two groups. The split point is selected based on an indicator value of a taxon’s group association. The indicator value was proposed to describe a taxon’s association with a number of existing clusters [Dufrêne and Legendre, 1997]. When there are only two clusters, the indicator value (or IndVal) is the product of the taxon’s relative abundance and the frequency of occurrence:(1)
where i = 1, 2 is the cluster index, Ai is the relative abundance calculated as the ratio of mean
total abundance in cluster i (ai) over the sum of the cluster means (Ai = ai∕∑
i=12(a
i)), and Bi
is the frequency of occurrence (fraction of non-zero observations) in cluster i. While IndVal is
derived for describing the association of a given taxon to an existing cluster, TITAN uses IndVal
to define clusters along a disturbance gradient. Observations along the gradient is successively
divided into two groups by moving a dividing line along the gradient, at each potential split
point an IndVal is calculated for each group. The division that resulted in the largest difference
between the two IndVal’s is defined as the split point for the taxon. (Baker and King [2010]
defines the IndVal for each potential split point as the larger of the two IndVals, the split
point based on the maximum IndVal under this definition is the same as the split
point with the largest difference between the two IndVals based on the definition
of Dufrêne and Legendre [1997].) This process searches a maxima of the indicator
value differences between the two groups. Because the calculation of IndVal requires
two clusters (or groups in this case), TITAN starts the search at some distance from
the low end of the gradient to allow a pre-determined number of data points to be
included in the “left group” and ends also at a distance from the upper end of the
gradient to allow the same minimum number of data points in the “righr group.” This
calculation is equivalent to a truncated variable transformation (truncating the data at
both ends of the disturbance gradient and transform the total abundance data into
IndVal) and the maxima of the transformed data is defined as the split point along the
gradient.
Once the split point is selected, a permutation test is employed to test its “statistical
significance.” The permutation test randomly divide the data into two groups each with the same
number of observations as the division that produces the maximum IndVal. The statistical
significance is represented by either the p-value (fraction of permutation IndVal’s
exceeding the IndVal associated with the selected split point) or the z-score (the
standardized IndVal with respect to the mean and standard deviation of permutation
Indval’s).
To translate taxon-specific split points into a community “threshold,” TITAN performs the
permutation test of IndVal for all possible split points along the gradient and record the z-scores
for all taxa. At each potential split point along the gradient, z-scores of all taxa are summed. The
community-level threshold is defined to be split point associated with the maximum sum of
z-scores. There are two sum of z-scores, one for “decreasing” taxa (those with a higher relative
abundance at the low side of the selected split point) and the other for “increasing”
taxa.
Uncertainty associated with the selected split point is estimated using a bootstrapping
resampling method.
3. Statistical Issues
3.1 TITAN is not a statistical model
The main problem of TITAN is that the program is inherently not a statistical method. In a
typical statistical problem, variables are grouped into response variable(s) and predictor
variable(s). A probability distribution assumption is first introduced to describe the variation of
the response variable(s), followed by a model describing the link between the mean variable of
the response variable distribution and predictor variable(s) (the mean function). This
interpretation of statistical inference was first introduced by Fisher [1922], and fully explained by
Lenhard [2006]. Without the distributional assumption, statistical inference and modeling is
impossible.
In selecting taxon-specific split point, TITAN uses an abundance measure of the taxon. No
probability distribution assumption was made. A misleading statement was made to claim that
TITAN is a nonparametric statistical method. The term “nonparametric” has two specific
meanings. First, when used in describing a collection of tests based on ordered statistics, the
term nonparametric is a synonym for non-normal. The distribution is undefined but replaced by
the distribution of the test statistics derived from the rank-transformed data. Second, when used
in describing a class of statistical models based on smoothing technique, the term nonparametric
refers to the mean function to be not characterized by a known formula with a number
of parameters. The distributional assumption is still needed to specify the model.
The response variable distribution defines the likelihood function, which is the basis
of all statistical models. Statistical inference is about the estimation of the model
parameters and the quantification of uncertainty of the estimated parameter values.
Parameter estimation is based on the maximum likelihood estimator in classical statistics
and the posterior parameter distribution in the Bayesian statistics, both methods
require the probability distribution assumption to specify the likelihood function.
Uncertainty analysis in classical statistics is based on the sampling distribution of the
estimator which is directly related to the probability distribution of the response variable.
Uncertainty analysis in a Bayesian statistics is based on the posterior distribution of
model parameters and the posterior distribution is largely based on the likelihood
function.
Without a probability assumption of the response variable(s), TITAN is just a collection of
computer programs for data processing and transformation. The split point for a taxon is
selected, not estimated. Whether it is a statistical model or not may not be a problem, but
TITAN’s lack of a definite distributional assumption results in ambiguity in the meaning of the
selected split point.
where yj is the jth observed (log-transformed) taxa abundance value, xj is the corresponding
disturbance gradient value, μ1,μ2 are the two cluster means, and ϵ1,ϵ2 are the residual terms
(ϵi ~ N(0,σi2), i = 1, 2). Indeed, the selected split point is almost always coincide with the MLE
of the change point under the step function model given that σi2 is small. Figure 1 shows
some simulations of step function and selected split points.
3.2 Statistical change point problems
TITAN’s authors repeated claim that TITAN is a change point model. The term “change point”
in statistics means a jump discontinuity in an otherwise smooth curve. The aim of a change point
analysis is to detect and locate the discontinuity. For a parametric model, a change point can be
defined as the predictor variable value at which parameters defining the response variable
distribution change (see the selected change point literature summarized in Khodadadi and
Asgharian [2008]). There are also nonparametric regression methods [Antoniadis and
Gijbels, 2002, Dempfle and Stute, 2002, Gijbels et al., 1999] for a change point problem.
Change point estimation methods are model-specific. Different response variable distributions
and/or different mean function (e.g., linear versus nonlinear) will lead to different likelihood
function, thereby different computation methods for MLE. TITAN does not define a response
variable, nor a probability distribution. Consequently, the meaning of the selected split point is
dependent on the underlying model, both in terms of response variable distribution and the mean
function. As the response variable is likely taxa abundance, and a log transformation is
recommended, a normality assumption is the least controversial. Based on the definition
of IndVal, when there are two clusters the difference in taxa abundance is mostly
between cluster while within cluster differences are minimal. Accordingly, when the two
clusters are defined by a split point along the gradient, the statistical model is the
step-function:
(2) |
Figure 1. Three
simulations
of
the
step
function,
each
with
a
different
level
of
uncertainty.
The
left
column
shows
the
three
simulated
data
sets,
the
middle
column
is
the
calculated
IndVal
along
the
gradient,
and
the
right
column
is
the
distribution
of
selected
split
point
based
on
2000
bootstrap
samples.
When
there
are
no
observation
error
(top
row),
the
IndVal
peaks
at
the
change
point.
As
the
level
of
observation
error
increases
(middle
and
bottom
rows),
the
IndVal
response
pattern
is
increasingly
irregular
and
the
bootstrap
distribution
of
the
selected
split
point
more
variable.
When the underlying model is not a step function, the interpretation of the selected split
point becomes difficult. We illustrate the difficulties using two simulated examples. First,
assuming the change in taxa abundance is a linear function of the gradient (no statistical change
point) and the abundance values are all positive and are observed without error at points evenly
spaced along a gradient. Under this situation, IndVal is inversely related to the selected split
point. That is, the maximum IndVal is at one or the other end of the truncated gradient (Figure
2) when the underlying response pattern is linear. Second, assuming the change in
taxa abundance is a hockey stick or a broken stick model (Figure 3). If the slopes of
the two line segments are of the same sign, the maximum of IndVal will also occur
at one or the other end of the truncated gradient. In fact, it can be shown that the
maximum of IndVal is always at one or the other end of the truncated gradient if the
underlying pattern is monotonic. In other words, the selected split point being
one of the two ends of the truncated gradient can be an indication of a monotonic
abundance response pattern, further investigation is needed to understand whether the
underlying pattern has a change point (e.g., hockey stick model) or not (e.g., linear model).
However, a permutation test will result in a p-value of 0 (see section 4 for additional
comments), thereby confirming the selected split point as a “threshold.” This problem is
exacerbated when using bootstrapping for estimating the confidence interval of the
selected split point. A bootstrap sample contains only a subsample of the potential
split points, thus each bootstrap sample will result in a selected split point that is
slightly larger (or smaller) than the observed split point. (See Section 3.3 for more
discussion about the use of bootstrap.) When plotted, bootstrap split points often show a
slightly skewed distribution near one end of the gradient. Both the permutation test
p-value of 0 and a concentrated split point distribution from bootstrapping give a false
impression of a distinct split point, but they are artifacts of the program (Figures 2 and
3).
More simulations are presented in the submitted manuscript, which concluded that TITAN will likely result in
erroneous split point for all response types except the step function model. Even when the
underlying model can be approximated by the step function, TITAN will likely fail when the
uncertainty in the observed abundance is large.
Figure 2. When
taxa
abundance
is
a
linear
function
of
the
gradient,
IndVal
is
a
function
of
the
inverse
of
the
selected
split
point.
The
left
column
shows
three
data
sets
with
the
same
underlying
linear
model
but
different
levels
of
uncertainty.
The
middle
column
shows
the
calculated
IndVal
along
the
truncated
gradient.
The
right
column
shows
bootstrap
estimated
split
point
distribution.
The
peak
IndVal
should
be
at
one
end
of
the
truncated
gradient,
but
the
bootstrap
distribution
peak
is
always
closer
to
the
middle
than
the
true
peak.
Figure
3.
Three
linear
change
point
models
are
used
as
the
true
models
of
taxa
abundance
response.
The
top
row
shows
a
hockey
stick
model
with
a
flat
first
segment
and
a
decreasing
second
segment.
The
middle
row
shows
a
hockey
stick
model
with
the
two
line
segment
having
slopes
of
opposite
signs.
The
bottom
row
shows
a
broken
stick
model.
In
all
cases,
the
peak
IndVal
is
at
one
end
of
the
truncated
gradient
(middle
column).
The
bootstrap
distribution
of
the
selected
split
point
is
always
shifted
slightly
towards
the
middle
(right
column).
3.3 TITAN abuses statistical tests
Even though the selected split point is not a statistical estimate of a parameter, the split point is,
nonetheless, a random variable derived from raw abundance data. Statistical characteristics of
the variable can often be best described by using many nonparametric methods, including
nonparametric tests for testing the central tendency (mean, median) and bootstrapping for
confidence interval.
Baker and King [2010] used a permutation test to address the question whether
the maximum difference in the two IndVals based on the selected split point can be
attributed to randomness in the data. Based on the definition of IndVal, the null
hypothesis of the permutation test is that the distributions of taxa abundance in the two
groups are identical. Although the probability distribution of the abundance data is
not defined, the null hypothesis implies that the two groups share the same mean,
the same median, and any other statistics. The permutation test repeatedly assigns
data points into two groups at random with the same sample sizes as the two groups
divided by the selected split point. The IndVal is calculated each time. These random
permutation IndVals form a pseudo-random sample of the IndVal variable under the null
hypothesis of no changes in the abundance data along the gradient. If the sample
size is large enough and the null hypothesis is true, these pseudo-random samples
can be used to approximate the probability distribution of IndVal under the null
hypothesis. When comparing the IndVal associated with the selected split point to the null
hypothesis distribution, the tail-area (or the fraction of permutation IndVals exceeding the
selected split point IndVal) is the p-value as defined by Fisher [Lenhard, 2006] –
evidence against the null hypothesis. Baker and King [2010] used a default number of
permutation of 250 and suggested (in the online supporting materials) that a larger than 250
permutations is unnecessary. This statement is without any statistical support. In fact, any
such general statement with regard to the question of how large a sample size is large
enough is misleading because the answer is largely affected by the underlying probability
distribution of the data. Qian [2010] (pages 53-54) illustrates the problem using a
simulation of the central limit theorem. When the data sample size is small, the number of
possible permutations is also small. Using a number larger than the possible number of
permutation is obviously unnecessary. When the sample size is large, the possible number of
permutation is also large, 250 may be too small to characterize the null hypothesis
distribution, especially when the underlying abundance data distribution is highly skewed.
When the number of permutation is too small, the test is associated with a high level
of uncertainty. For example, we reconstructed the six taxa response data shown in
Figure 2 of Baker and King [2010] and performed the permutation test using 250 as
the default number of permutations for the selected split point of Taxon D. When
the permutation test is repeated, we receive a very different p-value each time. In
fact, the range of the permutation estimated p-values is from 0.012 to 0.120 in 2,500
permutation tests (Figure 4). The variation in p-value is still substantial when the number of
permutation is increased to 500 (p-value ranges from 0.028 to 0.102). The small number of
permutations resulted in a high variance in all statistics related to the test, including the
z-score.
Figure
4.
Permutation
estimated
p-values
can
be
highly
uncertain
when
the
number
of
permutations
used
in
a
test
is
small.
The
two
histograms
show
the
distributions
of
permutation
estimated
p-values
based
on
two
simulation
of
2,500
permutation
tests,
the
right
panel
used
250
permutations,
and
the
right
panel
used
500
permutations.
The permutation test is also used in Baker and King [2010] to calculate the z-score for all
potential split points. The intention is to normalize taxa with different abundances. By
calculating the z-scores for all potential splits, TITAN selects the split point based on the largest
z scores along the truncated gradient. The practice, however, introduces a systematic bias into
the selected split point. In a permutation test, the estimated IndVal mean and standard
deviation is affected by the degree of balance in sample sizes. When the difference in sample sizes
of the two groups is large, the estimated variance and mean tend to be large, too.
But more so in variance than in the mean. In TITAN, the unevenness is related to
the location of the split point. When the split point is at one or the other end of the
truncated gradient, the permutation estimated variance is the largest due to the largest
difference in sample sizes. When the split point is at the middle (center of the data), the
permutation variance is the smallest. As the z-score is the difference between the observed
IndVal and the permutation mean divided by the permutation standard deviation, the
z-scores for those potential split points near both ends of the gradient tend to be smaller
than they should be if the null hypothesis of no clustering is true. Figure 5 shows four
simulated taxa abundance data series along a gradient, along with the calculated IndVal,
permutation z-scores, permutation estimated mean IndVal and permutation estimated
IndVal standard deviation. Given the high variability in the permutation estimated
statistics because of the small number of permutations, it is not unusual to see very
different selected split points when running TITAN several times using the same data
set!
Figure
5
–
Split
points
selected
using
permutation
estimated
z-score
are
systematically
biased.
From
left
to
right,
the
first
column
shows
four
simulated
taxa
abundance
data
sets,
the
second
column
shows
the
respective
IndVals,
the
third
column
shows
the
permutation
test-based
z-scores,
the
fourth
column
shows
the
permutation
estimated
means,
and
the
fifth
column
shows
the
permutation
estimated
standard
deviation.
The
permutation
estimated
z-score
is
the
difference
between
IndVal
and
the
permutation
mean
divided
by
the
permutation
standard
deviation.
TITAN's authors identify synchronicity of thresholds as a major
finding of their work with TITAN. It is likely that this apparent synchronicity
is an artifact of the TITAN analysis for two reasons. First, the selected split
point of a species derived using z-scores has been shown to be biased towards
the center of the data cloud along the gradient. Second, in many applications of TITAN the environmental
gradient is a gradient of human disturbance (e.g., % developed land or %
impervious surface in a watershed) in which the distribution of sites along the
gradient are skewed toward the low end of the gradient (e.g., there are few
highly urbanized sites relative to the population of possible sampling sites). This led to the peak of the z-score
distribution to be biased towards the low end of the gradient. Figure 6 shows an example of a linear
response model with sites unevenly distributed along the gradient, with more
data points closer to the low end of the gradient than to the high end (median
x-axis value = 0.11). When the gradient
is truncated to allow a minimum of 5 data points in a group (suggested default
for TITAN analysis), the upper bound of the gradient is moved from 1.0 to close
to 0.6. When the noise level is 0, we see a clear monotonic increasing trend in
IndVal along the gradient with the
highest IndVal at the high end of the
truncated gradient (0.6). The monotonic pattern of IndVal is apparent in all cases, but the maximum z-score is always at
a point closer to the center of the data cloud (median = 0.11). As the noise level increases, the peak z-score
moves even closer to the data center along the x-axis resulting in what appears
to be a high degree of synchronicity in split points among species.
Figure 6. Simulations of linear models are used to
understand the effects of the z-score bias coupled with a distribution of
sampling sites skewed toward the low end of the gradient. Columns are arranged
as in Figure 11. Only the linear model was used. Each row represents a
different level noise in the data represented by the coefficient of variation (CV).
TITAN's authors have used synchronicity in species responses to advocate for
detecting thresholds based on the responses of individual taxa rather than
aggregate community metrics. They suggest that community metrics are relatively
insensitive to the synchronous threshold declines that they have observed for
individual taxa based on TITAN analysis, which they suggest provides a method
to determine the magnitude, direction, and uncertainty in responses.
Unfortunately, TITAN is a poor method for determining the magnitude, direction
and uncertainty of thresholds and much of the synchronicity in responses is an
artifact of the IndVal method that
underlies TITAN. Their criticism of analyses based on community metrics is
unwarranted given the inability of TITAN to characterize common response forms
and the propensity for it to extract erroneous thresholds. The reality is that
threshold analysis requires describing how taxa abundances change across the
gradient, a task for which TITAN and
IndVal are ill suited. The characterization of individual taxa responses
and the detection of thresholds should be undertaken by comparing multiple
alternative models to determine whether the data fit a
response model that is indicative of a threshold or whether the response cannot
be differentiated from a model without a threshold (e.g., linear model).
Using permutation test repeatedly is also an invitation to type I error. When the null
hypothesis of no clustering is true, a permutation test will reject the null hypothesis 5% of the
time. In other words, 5 out 100 potential split points will be statistically significant even
the taxon does not respond to the disturbance. By selecting the split point based
on the largest z-value, TITAN’s risk of raising false confidence in the selected split
point is real. A recent study suggests that permutation estimated null hypothesis
distribution can be false when the null hypothesis is itself false, often resulting in increase
type II error [Aickin, 2010]. In any case, the use of a permutation test in this case is
unjustified.
TITAN also used bootstrap resampling method for calculating the confidence interval of the
selected split point. The bootstrapping method is a commonly used resampling methods for
estimating standard deviation and confidence intervals of statistics. The bootstrapping approach
is a Monte Carlo simulation procedure aimed at obtaining an approximation of the sampling
distribution of the parameter of interest. It substitutes random samples from the target
population with random samples of the same size (with replacement) from the existing
data. Asymptotically, i.e., as sample size of the data increases, bootstrap samples
are close to random samples from the population. As a result, empirical distribution
of variable calculated from these bootstrap samples can be seen as approximate to
the true sampling distribution of the variable of interest. The bootstrap method is,
however, not appropriate for a split point problem [Bühlmann and Yu, 2002, Banerjee
and McKeague, 2007]. Statistical theories [Bühlmann and Yu, 2002] indicate that
bootstrap estimated standard deviation of the split point is always smaller than the
true standard deviation, leading to a narrower confidence interval. For a Monte Carlo
simulation to approximate a random variable, each Monte Carlo sample should be
close to be a random sample from the target population. When using bootstrapping,
the population is approximated by the observed data. But the empirical distribution
of the random variable of interest is asymptotically approaching the true sampling
distribution. This result is true for many random variables of the population. But the
split point is different. When a sample (with k unique gradient values) is taken, the
number of potential split points is set (k - 1) and these potential split points are fixed.
Potential split points in a bootstrap sample are always a subsample of the k - 1 potential
split points in the original data. In other words, the bootstrapping process repeatedly
selects the split point from the same pool of k - 1 potential split points. Consequently,
the bootstrap estimated standard deviation is much smaller than it should be (and
the estimated confidence interval is much narrower than it should be). Again, Baker
and King recommended a default bootstrap sample size of 500 and suggested 100
may be enough without providing any evidence. Whether a number is large enough is
determined by many factors, especially the (unknown) underlying population distribution of
interest. It is irresponsible to suggest a default number, especially when it is as small as
100.
I must take some blame for the use of bootstrapping in a change point problem. I used bootstrap in a 2003 Ecological Modelling paper on a change point problem. In that paper, I used the classification and regression tree (CART) model as an alternative to the Bayesian change point model. Because the Bayesian change point model yields the distribution of the estimated change point, I used bootstrap for the split point confidence interval. Ian McKeague pointed out the problem of using bootstrapping to me in 2004 when we met at a meeting on statistical issues related to setting a phosphorus standard for the Everglades. It took me quite a while to finally understand the problem. (My other mistake was to call the model a nonparametric one because CART is a "nonparametric" model. But when we use the first split on one predictor, the model is a step function with 4 [or 5] parameters.)
I must take some blame for the use of bootstrapping in a change point problem. I used bootstrap in a 2003 Ecological Modelling paper on a change point problem. In that paper, I used the classification and regression tree (CART) model as an alternative to the Bayesian change point model. Because the Bayesian change point model yields the distribution of the estimated change point, I used bootstrap for the split point confidence interval. Ian McKeague pointed out the problem of using bootstrapping to me in 2004 when we met at a meeting on statistical issues related to setting a phosphorus standard for the Everglades. It took me quite a while to finally understand the problem. (My other mistake was to call the model a nonparametric one because CART is a "nonparametric" model. But when we use the first split on one predictor, the model is a step function with 4 [or 5] parameters.)
3.4 More on z-score
The last step of TITAN is to calculate the community threshold based on the sum of z-scores
calculated for all taxa for all potential split points. A z-score for a particular taxon is related to
the permutation test as a measure of evidence against the null hypothesis. In this case, the null
hypothesis is specific to the particular taxon (e.g., no clustering in abundance of the taxon along
the disturbance gradient). Different z-scores are for different tests with different nulls.
The sum of z-scores is only meaningful for tests sharing the same null (e.g., replicated
experiments). In this case, we may assume that the common null hypothesis of all these tests is that the split point does not exist for all taxa. Under the common null hypothesis, z-scores are random samples from the same
standard normal distribution and the sum of these z-scores is a random variable with
mean 0 and variance N (the number of tests or taxa). That is, we
use the sum as a composite test statistics for the common null hypothesis. But the
sum of squares of z-score (which has a χ2 distribution with degrees of freedom of N)
is a more frequently used test statistic. In any case, z-scores are to be combined only when these
z-scores are from tests sharing the same null hypothesis, such that, the combined
z-score (sum or sum of squares) provides a measure of evidence against the shared null
hypothesis. In TITAN, z-scores, calculated for individual taxa (hence different null
hypotheses), should not be combined. Combining z-scores based on different nulls is
meaningless.
Problems TITAN intends to solve is not a hypothesis testing problem. Consequently, the
value of a z-score is limited. The species sensitivity distribution concept in ecotoxicology is more
suitable. In an ecosystem risk assessment problem, the disturbance gradient is measured by the
concentration of a toxic substance and the species-specific “threshold” is set to be,
for example, LC50 (a concentration that may result in 50% fatality of the species).
At an ecosystem level, LC50s of many species are plotted along the concentration
gradient on the x-axis, and the y-axis shows fractions of species affected (or % species
with LC50 ≤ a given concentration). Here, we can replace LC50 with the selected
split points to form the cumulative taxa split point distribution. Mathematically, this
distribution is the hyper-distribution of species-specific split points under a Bayesian
hierarchical modeling framework, representing the distribution of taxon-specific split
points.
4 Programming Issues
TITNA is implemented in R. The authors’ R programs have a few flaws that may lead to
unintended consequences.
- Treatment of ties in the observed disturbance gradient. Because the disturbance gradient is the only cause for taxa clustering, observations with the same disturbance gradient value should be grouped together. The programming strategy is to first find a list of unique values along the gradient and then attach each observed abundance data value to these unique x-values. In TITAN, instead of grouping observed abundance data values by the disturbance gradient data values, each observation is treated as independent and sorted along the gradient. When there are multiple abundance values for a gradient value, these abundance values are randomly assigned to one or the other group. Consequently, TITAN result will vary from run to run using the same data set. This problem is important because TITAN uses bootstrapping and the same observation in the original data is likely represented more than once in a bootstrap sample. This problem is difficult to detect because TITAN uses the sum of z-scores to pick the “community” level split point and the z-scores are highly variable because of the small number of permutations.
- When running TITAN, we often see a permutation test p-value of 0.004. This is because the p-value was calculated as 1+ number of permutations with IndVal’s exceeding the observed IndVal divided by the numer of permutations. With the number of permutation is set to be 250, a p-value of 0.004 indicates no permutation IndVals exceeds the observed IndVal. As discussed in Section 4.2, a p-value of 0 may be an indication that the underlying pattern of response is monotonic. By adding 1 to the p-value calculation, the results are highly misleading.
5 Discussion
TITAN is intended for uncovering discontinuous jumps in taxa abundance data along a disturbance gradient. Instead of formulating specific models about abundance, TITAN’s authors used the clustering indicator IndVal. The resulting program is ambiguous in terms what kind of threshold was detected. The misuse of the permutation test resulted in a systematic bias in the selected split point towards the center of the data cloud along the gradient. TITAN is written to process large data sets with hundreds of taxa from many sites. As a result, the behavior of the program is not clear to a reviewer. It took us several months to recode the program and set up various test cases to understand what the program is doing. The peer-review process is not set up for such time-consuming task. The authors are ultimately responsible for checking their work. Simulations presented in this post should be conducted prior to submission. Journals, when considering methods manuscripts, should develop a process of verification (on top of the usual peer-review process) to avoid less obvious errors.References
M. Aickin.
Invalid
permutation
tests.
International
Journal
of
Mathematics
and
Mathematical
Sciences,
Article
ID
769780:
10
pages,
2010.
doi:
10.1155/2010/769780.
A. Antoniadis
and
I. Gijbels.
Detecting
abrupt
changes
by
wavelet
methods.
Journal
of
Nonparametric
Statistics,
14
(1-2):
7–29,
2002.
M.E.
Baker
and
R.S.
King.
A
new
method
for
detecting
and
interpreting
biodiversity
and
ecological
community
thresholds.
Methods
in
Ecology
and
Evolution,
1
(1):
25–37,
2010.
M. Banerjee
and
I. W.
McKeague.
Confidence
sets
for
split
points
in
decision
trees.
The
Annals
of
Statistics,
35
(2):
543–574,
2007.
P. Bühlmann
and
B. Yu.
Analyzing
bagging.
The
Annals
of
Statistics,
30
(4):
927–961,
2002.
A. Dempfle
and
W. Stute.
Nonparametric
estimation
of
a
discontinuity
in
regression.
Statistica
Neerlandica,
56
(2):
233–242,
2002.
ISSN
1467-9574.
doi:
10.1111/1467-9574.00196.
URL
http://dx.doi.org/10.1111/1467-9574.00196.
M. Dufrêne
and
P. Legendre.
Species
assemblages
and
indicator
species:
the
need
for
a
flexible
asymmetrical
approach.
Ecological
Monographs,
67
(3),
1997.
R.A.
Fisher.
On
the
mathematical
foundations
of
theoretical
statistics.
Philosophical
Transactions
of
the
Royal
Society
of
London,
Series
A,
222:
309–368,
1922.
I. Gijbels,
P. Hall,
and
A. Kneip.
On
the
estimation
of
jump
points
in
smooth
curves.
Annals
of
the
Institute
of
Statistical
Mathematics,
51:
231–251,
1999.
ISSN
0020-3157.
URL
http://dx.doi.org/10.1023/A:1003802007064.
10.1023/A:1003802007064.
Ahmad
Khodadadi
and
Masoud
Asgharian.
Change-point
problem
and
regression:
An
annotated
bibliography.
Technical
report,
COBRA
Preprint
Series.
Article
44,
2008.
URL
http://biostats.bepress.com/cobra/ps/art44.
J. Lenhard.
Models
and
statistical
inference:
The
controversy
between
Fisher
and
Neyman-Pearson.
The
British
Journal
for
the
Philosophy
of
Science,
57
(1):
69–91,
2006.
S.S.
Qian.
Environmental
and
Ecological
Statistics
with
R.
Chapman
and
Hall/CRC
Press,
2010.
No comments:
Post a Comment