Saturday, May 19, 2018

Log or not log

LOGorNOTLOG.html

Log or not log, that is the question

May 19, 2018

In 2014 I taught a special topics class on statistical issues associated with the measurement of the concentration of cyanobacterial toxin microcystins (MC) in Toledo's drinking water. That class led to a paper documenting the high level of uncertainty in the measured MC. An idea from that paper was to develop a better curve-fitting method to reduce the measurement uncertainty. The Ohio EPA and other regulatory agencies expressed no interest in my proposals. In following two summers, I directed two REU students to learn the measurement process. We designed an experiment using two ELISA kits to measure samples with known MC concentrations. These samples were obtained by diluting the standard solutions. Using the result tested to fit the standard curve fit on the MC concentration scale to the curve fit on the log-concentration scale. Both curves fit the data well. However, the curve fit to the log-concentrations leads to a far smaller predictive uncertainty. Based on this simple result, my students wrote a paper. After the paper was published, I learned that the current ELISA test kits come with three quality control samples with known MC concentrations. We are now searching for raw data from various sources to replicate our study. This replication is important because our study was done using a single test kit and many argued that the kit-to-kit variation is often more important, although we believe that such variation is largely due to the small sample size used for fitting the curve.

Revisit the Difficulty in Teaching Statistics

Revisit The Difficulty In Teaching Statistics.html

Revisit The Difficulty in Teaching Statistics

June 2017

Many years ago, I listened the famous​ talk on why statistics is like literature and mathematics is like music. The point is that, mathematics, as music, relies on deduction; statistics, on the other hand, is for induction. Suppose that A causes B and we either know A or B. If we know A, we use mathematics to deduce what will come after A. If we know B, we use statistics to learn about the cause of B. Deduction is guided by clearly defined logic. Induction doesn't have a set of rules. Before we take our first statistics class, we are entirely emersed in deduction. When we first learn statistics, we always treat it as a topic of mathematics. The beginning of the class is inevitably probability, which reinforces the impression of a deductive thinking. When statistics portion of the class starts, we have already impossibly sunken into the deduction mode. Most of us could never dig ourselves out of it. By the time we take our first graduate level statistics class, we probably have forgotten all about the little statistics we learned in the undergraduate class.

What makes the learning even harder is that the graduate level class is almost always taught by professors from statistics department on a rotational basis. No statistics professors want to teach an applied course in a science field. Student teaching evaluations from these courses are always below average, regardless of the quality of teaching. With professors either teaching​ the class for the first time ever or first time after a long hiatus, the teaching quality is always less than optimal.

From a student's perspective, statistics is impossible to learn well. The thought process of the modern statistics is the hypothetical deduction. To understand the concept we need to know a lot of science to be able to propose reasonable hypotheses; we also need to know a lot about probability distributions in order to know which distribution is most likely a relaven one. New students know neither. As a result, we teach a few simple models (t-test, ANOVA, regression). A good student can master each model and manage to use them in simple applications.

Recently, I read Neyman and Pearson (1933) to understand the development of the Neyman-Pearson Lemma. The first two pages of the article are particularly stimulating. Neyman and Pearson traced statistical hypothesis testing back to Bayes, as the test of a cause-and-effect hypothesis. They then described the what looks like the hypothetical deductive process of hypothesis testing and concluded that "no test of this kind could give useful results." The Neyman-Pearson lemma is then described as a "rule of behavior" with regard to hypothesis $H$. When following this rule, we "shall reject H when it is true not more, say, than once in a hundred times, and we shall reject H sufficiently often when it is false." Furthermore, "such a rule tells us nothing as to whether in a particular case $H$ is true or false" when the test result is statistically significant or not. It appears to me that the frequency-based classical statistics is really designed for engineers, whereas the Bayesian statistics is suited for scientists.

If teaching classical statistics is hard, teaching Bayesian statistics is harder (especially to American students who are poorly trained in calculus).

Peer-Review Fraud: Cite my paper or else

ReviewerFraud.html

Peer-Review Fraud: Cite my paper or else!

May 19, 2016 (revised in 2018)

I serve as a peer-reviewer a lot because I value the peer-review process. In the first few years after graduate school, reviewer comments on my manuscripts were often the most helpful part of the writing process.I benefited from the process and I am willing do what I can to contribute to the process. Reviewers' are volunteers and their service is a critical part of academic publication. I believe in the process and I value the system. As a result, I treat my review assignments seriously and always write reviews objectively and provide constructive recommendations. I want to do my part to keep this academic common a sustainable endeavor.

In 2016, reviews on two manuscripts disturbed me. The lead authors of the two manuscripts were former students of mine. One is about the use of a generalized propensity score method to estimate the causal effect of nitrogen on stream benthic community, and the other is on statistical issues of discretization of a continuous variable when constructing a Bayesian networks model. These two manuscripts have nothing in common, except that I was the second author on both. Reviewers' comments on the two manuscripts came back in the same week. One reviewer apparently reviewed both papers. This reviewer's comments on both papers were essentially the same. But the suggestions are irrelevant to our work. It is clear to us that this reviewer was sending a message: cite my papers and I will let you go.

For the Bayesian networks paper, we chose to ignore this reviewer as he was one of four reviewers commented on our paper. We copied this reviewer's comments on our propensity score paper to the editor and the paper is now published. The propensity score paper had only one reviewer. The lead author was a student at the time and was eager to add more publications to his resume before graduation. After discussion, I wrote to the editor of the journal to explain our concerns. I requested that the manuscript be considered as a new submission and go through the review process again. Although it would be easy to add a sentence or two with the recommended citations, I believe that it is important to uphold the principle. The associate editor ignored my request for communication so I sent the request to the editor in chief. Although the editor promised to handle the re-review himself, he delegated the work to the same associate editor, who in turn made sure that the paper went through repeated reviews until it was rejected. The paper is now published in a different journal.

I copy reviews in question below. Hopefully readers will reach the same conclusion as I did. We want to publish and we want our peers to read and cite our work because the work is worthwhile. Abusing the "power" as a reviewer is just as bad as cheating!

Review on the Bayesian networks model paper:

General comments:

Overall I like the study and I feel it is fairly well written. My two observations are about the lack of global sensitivity and uncertainty analyses (GSUA) and a conversation about management implications that we can extract from the model/GSUA. Note that here with ''model'' I mean any method that use the data, yet any model that process the data in input and produce an output. That is useful for assessing input factor importance and interaction, regimes, and scaling laws between model input factors and outcomes. This differs from traditional sensitivity analysis methods. Thus, GSUA is very useful for finding out optimal management/design strategies. GSUA is a variance-based method for analyzing data and models given an objective function. It is a bit unclear how many realizations of the model have been run and how the authors maximized prediction accuracy. Are the values of the input factors taken to maximize predictions? GSUA (see references below) typically assigns probability distribution functions to all model factors and propagate those into model outputs.

In this context, that is about discretization methods for pdfs, the impact of discretization may be small or large depending on the pdf chosen (or suitable) for the variables; yet, the discretization may have different results as a function of the nature of the variables of interest as well as of the model used.

I think that independently of the model / variables used the authors should discuss these issues in their paper and possibly postpone further research along these lines to another paper.

Specific comments:

Variance-based methods (see Saltelli and Convertino below) are a class of probabilistic approaches which quantify the input and output uncertainties as probability distributions, and decompose the output variance into parts attributable to input variables and combinations of variables. The sensitivity of the output to an input variable is therefore measured by the amount of variance in the output caused by that input. Variance-based methods allow full exploration of the input space, accounting for interactions, and nonlinear responses. For these reasons they are widely used when it is feasible to calculate them. Typically this calculation involves the use of Monte Carlo methods, but since this can involve many thousands of model runs, other methods (such as emulators) can be used to reduce computational expense when necessary. Note that full variance decompositions are only meaningful when the input factors are independent from one another. If that is not the case information theory based GSUA is necessary (see Ludtke et al. )

Thus, I really would like to see GSUA done because it (i) informs about the dynamics of the processes investigated and (ii) is very important for management purposes.

Convertino et al. Untangling drivers of species distributions: Global sensitivity and uncertainty analyses of MaxEnt. Journal Environmental Modelling & Software archive Volume 51, January, 2014 Pages 296-309

Saltelli A, Marco Ratto, Terry Andres, Francesca Campolongo, Jessica Cariboni, Debora Gatelli, Michaela Saisana, Stefano Tarantola Global Sensitivity Analysis: The Primer ISBN: 978-0-470-05997-5

Ludtke et al. (2007), Information-theoretic Sensitivity Analysis: a general method for credit assignment in complex networks J. Royal Soc. Interface

Review on the propensity score paper:

GENERAL COMMENTS

After a careful reading of the manuscript I really like the study and I feel it can have some impact into the theory of biodiversity and biogeography at multiple scales. My two technical observations are about the lack of global sensitivity and uncertainty analyses (GSUA) and a conversation about management implications that we can extract from the model/GSUA. Also, I think the findings can be presented in a clearer way by focusing on (i) the universality of findings across macro-geographical areas, (2) probabilistic structure of the variable considered and (3) the possibility to discuss gradual and sudden change in a non-linear theoretical framework (tipping points and gradual change). I would strongly suggest to talk about ''potential causal factors/relationship'' rather than talking about true causality because that is very difficulty proven and many causality assessment methods exist (e.g. transfer entropy, conergence cross mapping, scaling analysis, etc.). Also, can you provide an explanation for Eq. 6? Figure 2 does not show regressions but scaling law relationship since you plot everything in loglog. This can be an important results, in fact I suggest you to consider this avenue of interpretation (see Convertino et al. 2014 but also other work or Rinaldo and Rodriguez-Iturbe).

Note that here with ''model'' I mean any method that use the data, yet any model that process the data in input and produce an output. Data in fact can be thought as a model and probability distribution functions (pdfs) can be assigned to data variables (see Convertino et al. 2014). These pdfs can be assigned to any source of uncertainty about a variable (e.g. changing presence / absence into a continuous variable) and the uncertainty of outputs (e.g. species richness) can be tested against the uncertainty of all input variables. I believe that just considering average values is not enough.

As for the rest I really love the paper. I suggest to also plot the patterns in Convertino et al (2009): these are for instance the JSI and the Regional Species Richness; in ecological terms these can be defined as alpha, beta and gamma diversity. These patters can be studied as a function of geomorphological patterns such as the distance from the coat in order to find potential drivers of diversity. These are just ideas that can be pursued further. Lastly I wonder if the data can be made available to the community for further studies. For all above motivations I suggest to accept the paper only after Moderate or Major Revisions. Again, I think that these revisions can just make better the paper.

SPECIFIC COMMENTS:

In any context, e.g. as in this paper GSUA is very important because it given an idea of what is driving the output in term of model input factor importance and interaction, and how that can be used for management. GSUA is a variance-based method for analyzing data and models given an objective function. It is a bit unclear how many realizations of the model have been run and how the authors maximized prediction accuracy. Are the values of the input factors taken to maximize predictions? GSUA (see references below) typically assigns probability distribution functions to all model factors and propagate that into model outputs. That is useful for assessing input factor importance and interaction, regimes, and scaling laws between model input factors and outcomes. This differs from traditional sensitivity analysis methods (that are even missing here)

Variance-based methods (see Saltelli and Convertino below) are a class of probabilistic approaches which quantify the input and output uncertainties as probability distributions, and decompose the output variance into parts attributable to input variables and combinations of variables. The sensitivity of the output to an input variable is therefore measured by the amount of variance in the output caused by that input. Variance-based methods allow full exploration of the input space, accounting for interactions, and nonlinear responses. For these reasons they are widely used when it is feasible to calculate them. Typically this calculation involves the use of Monte Carlo methods, but since this can involve many thousands of model runs, other methods (such as emulators) can be used to reduce computational expense when necessary. Note that full variance decompositions are only meaningful when the input factors are independent from one another. If that is not the case information theory based GSUA is necessary (see Ludtke et al. for an information theory model of GSUA).

Thus, I really would like to see GSUA done because it (i) informs about the dynamics of the processes investigated and (ii) is very important for management purposes.

REFERENCES

Convertino, M. et al (2009) On neutral metacommunity patterns of river basins at different scales of aggregation http://www1.maths.leeds.ac.uk/~fbssaz/articles/Convertino_WRR09.pdf

Convertino, M.; Baker, K.M.; Vogel, J.T.; Lu, C.; Suedel, B.; and Linkov, I., "Multi-criteria decision analysis to select metrics for design and monitoring of sustainable ecosystem restorations" (2013). US Army Research. Paper 190. http://digitalcommons.unl.edu/usarmyresearch/190 http://digitalcommons.unl.edu/cgi/viewcontent.cgi?article=1189&context=usarmyresearch

Convertino et al. Untangling drivers of species distributions: Global sensitivity and uncertainty analyses of MaxEnt Journal Environmental Modelling & Software archive Volume 51, January, 2014 Pages 296-309

Saltelli A, Marco Ratto, Terry Andres, Francesca Campolongo, Jessica Cariboni, Debora Gatelli, Michaela Saisana, Stefano Tarantola Global Sensitivity Analysis: The Primer ISBN: 978-0-470-05997-5

Ludtke et al. (2007), Information-theoretic Sensitivity Analysis: a general method for credit assignment in complex networks J. Royal Soc. Interface

Friday, February 10, 2017

Statistical Fasification

When I read the collection of papers in the 2014 p-value/AIC forum in Ecology, the short comment by Burnham and Anderson stood out as the most ridiculous one.  Not only their use of 20th- versus 21st-century statistics is absurd (what about Bayesian, 17th-century statistics?), their use of the falsification principle to claim that hypothesis testing is bogus seems odd.  They claimed that because null hypothesis testing cannot test or falsify the alternative hypothesis, hypothesis testing "seems almost a scandal." In August 2016, I read this post by Mayo, which addressed the scandal part of Burnham and Anderson.  She said: "I am (almost) scandalized by this easily falsifiable allegation!"

Tuesday, June 28, 2016

Environmental and Ecological Statistics with R (2nd edition)

The second edition of EESwithR is coming in fall 2016.  I added one new chapter to the book and it is posted as an example chapter on github, along with R code and data sets.  The other main change is the replacement of the term statistical significant with something like "statistically different from 0."  The second edition also includes a large number of exercises, many of them have been used as homework assignments in my classes over the last ten years.  I am working on a solution pamphlet, as well as additional problems.  One unique feature of these exercises is the lack of a unique solution to almost each of these questions.  There are always multiple interpretations of a problem.  When grading homework, I look for student's thought process.  I welcome suggestions and recommendations on additional exercise problems.

Monday, June 13, 2016

Hypothesis testing and the Raven Paradox

I was going over my old reading logs the other day and the saw my notes on the Raven paradox (a.k.a. Hempel's paradox).  The statement that "all ravens are black" is apparently straightforward and entirely true.  The logical contrapositive "everything that is not black is not a raven" is also obviously true and uncontroversial.  In mathematics, proof by contrapositive is a legitimate inference method.  That is, you can show "If not B then not A" to support "If A then B."  The raven paradox is apparently paradoxical because it suggests that observing a white shoe (I. J. Good) is evidence supporting the claim that all ravens are black.  I.J. Good proposed a Bayesian explanation (or solution) of the paradox. The weight of evidence provided by seeing a white shoe (or any none black object that is not a raven) is positive, but small if the number of raven is small compared to all non-black objects.   But how is the paradox relevant to statistical hypothesis testing?

Statistical Hypothesis Inference and Testing is relevant to discussing the Raven paradox because we show support to our theory (the alternative hypothesis) by showing that a non-white object is not a raven (the null hypothesis).  If we are interested in showing that a treatment has an effect, we start by setting the null hypothesis as the treatment of no effect.  Using statistics, we show that data do not support the null hypothesis; hence the logic of contrapositive leads to the conclusion that the treatment is effective.  I have no problem with this thought process, as long as we are only interested in a yes/no answer about the effectiveness of the treatment.  How effective is of no interest.  But if we are interested in quantifying the treatment effect, hypothesis testing is almost always not appropriate.  When we are interested in quantifying the effect, we are interested in a specific alternative.  For example, when discussing the effectiveness of agricultural conservation practices on reducing nutrient loss, we want to know the magnitude of the effect, not whether or not the effect exists.  Showing that the effect is not zero gives some support to the claim that the effect is X, but not much.  This is why we often advise our students that statistical significance is not always practically useful, especially when the null hypothesis itself is irrelevant to the hypothesis of interest (the alternative hypothesis).  
A "threshold" model known as TITAN is a perfect example of the Raven paradox.  
The basic building block of TITAN is a series of permutation tests. Although TITAN's authors never clearly stated the null and alternative hypothesis, it is not difficult to derive these hypotheses using the basic characteristics of a permutation test.  The hypothesis of interest (the alternative) is that changes of a taxon's abundance along an environmental gradient can be approximated by a threshold model (specifically, a step function model).  The null hypothesis is that the taxon's abundance is a constant along the same gradient.  We can rephrase the alternative hypothesis as: the pattern of change of a taxon's abundance is a threshold model.  The null hypothesis is that the pattern of change is flat.  When we reject the null, we say that the pattern of change is not flat.  The rejection can be seen as evidence supporting the alternative, but the weight of evidence is small if the number of non-flat and non-threshold patterns of change is large.

Thursday, February 4, 2016

The Everglades wetland's phosphorus retention capacity

In 1997, I and Curt Richardson published a paper on using a piecewise linear regression model for estimating the phosphorus retention capacity in the Everglades.  At the time, fitting a piecewise linear model is not a simple task.  As I was up to date on Bayesian computation, I used the Gibbs sampler.  It was an interesting exercise to derive the full set of conditional probability distribution function.  The process is tedious but not hard.  When applied to the Everglades data, we concluded that the Everglades' phosphorus retention capacity is about 1 gram of phosphorus per year per square meter (the median is 1.15), with a 90% credible interval of (0.61, 1.47) (Table 2 in Qian and Richardson, 1997).  The posterior distribution of the retention capacity is skewed to the left.  In subsequent papers, Curt Richardson name the result as "the 1 gram rule".  The South Florida Water Management District (SFWMD) never believed our work and often claimed that the retention rate would be much higher.

Since then, SFWMD has constructed several Stormwater Treatment Areas (STAs) -- wetlands for removing phosphorus and has been monitoring the performances.  The latest results (Chen, et al, 2015) showed that the retention capacity of these STAs is 1.1  +/- 0.5 grams per square meter per year.

I was satisfied that finally SFWMD agreed with my finding, even if the agreement took them nearly 20 years (and hundreds of millions of dollars).

Chen, H., Ivanoff, D., and Pietro, K. (2015) Long-term phosphorus removal in the Everglades stormwater treatment areas of South Florida in the United States.  Ecological Engineering, 29:158-168.

Qian, S.S. and C.J. Richardson (1997) Estimating the long-term phosphorus accretion rate in the Everglades: a Bayesian approach with risk assessment.  Water Resources Research, 33(7): 1681-1688.