Saturday, May 19, 2018

Log or not log

LOGorNOTLOG.html

Log or not log, that is the question

May 19, 2018

In 2014 I taught a special topics class on statistical issues associated with the measurement of the concentration of cyanobacterial toxin microcystins (MC) in Toledo's drinking water. That class led to a paper documenting the high level of uncertainty in the measured MC. An idea from that paper was to develop a better curve-fitting method to reduce the measurement uncertainty. The Ohio EPA and other regulatory agencies expressed no interest in my proposals. In following two summers, I directed two REU students to learn the measurement process. We designed an experiment using two ELISA kits to measure samples with known MC concentrations. These samples were obtained by diluting the standard solutions. Using the result tested to fit the standard curve fit on the MC concentration scale to the curve fit on the log-concentration scale. Both curves fit the data well. However, the curve fit to the log-concentrations leads to a far smaller predictive uncertainty. Based on this simple result, my students wrote a paper. After the paper was published, I learned that the current ELISA test kits come with three quality control samples with known MC concentrations. We are now searching for raw data from various sources to replicate our study. This replication is important because our study was done using a single test kit and many argued that the kit-to-kit variation is often more important, although we believe that such variation is largely due to the small sample size used for fitting the curve.

Revisit the Difficulty in Teaching Statistics

Revisit The Difficulty In Teaching Statistics.html

Revisit The Difficulty in Teaching Statistics

June 2017

Many years ago, I listened the famous​ talk on why statistics is like literature and mathematics is like music. The point is that, mathematics, as music, relies on deduction; statistics, on the other hand, is for induction. Suppose that A causes B and we either know A or B. If we know A, we use mathematics to deduce what will come after A. If we know B, we use statistics to learn about the cause of B. Deduction is guided by clearly defined logic. Induction doesn't have a set of rules. Before we take our first statistics class, we are entirely emersed in deduction. When we first learn statistics, we always treat it as a topic of mathematics. The beginning of the class is inevitably probability, which reinforces the impression of a deductive thinking. When statistics portion of the class starts, we have already impossibly sunken into the deduction mode. Most of us could never dig ourselves out of it. By the time we take our first graduate level statistics class, we probably have forgotten all about the little statistics we learned in the undergraduate class.

What makes the learning even harder is that the graduate level class is almost always taught by professors from statistics department on a rotational basis. No statistics professors want to teach an applied course in a science field. Student teaching evaluations from these courses are always below average, regardless of the quality of teaching. With professors either teaching​ the class for the first time ever or first time after a long hiatus, the teaching quality is always less than optimal.

From a student's perspective, statistics is impossible to learn well. The thought process of the modern statistics is the hypothetical deduction. To understand the concept we need to know a lot of science to be able to propose reasonable hypotheses; we also need to know a lot about probability distributions in order to know which distribution is most likely a relaven one. New students know neither. As a result, we teach a few simple models (t-test, ANOVA, regression). A good student can master each model and manage to use them in simple applications.

Recently, I read Neyman and Pearson (1933) to understand the development of the Neyman-Pearson Lemma. The first two pages of the article are particularly stimulating. Neyman and Pearson traced statistical hypothesis testing back to Bayes, as the test of a cause-and-effect hypothesis. They then described the what looks like the hypothetical deductive process of hypothesis testing and concluded that "no test of this kind could give useful results." The Neyman-Pearson lemma is then described as a "rule of behavior" with regard to hypothesis $H$. When following this rule, we "shall reject H when it is true not more, say, than once in a hundred times, and we shall reject H sufficiently often when it is false." Furthermore, "such a rule tells us nothing as to whether in a particular case $H$ is true or false" when the test result is statistically significant or not. It appears to me that the frequency-based classical statistics is really designed for engineers, whereas the Bayesian statistics is suited for scientists.

If teaching classical statistics is hard, teaching Bayesian statistics is harder (especially to American students who are poorly trained in calculus).

Peer-Review Fraud: Cite my paper or else

ReviewerFraud.html

Peer-Review Fraud: Cite my paper or else!

May 19, 2016 (revised in 2018)

I serve as a peer-reviewer a lot because I value the peer-review process. In the first few years after graduate school, reviewer comments on my manuscripts were often the most helpful part of the writing process.I benefited from the process and I am willing do what I can to contribute to the process. Reviewers' are volunteers and their service is a critical part of academic publication. I believe in the process and I value the system. As a result, I treat my review assignments seriously and always write reviews objectively and provide constructive recommendations. I want to do my part to keep this academic common a sustainable endeavor.

In 2016, reviews on two manuscripts disturbed me. The lead authors of the two manuscripts were former students of mine. One is about the use of a generalized propensity score method to estimate the causal effect of nitrogen on stream benthic community, and the other is on statistical issues of discretization of a continuous variable when constructing a Bayesian networks model. These two manuscripts have nothing in common, except that I was the second author on both. Reviewers' comments on the two manuscripts came back in the same week. One reviewer apparently reviewed both papers. This reviewer's comments on both papers were essentially the same. But the suggestions are irrelevant to our work. It is clear to us that this reviewer was sending a message: cite my papers and I will let you go.

For the Bayesian networks paper, we chose to ignore this reviewer as he was one of four reviewers commented on our paper. We copied this reviewer's comments on our propensity score paper to the editor and the paper is now published. The propensity score paper had only one reviewer. The lead author was a student at the time and was eager to add more publications to his resume before graduation. After discussion, I wrote to the editor of the journal to explain our concerns. I requested that the manuscript be considered as a new submission and go through the review process again. Although it would be easy to add a sentence or two with the recommended citations, I believe that it is important to uphold the principle. The associate editor ignored my request for communication so I sent the request to the editor in chief. Although the editor promised to handle the re-review himself, he delegated the work to the same associate editor, who in turn made sure that the paper went through repeated reviews until it was rejected. The paper is now published in a different journal.

I copy reviews in question below. Hopefully readers will reach the same conclusion as I did. We want to publish and we want our peers to read and cite our work because the work is worthwhile. Abusing the "power" as a reviewer is just as bad as cheating!

Review on the Bayesian networks model paper:

General comments:

Overall I like the study and I feel it is fairly well written. My two observations are about the lack of global sensitivity and uncertainty analyses (GSUA) and a conversation about management implications that we can extract from the model/GSUA. Note that here with ''model'' I mean any method that use the data, yet any model that process the data in input and produce an output. That is useful for assessing input factor importance and interaction, regimes, and scaling laws between model input factors and outcomes. This differs from traditional sensitivity analysis methods. Thus, GSUA is very useful for finding out optimal management/design strategies. GSUA is a variance-based method for analyzing data and models given an objective function. It is a bit unclear how many realizations of the model have been run and how the authors maximized prediction accuracy. Are the values of the input factors taken to maximize predictions? GSUA (see references below) typically assigns probability distribution functions to all model factors and propagate those into model outputs.

In this context, that is about discretization methods for pdfs, the impact of discretization may be small or large depending on the pdf chosen (or suitable) for the variables; yet, the discretization may have different results as a function of the nature of the variables of interest as well as of the model used.

I think that independently of the model / variables used the authors should discuss these issues in their paper and possibly postpone further research along these lines to another paper.

Specific comments:

Variance-based methods (see Saltelli and Convertino below) are a class of probabilistic approaches which quantify the input and output uncertainties as probability distributions, and decompose the output variance into parts attributable to input variables and combinations of variables. The sensitivity of the output to an input variable is therefore measured by the amount of variance in the output caused by that input. Variance-based methods allow full exploration of the input space, accounting for interactions, and nonlinear responses. For these reasons they are widely used when it is feasible to calculate them. Typically this calculation involves the use of Monte Carlo methods, but since this can involve many thousands of model runs, other methods (such as emulators) can be used to reduce computational expense when necessary. Note that full variance decompositions are only meaningful when the input factors are independent from one another. If that is not the case information theory based GSUA is necessary (see Ludtke et al. )

Thus, I really would like to see GSUA done because it (i) informs about the dynamics of the processes investigated and (ii) is very important for management purposes.

Convertino et al. Untangling drivers of species distributions: Global sensitivity and uncertainty analyses of MaxEnt. Journal Environmental Modelling & Software archive Volume 51, January, 2014 Pages 296-309

Saltelli A, Marco Ratto, Terry Andres, Francesca Campolongo, Jessica Cariboni, Debora Gatelli, Michaela Saisana, Stefano Tarantola Global Sensitivity Analysis: The Primer ISBN: 978-0-470-05997-5

Ludtke et al. (2007), Information-theoretic Sensitivity Analysis: a general method for credit assignment in complex networks J. Royal Soc. Interface

Review on the propensity score paper:

GENERAL COMMENTS

After a careful reading of the manuscript I really like the study and I feel it can have some impact into the theory of biodiversity and biogeography at multiple scales. My two technical observations are about the lack of global sensitivity and uncertainty analyses (GSUA) and a conversation about management implications that we can extract from the model/GSUA. Also, I think the findings can be presented in a clearer way by focusing on (i) the universality of findings across macro-geographical areas, (2) probabilistic structure of the variable considered and (3) the possibility to discuss gradual and sudden change in a non-linear theoretical framework (tipping points and gradual change). I would strongly suggest to talk about ''potential causal factors/relationship'' rather than talking about true causality because that is very difficulty proven and many causality assessment methods exist (e.g. transfer entropy, conergence cross mapping, scaling analysis, etc.). Also, can you provide an explanation for Eq. 6? Figure 2 does not show regressions but scaling law relationship since you plot everything in loglog. This can be an important results, in fact I suggest you to consider this avenue of interpretation (see Convertino et al. 2014 but also other work or Rinaldo and Rodriguez-Iturbe).

Note that here with ''model'' I mean any method that use the data, yet any model that process the data in input and produce an output. Data in fact can be thought as a model and probability distribution functions (pdfs) can be assigned to data variables (see Convertino et al. 2014). These pdfs can be assigned to any source of uncertainty about a variable (e.g. changing presence / absence into a continuous variable) and the uncertainty of outputs (e.g. species richness) can be tested against the uncertainty of all input variables. I believe that just considering average values is not enough.

As for the rest I really love the paper. I suggest to also plot the patterns in Convertino et al (2009): these are for instance the JSI and the Regional Species Richness; in ecological terms these can be defined as alpha, beta and gamma diversity. These patters can be studied as a function of geomorphological patterns such as the distance from the coat in order to find potential drivers of diversity. These are just ideas that can be pursued further. Lastly I wonder if the data can be made available to the community for further studies. For all above motivations I suggest to accept the paper only after Moderate or Major Revisions. Again, I think that these revisions can just make better the paper.

SPECIFIC COMMENTS:

In any context, e.g. as in this paper GSUA is very important because it given an idea of what is driving the output in term of model input factor importance and interaction, and how that can be used for management. GSUA is a variance-based method for analyzing data and models given an objective function. It is a bit unclear how many realizations of the model have been run and how the authors maximized prediction accuracy. Are the values of the input factors taken to maximize predictions? GSUA (see references below) typically assigns probability distribution functions to all model factors and propagate that into model outputs. That is useful for assessing input factor importance and interaction, regimes, and scaling laws between model input factors and outcomes. This differs from traditional sensitivity analysis methods (that are even missing here)

Variance-based methods (see Saltelli and Convertino below) are a class of probabilistic approaches which quantify the input and output uncertainties as probability distributions, and decompose the output variance into parts attributable to input variables and combinations of variables. The sensitivity of the output to an input variable is therefore measured by the amount of variance in the output caused by that input. Variance-based methods allow full exploration of the input space, accounting for interactions, and nonlinear responses. For these reasons they are widely used when it is feasible to calculate them. Typically this calculation involves the use of Monte Carlo methods, but since this can involve many thousands of model runs, other methods (such as emulators) can be used to reduce computational expense when necessary. Note that full variance decompositions are only meaningful when the input factors are independent from one another. If that is not the case information theory based GSUA is necessary (see Ludtke et al. for an information theory model of GSUA).

Thus, I really would like to see GSUA done because it (i) informs about the dynamics of the processes investigated and (ii) is very important for management purposes.

REFERENCES

Convertino, M. et al (2009) On neutral metacommunity patterns of river basins at different scales of aggregation http://www1.maths.leeds.ac.uk/~fbssaz/articles/Convertino_WRR09.pdf

Convertino, M.; Baker, K.M.; Vogel, J.T.; Lu, C.; Suedel, B.; and Linkov, I., "Multi-criteria decision analysis to select metrics for design and monitoring of sustainable ecosystem restorations" (2013). US Army Research. Paper 190. http://digitalcommons.unl.edu/usarmyresearch/190 http://digitalcommons.unl.edu/cgi/viewcontent.cgi?article=1189&context=usarmyresearch

Convertino et al. Untangling drivers of species distributions: Global sensitivity and uncertainty analyses of MaxEnt Journal Environmental Modelling & Software archive Volume 51, January, 2014 Pages 296-309

Saltelli A, Marco Ratto, Terry Andres, Francesca Campolongo, Jessica Cariboni, Debora Gatelli, Michaela Saisana, Stefano Tarantola Global Sensitivity Analysis: The Primer ISBN: 978-0-470-05997-5

Ludtke et al. (2007), Information-theoretic Sensitivity Analysis: a general method for credit assignment in complex networks J. Royal Soc. Interface

Log or not log

LOGorNOTLOG.html Log or not log, that is the question May 19, 2018 In 2014 I taught a special topics class on statistical i...