Tuesday, August 26, 2014

This is not how you should treat your graduate student,

even if you believe that she is not worthy of your talent.

Recently, I served as the advisor of a graduate student at Duke University. It is unusual to have someone from another university to be the major advisor. But I agreed to do so two years ago because of the unusual circumstance.

The student came to the US from Iran and her first advisor left Duke before she finished her first year. After another year, her second advisor retired. But the advisor arranged fundings for her PhD dissertation. She worked with him for another year. When the funding ended by the end of her third year, she was forced to switch advisors again. This is when her third advisor (let's call him Dr. NC, as from North Carolina) came into the picture.

Dr. NC had no knowledge of Bayesian statistics and no interest in estuary eutrophication (the two main topics of the student's dissertation). I was invited to be on her committee becoming the only one on her committee who actually knew about her work. During the academic year when she worked under Dr. NC, she was constantly questioned by him about the scientific value of her research.

In completing her first manuscript, she had Dr. NC as a co-author because all advisors must be co-authors no matter how much we contribute. But again, Dr. NC repeatedly questioned the scientific value of the manuscript, even as a co-author. He tried to prevent her from going to a conference to present the paper. She did go to the conference anyway with funding from the university. At the end, he insisted that the manuscript be submitted to a journal in his field rather than a more applied field.

While the manuscript was under review, Dr. NC suddenly decided that the student's entire dissertation (including the submitted manuscript) is not worthy of his time and effort. He demanded that she either graduate with a master's degree or change direction altogether. This demand was made in the student's fourth year. She contacted me and her director of graduate studies. We decided that she should form a new committee and continue her research as planned. I was asked by the DGS to serve as her advisor. In the next 15 months, she completed her dissertation and successfully defended her dissertation in July of 2014. She is now working as a postdoc at the EPA.

But this is not the end of Dr. NC's involvement. In mid 2013, the submitted manuscript mentioned above was rejected (no surprise there). She and I read reviewers' comments and we decided that the rejection was mainly because that the manuscript was submitted to a wrong journal. We rewrote the manuscript, removing all materials due to Dr. NC's insistence of a theoretical orientation. We also changed the presentation of the manuscript so that it serves as the lead of her subsequent chapters of the dissertation. When it was submitted to an applied marine pollution journal, it was accepted with only minor revision. Obviously, Dr. NC was not a co-author in the new paper.

After she left Duke in August, Dr. NC demanded an explanation on why he was not included as a co-author. I told him that the paper was rewritten without any of his prior input. Given his prior doubt on the scientific value of the student's work, we believe that he would not want to be associated with the paper. Dr. NC blasted the student and I about our unethical behavior and vowed further actions against us through Duke University and the journal.

I like to share the following to all young professors:

You should treat your students as equals. If you don't like what they want to do, make sure that you explain your reasons. You obviously don't want to advise a student who pursues a topic of which you are not qualified to advise. Either you do not take on the student as an advisee or you convince her/him to change direction from the very beginning. If you don't like a student and dismissed her, you should just forget about her.  It makes you look really bad if you come back to claim credit after the student succeeded without your involvement.

Monday, April 28, 2014

Statistics is More Than P-values and AIC

Statistics is More Than \( P \)-values and AIC

Statistics is More Than \( P \)-values and AIC

Introduction

The controversy between R.A. Fisher and J. Neyman and E. Pearson was started by the publication of Neyman [1935]. The controversy was not settled during the lifetimes of Fisher and Neyman, and the controversy manifests today in the debate about the role of \( p \)-values in statistical inference. The March 2014 issue of Ecology is the latest in this debate among ecologists and statisticians interested in ecological problems. Although I am sure that another discussion on the topic will not resolve the issue, I will try nevertheless because the Ecology forum did not mention the Fisher – Neyman-Pearson controversy and I believe that the underlying differences between the the two parties can help us better understand the issue. It is through the understanding of the controversy that I changed my attitude towards the use of \( p \)-values and other related classical statistics concepts. In this paper, I argue that the role of \( p \)-values or other statistical concepts should be determined by the nature of the ecological problems, not the mathematical characteristics, and the use of statistics is to help us develop a “principled argument” in explaining the ecological phenomenon of interest.

The Ecology forum

On \( p \)-values

The Ecology forum focused on elucidating the \( p \)-value, the Akaike Information Criterion (AIC), and the proper use of these two statistics in model selection.

When arguing for the use of \( p \)-values, we read that a \( p \)-value is a monotonic function of the likelihood ratio in a model selection or comparison problem. Consequently, \( p \)-values, confidence intervals, and AIC are all based on the same basic information – the likelihood function – and which one to use is a question of style. Criticisms of the \( p \)-value should be more accurately described as criticisms of the rigid interpretation by practitioners. When discussing \( p \)-values as evidence against the null hypothesis, we equate evidence with the likelihood ratio. In this regard, a \( p \)-value is one piece of evidence. When we use a \( p \)-value as the only piece of information – for example, “we found no evidence (\( p = 0.06 \))” – we mistake the statistical definition of evidence (likelihood ratio) from the broad definition of evidence (anything used to decide).

Critics of the \( p \)-value often use two lines of attack. One is to suggest that a \( p \)-value voilates the likelihood principle in that we calculate the \( p \)-value using both the observed data and more extreme data not observed. The other is to cite the intrinsic shortcoming of the \( p \)-value.

This intrinsic shortcoming is the incoherence of a \( p \)-value as a measure of evidence against the null hypothesis (the smaller the \( p \)-value, the stronger the evidence). When one null hypothesis (e.g., \( H_1 : \mu \leq 0 \)) includes another as a subset (e.g., \( H_2 : \mu = 0 \)), the measure of evidence against the subset (\( H_2 \)) should be as strong as or stronger than the measure of the full set (\( H_1 \)). As we know, if the data resulted in a one-sided \( p \)-value (i.e., \( H_1 \)) of 0.034, the same data will result in a \( p \)-value of 0.068 for H2. The evidence for the subset is weaker than the evidence for the full set.

Lavine [2014] apparently summarizes the common ground of most participants:

  • \( p \)-values, confidence intervals, and AIC are statistics based on the same statistical information;
  • these statistics are descriptive and they should not be used as formal quantification of evidence;
  • we should abandon the binary (accept/reject) declarations, whether it is based on \( p \)-values or AIC;
  • we should be careful when interpreting a \( p \)-value or AIC as strength of evidence (the same \( p \)-value, say 0.01, in two problems may represent very different strength);
  • above all, we should interpret the model based on plots and checks of assumption compliance.

The consensus is inline with Abelson's (1995) MAGIC criterion, which states that a statistical inference should be a principled argument, measured by criteria representing Magnitude, Articulation, Generality, Interestingness, and Credibility (Abelson, 1995), not just a \( p \)-value or AIC or any other single statistic. However, a survey of 24 early-career ecologists suggested that ecologists often pay more attention to \( p \)-values than to the parameter of biological interest – the effect size.

On AIC

Burnham and Anderson [2014] vehemently rejected the defense of \( p \)-values and insisted that we should use AIC when choosing among multiple alternative models. They dismissed hypothesis testing as the 20th century statistical science, and proclaimed the use of AIC as the 21st century statistical science. Instead of viewing the \( p \)-value as a monotonic function of the likelihood ratio in the context of model comparison, Burnham and Anderson [2014] reiterated the conditional probability definition of a \( p \)-value, and linked the -2 times log-likelihood ratio to “information.”

Aho et al. [2014] discussed the use of AIC and Bayesian information criteria (BIC) for model selection. They conclude that AIC is a tool for picking a model that is most accurate in predicting out-of-sample data. When using AIC, we are not focused on selecting the correct model, but a model that is adequate for prediction. Thus, BIC is an instrument for selecting the correct model. When using BIC, we assume that the correct model is among the candidate models. The use of AIC is appealing for ecologists as we work with complex systems but the correct model is almost always elusive.

More on AIC later.

The Fisher – Neyman-Pearson Controversy

The Ecology forum avoided the Fisher – Neyman-Pearson controversy. A revisit of the controversy may be helpful as it reveals the underlying philosophical difference between scientific research and management.

When facing a scientific problem, we are interested in the underlying causal relationship. Fisher views statistics as a tool for scientific research (or inductive reasoning). Fisher [1922] divides statistical analyses into three groups of problems – problems of specification, problems of estimation, and problems of distribution. Problems of specification represent a step of formulating a model, or hypothesis. In this step, we ask “of what population is this a random sample?” An answer is a proposed model parameterized with unknown coefficients. The problems of estimation represent a step of estimating model coefficients from observed data. I interpret the problems of distribution as a step in model evaluation. In Fisher’s term, once we selected a model, “the adequacy of our choice may be tested a posteriori.” These are a problem of distribution in that we must select models which we know how to handle.

I interpret these groups of problems in terms of a hypothetical deductive reasoning process. We start our research by proposing a hypothesis or model of the underlying causal relationship of interest. Fisher insisted that the model must be parametric because the “objective of statistical methods is the reduction of data.” He further explained that “a quantity of data, which usually by its mere bulk is incapable of entering the mind, is to be replaced by relatively few quantities which shall adequately represent the whole, or which, in other words, shall contain as much as possible, ideally the whole, of the relevant information contained in the original data.” A parametric model achieves the goal of reducing data and yet retaining information of the data. With the model, we now estimate unknown model coefficients from observed data. With the fitted model, we test the adequacy of the model choice. Under this framework, we interpret a simple linear regression model as follows. First, we assume that the observed response variable data (\( y \)) are random samples of a normal distribution, which is parameterized by the mean and the standard deviation. The mean is further assumed to be a linear function of the predictor variable (\( x \)). This model can be expressed as \( y_i \sim N(\mu_i, \sigma^2) \), where \( \mu_i = \beta_0 + \beta_1x_i \). When data \( (y_i, x_i) \) are available, we estimate model coefficients (\( \beta_0, \beta_1, \sigma \)) using the maximum likelihood estimator. Once the coefficients are estimated, we want to evaluate whether the model agrees with the data. Because statistical inference prescribed by Fisher is a hypothetical deductive process, if the model proposed is inappropriate, the statistical inference based on the model is meaningless. This is why we emphasize model checking after a regression model's coefficients are estimated.

Under Fisher’s statistical inference framework, we test the initial model by comparing the model predicted and the observed. For example, a linear regression model consists of at least two hypotheses. First, the response variable is a normal random variable, which predicts that the model residuals are random variates from a normal distribution with mean 0 and a constant variance. Once the model is fit to the data, we obtain a set of residual values (the observed) and we compare the residuals to a normal distribution, using various graphical or analytic methods. Second, we assume that the mean of the response variable is a linear function of the predictor \( x \). In a simple linear regression problem, we compare the estimated linear function to the observed data to check for departure from the linearity assumption. Graphical tools are often the most effective means for these comparisons [Cleveland, 1993].

The signature of the Neyman-Pearson approach is the Neyman-Pearson Lemma, which shows that the likelihood principle coincides with the most powerful test. The Neyman-Pearson Lemma provides a framework for formulating a decision problem as a confrontation of two hypotheses. Under this framework, the decision (accepting one of the two hypotheses) is a mathematical process of evaluating the relative risks. They introduced the concepts of significance level (the probability of erroneously rejecting a correct hypothesis) and power (the probability of accepting a correct hypothesis). The Neyman-Pearson approach consists of three steps: formulating the two alternative hypotheses, determining the significance level, and maximizing the power. The last step is achieved by a variation of calculations for different types of problems. Statistical tests we learn are mostly developed under this framework. The Neyman-Pearson approach fundamentally changed the emphasis of statistical inference. That is, they see a hypothesis testing problem as a problem of mathematical deduction – one most powerful test for each type of problems. As a result, problem formulation is an important first step for determining which test to use. Once a problem is formulated, mathematics will take over to decide which hypothesis should be accepted.

Lenhard [2006] pointed out that the philosophical differences between Fisher and Neyman-Pearson lie in their interpretations of the role of models. In Lenhard’s term, Fisher views a statistical model as a mediator between mathematics and the real world, while Neyman-Pearson view a model as a pre-condition for deriving the optimal “behavior.” In Fisher’s world, models can be changed upon observing new data. In Neyman-Pearson’s world, models are an integral part of the inferential framework and new data can only change the behavior not the model. The two worlds share the same mathematics.

In Fisher's world, we pay attention to the \( p \)-value as it is used as the evidence against the hypothesis. A large or small \( p \)-value will tell use whether a new model should be attempted. In this world, our initial model can be wrong and we have room for improving the model based on new information (data). In the Neyman-Pearson world, we are interested in whether the \( p \)-value is below or above the significance level of \( \alpha \). In this world, we derive statistical tests/procedures for different situations (e.g., if data are normal and independent, we use a \( t \)-test for population means).

Statisticians mostly operate in Neyman-Pearson's world, in that most statisticians develop new methods for new problems. Ecologists should mostly operate in Fisher's world, in that we are interested in learning about the underlying model that can explain the pattern in data. We conduct experiments and collect data to test our theories, and we are ready to modify our models when data show evidence of weakness. For example, Beck [1987] presents an approach of using repeated measurements over time to discover weaknesses in a water quality model for River Cam in the U.K. The exposed weaknesses are then used to modify the model.

However, ecologists learn statistics from statisticians. As a result, most of us are accustomed to the Newman-Pearson world. We learn different statistical tests and models one at a time, from \( t \)-test to ANOVA to linear regression. Because Fisher and Neyman-Pearson had never settled their controversy, the statistics we learned is a hybrid of the two worlds. The concept of a \( p \)-value as evidence is naturally appealing to ecologists, but Neyman-Pearson's inferential structure is dominant in almost all ecological curriculum. As a result, we construct a hypothesis testing procedure with a null hypothesis often known to be wrong and report a very small \( p \)-value to suggest that the alternative is true. Frequently, we confuse the small \( p \)-value as evidence supporting the specific value estimated from data. We frequently see a \( p \)-value attached to an estimated quantity (e.g., “the estimated mean is 3.4 (\( p < 0.001 \))”) without stating the actual hypothesis. Presumably, the \( p \)-value is calculated to test against a null hypothesis mean of 0, the default of many statistical software packages.

Consequences of the Controversy

The philosophical difference between Fisher and Neyman-Pearson has real consequences. Because of the hybrid statistical paradgim, we often confuse a research question with a decision problem. This confusion is clearly illustrated by the well-publicized clinical trial reported by Ridker et al. [2008]. The objective of the trial is to decide whether a class of cholesterol reducing drug (Rosuvastatin or statin) is effective in preventing “first major cardiovascular events.” The trial divided 17,802 apparently healthy men and women (with no elevated cholesterol level) into treatment and control groups. By the end of the study, the treatment group had 142 cases of cardiovascular events or a risk of 1.6% (142/8901) and the control group had 258 cases of cardiovascular events or a risk of 2.8% (251/8901). These numbers were normalized to annual rates and a statistical test showed that the observed annual risk ratio (0.56) is statistically different from the null hypothesis ratio of 1 (no effect). Because 0.5 is inside the 95% confidence interval of the risk ratio, the result of the study is reported in the news media as “a 50% reduction in heart disease risk,” and accordingly, healthy people are recommended to take a daily dose of statin.

If we take this trial as a scientific endeavor, the result may suggest a worthy research topic. Further research may lead to a better understanding of the cause of various heart diseases, which may result in better information on heart disease prevention at individual level. When public health is of interest, we may want to examine whether the reduced cardiovascular events (109 out of 8901) are practically meaningful.

If we take this trial as a decision process, the decision is a personal one. Whether or not I should take a daily dose of statin should be decided by me after consulting with my doctor. Mathematical considerations of type I and type II errors alone are not sufficient. In fact, type I and type II errors are irrelevant, because the estimated effect is a population average, not specifically for any individual.

In both cases, we should treat the evidence represented in the \( p \)-value as one piece of information but not the sole evidence. The use of \( p \)-values and AIC “has made scientific inference rather formulaic and somewhat trivialized it” by “putting too much credence on each individual outcome, rather than a broader body of evidence” [C.A. Stow, 2014, personal communication].

AIC and DIC

There are at least four information criteria (IC) frequently published in the literature. However, the term “information” is somewhat misleading, especially when an IC is attached to a specific value. Gelman et al. [2013] discussed three information criteria from a Bayesian point of view. I summarize their discussion of AIC and deviance information criterion (DIC) in this section.

AIC is an approximation of a model's out-of-sample predictive accuracy, which is the expected log density of the predictive distribution given the posterior estimate of model parameters. Mathematically, this is expressed as \( E(\log p(\tilde{y}|\hat{\theta}(y))) \), where \( E \) is a mathematical expectation, \( p \) represents a probability density function, \( \tilde{y} \)̃ is an out-of-sample observation (not used in model fitting), and \( \hat{\theta}(y) \) is the model parameters estimated based on observations \( y \). Because we don't have \( \tilde{y} \)̃, the out-of-sample predictive accuracy cannot be calculated directly. AIC uses log posterior predictive density of the observed data and the estimated model parameter values (typically through MLE) as an approximation (or \( \log p(y|\hat{\theta}_{mle}) \)). In other words, the observed data were used first to estimate model parameters and then used to calculate the predictive densities. As a result, the approximation is an overestimate of the out-of-sample predictive accuracy. The simplest correction for the bias is to subtract the number of parameters (\( k \)) – \( \log p(y|\hat{\theta}_{mle}) − k \) – because the expected increase in predictive accuracy of using one statistically insignificant parameter in the model is 1. This result is based on asymptotic normal distribution (i.e., the posterior distribution of model parameter, \( p(\theta|y) \), is a normal distribution). Akaike [1974] defined AIC as the predictive accuracy multiplied by -2: \( AIC = −2 \log p(y|\hat{\theta}_{mle}) + 2k \). AIC works well for linear models (including generalized linear models). When model structure is more complicated than a linear model, simply subtracting k is no longer appropriate. For a more complicated Bayesian model, especially a hierarchical Bayesian model, the correction can be too high. The deviance information criterion (DIC) uses the posterior mean for model parameter estimates and a data-based bias correction: \( DIC = −2\log p(y|\hat{\theta}_{Bayes})+2 p_{DIC} \), where \( p_{DIC} \) is the effective number of parameters, defined as \( p_{DIC} = 2\left (\log p(y|\hat{\theta}_{Bayes})-E_{post}\left (\log p(y|\theta)\right )\right ) \), where the second term is the average of the log likelihood over the posterior distribution of model parameters.

In short, information criteria are measures of model predictive accuracy. They are typically defined based on deviance (-2 times log likelihood) evaluated using a point estimate of model parameters – \( −2 \log p(y|\hat{\theta}) \). The deviance is typically an overestimate of the out-of-sample predictive accuracy. AIC and DIC represent two approximations developed under different assumptions and conditions to correct the bias. Because an IC is evaluated using the deviance (a function of sample size, among other factors), the absolute value of AIC or DIC is meaningless. They should be used to compare alternative models fit to the same data.

Johnson (2013)

Johnson [2013] recommended using a significance level of 0.005 to increase the reproducibility of research. His recommendation is based on the target Bayes factor of 1/25 or 1/50 recommended by Jeffreys [1961]. This approach does not change the formulaic approach of a hypothesis testing problem. But it does pose a question of how we balance the trade off in the context of research and discovery. A smaller significance level is associated with a lower power, or a lower chance of discovering a true effect. In ecological studies, sample size is often the limiting factor. As a result, requiring a very high statistical power will have a detrimental effect on ecological research.

Johnson’s suggestion is understandable in the context of the statin clinical trial. If the true size of effect of statin on a healthy population is similar to the estimated size of 1.2% (2.8-1.6), the test reported in Ridker et al. [2008] has a power of more than 0.99991. (I simplified the problem to a two sample proportion test and used the R function power.prop.test() to calculate powers) From a decision-making perspective, a power of almost 1 indicates that we view the type II error as a far more serious error. Specifically in this case, we allow a 5% chance of making a type I error (a healthy person is prescribed statin, but statin does not prevent heart disease) and a less than 0.01% chance of making a type II error (statin is marginally effective in preventing heart diseases, but it is not prescribed to healthy people). Is the consequence of the type II error so much more severe? If we want to balance the probabilities of making both type I and type II errors, we can set the significance level between 0.0025 and 0.005 to have a power between 0.996 and 0.998. Whereas in an ecological study, we often have sample sizes much smaller than the number needed to require a type I error probability of 0.005. Therefore, a uniform threshold across all disciplines of science is unwise and counter productive.

If a balance of type I and type II error probabilities is desired [Rotenberry and Wiens, 1985, Hayes and Steidl, 1997], the significance level should not be predetermined, rather it should be estimated based on the sample size and the target effect size. This line of thinking reminds us that a \( p \)-value is one piece of information. It is not a good measure of evidence, and it may be incoherent. A \( p \)-value of 0.05 may represent strong evidence in one case and weak evidence in another. Information from both the data and elsewhere are needed. For example, to calculate the power of a test, we need to decide an effect size that is ecologically meaningful. When a "balanced” significance level is large (e.g., 0.25), we should conclude that the existing data alone do not have enough information, and either additional data should be collected or other sources of information should be sought.

Conclusion

The conceptual difficulty regarding the proper places for \( p \)-value and AIC/DIC in ecological research has its roots in the controversy between Fisher and Neyman-Pearson. The controversy is about the role of models in statistical inference. Fisher views statistics as a tool for induction by means of hypothetical deduction. Neyman-Pearson’s framework turns a statistical problem into a mathematical deduction. They both have their places in real world applications.

Many suggest that we should promote the use of Bayesian statistics to avoid the pitfalls of \( p \)-values. If the Bayesian alternative is to replace the \( p \)-value with the Bayes factor, we are still trapped in the same formulaic dichotomy defined by Neyman-Pearson. In my opinion, using Bayesian statistics is not necessarily a solution to the conceptual difficulty with regard to \( p \)-value and AIC/DIC. The foundamental question is still whether we use statistics as a mathematical tool for making a decision or as part of our research subject to change. Whether we use the Bayesian statistics or classical statistics is of little bearing.

Implications in education

I have been teaching classical statistics/biostatistics to first-year graduate students in environmental and ecological studies for the last ten years. Because my statistical training was mostly in Bayesian statistics, I have always been skeptical about \( p \)-values. Consequently, I deliberately deemphasize the use of single statistic such as the \( p \)-value in my teaching. But I have been disappointed almost every year because most students would automatically make their conclusions based on the \( p \)-value by the end of the semester! After reflection, I now believe that this outcome is expected. In applying statistics to a problem, the first question we ask is about the distribution of the data or the most appropriate model to use. To answer this question, we need not only knowledge of statistical distributions and models, but also ecological knowledge. Most first year graduate students do not have enough ecological knowledge and experience to make such a selection. As a result, they are unprepared for making the connection between ecological data and statistical distributions. Statistical inference starts with the problem of specification and this problem is essentially impossible for most students. The consequence is that we teach statistical tests and models as individual mathematical topics. No matter how hard we try, students will see these topics in isolation. The best we can expect is that some students will remember how to carry out some of the tests and know how to fit some of the models. But these tests or models are isolated, most students cannot make the connection between statistical models and real world problems. As a result, the most memorable thing from a statistics course is the \( p \)-value of a test or the \( R^2 \) value of a regression problem.

On the one hand, if we are teaching students to use statistics as a research tool, we should consider that statistics be taught later in their study, after they learn some basics of their chosen profession. On the other hand, statistics is also needed early in research (e.g., designing an experiment). So it is a conundrum without an easy solution.

However, Peters [1991] suggested that “statistics are better learned from direct applications of the statistics in the context of one’s own research.” As a result, we should provide opportunities to our graduate students to learn statistics after they complete their required quantitative courses.

To close this discussion, I describe a project-based course I teach regularly. The contents of the course are determined by student-proposed projects. In a typical year, I have fewer than 10 students in this class, each with a project that is related to his/her thesis or dissertation work. I start the course with student presentations of their problems, followed by my recommendations of appropriate statistical methods. I then teach selected topics while students carry out their work. Students present their preliminary results in the second half of the semester. Based on discussions and critiques during the presentations, students revise their work and write a final report. These reports are often turned into manuscripts, for example, Wu et al. [2011].

References

R.P. Abelson. Statistics as Principled Argument. Psychology Press, New York, 1995.

K. Aho, D. Derreberry, and T. Peterson. Model selection for ecologists: the worldviews of AIC and BIC. Ecology, 95(3):631–636, 2014.

H. Akaike. A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6):716–723, 1974.

M.B. Beck. Water quality modeling: a review of the analysis of uncertainty. Water Resources Research, 23(8):1393–1442, 1987.

K.P. Burnham and D.R. Anderson. P values are only an index to evidence: 20th- vs. 21st-century statistical science. Ecology, 95(3):627–630, 2014.

W.S. Cleveland. Visualizing Data. Hobart Press, Summit, NJ, 1993.

R.A. Fisher. On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society of London, Series A, 222: 309–368, 1922.

A. Gelman, J. Hwang, and A. Vehtari. Understanding predictive information criteria for bayesian models. Statistics and Computing, pages 1–20, 2013.

J.P. Hayes and R.J. Steidl. Statistical power analysis and amphibian population trends. Consevation Biology, 11(1):273–275, 1997.

H. Jeffreys. Theory of Probability. Oxford Univ Press, Oxford, 3rd edition, 1961.

V.E. Johnson. Revised standards for statistical evidence. Proceedings of the National Academy of Sciences, 110(48):19313–19317, 2013.

M. Lavine. Comment on Murtaugh. Ecology, 95(3):642–645, 2014.

J. Lenhard. Models and statistical inference: The controversy between Fisher and Neyman-Pearson. The British Journal for the Philosophy of Science, 57(1): 69–91, 2006.

J. Neyman. Statistical problems in agricultural experimentation. Journal of the Royal Statistical Society, 2(Supplement):107–180, 1935.

R.H. Peters. A Critique for Ecology. Cambridge University Press, 1991.

P.M. Ridker, E. Danielson, F.A.H. Fonseca, J. Genest, A.M. Gotto, J.J.P. Kastelein, W. Koenig, P. Libby, A.J. Lorenzatti, J.G. MacFadyen, B.G. Nordestgaard, J. Shepherd, J.T. Willerson, and R.J. Glynn. Rosuvastatin to prevent vascular events in men and women with elevated c-reactive protein. The New England Journal of Medicine, 359(21):2195–2207, 2008.

J.T. Rotenberry and J.A. Wiens. Statistical power analysis and community-wide patterns. The American Naturalist, 125(1):164–168, 1985.

R. Wu, S.S. Qian, F. Hao, H. Cheng, D. Zhu, and J. Zhang. Modeling contaminant concentration distributions in china’s centralized source waters. Environmental Science and Technology, 45(14):6041–6048, 2011.

Saturday, March 29, 2014

An Alternative Interpretation of the $p$-value

An Alternative Interpretation of the \( p \)-value

An Alternative Interpretation of the \( p \)-value

The likelihood principle is often used as the basis for criticizing the use of a \( p \)-value in classical statistics. Reading the recent discussions in the March issue of Ecology, I had an alternative explanation of what a \( p \)-value is.

The \( p \)-value is defined as the conditional probability of observing something as extreme or more than the data if the “null'' hypothesis is true. In a one sample \( t \)-test problem, we are interested in testing whether the population mean is equal to a specific value \[ H_0: \mu=\mu_0. \]

The test is based on the central limit theorem, which states that the sampling distribution of a sample mean \( \bar{x} \) is \( N(\mu,\sigma^2/n) \), where \( \mu, \sigma \) are the population mean and standard deviation and \( n \) is the sample size. In a hypothesis testing, we compare the sample mean \( \bar{x} \) to the sampling distribution under the null hypothesis \( N(\mu_0,\sigma^2/n) \).

plot of chunk normal

For simplicity, we assume that \( \sigma \) is known and use a one-sided \( p \)-value. The typical definition of the \( p \)-value is the shaded area in the figure, the probability of observing sample means as extreme as and more extreme than \( \bar{x} \), hence the criticisms of violating the likelihood principle because data not observed must be used to calculate the \( p \)-value.

The \( p \)-value can be interpreted as a probability, but it can also be interpreted as an indicator of the likelihood, the density value of \( \bar{x} \). That is, the tail area is a monotonic function of the likelihood. A small \( p \)-value is related to a small likelihood value and vice versa. A density value cannot be easily understood, while a probability is scaled and easy to understand. Alternatively, we can also measure the evidence against the null hypothesis using the distance between \( \bar{x} \) and \( \mu_0 \): \( d=|\bar{x} - \mu_0| \), and \( d \) is also a monotonic function of the likelihood. Because \( \bar{x} = \pm d \) share the same likelihood, the probabilistic interpretation of \( p \)-value must be 2 times the shaded area. As long as we know the rule of translating a likelihood value to a \( p \)-value, whether it is 1 or 2 times the shaded area is irrelevant.

Using a \( p \)-value, we now interpret the evidence'' against the hypothesis in terms of a probability and a binary decision rule is now easily justifiable (no matter how arbitrary). I would argue now that the \( p \)-value itself does not violate the likelihood principle. It can be seen as an easy-to-understand indicator of the likelihood (of the observed sample mean being a random sample of the sampling distribution defined by the null hypothesis). It is the literal interpretation of the indicator that introduces confusion.

Many years ago, when I took my first statistics course, we must find the \( p \)-value in the standard normal distribution table in the back of the text. That is, we calculate \( z=\frac{\bar{x}-\mu_0}{\sigma/\sqrt{n}} \) and find a \( p \)-value in a table. Could it (using \( p \)-value) be simply a computational short cut for the likelihood? After all, a likelihood is a monotonic function of the \( p \)-value.

Monday, January 6, 2014

Using Bayesian to Cheat

A few days before the end of last semester, I stumbled upon a paper in the journal Ecological Indicators [Song and Guan(2013)]. The title indicates a Bayesian estimation method was used. The authors started with a discussion of the “environmental efficiency analysis” and introduced an indicator, which is some kind of mathematical function of input and output. There was no intuitive explanation on the indicator. The indicator was based on three input variables (population, “fixed capital formation,” and non renewable energy consumption), one desirable output variable (GDP), and one undesirable output (industrial SO2 emission). The calculation resulted in 17 environmental efficiency scores (9 cities in 2 years). The main objective of the paper is to explore factors affecting these environmental efficiency scores (EE), using regression. The potential factors are (1) per capita GDP (RGDP), (2) total import and export volume (IE), (3) the proportion of the “second industry” in GDP (GY), (4) the proportion of “the industry of the second industry” in GDP (GGY), and (5) the proportion of environmental spending in GDP (HZ). The authors explained that (1) is an indicator of economic scale, (2) is a measure of economic exchange with the outside world (the authors used the term “opening up,” a bad translation of a Chinese term), (3) and (4) are measures of industry structure, and (5) is the “government factor.” Not being an economist, I don’t want to comment on the choice of these potential factors, except that GDP is now used both as part of the environmental efficiency score and as a potential factor that will be used to explain the variance of the score.
The authors used a multiple regression approach, but regression coefficients were estimated using MCMC. I was expecting a discussion on the choice of prior distributions of these coefficients. But it was soon clear that there was no prior distribution. So, why did they use MCMC for a multiple regression problem? Based on the authors’ affiliation (School of Statistics and Mathematics), I assume that they know that there should be no substantive difference between using MCMC and using OLS. The authors presented the following regression model:



The estimated model coefficients may have revealed the answer:

coefficient     estimate standard error
      α            -0.1188 1.7260
      β1          -0.1564 0.1562
      β2            3.8130 1.5510
      β3           4.8050     8.0620
      β4          -3.3960     5.5560
      β5           0.9362 18.8900

All slopes, except β2 are statistically not different from 0! If the authors used the OLS, a typical regression model output would include a column of p-values, which will make the paper unpublishable. Using MCMC, the authors are able to present the estimated coefficients, standard error, and selected quantiles. Without the column of p-values, a busy reviewer may not be able to catch the problem. (But all my students in an introductory biostatistics class recognized the problem.)
Is this a successful story of cheating by using “Bayesian” statistics?

References
[Song and Guan(2013)] Malin Song and Youyi Guan. The environmental efficiency of Wanjian demonstration area: a Bayesian estimation approach. Ecological Indicators, 36:59–67, 2013. 

Log or not log

LOGorNOTLOG.html Log or not log, that is the question May 19, 2018 In 2014 I taught a special topics class on statistical i...