Monday, April 28, 2014

Statistics is More Than P-values and AIC

Statistics is More Than \( P \)-values and AIC

Statistics is More Than \( P \)-values and AIC

Introduction

The controversy between R.A. Fisher and J. Neyman and E. Pearson was started by the publication of Neyman [1935]. The controversy was not settled during the lifetimes of Fisher and Neyman, and the controversy manifests today in the debate about the role of \( p \)-values in statistical inference. The March 2014 issue of Ecology is the latest in this debate among ecologists and statisticians interested in ecological problems. Although I am sure that another discussion on the topic will not resolve the issue, I will try nevertheless because the Ecology forum did not mention the Fisher – Neyman-Pearson controversy and I believe that the underlying differences between the the two parties can help us better understand the issue. It is through the understanding of the controversy that I changed my attitude towards the use of \( p \)-values and other related classical statistics concepts. In this paper, I argue that the role of \( p \)-values or other statistical concepts should be determined by the nature of the ecological problems, not the mathematical characteristics, and the use of statistics is to help us develop a “principled argument” in explaining the ecological phenomenon of interest.

The Ecology forum

On \( p \)-values

The Ecology forum focused on elucidating the \( p \)-value, the Akaike Information Criterion (AIC), and the proper use of these two statistics in model selection.

When arguing for the use of \( p \)-values, we read that a \( p \)-value is a monotonic function of the likelihood ratio in a model selection or comparison problem. Consequently, \( p \)-values, confidence intervals, and AIC are all based on the same basic information – the likelihood function – and which one to use is a question of style. Criticisms of the \( p \)-value should be more accurately described as criticisms of the rigid interpretation by practitioners. When discussing \( p \)-values as evidence against the null hypothesis, we equate evidence with the likelihood ratio. In this regard, a \( p \)-value is one piece of evidence. When we use a \( p \)-value as the only piece of information – for example, “we found no evidence (\( p = 0.06 \))” – we mistake the statistical definition of evidence (likelihood ratio) from the broad definition of evidence (anything used to decide).

Critics of the \( p \)-value often use two lines of attack. One is to suggest that a \( p \)-value voilates the likelihood principle in that we calculate the \( p \)-value using both the observed data and more extreme data not observed. The other is to cite the intrinsic shortcoming of the \( p \)-value.

This intrinsic shortcoming is the incoherence of a \( p \)-value as a measure of evidence against the null hypothesis (the smaller the \( p \)-value, the stronger the evidence). When one null hypothesis (e.g., \( H_1 : \mu \leq 0 \)) includes another as a subset (e.g., \( H_2 : \mu = 0 \)), the measure of evidence against the subset (\( H_2 \)) should be as strong as or stronger than the measure of the full set (\( H_1 \)). As we know, if the data resulted in a one-sided \( p \)-value (i.e., \( H_1 \)) of 0.034, the same data will result in a \( p \)-value of 0.068 for H2. The evidence for the subset is weaker than the evidence for the full set.

Lavine [2014] apparently summarizes the common ground of most participants:

  • \( p \)-values, confidence intervals, and AIC are statistics based on the same statistical information;
  • these statistics are descriptive and they should not be used as formal quantification of evidence;
  • we should abandon the binary (accept/reject) declarations, whether it is based on \( p \)-values or AIC;
  • we should be careful when interpreting a \( p \)-value or AIC as strength of evidence (the same \( p \)-value, say 0.01, in two problems may represent very different strength);
  • above all, we should interpret the model based on plots and checks of assumption compliance.

The consensus is inline with Abelson's (1995) MAGIC criterion, which states that a statistical inference should be a principled argument, measured by criteria representing Magnitude, Articulation, Generality, Interestingness, and Credibility (Abelson, 1995), not just a \( p \)-value or AIC or any other single statistic. However, a survey of 24 early-career ecologists suggested that ecologists often pay more attention to \( p \)-values than to the parameter of biological interest – the effect size.

On AIC

Burnham and Anderson [2014] vehemently rejected the defense of \( p \)-values and insisted that we should use AIC when choosing among multiple alternative models. They dismissed hypothesis testing as the 20th century statistical science, and proclaimed the use of AIC as the 21st century statistical science. Instead of viewing the \( p \)-value as a monotonic function of the likelihood ratio in the context of model comparison, Burnham and Anderson [2014] reiterated the conditional probability definition of a \( p \)-value, and linked the -2 times log-likelihood ratio to “information.”

Aho et al. [2014] discussed the use of AIC and Bayesian information criteria (BIC) for model selection. They conclude that AIC is a tool for picking a model that is most accurate in predicting out-of-sample data. When using AIC, we are not focused on selecting the correct model, but a model that is adequate for prediction. Thus, BIC is an instrument for selecting the correct model. When using BIC, we assume that the correct model is among the candidate models. The use of AIC is appealing for ecologists as we work with complex systems but the correct model is almost always elusive.

More on AIC later.

The Fisher – Neyman-Pearson Controversy

The Ecology forum avoided the Fisher – Neyman-Pearson controversy. A revisit of the controversy may be helpful as it reveals the underlying philosophical difference between scientific research and management.

When facing a scientific problem, we are interested in the underlying causal relationship. Fisher views statistics as a tool for scientific research (or inductive reasoning). Fisher [1922] divides statistical analyses into three groups of problems – problems of specification, problems of estimation, and problems of distribution. Problems of specification represent a step of formulating a model, or hypothesis. In this step, we ask “of what population is this a random sample?” An answer is a proposed model parameterized with unknown coefficients. The problems of estimation represent a step of estimating model coefficients from observed data. I interpret the problems of distribution as a step in model evaluation. In Fisher’s term, once we selected a model, “the adequacy of our choice may be tested a posteriori.” These are a problem of distribution in that we must select models which we know how to handle.

I interpret these groups of problems in terms of a hypothetical deductive reasoning process. We start our research by proposing a hypothesis or model of the underlying causal relationship of interest. Fisher insisted that the model must be parametric because the “objective of statistical methods is the reduction of data.” He further explained that “a quantity of data, which usually by its mere bulk is incapable of entering the mind, is to be replaced by relatively few quantities which shall adequately represent the whole, or which, in other words, shall contain as much as possible, ideally the whole, of the relevant information contained in the original data.” A parametric model achieves the goal of reducing data and yet retaining information of the data. With the model, we now estimate unknown model coefficients from observed data. With the fitted model, we test the adequacy of the model choice. Under this framework, we interpret a simple linear regression model as follows. First, we assume that the observed response variable data (\( y \)) are random samples of a normal distribution, which is parameterized by the mean and the standard deviation. The mean is further assumed to be a linear function of the predictor variable (\( x \)). This model can be expressed as \( y_i \sim N(\mu_i, \sigma^2) \), where \( \mu_i = \beta_0 + \beta_1x_i \). When data \( (y_i, x_i) \) are available, we estimate model coefficients (\( \beta_0, \beta_1, \sigma \)) using the maximum likelihood estimator. Once the coefficients are estimated, we want to evaluate whether the model agrees with the data. Because statistical inference prescribed by Fisher is a hypothetical deductive process, if the model proposed is inappropriate, the statistical inference based on the model is meaningless. This is why we emphasize model checking after a regression model's coefficients are estimated.

Under Fisher’s statistical inference framework, we test the initial model by comparing the model predicted and the observed. For example, a linear regression model consists of at least two hypotheses. First, the response variable is a normal random variable, which predicts that the model residuals are random variates from a normal distribution with mean 0 and a constant variance. Once the model is fit to the data, we obtain a set of residual values (the observed) and we compare the residuals to a normal distribution, using various graphical or analytic methods. Second, we assume that the mean of the response variable is a linear function of the predictor \( x \). In a simple linear regression problem, we compare the estimated linear function to the observed data to check for departure from the linearity assumption. Graphical tools are often the most effective means for these comparisons [Cleveland, 1993].

The signature of the Neyman-Pearson approach is the Neyman-Pearson Lemma, which shows that the likelihood principle coincides with the most powerful test. The Neyman-Pearson Lemma provides a framework for formulating a decision problem as a confrontation of two hypotheses. Under this framework, the decision (accepting one of the two hypotheses) is a mathematical process of evaluating the relative risks. They introduced the concepts of significance level (the probability of erroneously rejecting a correct hypothesis) and power (the probability of accepting a correct hypothesis). The Neyman-Pearson approach consists of three steps: formulating the two alternative hypotheses, determining the significance level, and maximizing the power. The last step is achieved by a variation of calculations for different types of problems. Statistical tests we learn are mostly developed under this framework. The Neyman-Pearson approach fundamentally changed the emphasis of statistical inference. That is, they see a hypothesis testing problem as a problem of mathematical deduction – one most powerful test for each type of problems. As a result, problem formulation is an important first step for determining which test to use. Once a problem is formulated, mathematics will take over to decide which hypothesis should be accepted.

Lenhard [2006] pointed out that the philosophical differences between Fisher and Neyman-Pearson lie in their interpretations of the role of models. In Lenhard’s term, Fisher views a statistical model as a mediator between mathematics and the real world, while Neyman-Pearson view a model as a pre-condition for deriving the optimal “behavior.” In Fisher’s world, models can be changed upon observing new data. In Neyman-Pearson’s world, models are an integral part of the inferential framework and new data can only change the behavior not the model. The two worlds share the same mathematics.

In Fisher's world, we pay attention to the \( p \)-value as it is used as the evidence against the hypothesis. A large or small \( p \)-value will tell use whether a new model should be attempted. In this world, our initial model can be wrong and we have room for improving the model based on new information (data). In the Neyman-Pearson world, we are interested in whether the \( p \)-value is below or above the significance level of \( \alpha \). In this world, we derive statistical tests/procedures for different situations (e.g., if data are normal and independent, we use a \( t \)-test for population means).

Statisticians mostly operate in Neyman-Pearson's world, in that most statisticians develop new methods for new problems. Ecologists should mostly operate in Fisher's world, in that we are interested in learning about the underlying model that can explain the pattern in data. We conduct experiments and collect data to test our theories, and we are ready to modify our models when data show evidence of weakness. For example, Beck [1987] presents an approach of using repeated measurements over time to discover weaknesses in a water quality model for River Cam in the U.K. The exposed weaknesses are then used to modify the model.

However, ecologists learn statistics from statisticians. As a result, most of us are accustomed to the Newman-Pearson world. We learn different statistical tests and models one at a time, from \( t \)-test to ANOVA to linear regression. Because Fisher and Neyman-Pearson had never settled their controversy, the statistics we learned is a hybrid of the two worlds. The concept of a \( p \)-value as evidence is naturally appealing to ecologists, but Neyman-Pearson's inferential structure is dominant in almost all ecological curriculum. As a result, we construct a hypothesis testing procedure with a null hypothesis often known to be wrong and report a very small \( p \)-value to suggest that the alternative is true. Frequently, we confuse the small \( p \)-value as evidence supporting the specific value estimated from data. We frequently see a \( p \)-value attached to an estimated quantity (e.g., “the estimated mean is 3.4 (\( p < 0.001 \))”) without stating the actual hypothesis. Presumably, the \( p \)-value is calculated to test against a null hypothesis mean of 0, the default of many statistical software packages.

Consequences of the Controversy

The philosophical difference between Fisher and Neyman-Pearson has real consequences. Because of the hybrid statistical paradgim, we often confuse a research question with a decision problem. This confusion is clearly illustrated by the well-publicized clinical trial reported by Ridker et al. [2008]. The objective of the trial is to decide whether a class of cholesterol reducing drug (Rosuvastatin or statin) is effective in preventing “first major cardiovascular events.” The trial divided 17,802 apparently healthy men and women (with no elevated cholesterol level) into treatment and control groups. By the end of the study, the treatment group had 142 cases of cardiovascular events or a risk of 1.6% (142/8901) and the control group had 258 cases of cardiovascular events or a risk of 2.8% (251/8901). These numbers were normalized to annual rates and a statistical test showed that the observed annual risk ratio (0.56) is statistically different from the null hypothesis ratio of 1 (no effect). Because 0.5 is inside the 95% confidence interval of the risk ratio, the result of the study is reported in the news media as “a 50% reduction in heart disease risk,” and accordingly, healthy people are recommended to take a daily dose of statin.

If we take this trial as a scientific endeavor, the result may suggest a worthy research topic. Further research may lead to a better understanding of the cause of various heart diseases, which may result in better information on heart disease prevention at individual level. When public health is of interest, we may want to examine whether the reduced cardiovascular events (109 out of 8901) are practically meaningful.

If we take this trial as a decision process, the decision is a personal one. Whether or not I should take a daily dose of statin should be decided by me after consulting with my doctor. Mathematical considerations of type I and type II errors alone are not sufficient. In fact, type I and type II errors are irrelevant, because the estimated effect is a population average, not specifically for any individual.

In both cases, we should treat the evidence represented in the \( p \)-value as one piece of information but not the sole evidence. The use of \( p \)-values and AIC “has made scientific inference rather formulaic and somewhat trivialized it” by “putting too much credence on each individual outcome, rather than a broader body of evidence” [C.A. Stow, 2014, personal communication].

AIC and DIC

There are at least four information criteria (IC) frequently published in the literature. However, the term “information” is somewhat misleading, especially when an IC is attached to a specific value. Gelman et al. [2013] discussed three information criteria from a Bayesian point of view. I summarize their discussion of AIC and deviance information criterion (DIC) in this section.

AIC is an approximation of a model's out-of-sample predictive accuracy, which is the expected log density of the predictive distribution given the posterior estimate of model parameters. Mathematically, this is expressed as \( E(\log p(\tilde{y}|\hat{\theta}(y))) \), where \( E \) is a mathematical expectation, \( p \) represents a probability density function, \( \tilde{y} \)̃ is an out-of-sample observation (not used in model fitting), and \( \hat{\theta}(y) \) is the model parameters estimated based on observations \( y \). Because we don't have \( \tilde{y} \)̃, the out-of-sample predictive accuracy cannot be calculated directly. AIC uses log posterior predictive density of the observed data and the estimated model parameter values (typically through MLE) as an approximation (or \( \log p(y|\hat{\theta}_{mle}) \)). In other words, the observed data were used first to estimate model parameters and then used to calculate the predictive densities. As a result, the approximation is an overestimate of the out-of-sample predictive accuracy. The simplest correction for the bias is to subtract the number of parameters (\( k \)) – \( \log p(y|\hat{\theta}_{mle}) − k \) – because the expected increase in predictive accuracy of using one statistically insignificant parameter in the model is 1. This result is based on asymptotic normal distribution (i.e., the posterior distribution of model parameter, \( p(\theta|y) \), is a normal distribution). Akaike [1974] defined AIC as the predictive accuracy multiplied by -2: \( AIC = −2 \log p(y|\hat{\theta}_{mle}) + 2k \). AIC works well for linear models (including generalized linear models). When model structure is more complicated than a linear model, simply subtracting k is no longer appropriate. For a more complicated Bayesian model, especially a hierarchical Bayesian model, the correction can be too high. The deviance information criterion (DIC) uses the posterior mean for model parameter estimates and a data-based bias correction: \( DIC = −2\log p(y|\hat{\theta}_{Bayes})+2 p_{DIC} \), where \( p_{DIC} \) is the effective number of parameters, defined as \( p_{DIC} = 2\left (\log p(y|\hat{\theta}_{Bayes})-E_{post}\left (\log p(y|\theta)\right )\right ) \), where the second term is the average of the log likelihood over the posterior distribution of model parameters.

In short, information criteria are measures of model predictive accuracy. They are typically defined based on deviance (-2 times log likelihood) evaluated using a point estimate of model parameters – \( −2 \log p(y|\hat{\theta}) \). The deviance is typically an overestimate of the out-of-sample predictive accuracy. AIC and DIC represent two approximations developed under different assumptions and conditions to correct the bias. Because an IC is evaluated using the deviance (a function of sample size, among other factors), the absolute value of AIC or DIC is meaningless. They should be used to compare alternative models fit to the same data.

Johnson (2013)

Johnson [2013] recommended using a significance level of 0.005 to increase the reproducibility of research. His recommendation is based on the target Bayes factor of 1/25 or 1/50 recommended by Jeffreys [1961]. This approach does not change the formulaic approach of a hypothesis testing problem. But it does pose a question of how we balance the trade off in the context of research and discovery. A smaller significance level is associated with a lower power, or a lower chance of discovering a true effect. In ecological studies, sample size is often the limiting factor. As a result, requiring a very high statistical power will have a detrimental effect on ecological research.

Johnson’s suggestion is understandable in the context of the statin clinical trial. If the true size of effect of statin on a healthy population is similar to the estimated size of 1.2% (2.8-1.6), the test reported in Ridker et al. [2008] has a power of more than 0.99991. (I simplified the problem to a two sample proportion test and used the R function power.prop.test() to calculate powers) From a decision-making perspective, a power of almost 1 indicates that we view the type II error as a far more serious error. Specifically in this case, we allow a 5% chance of making a type I error (a healthy person is prescribed statin, but statin does not prevent heart disease) and a less than 0.01% chance of making a type II error (statin is marginally effective in preventing heart diseases, but it is not prescribed to healthy people). Is the consequence of the type II error so much more severe? If we want to balance the probabilities of making both type I and type II errors, we can set the significance level between 0.0025 and 0.005 to have a power between 0.996 and 0.998. Whereas in an ecological study, we often have sample sizes much smaller than the number needed to require a type I error probability of 0.005. Therefore, a uniform threshold across all disciplines of science is unwise and counter productive.

If a balance of type I and type II error probabilities is desired [Rotenberry and Wiens, 1985, Hayes and Steidl, 1997], the significance level should not be predetermined, rather it should be estimated based on the sample size and the target effect size. This line of thinking reminds us that a \( p \)-value is one piece of information. It is not a good measure of evidence, and it may be incoherent. A \( p \)-value of 0.05 may represent strong evidence in one case and weak evidence in another. Information from both the data and elsewhere are needed. For example, to calculate the power of a test, we need to decide an effect size that is ecologically meaningful. When a "balanced” significance level is large (e.g., 0.25), we should conclude that the existing data alone do not have enough information, and either additional data should be collected or other sources of information should be sought.

Conclusion

The conceptual difficulty regarding the proper places for \( p \)-value and AIC/DIC in ecological research has its roots in the controversy between Fisher and Neyman-Pearson. The controversy is about the role of models in statistical inference. Fisher views statistics as a tool for induction by means of hypothetical deduction. Neyman-Pearson’s framework turns a statistical problem into a mathematical deduction. They both have their places in real world applications.

Many suggest that we should promote the use of Bayesian statistics to avoid the pitfalls of \( p \)-values. If the Bayesian alternative is to replace the \( p \)-value with the Bayes factor, we are still trapped in the same formulaic dichotomy defined by Neyman-Pearson. In my opinion, using Bayesian statistics is not necessarily a solution to the conceptual difficulty with regard to \( p \)-value and AIC/DIC. The foundamental question is still whether we use statistics as a mathematical tool for making a decision or as part of our research subject to change. Whether we use the Bayesian statistics or classical statistics is of little bearing.

Implications in education

I have been teaching classical statistics/biostatistics to first-year graduate students in environmental and ecological studies for the last ten years. Because my statistical training was mostly in Bayesian statistics, I have always been skeptical about \( p \)-values. Consequently, I deliberately deemphasize the use of single statistic such as the \( p \)-value in my teaching. But I have been disappointed almost every year because most students would automatically make their conclusions based on the \( p \)-value by the end of the semester! After reflection, I now believe that this outcome is expected. In applying statistics to a problem, the first question we ask is about the distribution of the data or the most appropriate model to use. To answer this question, we need not only knowledge of statistical distributions and models, but also ecological knowledge. Most first year graduate students do not have enough ecological knowledge and experience to make such a selection. As a result, they are unprepared for making the connection between ecological data and statistical distributions. Statistical inference starts with the problem of specification and this problem is essentially impossible for most students. The consequence is that we teach statistical tests and models as individual mathematical topics. No matter how hard we try, students will see these topics in isolation. The best we can expect is that some students will remember how to carry out some of the tests and know how to fit some of the models. But these tests or models are isolated, most students cannot make the connection between statistical models and real world problems. As a result, the most memorable thing from a statistics course is the \( p \)-value of a test or the \( R^2 \) value of a regression problem.

On the one hand, if we are teaching students to use statistics as a research tool, we should consider that statistics be taught later in their study, after they learn some basics of their chosen profession. On the other hand, statistics is also needed early in research (e.g., designing an experiment). So it is a conundrum without an easy solution.

However, Peters [1991] suggested that “statistics are better learned from direct applications of the statistics in the context of one’s own research.” As a result, we should provide opportunities to our graduate students to learn statistics after they complete their required quantitative courses.

To close this discussion, I describe a project-based course I teach regularly. The contents of the course are determined by student-proposed projects. In a typical year, I have fewer than 10 students in this class, each with a project that is related to his/her thesis or dissertation work. I start the course with student presentations of their problems, followed by my recommendations of appropriate statistical methods. I then teach selected topics while students carry out their work. Students present their preliminary results in the second half of the semester. Based on discussions and critiques during the presentations, students revise their work and write a final report. These reports are often turned into manuscripts, for example, Wu et al. [2011].

References

R.P. Abelson. Statistics as Principled Argument. Psychology Press, New York, 1995.

K. Aho, D. Derreberry, and T. Peterson. Model selection for ecologists: the worldviews of AIC and BIC. Ecology, 95(3):631–636, 2014.

H. Akaike. A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6):716–723, 1974.

M.B. Beck. Water quality modeling: a review of the analysis of uncertainty. Water Resources Research, 23(8):1393–1442, 1987.

K.P. Burnham and D.R. Anderson. P values are only an index to evidence: 20th- vs. 21st-century statistical science. Ecology, 95(3):627–630, 2014.

W.S. Cleveland. Visualizing Data. Hobart Press, Summit, NJ, 1993.

R.A. Fisher. On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society of London, Series A, 222: 309–368, 1922.

A. Gelman, J. Hwang, and A. Vehtari. Understanding predictive information criteria for bayesian models. Statistics and Computing, pages 1–20, 2013.

J.P. Hayes and R.J. Steidl. Statistical power analysis and amphibian population trends. Consevation Biology, 11(1):273–275, 1997.

H. Jeffreys. Theory of Probability. Oxford Univ Press, Oxford, 3rd edition, 1961.

V.E. Johnson. Revised standards for statistical evidence. Proceedings of the National Academy of Sciences, 110(48):19313–19317, 2013.

M. Lavine. Comment on Murtaugh. Ecology, 95(3):642–645, 2014.

J. Lenhard. Models and statistical inference: The controversy between Fisher and Neyman-Pearson. The British Journal for the Philosophy of Science, 57(1): 69–91, 2006.

J. Neyman. Statistical problems in agricultural experimentation. Journal of the Royal Statistical Society, 2(Supplement):107–180, 1935.

R.H. Peters. A Critique for Ecology. Cambridge University Press, 1991.

P.M. Ridker, E. Danielson, F.A.H. Fonseca, J. Genest, A.M. Gotto, J.J.P. Kastelein, W. Koenig, P. Libby, A.J. Lorenzatti, J.G. MacFadyen, B.G. Nordestgaard, J. Shepherd, J.T. Willerson, and R.J. Glynn. Rosuvastatin to prevent vascular events in men and women with elevated c-reactive protein. The New England Journal of Medicine, 359(21):2195–2207, 2008.

J.T. Rotenberry and J.A. Wiens. Statistical power analysis and community-wide patterns. The American Naturalist, 125(1):164–168, 1985.

R. Wu, S.S. Qian, F. Hao, H. Cheng, D. Zhu, and J. Zhang. Modeling contaminant concentration distributions in china’s centralized source waters. Environmental Science and Technology, 45(14):6041–6048, 2011.

2 comments:

Anonymous said...

I enjoyed this overview--thanks! It's a topic I've wrestled with many times in the past, and still wrestle with to this day when interacting with those who never knew there was anything but the hybrid approach. One of stats education's major failures, in my mind. I particularly like your points on teaching; when I taught I ran into that exact same issue and never could figure out a way around it. Glad to know I wasn't alone. :-)

I think an important point you may have overlooked is that (mathematical) statisticians and philosophers of science, with very few exceptions, rejected Fisher's approach because there was no way you could reconcile that with the frequentist conception of probability. In other words, a p-value is actually meaningless for a single experiment if you believe probability is based on long-run frequencies.

I think if Fisher had modern computers, he'd have been a Bayesian from the start, cause that was what he always wanted to know. The backwards (il)logic of single-experiment p-values was the best he could do with the tools of the time, and since it often worked even when interpreted incorrectly, it seemed to be good enough for him. We know now that they "worked" because in the absence of other information, he'd get results similar to a Bayesian analysis with weak priors.

p-values can be useful, but only in carefully controlled experimental conditions in which there is a realistic chance that long-run conditions will allow for long-run frequencies as meaningful. And if you subscribe to evolution (and any ecologist who doesn't is a fraud!), there is no such thing as "long-run conditions." But while such things never happen in ecology, they happen all the time in industrial contexts (industrial QA/QC is a good, perfectly legitimate use, for example).

I'm not an ecologist, but I can't see how you could be anything but a Bayesian or Likelihood/I-T practitioner in that profession.

Song Qian said...

Thanks. Yes, I overlooked the fact that Neyman ultimately won the argument among mathematical (classical) statisticians. Ecologists (and most natural scientists) think naturally in Bayesian terms in their work. The concept of long-run frequency does not exist in ecology. Ultimately, I want to change my biostatistics course to a Bayesian one. This is a small step towards the goal.

Log or not log

LOGorNOTLOG.html Log or not log, that is the question May 19, 2018 In 2014 I taught a special topics class on statistical i...