Thursday, June 5, 2008
Why teaching statistics is difficult
Many students find statistics difficult. They often complain that my class has no structure and they don't know how to apply the materials learned in class to their homework. I had the same experience. The only class I got a C in college was probability and statistics. I did not like that course because I could not find a system like mathematics. We were taught a collection of techniques and homework assignments are mostly irrelevant to any real application. In graduate school, I liked the introductory Bayesian inference very much because I could do the mathematics. All my energy was in the derivation of various posterior distributions. It gave me a sense of accomplishment. But after the introductory Bayesian course, I started to wonder why we have to do all the math just to find the posterior distribution that is not entirely different from the results using simple models in classical statistics. When I started teaching statistics to graduate students in environmental and ecological sciences, I wanted to teach in a different way. I wanted to show that statistics is different from mathematics. Thinking in statistics is different from thinking in mathematics. Mathematics is inductive reasoning, and statistics is deductive reasoning. Inductive reasoning starts from a set of premises and uses a set of rules or logic to move from point A to point B. Deductive reasoning is the opposite. We observe data and try to figure out what was the process that generated the data. Induction is "easy", deduction is difficult. We all know what will be the consequence if we leave an ice cube on a table at room temperature. But if we see a puddle of water on the same table, tracing back to the source of the water is difficult. If we did not see how it came to the desk, we can never know for sure. The situation applies to science too. In science, we observe data and try to understand the cause behind the data. We can propose different hypothesis, but no matter how simple the problem is, we can never be sure that the theory is correct. This problem of induction, first introduced in 1777 by David Hume, has yet to find a solution. Statistics is the tool for inductive reasoning. We observe data and try to estimate the parameter or model. When we observed a sample and calculate the sample mean \bar{x}, we don't claim that we know the population mean. We don't even know whether the sample mean is close to the true population mean. To give quantify the uncertainty, we calculate the confidence interval. But the confidence interval concept is a mathematical concept based on long-run frequencies. Its interpretation is counterintuitive. As a result, statistical reasoning is difficult. There is no rule to follow that will ensure a correct answer. But introducing this line of thinking in class is foolhardy, because we are so used to deductive reasoning. Most of our training in science is in the analytical skills necessary for deduction.
Sunday, May 18, 2008
The Likelihood Principle
Although systematically discussed by Birnbaum in 1962, the idea that the likelihood principle is the foundation of statistical inference is still not fully appreciated by environmental scientists and ecologists. The likelihood principle states that all evidence about an unknown parameter $\theta$ is contained in the likelihood function for the given data. In the one hand, the likelihood principle suggests that many classical statistics methods such as hypothesis testing are "contraindicated" (Berger and Wolpert, 1988). This implication of the likelihood principle has been discussed thoroughly in statistics, and started to attracting the attention of scientists. Doubts on the role of hypothesis testing are increasingly common in ecological literature. However, I often hear the justification of using hypothesis testing (especially the p-value) to be that a p-values is the only common ground for communication of results.
On the other hand, the likelihood principle implies that distributional assumption is the basis for informative inference (Birnbaum's term). Without an explicit distributional assumption, we cannot derive the likelihood function. This point is not well understood by scientists. One clear sign of this lack of understanding is the popularity of nonparametric or distribution-free methods in biological/life sciences. In classical statistics, we rely on the normality assumption because of the central limit theorem. In Bayesian statistics, the distributional assumption is automatic. Because most scientists are trained to use classical statistics and the normality assumption is always the rule, most of us don't have a clear idea that statistical inference is conditional on a specific distributional assumption. If we are dealing with the yield of crops or other mean/sum variables, the central limit theorem ensures that the normality assumption is adequate and the routine statistical methods are, hence, adequate. When the normality assumption is clearly not appropriate (e.g., species composition data), methods based on the normality assumption are potentially misleading.
In a typical graduate level statistics course, students are taught to check a regression model's residuals for possible departure from the normality assumption. Most of us will remember that we need to do something about the fitted model if the model's residuals are obviously not normal. We also remember that when comparing two populations using a t-test, we should check for normality. But there is little (other than a few variable transformations) we can do if the normality assumption does not hold. Consequently, "distribution-free" or nonparametric methods are appealing. When we use a nonparametric method, we don't have to worry about the normality assumption anymore.
Unfortunately, the term "nonparametric" has two meanings. First, nonparametric methods refer to the ordered statistics methods. These are methods mostly in the context of hypothesis testing. Using these methods, the variable of interest is known to be non-normal (so commonly used tests are inappropriate). The variable is rank transformed and a test statistic calculated from the rank-transformed variable is derived and its probability distribution under the null hypothesis is also derived. The null distribution is often tabulated, instead of represented using distribution function. Second, nonparametric is used in the context of statistical modeling, where the expected value of a response variable is predicted by a function of one or more predictor variables. If the function of predictor variable(s) is an algebraic formula with unknown parameters, the statistical model is known as a parametric model. If the function is a graphical model not specified by a formula (hence no parameters), the model is known as a nonparametric model. The simplest parametric model is the linear regression model. An example of a nonparametric model is Cleveland's local regression model "loess".
Fisher was irritated by the development of the distribution-free methods for hypothesis testing (Box, 1976), and Box explained this irritation by using a simulation. Box described the problem of normality is a problem of mice, while the problem of independence is a problem of tiger. We can tolerate an occasional mouse in our house, but we can't live with a wild tiger, period. Because hypothesis testing is something we should avoid anyway, this is not something worth a lot of ink.
For the nonparametric statistical models such as CART and GAM, we are still imposing parametric assumptions with regard to the response variable. I was once confused with this point. I named a threshold detection method based on the first split of a single predictor CART model "nonparametric" (Qian et al, 2003, Ecological Modelling). In this method, I used the split that resulted in the largest reduction in deviance as the threshold value. The name is unfortunate in that (1) a threshold model is a step function of the single predictor variable with three parameters and (2) the response variable distribution must be considered when deviance (-2 log-likelihood) is calculated. The default method for calculating the deviance in the S-Plus function I distributed is the sum-of-squares, the -2 log-likelihood function of a normal distribution with a constant variance.
The acceptance of AIC by environmental scientists and ecologists is an interesting phenomenon. First, model selection based on AIC does not involve a p-value. Second, AIC is -2 * log-likelihood + 2k, its calculation is dependent on the probabilistic assumption of the response variable.
Now, the question is whether a proper probabilistic distribution assumption is scientifically important.
On the other hand, the likelihood principle implies that distributional assumption is the basis for informative inference (Birnbaum's term). Without an explicit distributional assumption, we cannot derive the likelihood function. This point is not well understood by scientists. One clear sign of this lack of understanding is the popularity of nonparametric or distribution-free methods in biological/life sciences. In classical statistics, we rely on the normality assumption because of the central limit theorem. In Bayesian statistics, the distributional assumption is automatic. Because most scientists are trained to use classical statistics and the normality assumption is always the rule, most of us don't have a clear idea that statistical inference is conditional on a specific distributional assumption. If we are dealing with the yield of crops or other mean/sum variables, the central limit theorem ensures that the normality assumption is adequate and the routine statistical methods are, hence, adequate. When the normality assumption is clearly not appropriate (e.g., species composition data), methods based on the normality assumption are potentially misleading.
In a typical graduate level statistics course, students are taught to check a regression model's residuals for possible departure from the normality assumption. Most of us will remember that we need to do something about the fitted model if the model's residuals are obviously not normal. We also remember that when comparing two populations using a t-test, we should check for normality. But there is little (other than a few variable transformations) we can do if the normality assumption does not hold. Consequently, "distribution-free" or nonparametric methods are appealing. When we use a nonparametric method, we don't have to worry about the normality assumption anymore.
Unfortunately, the term "nonparametric" has two meanings. First, nonparametric methods refer to the ordered statistics methods. These are methods mostly in the context of hypothesis testing. Using these methods, the variable of interest is known to be non-normal (so commonly used tests are inappropriate). The variable is rank transformed and a test statistic calculated from the rank-transformed variable is derived and its probability distribution under the null hypothesis is also derived. The null distribution is often tabulated, instead of represented using distribution function. Second, nonparametric is used in the context of statistical modeling, where the expected value of a response variable is predicted by a function of one or more predictor variables. If the function of predictor variable(s) is an algebraic formula with unknown parameters, the statistical model is known as a parametric model. If the function is a graphical model not specified by a formula (hence no parameters), the model is known as a nonparametric model. The simplest parametric model is the linear regression model. An example of a nonparametric model is Cleveland's local regression model "loess".
Fisher was irritated by the development of the distribution-free methods for hypothesis testing (Box, 1976), and Box explained this irritation by using a simulation. Box described the problem of normality is a problem of mice, while the problem of independence is a problem of tiger. We can tolerate an occasional mouse in our house, but we can't live with a wild tiger, period. Because hypothesis testing is something we should avoid anyway, this is not something worth a lot of ink.
For the nonparametric statistical models such as CART and GAM, we are still imposing parametric assumptions with regard to the response variable. I was once confused with this point. I named a threshold detection method based on the first split of a single predictor CART model "nonparametric" (Qian et al, 2003, Ecological Modelling). In this method, I used the split that resulted in the largest reduction in deviance as the threshold value. The name is unfortunate in that (1) a threshold model is a step function of the single predictor variable with three parameters and (2) the response variable distribution must be considered when deviance (-2 log-likelihood) is calculated. The default method for calculating the deviance in the S-Plus function I distributed is the sum-of-squares, the -2 log-likelihood function of a normal distribution with a constant variance.
The acceptance of AIC by environmental scientists and ecologists is an interesting phenomenon. First, model selection based on AIC does not involve a p-value. Second, AIC is -2 * log-likelihood + 2k, its calculation is dependent on the probabilistic assumption of the response variable.
Now, the question is whether a proper probabilistic distribution assumption is scientifically important.
Subscribe to:
Posts (Atom)
Log or not log
LOGorNOTLOG.html Log or not log, that is the question May 19, 2018 In 2014 I taught a special topics class on statistical i...
-
Statistics is More Than \( P \)-values and AIC Statistics is More Than \( P \)-values and AIC Introduction The cont...
-
even if you believe that she is not worthy of your talent. Recently, I served as the advisor of a graduate student at Duke University. It...
-
The second edition of EESwithR is coming in fall 2016. I added one new chapter to the book and it is posted as an example chapter on githu...