Monday, January 6, 2014

Using Bayesian to Cheat

A few days before the end of last semester, I stumbled upon a paper in the journal Ecological Indicators [Song and Guan(2013)]. The title indicates a Bayesian estimation method was used. The authors started with a discussion of the “environmental efficiency analysis” and introduced an indicator, which is some kind of mathematical function of input and output. There was no intuitive explanation on the indicator. The indicator was based on three input variables (population, “fixed capital formation,” and non renewable energy consumption), one desirable output variable (GDP), and one undesirable output (industrial SO2 emission). The calculation resulted in 17 environmental efficiency scores (9 cities in 2 years). The main objective of the paper is to explore factors affecting these environmental efficiency scores (EE), using regression. The potential factors are (1) per capita GDP (RGDP), (2) total import and export volume (IE), (3) the proportion of the “second industry” in GDP (GY), (4) the proportion of “the industry of the second industry” in GDP (GGY), and (5) the proportion of environmental spending in GDP (HZ). The authors explained that (1) is an indicator of economic scale, (2) is a measure of economic exchange with the outside world (the authors used the term “opening up,” a bad translation of a Chinese term), (3) and (4) are measures of industry structure, and (5) is the “government factor.” Not being an economist, I don’t want to comment on the choice of these potential factors, except that GDP is now used both as part of the environmental efficiency score and as a potential factor that will be used to explain the variance of the score.
The authors used a multiple regression approach, but regression coefficients were estimated using MCMC. I was expecting a discussion on the choice of prior distributions of these coefficients. But it was soon clear that there was no prior distribution. So, why did they use MCMC for a multiple regression problem? Based on the authors’ affiliation (School of Statistics and Mathematics), I assume that they know that there should be no substantive difference between using MCMC and using OLS. The authors presented the following regression model:



The estimated model coefficients may have revealed the answer:

coefficient     estimate standard error
      α            -0.1188 1.7260
      β1          -0.1564 0.1562
      β2            3.8130 1.5510
      β3           4.8050     8.0620
      β4          -3.3960     5.5560
      β5           0.9362 18.8900

All slopes, except β2 are statistically not different from 0! If the authors used the OLS, a typical regression model output would include a column of p-values, which will make the paper unpublishable. Using MCMC, the authors are able to present the estimated coefficients, standard error, and selected quantiles. Without the column of p-values, a busy reviewer may not be able to catch the problem. (But all my students in an introductory biostatistics class recognized the problem.)
Is this a successful story of cheating by using “Bayesian” statistics?

References
[Song and Guan(2013)] Malin Song and Youyi Guan. The environmental efficiency of Wanjian demonstration area: a Bayesian estimation approach. Ecological Indicators, 36:59–67, 2013. 

Log or not log

LOGorNOTLOG.html Log or not log, that is the question May 19, 2018 In 2014 I taught a special topics class on statistical i...